ParaFormer: A transformer architecture for small business credit underwriting

Abstract. We present ParaFormer, a Temporal Fusion Transformer [1] (Lim, Arık, Loeff, & Pfister, 2019). It is built on the same self-attention mechanism that powers frontier LLMs such as ChatGPT and Claude. ParaFormer is trained on daily revenue time series from over one million U.S.-based small and medium-sized businesses across Parafin’s embedded-lending partner ecosystem. ParaFormer replaces a hand-engineered XGBoost ensemble (28 separate quantile-horizon models, ~130 covariates) with a single sequence model that emits the full quantile distribution across all twelve forward months in one forward pass. To the best of our knowledge, this is the first published production deployment of a transformer-based model to size small business credit at this scale. As of this writing, ParaFormer generates daily forecasts that drive offer sizing for over a million merchants across partners including Amazon, DoorDash, and Walmart.

This post is a companion to “Forecasting at scale: Managing risk across millions of merchants,” which established why forecasting is the central technical problem in our underwriting stack. Here we describe the architecture, the empirical results, and what the model’s internals reveal about the structure of small business cash flows.

1. Forecasting is the fundamental input for sizing offers

Parafin is the embedded lending infrastructure that powers small business lending across platforms including Amazon, DoorDash, Walmart, and dozens of other leading and emerging platforms serving the small business sector. Our product suite, which includes Flex Loans, Term Loans, Merchant Cash Advances, and Pay Over Time, are extended to merchants we never directly source. We do not underwrite using personal or business credit bureau scores as we believe that these scores do not accurately capture the performance of a business. We size offers against forecasted forward cash flow of the small business. This approach allows Parafin to serve small businesses in a way where business owners do not have to co-mingle personal liability with business liability and get the exact amount of financing that their business deserves.

Small business revenue is overwhelmingly seasonal. A landscaper books roughly 70% of annual receipts between April and September. A tax preparer compresses most of its volume into February through April. Sizing an offer correctly demands an accurate projection of the next 3 to 12 months of sales, conditioned on the merchant’s idiosyncratic seasonal arc. Project them correctly and we extend more capital ahead of peaks and less ahead of troughs resulting in growing originations without our altering risk posture. Project them poorly and we either leave growth unrealized or impose repayment stress the merchant cannot absorb. Replacing the production forecasting model was the highest-leverage machine learning investment available to us currently.

Figure 1. Sample forecasts from ParaFormer.

2. From a stacked XGBoost grid to a single transformer-based model

The legacy system was a stacked XGBoost ensemble: ~130 hand-engineered features per merchant (sales aggregations, growth rates, lagged windows), a separate model for each (quantile, horizon) pair (7 × 4 = 28 models), and a downstream join between independently trained on-platform and off-platform sales paths. Each design choice was locally rational. In aggregate the system became a tower in which every improvement traversed four layers, the model grid was structurally unable to share signal across quantiles or horizons, and platform partner-specific seasonality patterns required hand retuning. The architecture was the ceiling.

In September 2025, the four of us, Kartik (Data Science), Pierre (Engineering), Edwin (Engineering), and Jerome (Data Science), built a Temporal Fusion Transformer prototype during a two-day hackathon. The hypothesis: a single deep sequence model trained on millions of merchant time series should subsume the preprocessing layer, the seasonality module, the per-horizon model grid, and the on/off-platform split and outperform the production stack on the metrics that drive offer sizing. The prototype validated the hypothesis. We named it ParaFormer.

The architecture is a Temporal Fusion Transformer (Lim et al., 2019), the same model family that underlies the core attention mechanism of frontier large language models such as ChatGPT and Claude. ParaFormer emits all 17 quantiles across all 12 forward months in a single forward pass, with every layer of representation shared across them. Improvements at one horizon transfer to neighbors automatically. Quantiles are coherent: q30 and q40 emerge from a shared underlying state rather than from independent learners that happened to land near each other. On-platform and off-platform sales, previously handled by separate models, are now jointly modeled.

The hackathon prototype graduated to the production roadmap. We built out a complete pipeline (data processing → training → inference → evaluation), tuned for GPU throughput, and integrated with our underwriting infrastructure. ParaFormer now executes daily, generating quantile forecasts with full uncertainty bands for over one million businesses. Our internal shipping discipline is simple: prototype scrappily, validate against production on real metrics, scale only what wins.

A natural question is how ParaFormer compares to classical seasonality-aware baselines like ARIMA and Prophet. Those models are one fit per merchant, no sharing across the population. More fundamentally, their uncertainty bands are assumed, not learned from data. Neither can represent the fact that small business sales are inherently asymmetric: bounded at zero on the downside, open-ended on the upside. ParaFormer is fit directly against 17 quantile targets, which forces the conditional distribution shape (asymmetry, dispersion, tail behavior) to be learned from the patterns in our merchant population rather than assumed.

3. The attention mechanism processes sales the way it processes language

Predicting the next number in a sequence is structurally the same as predicting the next word in a sentence. When a language model encounters “I deposited money at the bank,” its core mechanism, Attention, disambiguates “bank” as a financial institution by attending to surrounding context. ParaFormer applies the same mechanism to sales history. A December spike triggers attention over prior Decembers, classifying the spike as a recurring annual peak rather than a permanent regime shift. A weak August prompts attention over prior Augusts, distinguishing a seasonal lull from business deterioration.

The training objective is the same: predict the next value in the sequence under a causal mask that prevents future-position leakage. There is no per-horizon label construction and no constructed cumulative target. Extending the cumulative forecast window to 15 months, 18 months, or whatever the next product requires needs no retraining, provided it falls within its pre-configured limit. The model emits monthly forecasts which are aggregated post hoc over any such window.

4. The model independently recovers annual seasonality

The claim “we use the same architecture as LLMs” carries no weight unless the architecture uses its mechanisms the same way. We therefore probed the trained model’s eight attention heads.

Inside any transformer, attention is multi-headed: a parallel set of context-finders, each free to specialize. In language models, one head may learn grammatical dependency, another coreference, another topical relatedness. ParaFormer’s heads have the same freedom. Three categories of temporal relationship emerged without supervision:

Heads 0 and 1 track annual seasonality relative to the predicted month (when forecasting August, they attend to last August and the August before; the peaks slide with the prediction horizon).
Heads 4 and 7 anchor every forecast to fixed calendar references (lag −1, −13, −25, the most recent month plus the same months in prior years).
The remaining four heads carry supporting signals. Recent volatility, partial seasonality near the autocorrelation boundary. No head is dead.

None of this was specified. The model discovered it. The same mechanism that lets a language model resolve which “she” refers to which earlier name lets ParaFormer recognize that next August’s sales correlate with last August’s.

Figure 2. Head 1 attention heatmap. This head focuses on "same month one year and two years ago, relative to the month being predicted."

Figure 3. Head 4 attention heatmap. This head anchors on fixed reference points: the most recent observed month, same month last year, and two years ago.

5. Static features modulate the temporal pathway

ParaFormer ingests 14 static covariates per merchant such as seasonal amplitude, baseline level, year-over-year correlation, history length, on/off-platform flag, and others. But their role is structurally different from their role in XGBoost. In the legacy system, features were the model: they directly encoded seasonality, trend, and momentum. In ParaFormer, attention extracts the seasonal pattern itself from raw history; static features answer meta-questions (“is this series seasonal at all? How predictable? how large?”) and modulate the temporal pathway.

This is the fusion in Temporal Fusion Transformer: the same December spike is read differently depending on merchant type such as a retailer’s annual peak, a tax preparer’s pre-peak buildup, a landscaper’s off-season anomaly. Static features tell the model which annual cycle the merchant occupies; attention reads the history accordingly. The pathway is general and comprises platform partner ID, MCC, geography, payments-channel mix, any per-merchant attribute we choose to add and most of that capacity remains unexercised.

Figure 4. Heads 0 and 1 (horizon-relative seasonal heads) contribute the most to the predictions, with peak contributions in the high-autocorrelation segment. Heads 4 and 7 (origin-relative anchor head) contribute meaningfully as well.

Figure 5. Static feature ranking. log_stddev_res_yr1 measures residual noise after detrending: how predictable the series is once seasonality and trend are removedt. Combined with min_max_sales_yr1 (range) and detrended_pattern_corr (seasonality strength), the top features encode noise, scale, and consistency of each series.

6. Empirical results

ParaFormer outperforms the production system on the metrics that drive offer sizing. Beyond lower terminal loss, the model anticipates seasonal patterns the production forecaster misses, and the improvement translates into measurable downstream business outcomes.

We backtested four quarterly snapshots (January, April, July, October 2025) of model predictions against (i) realized sales, (ii) the production forecaster, and (iii) repayment outcomes for ~15,000 funded businesses, with repayment stress observed at 65% of advance duration (~175 days). Three findings.

Finding 1: ParaFormer anticipates seasonal direction. For merchants with detectable seasonality, the TFT’s adjustment relative to production (model_lift) correlates with how actuals deviated from production (actual_surprise) at winsorized Pearson r = 0.50–0.82 across all four cohorts, with 72.7% directional accuracy. When the model predicts a peak, peaks materialize; when it predicts a trough, troughs materialize. The signal is consistent across quarters and is not a single-cohort artifact.

Finding 2: The model tilts conservatively on businesses that subsequently default without ever observing loss data. Stratified by terminal loss status, ParaFormer systematically lower-predicts for businesses that ultimately take losses, relative to comparable businesses that do not. The individual-level correlation between model_lift and terminal loss is essentially zero (r = −0.06): the model is not a loss predictor and is never trained on a loss signal. But across the funded population it errs in the correct direction on the lossy tail, sizing advances more conservatively for the part of the book that needs it.

Caveat. ParaFormer is trained on active businesses and forecasts seasonal sales conditioned on the business remaining healthy. It does not predict idiosyncratic deterioration or churn since those are separate problems handled by separate loss models. Its contribution is upstream: better seasonal forecast → better offer sizing → less repayment stress. Direct loss prediction remains downstream of the forecast.

7. Implications and what comes next

ParaFormer is the first deployment of transformer-based sequence modeling inside Parafin's underwriting stack, which adapts the attention-based paradigm that powers frontier LLMs to a forecasting setting. It will not be the last. The same architecture absorbs additional per-merchant attributes such as partner ID, MCC, geography, payments-channel mix, and several others. The model scales to any horizon within its pre-configured limit without retraining, at near-zero marginal engineering cost. It exposes interpretable internals that a tower of stacked classical models could not. It establishes a research and engineering pattern we intend to repeat across the underwriting stack: identify a load-bearing model, replace it with a single architecture that subsumes the surrounding hand-engineered scaffolding, and ship.

More broadly, the result suggests that the architectural advances driving frontier AI are directly applicable to the structured-data problems at the core of small business finance. The mechanism that lets a language model resolve long-range linguistic dependencies is the same mechanism that resolves long-range temporal dependencies in revenue. Parafin sits on one of the largest and most diverse corpora of small business cash flow data anywhere, with over a million active small businesses across dozens of partner platforms. That makes us the natural place to deploy these methods first.

References

Lim, B., Arık, S. Ö., Loeff, N., & Pfister, T. (2019). Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting. arXiv:1912.09363.

Parafin internal data on small business and partner platform metrics, April 2026.