The federal forecasting problem in one paragraph
Federal logistics — DLA, GSA, VA supply, DoD component commands, USPS — runs on forecasts. Bad forecasts cost money in two directions: stockouts (mission parts unavailable) and over-buy (working capital tied up, storage cost, obsolescence). The items being forecasted are heterogeneous — consumables, spares, capital equipment — with wildly different demand patterns. The data is a mix of ERP extracts, transaction logs, and operational records that are rarely as clean as the team pitching the model assumes. Real federal forecasting is not a Kaggle competition; it is a pipeline problem with modeling as one component among many.
Data patterns in federal logistics

- Long tail. A few SKUs have high, regular demand. Most have sporadic, low-volume, intermittent demand.
- Regime shifts. OPTEMPO changes, budget cycles, mission shifts produce structural breaks that purely historical models miss.
- Hierarchy. SKU → category → region → total. Forecasts are needed at multiple levels for different decisions.
- Exogenous drivers. Weather, deployments, maintenance schedules, acquisition events. Enriching the feature set with these typically matters more than the model architecture.
- Seasonality at multiple scales. Day-of-week, month, fiscal quarter, annual, multi-year.
Model classes, honestly
Model Class — Performance-to-Effort Ratio in Federal Logistics (field estimate)
Ratio = accuracy improvement relative to implementation and maintenance cost. GBT wins most federal logistics competitions on this metric. Foundation models are improving rapidly — re-evaluate quarterly.
Classical statistical
ARIMA, ETS, Theta, Croston (for intermittent demand). Fast, interpretable, hard to beat on regular series with moderate history. statsmodels and sktime in Python. Prophet (Meta) is classical+ with a nice API.
Gradient boosted trees
LightGBM or XGBoost on engineered features (lags, rolling stats, calendar, exogenous). Frequently wins in practical federal forecasting competitions and production. Simple to deploy, strong per-dollar-of-effort.
Deep learning (classical deep)
DeepAR, NBEATS, NHiTS, TFT (Temporal Fusion Transformer). GluonTS, NeuralForecast, Darts. Strong on datasets with many related series. More compute; more tuning.
Foundation models
Chronos (Amazon), TimeGPT (Nixtla), Moirai (Salesforce), Lag-Llama. Pretrained on diverse time series, zero-shot or fine-tuned. 2026 status: useful as a fast baseline; fine-tuned on domain data they are competitive with purpose-trained models; zero-shot on well-behaved series, often strong.
Ensemble
Average or stack of classical + tree + deep. Often the best production answer. The ensemble also stabilizes against any one model's failure mode.
Intermittent demand: the DLA problem
A huge fraction of DLA SKUs see demand in a minority of weeks. Classical point forecasting fails here because "expected demand" is near zero most of the time; the question is really the probability of a non-zero event.
Croston and variants
(SBA, TSB) handle intermittent demand with separate estimates of demand interval and demand size.
Negative binomial regression
on lagged and exogenous features.
Zero-inflated models
where the zero-event structure is explicit.
Quantile regression
(LightGBM quantile objective, quantile neural nets) to produce action-relevant quantiles directly.
Hierarchical reconciliation
Bottom-up (sum SKU forecasts): unbiased but high variance at top; loses aggregation signal. Top-down (allocate total to SKUs): smooth at top but loses per-SKU signal. MinT (Trace-minimizing) and ERM methods exploit correlation structure across the hierarchy to produce reconciled forecasts that are consistent and lower variance. hierarchicalforecast (Python) and hts (R) implement these.
Federal programs that skip reconciliation end up with decision-makers at different levels making contradictory calls because their forecasts do not agree.
Probabilistic forecasting
Federal supply decisions are asymmetric. The cost of a stockout on a mission part is not the same as the cost of an extra week of inventory. Point forecasts obscure this. Quantile forecasts (P10, P50, P90) or full predictive distributions enable the downstream service-level calculation that actually drives the order quantity.
Scoring: CRPS (continuous ranked probability score) and pinball loss at the target quantiles. Not RMSE. An operationally-calibrated quantile forecast is the deliverable.
Evaluation discipline
Walk-forward CV
Train on t0-t1, test on t1-t2. Roll forward. Aggregate across windows. Never randomly split time series.
Multiple horizons
1-week, 4-week, 13-week, 52-week forecasts each have different use cases and different accuracy.
Per-stratum metrics
SKU class (fast-moving, slow-moving, intermittent). Region. Category. Single-number MAPE hides everything that matters.
Naive baselines
Seasonal naive, last-value, moving average. Your fancy model must beat these by a material margin to be worth the complexity.
Operational translation
Convert forecast quality to operational quality: stockout days avoided, inventory dollars saved, ordering cycles enabled. The model does not ship unless the operational metric improves.
Feature engineering beats model choice
- Lags and rolling statistics at multiple windows.
- Calendar features: day of week, week of year, fiscal quarter, holidays, end-of-quarter, end-of-fiscal-year.
- Exogenous: OPTEMPO proxies, deployments, weather, maintenance schedules, contract award events.
- Cross-series: related SKUs, category aggregate, regional aggregate.
- Price and substitution when applicable.
- Budget cycle phase.
Deployment patterns
- Batch scoring nightly or weekly. Real-time forecasting is rarely the right pattern for supply chain.
- Forecast output as a versioned table in the warehouse, with prediction intervals and feature contributions.
- Monitoring: forecast error by stratum, bias drift, coverage of prediction intervals, feature drift on exogenous inputs.
- Re-training cadence: weekly for active series; monthly or quarterly for stable series; event-triggered after regime shifts.
- Governance: model registry with training data version, feature version, hyperparameters, eval results. Promotion gated.
Where this fits in our practice
We build forecasting platforms for federal supply, capacity, and readiness. See our MLOps on GovCloud for the surrounding infrastructure and our data labeling post for the upstream annotation patterns when classification or anomaly components feed into the forecasting system.