Skip to main content

Time series forecasting for federal logistics.

April 12, 2026 · 14 min read · DLA, supply chain, probabilistic forecasts, hierarchical reconciliation, and what accuracy actually looks like.

The federal forecasting problem in one paragraph

Federal logistics — DLA, GSA, VA supply, DoD component commands, USPS — runs on forecasts. Bad forecasts cost money in two directions: stockouts (mission parts unavailable) and over-buy (working capital tied up, storage cost, obsolescence). The items being forecasted are heterogeneous — consumables, spares, capital equipment — with wildly different demand patterns. The data is a mix of ERP extracts, transaction logs, and operational records that are rarely as clean as the team pitching the model assumes. Real federal forecasting is not a Kaggle competition; it is a pipeline problem with modeling as one component among many.

Rule of thumb. Spend more time on the data preparation, hierarchical structure, and cross-validation design than on picking the model. A well-set-up classical model often beats a poorly-set-up transformer.

Data patterns in federal logistics

  • Long tail. A few SKUs have high, regular demand. Most have sporadic, low-volume, intermittent demand.
  • Regime shifts. OPTEMPO changes, budget cycles, mission shifts produce structural breaks that purely historical models miss.
  • Hierarchy. SKU → category → region → total. Forecasts are needed at multiple levels for different decisions.
  • Exogenous drivers. Weather, deployments, maintenance schedules, acquisition events. Enriching the feature set with these typically matters more than the model architecture.
  • Seasonality at multiple scales. Day-of-week, month, fiscal quarter, annual, multi-year.

Model classes, honestly

Model Class — Performance-to-Effort Ratio in Federal Logistics (field estimate)

Ensemble (classical + tree + deep)
92%
Gradient boosted trees (LightGBM / XGBoost)
88%
Classical statistical (ARIMA / Prophet)
75%
Deep learning (DeepAR / TFT)
65%
Foundation models zero-shot (Chronos / TimeGPT)
52%

Ratio = accuracy improvement relative to implementation and maintenance cost. GBT wins most federal logistics competitions on this metric. Foundation models are improving rapidly — re-evaluate quarterly.

Classical statistical

ARIMA, ETS, Theta, Croston (for intermittent demand). Fast, interpretable, hard to beat on regular series with moderate history. statsmodels and sktime in Python. Prophet (Meta) is classical+ with a nice API.

Gradient boosted trees

LightGBM or XGBoost on engineered features (lags, rolling stats, calendar, exogenous). Frequently wins in practical federal forecasting competitions and production. Simple to deploy, strong per-dollar-of-effort.

Deep learning (classical deep)

DeepAR, NBEATS, NHiTS, TFT (Temporal Fusion Transformer). GluonTS, NeuralForecast, Darts. Strong on datasets with many related series. More compute; more tuning.

Foundation models

Chronos (Amazon), TimeGPT (Nixtla), Moirai (Salesforce), Lag-Llama. Pretrained on diverse time series, zero-shot or fine-tuned. 2026 status: useful as a fast baseline; fine-tuned on domain data they are competitive with purpose-trained models; zero-shot on well-behaved series, often strong.

Ensemble

Average or stack of classical + tree + deep. Often the best production answer. The ensemble also stabilizes against any one model's failure mode.

Intermittent demand: the DLA problem

A huge fraction of DLA SKUs see demand in a minority of weeks. Classical point forecasting fails here because "expected demand" is near zero most of the time; the question is really the probability of a non-zero event.

Croston and variants

(SBA, TSB) handle intermittent demand with separate estimates of demand interval and demand size.

Negative binomial regression

on lagged and exogenous features.

Zero-inflated models

where the zero-event structure is explicit.

Quantile regression

(LightGBM quantile objective, quantile neural nets) to produce action-relevant quantiles directly.

Hierarchical reconciliation

Bottom-up (sum SKU forecasts): unbiased but high variance at top; loses aggregation signal. Top-down (allocate total to SKUs): smooth at top but loses per-SKU signal. MinT (Trace-minimizing) and ERM methods exploit correlation structure across the hierarchy to produce reconciled forecasts that are consistent and lower variance. hierarchicalforecast (Python) and hts (R) implement these.

Federal programs that skip reconciliation end up with decision-makers at different levels making contradictory calls because their forecasts do not agree.

A forecast that does not tell you its uncertainty is not a forecast; it is a point estimate and a guess at what to do with it.

Probabilistic forecasting

Federal supply decisions are asymmetric. The cost of a stockout on a mission part is not the same as the cost of an extra week of inventory. Point forecasts obscure this. Quantile forecasts (P10, P50, P90) or full predictive distributions enable the downstream service-level calculation that actually drives the order quantity.

Scoring: CRPS (continuous ranked probability score) and pinball loss at the target quantiles. Not RMSE. An operationally-calibrated quantile forecast is the deliverable.

Evaluation discipline

Walk-forward CV

Train on t0-t1, test on t1-t2. Roll forward. Aggregate across windows. Never randomly split time series.

Multiple horizons

1-week, 4-week, 13-week, 52-week forecasts each have different use cases and different accuracy.

Per-stratum metrics

SKU class (fast-moving, slow-moving, intermittent). Region. Category. Single-number MAPE hides everything that matters.

Naive baselines

Seasonal naive, last-value, moving average. Your fancy model must beat these by a material margin to be worth the complexity.

Operational translation

Convert forecast quality to operational quality: stockout days avoided, inventory dollars saved, ordering cycles enabled. The model does not ship unless the operational metric improves.

Feature engineering beats model choice

  • Lags and rolling statistics at multiple windows.
  • Calendar features: day of week, week of year, fiscal quarter, holidays, end-of-quarter, end-of-fiscal-year.
  • Exogenous: OPTEMPO proxies, deployments, weather, maintenance schedules, contract award events.
  • Cross-series: related SKUs, category aggregate, regional aggregate.
  • Price and substitution when applicable.
  • Budget cycle phase.

Deployment patterns

  • Batch scoring nightly or weekly. Real-time forecasting is rarely the right pattern for supply chain.
  • Forecast output as a versioned table in the warehouse, with prediction intervals and feature contributions.
  • Monitoring: forecast error by stratum, bias drift, coverage of prediction intervals, feature drift on exogenous inputs.
  • Re-training cadence: weekly for active series; monthly or quarterly for stable series; event-triggered after regime shifts.
  • Governance: model registry with training data version, feature version, hyperparameters, eval results. Promotion gated.

Where this fits in our practice

We build forecasting platforms for federal supply, capacity, and readiness. See our MLOps on GovCloud for the surrounding infrastructure and our data labeling post for the upstream annotation patterns when classification or anomaly components feed into the forecasting system.

FAQ

What accuracy is realistic for federal demand forecasting?
Item-level weekly forecasts on DLA-scale portfolios typically achieve 30-60% MAPE at the SKU level, dropping to 10-25% at aggregated levels (category, region). Low-volume or intermittent demand items (a majority of DLA SKUs by count) have weaker point forecasts; quantile forecasts are more informative there.
What is hierarchical reconciliation and why does it matter?
Forecasts at different levels of aggregation (SKU, category, region, total) need to be consistent — the sum of forecasts at the SKU level should equal the regional forecast. Hierarchical reconciliation methods (MinT, ERM, bottom-up, top-down, middle-out) enforce this. Without it, downstream decisions made at different levels contradict each other.
Are foundation models for time series useful in federal?
Chronos (Amazon), TimeGPT (Nixtla), Lag-Llama, and Moirai have shown strong zero-shot performance on diverse benchmarks. For federal, they are a fast baseline worth evaluating alongside trained models. Fine-tuning on agency data typically beats zero-shot on well-behaved series and matches on noisy intermittent-demand series.
When do you use probabilistic forecasts vs point forecasts?
Use probabilistic (quantile or full distribution) forecasts whenever downstream decisions involve inventory buffer, service-level targets, risk, or cost asymmetry. That is almost all of federal logistics. Point forecasts are appropriate only when decisions are genuinely symmetric.
What tooling actually works for federal time series at scale?
For classical models, statsmodels and Prophet remain solid. For deep learning at scale, GluonTS (MXNet/PyTorch), Darts, and NeuralForecast. Foundation models via Hugging Face. SageMaker DeepAR for managed. The framework matters less than the feature engineering, cross-validation design, and reconciliation.
How do you validate forecasts for federal production?
Walk-forward (time-series split) cross-validation, multiple horizons, per-stratum metrics (by SKU class, region, seasonality pattern), probabilistic scoring (CRPS, pinball loss, not just RMSE), and operational metrics (stockout risk, over-stock risk) translated from forecasts through downstream decision logic.

Related insights

Building forecasting for a federal supply chain?

We build demand, logistics, and capacity forecasting systems on federal data — with honest uncertainty, hierarchical reconciliation, and audit-grade pipelines.