Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Foundational Models for Time Series Forecasting...

Foundational Models for Time Series Forecasting: Are We There Yet?

Avatar for Luca Baggi

Luca Baggi

May 30, 2025
Tweet

More Decks by Luca Baggi

Other Decks in Programming

Transcript

  1. Foundational models for Time Series Forecasting Are we there yet?

    Luca Baggi Gabriele Orlandi Bologna, 30/05/2025
  2. Outcomes • What are time-series foundational models? • Why do

    they matter? • How are we building and evaluting them • Words of cautions when going to the real world
  3. 🔗 Source 🔗 Pre-print Transformers in production Tim Januschowski However,

    it’s known that Amazon, Google, Alibaba, Zalando's use transformers in forecasting (in Zalando’s case since 2019). So what’s up? To me, the current most plausible explanation is that the success of transformers in forecasting is a function of the data. With data like Amazon's or Zalando’s demand data sets, transformers make a difference – and scaling laws even seem to kick in.
  4. From transformers to foundational models Recipe for a foundational model:

    1. Create an arbitrarily large neural network with a transformer-based architecture 2. Grab all the data that you can: even more than what it would take to train a successful transformer-based model. Any domain, any frequency, any length 3. Train the model 4. Hope the model learns general patterns thanks to the diversity of domains, frequencies, seasonalities, skewness, sparsity...
  5. What are the advantages of a zero-shot forecaster? 1. If

    the training set is broad enough, it should have reasonable performance on unseen data regardless of their granularity, frequencies, sparsity and perhaps even distribution. 2.You don't need to wait until you have enough data to train a model from scratch (e.g. ARIMA but might also apply to a global model such as XGBoost). 3. When the data starts to come in, you can fine-tune the zero-shot model to your domain (or other purposes, e.g. conformal predictions).
  6. Open time-series data The brief answer is: not much, but

    we're getting there. • 🔗 Monash Time Series Forecasting Repository: the OG, <1B observations • 🔗 LoTSA: introduced with Moirai, >27B • 🔗 Timeseries-PILE, introduced with Moment, ~1.23B (though contains Monash) Unfortunately, unlike NLP, there are no datasets specifically designed and/or set aside for evaluation (think of GSMK8...).
  7. Open time-series data Unfortunately, unlike NLP, there aren’t many datasets

    specifically designed and/or set aside for evaluation (think of GSMK8…): • 🔗 BOOM (observability, 2025)
  8. Enter synthetic data • So far, in two ways: ⚬

    Convex combinations of real data ⚬ Combinations of patterns (different AR processes, trends...) • Found a positive effect (possibly due to increased diversity) • Even 10% of the data mix appears to boost performance y axis: lower is better (from the Chronos paper)
  9. The latest releases • 🔗 Lag-Llama (Academia + ServiceNow, 2024)

    • 🔗 Moment (Academia, 2024) • 🔗 UniTS (Academia, 2024) • 🔗 Chronos (AWS, 2024) • 🔗 TimesFM (Google, 2024) • 🔗 Moirai (Salesforce, 2024) • 🔗 Granite TSFM (IBM, 2024) • 🔗 Toto (Datadog, 2024, open source 2025) • 🔗 Time-MoE (Academia, 2025)
  10. Architecture Framework to repurpose existing transformer- based language models, in

    three steps: • Apply mean scaling and quantise time series and match to tokens of a finite vocabulary • Train from scratch an existing LLM (the authors use T5 and GPT2) • In inference, multi-step autoregressive probabilistic predictions are de-scaled and de- quantised Chronos architecture
  11. Training Data: Real 55 datasets from "energy, transport, healthcare, retail,

    web, weather, finance, and with sampling frequencies ranging from 5 minutes up to yearly", including Monash repository, M competitions and public Kaggle datasets. Datasets are used as follows: • 13 datasets exclusively for training • 15 datasets for training and in-domain evaluation • 27 datasets exclusively for out-of-domain evaluation
  12. Training Data: Synthetic Two strategies: 1. TSMixup: Linear combinations of

    randomly sampled real-world series 2. KernelSynth: Combinations (additions/ multiplications) of "fundamental time series patterns" (named kernels). Schematics of TSMixup
  13. Hyperparameters The authors noted the model might fail to pickup

    exponential trends (might be due to an underrepresentation in the training data). Also, the model can underestimate the linear trend. In that case, use a bigger sequence length. Chronos and linear trends
  14. Decoder-only causal self-attention transformer architecture (200M params) Patches are embedded

    as tokens via MLP + residual connections and positional encodings Randomly masking parts or full patches allows to adapt to any context length Multi-step auto-regressive decoding for outputs of length 128, approaching one-shot forecasting to horizon Architecture
  15. 100B time points of various domains and frequencies. Main sources:

    • Google Trends & Wikipedia pageviews at various frequencies, for real-world data spanning all human interests • Synthetic data for “time-series grammar”: ARMA generators, seasonal patterns, trends, step functions... During training, 80% real and 20% synthetic mix sampled from all sources Training Data
  16. Increasing model complexity pays off: largest model performs better and

    there’s still margin for improvement Hyperparameters
  17. A longer output patch size approaches one-shot forecasting to horizon,

    argued to perform generally better than autoregressive mode Indeed, study shows performance is better with longer output, although with diminishing returns Hyperparameters
  18. There’s a sweet spot for input patch size at 32

    In contrast with output size, longer inputs are bad for performance Hyperparameters
  19. The usefulness of synthetic data really shows when applied to

    datasets having under- represented frequencies, such as 15-minutes in Monash and ETTm. Data Ablation
  20. Benchmarking on datasets used for training in the TimesFM paper

    • There’s no standard data collection for benchmarking used consistently across papers • Proper leave-out datasets are also lacking: benchmark pollution is a real risk, as with LLMs ...is not easy as it sounds
  21. • MASE for point forecast, i.e. MAE scaled by a

    baseline naive model (not standard across papers) • Weighted Quantile Loss, WQL for probabilistic forecasts, also scaled by baseline model Metrics Evaluation of both point and probabilistic forecast in the Chronos paper
  22. To fairly assess their ability as zero-shot forecasters, foundational models

    should be evaluated both in- and out-of-domain Out-of-domain Out-domain evaluation in the Chronos paper
  23. The possibility of fine-tuning adds a whole new dimension to

    evaluation: we need to take it into account if we want to fairly compare models that can be vastly different in scope (local vs global vs foundational) Fine-tuning Evaluation of fine-tuned Chronos T5, from paper
  24. A promising landscape... 1.Foundational models are on the rise, and

    we’re starting to notice them 2. As we speak, efforts are being made in compiling bigger and bigger collections of data 3.Synthetic data looks promising in teaching the model the most fundamental patterns 4. We’re at the early stages, with lots of room for improvement both in collecting more data and in creating better suited architectures: keep an eye out for those 5.Don’t sleep on fine-tuning: it is still under-developed, could be key for success
  25. To be explored with caution 1.Public benchmarks are insufficient due

    to a lack of a standard leave-out data collection 2. Don't just pick the model that ranks highest on benchmarks: do your own testing on the frequency, domain and horizon you’re interested in 3.Publicly available time series are limited, and not nearly as ubiquitous as text 4. We’re not sure that performance will scale with data and model complexity