Foundational Models for Time Series Forecasting: Are We There Yet?

Foundational models for Time Series Forecasting Are we there yet?
Luca Baggi Gabriele Orlandi Bologna, 30/05/2025

Outcomes • What are time-series foundational models? • Why do
they matter? • How are we building and evaluting them • Words of cautions when going to the real world

Transformers in forecasting: why bother?

🔗 Source 🔗 Pre-print Transformers in production Tim Januschowski However,
it’s known that Amazon, Google, Alibaba, Zalando's use transformers in forecasting (in Zalando’s case since 2019). So what’s up? To me, the current most plausible explanation is that the success of transformers in forecasting is a function of the data. With data like Amazon's or Zalando’s demand data sets, transformers make a difference – and scaling laws even seem to kick in.

From transformers to foundational models Recipe for a foundational model:
1. Create an arbitrarily large neural network with a transformer-based architecture 2. Grab all the data that you can: even more than what it would take to train a successful transformer-based model. Any domain, any frequency, any length 3. Train the model 4. Hope the model learns general patterns thanks to the diversity of domains, frequencies, seasonalities, skewness, sparsity...

Foundational models are zero shot forecasters

What are the advantages of a zero-shot forecaster? 1. If
the training set is broad enough, it should have reasonable performance on unseen data regardless of their granularity, frequencies, sparsity and perhaps even distribution. 2.You don't need to wait until you have enough data to train a model from scratch (e.g. ARIMA but might also apply to a global model such as XGBoost). 3. When the data starts to come in, you can fine-tune the zero-shot model to your domain (or other purposes, e.g. conformal predictions).

Is there enough data?

Open time-series data huggingface.co

Open time-series data The brief answer is: not much, but
we're getting there. • 🔗 Monash Time Series Forecasting Repository: the OG, <1B observations • 🔗 LoTSA: introduced with Moirai, >27B • 🔗 Timeseries-PILE, introduced with Moment, ~1.23B (though contains Monash) Unfortunately, unlike NLP, there are no datasets specifically designed and/or set aside for evaluation (think of GSMK8...).

Open time-series data Unfortunately, unlike NLP, there aren’t many datasets
specifically designed and/or set aside for evaluation (think of GSMK8…): • 🔗 BOOM (observability, 2025)

Enter synthetic data • So far, in two ways: ⚬
Convex combinations of real data ⚬ Combinations of patterns (different AR processes, trends...) • Found a positive effect (possibly due to increased diversity) • Even 10% of the data mix appears to boost performance y axis: lower is better (from the Chronos paper)

Let's talk models

The latest releases • 🔗 Lag-Llama (Academia + ServiceNow, 2024)
• 🔗 Moment (Academia, 2024) • 🔗 UniTS (Academia, 2024) • 🔗 Chronos (AWS, 2024) • 🔗 TimesFM (Google, 2024) • 🔗 Moirai (Salesforce, 2024) • 🔗 Granite TSFM (IBM, 2024) • 🔗 Toto (Datadog, 2024, open source 2025) • 🔗 Time-MoE (Academia, 2025)

Chronos

Architecture Framework to repurpose existing transformer- based language models, in
three steps: • Apply mean scaling and quantise time series and match to tokens of a finite vocabulary • Train from scratch an existing LLM (the authors use T5 and GPT2) • In inference, multi-step autoregressive probabilistic predictions are de-scaled and de- quantised Chronos architecture

Training Data: Real 55 datasets from "energy, transport, healthcare, retail,
web, weather, finance, and with sampling frequencies ranging from 5 minutes up to yearly", including Monash repository, M competitions and public Kaggle datasets. Datasets are used as follows: • 13 datasets exclusively for training • 15 datasets for training and in-domain evaluation • 27 datasets exclusively for out-of-domain evaluation

Training Data: Synthetic Two strategies: 1. TSMixup: Linear combinations of
randomly sampled real-world series 2. KernelSynth: Combinations (additions/ multiplications) of "fundamental time series patterns" (named kernels). Schematics of TSMixup

Hyperparameters The authors noted the model might fail to pickup
exponential trends (might be due to an underrepresentation in the training data). Also, the model can underestimate the linear trend. In that case, use a bigger sequence length. Chronos and linear trends

TimesFM

Decoder-only causal self-attention transformer architecture (200M params) Patches are embedded
as tokens via MLP + residual connections and positional encodings Randomly masking parts or full patches allows to adapt to any context length Multi-step auto-regressive decoding for outputs of length 128, approaching one-shot forecasting to horizon Architecture

100B time points of various domains and frequencies. Main sources:
• Google Trends & Wikipedia pageviews at various frequencies, for real-world data spanning all human interests • Synthetic data for “time-series grammar”: ARMA generators, seasonal patterns, trends, step functions... During training, 80% real and 20% synthetic mix sampled from all sources Training Data

Increasing model complexity pays off: largest model performs better and
there’s still margin for improvement Hyperparameters

A longer output patch size approaches one-shot forecasting to horizon,
argued to perform generally better than autoregressive mode Indeed, study shows performance is better with longer output, although with diminishing returns Hyperparameters

There’s a sweet spot for input patch size at 32
In contrast with output size, longer inputs are bad for performance Hyperparameters

The usefulness of synthetic data really shows when applied to
datasets having under- represented frequencies, such as 15-minutes in Monash and ETTm. Data Ablation

Benchmarking

Benchmarking on datasets used for training in the TimesFM paper
• There’s no standard data collection for benchmarking used consistently across papers • Proper leave-out datasets are also lacking: benchmark pollution is a real risk, as with LLMs ...is not easy as it sounds

• MASE for point forecast, i.e. MAE scaled by a
baseline naive model (not standard across papers) • Weighted Quantile Loss, WQL for probabilistic forecasts, also scaled by baseline model Metrics Evaluation of both point and probabilistic forecast in the Chronos paper

To fairly assess their ability as zero-shot forecasters, foundational models
should be evaluated both in- and out-of-domain Out-of-domain Out-domain evaluation in the Chronos paper

The possibility of fine-tuning adds a whole new dimension to
evaluation: we need to take it into account if we want to fairly compare models that can be vastly different in scope (local vs global vs foundational) Fine-tuning Evaluation of fine-tuned Chronos T5, from paper

Let’s wrap up

A promising landscape... 1.Foundational models are on the rise, and
we’re starting to notice them 2. As we speak, efforts are being made in compiling bigger and bigger collections of data 3.Synthetic data looks promising in teaching the model the most fundamental patterns 4. We’re at the early stages, with lots of room for improvement both in collecting more data and in creating better suited architectures: keep an eye out for those 5.Don’t sleep on fine-tuning: it is still under-developed, could be key for success

To be explored with caution 1.Public benchmarks are insufficient due
to a lack of a standard leave-out data collection 2. Don't just pick the model that ranks highest on benchmarks: do your own testing on the frequency, domain and horizon you’re interested in 3.Publicly available time series are limited, and not nearly as ubiquitous as text 4. We’re not sure that performance will scale with data and model complexity

Luca Baggi AI Engineer @ xtream Gabriele Orlandi AI Scientist
@ xtream

Foundational Models for Time Series Forecasting...

Foundational Models for Time Series Forecasting: Are We There Yet?

More Decks by Luca Baggi

Other Decks in Programming

Featured

Transcript