Machine Learning for Time Series Forecasting

Machine Learning for Time Series Forecasting Miguel Cabrera Senior Data
Scientist at NewYorker @mfcabrera Photo by Joel Duncan on Unsplash

HELLO! 2 I’m Miguel Cabrera Data Scientist at NewYorker @mfcabrera

GOALS ➔ Understand the basics of times series and time
series forecasting ➔ Learn the different approaches to solve the problem using machine learning techniques ➔ Learn some strategies and techniques commonly used to deal with these kind of problems. 5

6 THE AGENDA FOR TODAY INTRODUCTION MODELS PRACTICE

Basic Concepts Time Series and their properties

TIME SERIES AND FORECASTING ➔ Forecasting is needed in many
situations ➔ Most of the time you want to use your previous experiences to make assumption on the future process ➔ You have at the very least a time dimension and variable of interest ➔ Different horizons (Short, Medium, Long) 8

9 9 DEMAND FORECASTING Demand Forecasting refers to predicting future
demand (or sales), assuming that the factors which affected demand in the past and are affecting the present and will still have an inﬂuence in the future. [1] HISTORICAL PREDICTIONS 2018 2019 Date Sales Feb 2018 3500 Mar 2018 3000 April 2018 2000 May 2018 500 Jun 2018 500 … … T 1000 T+1 ?? T+2 ?? T+3 ?? … ?? T+n ??

TIME SERIES - PATTERNS 10 Source: Hyndman, R.J., & Athanasopoulos,
G. (2019) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2020-01-15 ➔ Trend ➔ Seasonality ➔ Cycle

TIME SERIES - PATTERNS 11 ➔ Trend ➔ Seasonality ➔
Cycle

TIME SERIES - STATIONARITY 12 Image Source: “An introduction to
time series analysis”: https://medium.com/swlh/an-introduction-to-time-series-analysis-ef1a9200717a

TIME SERIES - STATIONARITY 13

RECAP - TIME SERIES CONCEPTS ➔ Trend, Seasonality and Cycle
➔ Stationarity ➔ Single-Step and Multi-step forecasts ➔ Different Horizon 14

Models

TIME SERIES FORECASTING - REQUIREMENTS 1. Non-stationary time series support
2. Multiple (many) time series 3. Multi-Step & Multi-Horizon 4. Cold start problem 5. Model interpretability 6. Model capability 16

TIME SERIES FORECASTING - SCORE CARD 17 Characteristic / Requirement
Score Highly non-stationary Multiple time series support Multi-horizon forecast Model interpretability Model Capability Computational Efﬁciency Handle the cold-start problem

MODELING APPROACHES • (S)ARIMA • (G)ARCH • Exponential Smoothing •
VAR • FB Prophet • Linear Regression • SVM • Gaussian Process • Tree Based Models • Random Forests • Gradient Boosted Trees • MLP • RNN • LSTM • SEQ2SEQ

MODELING APPROACHES • (S)ARIMA • (G)ARCH • Exponential Smoothing •
VAR • FB Prophet • Linear Regression • SVM • Gaussian Process • Tree Based Models • Random Forests • Gradient Boosted Trees • MLP • RNN • LSTM • SEQ2SEQ 19

20 ARIMA Auto-Regressive Integrated Moving Average AR(p) MA(q) Past Values
Past Errors ARIMA(p, d, q) SARIMA(p, d, q)x(Q,D,P,m)

21 ARIMA

22 ARIMA • Study ACF/PACF charts and determine the parameter
or use an automated algorithm. • Seasonal pattern (Strong correlation between and ) • Algorithm found: SARIMAX(1, 1, 1)x(0, 1, 1)^12

23 from statsmodels.tsa.arima_model import ARIMA import pmdarima as pm model
= pm.auto_arima( ts["sales"], start_p=1, start_q=1, test="adf", # use adftest to find optimal 'd' max_p=3, max_q=3, # maximum p and q m=12, # frequency of series d=1, D=1, seasonal=True # ... other parameters ) y_hat, conf = model.predict( n, return_conf_int=True, alpha=0.05 ) pmdarima SAMPLE

LIBRARIES AND TOOLS PMDARIMA https://alkaline-ml.com/pmdarima/ sktime: https://alan-turing-institute.github.io/sktime/index.html 24

25 TIME SERIES MODELS Characteristic / Requirement Score Highly non-stationary
Limited Multiple time series Limited Multi-horizon forecast Yes Model interpretability High Model Capability Low Computational Efﬁciency Medium Handle cold-starts No Sample plots of fashion product sales

26 MACHINE LEARNING ‣Additional features in the model. ‣One single
model can handle many or all time series. ‣Feature Engineering is very important.

MACHINE LEARNING - FEATURES Time Series Product Attributes Time Location
category, brand, color, size, style, identiﬁer moving averages, statistics, lagged features Day of week, month of year, number of week, season Holiday, weather, macroeconomic information SOURCE EXTRACTION ENCODING Numerical One Hot Encoding Feature Hashing Embeddings FEATURES

MACHINE LEARNING - FEATURES SOURCE ENCODING

29 MACHINE LEARNING - MODELS LINEAR REGRESSION TREE BASED SUPPORT
VECTOR REGRESSION Estimate the independent variable as the linear expression of the features. ‣ Least Squares ‣ Ridge / Lasso ‣ Elastic Net ‣ ARIMA + X Use decision trees to learn the characteristics of the data to make predictions ‣ Regression Tree ‣ Random Forest ‣ Gradient Boosting ‣ Catboost ‣ LightGBM ‣ XGBoost Minimise the error within the support vector threshold using a non-Linear kernel to model non-linear relationships. ‣ NuSVR ‣ LibLinear ‣ LibSVM ‣ SKLearn

30 MACHINE LEARNING - MODELS LINEAR REGRESSION TREE BASED SUPPORT
VECTOR REGRESSION Estimate the independent variable as the linear expression of the features. ‣ Least Squares ‣ Ridge / Lasso ‣ Elastic Net ‣ ARIMA + X Use decision trees to learn the characteristics of the data to make predictions ‣ Regression Tree ‣ Random Forest ‣ Gradient Boosting ‣ Catboost ‣ LightGBM ‣ XGBoost Minimise the error within the support vector threshold using a non-Linear kernel to model non-linear relationships. ‣ NuSVR ‣ LibLinear ‣ LibSVM ‣ SKLearn

31 TREE BASED MODELS - REGRESSION TREES X < 40
X < 20 X< 60 X < 5 3.2 5 13.6 …. 8.5 17.6

32 TREE BASED MODELS - RANDOM FOREST • Bootstrap (Random
resample with replacement) • Independent classiﬁers • Random feature selection at split • “Bagging” • Parallel training • Generates a wide variety of trees

33 TREE BASED MODELS - GRADIENT BOOSTED TREES • Boosting
• Sequential Classiﬁers • Resample with weights • Important parameters: ◦ Learning rate ◦ Number of trees ◦ Depth

34 GRADIENT BOOSTING - LIBRARIES

35 LightGBM SAMPLE import lightgbm as lgb from sklearn.model_selection import
GridSearchCV estimator = lgb.LGBMRegressor(num_leaves=31) param_grid = { 'learning_rate': [0.01, 0.1, 1], 'n_estimators': [20, 40] } gbm = GridSearchCV(estimator, param_grid, cv=3) gbm.fit(X_train, y_train)

36 MACHINE LEARNING Characteristic / Requirement Score Highly non-stationary Yes
Multiple time series Yes Multi-horizon forecast Yes Model interpretability Medium Model Capability Medium Computational Efﬁciency Medium Handle cold-starts Partially ➔ Requires expert knowledge ➔ Time consuming feature engineering required ➔ Some features are difﬁcult to capture ➔ Some methods might not be able to extrapolate

37 NEURAL NETWORKS AND DEEP LEARNING - MODELS MULTILAYER PERCEPTRON
LONG SHORT TERM MEMORY & RECURRENT NN SEQ2SEQ Fully connected multilayer artiﬁcial neural network. A type of recurrent neural network used for sequential learning. Cell states updated by gates. Used for speech recognition, language models, translation, etc. Encoder decoder architecture. It uses two RNN that will work together trying to predict the next state sequence from the previous sequence. Image credits: https://github.com/ledell/sldm4-h2o/ https://smerity.com/articles/2016/google_nmt_arch.html

38 NEURAL NETS - MLP Image source: Faloutsos et al.
2019 [9]

2019 [9]

2019 [9] • Hidden layers are non-linearities • Flexible general function estimator • The larger and deeper, more complex functions. • Can learn complex relationships • Need data for training • Careful tuning needed • Feature engineering needed

41 DEEP LEARNING - MLP WITH EMBEDDINGS Source: Cheng Guo
and Felix Berkhahn. 2016. [7] The learned German state embedding mapped to a 2D space with t-SNE.

RECURRENT NEURAL NETS 42 Source: Faloutsos et al. 2019 [9]

43 DeepAR - LSTM + AUTOREGRESSIVE Source: Faloutsos et al.
2019 [9]

44 DeepAR - LSTM + AUTOREGRESSIVE Source: Faloutsos et al.
2019 [9]

NN & DL - TOOLS 45

46 NEURAL NETS AND DEEP LEARNING Characteristic / Requirement Score
Highly non-stationary Yes Multiple time series Yes Multi-horizon forecast Yes Model interpretability Low Model Capability High Computational Efﬁciency Low Cold start problem No ➔ Very ﬂexible approach ➔ Automated feature learning is more limited due to the lack of unlabelled data. ➔ Some feature engineering is still necessary. ➔ Poor model interpretability ➔ No clear consensus on which model (RNN, LSTM, CNN) work the best.

47 MODELS - SUMMARY ‣ Good model interpretability ‣ Limited
model complexity to handle non-linear data ‣ Difficult to model multiple time series ‣ Difficult to integrate shared features across different time series ‣ Flexible ‣ Can incorporate many features across the time series ‣ A lot of feature engineering required ‣ Very flexible ‣ Automated feature learning via embeddings ‣ Still some degree of feature engineering necessary ‣ Poor model interpretability ‣ Hard to train ‣ It is not clear which model or approaches are the best. TRADITIONAL MODELS MACHINE LEARNING NEURAL NETS AND DEEP LEARNING

Practice

49 EVALUATION AND METRICS Metric Formula Notes MAE (mean absolute
error) Intuitive MAPE (mean absolute percentage error) Independent of the scale of measurement SMAPE (symmetric mean absolute percentage error) Avoid Asymmetry of MAPE MSE (Mean squared error) Penalize extreme errors MSLE (Mean Squared Logarithmic loss) Large errors are not more signiﬁcantly penalised than small ones Quantile Loss Measure distribution RMSPE (Root Mean Squared Percentage Error) Independent of the scale of measurement

TARGET VARIABLE TRANSFORMATION ➔ Box-Cox Transformation ➔ Log-Transformation ➔ Scaling
/ Normalization 50

TARGET VARIABLE TRANSFORMATION ➔ Box-Cox Transformation ➔ Log-Transformation ➔ Scaling
/ Normalization 51

TARGET VARIABLE TRANSFORMATION* ➔ Box-Cox Transformation ➔ Log-Transformation ➔ Scaling
/ Normalization 52

TARGET VARIABLE TRANSFORMATION 53 import numpy as np from sklearn.compose
import TransformedTargetRegressor tt = TransformedTargetRegressor( regressor=YourAwesomeRegressor(), func=np.log, inverse_func=np.exp ) ... tt.fit(X, y)

54 CROSS VALIDATION

55 USEFUL PREDICTORS ‣ Trend or Sequence ‣ Seasonal Variables
‣ Intervention Variables

58 SUMMARY ➔ Time Series Forecasting has a lot of
practical applications ➔ Traditional methods might still be relevant in many use cases ➔ Machine Learning, in particular Gradient Boosting seem to offer a good compromise between model capacity and interpretability. ➔ Feature Engineering is key, and (some) is still necessary when using Deep Learning. ➔ Deep Learning has not yet “cracked” time series forecasting, but recent models show promise. ➔ Avoid feature leaking by using a robust time series cross-validation approach.

59 REFERENCES • [1] Choi, T. M., Hui, C. L.,
& Yu, Y. (2014). . (pp. 1–194). Springer Berlin Heidelberg. • [2] Hyndman, R.J., & Athanasopoulos, G. (2018) , 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 01.10.2019 • [3] H&M, a Fashion Giant, Has a Problem: $4.3 Billion in Unsold Clothes. • [4] Thomassey, S. (2014). Sales Forecasting in Apparel and Fashion Industry: A Review. In (pp. 9–27). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-39869-8_2 • [5] Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. San Francisco: Holden-Day • [6] Autoregressive integrated moving average (ARIMA). https://en.wikipedia. org/wiki/Autoregressive_integrated_moving_average. Accessed: 2019-05-02 • [7] Cheng Guo and Felix Berkhahn. 2016. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737 (2016). • [8] Shen, Yuan, Wu and Pei - Data Science in Retail-as-a-Service Workshop. KDD 2018. London. • [9] Faloutsos, Christos & Flunkert, Valentin & Gasthaus, Jan & Januschowski, Tim & Wang, Yuyang. (2019). Forecasting Big Time Series: Theory and Practice. KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3209-3210. 10.1145/3292500.3332289. • [10] The M4 Competition: 100,000 time series and 61 forecasting methods [Makridakis et al., 2018] • [11] CSalinas, D., Flunkert, V., & Gasthaus, J. (2017). DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. Retrieved from http://arxiv.org/abs/1704.04110

60 IMAGE CREDITS • decision tree by H Alberto Gongora
from the Noun Project • Ship Freight by ProSymbols from the Noun Project • warehouse by ProSymbols from the Noun Project • Store by AomAm from the Noun Project • Neural Network by Knut M. Synstad from the Noun Project • Tunic Dress by Vectors Market from the Noun Project • sales by Kantor Tegalsari from the Noun Project • time series by tom from the Noun Project • fashion by Smalllike from the Noun Project • Time by Anna Sophie from the Noun Project • linear regression by Becris from the Noun Project • Random Forest by Becris from the Noun Project • SVM by sachin modgekar from the Noun Project • production by Orin zuu from the Noun Project • Auto by Graphic Tigers from the Noun Project • Factory by Graphic Tigers from the Noun Project • Express Delivery by Vectors Market from the Noun Project • Stand Out by BomSymbols from the Noun Project • Photo Credit: https://www.flickr.com/photos/157635012@N07/47981346167/ by Artem Beliaikin on Flickr via Compfight CC 2.0 • Photo Credit: „https://www.flickr.com/photos/157635012@N07/48014587002/ Artem Beliaikin</a> Flickr via Compfight CC 2.0 • regression analysis by Vectors Market from the Noun Project • Research Experiment by Vectors Market from the Noun Project • weather by Alice Design from the Noun Project • Shirt by Ben Davis from the Noun Project • fashion by Eat Bread Studio from the Noun Project • renew by david from the Noun Project • price by Adrien Coquet from the Noun Project • requirements by ProSymbols from the Noun Project • marketing by Gregor Cresnar from the Noun Project • macroeconomic by priyanka from the Noun Project • competition by Gregor Cresnar from the Noun Project

Gracias! Miguel Cabrera @mfcabrera

BACKUP SLIDES 62

63 AWS DeepAR - LSTM + AUTOREGRESSIVE Source: CSalinas, D.,
Flunkert, V., & Gasthaus, J. (2017). [11]

So what method should I use? 64

THE M4 COMPETITION ➔ January-May 2018 ➔ 100,000 series of
following frequencies: monthly, quarterly, yearly, daily, weekly, and hourly. ➔ 95% of series within ﬁrst 3 categories. ➔ The forecasting horizons varied, e.g., six for yearly, 18 for monthly, and 48 for hourly series point forecasts and prediction intervals were evaluated ➔ No extra features / exogeneous variables 65

THE M4 COMPETITION ➔ Winning solution: an RNN with integrated
exponential smoothing formula. ➔ Second: ensembles of classical solutions using sophisticated time series feature extraction. ➔ The dataset might not be then most representative but it will offer a baseline. 66

ATTENTION: LAG IS ALL YOU NEED 67 Wikipedia Winning entry:
https://github.com/Arturus/kaggle-web-trafﬁc/blob/master/how_it_works.md

APPLICATIONS OF TIME SERIES FORECASTING ➔ Manufacturing ➔ Government services
(budgeting) ➔ Supply chain and retail/commerce ➔ Workforce scheduling ➔ Cloud computing ➔ Website trafﬁc prediction ➔ Healthcare 68

69 MODEL SUMMARY

Machine Learning for Time Series Forecasting

Machine Learning for Time Series Forecasting

More Decks by Miguel Cabrera

Other Decks in Technology

Featured

Transcript