Slide 1

Slide 1 text

Machine Learning for Time Series Forecasting Miguel Cabrera Senior Data Scientist at NewYorker @mfcabrera Photo by Joel Duncan on Unsplash

Slide 2

Slide 2 text

HELLO! 2 I’m Miguel Cabrera Data Scientist at NewYorker @mfcabrera

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

GOALS ➔ Understand the basics of times series and time series forecasting ➔ Learn the different approaches to solve the problem using machine learning techniques ➔ Learn some strategies and techniques commonly used to deal with these kind of problems. 5

Slide 6

Slide 6 text

6 THE AGENDA FOR TODAY INTRODUCTION MODELS PRACTICE

Slide 7

Slide 7 text

Basic Concepts Time Series and their properties

Slide 8

Slide 8 text

TIME SERIES AND FORECASTING ➔ Forecasting is needed in many situations ➔ Most of the time you want to use your previous experiences to make assumption on the future process ➔ You have at the very least a time dimension and variable of interest ➔ Different horizons (Short, Medium, Long) 8

Slide 9

Slide 9 text

9 9 DEMAND FORECASTING Demand Forecasting refers to predicting future demand (or sales), assuming that the factors which affected demand in the past and are affecting the present and will still have an influence in the future. [1] HISTORICAL PREDICTIONS 2018 2019 Date Sales Feb 2018 3500 Mar 2018 3000 April 2018 2000 May 2018 500 Jun 2018 500 … … T 1000 T+1 ?? T+2 ?? T+3 ?? … ?? T+n ??

Slide 10

Slide 10 text

TIME SERIES - PATTERNS 10 Source: Hyndman, R.J., & Athanasopoulos, G. (2019) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2020-01-15 ➔ Trend ➔ Seasonality ➔ Cycle

Slide 11

Slide 11 text

TIME SERIES - PATTERNS 11 ➔ Trend ➔ Seasonality ➔ Cycle

Slide 12

Slide 12 text

TIME SERIES - STATIONARITY 12 Image Source: “An introduction to time series analysis”: https://medium.com/swlh/an-introduction-to-time-series-analysis-ef1a9200717a

Slide 13

Slide 13 text

TIME SERIES - STATIONARITY 13

Slide 14

Slide 14 text

RECAP - TIME SERIES CONCEPTS ➔ Trend, Seasonality and Cycle ➔ Stationarity ➔ Single-Step and Multi-step forecasts ➔ Different Horizon 14

Slide 15

Slide 15 text

Models

Slide 16

Slide 16 text

TIME SERIES FORECASTING - REQUIREMENTS 1. Non-stationary time series support 2. Multiple (many) time series 3. Multi-Step & Multi-Horizon 4. Cold start problem 5. Model interpretability 6. Model capability 16

Slide 17

Slide 17 text

TIME SERIES FORECASTING - SCORE CARD 17 Characteristic / Requirement Score Highly non-stationary Multiple time series support Multi-horizon forecast Model interpretability Model Capability Computational Efficiency Handle the cold-start problem

Slide 18

Slide 18 text

MODELING APPROACHES • (S)ARIMA • (G)ARCH • Exponential Smoothing • VAR • FB Prophet • Linear Regression • SVM • Gaussian Process • Tree Based Models • Random Forests • Gradient Boosted Trees • MLP • RNN • LSTM • SEQ2SEQ

Slide 19

Slide 19 text

MODELING APPROACHES • (S)ARIMA • (G)ARCH • Exponential Smoothing • VAR • FB Prophet • Linear Regression • SVM • Gaussian Process • Tree Based Models • Random Forests • Gradient Boosted Trees • MLP • RNN • LSTM • SEQ2SEQ 19

Slide 20

Slide 20 text

20 ARIMA Auto-Regressive Integrated Moving Average AR(p) MA(q) Past Values Past Errors ARIMA(p, d, q) SARIMA(p, d, q)x(Q,D,P,m)

Slide 21

Slide 21 text

21 ARIMA

Slide 22

Slide 22 text

22 ARIMA • Study ACF/PACF charts and determine the parameter or use an automated algorithm. • Seasonal pattern (Strong correlation between and ) • Algorithm found: SARIMAX(1, 1, 1)x(0, 1, 1)^12

Slide 23

Slide 23 text

23 from statsmodels.tsa.arima_model import ARIMA import pmdarima as pm model = pm.auto_arima( ts["sales"], start_p=1, start_q=1, test="adf", # use adftest to find optimal 'd' max_p=3, max_q=3, # maximum p and q m=12, # frequency of series d=1, D=1, seasonal=True # ... other parameters ) y_hat, conf = model.predict( n, return_conf_int=True, alpha=0.05 ) pmdarima SAMPLE

Slide 24

Slide 24 text

LIBRARIES AND TOOLS PMDARIMA https://alkaline-ml.com/pmdarima/ sktime: https://alan-turing-institute.github.io/sktime/index.html 24

Slide 25

Slide 25 text

25 TIME SERIES MODELS Characteristic / Requirement Score Highly non-stationary Limited Multiple time series Limited Multi-horizon forecast Yes Model interpretability High Model Capability Low Computational Efficiency Medium Handle cold-starts No Sample plots of fashion product sales

Slide 26

Slide 26 text

26 MACHINE LEARNING ‣Additional features in the model. ‣One single model can handle many or all time series. ‣Feature Engineering is very important.

Slide 27

Slide 27 text

MACHINE LEARNING - FEATURES Time Series Product Attributes Time Location category, brand, color, size, style, identifier moving averages, statistics, lagged features Day of week, month of year, number of week, season Holiday, weather, macroeconomic information SOURCE EXTRACTION ENCODING Numerical One Hot Encoding Feature Hashing Embeddings FEATURES

Slide 28

Slide 28 text

MACHINE LEARNING - FEATURES SOURCE ENCODING

Slide 29

Slide 29 text

29 MACHINE LEARNING - MODELS LINEAR REGRESSION TREE BASED SUPPORT VECTOR REGRESSION Estimate the independent variable as the linear expression of the features. ‣ Least Squares ‣ Ridge / Lasso ‣ Elastic Net ‣ ARIMA + X Use decision trees to learn the characteristics of the data to make predictions ‣ Regression Tree ‣ Random Forest ‣ Gradient Boosting ‣ Catboost ‣ LightGBM ‣ XGBoost Minimise the error within the support vector threshold using a non-Linear kernel to model non-linear relationships. ‣ NuSVR ‣ LibLinear ‣ LibSVM ‣ SKLearn

Slide 30

Slide 30 text

30 MACHINE LEARNING - MODELS LINEAR REGRESSION TREE BASED SUPPORT VECTOR REGRESSION Estimate the independent variable as the linear expression of the features. ‣ Least Squares ‣ Ridge / Lasso ‣ Elastic Net ‣ ARIMA + X Use decision trees to learn the characteristics of the data to make predictions ‣ Regression Tree ‣ Random Forest ‣ Gradient Boosting ‣ Catboost ‣ LightGBM ‣ XGBoost Minimise the error within the support vector threshold using a non-Linear kernel to model non-linear relationships. ‣ NuSVR ‣ LibLinear ‣ LibSVM ‣ SKLearn

Slide 31

Slide 31 text

31 TREE BASED MODELS - REGRESSION TREES X < 40 X < 20 X< 60 X < 5 3.2 5 13.6 …. 8.5 17.6

Slide 32

Slide 32 text

32 TREE BASED MODELS - RANDOM FOREST ● Bootstrap (Random resample with replacement) ● Independent classifiers ● Random feature selection at split ● “Bagging” ● Parallel training ● Generates a wide variety of trees

Slide 33

Slide 33 text

33 TREE BASED MODELS - GRADIENT BOOSTED TREES ● Boosting ● Sequential Classifiers ● Resample with weights ● Important parameters: ○ Learning rate ○ Number of trees ○ Depth

Slide 34

Slide 34 text

34 GRADIENT BOOSTING - LIBRARIES

Slide 35

Slide 35 text

35 LightGBM SAMPLE import lightgbm as lgb from sklearn.model_selection import GridSearchCV estimator = lgb.LGBMRegressor(num_leaves=31) param_grid = { 'learning_rate': [0.01, 0.1, 1], 'n_estimators': [20, 40] } gbm = GridSearchCV(estimator, param_grid, cv=3) gbm.fit(X_train, y_train)

Slide 36

Slide 36 text

36 MACHINE LEARNING Characteristic / Requirement Score Highly non-stationary Yes Multiple time series Yes Multi-horizon forecast Yes Model interpretability Medium Model Capability Medium Computational Efficiency Medium Handle cold-starts Partially ➔ Requires expert knowledge ➔ Time consuming feature engineering required ➔ Some features are difficult to capture ➔ Some methods might not be able to extrapolate

Slide 37

Slide 37 text

37 NEURAL NETWORKS AND DEEP LEARNING - MODELS MULTILAYER PERCEPTRON LONG SHORT TERM MEMORY & RECURRENT NN SEQ2SEQ Fully connected multilayer artificial neural network. A type of recurrent neural network used for sequential learning. Cell states updated by gates. Used for speech recognition, language models, translation, etc. Encoder decoder architecture. It uses two RNN that will work together trying to predict the next state sequence from the previous sequence. Image credits: https://github.com/ledell/sldm4-h2o/ https://smerity.com/articles/2016/google_nmt_arch.html

Slide 38

Slide 38 text

38 NEURAL NETS - MLP Image source: Faloutsos et al. 2019 [9]

Slide 39

Slide 39 text

39 NEURAL NETS - MLP Image source: Faloutsos et al. 2019 [9]

Slide 40

Slide 40 text

40 NEURAL NETS - MLP Image source: Faloutsos et al. 2019 [9] ● Hidden layers are non-linearities ● Flexible general function estimator ● The larger and deeper, more complex functions. ● Can learn complex relationships ● Need data for training ● Careful tuning needed ● Feature engineering needed

Slide 41

Slide 41 text

41 DEEP LEARNING - MLP WITH EMBEDDINGS Source: Cheng Guo and Felix Berkhahn. 2016. [7] The learned German state embedding mapped to a 2D space with t-SNE.

Slide 42

Slide 42 text

RECURRENT NEURAL NETS 42 Source: Faloutsos et al. 2019 [9]

Slide 43

Slide 43 text

43 DeepAR - LSTM + AUTOREGRESSIVE Source: Faloutsos et al. 2019 [9]

Slide 44

Slide 44 text

44 DeepAR - LSTM + AUTOREGRESSIVE Source: Faloutsos et al. 2019 [9]

Slide 45

Slide 45 text

NN & DL - TOOLS 45

Slide 46

Slide 46 text

46 NEURAL NETS AND DEEP LEARNING Characteristic / Requirement Score Highly non-stationary Yes Multiple time series Yes Multi-horizon forecast Yes Model interpretability Low Model Capability High Computational Efficiency Low Cold start problem No ➔ Very flexible approach ➔ Automated feature learning is more limited due to the lack of unlabelled data. ➔ Some feature engineering is still necessary. ➔ Poor model interpretability ➔ No clear consensus on which model (RNN, LSTM, CNN) work the best.

Slide 47

Slide 47 text

47 MODELS - SUMMARY ‣ Good model interpretability ‣ Limited model complexity to handle non-linear data ‣ Difficult to model multiple time series ‣ Difficult to integrate shared features across different time series ‣ Flexible ‣ Can incorporate many features across the time series ‣ A lot of feature engineering required ‣ Very flexible ‣ Automated feature learning via embeddings ‣ Still some degree of feature engineering necessary ‣ Poor model interpretability ‣ Hard to train ‣ It is not clear which model or approaches are the best. TRADITIONAL MODELS MACHINE LEARNING NEURAL NETS AND DEEP LEARNING

Slide 48

Slide 48 text

Practice

Slide 49

Slide 49 text

49 EVALUATION AND METRICS Metric Formula Notes MAE (mean absolute error) Intuitive MAPE (mean absolute percentage error) Independent of the scale of measurement SMAPE (symmetric mean absolute percentage error) Avoid Asymmetry of MAPE MSE (Mean squared error) Penalize extreme errors MSLE (Mean Squared Logarithmic loss) Large errors are not more significantly penalised than small ones Quantile Loss Measure distribution RMSPE (Root Mean Squared Percentage Error) Independent of the scale of measurement

Slide 50

Slide 50 text

TARGET VARIABLE TRANSFORMATION ➔ Box-Cox Transformation ➔ Log-Transformation ➔ Scaling / Normalization 50

Slide 51

Slide 51 text

TARGET VARIABLE TRANSFORMATION ➔ Box-Cox Transformation ➔ Log-Transformation ➔ Scaling / Normalization 51

Slide 52

Slide 52 text

TARGET VARIABLE TRANSFORMATION* ➔ Box-Cox Transformation ➔ Log-Transformation ➔ Scaling / Normalization 52

Slide 53

Slide 53 text

TARGET VARIABLE TRANSFORMATION 53 import numpy as np from sklearn.compose import TransformedTargetRegressor tt = TransformedTargetRegressor( regressor=YourAwesomeRegressor(), func=np.log, inverse_func=np.exp ) ... tt.fit(X, y)

Slide 54

Slide 54 text

54 CROSS VALIDATION

Slide 55

Slide 55 text

55 USEFUL PREDICTORS ‣ Trend or Sequence ‣ Seasonal Variables ‣ Intervention Variables

Slide 56

Slide 56 text

56 USEFUL PREDICTORS ‣ Trend or Sequence ‣ Seasonal Variables ‣ Intervention Variables

Slide 57

Slide 57 text

57 USEFUL PREDICTORS ‣ Trend or Sequence ‣ Seasonal Variables ‣ Intervention Variables

Slide 58

Slide 58 text

58 SUMMARY ➔ Time Series Forecasting has a lot of practical applications ➔ Traditional methods might still be relevant in many use cases ➔ Machine Learning, in particular Gradient Boosting seem to offer a good compromise between model capacity and interpretability. ➔ Feature Engineering is key, and (some) is still necessary when using Deep Learning. ➔ Deep Learning has not yet “cracked” time series forecasting, but recent models show promise. ➔ Avoid feature leaking by using a robust time series cross-validation approach.

Slide 59

Slide 59 text

59 REFERENCES • [1] Choi, T. M., Hui, C. L., & Yu, Y. (2014). . (pp. 1–194). Springer Berlin Heidelberg. • [2] Hyndman, R.J., & Athanasopoulos, G. (2018) , 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 01.10.2019 • [3] H&M, a Fashion Giant, Has a Problem: $4.3 Billion in Unsold Clothes. • [4] Thomassey, S. (2014). Sales Forecasting in Apparel and Fashion Industry: A Review. In (pp. 9–27). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-39869-8_2 • [5] Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. San Francisco: Holden-Day • [6] Autoregressive integrated moving average (ARIMA). https://en.wikipedia. org/wiki/Autoregressive_integrated_moving_average. Accessed: 2019-05-02 • [7] Cheng Guo and Felix Berkhahn. 2016. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737 (2016). • [8] Shen, Yuan, Wu and Pei - Data Science in Retail-as-a-Service Workshop. KDD 2018. London. • [9] Faloutsos, Christos & Flunkert, Valentin & Gasthaus, Jan & Januschowski, Tim & Wang, Yuyang. (2019). Forecasting Big Time Series: Theory and Practice. KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3209-3210. 10.1145/3292500.3332289. • [10] The M4 Competition: 100,000 time series and 61 forecasting methods [Makridakis et al., 2018] • [11] CSalinas, D., Flunkert, V., & Gasthaus, J. (2017). DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. Retrieved from http://arxiv.org/abs/1704.04110

Slide 60

Slide 60 text

60 IMAGE CREDITS ● decision tree by H Alberto Gongora from the Noun Project ● Ship Freight by ProSymbols from the Noun Project ● warehouse by ProSymbols from the Noun Project ● Store by AomAm from the Noun Project ● Neural Network by Knut M. Synstad from the Noun Project ● Tunic Dress by Vectors Market from the Noun Project ● sales by Kantor Tegalsari from the Noun Project ● time series by tom from the Noun Project ● fashion by Smalllike from the Noun Project ● Time by Anna Sophie from the Noun Project ● linear regression by Becris from the Noun Project ● Random Forest by Becris from the Noun Project ● SVM by sachin modgekar from the Noun Project ● production by Orin zuu from the Noun Project ● Auto by Graphic Tigers from the Noun Project ● Factory by Graphic Tigers from the Noun Project ● Express Delivery by Vectors Market from the Noun Project ● Stand Out by BomSymbols from the Noun Project ● Photo Credit: https://www.flickr.com/photos/157635012@N07/47981346167/ by Artem Beliaikin on Flickr via Compfight CC 2.0 ● Photo Credit: „https://www.flickr.com/photos/157635012@N07/48014587002/ Artem Beliaikin Flickr via Compfight CC 2.0 ● regression analysis by Vectors Market from the Noun Project ● Research Experiment by Vectors Market from the Noun Project ● weather by Alice Design from the Noun Project ● Shirt by Ben Davis from the Noun Project ● fashion by Eat Bread Studio from the Noun Project ● renew by david from the Noun Project ● price by Adrien Coquet from the Noun Project ● requirements by ProSymbols from the Noun Project ● marketing by Gregor Cresnar from the Noun Project ● macroeconomic by priyanka from the Noun Project ● competition by Gregor Cresnar from the Noun Project

Slide 61

Slide 61 text

Gracias! Miguel Cabrera @mfcabrera

Slide 62

Slide 62 text

BACKUP SLIDES 62

Slide 63

Slide 63 text

63 AWS DeepAR - LSTM + AUTOREGRESSIVE Source: CSalinas, D., Flunkert, V., & Gasthaus, J. (2017). [11]

Slide 64

Slide 64 text

So what method should I use? 64

Slide 65

Slide 65 text

THE M4 COMPETITION ➔ January-May 2018 ➔ 100,000 series of following frequencies: monthly, quarterly, yearly, daily, weekly, and hourly. ➔ 95% of series within first 3 categories. ➔ The forecasting horizons varied, e.g., six for yearly, 18 for monthly, and 48 for hourly series point forecasts and prediction intervals were evaluated ➔ No extra features / exogeneous variables 65

Slide 66

Slide 66 text

THE M4 COMPETITION ➔ Winning solution: an RNN with integrated exponential smoothing formula. ➔ Second: ensembles of classical solutions using sophisticated time series feature extraction. ➔ The dataset might not be then most representative but it will offer a baseline. 66

Slide 67

Slide 67 text

ATTENTION: LAG IS ALL YOU NEED 67 Wikipedia Winning entry: https://github.com/Arturus/kaggle-web-traffic/blob/master/how_it_works.md

Slide 68

Slide 68 text

APPLICATIONS OF TIME SERIES FORECASTING ➔ Manufacturing ➔ Government services (budgeting) ➔ Supply chain and retail/commerce ➔ Workforce scheduling ➔ Cloud computing ➔ Website traffic prediction ➔ Healthcare 68

Slide 69

Slide 69 text

69 MODEL SUMMARY