Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning for Time Series Forecasting

Machine Learning for Time Series Forecasting

PyCon Colombia - Medellin Feburary 2020

Miguel Cabrera

February 08, 2020
Tweet

More Decks by Miguel Cabrera

Other Decks in Technology

Transcript

  1. Machine Learning for Time Series Forecasting
    Miguel Cabrera
    Senior Data Scientist at NewYorker @mfcabrera
    Photo by Joel Duncan on Unsplash

    View full-size slide

  2. HELLO!
    2
    I’m Miguel Cabrera
    Data Scientist at NewYorker
    @mfcabrera

    View full-size slide

  3. GOALS
    ➔ Understand the basics of times series and time series forecasting
    ➔ Learn the different approaches to solve the problem using machine learning
    techniques
    ➔ Learn some strategies and techniques commonly used to deal with these kind of
    problems.
    5

    View full-size slide

  4. 6
    THE AGENDA
    FOR TODAY
    INTRODUCTION MODELS PRACTICE

    View full-size slide

  5. Basic Concepts
    Time Series and their properties

    View full-size slide

  6. TIME SERIES AND FORECASTING
    ➔ Forecasting is needed in many situations
    ➔ Most of the time you want to use your previous experiences to make assumption
    on the future process
    ➔ You have at the very least a time dimension and variable of interest
    ➔ Different horizons (Short, Medium, Long)
    8

    View full-size slide

  7. 9
    9
    DEMAND
    FORECASTING
    Demand Forecasting refers to predicting
    future demand (or sales), assuming that the
    factors which affected demand in the past and
    are affecting the present and will still have an
    influence in the future. [1]
    HISTORICAL
    PREDICTIONS
    2018 2019
    Date Sales
    Feb 2018 3500
    Mar 2018 3000
    April 2018 2000
    May 2018 500
    Jun 2018 500
    … …
    T 1000
    T+1 ??
    T+2 ??
    T+3 ??
    … ??
    T+n ??

    View full-size slide

  8. TIME SERIES - PATTERNS
    10
    Source: Hyndman, R.J., & Athanasopoulos, G. (2019) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2020-01-15
    ➔ Trend
    ➔ Seasonality
    ➔ Cycle

    View full-size slide

  9. TIME SERIES - PATTERNS
    11
    ➔ Trend
    ➔ Seasonality
    ➔ Cycle

    View full-size slide

  10. TIME SERIES - STATIONARITY
    12
    Image Source: “An introduction to time series analysis”: https://medium.com/swlh/an-introduction-to-time-series-analysis-ef1a9200717a

    View full-size slide

  11. TIME SERIES - STATIONARITY
    13

    View full-size slide

  12. RECAP - TIME SERIES CONCEPTS
    ➔ Trend, Seasonality and Cycle
    ➔ Stationarity
    ➔ Single-Step and Multi-step forecasts
    ➔ Different Horizon
    14

    View full-size slide

  13. TIME SERIES FORECASTING - REQUIREMENTS
    1. Non-stationary time series support
    2. Multiple (many) time series
    3. Multi-Step & Multi-Horizon
    4. Cold start problem
    5. Model interpretability
    6. Model capability
    16

    View full-size slide

  14. TIME SERIES FORECASTING - SCORE CARD
    17
    Characteristic / Requirement Score
    Highly non-stationary
    Multiple time series support
    Multi-horizon forecast
    Model interpretability
    Model Capability
    Computational Efficiency
    Handle the cold-start problem

    View full-size slide

  15. MODELING APPROACHES
    • (S)ARIMA
    • (G)ARCH
    • Exponential Smoothing
    • VAR
    • FB Prophet
    • Linear Regression
    • SVM
    • Gaussian Process
    • Tree Based Models
    • Random Forests
    • Gradient Boosted Trees
    • MLP
    • RNN
    • LSTM
    • SEQ2SEQ

    View full-size slide

  16. MODELING APPROACHES
    • (S)ARIMA
    • (G)ARCH
    • Exponential Smoothing
    • VAR
    • FB Prophet
    • Linear Regression
    • SVM
    • Gaussian Process
    • Tree Based Models
    • Random Forests
    • Gradient Boosted Trees
    • MLP
    • RNN
    • LSTM
    • SEQ2SEQ
    19

    View full-size slide

  17. 20
    ARIMA
    Auto-Regressive Integrated Moving Average
    AR(p) MA(q)
    Past Values Past Errors
    ARIMA(p, d, q)
    SARIMA(p, d, q)x(Q,D,P,m)

    View full-size slide

  18. 22
    ARIMA
    • Study ACF/PACF charts and determine the parameter or use an automated algorithm.
    • Seasonal pattern (Strong correlation between and )
    • Algorithm found: SARIMAX(1, 1, 1)x(0, 1, 1)^12

    View full-size slide

  19. 23
    from statsmodels.tsa.arima_model import ARIMA
    import pmdarima as pm
    model = pm.auto_arima(
    ts["sales"],
    start_p=1,
    start_q=1,
    test="adf", # use adftest to find optimal 'd'
    max_p=3,
    max_q=3, # maximum p and q
    m=12, # frequency of series
    d=1,
    D=1,
    seasonal=True
    # ... other parameters
    )
    y_hat, conf = model.predict(
    n, return_conf_int=True,
    alpha=0.05
    )
    pmdarima SAMPLE

    View full-size slide

  20. LIBRARIES AND TOOLS
    PMDARIMA https://alkaline-ml.com/pmdarima/
    sktime: https://alan-turing-institute.github.io/sktime/index.html
    24

    View full-size slide

  21. 25
    TIME SERIES MODELS
    Characteristic / Requirement Score
    Highly non-stationary Limited
    Multiple time series Limited
    Multi-horizon forecast Yes
    Model interpretability High
    Model Capability Low
    Computational Efficiency Medium
    Handle cold-starts No
    Sample plots of fashion product sales

    View full-size slide

  22. 26
    MACHINE LEARNING
    ‣Additional features in the model.
    ‣One single model can handle many or all time series.
    ‣Feature Engineering is very important.

    View full-size slide

  23. MACHINE LEARNING - FEATURES
    Time Series
    Product Attributes
    Time
    Location
    category, brand, color,
    size, style, identifier
    moving averages,
    statistics, lagged
    features
    Day of week, month of
    year, number of week,
    season
    Holiday, weather,
    macroeconomic
    information
    SOURCE EXTRACTION ENCODING
    Numerical
    One Hot Encoding
    Feature Hashing
    Embeddings
    FEATURES

    View full-size slide

  24. MACHINE LEARNING - FEATURES
    SOURCE ENCODING

    View full-size slide

  25. 29
    MACHINE LEARNING - MODELS
    LINEAR REGRESSION
    TREE BASED
    SUPPORT VECTOR
    REGRESSION
    Estimate the independent
    variable as the linear expression
    of the features.
    ‣ Least Squares
    ‣ Ridge / Lasso
    ‣ Elastic Net
    ‣ ARIMA + X
    Use decision trees to learn the
    characteristics of the data to
    make predictions
    ‣ Regression Tree
    ‣ Random Forest
    ‣ Gradient Boosting
    ‣ Catboost
    ‣ LightGBM
    ‣ XGBoost
    Minimise the error within the
    support vector threshold using
    a non-Linear kernel to model
    non-linear relationships.
    ‣ NuSVR
    ‣ LibLinear
    ‣ LibSVM
    ‣ SKLearn

    View full-size slide

  26. 30
    MACHINE LEARNING - MODELS
    LINEAR REGRESSION
    TREE BASED
    SUPPORT VECTOR
    REGRESSION
    Estimate the independent
    variable as the linear expression
    of the features.
    ‣ Least Squares
    ‣ Ridge / Lasso
    ‣ Elastic Net
    ‣ ARIMA + X
    Use decision trees to learn the
    characteristics of the data to
    make predictions
    ‣ Regression Tree
    ‣ Random Forest
    ‣ Gradient Boosting
    ‣ Catboost
    ‣ LightGBM
    ‣ XGBoost
    Minimise the error within the
    support vector threshold using
    a non-Linear kernel to model
    non-linear relationships.
    ‣ NuSVR
    ‣ LibLinear
    ‣ LibSVM
    ‣ SKLearn

    View full-size slide

  27. 31
    TREE BASED MODELS - REGRESSION TREES
    X < 40
    X < 20 X< 60
    X < 5
    3.2
    5
    13.6 ….
    8.5
    17.6

    View full-size slide

  28. 32
    TREE BASED MODELS - RANDOM FOREST
    ● Bootstrap (Random resample
    with replacement)
    ● Independent classifiers
    ● Random feature selection at
    split
    ● “Bagging”
    ● Parallel training
    ● Generates a wide variety of
    trees

    View full-size slide

  29. 33
    TREE BASED MODELS - GRADIENT
    BOOSTED TREES
    ● Boosting
    ● Sequential Classifiers
    ● Resample with weights
    ● Important parameters:
    ○ Learning rate
    ○ Number of trees
    ○ Depth

    View full-size slide

  30. 34
    GRADIENT BOOSTING - LIBRARIES

    View full-size slide

  31. 35
    LightGBM SAMPLE
    import lightgbm as lgb
    from sklearn.model_selection import GridSearchCV
    estimator = lgb.LGBMRegressor(num_leaves=31)
    param_grid = {
    'learning_rate': [0.01, 0.1, 1],
    'n_estimators': [20, 40]
    }
    gbm = GridSearchCV(estimator, param_grid, cv=3)
    gbm.fit(X_train, y_train)

    View full-size slide

  32. 36
    MACHINE LEARNING
    Characteristic / Requirement Score
    Highly non-stationary Yes
    Multiple time series Yes
    Multi-horizon forecast Yes
    Model interpretability Medium
    Model Capability Medium
    Computational Efficiency Medium
    Handle cold-starts Partially
    ➔ Requires expert knowledge
    ➔ Time consuming feature engineering required
    ➔ Some features are difficult to capture
    ➔ Some methods might not be able to extrapolate

    View full-size slide

  33. 37
    NEURAL NETWORKS AND DEEP LEARNING - MODELS
    MULTILAYER
    PERCEPTRON
    LONG SHORT
    TERM MEMORY & RECURRENT
    NN
    SEQ2SEQ
    Fully connected multilayer
    artificial neural network.
    A type of recurrent neural
    network used for sequential
    learning.
    Cell states updated by gates.
    Used for speech recognition,
    language models, translation,
    etc.
    Encoder decoder architecture.
    It uses two RNN that will work
    together trying to predict the
    next state sequence from the
    previous sequence.
    Image credits: https://github.com/ledell/sldm4-h2o/ https://smerity.com/articles/2016/google_nmt_arch.html

    View full-size slide

  34. 38
    NEURAL NETS - MLP
    Image source: Faloutsos et al. 2019 [9]

    View full-size slide

  35. 39
    NEURAL NETS - MLP
    Image source: Faloutsos et al. 2019 [9]

    View full-size slide

  36. 40
    NEURAL NETS - MLP
    Image source: Faloutsos et al. 2019 [9]
    ● Hidden layers are non-linearities
    ● Flexible general function estimator
    ● The larger and deeper, more complex
    functions.
    ● Can learn complex relationships
    ● Need data for training
    ● Careful tuning needed
    ● Feature engineering needed

    View full-size slide

  37. 41
    DEEP LEARNING - MLP WITH EMBEDDINGS
    Source: Cheng Guo and Felix Berkhahn. 2016. [7]
    The learned German state embedding mapped to a 2D space with t-SNE.

    View full-size slide

  38. RECURRENT NEURAL NETS
    42
    Source: Faloutsos et al. 2019 [9]

    View full-size slide

  39. 43
    DeepAR - LSTM + AUTOREGRESSIVE
    Source: Faloutsos et al. 2019 [9]

    View full-size slide

  40. 44
    DeepAR - LSTM + AUTOREGRESSIVE
    Source: Faloutsos et al. 2019 [9]

    View full-size slide

  41. NN & DL - TOOLS
    45

    View full-size slide

  42. 46
    NEURAL NETS AND DEEP LEARNING
    Characteristic / Requirement Score
    Highly non-stationary Yes
    Multiple time series Yes
    Multi-horizon forecast Yes
    Model interpretability Low
    Model Capability High
    Computational Efficiency Low
    Cold start problem No
    ➔ Very flexible approach
    ➔ Automated feature learning is more limited
    due to the lack of unlabelled data.
    ➔ Some feature engineering is still necessary.
    ➔ Poor model interpretability
    ➔ No clear consensus on which model (RNN,
    LSTM, CNN) work the best.

    View full-size slide

  43. 47
    MODELS - SUMMARY
    ‣ Good model interpretability
    ‣ Limited model complexity to handle
    non-linear data
    ‣ Difficult to model multiple time
    series
    ‣ Difficult to integrate shared
    features across different time
    series
    ‣ Flexible
    ‣ Can incorporate many features
    across the time series
    ‣ A lot of feature engineering
    required
    ‣ Very flexible
    ‣ Automated feature learning via
    embeddings
    ‣ Still some degree of feature
    engineering necessary
    ‣ Poor model interpretability
    ‣ Hard to train
    ‣ It is not clear which model or
    approaches are the best.
    TRADITIONAL MODELS MACHINE LEARNING
    NEURAL NETS AND
    DEEP LEARNING

    View full-size slide

  44. 49
    EVALUATION AND METRICS
    Metric Formula Notes
    MAE (mean absolute error) Intuitive
    MAPE (mean absolute percentage error) Independent of the scale of measurement
    SMAPE (symmetric mean absolute percentage
    error)
    Avoid Asymmetry of MAPE
    MSE (Mean squared error) Penalize extreme errors
    MSLE (Mean Squared Logarithmic loss) Large errors are not more significantly
    penalised than small ones
    Quantile Loss Measure distribution
    RMSPE (Root Mean Squared Percentage Error) Independent of the scale of measurement

    View full-size slide

  45. TARGET VARIABLE TRANSFORMATION
    ➔ Box-Cox Transformation
    ➔ Log-Transformation
    ➔ Scaling / Normalization
    50

    View full-size slide

  46. TARGET VARIABLE TRANSFORMATION
    ➔ Box-Cox Transformation
    ➔ Log-Transformation
    ➔ Scaling / Normalization
    51

    View full-size slide

  47. TARGET VARIABLE TRANSFORMATION*
    ➔ Box-Cox Transformation
    ➔ Log-Transformation
    ➔ Scaling / Normalization
    52

    View full-size slide

  48. TARGET VARIABLE TRANSFORMATION
    53
    import numpy as np
    from sklearn.compose import TransformedTargetRegressor
    tt = TransformedTargetRegressor(
    regressor=YourAwesomeRegressor(),
    func=np.log,
    inverse_func=np.exp
    )
    ...
    tt.fit(X, y)

    View full-size slide

  49. 54
    CROSS VALIDATION

    View full-size slide

  50. 55
    USEFUL PREDICTORS
    ‣ Trend or Sequence
    ‣ Seasonal Variables
    ‣ Intervention Variables

    View full-size slide

  51. 56
    USEFUL PREDICTORS
    ‣ Trend or Sequence
    ‣ Seasonal Variables
    ‣ Intervention Variables

    View full-size slide

  52. 57
    USEFUL PREDICTORS
    ‣ Trend or Sequence
    ‣ Seasonal Variables
    ‣ Intervention Variables

    View full-size slide

  53. 58
    SUMMARY
    ➔ Time Series Forecasting has a lot of practical applications
    ➔ Traditional methods might still be relevant in many use cases
    ➔ Machine Learning, in particular Gradient Boosting seem to offer a good compromise between model
    capacity and interpretability.
    ➔ Feature Engineering is key, and (some) is still necessary when using Deep Learning.
    ➔ Deep Learning has not yet “cracked” time series forecasting, but recent models show promise.
    ➔ Avoid feature leaking by using a robust time series cross-validation approach.

    View full-size slide

  54. 59
    REFERENCES
    • [1] Choi, T. M., Hui, C. L., & Yu, Y. (2014). .
    (pp. 1–194). Springer Berlin Heidelberg.
    • [2] Hyndman, R.J., & Athanasopoulos, G. (2018) , 2nd edition, OTexts: Melbourne, Australia.
    OTexts.com/fpp2. Accessed on 01.10.2019
    • [3] H&M, a Fashion Giant, Has a Problem: $4.3 Billion in Unsold Clothes.
    • [4] Thomassey, S. (2014). Sales Forecasting in Apparel and Fashion Industry: A Review. In
    (pp. 9–27). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-39869-8_2
    • [5] Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. San Francisco: Holden-Day
    • [6] Autoregressive integrated moving average (ARIMA). https://en.wikipedia. org/wiki/Autoregressive_integrated_moving_average. Accessed:
    2019-05-02
    • [7] Cheng Guo and Felix Berkhahn. 2016. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737 (2016).
    • [8] Shen, Yuan, Wu and Pei - Data Science in Retail-as-a-Service Workshop. KDD 2018. London.
    • [9] Faloutsos, Christos & Flunkert, Valentin & Gasthaus, Jan & Januschowski, Tim & Wang, Yuyang. (2019). Forecasting Big Time Series: Theory
    and Practice. KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3209-3210.
    10.1145/3292500.3332289.
    • [10] The M4 Competition: 100,000 time series and 61 forecasting methods [Makridakis et al., 2018]
    • [11] CSalinas, D., Flunkert, V., & Gasthaus, J. (2017). DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. Retrieved
    from http://arxiv.org/abs/1704.04110

    View full-size slide

  55. 60
    IMAGE CREDITS
    ● decision tree by H Alberto Gongora from the Noun Project
    ● Ship Freight by ProSymbols from the Noun Project
    ● warehouse by ProSymbols from the Noun Project
    ● Store by AomAm from the Noun Project
    ● Neural Network by Knut M. Synstad from the Noun Project
    ● Tunic Dress by Vectors Market from the Noun Project
    ● sales by Kantor Tegalsari from the Noun Project
    ● time series by tom from the Noun Project
    ● fashion by Smalllike from the Noun Project
    ● Time by Anna Sophie from the Noun Project
    ● linear regression by Becris from the Noun Project
    ● Random Forest by Becris from the Noun Project
    ● SVM by sachin modgekar from the Noun Project
    ● production by Orin zuu from the Noun Project
    ● Auto by Graphic Tigers from the Noun Project
    ● Factory by Graphic Tigers from the Noun Project
    ● Express Delivery by Vectors Market from the Noun Project
    ● Stand Out by BomSymbols from the Noun Project
    ● Photo Credit: https://www.flickr.com/photos/157635012@N07/47981346167/ by Artem Beliaikin on Flickr via Compfight CC 2.0
    ● Photo Credit: „https://www.flickr.com/photos/157635012@N07/48014587002/ Artem Beliaikin Flickr via Compfight CC 2.0
    ● regression analysis by Vectors Market from the Noun Project
    ● Research Experiment by Vectors Market from the Noun Project
    ● weather by Alice Design from the Noun Project
    ● Shirt by Ben Davis from the Noun Project
    ● fashion by Eat Bread Studio from the Noun Project
    ● renew by david from the Noun Project
    ● price by Adrien Coquet from the Noun Project
    ● requirements by ProSymbols from the Noun Project
    ● marketing by Gregor Cresnar from the Noun Project
    ● macroeconomic by priyanka from the Noun Project
    ● competition by Gregor Cresnar from the Noun Project

    View full-size slide

  56. Gracias!
    Miguel Cabrera
    @mfcabrera

    View full-size slide

  57. BACKUP SLIDES
    62

    View full-size slide

  58. 63
    AWS DeepAR - LSTM + AUTOREGRESSIVE
    Source: CSalinas, D., Flunkert, V., & Gasthaus, J. (2017). [11]

    View full-size slide

  59. So what method should I
    use?
    64

    View full-size slide

  60. THE M4 COMPETITION
    ➔ January-May 2018
    ➔ 100,000 series of following frequencies:
    monthly, quarterly, yearly, daily, weekly, and
    hourly.
    ➔ 95% of series within first 3 categories.
    ➔ The forecasting horizons varied, e.g., six for
    yearly, 18 for monthly, and 48 for hourly
    series point forecasts and prediction
    intervals were evaluated
    ➔ No extra features / exogeneous variables
    65

    View full-size slide

  61. THE M4 COMPETITION
    ➔ Winning solution: an RNN with integrated
    exponential smoothing formula.
    ➔ Second: ensembles of classical solutions
    using sophisticated time series feature
    extraction.
    ➔ The dataset might not be then most
    representative but it will offer a baseline.
    66

    View full-size slide

  62. ATTENTION: LAG IS ALL YOU NEED
    67
    Wikipedia Winning entry:
    https://github.com/Arturus/kaggle-web-traffic/blob/master/how_it_works.md

    View full-size slide

  63. APPLICATIONS OF TIME SERIES FORECASTING
    ➔ Manufacturing
    ➔ Government services (budgeting)
    ➔ Supply chain and retail/commerce
    ➔ Workforce scheduling
    ➔ Cloud computing
    ➔ Website traffic prediction
    ➔ Healthcare
    68

    View full-size slide

  64. 69
    MODEL SUMMARY

    View full-size slide