Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automatic Feature Enegineering for Large Scale ...

Automatic Feature Enegineering for Large Scale Time Series Data Using tsfresh and Dask

The internet of things, digitized health care systems, financial markets, smart cities (etc.) are continuously generating time series data of different types, sizes and complexities. Time series data is different from non-temporal data. In time series data, observation at any instance of time depends on the observations from the past based on the underlying process. Often it contains noise and redundant information. To make things more complex, most of the traditional Machine Learning algorithms are developed for non-temporal data. Thus, extracting meaningful features from raw time series plays a major role. While there are features generic across different flavors/types of time series, there are features specific to different domains. As a result, feature engineering often demands familiarity with domain specific and/or signal processing algorithms making the process complicated.

This presentation introduces to a Python library called tsfresh. tsfresh accelerates the feature engineering process by automatically generating 750+ of features for time series data. However, if the size of the time series data is large, we start encountering two kinds problems: Large execution time and Need for larger memory. This is where another Python framework Dask comes into picture. Dask parallelizes the feature extraction process of tsfresh. Also, by using out of core computing, it addresses the problem of larger than RAM dataset.

Corresponding github repo can be found here: https://github.com/arnabbiswas1/feature_engineering_with_tsfresh_and_dask

Arnab Biswas

January 09, 2021
Tweet

More Decks by Arnab Biswas

Other Decks in Technology

Transcript

  1. Automated Feature Engineering for Large Scale Time Series Data with

    tsfresh & Dask Arnab Biswas @arnabbiswas1 /arnabbiswas1 arnab.blog
  2. Time Series • Measurement of a variable taken repeatedly over

    time • Series of data points indexed in time order Data: https://www.kaggle.com/c/predict-volcanic-eruptions-ingv-oe
  3. Time Series vs Cross Sectional Data • Time Series •

    Follows natural temporal ordering • One observation at any instance of time depends on the observations from the past based on underlying process • Cross Sectional • Observations collected from subjects at one point of time • No natural ordering • Analysis focuses comparing differences among subjects; doesn’t have any regard to differences in time
  4. Challenges with Time Series Data • Temporal Ordering: Traditional ML

    algorithms are not designed keeping time series in mind • Non-Stationary • Noisy: Low Signal to noise ratio • Different sampling frequency • Uneven length • Different domains
  5. Feature Engineering • Process of capturing most important characteristics of

    a time series into few metrics • Compresses raw time series data into shorter representation • Necessary for ML algorithms which were not designed for time series data
  6. Which features? • Standard generic features to summarize data •

    Domain specific • No Free Lunch: Features vary across tasks
  7. tsfresh • Open Source Python Library • Automated Feature Generation

    • 750+ features (70+ calculators with different parameters) • Wide range of features • Descriptive Statistics (mean, max, autocorrelation) • Physics based indicators for nonlinearity & complexity • Digital Signal Processing related • History compressing • Hypothesis Test Based Feature Selection • Identifies features relevant for the target • Supports parallelization & large data https://tsfresh.readthedocs.io/
  8. Challenges with large time series data • Large execution time

    (Compute bound problem) • Need for larger memory (Memory bound problem)
  9. Fundamentals of Computer System • Components • Computing Unit: CPU

    • Memory Unit: RAM, Hard Drive, CPU Cache (L1/L2) • Bus: Connector between Computing & Memory Unit • Network Connection: Very slow connection connecting to other computing and memory units
  10. Computing Unit • Properties • Instructions per cycle (IPC): Number

    of operations CPU can do in one cycle • Clock Speed: Number of cycles CPU can execute in one sec • Multicore Architecture • Processors are not getting faster • Chip Makers are putting multiple CPUs ("cores") within the same physical unit increasing total capacity http://cpudb.stanford.edu/visualize/clock_frequency https://www.practicaldatascience.org/html/parallelism.html
  11. Parallelism • Parallelism as a process • Break a problem

    into lot of smaller independent problems • Solve each smaller problem independently using a single core • Combine the results • Parallelism is hard • Computer Time to break the problem into pieces, distribute them across cores and recombine them to get final result • Developing parallel code • Most of the Data Science code can be parallelized better
  12. Memory Units • Properties • Read/Write Speed, Latency • Capacity

    • Different Memory Units • Spinning Hard Drive: Extremely slow read/write, but large capacity (TB + ) • Solid State Hard Drive: Faster read/write, smaller capacity (GB - TB) • RAM: Faster read/write. Limited Capacity (GB) • CPU Cache • Extremely fast, very small capacity (MB) • Data going to CPU goes through here.
  13. Memory Units… • Tiered Approach • Data start from hard

    drive, moves to RAM, smaller subset moves to CPU cache, finally arrives at Processor • Moving around data is often the most expensive thing (compared to the actual computation) • With large data • If data doesn’t fit into RAM, moving Data directly between Processor and Hard Disk is unimaginably slow (minutes -> days)
  14. Large Data Strategies • Simple Definition: Data that doesn’t fit

    into RAM • Strategy • Slim & Trim the data to fit into memory • Use more RAM • Out of Core Computing • Minimize penalties of streaming data out of hard drive • Use Distributed Computing
  15. Distributed Computing • Break data into small pieces • Send

    those to different computers • Execute those on different computers • Bring back the partial results from those computers • Recombine the partial results to generate final output
  16. Data Science with Large Data : Approach • Single Computer

    • Use parallelization using multiple cores • Try to fit data into RAM, if not, stream data out of hard drive • Distributed Computing https://azure.microsoft.com/en-in/pricing/details/virtual-machines/linux/
  17. Data Science with Large Data : Challenges • Popular libraries

    (Numpy, Pandas etc.) are not designed to scale beyond a single core/processor • Not designed to scale beyond a single machine • For larger than RAM data, can’t be used • Data needs to be loaded into RAM for computation
  18. Dask to Rescue • Parallelizes Python Eco-System • Parallelize using

    Multiple Cores (single/multiple computers) • Handles large data • Out of Core Computing: For larger than RAM data, streams from hard drive • Familiar API • Uses data structures of Pandas, Numpy (etc.) internally • Dask copies most of their API • Scales up on a cluster. Scales down to a laptop • Enables transitioning from single machine workflows to parallel & distributed computing without learning new frameworks or without changing much code https://docs.dask.org/
  19. tsfresh with Large Data • Data fits into memory •

    Single Machine: Out of the box using “multiprocessing” package • Distributed Environment • tsfresh’s distribution module (tsfresh.utilities.distribution) • Larger than Memory Data • Support for Dask & PySpark (tsfresh.convenience.bindings)
  20. References • tsfresh documentation • tsfresh on Large Data Samples

    - Part I & Part II • Dask Documentation & Videos • Scale Up Data Science Workflow using Dask • High Performance Python, 2nd Edition • Duke MIDS Fall 2020 Practical Data Science Course