Automatic Feature Enegineering for Large Scale Time Series Data Using tsfresh and Dask

Automated Feature Engineering for Large Scale Time Series Data with
tsfresh & Dask Arnab Biswas @arnabbiswas1 /arnabbiswas1 arnab.blog

Time Series • Measurement of a variable taken repeatedly over
time • Series of data points indexed in time order Data: https://www.kaggle.com/c/predict-volcanic-eruptions-ingv-oe

Time Series vs Cross Sectional Data • Time Series •
Follows natural temporal ordering • One observation at any instance of time depends on the observations from the past based on underlying process • Cross Sectional • Observations collected from subjects at one point of time • No natural ordering • Analysis focuses comparing differences among subjects; doesn’t have any regard to differences in time

Challenges with Time Series Data • Temporal Ordering: Traditional ML
algorithms are not designed keeping time series in mind • Non-Stationary • Noisy: Low Signal to noise ratio • Different sampling frequency • Uneven length • Different domains

Feature Engineering • Process of capturing most important characteristics of
a time series into few metrics • Compresses raw time series data into shorter representation • Necessary for ML algorithms which were not designed for time series data

Which features? • Standard generic features to summarize data •
Domain specific • No Free Lunch: Features vary across tasks

tsfresh • Open Source Python Library • Automated Feature Generation
• 750+ features (70+ calculators with different parameters) • Wide range of features • Descriptive Statistics (mean, max, autocorrelation) • Physics based indicators for nonlinearity & complexity • Digital Signal Processing related • History compressing • Hypothesis Test Based Feature Selection • Identifies features relevant for the target • Supports parallelization & large data https://tsfresh.readthedocs.io/

arnabbiswas1/feature_engineering_with_tsfresh_and_dask Code walkthrough

Challenges with large time series data • Large execution time
(Compute bound problem) • Need for larger memory (Memory bound problem)

Revisiting the fundamentals

Fundamentals of Computer System • Components • Computing Unit: CPU
• Memory Unit: RAM, Hard Drive, CPU Cache (L1/L2) • Bus: Connector between Computing & Memory Unit • Network Connection: Very slow connection connecting to other computing and memory units

Computing Unit • Properties • Instructions per cycle (IPC): Number
of operations CPU can do in one cycle • Clock Speed: Number of cycles CPU can execute in one sec • Multicore Architecture • Processors are not getting faster • Chip Makers are putting multiple CPUs ("cores") within the same physical unit increasing total capacity http://cpudb.stanford.edu/visualize/clock_frequency https://www.practicaldatascience.org/html/parallelism.html

Parallelism • Parallelism as a process • Break a problem
into lot of smaller independent problems • Solve each smaller problem independently using a single core • Combine the results • Parallelism is hard • Computer Time to break the problem into pieces, distribute them across cores and recombine them to get final result • Developing parallel code • Most of the Data Science code can be parallelized better

Memory Units • Properties • Read/Write Speed, Latency • Capacity
• Different Memory Units • Spinning Hard Drive: Extremely slow read/write, but large capacity (TB + ) • Solid State Hard Drive: Faster read/write, smaller capacity (GB - TB) • RAM: Faster read/write. Limited Capacity (GB) • CPU Cache • Extremely fast, very small capacity (MB) • Data going to CPU goes through here.

Memory Units… • Tiered Approach • Data start from hard
drive, moves to RAM, smaller subset moves to CPU cache, finally arrives at Processor • Moving around data is often the most expensive thing (compared to the actual computation) • With large data • If data doesn’t fit into RAM, moving Data directly between Processor and Hard Disk is unimaginably slow (minutes -> days)

Large Data Strategies • Simple Definition: Data that doesn’t fit
into RAM • Strategy • Slim & Trim the data to fit into memory • Use more RAM • Out of Core Computing • Minimize penalties of streaming data out of hard drive • Use Distributed Computing

Distributed Computing • Break data into small pieces • Send
those to different computers • Execute those on different computers • Bring back the partial results from those computers • Recombine the partial results to generate final output

Data Science with Large Data : Approach • Single Computer
• Use parallelization using multiple cores • Try to fit data into RAM, if not, stream data out of hard drive • Distributed Computing https://azure.microsoft.com/en-in/pricing/details/virtual-machines/linux/

Data Science with Large Data : Challenges • Popular libraries
(Numpy, Pandas etc.) are not designed to scale beyond a single core/processor • Not designed to scale beyond a single machine • For larger than RAM data, can’t be used • Data needs to be loaded into RAM for computation

Dask to Rescue • Parallelizes Python Eco-System • Parallelize using
Multiple Cores (single/multiple computers) • Handles large data • Out of Core Computing: For larger than RAM data, streams from hard drive • Familiar API • Uses data structures of Pandas, Numpy (etc.) internally • Dask copies most of their API • Scales up on a cluster. Scales down to a laptop • Enables transitioning from single machine workflows to parallel & distributed computing without learning new frameworks or without changing much code https://docs.dask.org/

• Utilize Parallelization • Handle Larger than Memory Data Manage
Large Time Series Data

tsfresh with Large Data • Data fits into memory •
Single Machine: Out of the box using “multiprocessing” package • Distributed Environment • tsfresh’s distribution module (tsfresh.utilities.distribution) • Larger than Memory Data • Support for Dask & PySpark (tsfresh.convenience.bindings)

Code walkthrough arnabbiswas1/feature_engineering_with_tsfresh_and_dask

References • tsfresh documentation • tsfresh on Large Data Samples
- Part I & Part II • Dask Documentation & Videos • Scale Up Data Science Workflow using Dask • High Performance Python, 2nd Edition • Duke MIDS Fall 2020 Practical Data Science Course

Questions arnabbiswas1/feature_engineering_with_tsfresh_and_dask

Automatic Feature Enegineering for Large Scale ...

Automatic Feature Enegineering for Large Scale Time Series Data Using tsfresh and Dask

Arnab Biswas

More Decks by Arnab Biswas

Other Decks in Technology

Featured

Transcript

Automated Feature Engineering for Large Scale Time Series Data with

Time Series • Measurement of a variable taken repeatedly over

Time Series vs Cross Sectional Data • Time Series •

Challenges with Time Series Data • Temporal Ordering: Traditional ML

Feature Engineering • Process of capturing most important characteristics of

Which features? • Standard generic features to summarize data •

tsfresh • Open Source Python Library • Automated Feature Generation

arnabbiswas1/feature_engineering_with_tsfresh_and_dask Code walkthrough

Challenges with large time series data • Large execution time

Revisiting the fundamentals

Fundamentals of Computer System • Components • Computing Unit: CPU

Computing Unit • Properties • Instructions per cycle (IPC): Number

Parallelism • Parallelism as a process • Break a problem

Memory Units • Properties • Read/Write Speed, Latency • Capacity

Memory Units… • Tiered Approach • Data start from hard

Large Data Strategies • Simple Definition: Data that doesn’t fit

Distributed Computing • Break data into small pieces • Send

Data Science with Large Data : Approach • Single Computer

Data Science with Large Data : Challenges • Popular libraries

Dask to Rescue • Parallelizes Python Eco-System • Parallelize using

• Utilize Parallelization • Handle Larger than Memory Data Manage

tsfresh with Large Data • Data fits into memory •

Code walkthrough arnabbiswas1/feature_engineering_with_tsfresh_and_dask

References • tsfresh documentation • tsfresh on Large Data Samples

Questions arnabbiswas1/feature_engineering_with_tsfresh_and_dask