Slide 1

Slide 1 text

Automated Feature Engineering for Large Scale Time Series Data with tsfresh & Dask Arnab Biswas @arnabbiswas1 /arnabbiswas1 arnab.blog

Slide 2

Slide 2 text

Time Series • Measurement of a variable taken repeatedly over time • Series of data points indexed in time order Data: https://www.kaggle.com/c/predict-volcanic-eruptions-ingv-oe

Slide 3

Slide 3 text

Time Series vs Cross Sectional Data • Time Series • Follows natural temporal ordering • One observation at any instance of time depends on the observations from the past based on underlying process • Cross Sectional • Observations collected from subjects at one point of time • No natural ordering • Analysis focuses comparing differences among subjects; doesn’t have any regard to differences in time

Slide 4

Slide 4 text

Challenges with Time Series Data • Temporal Ordering: Traditional ML algorithms are not designed keeping time series in mind • Non-Stationary • Noisy: Low Signal to noise ratio • Different sampling frequency • Uneven length • Different domains

Slide 5

Slide 5 text

Feature Engineering • Process of capturing most important characteristics of a time series into few metrics • Compresses raw time series data into shorter representation • Necessary for ML algorithms which were not designed for time series data

Slide 6

Slide 6 text

Which features? • Standard generic features to summarize data • Domain specific • No Free Lunch: Features vary across tasks

Slide 7

Slide 7 text

tsfresh • Open Source Python Library • Automated Feature Generation • 750+ features (70+ calculators with different parameters) • Wide range of features • Descriptive Statistics (mean, max, autocorrelation) • Physics based indicators for nonlinearity & complexity • Digital Signal Processing related • History compressing • Hypothesis Test Based Feature Selection • Identifies features relevant for the target • Supports parallelization & large data https://tsfresh.readthedocs.io/

Slide 8

Slide 8 text

arnabbiswas1/feature_engineering_with_tsfresh_and_dask Code walkthrough

Slide 9

Slide 9 text

Challenges with large time series data • Large execution time (Compute bound problem) • Need for larger memory (Memory bound problem)

Slide 10

Slide 10 text

Revisiting the fundamentals

Slide 11

Slide 11 text

Fundamentals of Computer System • Components • Computing Unit: CPU • Memory Unit: RAM, Hard Drive, CPU Cache (L1/L2) • Bus: Connector between Computing & Memory Unit • Network Connection: Very slow connection connecting to other computing and memory units

Slide 12

Slide 12 text

Computing Unit • Properties • Instructions per cycle (IPC): Number of operations CPU can do in one cycle • Clock Speed: Number of cycles CPU can execute in one sec • Multicore Architecture • Processors are not getting faster • Chip Makers are putting multiple CPUs ("cores") within the same physical unit increasing total capacity http://cpudb.stanford.edu/visualize/clock_frequency https://www.practicaldatascience.org/html/parallelism.html

Slide 13

Slide 13 text

Parallelism • Parallelism as a process • Break a problem into lot of smaller independent problems • Solve each smaller problem independently using a single core • Combine the results • Parallelism is hard • Computer Time to break the problem into pieces, distribute them across cores and recombine them to get final result • Developing parallel code • Most of the Data Science code can be parallelized better

Slide 14

Slide 14 text

Memory Units • Properties • Read/Write Speed, Latency • Capacity • Different Memory Units • Spinning Hard Drive: Extremely slow read/write, but large capacity (TB + ) • Solid State Hard Drive: Faster read/write, smaller capacity (GB - TB) • RAM: Faster read/write. Limited Capacity (GB) • CPU Cache • Extremely fast, very small capacity (MB) • Data going to CPU goes through here.

Slide 15

Slide 15 text

Memory Units… • Tiered Approach • Data start from hard drive, moves to RAM, smaller subset moves to CPU cache, finally arrives at Processor • Moving around data is often the most expensive thing (compared to the actual computation) • With large data • If data doesn’t fit into RAM, moving Data directly between Processor and Hard Disk is unimaginably slow (minutes -> days)

Slide 16

Slide 16 text

Large Data Strategies • Simple Definition: Data that doesn’t fit into RAM • Strategy • Slim & Trim the data to fit into memory • Use more RAM • Out of Core Computing • Minimize penalties of streaming data out of hard drive • Use Distributed Computing

Slide 17

Slide 17 text

Distributed Computing • Break data into small pieces • Send those to different computers • Execute those on different computers • Bring back the partial results from those computers • Recombine the partial results to generate final output

Slide 18

Slide 18 text

Data Science with Large Data : Approach • Single Computer • Use parallelization using multiple cores • Try to fit data into RAM, if not, stream data out of hard drive • Distributed Computing https://azure.microsoft.com/en-in/pricing/details/virtual-machines/linux/

Slide 19

Slide 19 text

Data Science with Large Data : Challenges • Popular libraries (Numpy, Pandas etc.) are not designed to scale beyond a single core/processor • Not designed to scale beyond a single machine • For larger than RAM data, can’t be used • Data needs to be loaded into RAM for computation

Slide 20

Slide 20 text

Dask to Rescue • Parallelizes Python Eco-System • Parallelize using Multiple Cores (single/multiple computers) • Handles large data • Out of Core Computing: For larger than RAM data, streams from hard drive • Familiar API • Uses data structures of Pandas, Numpy (etc.) internally • Dask copies most of their API • Scales up on a cluster. Scales down to a laptop • Enables transitioning from single machine workflows to parallel & distributed computing without learning new frameworks or without changing much code https://docs.dask.org/

Slide 21

Slide 21 text

• Utilize Parallelization • Handle Larger than Memory Data Manage Large Time Series Data

Slide 22

Slide 22 text

tsfresh with Large Data • Data fits into memory • Single Machine: Out of the box using “multiprocessing” package • Distributed Environment • tsfresh’s distribution module (tsfresh.utilities.distribution) • Larger than Memory Data • Support for Dask & PySpark (tsfresh.convenience.bindings)

Slide 23

Slide 23 text

Code walkthrough arnabbiswas1/feature_engineering_with_tsfresh_and_dask

Slide 24

Slide 24 text

References • tsfresh documentation • tsfresh on Large Data Samples - Part I & Part II • Dask Documentation & Videos • Scale Up Data Science Workflow using Dask • High Performance Python, 2nd Edition • Duke MIDS Fall 2020 Practical Data Science Course

Slide 25

Slide 25 text

Questions arnabbiswas1/feature_engineering_with_tsfresh_and_dask