Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modern Time Series Analysis with STUMPY

Sean Law
November 14, 2020

Modern Time Series Analysis with STUMPY

Traditional time series analysis techniques have found success in a variety of data mining tasks. However, they often require years of experience to master and the recent development of straightforward, easy-to-use analysis tools has been lacking. STUMPY is a scientific Python library for modern time series analysis that leverages popular open source software and enables you to do better science!

Sean Law is a senior applied scientific researcher and data scientist currently working with a multi-talented Exploration Lab team and serves as an advisor on an enterprise A.I. Council at TD Ameritrade. He has experience producing cutting edge methodologies, building high-performance predictive models, and developing rapid prototypes. Additionally, he is one of the co-organizers of PyData Ann Arbor and is also the creator and core maintainer of STUMPY, a powerful and scalable open source Python library that can be used for a variety of time series data mining tasks.

Sean Law

November 14, 2020
Tweet

More Decks by Sean Law

Other Decks in Technology

Transcript

  1. #IllustrativeExample 0 1 3 2 9 1 14 15 1

    2 2 10 7 Time Series with Length, n = 13
  2. If a behavior is conserved, there must have been a

    reason why it was conserved and teasing out these reasons/causes is often very useful… #Goal
  3. Do any conserved behaviors exist in my time series data?

    If there are conserved behaviors, what are they and where are they?
  4. #KeyQualities What’s the most simple and intuitive approach? Easy To

    Interpret User/Data Agnostic Parameter Free No Prior Knowledge
  5. #Parameter 0 1 3 2 9 1 14 15 1

    2 2 10 7 Choose Subsequence Length, m
  6. #IllustrativeExample 0 1 3 2 9 1 14 15 1

    2 2 10 7 Compare Subsequences
  7. #Back2School 0 1 3 2 9 1 14 15 1

    2 2 10 7 Euclidean Distance
  8. #Back2School 0 1 3 2 9 1 14 15 1

    2 2 10 7 Euclidean Distance (0,1) (1,2) Euclidean Distance X Y x1 x2 y1 y2
  9. #Back2School 0 1 3 2 9 1 14 15 1

    2 2 10 7 Euclidean Distance (0,1) (1,2) H X Y x1 x2 y1 y2
  10. #Back2School 0 1 3 2 9 1 14 15 1

    2 2 10 7 Euclidean Distance (0,1) (1,2) H A B X Y x1 x2 y1 y2
  11. #Back2School 0 1 3 2 9 1 14 15 1

    2 2 10 7 Pythagorean Theorem (0,1) (1,2) H A B X Y x1 x2 y1 y2
  12. #Back2School 0 1 3 2 9 1 14 15 1

    2 2 10 7 Pythagorean Theorem (0,1) (1,2) H A B X Y x1 x2 y1 y2
  13. #Back2School 0 1 3 2 9 1 14 15 1

    2 2 10 7 Pythagorean Theorem (0,1) (1,2) H A B X Y x1 x2 y1 y2
  14. #Back2School 0 1 3 2 9 1 14 15 1

    2 2 10 7 Pythagorean Theorem (0,1) (1,2) H A B X Y x1 x2 y1 y2
  15. #Back2School 0 1 3 2 9 1 14 15 1

    2 2 10 7 Euclidean Distance (0,1,3,2) (1,2,2,10) H x1 x2 y1 y2 x3 x4 y3 y4
  16. #DistanceProfile 0 1 3 2 9 1 14 15 1

    2 2 10 7 Pairwise Euclidean Distance 0 1 3 2 9 1 14 15 1 2 2 10 7
  17. #DistanceProfile 0 1 3 2 9 1 14 15 1

    2 2 10 7 Pairwise Euclidean Distance 0.0 7.4 6.9 14.7 19.3 17.7 19.9 15.0 8.2 8.9 0 1 3 2 9 1 14 15 1 2 2 10 7 Distance Profile O(nm)
  18. #TrivialMatch 0 1 3 2 9 1 14 15 1

    2 2 10 7 Pairwise Euclidean Distance 0.0 7.4 6.9 14.7 19.3 17.7 19.9 15.0 8.2 8.9 0 1 3 2 9 1 14 15 1 2 2 10 7
  19. #NextBestMatch 0 1 3 2 9 1 14 15 1

    2 2 10 7 Pairwise Euclidean Distance 0.0 7.4 6.9 14.7 19.3 17.7 19.9 15.0 8.2 8.9 0 1 3 2 9 1 14 15 1 2 2 10 7
  20. #NextBestMatch 0 1 3 2 9 1 14 15 1

    2 2 10 7 Pairwise Euclidean Distance 0.0 7.4 6.9 14.7 19.3 17.7 19.9 15.0 8.2 8.9 0 1 3 2 9 1 14 15 1 2 2 10 7
  21. #DistanceMatrix * * * * * * * * *

    6.9 * * * * * * * * * 1.4 * * * * * * * * * 6.2 * * * * * * * * * 7.9 * * * * * * * * * 11.4 * * * * * * * * * 13.6 * * * * * * * * * 14.1 * * * * * * * * * 14.0 * * * * * * * * * 1.4 * * * * * * * * * 6.2 0 1 3 2 9 1 14 15 1 2 2 10 7 0 1 3 2 9 1 14 15 1 2 2 10 7
  22. #DistanceMatrix Stacked Distance Profiles * * * * * *

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 Distance Matrix
  23. #KeyQualities What’s the most simple and intuitive approach? Easy To

    Interpret User/Data Agnostic Parameter Free No Prior Knowledge * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2
  24. Easy To Interpret Minimal Assumptions Parameter Free No Prior Knowledge

    #KeyQuestions What’s the most simple and intuitive approach?
  25. #DistanceMatrix Brute Force Calculation * * * * * *

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2
  26. #DistanceMatrix Brute Force Calculation * * * * * *

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 ts = np.array([ 0., 1., 3., 2., 9., 1., 14., 15., 1., 2., 2., 10., 7.]) n = len(ts) # 13 m = 4 for i in range(n-m+1): for j in range(n-m+1): distance = 0 for k in range(m): distance += (ts[i+k]-ts[j+k])2 distance = math.sqrt(distance) 0 1 3 2 9 1 14 15 1 2 2 10 7 0 1 3 2 9 1 14 15 1 2 2 10 7
  27. #DistanceMatrix Brute Force Calculation * * * * * *

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 ts = np.array([ 0., 1., 3., 2., 9., 1., 14., 15., 1., 2., 2., 10., 7.]) n = len(ts) # 13 m = 4 for i in range(n-m+1): for j in range(n-m+1): distance = 0 for k in range(m): distance += (ts[i+k]-ts[j+k])2 distance = math.sqrt(distance) 0 1 3 2 9 1 14 15 1 2 2 10 7 0 1 3 2 9 1 14 15 1 2 2 10 7
  28. #DistanceMatrix Brute Force Calculation * * * * * *

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 ts = np.array([ 0., 1., 3., 2., 9., 1., 14., 15., 1., 2., 2., 10., 7.]) n = len(ts) # 13 m = 4 for i in range(n-m+1): for j in range(n-m+1): distance = 0 for k in range(m): distance += (ts[i+k]-ts[j+k])2 distance = math.sqrt(distance)
  29. #DistanceMatrix Brute Force Calculation * * * * * *

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 ts = np.array([ 0., 1., 3., 2., 9., 1., 14., 15., 1., 2., 2., 10., 7.]) n = len(ts) # 13 m = 4 for i in range(n-m+1): for j in range(n-m+1): distance = 0 for k in range(m): distance += (ts[i+k]-ts[j+k])2 distance = math.sqrt(distance) O(n2m) O(n2)
  30. Redefining #BackOfTheEnvelope n = 5 years x 365 days/year x

    24 hours/day x 60 mins/hour x 20 times/min n = 52,560,000 data points 1,598.7 days (4.4 years) and 11.1 PB memory to compute! (n2-n)/2 1,381,276,773,720,000 x 0.000 000 1 seconds Brute Force Calculation 32 bit
  31. matrix profile /ˈmātriks/ /ˈprōˌfīl/ noun a vector that stores the

    distance* between each subsequence within a time series and its nearest neighbor * z-normalized Euclidean distance
  32. Distance Matrix Matrix Profile * * * * * *

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2
  33. Distance Matrix Matrix Profile * * * * * *

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 Space O(n) Time O(n2)
  34. Do any conserved behaviors exist in my time series data?

    If there are conserved behaviors, what are they and where are they?
  35. #Motif 0 1 3 2 9 1 14 15 1

    2 2 10 7 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2
  36. #Motif 0 1 3 2 9 1 14 15 1

    2 2 10 7 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 Motif
  37. 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2

    2 8 9 1 9 2 7 2 1 2 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 3 2 9 1 14 15 1 2 2 10 7 #MatrixProfileIndex
  38. 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2

    2 8 9 1 9 2 7 2 1 2 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 3 2 9 1 14 15 1 2 2 10 7 #MatrixProfileIndex
  39. #Discord 0 1 3 2 9 1 14 15 1

    2 2 10 7 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 Discord
  40. #MatrixProfilePerformance STAMP FFT Time O(n2logn) Space O(n) GPU-STOMP Hardware Time

    O(n2) Space O(n) Brute Force Naive Time O(n2m) Space O(n2) STOMP Algebra Time O(n2) Space O(n)
  41. Given the matrix profile, most time series data mining problems

    are trivial or easy to solve in a few lines of code #EamonnKeogh
  42. These are the best ideas in time series data mining

    in the last two decades. #EamonnKeogh
  43. STUMPY /ˈstəmpē/ noun a powerful and scalable Python library that

    efficiently computes the matrix profile, which can be used for a variety of time series data mining tasks
  44. Minimal Dependencies (3) #Winning Python 3.6+ Core Dask Distribute +

    + + NumPy Numerical Numba Parallelize + SciPy Accessory
  45. State of the Art Days 0 3.25 6.5 9.75 13

    Today (Distributed 256 CPUs) GPU-STOMP 12.13 Days 9.75 Days n = 100,000,000
  46. #Why Interpretable Euclidean Distance Complementary Mix-and-Match User Friendly Developed For

    (Data) Scientists Fast & Scalable Multi-CPU/GPU Support Reliable 100% Code Coverage
  47. #STUMPY And Moar! Multi- dimensional Semantic Segmentation Motif/Discord Discovery Time

    Series Chains Shapelets & Snippets MPdist Clustering ML Features
  48. search: “python stumpy” tutorials: https://tiny.cc/stumpy-tutorials demo: https://tiny.cc/stumpy-demo gpu: https://tiny.cc/stumpy-colab code:

    https://tiny.cc/stumpy-code docs: https://tiny.cc/stumpy-docs pubs: https://tiny.cc/stumpy-pubs @seanmylaw @stumpy_dev
  49. #TrivialMatch Trivial Match Only * 7.4 14.7 19.3 17.7 19.9

    15.0 8.2 8.9 7.4 * 10.9 7.9 15.7 18.8 19.1 15.8 8.4 6.9 10.9 * 16.8 16.1 13.6 18.8 14.0 11.6 14.7 16.8 * 16.8 19.8 18.0 19.4 8.2 13.4 19.3 15.7 16.1 16.8 * 20.7 23.6 18.7 15.3 17.7 18.8 19.8 20.7 * 19.2 23.1 19.8 14.4 19.9 19.1 18.8 18.0 23.6 19.2 * 20.1 20.5 15.0 15.8 19.4 18.7 23.1 14.1 * 16.2 16.1 8.2 11.6 8.2 15.3 19.8 20.1 16.2 * 8.6 8.9 8.4 13.4 11.4 14.4 20.5 16.1 8.6 * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 Exclusion Zone * 7.4 14.7 19.3 17.7 19.9 15.0 8.2 8.9 7.4 * 10.9 7.9 15.7 18.8 19.1 15.8 8.4 6.9 10.9 * 16.8 16.1 13.6 18.8 14.0 11.6 14.7 16.8 * 16.8 19.8 18.0 19.4 8.2 13.4 19.3 15.7 16.1 16.8 * 20.7 23.6 18.7 15.3 17.7 18.8 19.8 20.7 * 19.2 23.1 19.8 14.4 19.9 19.1 18.8 23.6 19.2 * 20.1 20.5 15.0 15.8 19.4 18.7 23.1 14.1 * 16.2 16.1 8.2 11.6 8.2 15.3 19.8 20.1 16.2 * 8.6 8.9 8.4 13.4 11.4 14.4 20.5 16.1 8.6 * 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2 18.0 * 7.4 14.7 19.3 17.7 19.9 15.0 8.9 7.4 * 10.9 7.9 15.7 18.8 19.1 15.8 8.4 6.9 10.9 * 16.8 16.1 13.6 18.8 14.0 11.6 14.7 16.8 * 16.8 19.8 18.0 19.4 13.4 19.3 15.7 16.1 16.8 * 20.7 23.6 18.7 15.3 17.7 18.8 19.8 20.7 * 19.2 23.1 19.8 14.4 19.9 19.1 18.8 23.6 19.2 * 20.1 20.5 15.0 15.8 19.4 18.7 23.1 14.1 * 16.2 16.1 8.2 11.6 8.2 15.3 19.8 20.1 16.2 * 8.6 8.9 8.4 13.4 11.4 14.4 20.5 16.1 8.6 * 6.9 7.9 14.1 8.2 8.2 18.0 1.4 6.2 11.4 13.6 14.0 1.4 6.2 * 7.4 14.7 19.3 17.7 19.9 15.0 8.9 7.4 * 10.9 7.9 15.7 18.8 19.1 15.8 8.4 6.9 10.9 * 16.8 16.1 13.6 18.8 14.0 11.6 14.7 16.8 * 16.8 19.8 18.0 19.4 13.4 19.3 15.7 16.1 16.8 * 20.7 23.6 18.7 15.3 17.7 18.8 19.8 20.7 * 19.2 23.1 19.8 19.9 19.1 18.0 23.6 19.2 * 20.1 20.5 15.0 15.8 19.4 18.7 23.1 14.1 * 16.2 16.1 8.2 11.6 8.2 15.3 19.8 20.1 16.2 * 8.6 8.9 8.4 13.4 11.4 14.4 20.5 16.1 8.6 * 6.9 7.9 13.6 14.1 8.2 8.2 14.4 18.8 1.4 6.2 11.4 14.0 1.4 6.2 * 7.4 14.7 19.3 17.7 19.9 15.0 8.9 7.4 * 10.9 7.9 15.7 18.8 19.1 15.8 8.4 6.9 10.9 * 16.8 16.1 13.6 18.8 14.0 11.6 14.7 16.8 * 16.8 19.8 18.0 19.4 13.4 19.3 15.7 16.1 16.8 * 20.7 23.6 18.7 15.3 17.7 18.8 19.8 20.7 * 19.2 23.1 19.8 14.4 19.9 18.8 18.0 23.6 19.2 * 20.1 20.5 15.0 15.8 19.4 18.7 23.1 14.1 * 16.2 16.1 8.2 11.6 8.2 15.3 19.8 20.1 16.2 * 8.6 8.9 8.4 13.4 11.4 14.4 20.5 16.1 8.6 * 6.9 7.9 13.6 14.1 8.2 8.2 19.1 1.4 6.2 11.4 14.0 1.4 6.2 * 7.4 14.7 19.3 17.7 19.9 15.0 8.9 7.4 * 10.9 7.9 15.7 18.8 19.1 15.8 8.4 6.9 10.9 * 16.8 16.1 13.6 18.8 14.0 11.6 14.7 16.8 * 16.8 19.8 18.0 19.4 8.2 19.3 15.7 16.1 16.8 * 20.7 23.6 18.7 15.3 17.7 18.8 19.8 20.7 * 19.2 23.1 19.8 14.4 19.1 18.8 18.0 23.6 19.2 * 20.1 20.5 15.8 19.4 18.7 23.1 14.1 * 16.2 16.1 8.2 11.6 8.2 15.3 19.8 20.1 16.2 * 8.6 8.9 8.4 13.4 11.4 14.4 20.5 16.1 8.6 * 6.9 7.9 11.4 13.6 14.1 14.0 8.2 13.4 19.9 15.0 1.4 6.2 1.4 6.2 * 7.4 14.7 19.3 17.7 19.9 15.0 8.9 7.4 * 10.9 7.9 15.7 18.8 19.1 15.8 8.4 6.9 10.9 * 16.8 16.1 13.6 18.8 14.0 11.6 14.7 16.8 * 16.8 19.8 18.0 19.4 8.2 13.4 19.3 15.7 16.1 16.8 * 20.7 23.6 18.7 15.3 17.7 18.8 19.8 20.7 * 19.2 23.1 19.8 14.4 19.9 19.1 18.8 18.0 23.6 19.2 * 20.1 20.5 15.8 19.4 18.7 23.1 14.1 * 16.2 16.1 8.2 11.6 8.2 15.3 19.8 20.1 16.2 * 8.6 8.9 8.4 13.4 11.4 14.4 20.5 16.1 8.6 * 6.9 7.9 11.4 13.6 14.1 14.0 8.2 15.0 1.4 6.2 1.4 6.2 i±1 i±2 i±3 i±4 i±5 i±6
  50. z-normalize /zē/ /ˈnôrməˌlīz/ verb a vector that stores the z-normalized

    Euclidean distance between each subsequence and its nearest neighbor
  51. #IllustrativeExample 0 1 3 2 9 1 14 15 1

    2 2 10 7 Z-normalize -2.875 -3.875 3.125 -4.875 8.125 9.125 -4.875 -3.875 -0.520 -0.700 0.565 -0.881 1.469 1.650 -0.881 -0.700 Subtract the Subsequence Mean (5.875) Divide by the Subsequence Standard Deviation (5.533)
  52. 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2

    2 8 9 1 9 2 7 2 1 2 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 3 2 9 1 14 15 1 2 2 10 7 -1 0 0 1 1 2 3 2 1 2 Left Matrix Profile Indices
  53. 6.9 1.4 6.2 7.9 11.4 13.6 14.1 14.0 1.4 6.2

    2 8 9 1 9 2 7 2 1 2 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 3 2 9 1 14 15 1 2 2 10 7 -1 0 0 1 1 2 3 2 1 2 2 8 9 8 9 9 7 9 9 -1 Left Matrix Profile Indices Right Matrix Profile Indices
  54. #Distance 0 1 3 2 9 1 14 15 1

    2 2 10 7 Brief Recap Standard Dot Product!
  55. #DotProduct Reusing Dot Products 0 1 3 2 9 1

    14 15 1 2 2 10 7 0 1 3 2 9 1 14 15 1 2 2 10 7 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 1 0 1 1 3 3 2 2 9 2 + + + 1 3 3 2 2 9 9 1 + + + 3 2 2 9 9 1 1 14 + + + 3 4 5 6 7 8 9 15 1 1 2 2 2 2 10 + + + 1 2 2 2 2 10 10 7 + + +
  56. #DotProduct Reusing Dot Products 0 1 3 2 9 1

    14 15 1 2 2 10 7 0 1 3 2 9 1 14 15 1 2 2 10 7 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 0 1 1 3 3 2 2 9 + + + 1 3 9 1 + + Time O(n2) 0 1 3 2 9 1 14 15 1 2 2 10 7 0 1 3 2 9 1 14 15 1 2 2 10 7 Xi Yj Xi-1 Yj-1 i j
  57. #NoTrivialMatch 0 1 3 2 9 1 14 15 1

    2 2 10 7 7 4 0 1 3 2 8 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Sidebar: AB-Joins 0.0 1.0 6.9 17.7 13.1 8.8 2.2 8.2 16.2 8.6 stumpy.stump(ts_A, m, ts_B) or stumpy.stump(ts_B, m, ts_A)
  58. Right Tools, Right Job #Winning Python Numba Dask Core Parallelize

    Distribute + + + NumPy Numerical GPU Support
  59. Pros No additional dependencies Code in Python Compiles to CUDA

    Excellent host-device API Multi-GPU Cons Separate code targeting GPUs GPU multithreading paradigm Delayed CUDA functionality Memory Limitations Writing unit tests and CI is hard!