Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding Atypical Patterns in Large Datasets

Finding Atypical Patterns in Large Datasets

Talk by Adi Andrei @atypiko, at Data Science London meetup @ds_ldn

Data Science London

July 20, 2014
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Outline —  Atypicalities —  What they are —  Why are

    they are relevant —  Real world application: NASA’s Morning Report —  Premise —  Methodology —  Results —  Conclusion
  2. Patterns vs Values 1 2 3 4 5 6 7

    Atypicality = Unusual, out of the ordinary pattern Can you spot the outlier value (1-7)? Can you spot the atypical pattern (1-7)? Which one is easier to detect by human eye? What about by computer?
  3. Why Atypicalities? —  Regular data science —  Extract (most frequent)

    patterns out of the noisy data —  Atypicality detection —  Of the multitude of patterns, identify the really interesting ones —  Why are they relevant —  In mission-critical operations, anything unusual or out of the ordinary can signify a possible problem which can lead to a major disruption later on. Discovering the atypicalities early is key to managing the situation and preventing possible loss.
  4. Aviation and Safety —  Aviation is one of the safest

    means of transportation. —  even the smallest bolt on an aircraft is certified —  strict rules and procedures for all operations —  safety is continuously monitored —  When flights operate outside of the norm, they may also be operating outside the realm of safety. —  Traditionally, safety analysts compare data to preset parameters to identify atypical events —  What about issues which might otherwise have been unforeseen?
  5. Mandate Year: 1999 NASA Aeronautics Research Mission Directorate Aviation Safety

    And Security Program Mission: Make aviation safer by developing advanced tools that find latent safety issues from large sources of operational flight data
  6. The Data —  1000+ sensors per aircraft —  which collect

    data every second —  9000 seconds – average flight duration —  30,000 commercial flights per day in US 270 Billion data points per day
  7. Houston, we have a problem —  How to find what

    is interesting or a potential safety issue in these very large data sets —  How to do it fast, yet reliable, given the technology constrains 2012 Galaxy s3 Quad core ARM 1400 MHz 1GB RAM 2001 Dell Inspiron Single core P3 900 MHz 500MB RAM
  8. Methodology —  Phase separation —  Feature Matrix —  Dimensionality reduction

    —  Compute Atypicality —  Rationale —  Visualisation
  9. Feature Extraction How to transform this: into a vector of

    numbers such that the properties of the curve are being preserved well enough to allow for comparison with similar time series from other flights
  10. Curve Coeficients —  Take an 11 second sliding window and

    run it across a whole flight phase —  For each window, compute the a,b,c,d coefficients where: —  Y=a+bx+cx2 —  d=sqrt(SSE/n-3) Coefficient meaning: —  a = intercept | average value —  b = slope | speed of change —  c = curvature | acceleration —  d= noise
  11. Feature Vector —  Create feature vector (or flight signature) by

    applying: —  min —  max —  average —  standard deviation to the a, b, c, d coefficients across all windows min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d
  12. Reduce Dimensionality min a max a avg a std a

    min b max b avg b std b min c max c avg c std c min d max d avg d std d Flight 1 min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d min a max a avg a std a min b max b Parameter 1 Parameter 2 ……………………… Parameter N min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d Flight 2 min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d min a max a avg a std a min b max b min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d Flight N min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d min a max a avg a std a min b max b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feature Matrix : 135 parameters * 16values = 2160 columns Principle Component Analysis (PCA) Reduce the 2160 columns to about 50 Flight 1 50 components Flight 2 Flight N . . . . . . . . . . . . . . . . . 50 components 50 components
  13. Atypicality Score —  Mahalanobis Distance —  a measure of the

    distance (similarity) between a point P and a distribution D —  P-value – gamma distribution —  reduce atypicality score to a value between 0 and 1 Atypicality score distribution
  14. Global Atypicality —  Cluster Score —  cluster_size / population_size — 

    smaller for smaller clusters and singletons —  Global Atypicality —  – log(p_value) – log(cluster_score) —  the “-log” is used so that the most atypical flights will result in the largest Global Atypicality scores
  15. Show me your… true colour •  For each phase: • 

    Sort on global atypicality •  Assign phase level: •  Red: top 1% •  Green: 5% •  Blue: 10% •  Sort and assign flight level: •  Red: top 1% •  Yellow: 5% •  Blue: 10% 6 The Morning Report identified only 1 level-3 atypical flight (highlighted in red in the Flight ID colu figure 4), 3 level-2 atypical flights (the yellow ones) and 11 level-1 atypical flights. Among these 15 flig phases of flight had level-3 atypicality scores, 12 phases had level-2 atypicality scores, and 21 phase level-1 atypicality scores. Notice that The Morning Report does not necessarily assign a high-severity le the entire flight even though it had a level-3 atypical phase in that flight. For each atypical phase, the tool identifies which parameters contribute the most to this atypicality Morning Report generates plots of these parameters, as indicated in figure 5. The characteristic of each cal parameter is shown in the figure overlaid on the baseline of all the data (i.e., including both the basel 210 flights and the 79 flights being examined) in column 2 and overlaid on the baseline data alone in co 3. Column 4 indicates the cluster to which this flight belongs, and column 5 shows the recorded flight tr the parameter. Figure 4. Atypical flight and phases.
  16. Tell me why —  Rationale —  Mathematical explanation in “plain

    English” of the reason a certain pattern is atypical —  Focused on just one parameter at a time —  Performance Envelope —  heat map of all instances of a given parameter Fuel flow Normal power Full power
  17. Flight Atypicalities Identified —  High-energy arrivals —  well above the

    desired glide path —  tend to result in unstable approaches which are potential precursors to accidents —  Create a new parameter Kinetic Enegry on Approach —  Go-arounds —  Landing rollout anomalies —  included atypical use of reverse thrust or application of elevator or rudder during landing rollout. —  Takeoff anomalies —  rotating and lifting off at atypically high airspeeds. —  Atypical climbs —  Unusual arrival paths
  18. Report conclusions: —  Morning Report: revealed interesting operational situations that

    could not be identified by the regular aviation safety software. —  The number of atypical situations identified by Morning Report is small enough that it is practical for a safety officer to analyze them. NASA/TM–2009-215379 Comparative Analyses of Operational Flights with AirFASE and The Morning Report Tools Nicolas P. Maille ONERA, BA 701, 13661 Salon Air Cedex, France Irving C. Statler Ames Research Center, Moffett Field, California
  19. Conclusion —  Finding Atypicalities: proven to identify relevant operational situations

    that could not be identified with existing methodology —  Other applications —  Unusual patterns of Out of Stock events —  Atypicalities in sales data —  Boiler sensor data