Slide 1

Slide 1 text

Finding Atypical Patterns in Large Datasets Adi Andrei – Atypiko.com

Slide 2

Slide 2 text

Outline —  Atypicalities —  What they are —  Why are they are relevant —  Real world application: NASA’s Morning Report —  Premise —  Methodology —  Results —  Conclusion

Slide 3

Slide 3 text

Patterns vs Values 1 2 3 4 5 6 7 Atypicality = Unusual, out of the ordinary pattern Can you spot the outlier value (1-7)? Can you spot the atypical pattern (1-7)? Which one is easier to detect by human eye? What about by computer?

Slide 4

Slide 4 text

Why Atypicalities? —  Regular data science —  Extract (most frequent) patterns out of the noisy data —  Atypicality detection —  Of the multitude of patterns, identify the really interesting ones —  Why are they relevant —  In mission-critical operations, anything unusual or out of the ordinary can signify a possible problem which can lead to a major disruption later on. Discovering the atypicalities early is key to managing the situation and preventing possible loss.

Slide 5

Slide 5 text

Aviation and Safety —  Aviation is one of the safest means of transportation. —  even the smallest bolt on an aircraft is certified —  strict rules and procedures for all operations —  safety is continuously monitored —  When flights operate outside of the norm, they may also be operating outside the realm of safety. —  Traditionally, safety analysts compare data to preset parameters to identify atypical events —  What about issues which might otherwise have been unforeseen?

Slide 6

Slide 6 text

Mandate Year: 1999 NASA Aeronautics Research Mission Directorate Aviation Safety And Security Program Mission: Make aviation safer by developing advanced tools that find latent safety issues from large sources of operational flight data

Slide 7

Slide 7 text

Workflow Download flight data Process Data & Report Analyze & Validate Learn

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

The Data —  1000+ sensors per aircraft —  which collect data every second —  9000 seconds – average flight duration —  30,000 commercial flights per day in US 270 Billion data points per day

Slide 10

Slide 10 text

Houston, we have a problem —  How to find what is interesting or a potential safety issue in these very large data sets —  How to do it fast, yet reliable, given the technology constrains 2012 Galaxy s3 Quad core ARM 1400 MHz 1GB RAM 2001 Dell Inspiron Single core P3 900 MHz 500MB RAM

Slide 11

Slide 11 text

Methodology —  Phase separation —  Feature Matrix —  Dimensionality reduction —  Compute Atypicality —  Rationale —  Visualisation

Slide 12

Slide 12 text

Flight phases —  dss

Slide 13

Slide 13 text

Raw Data

Slide 14

Slide 14 text

Feature Extraction How to transform this: into a vector of numbers such that the properties of the curve are being preserved well enough to allow for comparison with similar time series from other flights

Slide 15

Slide 15 text

Curve Coeficients —  Take an 11 second sliding window and run it across a whole flight phase —  For each window, compute the a,b,c,d coefficients where: —  Y=a+bx+cx2 —  d=sqrt(SSE/n-3) Coefficient meaning: —  a = intercept | average value —  b = slope | speed of change —  c = curvature | acceleration —  d= noise

Slide 16

Slide 16 text

Feature Vector —  Create feature vector (or flight signature) by applying: —  min —  max —  average —  standard deviation to the a, b, c, d coefficients across all windows min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d

Slide 17

Slide 17 text

Reduce Dimensionality min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d Flight 1 min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d min a max a avg a std a min b max b Parameter 1 Parameter 2 ……………………… Parameter N min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d Flight 2 min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d min a max a avg a std a min b max b min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d Flight N min a max a avg a std a min b max b avg b std b min c max c avg c std c min d max d avg d std d min a max a avg a std a min b max b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feature Matrix : 135 parameters * 16values = 2160 columns Principle Component Analysis (PCA) Reduce the 2160 columns to about 50 Flight 1 50 components Flight 2 Flight N . . . . . . . . . . . . . . . . . 50 components 50 components

Slide 18

Slide 18 text

Atypicality Score —  Mahalanobis Distance —  a measure of the distance (similarity) between a point P and a distribution D —  P-value – gamma distribution —  reduce atypicality score to a value between 0 and 1 Atypicality score distribution

Slide 19

Slide 19 text

Global Atypicality —  Cluster Score —  cluster_size / population_size —  smaller for smaller clusters and singletons —  Global Atypicality —  – log(p_value) – log(cluster_score) —  the “-log” is used so that the most atypical flights will result in the largest Global Atypicality scores

Slide 20

Slide 20 text

Show me your… true colour •  For each phase: •  Sort on global atypicality •  Assign phase level: •  Red: top 1% •  Green: 5% •  Blue: 10% •  Sort and assign flight level: •  Red: top 1% •  Yellow: 5% •  Blue: 10% 6 The Morning Report identified only 1 level-3 atypical flight (highlighted in red in the Flight ID colu figure 4), 3 level-2 atypical flights (the yellow ones) and 11 level-1 atypical flights. Among these 15 flig phases of flight had level-3 atypicality scores, 12 phases had level-2 atypicality scores, and 21 phase level-1 atypicality scores. Notice that The Morning Report does not necessarily assign a high-severity le the entire flight even though it had a level-3 atypical phase in that flight. For each atypical phase, the tool identifies which parameters contribute the most to this atypicality Morning Report generates plots of these parameters, as indicated in figure 5. The characteristic of each cal parameter is shown in the figure overlaid on the baseline of all the data (i.e., including both the basel 210 flights and the 79 flights being examined) in column 2 and overlaid on the baseline data alone in co 3. Column 4 indicates the cluster to which this flight belongs, and column 5 shows the recorded flight tr the parameter. Figure 4. Atypical flight and phases.

Slide 21

Slide 21 text

Tell me why —  Rationale —  Mathematical explanation in “plain English” of the reason a certain pattern is atypical —  Focused on just one parameter at a time —  Performance Envelope —  heat map of all instances of a given parameter Fuel flow Normal power Full power

Slide 22

Slide 22 text

The Morning Report

Slide 23

Slide 23 text

Flight Atypicalities Identified —  High-energy arrivals —  well above the desired glide path —  tend to result in unstable approaches which are potential precursors to accidents —  Create a new parameter Kinetic Enegry on Approach —  Go-arounds —  Landing rollout anomalies —  included atypical use of reverse thrust or application of elevator or rudder during landing rollout. —  Takeoff anomalies —  rotating and lifting off at atypically high airspeeds. —  Atypical climbs —  Unusual arrival paths

Slide 24

Slide 24 text

Full Throttle (atypical takeoff) —  dd

Slide 25

Slide 25 text

Report conclusions: —  Morning Report: revealed interesting operational situations that could not be identified by the regular aviation safety software. —  The number of atypical situations identified by Morning Report is small enough that it is practical for a safety officer to analyze them. NASA/TM–2009-215379 Comparative Analyses of Operational Flights with AirFASE and The Morning Report Tools Nicolas P. Maille ONERA, BA 701, 13661 Salon Air Cedex, France Irving C. Statler Ames Research Center, Moffett Field, California

Slide 26

Slide 26 text

Conclusion —  Finding Atypicalities: proven to identify relevant operational situations that could not be identified with existing methodology —  Other applications —  Unusual patterns of Out of Stock events —  Atypicalities in sales data —  Boiler sensor data

Slide 27

Slide 27 text

Thank You! for flying Atypiko [email protected] www.Atypiko.com