Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Capital One: RITHM — Automated Insight

Capital One: RITHM — Automated Insight

Presentation from Elastic Power{ON} Tysons Corner

Douglas Daly
June 1, 2016
Tysons Corner, VA

Elastic Co

June 01, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Fully implemented, RITHM will eliminate downtime and drive higher production

    support efficiency Integrate disparate data sources into a curated single pane of glass Massively parallelize investigation Codify tribal knowledge into rules and models “E Pluribus Unum” From many monitors, one view of performance Slice across many dimensions – customer behavior, network flows, hardware load – to pinpoint failure source Couple machine learning models with analyst expertise for automated root cause analysis and remediation
  2. RITHM leverages the Elastic Stack to show mobile health on

    a single pane of glass LOG STREAMING SOURCE STREAM PROCESSING ANALYTICS & PRESENTATION Logstash parses live stream into key / value pairs and indexes into elasticsearch 14:41:51","stkId":0,"evtAudtg":null},{"evtDetlId": 0,"cretTs":null,"evtDetlKeyTxt":"lineOfBusiness" ,"evtDetlKeyValTxt":"CARD","stkId":0,"evtAudtg ":null},{"evtDetlId":0,"cretTs":null,"evtDetlKeyT xt ":"sorID","evtDetlKeyValTxt":"7","stkId":0,"evtAu dtg":null},{"evtDetlId":0,"cretTs":null,"evtDetlKey Txt":"userStatusMsg","evtDetlKeyValTxt":"SUC CESS","stkId":0,"evtAudtg":null},{"evtDetlId":0," cretTs":null,"evtDetlKeyTxt":"ssoID","evtDetlKey ValTxt":"4b4022f85932ebca4261785d2d36fa4a ","stkId":0,"evtAudtg":null},{"evtDetlId":0,"cretT s" :null,"evtDetlKeyTxt":"userID","evtDetlKeyValT x t":"steve443123","stkId":0,"evtAudtg":null},{"evt DetlId":0,"cretTs":null,"evtDetlKeyTxt":"sessionI d","evtDetlKeyValTxt":"8sT5VLwTWpdyWNLvfk 21YDpf17rjHxT1LQPV0ZG4jTJhzTBvVV8Y!27 2707040!1439232110492","stkId":0,"evtAudtg": null},{"evtDetlId":0,"cretTs":null,"evtDetlKeyTxt": "serviceID"," Logs stream into a kafka topic in real time Elasticsearch with Python code summarizes data & performs analytics Time #Pass # Fail Trans Chan CPU Mem 10:00 6105 2 Login 4.x 12 27 10:01 6113 3 Login 4.x 15 26 10:02 6155 7 Login 4.x 13 28 10:03 6040 5 Login 4.x 11 33 10:04 6177 2 Login 4.x 14 31 Kibana displays Elastic Search data in a GUI **Actual transaction data & automated analytics 8/2/15 22:00 – 23:00 Availability state in RYG by datacenter Transaction volume by ID & datacenter Data center devices Server logs App Perf logs Storage logs Network logs App Audit logs Infrastructure monitors
  3. It detects failures in real-time based on customer behavior and

    infrastructure response • Tri-state information in RYG format • Hierarchical framework: Datacenter – Environment – Transaction • Overall state based on key metrics – Txn volatility and Error rates
  4. Gaps with previous tools RITHM approach By placing thousands of

    logs into logical aggregates, RITHM gives you to the context to pinpoint the problem • Successful high volume transactions hiding failing low-volume transactions • Large scale failure appearing as many individual small failures • Many point metrics without context • Volume neutral stacked bar charts reflecting each transaction’s state • Transaction states grouped in logical aggregates (DC, ENV, etc.) • Logical “Red, Yellow, Green” state description based on current data and history RITHM addresses the key challenges with making an actionable enterprise dashboard
  5. Automated filtering tells you where to focus • RITHM highlights

    failed and failing transactions – 2 metrics: Raw measures and State – 2 dimensions: Error rate and Transaction volume RITHM Detects high error rates for BankStatements in both datacenters RITHM finds transaction volatility for Login, Bank Acct Details, Card Acct Details, Credit Tracker and Transfer-Schedule
  6. Python code auto-generates alerts via AWS messaging • Email and

    text dashboard status to key contacts • Hyperlink to Kibana dashboard • Attach live snapshot of performance to emails
  7. RITHM computes health state and set points in real-time There

    are 2 major challenges to correct estimation Customer usage patterns Failure state volatility • Transaction volumes vary significantly by time of day, day of week, whether or not a holiday, etc. • It can also just be a ‘weird day’ • Need to discount data when in failing state; it is a poor estimator of volumes during normal conditions You need an adaptable algorithm to adjust set points and confidence intervals in real time
  8. The RYG state machine determines how to compute score; the

    score tells you what state you’re in >Tgy >Tyr <Try <Tyg Select ‘alpha’ from RYG state Update mean & standard deviation estimate w/ current alpha Compute z-score from data, mean and standard deviation Collect current data Change state if z-score exceeds thresholds, T Initialize: Current mean & standard deviation estimate, and RYG state RITHM determines the current state from both its scoring metric and previous state