Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Phoenix Data Conference 2014 - Nauman Fakhar

Phoenix Data Conference 2014 - Nauman Fakhar

Real Time Data processing with Storm

teamclairvoyant

October 25, 2014
Tweet

More Decks by teamclairvoyant

Other Decks in Technology

Transcript

  1. Page  1   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Real  )me  processing  in  Hadoop   Trucking  Company  Use  Case     Nauman  Fakhar     SoluBons  Engineer,  Hortonworks  
  2. Page  2   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Agenda   §  Overview  of  logisBcs  industry  scenario   §  Quick  overview  of  streaming  architecture  on  HDP   §  Streaming  Demo   §  IntegraBng  PredicBve  AnalyBcs  in  streaming  scenarios   §  Spark  Demo   Page  2  
  3. Page  3   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Scenario Overview .
  4. Page  4   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Trucking  company  w/  large  fleet  of  trucks  in  Midwest   A  truck  generates  millions  of  events  for  a   given  route;  an  event  could  be:   §  'Normal'  events:  starBng  /  stopping  of  the   vehicle   §  ‘ViolaBon’  events:  speeding,  excessive   acceleraBon  and  breaking,  unsafe  tail  distance   Company  uses  an  applica)on  that  monitors   truck  loca)ons  and  viola)ons  from  the  truck/ driver  in  real-­‐)me   Route?   Truck?   Driver?     Analysts  query  a  broad   history  to  understand  if   today’s  violaBons  are   part  of  a  larger  problem   with  specific  routes,   trucks,  or  drivers  
  5. Page  5   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Distributed  Storage:  HDFS   Many  Workloads:  YARN   Trucking  Company’s  YARN-­‐enabled  Architecture   Stream  Processing   (Storm)   Inbound  Messaging   (Ka`a)   Real-­‐Bme  Serving   (HBase)   Alerts  &  Events   (AcBveMQ)   Real-­‐Time     User  Interface   One  cluster  with  consistent   security,  governance  &   opera)ons   SQL   InteracBve  Query   (Hive  on  Tez)   Truck  Sensors  
  6. Page  6   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Demo  -­‐  Streaming   .
  7. Page  7   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Streaming  Demo  -­‐  High  Level  Architecture   Distributed  Storage:  HDFS   YARN   Storm  Stream  Processing   Kakfa  Spout   HBase   Dangerous   Events  Table   Hbase   Bolt   HDFS   Bolt   Truck  Events   Ac)ve     MQ   Monitoring   Bolt   Web  App   Truck  Streaming  Data   T(1)   T(2)   T(N)   Inbound  Messaging   (KaYa)   Truck  Events  Topic  
  8. Page  8   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Demo  –  Analyzing  Events  with  Tableau   .
  9. Page  9   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Analyzing Raw Events – dangerous drivers Page 9
  10. Page  10   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Analyzing Raw Events – dangerous routes Page 10
  11. Page  11   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Analyzing Raw Events – violations by location Page 11
  12. Page  12   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Enriching  truck  events  for  analysis  with  Pig   HDFS   Raw  Truck  Events   Weather  Data  Sets   Raw  Weather  Data   HCatalog  (Metadata)   Payroll  Data   HR  &  Payroll  DBs   Load  Raw  Truck   Events   Clean  &     Filter   Cleaned   Events   Transformed   Events   Transform       Join  with   HR  &  weather  data   Enriched   Events   Enriched  Events   Store   Tableau    
  13. Page  13   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Analyzing Enriched Events – noncertified and fatigued drivers more dangerous Page 13
  14. Page  14   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Analyzing Enriched Events – top 3 dangerous routes seem to be driven by fatigued drivers Page 14
  15. Page  15   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Analyzing Enriched Events – foggy weather leads to violations Page 15
  16. Page  16   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Analyzing Enriched Events – but top 3 safest routes are also foggy Page 16
  17. Page  17   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   IntegraBng  PredicBve  AnalyBcs  
  18. Page  18   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   CDO’s  vision:  Build  a  PredicBve  Business,  not  a  ReacBve  one   CDO’s  Requirements   §  Offline  predic)ons   §  Iden)fy  investments  that  will  increase   safety  and  reduce  company’s  liabili)es   §  Real-­‐)me  predic)ons     §  An)cipate  driver  viola)ons  before  they   happen  and  take  precau)onary  ac)ons   Data  Scien)st’s  Response   §  ♬  I’ve  been  wai8ng  for  this  moment  all  my  life  ♬   §  Verify  BI  tool  trends  against  TBs  of  events  data   via  machine  learning   §  Generate  predicBve  models  with  Spark   MLlib  on  HDP     §  Plug  in  Spark  models  in  Storm  to  predict  driver   violaBons  in  real-­‐Bme  
  19. Page  19   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Truck  Sensors   HDFS   YARN   Integrate  PredicBve  AnalyBcs  in  Stream  Processing   Stream  Processing   (Storm)   Inbound  Messaging   (Ka`a)   InteracBve  Query   (Hive  on  Tez)   Real-­‐Bme  Serving   (HBase)   Millions  of  Enriched  Truck  Events     PredicBon  Bolt   Plug  Spark  model   into  Storm  bolt   Machine  Learning   (Spark)   Train  Spark  ML  model  with   millions  of  truck  events  
  20. Page  20   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Streaming  Demo  -­‐  Updated  Architecture   Distributed  Storage:  HDFS   YARN   Storm  Stream  Processing   Kakfa  Spout   HBase   PayRoll   Table   HBase   Bolt   HDFS   Bolt   Truck  Events   Ac)ve     MQ   Monitoring   Bolt   Web  App   Truck  Streaming  Data   T(1)   T(2)   T(N)   Inbound  Messaging   (KaYa)   Truck  Events  Topic   Predic)on   Bolt   Enrich     Event   Predict   viola)on  in  real   )me    &  alert   via  MQ   Render  Real  )me   predic)ons  on  UI  
  21. Page  21   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Building  the  PredicBve  Model  on  HDP   Tableau     Explore  small  subset  of  events  to  idenBfy  predicBve   features  and  make  a  hypothesis.  E.g.  hypothesis:  “foggy   weather  causes  driver  viola8ons”   1   IdenBfy  suitable  ML  algorithms  to  train  a  model  –  we  will   use  classificaBon  algorithms  as  we  have  labeled  events   data     2   Transform  enriched  events  data  to  a  format  that  is   friendly  to  Spark  MLlib  –  many  ML  libs  expect   training  data  in  a  certain  format   3   Train  a  logisBc  classificaBon  Spark  model  on  YARN,  with   above  events  as  training  input,  and  iterate  to  fine  tune   generated  model   4     Integrate  Spark  MLlib  model  in  a  Storm  bolt  to  predict   violaBons  in  real  Bme   5  
  22. Page  22   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Transforming  training  data  for  Spark  MLlib   Enriched  Events  Data   Event  Type   Is  Driver   Cer)fied?   Wage   Plan   Hours   Driven   Miles   Driven   Longitude   La)tude   Weather   Foggy   Weather     Rainy   Weather     Windy   Normal   Yes   Hourly   45   2721   -­‐91.3   38.14   No   No   No   Overspeed   No   Miles   72   4152   -­‐94.23   37.09   Yes   Yes   No   …   …   …   …   …   …   …   …   …   …   Spark  MLlib    Training  Data   Label   Is  Driver   Cer)fied?   Wage   Plan   Hours   Driven   Miles   Driven   Weather   Foggy   Weather     Rainy   Weather     Windy   0   1   1   0.45   0.2721   0   0   0   1   0   0   0.72   0.4152   1   1   0   …   …   …   …   …   …   …   …   Normal  events   labeled  as  0  and   violaBon  events  as  1   Feature  scaling  applied  to   hours  and  miles  to  improve   algorithm  performance   Features  with  binary  values     denoted  as  0  and  1  
  23. Page  23   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Running  Spark  ML  on  YARN   1   spark-­‐submit  -­‐-­‐class  org.apache.spark.examples.mllib.BinaryClassifica8on  -­‐-­‐master  yarn-­‐cluster    -­‐-­‐ num-­‐executors  3  -­‐-­‐driver-­‐memory  512m    -­‐-­‐executor-­‐memory  512m         -­‐-­‐executor-­‐cores  1  truckml.jar  -­‐-­‐algorithm  LR  -­‐-­‐regType  L2  -­‐-­‐regParam  1.0  /user/root/truck_training     -­‐-­‐numItera3ons  100   Run  spark-­‐submit  script  to  launch  a  Spark  job  on  YARN.   Training  data   locaBon  on  HDFS   2   Monitor  progress  of  Spark  job  in  YARN  Resource  Mgr  UI  
  24. Page  24   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   InterpreBng  Spark  LogisBc  Regression  Results   Precision:  87.5%   Recall:  88%    Top  three  predictors  of  viola)ons     1.  Foggy  Weather  2.  Rainy  Weather  3.  Driver  CerBficaBon  
  25. Page  25   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   IntegraBng  Spark  model  in  Storm   Ka`a  Spout              Storm  PredicBon  Bolt   §  IniBalize  Spark  model   §  Parse  truck  event   §  Enrich  event  with  HBase  data   §  Predict  violaBon  with  model   §  Send  Alert  if  violaBon  predicted   Real-­‐Bme  Serving   (HBase)   Ac)ve  MQ   Ops  Center   LOB  Dashboards  
  26. Page  26   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   RecommendaBons  to  CDO   §  Investment  recommenda)ons,  in  order  of  priority   1.  Invest  in  visibility  sensors  and  auto  braking  systems  to  deal  with  foggy  condiBons   2.  Invest  in  slip  resistant  Bres  to  fight  rainy  condiBons   3.  Invest  in  cerBfying  drivers  to  reduce  violaBon  probability           §  Power  of  real  )me  predic)ons   §  40%  reducBon  in  violaBon  rates  by  predicBng  high  risk  situaBons  in  real-­‐Bme  and   sending  immediate  alerts  to  drivers      
  27. Page  27   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Value  of  large  scale  ML  on  HDP   §  Accelerate  )me  to  market/value   §  Test  out  mulBple  ML  algorithms  against  TBs  of  training  data  in   reasonable  Bme  frames   §  Confirm  hypothesis  against  TBs  of  training  data  with  confidence   §  We  confirmed  that  fog  does  impact  safety  and  wage  plans  do  not,   whereas  BI  tools  indicated  otherwise     §  Easily  integrate  predic)ve  models  in  data  driven  apps   §  Run  predicBve  models  in  Storm  or  any  other  app  in  your  enterprise     §  Run  all  of  the  above  in  a  mul)-­‐tenant  YARN  cluster   §  Large  scale  ML  on  YARN  respects  other  tenants  in  an  HDP  cluster  
  28. Page  29   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   Calling  Spark  from  a  Storm  Bolt   §  The  outputs  of  a  logisBc  regression  model  are  weights  and  an  intercept  value:      val  algorithm  =  new  Logis)cRegressionWithSGD()    val  model  =  algorithm.run(training).clearThreshold()    println(model.weights)    println(model.intercept)     Weights[-­‐0.40819922025591465,0.06392530395655666,-­‐0.1346227352186122,-­‐0.07188217286407801,0.7277326276521062,0.50877 9221680863,-­‐0.024689093098281954]   Intercept  0.0     §  The  model  can  then  be  reconstructed  in  a  Storm  bolt  with  the  above  weights  to   make  predicBons     import  org.apache.spark.mllib.classifica)on.Logis)cRegressionModel;   import  org.apache.spark.mllib.linalg.Vectors;   ………..   Vector  weights  =  (Vectors.dense(new  double[]  <array  of  weights  like  above>)   Logis)cRegressionModel  model  =  new  Logis)cRegressionModel(weights,  0.0);   double  predic)on  =  model.predict(<input  features>)        
  29. Page  30   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Why  Apache  Ka`a?   Open  source  real-­‐Bme  event  stream  processing  plarorm  that  provides  fixed,   conBnuous  &  low  latency  processing  for  very  high  frequency  streaming  data     •  Horizontally  scalable  like  Hadoop   •  Eg:  3  node  cluster  can  store  5M  messages  per  second   Highly  scalable     •  AutomaBcally  reassigns  on  failed  nodes   Fault-­‐tolerant     •  Supports  message  acknowledgements   Guarantees   delivery   •  Producers  and  consumers  exist  for  many  programming  languages   Language  agnos)c   •  Brand,  governance  &  a  large  acBve  community   Apache  project  
  30. Page  31   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Key  CapabiliBes  of  Storm   Page  31   •  Extremely  high  ingest  rates  –  millions  of  events/second   Data  Ingest   •  Ability  to  easily  plug  different  processing  frameworks   •  Guaranteed  processing  –  at-­‐least  once  processing  semanBcs   Processing   •  Ability  to  persist  data  to  mulBple  relaBonal  and  non  relaBonal  data  stores   Persistence   •  HA,  fault  tolerance  &  management  support   OperaBons  
  31. Page  32   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Our  preferred  soluBon  architecture   HDP  2.x  Data  Lake   Online  Data     Processing   HBase       Real  Time  Stream     Processing   Storm   YARN   HDFS   APACHE  KAFKA   Real-­‐Bme  data  feeds                                       Search   Solr  
  32. Page  33   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services What  is  Real  Time  Event  Processing   Real  Time  Event  Processing  System     A  system  that  processes  the  events  as  they  happen  and  generates  real-­‐Bme  informaBon/acBons     Requirements   •  Ingest  data  at  high  rate   •  Process  the  data  while  its          being  collected   •  ConBnuously  running   •  Low  latency         Ka`a   Storm   33   Components     •  Collec)on  –  Process  to  collect  raw  data   •  Data  Flow  -­‐  Process  to  Move  data     •  Processing  –  Process  to  Analyze  data     •  Delivery  –  Process  to  deliver  the                  extracted  informa)on    
  33. Page  35   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services What  is  Ka`a?     APACHE  KAFKA   §  High  throughput  distributed   messaging  system   §  Publish-­‐Subscribe  semanBcs   but  re-­‐imagined  at  the   implementaBon  level  to   operate  at  speed  with  big  data   volumes   KaYa   Cluster   producer   producer   producer   consumer   consumer   consumer  
  34. Page  36   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Ka`a:  Anatomy  of  a  Topic   Par))on  0   Par))on  1   Par))on  2     0   0   0   1   1   1   2   2   2   3   3   3   4   4   4   5   5   5   6   6   6   7   7   7   8   8   8   9   9   9   10   10   11   11   12   Writes   Old   New   APACHE  KAFKA   §  ParBBoning  allows  topics  to   scale  beyond  a  single   machine/node    
  35. Page  38   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Key  Constructs  in  Apache  Storm   • Tuples   • Streams   • Spouts   • Bolts   • Topology   Page  38  
  36. Page  39   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Tuples  and  Streams   • What  is  a  Tuple?   – Fundamental  data  structure  in  Storm.    Is  a  named  list  of  values  that  can  be  of  any  data  type.     Page  39   • What  is  a  Stream?   – An  unbounded  sequences  of  tuples.   – Core  abstracBon  in  Storm  and  are  what  you  “process”  in  Storm    
  37. Page  40   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Spouts   • What  is  a  Spout?   – Generates  or  a  source  of  Streams   – E.g.:  JMS,  Twixer,  Log,  Ka`a  Spout   – Can  spin  up  mulBple  instances  of  a  Spout  and  dynamically  adjust  as  needed   Page  40  
  38. Page  41   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Bolts   • What  is  a  Bolt?   – Processes  any  number  of  input  streams  and  produces  output  streams   – Common  processing  in  bolts  are  funcBons,  aggregaBons,  joins,  read/write  to  data  stores,  alerBng   logic   – Can  spin  up  mulBple  instances  of  a  Bolt  and  dynamically  adjust  as  needed   • Bolts  used  in  the  Use  Case:   1.  HBaseBolt:  persisBng  and  counBng  in  Hbase   2.  HDFSBolt:  persisBng  into  HFDS  as  Avro  Files  using  Flume   3.  MonitoringBolt:  Read  from  Hbase  and  create  alerts  via  email  and  a  message  to  AcBveMQ  if  the   number  of  illegal  driver  incidents  exceed  a  given  threshhold.   Page  41  
  39. Page  42   ©  Hortonworks  Inc.  2011  –  2014.  All

     Rights  Reserved   © Hortonworks Inc. 2012 Professional Services Topology   • What  is  a  Topology?   – A  network  of  spouts  and  bolts  wired  together  into  a  workflow   Page 42 Truck-Event-Processor Topology Kafka Spout HBase Bolt Monitoring Bolt HDFS Bolt WebSocket Bolt Stream Stream Stream Stream