Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Intro

mhookey
April 02, 2013

Big Data Intro

High level slides from Big Data discussion in HK

mhookey

April 02, 2013
Tweet

Other Decks in Business

Transcript

  1.               This  document  contains

     proprietary  and  confiden3al  informa3on  and  may  be  subject  to  a  non-­‐disclosure  agreement.  If  you   have  received  this  in  error,  please  no3fy  the  sender  immediately.   1.  Analy3cs  /  Big  Data  intro   2.  Data  Science  101  workshop   •  Data  cleaning   •  Variable  selec3on   •  Parametric  modeling   •  Non  parametric  modeling   3.  Discussion  
  2. www.demystdata.com 3   There  are  2  flavors  of  Big  Data

     …   Observa3ons   AMributes   “Short  and  fat”  …  e.g.     •  10k  customer  quotes  and  conversion  with   hundreds  of  demographic  aMributes   •  100k  insurance  claims  with  thousands  of   vehicle  data  points   •  High  informa3on  content   “Tall  and  skinny”  …  e.g.     •  billions  of  tweets  with  user  and  content  only   •  web  log  data  with  IP  and  controller  ac3on  only   •  Petabytes  and  more  
  3. www.demystdata.com 4   …  each  raising  different  ques9ons  …  

    Observa3ons   AMributes   How do we apply machine learning and statistics in distributed, streaming systems? How do we translate complicated models in to realtime analytics? How do we get basic insights at scale? Which attributes matter the most? Data science (today’s focus) Software engineering
  4. www.demystdata.com 5   Visualiza3on   Simula3on  &  op3miza3on   Machine

     learning   Data  reduc3on   Enriched  data   Raw  star3ng  data   Hardware   BeMer  decisions   Track  and  learn   capacity   results   sources   approach   The  sequence  of  analy9cal  ac9vi9es  has  not  fundamentally  changed  …  
  5. www.demystdata.com 6   …  let’s  demonstrate  with  a  simple  exercise

      Census dataset •  32k rows •  Predict Pr(Salary > $50k) Steps 1.  Cross tabulation 2.  Basic model 3.  Non-parametric model Machine  learning   Data  reduc3on   Enriched  data  
  6. www.demystdata.com 9   Nonparametric  model  –  No  assump9on  of  sampling

     from  a  specific  distribu9on   Other approaches : -  Neural networks -  SVM -  K-means -  Gradient boosting -  Hybrid
  7. www.demystdata.com 10   How  do  we  produc9onalize  these  insights  …

     batch  &  real9me   Parametric  model   Non  parametric  model   if mycustomer.relationship.starts_with?(‘married’)! render ‘expensive_products’! else! render ‘discount’! end! mycustomer.each do |k, v|! income += coefficients[k][v]! end! If mycustomer.relationship.starts_with?(‘married’)! if mycustomer.capital.gain < 7000! render ‘a’! else! render ‘b’! end! else # ...! render ‘discount’! end! One  way  cut  
  8. www.demystdata.com 11   Visualiza3on   Simula3on  &   op3miza3on  

    Machine  learning   Data  reduc3on   Enriched  data   Raw  star3ng  data   Hardware   BeMer  decisions   The  goal  posts  are  shiJing  with  innova9on  at  every  level   Then   Now   Silo’d  by  channel   Cross  channel  &  industry   op3miza3on   Niche  providers   Tools  &  capabili3es   PivoMables   Hosted  BI  pla^orms   Linear  regression   Non  parametric   SQL  scripts   Hive/Pig/Nosql   Enrich  only  ‘as  needed’   Gather  everything   CRM  data   Every  touchpoint   Dedicated  capacity   AWS/Private  clouds  
  9. Discussion:       •  Will  this  favor  the  incumbent

     or  new  entrant?  E.g.   •  Banks   •  Telcos   •  Retailers   •  B2C  startups   •  What  will  it  mean  for  service  providers?  E.g.   •  Technology  conglomerates   •  Consul3ng  firms   •  B2B  startups   •  What  will  be  outsourced  /  insourced?     •  Does  Hong  Kong  have  the  poten3al  to  be  a  Big  Data  hub?   www.demystdata.com 12   Ok.  So  what?