Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python and Hadoop: Big Data Application Development with PyCascading

Python and Hadoop: Big Data Application Development with PyCascading

3b085ba94fee217d7656971b0cb4cf00?s=128

PyCon Canada

August 11, 2013
Tweet

Transcript

  1. Python and Hadoop: Big Data Application Development with PyCascading Craig

    Hawco, Technical Lead l August 11, 2013 Data Processing with Hive and Cascading
  2. •  Overview of building analytics applications •  Brief introduction to

    MapReduce and Hadoop •  Introduction to Cascading and PyCascading Agenda Data Processing with Hive and Cascading
  3. • First got into analytics in finance in 2005 o  Different

    world o  No Hadoop • Hired as analytics developer at Polar Mobile in 2009 o  Designed Hive-based mobile analytics platform o  Ended up as Director of Engineering • Engineer at Kontagent o  Yellow elephant tamer o  Working on new analytics platform based on Hadoop Who is this guy? Data Processing with Hive and Cascading
  4. Kontagent Facts •  Founded in 2007 •  130+ employees and

    growing •  100s of Customers •  1000s of Apps Instrumented •  250+ billion events per month •  200MM+ MAUs •  1 Trillion Events in 2013
  5. •  Analytics is making sense of your data. •  Gain

    insight into some aspect of your business and produce actionable results. What is Analytics? Data Processing with Hive and Cascading 0 50 100 150 200 250 300 0 10 20 30 40 50 60 70 80 90 100 Cumulative Spend Days Since Install
  6. •  Web API logs •  Event log based on event

    type o  APA – install message o  MTU – monetization o  EVT – custom event Data at Kontagent Data Processing with Hive and Cascading
  7. •  Interest sparked by Google’s whitepapers •  Inspired by functional

    programming •  Computational model for dealing with distributed systems MapReduce Data Processing with Hive and Cascading
  8. Hadoop is a distributed data processing framework. •  Based on

    the Google MapReduce and GFS whitepapers •  Open Source •  Ecosystem Hadoop Data Processing with Hive and Cascading
  9. Humans don’t think in MapReduce! •  Many options which are

    more natural for humans o  Hive o  Pig o  Cascading •  Frameworks provide alternative computational models o  Declarative – Let the planner do the work o  Imperative but not MapReduce – Some input over execution Hadoop MapReduce Frameworks Data Processing with Hive and Cascading
  10. Flow-based computational model •  Not explicitly tied to Hadoop, but

    runs on MapReduce •  Many Domain-Specific Languages available o  Cascalog o  Scalding o  Lingual o  PyCascading o  Cascading.JRuby Cascading Data Processing with Hive and Cascading
  11. Taps Data Processing with Hive and Cascading Start/Finish! •  Produces

    a Tuple stream to process (or write out) •  Scheme = Input Format •  No wrappers in PyCascading
  12. Do some work! •  Transform the tuple stream •  Lots

    of different types Operators Data Processing with Hive and Cascading
  13. Operators Data Processing with Hive and Cascading

  14. Flows Data Processing with Hive and Cascading Fit all of

    your pieces together!
  15. Run it! Data Processing with Hive and Cascading

  16. Questions? Need a job? We’re hiring: http://www.kontagent.com/company/careers/ Data Processing with

    Hive and Cascading