Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python and Hadoop: Big Data Application Development with PyCascading

Python and Hadoop: Big Data Application Development with PyCascading

PyCon Canada

August 11, 2013
Tweet

More Decks by PyCon Canada

Other Decks in Programming

Transcript

  1. Python and Hadoop: Big Data Application Development with PyCascading Craig

    Hawco, Technical Lead l August 11, 2013 Data Processing with Hive and Cascading
  2. •  Overview of building analytics applications •  Brief introduction to

    MapReduce and Hadoop •  Introduction to Cascading and PyCascading Agenda Data Processing with Hive and Cascading
  3. • First got into analytics in finance in 2005 o  Different

    world o  No Hadoop • Hired as analytics developer at Polar Mobile in 2009 o  Designed Hive-based mobile analytics platform o  Ended up as Director of Engineering • Engineer at Kontagent o  Yellow elephant tamer o  Working on new analytics platform based on Hadoop Who is this guy? Data Processing with Hive and Cascading
  4. Kontagent Facts •  Founded in 2007 •  130+ employees and

    growing •  100s of Customers •  1000s of Apps Instrumented •  250+ billion events per month •  200MM+ MAUs •  1 Trillion Events in 2013
  5. •  Analytics is making sense of your data. •  Gain

    insight into some aspect of your business and produce actionable results. What is Analytics? Data Processing with Hive and Cascading 0 50 100 150 200 250 300 0 10 20 30 40 50 60 70 80 90 100 Cumulative Spend Days Since Install
  6. •  Web API logs •  Event log based on event

    type o  APA – install message o  MTU – monetization o  EVT – custom event Data at Kontagent Data Processing with Hive and Cascading
  7. •  Interest sparked by Google’s whitepapers •  Inspired by functional

    programming •  Computational model for dealing with distributed systems MapReduce Data Processing with Hive and Cascading
  8. Hadoop is a distributed data processing framework. •  Based on

    the Google MapReduce and GFS whitepapers •  Open Source •  Ecosystem Hadoop Data Processing with Hive and Cascading
  9. Humans don’t think in MapReduce! •  Many options which are

    more natural for humans o  Hive o  Pig o  Cascading •  Frameworks provide alternative computational models o  Declarative – Let the planner do the work o  Imperative but not MapReduce – Some input over execution Hadoop MapReduce Frameworks Data Processing with Hive and Cascading
  10. Flow-based computational model •  Not explicitly tied to Hadoop, but

    runs on MapReduce •  Many Domain-Specific Languages available o  Cascalog o  Scalding o  Lingual o  PyCascading o  Cascading.JRuby Cascading Data Processing with Hive and Cascading
  11. Taps Data Processing with Hive and Cascading Start/Finish! •  Produces

    a Tuple stream to process (or write out) •  Scheme = Input Format •  No wrappers in PyCascading
  12. Do some work! •  Transform the tuple stream •  Lots

    of different types Operators Data Processing with Hive and Cascading