Python and Hadoop: Big Data Application Development with PyCascading

Python and Hadoop: Big Data Application Development with PyCascading Craig
Hawco, Technical Lead l August 11, 2013 Data Processing with Hive and Cascading

•  Overview of building analytics applications •  Brief introduction to
MapReduce and Hadoop •  Introduction to Cascading and PyCascading Agenda Data Processing with Hive and Cascading

• First got into analytics in finance in 2005 o  Different
world o  No Hadoop • Hired as analytics developer at Polar Mobile in 2009 o  Designed Hive-based mobile analytics platform o  Ended up as Director of Engineering • Engineer at Kontagent o  Yellow elephant tamer o  Working on new analytics platform based on Hadoop Who is this guy? Data Processing with Hive and Cascading

Kontagent Facts •  Founded in 2007 •  130+ employees and
growing •  100s of Customers •  1000s of Apps Instrumented •  250+ billion events per month •  200MM+ MAUs •  1 Trillion Events in 2013

•  Analytics is making sense of your data. •  Gain
insight into some aspect of your business and produce actionable results. What is Analytics? Data Processing with Hive and Cascading 0 50 100 150 200 250 300 0 10 20 30 40 50 60 70 80 90 100 Cumulative Spend Days Since Install

•  Web API logs •  Event log based on event
type o  APA – install message o  MTU – monetization o  EVT – custom event Data at Kontagent Data Processing with Hive and Cascading

•  Interest sparked by Google’s whitepapers •  Inspired by functional
programming •  Computational model for dealing with distributed systems MapReduce Data Processing with Hive and Cascading

Hadoop is a distributed data processing framework. •  Based on
the Google MapReduce and GFS whitepapers •  Open Source •  Ecosystem Hadoop Data Processing with Hive and Cascading

Humans don’t think in MapReduce! •  Many options which are
more natural for humans o  Hive o  Pig o  Cascading •  Frameworks provide alternative computational models o  Declarative – Let the planner do the work o  Imperative but not MapReduce – Some input over execution Hadoop MapReduce Frameworks Data Processing with Hive and Cascading

Flow-based computational model •  Not explicitly tied to Hadoop, but
runs on MapReduce •  Many Domain-Specific Languages available o  Cascalog o  Scalding o  Lingual o  PyCascading o  Cascading.JRuby Cascading Data Processing with Hive and Cascading

Taps Data Processing with Hive and Cascading Start/Finish! •  Produces
a Tuple stream to process (or write out) •  Scheme = Input Format •  No wrappers in PyCascading

Do some work! •  Transform the tuple stream •  Lots
of different types Operators Data Processing with Hive and Cascading

Operators Data Processing with Hive and Cascading

Flows Data Processing with Hive and Cascading Fit all of
your pieces together!

Run it! Data Processing with Hive and Cascading

Questions? Need a job? We’re hiring: http://www.kontagent.com/company/careers/ Data Processing with
Hive and Cascading

Python and Hadoop: Big Data Application Develop...

Python and Hadoop: Big Data Application Development with PyCascading

PyCon Canada

More Decks by PyCon Canada

Other Decks in Programming

Featured

Transcript

Python and Hadoop: Big Data Application Development with PyCascading Craig

•  Overview of building analytics applications •  Brief introduction to

• First got into analytics in finance in 2005 o  Different

Kontagent Facts •  Founded in 2007 •  130+ employees and

•  Analytics is making sense of your data. •  Gain

•  Web API logs •  Event log based on event

•  Interest sparked by Google’s whitepapers •  Inspired by functional

Hadoop is a distributed data processing framework. •  Based on

Humans don’t think in MapReduce! •  Many options which are

Flow-based computational model •  Not explicitly tied to Hadoop, but

Taps Data Processing with Hive and Cascading Start/Finish! •  Produces

Do some work! •  Transform the tuple stream •  Lots

Operators Data Processing with Hive and Cascading

Flows Data Processing with Hive and Cascading Fit all of

Run it! Data Processing with Hive and Cascading

Questions? Need a job? We’re hiring: http://www.kontagent.com/company/careers/ Data Processing with