Using Python for data? - Speaker Deck

Using Python for data?

by Bence Faludi

Slide 1

Slide 1 text

Python for Data? no viz., no analytics, just the hidden truth of engineering Bence Faludi, Wunderlist / Microsoft

Slide 2

Slide 2 text

whoami • Data & Applied Scientist @Microsoft • Working on @Wunderlist's data infrastructure. • Member of Python Software Foundation and NumFocus • Open source addict, author of several packages. bfaludi https://github.com/bfaludi

Slide 3

Slide 3 text

Today we'll cover... • Doing data in Python • Real estate website use case • Wunderlist use case • What's next?

Slide 4

Slide 4 text

Data stack in Python

Slide 5

Slide 5 text

There is a library for everything!

Slide 6

Slide 6 text

IPython Notebook* * In the picture you see Pineapple. It is a standalone frontend to IPython for Mac without Anaconda.

Slide 7

Slide 7 text

If you need more speed then just use PyPy without changing a line of code.

Slide 8

Slide 8 text

Sounds too good to be true! Doesn't it?

Slide 9

Slide 9 text

Multiple problems • Not every package is ported to Python 3. • UnicodeDecodeError in older Python versions. • Few packages are hardly integrated with Hadoop. • Data handling is memory bound and far from optimal. • Global Interpreter Lock (GIL) limitations. • Monolith packages with a tons of dependencies.

Slide 10

Slide 10 text

Have you ever tried to ... • write a csv ﬁle in Python 2.x with chinese letters in it? • load an xml ﬁle which is larger than 8GB? • use shared memory with multiprocessing? • use matplotlib right after you used ggplot? • stream gzipped data from AWS with boto3?

Slide 11

Slide 11 text

Python is a great language in general

Slide 12

Slide 12 text

Let's look at a Real estate website use case

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Data infrastructure's goal • Process up to 200k real estate updates per day. • Grab agencies' data from 100+ sources and create a standardised output. • Geolocate addresses based on incomplete data and descriptions. • Check changes and skip unchanged data. • Detect similarities & duplicates within the dataset.

Slide 15

Slide 15 text

Data pipeline Download ﬁles where the source was sliced into multiple ﬁles Map the data into standardised format Filter out and log incomplete records Geolocate Merge source Collect changes XML JSON CSV TSV Spreadsheet APIs Detect similarity JSON production database streaming computed parallel sources

Slide 16

Slide 16 text

Slide 17

Slide 17 text

mETL** It is an extract, transform and load library for Python 2.7: • Works with 9 source types and 11 target types. • Over 35 built-in - transformations. • No GUI, conﬁguration in Yaml format. • Checks differences between migrations. • Quick transformations and manipulations + easy to add yours. ** https://github.com/ceumicrodata/mETL

Slide 18

Slide 18 text

mETL** example source: source: CSV resource: http://path/to/file.csv skipRows: 1 headerRow: 0 fields: - name: country_code - name: name - name: nfkd_name map: name transforms: - transform: Homogenize - transform: LowerCase - name: type - name: gender - name: population type: Integer target: type: JSON compact: false rootIterator: records resource: output.json ** https://github.com/ceumicrodata/mETL

Slide 19

Slide 19 text

mETL** ! A Project Manager can modify the configuration files. ! Understandable & reliable. ! Really easy to extend and write your own Python scripts. ! You can use it standalone. A lifesaver sometimes. (csv+unicode) ! Perfect for converting and cleaning data. ! Python 3 is not supported. ! Huge package, lot of dependencies and using without Yaml is hard. ! Doesn't contain a flow scheduler and Luigi is overkill. ! Can't override attributes from bash. ** https://github.com/ceumicrodata/mETL

Slide 20

Slide 20 text

Wunderlist use case

Slide 21

Slide 21 text

What is Wunderlist? • Productivity app on every platform. • 14.75+ million user in 5 years. • From monolith Rails to polyglot microservices (Scala, Clojure, Go) heavy on AWS. "Wunderlist is the easiest way to get stuff done. Whether you’re planning a holiday, sharing a shopping list with a partner or managing multiple work projects, Wunderlist is here to help you tick off all your personal and professional to-dos." — http://www.wunderlist.com

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Data infrastructure's goal • Collect every event from tracking (~125M/day). • Parse and load compressed log ﬁle's content into Redshift (~320GB/day). • Mirror productional databases (~35 source, 30GB inc./day). • Load external sources into Redshift (e.g.: app store, payments). • Calculate KPIs, aggregates and business logic. (200+ queries)

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Mantra Keep it simple. Don't re-invent the wheel.

Slide 26

Slide 26 text

Implementation plan 1. Use cron for scheduling. 2. Use make for dependencies, partial results, and retries. 3. Glue everything together with a hand full of bash script. • Most process handling and output redirection should be handled by bash and make. Because they are good at it and it is more work to do right in Ruby or Python. • All complex logic (and math) should be either in a tool (make) or Ruby/Python. 4. Use Python or Ruby for the actual workers. 5. Inject variables and logic into SQL with Ruby's ERB.

Slide 27

Slide 27 text

Data infrastructure cold storage (Redshift) hot storage (Redshift) production database(s) external sources S3 S3 microservice applications Rsyslog Noxy EMR logging clients (phone, tablet, etc.) email Tracking SNS SQS SQS dumper tracking Postamt Chart.io AWS + DWH riporting data-flow Node.js Clojure Ruby Scala SQL bash SQL bash SQL bash SQL bash + SQL + ERB bash + SQL + ERB night-shift + trackingshell + Flask Scala, Clojure, Golang bash SQL

Slide 28

Slide 28 text

night-shift*** This is the skeleton of our data ﬂow. Almost no dependencies. Written in Python, Ruby and bash. • Makefile wrapper that gets triggered by cron. • Runs all make targets in a tracking shell1, so timing information, output and errors could be logged. • Has a timer script for cron like target timing. • Has a script to inject conditionals, variables and Ruby logic into SQL. • Converts SQL results into CSV from mysql, postgresql and redshift. • Has a Flask application to monitor your logs. 1 https://github.com/wunderlist/trackingshell *** https://github.com/wunderlist/night-shift

Slide 29

Slide 29 text

night-shift*** *** https://github.com/wunderlist/night-shift

Slide 30

Slide 30 text

Let's recap.

Slide 31

Slide 31 text

Python is good for everything.

Slide 32

Slide 32 text

which means Python is good for data.

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

What's next?

Slide 35

Slide 35 text

What's next? • A lot of Python 3 compatible releases in 2016. • More Apache Spark than before (It's already supporting Python). • Python as Redshift's user deﬁned function. • mETL v2.0 will be released early next year with Python 3 support. Sliced into micro-packages (riwo, daprot, uniopen, dm, etc.) and will provide an easy. to use Python interface and better bash support. Integrates with night-shift out of the box • night-shift will support Azure and Apache Spark. Plans to work with mako templates.

Slide 36

Slide 36 text

Thank you for your attention! Bence Faludi @bfaludi