Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Python for data?

Bence Faludi
October 25, 2015

Using Python for data?

This talk was presented at PyCon Ireland, 2015.

Did you feel like using R is too complex and Java is pain in the ass? Is your company using PHP because somebody said it would be okay? You started working with R and realized you can't write a micro service and need to pick another language?

Regardless of your answers, this lecture is about what your biggest problem would be if you have had chosen Python for your work in the first place.

Bence Faludi

October 25, 2015
Tweet

More Decks by Bence Faludi

Other Decks in Technology

Transcript

  1. Python for Data? no viz., no analytics, just the hidden

    truth of engineering Bence Faludi, Wunderlist / Microsoft
  2. whoami • Data & Applied Scientist @Microsoft • Working on

    @Wunderlist's data infrastructure. • Member of Python Software Foundation and NumFocus • Open source addict, author of several packages. bfaludi https://github.com/bfaludi
  3. Today we'll cover... • Doing data in Python • Real

    estate website use case • Wunderlist use case • What's next?
  4. IPython Notebook* * In the picture you see Pineapple. It

    is a standalone frontend to IPython for Mac without Anaconda.
  5. Multiple problems • Not every package is ported to Python

    3. • UnicodeDecodeError in older Python versions. • Few packages are hardly integrated with Hadoop. • Data handling is memory bound and far from optimal. • Global Interpreter Lock (GIL) limitations. • Monolith packages with a tons of dependencies.
  6. Have you ever tried to ... • write a csv

    file in Python 2.x with chinese letters in it? • load an xml file which is larger than 8GB? • use shared memory with multiprocessing? • use matplotlib right after you used ggplot? • stream gzipped data from AWS with boto3?
  7. Data infrastructure's goal • Process up to 200k real estate

    updates per day. • Grab agencies' data from 100+ sources and create a standardised output. • Geolocate addresses based on incomplete data and descriptions. • Check changes and skip unchanged data. • Detect similarities & duplicates within the dataset.
  8. Data pipeline Download files where the source was sliced into

    multiple files Map the data into standardised format Filter out and log incomplete records Geolocate Merge source Collect changes XML JSON CSV TSV Spreadsheet APIs Detect similarity JSON production database streaming computed parallel sources
  9. Data pipeline Download files where the source was sliced into

    multiple files Map the data into standardised format Filter out and log incomplete records Geolocate Merge source Collect changes XML JSON CSV TSV Spreadsheet APIs Detect similarity JSON production database streaming computed parallel sources mETL python & bash scripts mETL mETL mETL mETL w/ extension mETL w/ extension mETL mETL Golang script
  10. mETL** It is an extract, transform and load library for

    Python 2.7: • Works with 9 source types and 11 target types. • Over 35 built-in - transformations. • No GUI, configuration in Yaml format. • Checks differences between migrations. • Quick transformations and manipulations + easy to add yours. ** https://github.com/ceumicrodata/mETL
  11. mETL** example source: source: CSV resource: http://path/to/file.csv skipRows: 1 headerRow:

    0 fields: - name: country_code - name: name - name: nfkd_name map: name transforms: - transform: Homogenize - transform: LowerCase - name: type - name: gender - name: population type: Integer target: type: JSON compact: false rootIterator: records resource: output.json ** https://github.com/ceumicrodata/mETL
  12. mETL** ! A Project Manager can modify the configuration files.

    ! Understandable & reliable. ! Really easy to extend and write your own Python scripts. ! You can use it standalone. A lifesaver sometimes. (csv+unicode) ! Perfect for converting and cleaning data. ! Python 3 is not supported. ! Huge package, lot of dependencies and using without Yaml is hard. ! Doesn't contain a flow scheduler and Luigi is overkill. ! Can't override attributes from bash. ** https://github.com/ceumicrodata/mETL
  13. What is Wunderlist? • Productivity app on every platform. •

    14.75+ million user in 5 years. • From monolith Rails to polyglot microservices (Scala, Clojure, Go) heavy on AWS. "Wunderlist is the easiest way to get stuff done. Whether you’re planning a holiday, sharing a shopping list with a partner or managing multiple work projects, Wunderlist is here to help you tick off all your personal and professional to-dos." — http://www.wunderlist.com
  14. Data infrastructure's goal • Collect every event from tracking (~125M/day).

    • Parse and load compressed log file's content into Redshift (~320GB/day). • Mirror productional databases (~35 source, 30GB inc./day). • Load external sources into Redshift (e.g.: app store, payments). • Calculate KPIs, aggregates and business logic. (200+ queries)
  15. Data infrastructure cold storage (Redshift) hot storage (Redshift) production database(s)

    external sources S3 S3 microservice applications Rsyslog Noxy EMR logging clients (phone, tablet, etc.) email Tracking SNS SQS SQS dumper tracking Postamt Chart.io AWS + DWH riporting data-flow
  16. Implementation plan 1. Use cron for scheduling. 2. Use make

    for dependencies, partial results, and retries. 3. Glue everything together with a hand full of bash script. • Most process handling and output redirection should be handled by bash and make. Because they are good at it and it is more work to do right in Ruby or Python. • All complex logic (and math) should be either in a tool (make) or Ruby/Python. 4. Use Python or Ruby for the actual workers. 5. Inject variables and logic into SQL with Ruby's ERB.
  17. Data infrastructure cold storage (Redshift) hot storage (Redshift) production database(s)

    external sources S3 S3 microservice applications Rsyslog Noxy EMR logging clients (phone, tablet, etc.) email Tracking SNS SQS SQS dumper tracking Postamt Chart.io AWS + DWH riporting data-flow Node.js Clojure Ruby Scala SQL bash SQL bash SQL bash SQL bash + SQL + ERB bash + SQL + ERB night-shift + trackingshell + Flask Scala, Clojure, Golang bash SQL
  18. night-shift*** This is the skeleton of our data flow. Almost

    no dependencies. Written in Python, Ruby and bash. • Makefile wrapper that gets triggered by cron. • Runs all make targets in a tracking shell1, so timing information, output and errors could be logged. • Has a timer script for cron like target timing. • Has a script to inject conditionals, variables and Ruby logic into SQL. • Converts SQL results into CSV from mysql, postgresql and redshift. • Has a Flask application to monitor your logs. 1 https://github.com/wunderlist/trackingshell *** https://github.com/wunderlist/night-shift
  19. What's next? • A lot of Python 3 compatible releases

    in 2016. • More Apache Spark than before (It's already supporting Python). • Python as Redshift's user defined function. • mETL v2.0 will be released early next year with Python 3 support. Sliced into micro-packages (riwo, daprot, uniopen, dm, etc.) and will provide an easy. to use Python interface and better bash support. Integrates with night-shift out of the box • night-shift will support Azure and Apache Spark. Plans to work with mako templates.