Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python and Big Data Frameworks (PyData Berlin ...

Python and Big Data Frameworks (PyData Berlin Meetup)

Talk on "Python and Big Data Frameworks" at the PyData Berlin Meetup (2015/01/29, http://www.meetup.com/PyData-Berlin/events/219668075/) giving a broad overview of approaches and technologies to build scalable data processing solutions using Python alone and in combination with Big Data frameworks like Hadoop, Spark, Impala, etc.

Frank Kaufer

January 29, 2015
Tweet

More Decks by Frank Kaufer

Other Decks in Technology

Transcript

  1. B E S P O K E D A T

    A E N G I N E E R I N G Python and Big Data Frameworks Frank Kaufer @fkaufer 29.01.15 PyData Berlin Meetup
  2. 29.01.15 PyData Berlin Meetup Speaker: Frank Kaufer ▪  Computer Scientist

    ▪  AI, Data Engineering, Distributed Systems ▪  Freelancer since 2001 ▪  Former PhD student at Hasso Plattner Institute ▪  Co-founded bakdata in 2013 ▪  Pythonista since 2004
  3. 29.01.15 PyData Berlin Meetup bakdata - Bespoke Data Engineering ▪ 

    Founded 2013 ▪  Born at Hasso Plattner Institute, Potsdam ▪  Bespoke Data Engineering ¤  Individual data solutions. Areas: ▶  Data Systems: Data Management Systems, Cluster Computing, DevOps, ... ▶  Data Processing: ETL, polyglot data-oriented algorithms, ... ▶  Data Analytics: Data Mining, ML, Statistics, DWH, Visualization, ... ¤  Open-source preference (but not limited to) ¤  No vendor dependency, no VC, 100% equity-financed ▪  Early-stage product development
  4. Why Python? ▪  Classical universal languages: Java/JVM, C/C++ ▪  Scientific

    Computing, Statistics: Matlab, R, SAS, SPSS, Stata, Mathematica ▪  Concurrent Computing: Fortran, Erlang, Clojure ▪  “Next Generation”: Scala, Julia ▪  Efficient coding, nice syntax ▪  Multi-paradigm ▪  Batteries included ▪  Community ▪  Operator overloading, DSLs ▪  One language for prototyping and production ▪  One language for systems, analytics, data engineering, networking, Web apps, ... ▪  Optimization option: C/C++ extensions, Cython, Numba, ... ▪  REPL, IPython ▪  “Glue language” ▪  PyData! 29.01.15 PyData Berlin Meetup Alternatives Pro Python
  5. PyData ▪  NumPy, SciPy ▪  matplotlib, seaborn, Bokeh ▪  pandas

    ▪  scikit-learn (“sklearn”), statsmodels ▪  SQLAlchemy ▪  IPython, Jupyter ▪  Enthought ▪  Continuum Analytics (Anaconda, Numba, blaze, Bokeh) ▪  Cloudera/DataPad ▪  DARPA XDATA ▪  databricks Cloud Platform ▪  Microsoft Azure ML ▪  PyData conference ▪  PyData Berlin ¤  http://pydata.berlin ¤  @pydataberlin 29.01.15 PyData Berlin Meetup PyData “Stack” Community, Market, Drivers
  6. 29.01.15 PyData Berlin Meetup “Big Data” - Ockham’s Razor ▪ 

    3, 5, 7, ... Vs, MapReduce, NSA, ... ▪  Big Data ¤  Data volume is central aspect of the problem ¤  Scalability is central requirement to the solution ▪  Data volume ▪  Scalability
  7. 29.01.15 PyData Berlin Meetup Implementing scalable data processing ▪  Scaling

    physically ¤  Up: RAM, Multiple CPUs, multiple cores, Hyperthreading ¤  Out: Multiple Machines, Cluster Computing ▪  Computing paradigms ¤  Parallel ¤  Concurrent ¤  Distributed ▪  “Embarassingly parallel/distributed”
  8. 29.01.15 PyData Berlin Meetup Parallel/Concurrent Computing ▪  Parallel ≠ Concurrent

    ▪  Execution models ¤  Multiprocessing ¤  Multithreading ▶  OS kernel level ▶  User/Application level ¤  Asynchronous programming ¤  Cooperative vs Preemptive
  9. Python Myths: Python only for prototyping, small data, slow, does

    not scale, no concurrency (GIL), ... Can we do that in Python? 29.01.15 PyData Berlin Meetup
  10. 29.01.15 PyData Berlin Meetup Myth “Python is slow” ▪  Language

    vs. runtime, compare “SQL is slow” ▪  TMP issue “Too Much (pure) Python” - Wes McKinney ▪  Many optimization options ¤  C/C++ core, Python wrapper ¤  NumPy & Intel MKL ¤  Cython ¤  Numba – optimizatin by annotation, LLVM-based on compiler, JIT compiler, GPU ¤  Vectorized code ¤  High-Performance Python ▪  Many popular libs already optimized and pretty fast
  11. 29.01.15 PyData Berlin Meetup Python scalability, concurrency ▪  Scale up,

    concurrency ¤  RAM – efficiency? ¤  Multiprocessing, std lib multiprocessing ¤  Multithreading ▶  OS Kernel threads -> std lib threading, but GIL ▶  User level threads: many options ▪  Scale out ¤  See Google, DropBox, Instagram, ... ¤  Networking, System automation: Python stronghold ¤  Lack of (integrated) open source frameworks: cluster management, scheduler, dispatcher, data ingestion, ... -> no obvious technical reason
  12. 29.01.15 PyData Berlin Meetup GIL (Global Interpreter Lock) ▪  CPython:

    One thread per core at a time ▪  Reasons for GIL: legacy, performance, non-thread-safe C libs ▪  Affects only CPU-heavy kernel threads, but not: ¤  I/O threads ¤  Image processing ¤  NumPy et al: deactivate GIL and do C/C++ multithreading under the hood ¤  Cython “with gil” / “with nogil” statements ¤  User level threads -> Asynchronous programming
  13. ▪  Non-prememptive (cooperative) user level threads ¤  less overhead, faster

    ¤  keep control, individual scheduling ¤  simpler code, user non-thread- safe libs ¤  Lib greenlet ▪  Event loops, callbacks ¤  libevent ¤  Twisted (chained deferreds) ▪  Coroutines, Generators (“semi- coroutines”) ¤  Enhanced Generators (PEP 342): yield, send, throw, close ¤  gevent (uses greenlet & libevent) ▪  Higher-Level ¤  Networking: Eventlet, Tornado ¤  Actor model (popularized by Scala’s Akka): Pykka, Pulsar ▪  Many new standard libs in Python 3: ¤  asyncio (Python 2 backport: trollius), already many aio... version of libs, wrappers ¤  concurrent.futures 29.01.15 PyData Berlin Meetup Asynchronous programming
  14. 29.01.15 PyData Berlin Meetup Embarrassingly parallel/distributed and other Distributed Systems

    lies ▪  Free markets – regulation, competions authorities ▪  Open source projects – benevolant dictators ▪  ... ▪  Distributed Systems ¤  Resource management ¤  Synchronization, Coordination ¤  Scheduling ¤  Dispatching ¤  Monitoring ¤  ... ¤  Complex system management, complex programming
  15. Big Data stacks – TL;DR 29.01.15 PyData Berlin Meetup Sources:

    Cloudera, Hortonworks, MapR, Gigaom, jameskaskade.com
  16. 29.01.15 PyData Berlin Meetup Big Data - components/services ▪  DevOps

    ▪  Storage, Distributed File Systems, Data Stores ▪  DAG runtimes, Distributed Processing Operators ¤  Map, Reduce ¤  Beyond Map Reduce ▪  Resource Management, Meta-Data Management, Scheduling, Optimizer ▪  Workflow management/programming, ETL, Compilers ▪  Data /Job/Message Dispatching, Distributed Queues ▪  Stream Processing ▪  Connectors, File Formats, Loaders ▪  Query Engines ▪  Monitoring, Tracking ▪  Higher-Level Layers, Applications: Batch, Stream, Analytics/Statistics/ML, SQL, Data Warehousing, Graphs, Search
  17. 29.01.15 PyData Berlin Meetup Big Data – technology ▪  Technology:

    ¤  MapReduce (* 2004, Google), Hadoop (* 2005, Yahoo) ¤  Hive, HBase, Spark, Flink, Impala, ... ¤  RDBMS? ▶  PostgreSQL -> Greenplum, CitusDB (“Scalable PostgreSQL”) ▶  MPP databases ▪  Market, R&D ¤  Specialists: Cloudera, Hortonworks, MapR ¤  Google, Yahoo, Amazon, Microsoft, SAP, Oracle, IBM, TeraData, ... ¤  Research: ▶  UC Berkeley AMPLab (-> Apache Spark and more) ▶  Germany: Stratosphere project (-> Apache Flink) -> Berlin Big Data Center ▪  Strata, Hadoop World ▪  Programming languages ¤  JVM (Java/Scala) dominant ¤  Python?
  18. 29.01.15 PyData Berlin Meetup Python & Big Data ▪  Python/PyData

    community and “Big Data” ¤  Scientific vs. industry-driven communities ¤  “Medium Data” ¤  Less focus on systems, more algorithm-oriented + sophisticated tools ▪  Python projects for distributed data processing ¤  wrapper/abstraction projects ¤  single components ¤  Definitely a lack of easy-to-use integrated systems or aligned components ¤  Also some individual (proprietary) solutions ▶  One example: AdRoll Deliroll system - outperforming 3 node Amazon Redshift with 1 node using PostgreSQL, Multicorn, Numba (see talk by Ville Tuulos at SFO PUG)
  19. 29.01.15 PyData Berlin Meetup Python Big Data projects/components ▪  joblib

    (not distributed) ▪  clusterlib ▪  mpi4py (OpenMPI) ▪  RPyC ▪  Queues: Celery, bindings/clients for ZeroMQ, RabbitMQ (pika, py- amqplib), ... ▪  IPython.parallel (uses ZeroMQ) ▪  spartan - Distributed NumPy ▪  Disco (MapReduce, Erlang + Python) ▪  Luigi (ETL, spotify) ▪  Blaze
  20. 29.01.15 PyData Berlin Meetup Disco ▪  Started in 2009 at

    Nokia ▪  MapReduce framework ▪  Backend in Erlang, Jobs in Python ▪  Satellite projects ¤  DiscoDB – key-value store ¤  Hustle – column-oriented event database, NoSQL Python DSL
  21. 29.01.15 PyData Berlin Meetup Luigi ▪  Developed at Spotify (started

    in 2008) ▪  Workflow engine, ETL tool - automated data pipelines ▪  Support for HDFS and Hadoop MR ▪  Complementary to Pig, Cascading, Scalding, Crunch ▪  Features ¤  dependency resolution ¤  workflow management ¤  visualization ¤  command line integration ¤  Planned: scheduling ▪  Active development, but mature, used at Spotify, Foursquare, ...
  22. 29.01.15 PyData Berlin Meetup Blaze ▪  Most promising PyData Big

    Data project ▪  2012/2013 started by Continuum ▪  Boost and new focuse in 2014, lead developer Matthew Rocklin ▪  Data (type) abstraction ▪  Many backends – Pandas, Spark, Hive, MongoDB, SQLAlchemy, ... ▪  Symbolic expressions, deferred computation (see also SymPy) ▪  Spin-off libs ¤  Data abstraction: datashape ¤  Data ingestion/migration: into ¤  Scheduling abstraction: dask ¤  Dynamic multi-dimensional arrays (“new NumPy”): LibDyND ▪  See also Matthew’s blog: http://matthewrocklin.com/blog/
  23. ▪  File, Data storage systems ¤  “Plain files” (CSV, Excel,

    Stata DTA, ...) ¤  Optimized data container: HDF5, bcolz ¤  Databases: relational/ structured, semi-/ unstructured, key-value, ... ¤  Storage Services like S3 ▪  Serialization, formats ¤  XML ¤  JSON ¤  MessagePack ¤  Apache Avro ¤  Protocol buffers ¤  Apache Thrift ¤  Apache Parquet ▪  Compression ¤  Zlib, LZ4/LZ4HC, Snappy ¤  Blosc 29.01.15 PyData Berlin Meetup Interchange technologies – stores, serialization, formats
  24. ▪  Low-level IPC ¤  Shared Memory ¤  Memory-Mapped files ¤ 

    Pipes ▪  Messaging ¤  Socket ¤  RPC/Web Services ¤  Message/Job/Event queues/ broker: ▶  ZeroMQ, RabbitMQ, Apache Kafka, Apache Qpid, ActiveMQ ▶  Overview: http://queues.io/ ▪  Bridges, gateways, compilers, intermediate representation (IR) ¤  Java: Py4J, JPy ¤  PyCall (Julia) ¤  LLVM, Numba ¤  Cython, C extensions ▪  Polyglot notebooks - IPython/Jupyter ¤  IJulia ¤  IPython-SQL ¤  IScala ¤  IRKernel, rmagic 29.01.15 PyData Berlin Meetup Interchange technologies – communication, compilation
  25. 29.01.15 PyData Berlin Meetup Hadoop cont’d ▪  HDFS ▪  Resource

    Manager (YARN) ▪  MapReduce ▪  Hadoop ecosystem ¤  Data store: HBase, Accumulo ¤  Strukturierte DB: Cassandra ¤  SQL, DWH: Hive, Tajo ¤  Machine Learning: Mahout ¤  Management: Oozie, Zookeeper, Ambari, Azkaban ¤  Data Serialization, Loader: Sqoop, Avro, Flume ¤  Stream: Storm ¤  Programming, ETL: Apache Pig, Cascading
  26. 29.01.15 PyData Berlin Meetup Hadoop & Python ▪  Generic: Hadoop

    Streaming ▪  HDFS-Client: snakebite (Spotify) ▪  MapReduce: mrjob (Yelp) ▪  pywebhdfs ▪  yarn-api-client ▪  dumbo, hadoopy, pydoop, ...
  27. 29.01.15 PyData Berlin Meetup Spark ▪  AMPLab (2009/2010), Databricks, Apache

    project in 2013 ▪  Resilient Distributed Datasets ¤  distributed collections ¤  memory with options to persist/spill-over to disk ▪  DAG engine. More Operators than Map & Reduce: ¤  Transformation: map, join, cogroup, groupByKey, filter, union, intersection, ... ¤  Aktionen: reduce, foreach, reduceByKey, take, ... ▪  Hadoop “compatible” ¤  Data: HDFS, HBase, Cassandra, ... ¤  Cluster management: YARN, but also Mesos ▪  Higher-level tools ¤  Spark SQL (Nachfolger von Shark) ¤  Spark Streaming ¤  MLlib ¤  GraphX
  28. 29.01.15 PyData Berlin Meetup Spark & Python ▪  PySpark ▪ 

    API ¤  Client API – Wrapper using Py4J ¤  On Cluster: JVM executers communicate with Python workers via Pipes ▪  Python support ¤  MLlib (since Spark 0.9) ¤  Streaming (since Spark 1.2) ¤  GraphX (not yet) ▪  Spark SQL External Script Query Quelle: Apache Source: Apache
  29. How long do we still call it Hadoop? 29.01.15 PyData

    Berlin Meetup YARN HDFS MapReduce Spark Hive Mahout Mesos GlusterFS Spark Tachyon Hive Mahout Flink Storm
  30. ▪  Cloudera, Open Source ▪  Scalable interactive SQL ▪  Distributed

    query processing engine ▪  Apache Parquet ▪  Related ¤  SQL on Hadoop – “on HDFS” vs. “on MapReduce” ¤  Apache Hive – MapReduce ¤  Hive on Spark ¤  Spark SQL ¤  Others: Apache Tajo, Pivotal HAWQ, Facebook Presto, Amazon Redshift 29.01.15 PyData Berlin Meetup Impala Quelle: Cloudera
  31. 29.01.15 PyData Berlin Meetup Impala & Python ▪  impyla ▪ 

    DB-API incl. support for HiveServer2, Beeswax, Kerberos ▪  Results as Pandas DataFrame ▪  Under development, experimental: ¤  Fast Python UDFs using Numba/LLVM ¤  BigDataFrame – Pandas+Spark RDD in Impala ¤  Integration with Blaze, SQL Alchemy ¤  sklearn-style wrapper for MADlib
  32. 29.01.15 PyData Berlin Meetup GraphLab ▪  Started as GraphLab project

    at Carnegie Mellon ▪  Scalable, parallel ML algorithms exploiting structural sparseness ▪  Open-source, commercial version/support by Dato ▪  GraphLab Create API: Python lib with C++ engine ¤  SFrame structure ¤  Can be created from Pandas DataFrame, Apache Avro, PySpark RDD, ...
  33. ▪  streamparse: Stream data, Apache Storm integration ▪  pysolr: Apache

    Solr wrapper ▪  PyHive: Python interface to Hive and Presto (by DropBox) ▪  Apache Aurora: Mesos mgmt framework with Python DSL ▪  Python YARN client ▪  Kazoo: Apache Zookeeper API ▪  HappyBase: Apache HBase lib ▪  Apache Flink: new Python API upcoming PR#202 ▪  h2o-dev – “Dev-Friendly Rewrite of H2O with Spark API” ▪  kafka-python: Apache Kafka client ▪  libgfapi-python: GlusterFS API ▪  Python-RQ: Python Redis Queue ▪  PyCascading (but: outdated, Jython) ▪  multicorn: PostgreSQL FDW 29.01.15 PyData Berlin Meetup More Python & Big Data Systems
  34. 29.01.15 PyData Berlin Meetup Virtualization ▪  System Virtualization ¤  Technologies:

    vSphere, Xen, VirtualBox, LXC (Linux container) ¤  As a service: Amazon (AWS) & Co: EC2, EMR, Google Cloud Platform, MS Azure, IBM SoftLayer ▪  Virtualisierung and Big Data – related, but ambivalent ¤  Related: clustering, scalability ¤  But data clusters in production typically on physical machines, shared nothing, JBODs ¤  Container-level virtualization also in production ¤  Small companies, developers: virtual environments good to start and test ▪  Tools to create isolated, repeatable environments ¤  Docker ▶  LXC automation, provisioning ▶  Ferry - Hadoop, Cassandra, Spark, GlusterFS, and Open MPI on Docker ¤  Vagrant ▶  Originally only VirtualBox automation, now support for many Hypervisors and also Docker/LXC ▶  Veewee – custom Vagrant boxes
  35. 29.01.15 PyData Berlin Meetup Python Virtualization tools ▪  boto (AWS,

    "Cloud") ▪  pyvsphere (VSphere/ESX, "Private Cloud") ▪  libvirt (Python bindings) ▪  virtualenv: not system virtualization, virtual (Python) environments ▪  libcloud: Meta lib for more than 30 virtualization providers ▪  OpenStack ¤  Free, open-source infrastructure for virtualization and distributed computing ¤  Data processing: Sahara subproject
  36. 29.01.15 PyData Berlin Meetup DevOps ▪  Configuration Management, Deployment, Provisioning:

    ¤  Chef, Puppet ¤  Python: Ansible, Salt ▪  fabric – slim application deployment, system automation via SSH ▪  Cloudera Manager Python API ▪  Cloudera Hue ¤  Web Management UI for CDH ¤  Written in Python, Django-based, extensible in Python, Python SDK ▪  Supervisor: Process control system ▪  CI: python-jenkins, jenkinsAPI, TravisPy, buildbot