Python and Big Data Frameworks (PyData Berlin Meetup)

B E S P O K E D A T
A E N G I N E E R I N G Python and Big Data Frameworks Frank Kaufer @fkaufer 29.01.15 PyData Berlin Meetup

29.01.15 PyData Berlin Meetup Speaker: Frank Kaufer ▪  Computer Scientist
▪  AI, Data Engineering, Distributed Systems ▪  Freelancer since 2001 ▪  Former PhD student at Hasso Plattner Institute ▪  Co-founded bakdata in 2013 ▪  Pythonista since 2004

29.01.15 PyData Berlin Meetup bakdata - Bespoke Data Engineering ▪ 
Founded 2013 ▪  Born at Hasso Plattner Institute, Potsdam ▪  Bespoke Data Engineering ¤  Individual data solutions. Areas: ▶  Data Systems: Data Management Systems, Cluster Computing, DevOps, ... ▶  Data Processing: ETL, polyglot data-oriented algorithms, ... ▶  Data Analytics: Data Mining, ML, Statistics, DWH, Visualization, ... ¤  Open-source preference (but not limited to) ¤  No vendor dependency, no VC, 100% equity-financed ▪  Early-stage product development

Who are you? Background knowledge Experience Interests Expectations Bespoke Talk
Engineering 29.01.15 PyData Berlin Meetup

Python & Big Data systems – Why? 29.01.15 PyData Berlin
Meetup

Python, PyData 29.01.15 PyData Berlin Meetup

Why Python? ▪  Classical universal languages: Java/JVM, C/C++ ▪  Scientific
Computing, Statistics: Matlab, R, SAS, SPSS, Stata, Mathematica ▪  Concurrent Computing: Fortran, Erlang, Clojure ▪  “Next Generation”: Scala, Julia ▪  Eﬀicient coding, nice syntax ▪  Multi-paradigm ▪  Batteries included ▪  Community ▪  Operator overloading, DSLs ▪  One language for prototyping and production ▪  One language for systems, analytics, data engineering, networking, Web apps, ... ▪  Optimization option: C/C++ extensions, Cython, Numba, ... ▪  REPL, IPython ▪  “Glue language” ▪  PyData! 29.01.15 PyData Berlin Meetup Alternatives Pro Python

PyData ▪  NumPy, SciPy ▪  matplotlib, seaborn, Bokeh ▪  pandas
▪  scikit-learn (“sklearn”), statsmodels ▪  SQLAlchemy ▪  IPython, Jupyter ▪  Enthought ▪  Continuum Analytics (Anaconda, Numba, blaze, Bokeh) ▪  Cloudera/DataPad ▪  DARPA XDATA ▪  databricks Cloud Platform ▪  Microsoft Azure ML ▪  PyData conference ▪  PyData Berlin ¤  http://pydata.berlin ¤  @pydataberlin 29.01.15 PyData Berlin Meetup PyData “Stack” Community, Market, Drivers

Big Data 29.01.15 PyData Berlin Meetup

29.01.15 PyData Berlin Meetup “Big Data” - Ockham’s Razor ▪ 
3, 5, 7, ... Vs, MapReduce, NSA, ... ▪  Big Data ¤  Data volume is central aspect of the problem ¤  Scalability is central requirement to the solution ▪  Data volume ▪  Scalability

Scalability 29.01.15 PyData Berlin Meetup Scale up Scale out

Parallel, Concurrent, Distributed Implementing scalable data processing 29.01.15 PyData Berlin
Meetup

29.01.15 PyData Berlin Meetup Implementing scalable data processing ▪  Scaling
physically ¤  Up: RAM, Multiple CPUs, multiple cores, Hyperthreading ¤  Out: Multiple Machines, Cluster Computing ▪  Computing paradigms ¤  Parallel ¤  Concurrent ¤  Distributed ▪  “Embarassingly parallel/distributed”

29.01.15 PyData Berlin Meetup Parallel/Concurrent Computing ▪  Parallel ≠ Concurrent
▪  Execution models ¤  Multiprocessing ¤  Multithreading ▶  OS kernel level ▶  User/Application level ¤  Asynchronous programming ¤  Cooperative vs Preemptive

Python Myths: Python only for prototyping, small data, slow, does
not scale, no concurrency (GIL), ... Can we do that in Python? 29.01.15 PyData Berlin Meetup

29.01.15 PyData Berlin Meetup Myth “Python is slow” ▪  Language
vs. runtime, compare “SQL is slow” ▪  TMP issue “Too Much (pure) Python” - Wes McKinney ▪  Many optimization options ¤  C/C++ core, Python wrapper ¤  NumPy & Intel MKL ¤  Cython ¤  Numba – optimizatin by annotation, LLVM-based on compiler, JIT compiler, GPU ¤  Vectorized code ¤  High-Performance Python ▪  Many popular libs already optimized and pretty fast

29.01.15 PyData Berlin Meetup Python scalability, concurrency ▪  Scale up,
concurrency ¤  RAM – eﬀiciency? ¤  Multiprocessing, std lib multiprocessing ¤  Multithreading ▶  OS Kernel threads -> std lib threading, but GIL ▶  User level threads: many options ▪  Scale out ¤  See Google, DropBox, Instagram, ... ¤  Networking, System automation: Python stronghold ¤  Lack of (integrated) open source frameworks: cluster management, scheduler, dispatcher, data ingestion, ... -> no obvious technical reason

29.01.15 PyData Berlin Meetup GIL (Global Interpreter Lock) ▪  CPython:
One thread per core at a time ▪  Reasons for GIL: legacy, performance, non-thread-safe C libs ▪  Aﬀects only CPU-heavy kernel threads, but not: ¤  I/O threads ¤  Image processing ¤  NumPy et al: deactivate GIL and do C/C++ multithreading under the hood ¤  Cython “with gil” / “with nogil” statements ¤  User level threads -> Asynchronous programming

▪  Non-prememptive (cooperative) user level threads ¤  less overhead, faster
¤  keep control, individual scheduling ¤  simpler code, user non-thread- safe libs ¤  Lib greenlet ▪  Event loops, callbacks ¤  libevent ¤  Twisted (chained deferreds) ▪  Coroutines, Generators (“semi- coroutines”) ¤  Enhanced Generators (PEP 342): yield, send, throw, close ¤  gevent (uses greenlet & libevent) ▪  Higher-Level ¤  Networking: Eventlet, Tornado ¤  Actor model (popularized by Scala’s Akka): Pykka, Pulsar ▪  Many new standard libs in Python 3: ¤  asyncio (Python 2 backport: trollius), already many aio... version of libs, wrappers ¤  concurrent.futures 29.01.15 PyData Berlin Meetup Asynchronous programming

Out of Core Distributed concurrent data processing Scale out 29.01.15
PyData Berlin Meetup

Scale out 29.01.15 PyData Berlin Meetup

29.01.15 PyData Berlin Meetup Embarrassingly parallel/distributed and other Distributed Systems
lies ▪  Free markets – regulation, competions authorities ▪  Open source projects – benevolant dictators ▪  ... ▪  Distributed Systems ¤  Resource management ¤  Synchronization, Coordination ¤  Scheduling ¤  Dispatching ¤  Monitoring ¤  ... ¤  Complex system management, complex programming

Big Data stacks – TL;DR 29.01.15 PyData Berlin Meetup Sources:
Cloudera, Hortonworks, MapR, Gigaom, jameskaskade.com

29.01.15 PyData Berlin Meetup Big Data - components/services ▪  DevOps
▪  Storage, Distributed File Systems, Data Stores ▪  DAG runtimes, Distributed Processing Operators ¤  Map, Reduce ¤  Beyond Map Reduce ▪  Resource Management, Meta-Data Management, Scheduling, Optimizer ▪  Workflow management/programming, ETL, Compilers ▪  Data /Job/Message Dispatching, Distributed Queues ▪  Stream Processing ▪  Connectors, File Formats, Loaders ▪  Query Engines ▪  Monitoring, Tracking ▪  Higher-Level Layers, Applications: Batch, Stream, Analytics/Statistics/ML, SQL, Data Warehousing, Graphs, Search

29.01.15 PyData Berlin Meetup Big Data – technology ▪  Technology:
¤  MapReduce (* 2004, Google), Hadoop (* 2005, Yahoo) ¤  Hive, HBase, Spark, Flink, Impala, ... ¤  RDBMS? ▶  PostgreSQL -> Greenplum, CitusDB (“Scalable PostgreSQL”) ▶  MPP databases ▪  Market, R&D ¤  Specialists: Cloudera, Hortonworks, MapR ¤  Google, Yahoo, Amazon, Microsoft, SAP, Oracle, IBM, TeraData, ... ¤  Research: ▶  UC Berkeley AMPLab (-> Apache Spark and more) ▶  Germany: Stratosphere project (-> Apache Flink) -> Berlin Big Data Center ▪  Strata, Hadoop World ▪  Programming languages ¤  JVM (Java/Scala) dominant ¤  Python?

Python & Big Data 29.01.15 PyData Berlin Meetup

29.01.15 PyData Berlin Meetup Python & Big Data ▪  Python/PyData
community and “Big Data” ¤  Scientific vs. industry-driven communities ¤  “Medium Data” ¤  Less focus on systems, more algorithm-oriented + sophisticated tools ▪  Python projects for distributed data processing ¤  wrapper/abstraction projects ¤  single components ¤  Definitely a lack of easy-to-use integrated systems or aligned components ¤  Also some individual (proprietary) solutions ▶  One example: AdRoll Deliroll system - outperforming 3 node Amazon Redshift with 1 node using PostgreSQL, Multicorn, Numba (see talk by Ville Tuulos at SFO PUG)

29.01.15 PyData Berlin Meetup Python Big Data projects/components ▪  joblib
(not distributed) ▪  clusterlib ▪  mpi4py (OpenMPI) ▪  RPyC ▪  Queues: Celery, bindings/clients for ZeroMQ, RabbitMQ (pika, py- amqplib), ... ▪  IPython.parallel (uses ZeroMQ) ▪  spartan - Distributed NumPy ▪  Disco (MapReduce, Erlang + Python) ▪  Luigi (ETL, spotify) ▪  Blaze

29.01.15 PyData Berlin Meetup Disco ▪  Started in 2009 at
Nokia ▪  MapReduce framework ▪  Backend in Erlang, Jobs in Python ▪  Satellite projects ¤  DiscoDB – key-value store ¤  Hustle – column-oriented event database, NoSQL Python DSL

29.01.15 PyData Berlin Meetup Luigi ▪  Developed at Spotify (started
in 2008) ▪  Workflow engine, ETL tool - automated data pipelines ▪  Support for HDFS and Hadoop MR ▪  Complementary to Pig, Cascading, Scalding, Crunch ▪  Features ¤  dependency resolution ¤  workflow management ¤  visualization ¤  command line integration ¤  Planned: scheduling ▪  Active development, but mature, used at Spotify, Foursquare, ...

29.01.15 PyData Berlin Meetup Blaze ▪  Most promising PyData Big
Data project ▪  2012/2013 started by Continuum ▪  Boost and new focuse in 2014, lead developer Matthew Rocklin ▪  Data (type) abstraction ▪  Many backends – Pandas, Spark, Hive, MongoDB, SQLAlchemy, ... ▪  Symbolic expressions, deferred computation (see also SymPy) ▪  Spin-oﬀ libs ¤  Data abstraction: datashape ¤  Data ingestion/migration: into ¤  Scheduling abstraction: dask ¤  Dynamic multi-dimensional arrays (“new NumPy”): LibDyND ▪  See also Matthew’s blog: http://matthewrocklin.com/blog/

Interfacing mixed-language data systems Interchange technologies 29.01.15 PyData Berlin Meetup

▪  File, Data storage systems ¤  “Plain files” (CSV, Excel,
Stata DTA, ...) ¤  Optimized data container: HDF5, bcolz ¤  Databases: relational/ structured, semi-/ unstructured, key-value, ... ¤  Storage Services like S3 ▪  Serialization, formats ¤  XML ¤  JSON ¤  MessagePack ¤  Apache Avro ¤  Protocol buﬀers ¤  Apache Thrift ¤  Apache Parquet ▪  Compression ¤  Zlib, LZ4/LZ4HC, Snappy ¤  Blosc 29.01.15 PyData Berlin Meetup Interchange technologies – stores, serialization, formats

▪  Low-level IPC ¤  Shared Memory ¤  Memory-Mapped files ¤ 
Pipes ▪  Messaging ¤  Socket ¤  RPC/Web Services ¤  Message/Job/Event queues/ broker: ▶  ZeroMQ, RabbitMQ, Apache Kafka, Apache Qpid, ActiveMQ ▶  Overview: http://queues.io/ ▪  Bridges, gateways, compilers, intermediate representation (IR) ¤  Java: Py4J, JPy ¤  PyCall (Julia) ¤  LLVM, Numba ¤  Cython, C extensions ▪  Polyglot notebooks - IPython/Jupyter ¤  IJulia ¤  IPython-SQL ¤  IScala ¤  IRKernel, rmagic 29.01.15 PyData Berlin Meetup Interchange technologies – communication, compilation

Hadoop, Spark, Impala, GraphLab Python & Big Data Systems 29.01.15
PyData Berlin Meetup

Hadoop

Hadoop 29.01.15 PyData Berlin Meetup MapReduce HDFS YARN HDFS MapReduce
Hadoop 1 Hadoop 2 (Other Data Processing)

29.01.15 PyData Berlin Meetup Hadoop cont’d ▪  HDFS ▪  Resource
Manager (YARN) ▪  MapReduce ▪  Hadoop ecosystem ¤  Data store: HBase, Accumulo ¤  Strukturierte DB: Cassandra ¤  SQL, DWH: Hive, Tajo ¤  Machine Learning: Mahout ¤  Management: Oozie, Zookeeper, Ambari, Azkaban ¤  Data Serialization, Loader: Sqoop, Avro, Flume ¤  Stream: Storm ¤  Programming, ETL: Apache Pig, Cascading

29.01.15 PyData Berlin Meetup Hadoop & Python ▪  Generic: Hadoop
Streaming ▪  HDFS-Client: snakebite (Spotify) ▪  MapReduce: mrjob (Yelp) ▪  pywebhdfs ▪  yarn-api-client ▪  dumbo, hadoopy, pydoop, ...

29.01.15 PyData Berlin Meetup Spark ▪  AMPLab (2009/2010), Databricks, Apache
project in 2013 ▪  Resilient Distributed Datasets ¤  distributed collections ¤  memory with options to persist/spill-over to disk ▪  DAG engine. More Operators than Map & Reduce: ¤  Transformation: map, join, cogroup, groupByKey, filter, union, intersection, ... ¤  Aktionen: reduce, foreach, reduceByKey, take, ... ▪  Hadoop “compatible” ¤  Data: HDFS, HBase, Cassandra, ... ¤  Cluster management: YARN, but also Mesos ▪  Higher-level tools ¤  Spark SQL (Nachfolger von Shark) ¤  Spark Streaming ¤  MLlib ¤  GraphX

29.01.15 PyData Berlin Meetup Spark & Python ▪  PySpark ▪ 
API ¤  Client API – Wrapper using Py4J ¤  On Cluster: JVM executers communicate with Python workers via Pipes ▪  Python support ¤  MLlib (since Spark 0.9) ¤  Streaming (since Spark 1.2) ¤  GraphX (not yet) ▪  Spark SQL External Script Query Quelle: Apache Source: Apache

Databricks Cloud Platform 29.01.15 PyData Berlin Meetup Source: Databricks

How long do we still call it Hadoop? 29.01.15 PyData
Berlin Meetup YARN HDFS MapReduce Spark Hive Mahout Mesos GlusterFS Spark Tachyon Hive Mahout Flink Storm

Impala

▪  Cloudera, Open Source ▪  Scalable interactive SQL ▪  Distributed
query processing engine ▪  Apache Parquet ▪  Related ¤  SQL on Hadoop – “on HDFS” vs. “on MapReduce” ¤  Apache Hive – MapReduce ¤  Hive on Spark ¤  Spark SQL ¤  Others: Apache Tajo, Pivotal HAWQ, Facebook Presto, Amazon Redshift 29.01.15 PyData Berlin Meetup Impala Quelle: Cloudera

29.01.15 PyData Berlin Meetup Impala & Python ▪  impyla ▪ 
DB-API incl. support for HiveServer2, Beeswax, Kerberos ▪  Results as Pandas DataFrame ▪  Under development, experimental: ¤  Fast Python UDFs using Numba/LLVM ¤  BigDataFrame – Pandas+Spark RDD in Impala ¤  Integration with Blaze, SQL Alchemy ¤  sklearn-style wrapper for MADlib

29.01.15 PyData Berlin Meetup GraphLab ▪  Started as GraphLab project
at Carnegie Mellon ▪  Scalable, parallel ML algorithms exploiting structural sparseness ▪  Open-source, commercial version/support by Dato ▪  GraphLab Create API: Python lib with C++ engine ¤  SFrame structure ¤  Can be created from Pandas DataFrame, Apache Avro, PySpark RDD, ...

▪  streamparse: Stream data, Apache Storm integration ▪  pysolr: Apache
Solr wrapper ▪  PyHive: Python interface to Hive and Presto (by DropBox) ▪  Apache Aurora: Mesos mgmt framework with Python DSL ▪  Python YARN client ▪  Kazoo: Apache Zookeeper API ▪  HappyBase: Apache HBase lib ▪  Apache Flink: new Python API upcoming PR#202 ▪  h2o-dev – “Dev-Friendly Rewrite of H2O with Spark API” ▪  kafka-python: Apache Kafka client ▪  libgfapi-python: GlusterFS API ▪  Python-RQ: Python Redis Queue ▪  PyCascading (but: outdated, Jython) ▪  multicorn: PostgreSQL FDW 29.01.15 PyData Berlin Meetup More Python & Big Data Systems

Virtualization & DevOps 29.01.15 PyData Berlin Meetup

29.01.15 PyData Berlin Meetup Virtualization ▪  System Virtualization ¤  Technologies:
vSphere, Xen, VirtualBox, LXC (Linux container) ¤  As a service: Amazon (AWS) & Co: EC2, EMR, Google Cloud Platform, MS Azure, IBM SoftLayer ▪  Virtualisierung and Big Data – related, but ambivalent ¤  Related: clustering, scalability ¤  But data clusters in production typically on physical machines, shared nothing, JBODs ¤  Container-level virtualization also in production ¤  Small companies, developers: virtual environments good to start and test ▪  Tools to create isolated, repeatable environments ¤  Docker ▶  LXC automation, provisioning ▶  Ferry - Hadoop, Cassandra, Spark, GlusterFS, and Open MPI on Docker ¤  Vagrant ▶  Originally only VirtualBox automation, now support for many Hypervisors and also Docker/LXC ▶  Veewee – custom Vagrant boxes

29.01.15 PyData Berlin Meetup Python Virtualization tools ▪  boto (AWS,
"Cloud") ▪  pyvsphere (VSphere/ESX, "Private Cloud") ▪  libvirt (Python bindings) ▪  virtualenv: not system virtualization, virtual (Python) environments ▪  libcloud: Meta lib for more than 30 virtualization providers ▪  OpenStack ¤  Free, open-source infrastructure for virtualization and distributed computing ¤  Data processing: Sahara subproject

29.01.15 PyData Berlin Meetup DevOps ▪  Configuration Management, Deployment, Provisioning:
¤  Chef, Puppet ¤  Python: Ansible, Salt ▪  fabric – slim application deployment, system automation via SSH ▪  Cloudera Manager Python API ▪  Cloudera Hue ¤  Web Management UI for CDH ¤  Written in Python, Django-based, extensible in Python, Python SDK ▪  Supervisor: Process control system ▪  CI: python-jenkins, jenkinsAPI, TravisPy, buildbot

http://www.bakdata.com/ @fkaufer @bakdata http://pydata.berlin Many thanks! 29.01.15 PyData Berlin Meetup

Python and Big Data Frameworks (PyData Berlin ...

Python and Big Data Frameworks (PyData Berlin Meetup)

More Decks by Frank Kaufer

Other Decks in Technology

Featured

Transcript