Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Polyglot Data Science Platforms on Big Data Systems

Polyglot Data Science Platforms on Big Data Systems

Extended slide deck from talk at PyData Berlin 2016 conference.

Keywords: Polyglot Data Science, Data Science Platforms, Python, NumPy, PyData, Scala, Big Data, Scale Out, Spark, Infrastructure, Automation, DevOps, Boxes, Packaging, Ansible, Packer, Virtualization, Container, Docker, Kubernetes, Jupyter, Pipelines, Recommender, ALS, BLAS.

http://pydata.org/berlin2016/

Frank Kaufer

May 24, 2016
Tweet

More Decks by Frank Kaufer

Other Decks in Technology

Transcript

  1. B E S P O K E D A T

    A E N G I N E E R I N G Polyglot Data Science Platforms on Big Data Systems Frank Kaufer @fkaufer 21.05.16 PyData Berlin 2016 | extended slide deck
  2. 21.05.16 PyData Berlin 2016 About me, bakdata ▪  AI, Data

    Engineering, Distributed Systems ▪  Pythonista since 2004 ▪  Co-founder bakdata ▪  bakdata ¤  One-stop-shop for bespoke data engineering solutions ▶  Systems ▶  Data Management ▶  Information Integration ▶  Analytics and Reporting ¤  Recent projects/clients: Otto, GfK, Elsevier, E.CA Economics, ... ¤  Polyglot software engineering: Python, Java, Scala, SQL, C/C++, R, ...
  3. 21.05.16 PyData Berlin 2016 This talk ▪  ... is for

    ¤  Data Scientists working with Python on “Big Data” ¤  (Big) Data (Systems) Engineers building platforms and tools for Data Scientists ▪  Polyglot Data Science Platforms on Big Data Systems ▪  Building blocks ¤  Boxes ¤  Pipelines ¤  Interfaces/Abstractions (high-level convenience, low-level optimization)
  4. Big Data? 21.05.16 PyData Berlin 2016 Very Large Data Bases

    since 1975 Source: CBS Huuuuge Data starting Nov 2016 Source: vldb.org
  5. Big Data – Scale Down 21.05.16 PyData Berlin 2016 ▪ 

    Sampling, Filtering ▪  Compression, C-Stores ¤  Pandas Categorical ¤  parquet-python ¤  Blosc, bcolz, castra
  6. Big Data – Scale Out 21.05.16 PyData Berlin 2016 Disruption:

    Tool stack, programming paradigm, technical/ organziational restrictions
  7. 21.05.16 PyData Berlin 2016 Parallel, Out-of-Core, Scale Out with Python

    ▪  Manual sharding often sufficient ▪  Staying in Python less a matter of language but toolstack and programming paradigm ▪  Ufora ▪  asyncio, concurrent.futures ▪  Cython “with nogil” ▪  IPython.parallel ▪  Celery ▪  dask, distributed ▪  Blaze ▪  Ibis ▪  Google Data Flow / Apache Beam (Python SDK)
  8. 21.05.16 PyData Berlin 2016 Big Data Zoo on Fire Source:

    http://j2eedev.org/wp-content/uploads/2014/05/ecosystem-hadoop.png
  9. 21.05.16 PyData Berlin 2016 Big Data Zoo on Fire ...

    and Steroids Source: http://j2eedev.org/wp-content/uploads/2014/05/ecosystem-hadoop.png
  10. Spark – Holistic Data Framework 21.05.16 PyData Berlin 2016 SQL

    DataFrames Streaming Data Machine Learning Graph Analytics Spark Core (DAG Compute Engine) Standalone Java Scala YARN Mesos Python R HDFS, Parquet, S3, JDBC, HBase, Cassandra, …
  11. 21.05.16 PyData Berlin 2016 Big Data challenges ▪  Technology Zoo:

    configurations, interfaces, integration, monitoring ▪  Big Data land is – mostly – JVM land ▪  Programming paradigm shift ▪  Needle in the Haystack ¤  Volume, heterogeneity ¤  Data Science? Data Prep! ▪  Cluster = Shared Ressource ¤  Ressource Management ¤  Multitenancy ¤  Security (Sentry, Ranger, Kerberos, RecordService)
  12. 21.05.16 PyData Berlin 2016 Boxes ▪  Encapsulate Complexity ▪  Reproducible

    Working and Executing Environments ▪  Reproducible Analytics ▪  Engines, Runtimes, Services ▪  Packaging ▪  System environments ▪  Runtime environments ▪  Machines and Virtualization ▪  Ressource pools ▪  Nested: System, Virtualization, Runtime Environments, Packages
  13. Big Data Engines: Runtimes, Services 21.05.16 PyData Berlin 2016 HDFS

    HBase Spark Comp ute Spark SQL Hive Impala Compute/Query/Storage/Stream Engines Cluster, hardware, networks
  14. 21.05.16 PyData Berlin 2016 Use Case: Working Environments Otto BRAIN

    ▪  Hadoop ▪  Spark ▪  Python ▪  Hive ▪  Impala ▪  HBase ▪  Ignite ▪  Sentry, Kerberos ▪  Talend ▪  Teradata ▪  ...
  15. NoBrainer – Build and Distribute Boxes 21.05.16 PyData Berlin 2016

    OVF ISO Docker Img Docker Img YAML JSON PY Vagrant box Vagrant box
  16. 21.05.16 PyData Berlin 2016 DevOps Tools ▪  Virtualization ¤  Hosted

    Hypervisor: VirtualBox (Parallels, VMware Fusion) ¤  Bare-Metal Hypervisor: vSphere/ESX ¤  Container: Docker, Kubernetes ▪  Cross-platform image builder: Packer ▪  Configuration management, provisioning: Ansible ▪  Box deployment: Vagrant ▪  Artifacts ¤  SCM: Git-Server ¤  Binary: JFrog Artifactory, Nexus
  17. 21.05.16 PyData Berlin 2016 Ansible ▪  Configuration Management, Software Provisioning,

    Automation ¤  Alternatives: Chef, Puppet, Saltstack, CFEngine ¤  Declarative, idempotent ¤  Modular, re-usable ▪  Why Ansible? ¤  Lean - Agentless, Push (SSH) ¤  Configuration: YAML or Python API ¤  Python ▶  Plugins, modules ▶  Ansible 2 Python API: https://serversforhackers.com/running-ansible-2-programmatically
  18. Ansible Tasks - name: install Hadoop apt: name=hadoop-client state=present -

    name: install Hive apt: name=hive state=present - name: install numpy via conda conda: name=numpy state=latest - name: install scipy 0.17 via conda conda: name=scipy version="0.17" - name: remove matplotlib from conda conda: name=matplotlib state=absent play_source = dict( hosts = 'localhost', tasks = [ dict( name="install Hadoop”, action=dict(module='apt', args=dict( name='hadoop-client’, state='present')), ... ) ] ) play = Play().load(play_source) 21.05.16 PyData Berlin 2016 YAML Python
  19. 21.05.16 PyData Berlin 2016 Aligned Working/Execution Environments ▪  System Environments

    ¤  Virtualization ¤  Container ▪  Runtime environments ¤  Virtualenv/conda environments ¤  Maven, Gradle ▪  Software, Configuration Management ▪  Jupyter Kernels ▪  Spark ¤  Client – Master/Driver – Worker ¤  Spark Core runtime (JVM) ¤  Guest API runtime (Python): spark.yarn.appMasterEnv.PYSPARK_PYTHON
  20. Execution Environments 21.05.16 PyData Berlin 2016 HDFS HBase Spark Compute

    Spark SQL Hive Impala Workflow Engine Scheduler Working Env Batch/Prod Interactive/Lab
  21. Notebook Service, Jupyter Kernels 21.05.16 PyData Berlin 2016 Local Working

    Environment JupyterHub Browser HTTP Proxy API Kerberos Authenticator Kubernetes Spawner Notebook Server Kernels Notebook Server Spark
  22. 21.05.16 PyData Berlin 2016 From Unix Pipes to Big Data

    Pipelines ▪  cat *.txt | wc –l ▪  cat *.txt | mapper.py | sort -k1,1 | reducer.py ▪  hadoop jar hadoop-streaming.jar –file mapper.py –mapper mapper.py –file reducer.py –reducer reducer.py –input /path/*.txt –output /path/count.dat
  23. Pipelines – building blocks 21.05.16 PyData Berlin 2016 ▪  Workflow

    vs dataflow ▪  Dependency management ▪  Execution environment ▪  Single-plattform/runtime vs. cross-plattform ▪  Triggers (time, event) ▪  Exception handling ▪  Optimizer (Query Plan) ▪  Scheduler ▪  Graph ▪  Typically DAG ▪  Tasks, steps ¤  Sources ¤  Sinks ¤  Flows ¤  Connectors ¤  Operators ▪  Nested pipelines
  24. ▪  Unix Pipes ▪  PyToolz pipes ▪  Hadoop streaming ▪ 

    Pandas, Spark operator chaining ▪  ML Pipelines ¤  sklearn.pipeline ¤  TensorFlow ¤  Spark ML (persistence in Spark 2.0) ▪  ETL, cross-plattform workflows ¤  Oozie ¤  Cascading/Scalding ¤  Python: Luigi, Airflow, Pinball ▪  Implicit (Runtime) Pipelines ¤  Query Plans (DAG with UDFs) ¤  Reactive Programming ¤  Notebooks ▪  CI/CD Pipelines ▪  Unified Programming ¤  Google Dataflow/Apache Beam ¤  Blaze 21.05.16 PyData Berlin 2016 Pipelines – systems and technologies
  25. 21.05.16 PyData Berlin 2016 Luigi, Airflow ▪  Configuration as Python

    code ▪  Dynamic task/pipeline generation ▪  Luigi ¤  Maturity ¤  More connectors, Spark ▪  Airflow ¤  scheduler ¤  code deploy (pickle) ¤  cross-dag dependencies ¤  Apache Incubator http://bytepawn.com/luigi-airflow-pinball.html
  26. 21.05.16 PyData Berlin 2016 Continous Integration/Deployment/Delivery ▪  Seamless integration of

    CI/CD pipelines, general data pipelines and ML pipelines is a major challenge for fast non-disruptive transition from Data Science lab to production ¤  Boxes ¤  Automation ¤  Pipelines as code ▪  GoCD, Bamboo: Pipelines as first class citizens ▪  Jenkins 2: Pipelines, Groovy Job DSL, Docker/Ansible steps ▪  LambdaCD: Clojure Job DSL ▪  Buildbot: Python Job DSL ▪  CI-Servers can be and are (mis)used as data pipeline engines
  27. High-level: convenience, unified programming Push down with minimal overhead Low-level:

    machine-/storage-level computation/ optimization Interfaces 21.05.16 PyData Berlin 2016
  28. 21.05.16 PyData Berlin 2016 Low-level Interfaces, Interchange ▪  Storage, Serialization:

    Parquet, Avro, HBase/HFiles, Thrift, ... upcoming: Kudu, Arrow ▪  Compression ▪  IPC - Unix Pipes (Hadoop Streaming, Spark rdd.pipe) ▪  Messaging – RPC, Queues ▪  Low-Level Compilers ¤  Numba, LLVM (Impala UDFs) ¤  Cython ▪  Machine-level libraries (C, Fortran): BLAS, LAPACK, MAGMA, ...
  29. 21.05.16 PyData Berlin 2016 High-Level Interfaces, Compilers ▪  Pig ▪ 

    Cascading ▪  Apache Lens ▪  Apache Ignite ▪  Blaze ▪  Ibis ▪  Spark ▪  Apache Beam/Google DataFlow No alternatives, very different approaches!
  30. 21.05.16 PyData Berlin 2016 Spark – High-Level and Low-Level Interfaces

    ▪  Spark SQL, DataFrame API ¤  Hive vs Spark HiveContext ¤  Catalyst – Query optimizer ¤  Spark SQL and Security ▪  Spark Low-Level Optimization ¤  Tuning Java Garbage Collection for Spark [1] ¤  Tungsten ¤  Off-Heap Memory – Tachyon ¤  Linear Algebra Subsystems Integration: Breeze, netlib-java [2], JIT Optimizations Source: Databricks References [1] https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html [2] Sam Halliday (Scala Exchange, 2014): High Performance Linear Algebra, http://fommil.github.io/scalax14/#/
  31. Use Case: Recommendations with Spark 21.05.16 PyData Berlin 2016 500

    K Products 7 M Users 500 K 7 M 70 Factors 70 ALS DATA (Sparse Matrix) MODEL (Matrix Factorization) RECOMMENDATIONS (Dense Matrix) X
  32. Online, Batch, Top K Recommendations 21.05.16 PyData Berlin 2016 TOP

    K (SORT) ONLINE BATCH #F x #P 35 Mio Ops #U x #Fx #P 245 Trillion Ops #Ux #Px log2(#P) 7 Trillion Ops #P x log2 (#P) 9.5 Mio Ops
  33. 21.05.16 PyData Berlin 2016 Scalabale Matrix Multiplication ▪  Matrix multiplication

    embarrassingly parallel ▪  But at which level? Which granularity, blocks? ▪  Spark built-in options: ¤  MatrixFactorizationModel.predict() ¤  PySpark: predict only thin wrapper for Scala API predict ¤  Iterates over products/users ▶  Computation dominated by shuffling ▶  Only applicable in single-node standalone mode ¤  Since Spark 1.4 (PySpark 1.6): recommendProductsForUsers(num) ¤  Still slow (run time would be weeks). Why?
  34. 21.05.16 PyData Berlin 2016 3.5 Trillion recommendations in < 30

    min (*) ▪  Broadcast smaller feature matrix (here: products) to memory on all nodes ▪  Top-K prediction as one-liner ¤  Spark API: user.map ¤  Vectorized multiplication in NumPy: numpy.dot(user_features, product_features) ¤  Top-K: cytoolz.topk ▪  CyToolz ¤  Cython optimized variant of toolz: functional programming ¤  lazy evaluation: streaming, as ¤  CyToolz pipeline in Spark operator ▪  numpy.dot -> BLAS.DGEMM/BLAS.DGEMV (*) no extensive benchmarking, less than 20 workers, FAIR scheduler, can be further optimized
  35. Linear Algebra Subsystems 21.05.16 PyData Berlin 2016 Spark Scala API

    Breeze PySpark netlib-java BLAS/LAPACK NumPy Intel MKL AMD CML OpenBLAS ATLAS Accelerate Custom topk predict Standard topk predict Minutes Weeks
  36. 21.05.16 PyData Berlin 2016 Summary ▪  Building blocks for Polyglot

    Data Science Platforms on Big Data Systems ¤  Boxes ¤  Pipelines ¤  High-Level and Low-Level Interfaces ▪  Avoid disruptions, enable seamless transitions ¤  Local – Cluster ¤  Lab/Interactive – Production/Batch ▪  Automate Everything ¤  DevOps tools ¤  Pipelines and infrastructure as code
  37. 21.05.16 PyData Berlin 2016 We are hiring! ▪  Data Engineer/Scientist

    ¤  Senior and Graduate ¤  Python, Java, Scala, C/C++ ▪  Berlin-Kreuzberg office ▪  Interesting industry projects ▪  Cutting-edge technologies ▪  Research and continous learning - collaboration with Hasso Plattner Institute ▪  Events (e.g. lake-side retreat + drone hacking) ▪  Club Mate ∞ ! [email protected]