Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Polyglot Data Science Platforms on Big Data Systems

Polyglot Data Science Platforms on Big Data Systems

Extended slide deck from talk at PyData Berlin 2016 conference.

Keywords: Polyglot Data Science, Data Science Platforms, Python, NumPy, PyData, Scala, Big Data, Scale Out, Spark, Infrastructure, Automation, DevOps, Boxes, Packaging, Ansible, Packer, Virtualization, Container, Docker, Kubernetes, Jupyter, Pipelines, Recommender, ALS, BLAS.

http://pydata.org/berlin2016/

Frank Kaufer

May 24, 2016
Tweet

More Decks by Frank Kaufer

Other Decks in Technology

Transcript

  1. B E S P O K E D A T A E N G I N E E R I N G
    Polyglot Data Science Platforms
    on Big Data Systems
    Frank Kaufer @fkaufer
    21.05.16 PyData Berlin 2016 | extended slide deck

    View full-size slide

  2. 21.05.16
    PyData Berlin 2016
    About me, bakdata
    ■  AI, Data Engineering, Distributed Systems
    ■  Pythonista since 2004
    ■  Co-founder bakdata
    ■  bakdata
    ¤  One-stop-shop for bespoke data engineering solutions
    ▶  Systems
    ▶  Data Management
    ▶  Information Integration
    ▶  Analytics and Reporting
    ¤  Recent projects/clients: Otto, GfK, Elsevier, E.CA Economics, ...
    ¤  Polyglot software engineering: Python, Java, Scala, SQL, C/C++, R, ...

    View full-size slide

  3. 21.05.16
    PyData Berlin 2016
    This talk
    ■  ... is for
    ¤  Data Scientists working with Python on “Big Data”
    ¤  (Big) Data (Systems) Engineers building platforms and tools for Data Scientists
    ■  Polyglot Data Science Platforms on Big Data Systems
    ■  Building blocks
    ¤  Boxes
    ¤  Pipelines
    ¤  Interfaces/Abstractions (high-level convenience, low-level optimization)

    View full-size slide

  4. Pythonic Data Science
    21.05.16
    PyData Berlin 2016
    jupyter
    numpy
    pandas
    seaborn
    scikit-klearn

    View full-size slide

  5. Polyglot Data Science
    21.05.16
    PyData Berlin 2016
    Matlab, SAS, Stata, SPSS, EViews, ...
    XLSX, CSV, HDF5, ...

    View full-size slide

  6. Data Science and Big Data
    21.05.16
    PyData Berlin 2016
    Dump
    Extract
    SQL
    UDF

    View full-size slide

  7. Big Data?
    21.05.16
    PyData Berlin 2016
    Very Large Data Bases
    since 1975
    Source: CBS
    Huuuuge Data
    starting Nov 2016
    Source: vldb.org

    View full-size slide

  8. Big Data – Scale Up
    21.05.16
    PyData Berlin 2016
    ?

    View full-size slide

  9. Big Data – Scale Down
    21.05.16
    PyData Berlin 2016
    ■  Sampling, Filtering
    ■  Compression, C-Stores
    ¤  Pandas Categorical
    ¤  parquet-python
    ¤  Blosc, bcolz, castra

    View full-size slide

  10. Big Data – Scale Out
    21.05.16
    PyData Berlin 2016
    Disruption:
    Tool stack, programming
    paradigm, technical/
    organziational restrictions

    View full-size slide

  11. 21.05.16
    PyData Berlin 2016
    Parallel, Out-of-Core, Scale Out with Python
    ■  Manual sharding often sufficient
    ■  Staying in Python less a matter of language but toolstack and programming paradigm
    ■  Ufora
    ■  asyncio, concurrent.futures
    ■  Cython “with nogil”
    ■  IPython.parallel
    ■  Celery
    ■  dask, distributed
    ■  Blaze
    ■  Ibis
    ■  Google Data Flow / Apache Beam (Python SDK)

    View full-size slide

  12. 21.05.16
    PyData Berlin 2016
    Big Data Zoo
    Source: http://j2eedev.org/wp-content/uploads/2014/05/ecosystem-hadoop.png

    View full-size slide

  13. 21.05.16
    PyData Berlin 2016
    Big Data Zoo on Fire
    Source: http://j2eedev.org/wp-content/uploads/2014/05/ecosystem-hadoop.png

    View full-size slide

  14. 21.05.16
    PyData Berlin 2016
    Big Data Zoo on Fire ... and Steroids
    Source: http://j2eedev.org/wp-content/uploads/2014/05/ecosystem-hadoop.png

    View full-size slide

  15. Spark – Holistic Data Framework
    21.05.16
    PyData Berlin 2016
    SQL
    DataFrames
    Streaming
    Data
    Machine
    Learning
    Graph
    Analytics
    Spark Core
    (DAG Compute Engine)
    Standalone
    Java Scala
    YARN Mesos
    Python R
    HDFS, Parquet,
    S3, JDBC, HBase,
    Cassandra, …

    View full-size slide

  16. 21.05.16
    PyData Berlin 2016
    Big Data challenges
    ■  Technology Zoo: configurations, interfaces, integration, monitoring
    ■  Big Data land is – mostly – JVM land
    ■  Programming paradigm shift
    ■  Needle in the Haystack
    ¤  Volume, heterogeneity
    ¤  Data Science? Data Prep!
    ■  Cluster = Shared Ressource
    ¤  Ressource Management
    ¤  Multitenancy
    ¤  Security (Sentry, Ranger, Kerberos, RecordService)

    View full-size slide

  17. Boxes
    21.05.16
    PyData Berlin 2016

    View full-size slide

  18. 21.05.16
    PyData Berlin 2016
    Boxes
    ■  Encapsulate Complexity
    ■  Reproducible Working and Executing Environments
    ■  Reproducible Analytics
    ■  Engines, Runtimes, Services
    ■  Packaging
    ■  System environments
    ■  Runtime environments
    ■  Machines and Virtualization
    ■  Ressource pools
    ■  Nested: System, Virtualization, Runtime Environments, Packages

    View full-size slide

  19. Big Data Engines: Runtimes, Services
    21.05.16
    PyData Berlin 2016
    HDFS HBase
    Spark
    Comp
    ute
    Spark
    SQL
    Hive Impala
    Compute/Query/Storage/Stream Engines
    Cluster, hardware, networks

    View full-size slide

  20. 21.05.16
    PyData Berlin 2016
    Use Case: Working Environments Otto BRAIN
    ■  Hadoop
    ■  Spark
    ■  Python
    ■  Hive
    ■  Impala
    ■  HBase
    ■  Ignite
    ■  Sentry, Kerberos
    ■  Talend
    ■  Teradata
    ■  ...

    View full-size slide

  21. NoBrainer - Infrastructure as code
    21.05.16
    PyData Berlin 2016
    ISO
    Spark
    Teradata
    Impala
    Hive
    HBase
    HDFS

    View full-size slide

  22. Working Env: BRAIN Data Science Box
    21.05.16
    PyData Berlin 2016

    View full-size slide

  23. NoBrainer – Build and Distribute Boxes
    21.05.16
    PyData Berlin 2016
    OVF
    ISO
    Docker Img
    Docker Img
    YAML
    JSON
    PY
    Vagrant box
    Vagrant box

    View full-size slide

  24. 21.05.16
    PyData Berlin 2016
    DevOps Tools
    ■  Virtualization
    ¤  Hosted Hypervisor: VirtualBox (Parallels, VMware Fusion)
    ¤  Bare-Metal Hypervisor: vSphere/ESX
    ¤  Container: Docker, Kubernetes
    ■  Cross-platform image builder: Packer
    ■  Configuration management, provisioning: Ansible
    ■  Box deployment: Vagrant
    ■  Artifacts
    ¤  SCM: Git-Server
    ¤  Binary: JFrog Artifactory, Nexus

    View full-size slide

  25. 21.05.16
    PyData Berlin 2016
    Ansible
    ■  Configuration Management, Software Provisioning, Automation
    ¤  Alternatives: Chef, Puppet, Saltstack, CFEngine
    ¤  Declarative, idempotent
    ¤  Modular, re-usable
    ■  Why Ansible?
    ¤  Lean - Agentless, Push (SSH)
    ¤  Configuration: YAML or Python API
    ¤  Python
    ▶  Plugins, modules
    ▶  Ansible 2 Python API:
    https://serversforhackers.com/running-ansible-2-programmatically

    View full-size slide

  26. Ansible Tasks
    - name: install Hadoop
    apt: name=hadoop-client state=present
    - name: install Hive
    apt: name=hive state=present
    - name: install numpy via conda
    conda: name=numpy state=latest
    - name: install scipy 0.17 via conda
    conda: name=scipy version="0.17"
    - name: remove matplotlib from conda
    conda: name=matplotlib state=absent
    play_source = dict(
    hosts = 'localhost',
    tasks = [
    dict(
    name="install Hadoop”,
    action=dict(module='apt',
    args=dict(
    name='hadoop-client’,
    state='present')),
    ... )
    ]
    )
    play = Play().load(play_source)
    21.05.16
    PyData Berlin 2016
    YAML Python

    View full-size slide

  27. 21.05.16
    PyData Berlin 2016
    Aligned Working/Execution Environments
    ■  System Environments
    ¤  Virtualization
    ¤  Container
    ■  Runtime environments
    ¤  Virtualenv/conda environments
    ¤  Maven, Gradle
    ■  Software, Configuration Management
    ■  Jupyter Kernels
    ■  Spark
    ¤  Client – Master/Driver – Worker
    ¤  Spark Core runtime (JVM)
    ¤  Guest API runtime (Python): spark.yarn.appMasterEnv.PYSPARK_PYTHON

    View full-size slide

  28. Execution Environments
    21.05.16
    PyData Berlin 2016
    HDFS HBase
    Spark
    Compute
    Spark
    SQL
    Hive Impala
    Workflow
    Engine
    Scheduler
    Working
    Env
    Batch/Prod Interactive/Lab

    View full-size slide

  29. Notebook Service, Jupyter Kernels
    21.05.16
    PyData Berlin 2016
    Local Working
    Environment
    JupyterHub
    Browser
    HTTP Proxy
    API
    Kerberos
    Authenticator
    Kubernetes
    Spawner
    Notebook
    Server
    Kernels
    Notebook
    Server Spark

    View full-size slide

  30. Pipelines
    21.05.16
    PyData Berlin 2016

    View full-size slide

  31. Pipelines
    21.05.16
    PyData Berlin 2016
    Working
    Env
    Pipeline
    Engine
    ?

    View full-size slide

  32. 21.05.16
    PyData Berlin 2016
    From Unix Pipes to Big Data Pipelines
    ■  cat *.txt | wc –l
    ■  cat *.txt | mapper.py | sort -k1,1 | reducer.py
    ■  hadoop jar hadoop-streaming.jar
    –file mapper.py –mapper mapper.py
    –file reducer.py –reducer reducer.py
    –input /path/*.txt
    –output /path/count.dat

    View full-size slide

  33. Pipelines – building blocks
    21.05.16
    PyData Berlin 2016
    ■  Workflow vs dataflow
    ■  Dependency management
    ■  Execution environment
    ■  Single-plattform/runtime vs.
    cross-plattform
    ■  Triggers (time, event)
    ■  Exception handling
    ■  Optimizer (Query Plan)
    ■  Scheduler
    ■  Graph
    ■  Typically DAG
    ■  Tasks, steps
    ¤  Sources
    ¤  Sinks
    ¤  Flows
    ¤  Connectors
    ¤  Operators
    ■  Nested pipelines

    View full-size slide

  34. ■  Unix Pipes
    ■  PyToolz pipes
    ■  Hadoop streaming
    ■  Pandas, Spark operator chaining
    ■  ML Pipelines
    ¤  sklearn.pipeline
    ¤  TensorFlow
    ¤  Spark ML (persistence in Spark 2.0)
    ■  ETL, cross-plattform workflows
    ¤  Oozie
    ¤  Cascading/Scalding
    ¤  Python: Luigi, Airflow, Pinball
    ■  Implicit (Runtime) Pipelines
    ¤  Query Plans (DAG with UDFs)
    ¤  Reactive Programming
    ¤  Notebooks
    ■  CI/CD Pipelines
    ■  Unified Programming
    ¤  Google Dataflow/Apache Beam
    ¤  Blaze
    21.05.16
    PyData Berlin 2016
    Pipelines – systems and technologies

    View full-size slide

  35. 21.05.16
    PyData Berlin 2016
    Luigi, Airflow
    ■  Configuration as Python
    code
    ■  Dynamic task/pipeline
    generation
    ■  Luigi
    ¤  Maturity
    ¤  More connectors, Spark
    ■  Airflow
    ¤  scheduler
    ¤  code deploy (pickle)
    ¤  cross-dag dependencies
    ¤  Apache Incubator
    http://bytepawn.com/luigi-airflow-pinball.html

    View full-size slide

  36. 21.05.16
    PyData Berlin 2016
    Continous Integration/Deployment/Delivery
    ■  Seamless integration of CI/CD pipelines, general data pipelines and ML pipelines is a
    major challenge for fast non-disruptive transition from Data Science lab to production
    ¤  Boxes
    ¤  Automation
    ¤  Pipelines as code
    ■  GoCD, Bamboo: Pipelines as first class citizens
    ■  Jenkins 2: Pipelines, Groovy Job DSL, Docker/Ansible steps
    ■  LambdaCD: Clojure Job DSL
    ■  Buildbot: Python Job DSL
    ■  CI-Servers can be and are (mis)used as data pipeline engines

    View full-size slide

  37. High-level: convenience, unified programming
    Push down with minimal overhead
    Low-level: machine-/storage-level computation/
    optimization
    Interfaces
    21.05.16
    PyData Berlin 2016

    View full-size slide

  38. 21.05.16
    PyData Berlin 2016
    Low-level Interfaces, Interchange
    ■  Storage, Serialization: Parquet, Avro, HBase/HFiles, Thrift, ...
    upcoming: Kudu, Arrow
    ■  Compression
    ■  IPC - Unix Pipes (Hadoop Streaming, Spark rdd.pipe)
    ■  Messaging – RPC, Queues
    ■  Low-Level Compilers
    ¤  Numba, LLVM (Impala UDFs)
    ¤  Cython
    ■  Machine-level libraries (C, Fortran): BLAS, LAPACK, MAGMA, ...

    View full-size slide

  39. 21.05.16
    PyData Berlin 2016
    High-Level Interfaces, Compilers
    ■  Pig
    ■  Cascading
    ■  Apache Lens
    ■  Apache Ignite
    ■  Blaze
    ■  Ibis
    ■  Spark
    ■  Apache Beam/Google DataFlow
    No alternatives,
    very different approaches!

    View full-size slide

  40. 21.05.16
    PyData Berlin 2016
    Spark – High-Level and Low-Level Interfaces
    ■  Spark SQL, DataFrame API
    ¤  Hive vs Spark HiveContext
    ¤  Catalyst – Query optimizer
    ¤  Spark SQL and Security
    ■  Spark Low-Level Optimization
    ¤  Tuning Java Garbage Collection for Spark [1]
    ¤  Tungsten
    ¤  Off-Heap Memory – Tachyon
    ¤  Linear Algebra Subsystems Integration: Breeze, netlib-java [2], JIT Optimizations
    Source: Databricks
    References
    [1] https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
    [2] Sam Halliday (Scala Exchange, 2014): High Performance Linear Algebra, http://fommil.github.io/scalax14/#/

    View full-size slide

  41. Use Case: Recommendations with Spark
    21.05.16
    PyData Berlin 2016
    500 K Products
    7 M
    Users
    500 K
    7 M
    70 Factors
    70
    ALS
    DATA
    (Sparse Matrix)
    MODEL
    (Matrix Factorization)
    RECOMMENDATIONS
    (Dense Matrix)
    X

    View full-size slide

  42. Online, Batch, Top K Recommendations
    21.05.16
    PyData Berlin 2016
    TOP K
    (SORT)
    ONLINE
    BATCH
    #F x #P
    35 Mio Ops
    #U x #Fx #P
    245 Trillion Ops
    #Ux #Px log2(#P)
    7 Trillion Ops
    #P x log2 (#P)
    9.5 Mio Ops

    View full-size slide

  43. 21.05.16
    PyData Berlin 2016
    Scalabale Matrix Multiplication
    ■  Matrix multiplication embarrassingly parallel
    ■  But at which level? Which granularity, blocks?
    ■  Spark built-in options:
    ¤  MatrixFactorizationModel.predict()
    ¤  PySpark: predict only thin wrapper for Scala API predict
    ¤  Iterates over products/users
    ▶  Computation dominated by shuffling
    ▶  Only applicable in single-node standalone mode
    ¤  Since Spark 1.4 (PySpark 1.6): recommendProductsForUsers(num)
    ¤  Still slow (run time would be weeks). Why?

    View full-size slide

  44. 21.05.16
    PyData Berlin 2016
    Fixed Block size, Scala/JVM Overhead
    Breeze – “Scala’s NumPy”

    View full-size slide

  45. 21.05.16
    PyData Berlin 2016
    PySpark – Python/JVM Interaction

    View full-size slide

  46. Distributed Matrix Multiplication
    21.05.16
    PyData Berlin 2016
    Shuffle vs Broadcast

    View full-size slide

  47. 21.05.16
    PyData Berlin 2016
    3.5 Trillion recommendations in < 30 min (*)
    ■  Broadcast smaller feature matrix (here: products) to memory on all
    nodes
    ■  Top-K prediction as one-liner
    ¤  Spark API: user.map
    ¤  Vectorized multiplication in NumPy: numpy.dot(user_features, product_features)
    ¤  Top-K: cytoolz.topk
    ■  CyToolz
    ¤  Cython optimized variant of toolz: functional programming
    ¤  lazy evaluation: streaming, as
    ¤  CyToolz pipeline in Spark operator
    ■  numpy.dot -> BLAS.DGEMM/BLAS.DGEMV
    (*) no extensive benchmarking, less than 20 workers, FAIR scheduler, can be further optimized

    View full-size slide

  48. Linear Algebra Subsystems
    21.05.16
    PyData Berlin 2016
    Spark Scala API
    Breeze
    PySpark
    netlib-java
    BLAS/LAPACK
    NumPy
    Intel MKL AMD CML OpenBLAS ATLAS Accelerate
    Custom topk predict
    Standard topk
    predict
    Minutes
    Weeks

    View full-size slide

  49. 21.05.16
    PyData Berlin 2016
    Summary
    ■  Building blocks for Polyglot Data Science Platforms on Big Data Systems
    ¤  Boxes
    ¤  Pipelines
    ¤  High-Level and Low-Level Interfaces
    ■  Avoid disruptions, enable seamless transitions
    ¤  Local – Cluster
    ¤  Lab/Interactive – Production/Batch
    ■  Automate Everything
    ¤  DevOps tools
    ¤  Pipelines and infrastructure as code

    View full-size slide

  50. Thank you
    21.05.16
    PyData Berlin 2016

    View full-size slide

  51. 21.05.16
    PyData Berlin 2016
    We are hiring!
    ■  Data Engineer/Scientist
    ¤  Senior and Graduate
    ¤  Python, Java, Scala, C/C++
    ■  Berlin-Kreuzberg office
    ■  Interesting industry projects
    ■  Cutting-edge technologies
    ■  Research and continous learning -
    collaboration with Hasso Plattner Institute
    ■  Events
    (e.g. lake-side retreat + drone hacking)
    ■  Club Mate ∞
    ! [email protected]

    View full-size slide