Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science meets Software Development

Data Science meets Software Development

I work in a Data Innovation Lab with a horde of Data Scientists. Data Scientists gather data, clean data, apply Machine Learning algorithms and produce results, all of that with specialized tools (Dataiku, Scikit-Learn, R...). These processes run on a single machine, on data that is fixed in time, and they have no constraint on execution speed.

With my fellow Developers, our goal is to bring these processes to production. Our constraints are very different: we want the code to be versioned, to be tested, to be deployed automatically and to produce logs. We also need it to run in production on distributed architectures (Spark, Hadoop), with fixed versions of languages and frameworks (Scala...), and with data that changes every day.

In this talk, I will explain how we, Developers, work hand-in-hand with Data Scientists to shorten the path to running data workflows in production.

Alexis Seigneurin

August 26, 2015
Tweet

More Decks by Alexis Seigneurin

Other Decks in Technology

Transcript

  1. Who I am • Software engineer for 15 years •

    Consultant at Ippon Tech in Paris, France • Favorite subjects: Spark, Cassandra, Ansible, Docker • @aseigneurin
  2. • 200 software engineers in France and the US •

    In the US: offices in DC, NYC and Richmond, Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster, Tatami, etc. • @ipponusa
  3. The project • Data Innovation Lab of a large insurance

    company • Data → Business value • Team of 30 Data Scientists + Software Developers
  4. Skill set of a Data Scientist • Strong in: •

    Science (maths / statistics) • Machine Learning • Analyzing data • Good / average in: • Programming • Not good in: • Software engineering
  5. Machine Learning • Algorithms: • Logistic Regression • Decision trees

    • Random forests • Implementations: • Dataiku • Scikit-Learn • Vowpal Wabbit
  6. Skill set of a Developer • Strong in: • Software

    engineering • Programming • Good / average in: • Science (maths / statistics) • Analyzing data • Not good in: • Machine Learning
  7. How Developers work • Programming languages • Java • Scala

    • Development environment • Eclipse • IntelliJ IDEA • Toolbox • Maven • …
  8. Workflow 1. Data Cleansing 2. Feature Engineering 3. Train a

    Machine Learning model 1. Split the dataset: training/validation/test datasets 2. Train the model 4. Apply the model on new data
  9. Data Cleansing • Convert strings to numbers/booleans/… • Parse dates

    • Handle missing values • Handle data in an incorrect format • …
  10. Feature Engineering • Transform data into numerical features • E.g.:

    • A birth date → age • Dates of phone calls → Number of calls • Text → Vector of words • 2 names → Levensthein distance
  11. Machine Learning • Train a model • Test an algorithm

    with different params • Cross validation (Grid Search) • Compare different algorithms, e.g.: • Logistic regression • Gradient boosting trees • Random forest
  12. Machine Learning • Evaluate the accuracy of the model •

    Root Mean Square Error (RMSE) • ROC curve • … • Examine predictions • False positives, false negatives…
  13. Disclaimer • Context of this project: • Not So Big

    Data (but Smart Data) • No real-time workflows (yet?)
  14. Distribute the processing • Data Scientists work with data samples

    • No constraint on processing time • Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)
  15. Distribute the processing • In production: • H/W resources are

    constrained • Large data sets to process • Spark: • Included in CDH • DataFrames (Spark 1.3+) ≃ Pandas DataFrames • Fast!
  16. Use a centralized data store • Data Scientists store data

    on their workstations • Limited storage • Data not shared within the team • Data privacy not enforced • Subject to data losses
  17. Use a centralized data store • Store data on HDFS:

    • Hive tables (SQL) • Parquet files • Security: Kerberos + permissions • Redundant + potentially unlimited storage • Easy access from Spark and Dataiku
  18. Programming languages • Data Scientists write code on their workstations

    • This code may not run in the datacenter • Language variety → Hard to share knowledge
  19. Programming languages • Use widely spread languages • Spark in

    Python/Scala • Support for R is too young • Provide assistance to ease the adoption!
  20. Use an IDE • Notebooks: • Powerful for exploratory work

    • Weak for code edition and code structuring • Inadequate for code versioning
  21. Use an IDE • IntelliJ IDEA / PyCharm • Code

    compilation • Refactoring • Execution of unit tests • Support for Git
  22. Source Control • Data Scientists work on their workstations •

    Code is not shared • Code may be lost • Intermediate versions are not preserved • Lack of code review
  23. Source Control • Git + GitHub / GitLab • Versioning

    • Easy to go back to a version running in production • Easy sharing (+permissions) • Code review
  24. Packaging the code • Source code has dependencies • Dependencies

    in production ≠ at dev time • Assemble the code + its dependencies
  25. Packaging the code • Freeze the dependencies: • Scala →

    Maven • Python → Setuptools • Packaging: • Scala → Jar (Maven Shade plugin) • Python → Egg (Setuptools) • Compliant with spark-submit.sh
  26. Secure the build process • Data Scientists may commit code…

    without running tests first! • Quality may decrease over time • Packages built by hand on a workstation are not reproducible
  27. Secure the build process • Jenkins • Unit test report

    • Code coverage report • Packaging: Jar / Egg • Dashboard • Notifications (Slack + email)
  28. Automate the process • Data is loaded manually in HDFS:

    • CSV files, sometimes compressed • Often received by email • Often samples
  29. Automate the process • No human intervention should be required

    • All steps should be code / tools • E.g. automate file transfers, unzipping…
  30. Adapt to living data • Data Scientists work with: •

    Frozen data • Samples • Risks with data received on a regular basis: • Incorrect format (dates, numbers…) • Corrupt data (incl. encoding changes) • Missing values
  31. Adapt to living data • Data Checking & Cleansing •

    Preliminary steps before processing the data • Decide what to do with invalid data • Thetis • Internal tool • Performs most checking & cleansing operations
  32. Library of transformations • Dataiku « shakers »: • Parse

    dates • Split a URL (protocol, host, path, …) • Transform a post code into a city / department name • … • Cannot be used outside Dataiku
  33. Library of transformations • All transformations should be code •

    Reuse transformations between projects • Provide a library • Transformation = DataFrame → DataFrame • Unit tests
  34. Unit test the data pipeline • Independent data processing steps

    • Data pipeline not often tested from beginning to end • Data pipeline easily broken
  35. Unit test the data pipeline • Unit test each data

    transformation stage • Scala: Scalatest • Python: Unittest • Use mock data • Compare DataFrames: • No library (yet?) • Compare lists of lists
  36. Assemble the Workflow • Separate transformation processes: • Transformations applied

    to some data • Results are frozen and used in other processes • Jobs are launched manually • No built-in scheduler in Spark
  37. Assemble the workflow • Oozie: • Spark • Map-Reduce •

    Shell • … • Scheduling • Alerts • Logs
  38. Summary • Keys: • Use industrialization-ready tools • Pair Programming:

    Data Scientist + Developer • Success criteria: • Lower time to market • Higher processing speed • More robust processes