Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science meets Software Development

Data Science meets Software Development

I work in a Data Innovation Lab with a horde of Data Scientists. Data Scientists gather data, clean data, apply Machine Learning algorithms and produce results, all of that with specialized tools (Dataiku, Scikit-Learn, R...). These processes run on a single machine, on data that is fixed in time, and they have no constraint on execution speed.

With my fellow Developers, our goal is to bring these processes to production. Our constraints are very different: we want the code to be versioned, to be tested, to be deployed automatically and to produce logs. We also need it to run in production on distributed architectures (Spark, Hadoop), with fixed versions of languages and frameworks (Scala...), and with data that changes every day.

In this talk, I will explain how we, Developers, work hand-in-hand with Data Scientists to shorten the path to running data workflows in production.

B1ed299a884f153fd23b9a1b81b798ac?s=128

Alexis Seigneurin

August 26, 2015
Tweet

Transcript

  1. DATA SCIENCE MEETS SOFTWARE DEVELOPMENT Alexis Seigneurin - Ippon Technologies

  2. Who I am • Software engineer for 15 years •

    Consultant at Ippon Tech in Paris, France • Favorite subjects: Spark, Cassandra, Ansible, Docker • @aseigneurin
  3. • 200 software engineers in France and the US •

    In the US: offices in DC, NYC and Richmond, Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster, Tatami, etc. • @ipponusa
  4. The project • Data Innovation Lab of a large insurance

    company • Data → Business value • Team of 30 Data Scientists + Software Developers
  5. Data Scientists Who they are & How they work

  6. Skill set of a Data Scientist • Strong in: •

    Science (maths / statistics) • Machine Learning • Analyzing data • Good / average in: • Programming • Not good in: • Software engineering
  7. Programming languages • Mostly Python, incl. frameworks: • NumPy •

    Pandas • SciKit Learn • SQL • R
  8. Development environments • IPython Notebook

  9. Development environments • Dataiku

  10. Machine Learning • Algorithms: • Logistic Regression • Decision trees

    • Random forests • Implementations: • Dataiku • Scikit-Learn • Vowpal Wabbit
  11. Programmers Who they are & How they work http://xkcd.com/378/

  12. Skill set of a Developer • Strong in: • Software

    engineering • Programming • Good / average in: • Science (maths / statistics) • Analyzing data • Not good in: • Machine Learning
  13. How Developers work • Programming languages • Java • Scala

    • Development environment • Eclipse • IntelliJ IDEA • Toolbox • Maven • …
  14. A typical Data Science project In the Lab

  15. Workflow 1. Data Cleansing 2. Feature Engineering 3. Train a

    Machine Learning model 1. Split the dataset: training/validation/test datasets 2. Train the model 4. Apply the model on new data
  16. Data Cleansing • Convert strings to numbers/booleans/… • Parse dates

    • Handle missing values • Handle data in an incorrect format • …
  17. Feature Engineering • Transform data into numerical features • E.g.:

    • A birth date → age • Dates of phone calls → Number of calls • Text → Vector of words • 2 names → Levensthein distance
  18. Machine Learning • Train a model • Test an algorithm

    with different params • Cross validation (Grid Search) • Compare different algorithms, e.g.: • Logistic regression • Gradient boosting trees • Random forest
  19. Machine Learning • Evaluate the accuracy of the model •

    Root Mean Square Error (RMSE) • ROC curve • … • Examine predictions • False positives, false negatives…
  20. Industrialization Cookbook

  21. Disclaimer • Context of this project: • Not So Big

    Data (but Smart Data) • No real-time workflows (yet?)
  22. Distribute the processing R E C I P E #

    1
  23. Distribute the processing • Data Scientists work with data samples

    • No constraint on processing time • Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)
  24. Distribute the processing • In production: • H/W resources are

    constrained • Large data sets to process • Spark: • Included in CDH • DataFrames (Spark 1.3+) ≃ Pandas DataFrames • Fast!
  25. Use a centralized data store R E C I P

    E # 2
  26. Use a centralized data store • Data Scientists store data

    on their workstations • Limited storage • Data not shared within the team • Data privacy not enforced • Subject to data losses
  27. Use a centralized data store • Store data on HDFS:

    • Hive tables (SQL) • Parquet files • Security: Kerberos + permissions • Redundant + potentially unlimited storage • Easy access from Spark and Dataiku
  28. Rationalize the use of programming languages R E C I

    P E # 3
  29. Programming languages • Data Scientists write code on their workstations

    • This code may not run in the datacenter • Language variety → Hard to share knowledge
  30. Programming languages • Use widely spread languages • Spark in

    Python/Scala • Support for R is too young • Provide assistance to ease the adoption!
  31. Use an IDE R E C I P E #

    4
  32. Use an IDE • Notebooks: • Powerful for exploratory work

    • Weak for code edition and code structuring • Inadequate for code versioning
  33. Use an IDE • IntelliJ IDEA / PyCharm • Code

    compilation • Refactoring • Execution of unit tests • Support for Git
  34. Source Control R E C I P E # 5

  35. Source Control • Data Scientists work on their workstations •

    Code is not shared • Code may be lost • Intermediate versions are not preserved • Lack of code review
  36. Source Control • Git + GitHub / GitLab • Versioning

    • Easy to go back to a version running in production • Easy sharing (+permissions) • Code review
  37. Packaging the code R E C I P E #

    6
  38. Packaging the code • Source code has dependencies • Dependencies

    in production ≠ at dev time • Assemble the code + its dependencies
  39. Packaging the code • Freeze the dependencies: • Scala →

    Maven • Python → Setuptools • Packaging: • Scala → Jar (Maven Shade plugin) • Python → Egg (Setuptools) • Compliant with spark-submit.sh
  40. R E C I P E # 7 Secure the

    build process
  41. Secure the build process • Data Scientists may commit code…

    without running tests first! • Quality may decrease over time • Packages built by hand on a workstation are not reproducible
  42. Secure the build process • Jenkins • Unit test report

    • Code coverage report • Packaging: Jar / Egg • Dashboard • Notifications (Slack + email)
  43. Automate the process R E C I P E #

    8
  44. Automate the process • Data is loaded manually in HDFS:

    • CSV files, sometimes compressed • Often received by email • Often samples
  45. Automate the process • No human intervention should be required

    • All steps should be code / tools • E.g. automate file transfers, unzipping…
  46. Adapt to living data R E C I P E

    # 9
  47. Adapt to living data • Data Scientists work with: •

    Frozen data • Samples • Risks with data received on a regular basis: • Incorrect format (dates, numbers…) • Corrupt data (incl. encoding changes) • Missing values
  48. Adapt to living data • Data Checking & Cleansing •

    Preliminary steps before processing the data • Decide what to do with invalid data • Thetis • Internal tool • Performs most checking & cleansing operations
  49. Provide a library of transformations R E C I P

    E # 1 0
  50. Library of transformations • Dataiku « shakers »: • Parse

    dates • Split a URL (protocol, host, path, …) • Transform a post code into a city / department name • … • Cannot be used outside Dataiku
  51. Library of transformations • All transformations should be code •

    Reuse transformations between projects • Provide a library • Transformation = DataFrame → DataFrame • Unit tests
  52. Unit test the data pipeline R E C I P

    E # 1 1
  53. Unit test the data pipeline • Independent data processing steps

    • Data pipeline not often tested from beginning to end • Data pipeline easily broken
  54. Unit test the data pipeline • Unit test each data

    transformation stage • Scala: Scalatest • Python: Unittest • Use mock data • Compare DataFrames: • No library (yet?) • Compare lists of lists
  55. Assemble the Workflow R E C I P E #

    1 2
  56. Assemble the Workflow • Separate transformation processes: • Transformations applied

    to some data • Results are frozen and used in other processes • Jobs are launched manually • No built-in scheduler in Spark
  57. Assemble the workflow • Oozie: • Spark • Map-Reduce •

    Shell • … • Scheduling • Alerts • Logs
  58. Summary & Conclusion

  59. Summary • Keys: • Use industrialization-ready tools • Pair Programming:

    Data Scientist + Developer • Success criteria: • Lower time to market • Higher processing speed • More robust processes
  60. Thank you! @aseigneurin - @ipponusa