Data Science meets Software Development

DATA SCIENCE MEETS SOFTWARE DEVELOPMENT Alexis Seigneurin - Ippon Technologies

Who I am • Software engineer for 15 years •
Consultant at Ippon Tech in Paris, France • Favorite subjects: Spark, Cassandra, Ansible, Docker • @aseigneurin

• 200 software engineers in France and the US •
In the US: ofﬁces in DC, NYC and Richmond, Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster, Tatami, etc. • @ipponusa

The project • Data Innovation Lab of a large insurance
company • Data → Business value • Team of 30 Data Scientists + Software Developers

Data Scientists Who they are & How they work

Skill set of a Data Scientist • Strong in: •
Science (maths / statistics) • Machine Learning • Analyzing data • Good / average in: • Programming • Not good in: • Software engineering

Programming languages • Mostly Python, incl. frameworks: • NumPy •
Pandas • SciKit Learn • SQL • R

Development environments • IPython Notebook

Development environments • Dataiku

Machine Learning • Algorithms: • Logistic Regression • Decision trees
• Random forests • Implementations: • Dataiku • Scikit-Learn • Vowpal Wabbit

Programmers Who they are & How they work http://xkcd.com/378/

Skill set of a Developer • Strong in: • Software
engineering • Programming • Good / average in: • Science (maths / statistics) • Analyzing data • Not good in: • Machine Learning

How Developers work • Programming languages • Java • Scala
• Development environment • Eclipse • IntelliJ IDEA • Toolbox • Maven • …

A typical Data Science project In the Lab

Workﬂow 1. Data Cleansing 2. Feature Engineering 3. Train a
Machine Learning model 1. Split the dataset: training/validation/test datasets 2. Train the model 4. Apply the model on new data

Data Cleansing • Convert strings to numbers/booleans/… • Parse dates
• Handle missing values • Handle data in an incorrect format • …

Feature Engineering • Transform data into numerical features • E.g.:
• A birth date → age • Dates of phone calls → Number of calls • Text → Vector of words • 2 names → Levensthein distance

Machine Learning • Train a model • Test an algorithm
with different params • Cross validation (Grid Search) • Compare different algorithms, e.g.: • Logistic regression • Gradient boosting trees • Random forest

Machine Learning • Evaluate the accuracy of the model •
Root Mean Square Error (RMSE) • ROC curve • … • Examine predictions • False positives, false negatives…

Industrialization Cookbook

Disclaimer • Context of this project: • Not So Big
Data (but Smart Data) • No real-time workﬂows (yet?)

Distribute the processing R E C I P E #
1

Distribute the processing • Data Scientists work with data samples
• No constraint on processing time • Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)

Distribute the processing • In production: • H/W resources are
constrained • Large data sets to process • Spark: • Included in CDH • DataFrames (Spark 1.3+) ≃ Pandas DataFrames • Fast!

Use a centralized data store R E C I P
E # 2

Use a centralized data store • Data Scientists store data
on their workstations • Limited storage • Data not shared within the team • Data privacy not enforced • Subject to data losses

Use a centralized data store • Store data on HDFS:
• Hive tables (SQL) • Parquet ﬁles • Security: Kerberos + permissions • Redundant + potentially unlimited storage • Easy access from Spark and Dataiku

Rationalize the use of programming languages R E C I
P E # 3

Programming languages • Data Scientists write code on their workstations
• This code may not run in the datacenter • Language variety → Hard to share knowledge

Programming languages • Use widely spread languages • Spark in
Python/Scala • Support for R is too young • Provide assistance to ease the adoption!

Use an IDE R E C I P E #
4

Use an IDE • Notebooks: • Powerful for exploratory work
• Weak for code edition and code structuring • Inadequate for code versioning

Use an IDE • IntelliJ IDEA / PyCharm • Code
compilation • Refactoring • Execution of unit tests • Support for Git

Source Control R E C I P E # 5

Source Control • Data Scientists work on their workstations •
Code is not shared • Code may be lost • Intermediate versions are not preserved • Lack of code review

Source Control • Git + GitHub / GitLab • Versioning
• Easy to go back to a version running in production • Easy sharing (+permissions) • Code review

Packaging the code R E C I P E #
6

Packaging the code • Source code has dependencies • Dependencies
in production ≠ at dev time • Assemble the code + its dependencies

Packaging the code • Freeze the dependencies: • Scala →
Maven • Python → Setuptools • Packaging: • Scala → Jar (Maven Shade plugin) • Python → Egg (Setuptools) • Compliant with spark-submit.sh

R E C I P E # 7 Secure the
build process

Secure the build process • Data Scientists may commit code…
without running tests ﬁrst! • Quality may decrease over time • Packages built by hand on a workstation are not reproducible

Secure the build process • Jenkins • Unit test report
• Code coverage report • Packaging: Jar / Egg • Dashboard • Notiﬁcations (Slack + email)

Automate the process R E C I P E #
8

Automate the process • Data is loaded manually in HDFS:
• CSV ﬁles, sometimes compressed • Often received by email • Often samples

Automate the process • No human intervention should be required
• All steps should be code / tools • E.g. automate ﬁle transfers, unzipping…

Adapt to living data R E C I P E
# 9

Adapt to living data • Data Scientists work with: •
Frozen data • Samples • Risks with data received on a regular basis: • Incorrect format (dates, numbers…) • Corrupt data (incl. encoding changes) • Missing values

Adapt to living data • Data Checking & Cleansing •
Preliminary steps before processing the data • Decide what to do with invalid data • Thetis • Internal tool • Performs most checking & cleansing operations

Provide a library of transformations R E C I P
E # 1 0

Library of transformations • Dataiku « shakers »: • Parse
dates • Split a URL (protocol, host, path, …) • Transform a post code into a city / department name • … • Cannot be used outside Dataiku

Library of transformations • All transformations should be code •
Reuse transformations between projects • Provide a library • Transformation = DataFrame → DataFrame • Unit tests

Unit test the data pipeline R E C I P
E # 1 1

Unit test the data pipeline • Independent data processing steps
• Data pipeline not often tested from beginning to end • Data pipeline easily broken

Unit test the data pipeline • Unit test each data
transformation stage • Scala: Scalatest • Python: Unittest • Use mock data • Compare DataFrames: • No library (yet?) • Compare lists of lists

Assemble the Workﬂow R E C I P E #
1 2

Assemble the Workﬂow • Separate transformation processes: • Transformations applied
to some data • Results are frozen and used in other processes • Jobs are launched manually • No built-in scheduler in Spark

Assemble the workﬂow • Oozie: • Spark • Map-Reduce •
Shell • … • Scheduling • Alerts • Logs

Summary & Conclusion

Summary • Keys: • Use industrialization-ready tools • Pair Programming:
Data Scientist + Developer • Success criteria: • Lower time to market • Higher processing speed • More robust processes

Thank you! @aseigneurin - @ipponusa

Data Science meets Software Development

Data Science meets Software Development

More Decks by Alexis Seigneurin

Other Decks in Technology

Featured

Transcript