DATA SCIENCE MEETS
SOFTWARE DEVELOPMENT
Alexis Seigneurin - Ippon Technologies
Slide 2
Slide 2 text
Who I am
• Software engineer for 15 years
• Consultant at Ippon Tech in Paris, France
• Favorite subjects: Spark, Cassandra, Ansible, Docker
• @aseigneurin
Slide 3
Slide 3 text
• 200 software engineers in France and the US
• In the US: offices in DC, NYC and Richmond, Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster, Tatami, etc.
• @ipponusa
Slide 4
Slide 4 text
The project
• Data Innovation Lab of a large insurance company
• Data → Business value
• Team of 30 Data Scientists + Software Developers
Slide 5
Slide 5 text
Data Scientists
Who they are
&
How they work
Slide 6
Slide 6 text
Skill set of a Data Scientist
• Strong in:
• Science (maths / statistics)
• Machine Learning
• Analyzing data
• Good / average in:
• Programming
• Not good in:
• Software engineering
Slide 7
Slide 7 text
Programming languages
• Mostly Python, incl. frameworks:
• NumPy
• Pandas
• SciKit Learn
• SQL
• R
Programmers
Who they are
&
How they work
http://xkcd.com/378/
Slide 12
Slide 12 text
Skill set of a Developer
• Strong in:
• Software engineering
• Programming
• Good / average in:
• Science (maths / statistics)
• Analyzing data
• Not good in:
• Machine Learning
Slide 13
Slide 13 text
How Developers work
• Programming languages
• Java
• Scala
• Development environment
• Eclipse
• IntelliJ IDEA
• Toolbox
• Maven
• …
Slide 14
Slide 14 text
A typical Data
Science project
In the Lab
Slide 15
Slide 15 text
Workflow
1. Data Cleansing
2. Feature Engineering
3. Train a Machine Learning model
1. Split the dataset: training/validation/test datasets
2. Train the model
4. Apply the model on new data
Slide 16
Slide 16 text
Data Cleansing
• Convert strings to numbers/booleans/…
• Parse dates
• Handle missing values
• Handle data in an incorrect format
• …
Slide 17
Slide 17 text
Feature Engineering
• Transform data into numerical features
• E.g.:
• A birth date → age
• Dates of phone calls → Number of calls
• Text → Vector of words
• 2 names → Levensthein distance
Slide 18
Slide 18 text
Machine Learning
• Train a model
• Test an algorithm with different
params
• Cross validation (Grid Search)
• Compare different algorithms, e.g.:
• Logistic regression
• Gradient boosting trees
• Random forest
Slide 19
Slide 19 text
Machine Learning
• Evaluate the accuracy of the
model
• Root Mean Square Error (RMSE)
• ROC curve
• …
• Examine predictions
• False positives, false negatives…
Slide 20
Slide 20 text
Industrialization
Cookbook
Slide 21
Slide 21 text
Disclaimer
• Context of this project:
• Not So Big Data (but Smart Data)
• No real-time workflows (yet?)
Slide 22
Slide 22 text
Distribute the
processing
R E C I P E # 1
Slide 23
Slide 23 text
Distribute the processing
• Data Scientists work with data samples
• No constraint on processing time
• Processing on the Data Scientist’s workstation
(IPython Notebook) or on a single server
(Dataiku)
Slide 24
Slide 24 text
Distribute the processing
• In production:
• H/W resources are constrained
• Large data sets to process
• Spark:
• Included in CDH
• DataFrames (Spark 1.3+) ≃ Pandas DataFrames
• Fast!
Slide 25
Slide 25 text
Use a centralized
data store
R E C I P E # 2
Slide 26
Slide 26 text
Use a centralized data store
• Data Scientists store data on their workstations
• Limited storage
• Data not shared within the team
• Data privacy not enforced
• Subject to data losses
Slide 27
Slide 27 text
Use a centralized data store
• Store data on HDFS:
• Hive tables (SQL)
• Parquet files
• Security: Kerberos + permissions
• Redundant + potentially unlimited storage
• Easy access from Spark and Dataiku
Slide 28
Slide 28 text
Rationalize the use of
programming
languages
R E C I P E # 3
Slide 29
Slide 29 text
Programming languages
• Data Scientists write code on their workstations
• This code may not run in the datacenter
• Language variety → Hard to share knowledge
Slide 30
Slide 30 text
Programming languages
• Use widely spread languages
• Spark in Python/Scala
• Support for R is too young
• Provide assistance to ease the adoption!
Slide 31
Slide 31 text
Use an IDE
R E C I P E # 4
Slide 32
Slide 32 text
Use an IDE
• Notebooks:
• Powerful for exploratory work
• Weak for code edition and code
structuring
• Inadequate for code versioning
Slide 33
Slide 33 text
Use an IDE
• IntelliJ IDEA / PyCharm
• Code compilation
• Refactoring
• Execution of unit tests
• Support for Git
Slide 34
Slide 34 text
Source Control
R E C I P E # 5
Slide 35
Slide 35 text
Source Control
• Data Scientists work on their workstations
• Code is not shared
• Code may be lost
• Intermediate versions are not preserved
• Lack of code review
Slide 36
Slide 36 text
Source Control
• Git + GitHub / GitLab
• Versioning
• Easy to go back to a version running in production
• Easy sharing (+permissions)
• Code review
Slide 37
Slide 37 text
Packaging the code
R E C I P E # 6
Slide 38
Slide 38 text
Packaging the code
• Source code has dependencies
• Dependencies in production ≠ at dev time
• Assemble the code + its dependencies
Slide 39
Slide 39 text
Packaging the code
• Freeze the dependencies:
• Scala → Maven
• Python → Setuptools
• Packaging:
• Scala → Jar (Maven Shade plugin)
• Python → Egg (Setuptools)
• Compliant with spark-submit.sh
Slide 40
Slide 40 text
R E C I P E # 7
Secure the build
process
Slide 41
Slide 41 text
Secure the build process
• Data Scientists may commit code… without
running tests first!
• Quality may decrease over time
• Packages built by hand on a workstation are not
reproducible
Slide 42
Slide 42 text
Secure the build process
• Jenkins
• Unit test report
• Code coverage report
• Packaging: Jar / Egg
• Dashboard
• Notifications (Slack + email)
Slide 43
Slide 43 text
Automate the process
R E C I P E # 8
Slide 44
Slide 44 text
Automate the process
• Data is loaded manually in HDFS:
• CSV files, sometimes compressed
• Often received by email
• Often samples
Slide 45
Slide 45 text
Automate the process
• No human intervention should be required
• All steps should be code / tools
• E.g. automate file transfers, unzipping…
Slide 46
Slide 46 text
Adapt to living data
R E C I P E # 9
Slide 47
Slide 47 text
Adapt to living data
• Data Scientists work with:
• Frozen data
• Samples
• Risks with data received on a regular basis:
• Incorrect format (dates, numbers…)
• Corrupt data (incl. encoding changes)
• Missing values
Slide 48
Slide 48 text
Adapt to living data
• Data Checking & Cleansing
• Preliminary steps before processing the data
• Decide what to do with invalid data
• Thetis
• Internal tool
• Performs most checking & cleansing operations
Slide 49
Slide 49 text
Provide a library of
transformations
R E C I P E # 1 0
Slide 50
Slide 50 text
Library of transformations
• Dataiku « shakers »:
• Parse dates
• Split a URL (protocol, host, path, …)
• Transform a post code into a city / department name
• …
• Cannot be used outside Dataiku
Slide 51
Slide 51 text
Library of transformations
• All transformations should be code
• Reuse transformations between projects
• Provide a library
• Transformation = DataFrame → DataFrame
• Unit tests
Slide 52
Slide 52 text
Unit test the data
pipeline
R E C I P E # 1 1
Slide 53
Slide 53 text
Unit test the data pipeline
• Independent data processing steps
• Data pipeline not often tested from beginning to
end
• Data pipeline easily broken
Slide 54
Slide 54 text
Unit test the data pipeline
• Unit test each data transformation stage
• Scala: Scalatest
• Python: Unittest
• Use mock data
• Compare DataFrames:
• No library (yet?)
• Compare lists of lists
Slide 55
Slide 55 text
Assemble the
Workflow
R E C I P E # 1 2
Slide 56
Slide 56 text
Assemble the Workflow
• Separate transformation processes:
• Transformations applied to some data
• Results are frozen and used in other processes
• Jobs are launched manually
• No built-in scheduler in Spark