Slide 1

Slide 1 text

DATA SCIENCE MEETS SOFTWARE DEVELOPMENT Alexis Seigneurin - Ippon Technologies

Slide 2

Slide 2 text

Who I am • Software engineer for 15 years • Consultant at Ippon Tech in Paris, France • Favorite subjects: Spark, Cassandra, Ansible, Docker • @aseigneurin

Slide 3

Slide 3 text

• 200 software engineers in France and the US • In the US: offices in DC, NYC and Richmond, Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster, Tatami, etc. • @ipponusa

Slide 4

Slide 4 text

The project • Data Innovation Lab of a large insurance company • Data → Business value • Team of 30 Data Scientists + Software Developers

Slide 5

Slide 5 text

Data Scientists Who they are & How they work

Slide 6

Slide 6 text

Skill set of a Data Scientist • Strong in: • Science (maths / statistics) • Machine Learning • Analyzing data • Good / average in: • Programming • Not good in: • Software engineering

Slide 7

Slide 7 text

Programming languages • Mostly Python, incl. frameworks: • NumPy • Pandas • SciKit Learn • SQL • R

Slide 8

Slide 8 text

Development environments • IPython Notebook

Slide 9

Slide 9 text

Development environments • Dataiku

Slide 10

Slide 10 text

Machine Learning • Algorithms: • Logistic Regression • Decision trees • Random forests • Implementations: • Dataiku • Scikit-Learn • Vowpal Wabbit

Slide 11

Slide 11 text

Programmers Who they are & How they work http://xkcd.com/378/

Slide 12

Slide 12 text

Skill set of a Developer • Strong in: • Software engineering • Programming • Good / average in: • Science (maths / statistics) • Analyzing data • Not good in: • Machine Learning

Slide 13

Slide 13 text

How Developers work • Programming languages • Java • Scala • Development environment • Eclipse • IntelliJ IDEA • Toolbox • Maven • …

Slide 14

Slide 14 text

A typical Data Science project In the Lab

Slide 15

Slide 15 text

Workflow 1. Data Cleansing 2. Feature Engineering 3. Train a Machine Learning model 1. Split the dataset: training/validation/test datasets 2. Train the model 4. Apply the model on new data

Slide 16

Slide 16 text

Data Cleansing • Convert strings to numbers/booleans/… • Parse dates • Handle missing values • Handle data in an incorrect format • …

Slide 17

Slide 17 text

Feature Engineering • Transform data into numerical features • E.g.: • A birth date → age • Dates of phone calls → Number of calls • Text → Vector of words • 2 names → Levensthein distance

Slide 18

Slide 18 text

Machine Learning • Train a model • Test an algorithm with different params • Cross validation (Grid Search) • Compare different algorithms, e.g.: • Logistic regression • Gradient boosting trees • Random forest

Slide 19

Slide 19 text

Machine Learning • Evaluate the accuracy of the model • Root Mean Square Error (RMSE) • ROC curve • … • Examine predictions • False positives, false negatives…

Slide 20

Slide 20 text

Industrialization Cookbook

Slide 21

Slide 21 text

Disclaimer • Context of this project: • Not So Big Data (but Smart Data) • No real-time workflows (yet?)

Slide 22

Slide 22 text

Distribute the processing R E C I P E # 1

Slide 23

Slide 23 text

Distribute the processing • Data Scientists work with data samples • No constraint on processing time • Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)

Slide 24

Slide 24 text

Distribute the processing • In production: • H/W resources are constrained • Large data sets to process • Spark: • Included in CDH • DataFrames (Spark 1.3+) ≃ Pandas DataFrames • Fast!

Slide 25

Slide 25 text

Use a centralized data store R E C I P E # 2

Slide 26

Slide 26 text

Use a centralized data store • Data Scientists store data on their workstations • Limited storage • Data not shared within the team • Data privacy not enforced • Subject to data losses

Slide 27

Slide 27 text

Use a centralized data store • Store data on HDFS: • Hive tables (SQL) • Parquet files • Security: Kerberos + permissions • Redundant + potentially unlimited storage • Easy access from Spark and Dataiku

Slide 28

Slide 28 text

Rationalize the use of programming languages R E C I P E # 3

Slide 29

Slide 29 text

Programming languages • Data Scientists write code on their workstations • This code may not run in the datacenter • Language variety → Hard to share knowledge

Slide 30

Slide 30 text

Programming languages • Use widely spread languages • Spark in Python/Scala • Support for R is too young • Provide assistance to ease the adoption!

Slide 31

Slide 31 text

Use an IDE R E C I P E # 4

Slide 32

Slide 32 text

Use an IDE • Notebooks: • Powerful for exploratory work • Weak for code edition and code structuring • Inadequate for code versioning

Slide 33

Slide 33 text

Use an IDE • IntelliJ IDEA / PyCharm • Code compilation • Refactoring • Execution of unit tests • Support for Git

Slide 34

Slide 34 text

Source Control R E C I P E # 5

Slide 35

Slide 35 text

Source Control • Data Scientists work on their workstations • Code is not shared • Code may be lost • Intermediate versions are not preserved • Lack of code review

Slide 36

Slide 36 text

Source Control • Git + GitHub / GitLab • Versioning • Easy to go back to a version running in production • Easy sharing (+permissions) • Code review

Slide 37

Slide 37 text

Packaging the code R E C I P E # 6

Slide 38

Slide 38 text

Packaging the code • Source code has dependencies • Dependencies in production ≠ at dev time • Assemble the code + its dependencies

Slide 39

Slide 39 text

Packaging the code • Freeze the dependencies: • Scala → Maven • Python → Setuptools • Packaging: • Scala → Jar (Maven Shade plugin) • Python → Egg (Setuptools) • Compliant with spark-submit.sh

Slide 40

Slide 40 text

R E C I P E # 7 Secure the build process

Slide 41

Slide 41 text

Secure the build process • Data Scientists may commit code… without running tests first! • Quality may decrease over time • Packages built by hand on a workstation are not reproducible

Slide 42

Slide 42 text

Secure the build process • Jenkins • Unit test report • Code coverage report • Packaging: Jar / Egg • Dashboard • Notifications (Slack + email)

Slide 43

Slide 43 text

Automate the process R E C I P E # 8

Slide 44

Slide 44 text

Automate the process • Data is loaded manually in HDFS: • CSV files, sometimes compressed • Often received by email • Often samples

Slide 45

Slide 45 text

Automate the process • No human intervention should be required • All steps should be code / tools • E.g. automate file transfers, unzipping…

Slide 46

Slide 46 text

Adapt to living data R E C I P E # 9

Slide 47

Slide 47 text

Adapt to living data • Data Scientists work with: • Frozen data • Samples • Risks with data received on a regular basis: • Incorrect format (dates, numbers…) • Corrupt data (incl. encoding changes) • Missing values

Slide 48

Slide 48 text

Adapt to living data • Data Checking & Cleansing • Preliminary steps before processing the data • Decide what to do with invalid data • Thetis • Internal tool • Performs most checking & cleansing operations

Slide 49

Slide 49 text

Provide a library of transformations R E C I P E # 1 0

Slide 50

Slide 50 text

Library of transformations • Dataiku « shakers »: • Parse dates • Split a URL (protocol, host, path, …) • Transform a post code into a city / department name • … • Cannot be used outside Dataiku

Slide 51

Slide 51 text

Library of transformations • All transformations should be code • Reuse transformations between projects • Provide a library • Transformation = DataFrame → DataFrame • Unit tests

Slide 52

Slide 52 text

Unit test the data pipeline R E C I P E # 1 1

Slide 53

Slide 53 text

Unit test the data pipeline • Independent data processing steps • Data pipeline not often tested from beginning to end • Data pipeline easily broken

Slide 54

Slide 54 text

Unit test the data pipeline • Unit test each data transformation stage • Scala: Scalatest • Python: Unittest • Use mock data • Compare DataFrames: • No library (yet?) • Compare lists of lists

Slide 55

Slide 55 text

Assemble the Workflow R E C I P E # 1 2

Slide 56

Slide 56 text

Assemble the Workflow • Separate transformation processes: • Transformations applied to some data • Results are frozen and used in other processes • Jobs are launched manually • No built-in scheduler in Spark

Slide 57

Slide 57 text

Assemble the workflow • Oozie: • Spark • Map-Reduce • Shell • … • Scheduling • Alerts • Logs

Slide 58

Slide 58 text

Summary & Conclusion

Slide 59

Slide 59 text

Summary • Keys: • Use industrialization-ready tools • Pair Programming: Data Scientist + Developer • Success criteria: • Lower time to market • Higher processing speed • More robust processes

Slide 60

Slide 60 text

Thank you! @aseigneurin - @ipponusa