Slide 1

Slide 1 text

IVORY A DATA STORE FOR DATA SCIENCE http://github.com/ambiata/ivory © Ambiata 2014

Slide 2

Slide 2 text

DATA SCIENCE IN THE REAL WORLD © Ambiata 2014

Slide 3

Slide 3 text

PROBLEM #1 © Ambiata 2014

Slide 4

Slide 4 text

“DATA WRANGLING” © Ambiata 2014

Slide 5

Slide 5 text

WHAT WE START WITH © Ambiata 2014

Slide 6

Slide 6 text

© Ambiata 2014

Slide 7

Slide 7 text

WHAT WE NEED © Ambiata 2014

Slide 8

Slide 8 text

Feature vectors © Ambiata 2014 0.00 3 3001 1.00 634.83 16 4670 0.6875 15.12 2 - 0.50 33.56 2 - 1.00 98.34 12 3303 0.8333 523.81 23 2046 0.4782 1086.05 17 - 1.00 224.81 9 - 0.2222 78.21 2 2134 0.50 126.48 4 - 0.0 1 3 1 1 4 1 2 1 1 1 M - F M F - F F M - gender balance purchases zipcode prop_online num_accs 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236

Slide 9

Slide 9 text

Data set B Data set C Data set D Feature Eng Model train Score A typical workflow Multiple data sources: • Transaction logs • Database snapshots • Segmentation models • 3rd-party data © Ambiata 2014 Feature engineering: • Data source are joined • Instances are created • Features are engineered The cool stuff: • Models are built • Instances are scored

Slide 10

Slide 10 text

Feature preparation Modelling 85% 15% © Ambiata 2014

Slide 11

Slide 11 text

Data set A Data set B Data set C Data set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Multiple data science projects Feature engineering is in a silo - no reuse between model builds © Ambiata 2014

Slide 12

Slide 12 text

PROBLEM #2 © Ambiata 2014

Slide 13

Slide 13 text

“LAB TO FACTORY” AKA DEV OPS © Ambiata 2014

Slide 14

Slide 14 text

• Continually receiving data • Want to leverage a history of all this data • Continually training + scoring • Data may need to be corrected • Need to extend data model on-the-fly © Ambiata 2014

Slide 15

Slide 15 text

LAMBDA ARCHITECTURE © Ambiata 2014

Slide 16

Slide 16 text

© Ambiata 2014 query = function(all data)

Slide 17

Slide 17 text

© Ambiata 2014 New data stream Query Magical query engine

Slide 18

Slide 18 text

© Ambiata 2014 SERVING LAYER New data stream Query All data Precomputed views Stream processing Incremental views BATCH LAYER SPEED LAYER Real-time store

Slide 19

Slide 19 text

© Ambiata 2014 New data stream Query All data Feature view Stream processing Incremental views Real-time store

Slide 20

Slide 20 text

© Ambiata 2014 New data stream Query All data Feature view Stream processing Incremental views Real-time store Model train and score

Slide 21

Slide 21 text

IVORY © Ambiata 2014

Slide 22

Slide 22 text

Data set A Data set B Data set C Data set D Data set E Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Ivory © Ambiata 2014 A shared feature view asset

Slide 23

Slide 23 text

© Ambiata 2014 New data stream Query All data Ivory Stream processing Incremental views Real-time store Model train and score An immutable, batch-oriented data store

Slide 24

Slide 24 text

© Ambiata 2014 Feature vectors Ivory An extensible data model, backed by HDFS/S3 HDFS / S3

Slide 25

Slide 25 text

Apache V2 Licence github.com/ambiata/ivory © Ambiata 2014