Slide 1

Slide 1 text

Big Data and Machine Learning APIs

Slide 2

Slide 2 text

Sam Bessalah @samklr Software Engineer, Freelance Data Engineering, Distributed systems, Machine Learning Paris Data Geek Meetup @DataParis me :

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Big Data Legends ….

Slide 7

Slide 7 text

Big Data Legends … Web logs Sensors Other Data source .. . . .

Slide 8

Slide 8 text

A Big Data Legend … Web logs Sensors Other Data sources .. . . .

Slide 9

Slide 9 text

A Big Data Legend … Web logs Sensors Other Data sources .. . . .

Slide 10

Slide 10 text

A Big Data Legend … Web logs Sensors Other Data sources .. . . .

Slide 11

Slide 11 text

A Big Data Legend … Web logs Sensors Other Data sources .. . . .

Slide 12

Slide 12 text

A Big Data Legend … Web logs Sensors Other Data sources .. . . . Data Driven Decisions Smart Applications

Slide 13

Slide 13 text

BUT ….

Slide 14

Slide 14 text

- Building big data infrastructures is no easy task. - Leveraging data for decision making requires a mix of multiples skills : . System Engineering . Distributed computing . Statistics . Machine Learning

Slide 15

Slide 15 text

Solutions …. - Build Data platforms as a service. - Build robust and consistent APIs to bring big data to the masses. - Leverages fluent APIs for fast data science

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Big Data is not just about throwing data to Hadoop.

Slide 18

Slide 18 text

It’s also about data pipelines

Slide 19

Slide 19 text

Data Sources

Slide 20

Slide 20 text

Data Sources

Slide 21

Slide 21 text

Data Sources - High Throughput distributed mssaging platform - Publish Subscribe Model - Modelled as a distributed replicated log - Persists messages to disk - Categorizes messages into Topics - Allows message retention for long specified amount of time - Allows stream replay in case of failure

Slide 22

Slide 22 text

Data Sources Machine Learning High Latency Batch Apps Real Time Processing

Slide 23

Slide 23 text

How do you build an API around that?

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

/ingest REST API

Slide 26

Slide 26 text

/ingest

Slide 27

Slide 27 text

/ingest /query /trainModel /process

Slide 28

Slide 28 text

Things to be careful with - Multitenancy (Yarn, Mesos, Docker…) - Job Scheduling - Security - Serialisation : ProtoBuf, Thrift, Avro - Storage Format : Optimize queries with columnar storage. - Compression : LZO, Snappy

Slide 29

Slide 29 text

Making sense of data …

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

What is Machine Learning?

Slide 32

Slide 32 text

http://dilbert.com/strips/comic/2013-02-02

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark-1

Slide 35

Slide 35 text

Machine Learning workflow

Slide 36

Slide 36 text

Machine Learning workflow Text, Images, etc

Slide 37

Slide 37 text

Machine Learning workflow Text, Images, etc Feature Extraction

Slide 38

Slide 38 text

Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm Training

Slide 39

Slide 39 text

Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm Training Predictive Model

Slide 40

Slide 40 text

Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm Training Predictive Model New Data Feature Vector Prediction

Slide 41

Slide 41 text

Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm Predictive Model New Data Prediction

Slide 42

Slide 42 text

Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm Predictive Model New Data Prediction BLACK BOX

Slide 43

Slide 43 text

Machine Learning Libraries and Frameworks

Slide 44

Slide 44 text

scikit-learn.org

Slide 45

Slide 45 text

Text, Images, etc Feature Extraction Predictive Model New Data Prediction X = vect.fit_transform(input) clf.fit(X,y) X_new = vect.fit_transform(input) y_new= clf.predict(X_new)

Slide 46

Slide 46 text

http://arxiv.org/abs/1309.0238

Slide 47

Slide 47 text

From library to web APIs

Slide 48

Slide 48 text

Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm Predictive Model New Data Prediction BLACK BOX

Slide 49

Slide 49 text

Machine Learning workflow Text, Images, etc Transformed Data Application Prediction Predictive API

Slide 50

Slide 50 text

Predictive Web APIs

Slide 51

Slide 51 text

Some examples

Slide 52

Slide 52 text

Challenges of Predictive APIs

Slide 53

Slide 53 text

http://www.r-bloggers.com/data-science-toolbox-survey-results-surprise-r-and-python-win/

Slide 54

Slide 54 text

Modeling and Prediction are just a small part of the process

Slide 55

Slide 55 text

- Data locality and data gravity - Support the full workflow - Verticalization of platforms - Scalability - Collaboration and interoperability - Black boxing of implementations

Slide 56

Slide 56 text

Explore machine learning for APIs orchestration. Talk to Ori @OriPekelman Next Frontier ? Or actual reality ?

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

http://speakerdeck.com/samklr