Sam Bessalah
@samklr
Software Engineer, Freelance
Data Engineering, Distributed systems,
Machine Learning
Paris Data Geek Meetup
@DataParis
me :
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
Big Data Legends ….
Slide 7
Slide 7 text
Big Data Legends …
Web logs
Sensors
Other Data
source
..
.
.
.
Slide 8
Slide 8 text
A Big Data Legend …
Web logs
Sensors
Other Data
sources
..
.
.
.
Slide 9
Slide 9 text
A Big Data Legend …
Web logs
Sensors
Other Data
sources
..
.
.
.
Slide 10
Slide 10 text
A Big Data Legend …
Web logs
Sensors
Other Data
sources
..
.
.
.
Slide 11
Slide 11 text
A Big Data Legend …
Web logs
Sensors
Other Data
sources
..
.
.
.
Slide 12
Slide 12 text
A Big Data Legend …
Web logs
Sensors
Other Data
sources
..
.
.
.
Data Driven
Decisions
Smart
Applications
Slide 13
Slide 13 text
BUT ….
Slide 14
Slide 14 text
- Building big data infrastructures is no easy
task.
- Leveraging data for decision making
requires a mix of multiples skills :
. System Engineering
. Distributed computing
. Statistics
. Machine Learning
Slide 15
Slide 15 text
Solutions ….
- Build Data platforms as a service.
- Build robust and consistent APIs to bring big
data to the masses.
- Leverages fluent APIs for fast data science
Slide 16
Slide 16 text
No content
Slide 17
Slide 17 text
Big Data is not just about throwing data to Hadoop.
Slide 18
Slide 18 text
It’s also about data pipelines
Slide 19
Slide 19 text
Data Sources
Slide 20
Slide 20 text
Data Sources
Slide 21
Slide 21 text
Data Sources
- High Throughput distributed mssaging
platform
- Publish Subscribe Model
- Modelled as a distributed replicated log
- Persists messages to disk
- Categorizes messages into Topics
- Allows message retention for long specified
amount of time
- Allows stream replay in case of failure
Slide 22
Slide 22 text
Data Sources
Machine Learning High Latency
Batch Apps
Real Time
Processing
Slide 23
Slide 23 text
How do you build an API around that?
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
/ingest
REST API
Slide 26
Slide 26 text
/ingest
Slide 27
Slide 27 text
/ingest
/query
/trainModel
/process
Slide 28
Slide 28 text
Things to be careful with
- Multitenancy (Yarn, Mesos, Docker…)
- Job Scheduling
- Security
- Serialisation : ProtoBuf, Thrift, Avro
- Storage Format : Optimize queries with columnar
storage.
- Compression : LZO, Snappy
Machine Learning workflow
Text, Images, etc
Feature Extraction
Learning algorithm Training
Predictive Model
Slide 40
Slide 40 text
Machine Learning workflow
Text, Images, etc
Feature Extraction
Learning algorithm Training
Predictive Model
New Data
Feature Vector
Prediction
Slide 41
Slide 41 text
Machine Learning workflow
Text, Images, etc
Feature Extraction
Learning algorithm
Predictive Model
New Data Prediction
Slide 42
Slide 42 text
Machine Learning workflow
Text, Images, etc
Feature Extraction
Learning algorithm
Predictive Model
New Data Prediction
BLACK BOX
Slide 43
Slide 43 text
Machine Learning Libraries and Frameworks
Slide 44
Slide 44 text
scikit-learn.org
Slide 45
Slide 45 text
Text, Images, etc
Feature Extraction
Predictive Model
New Data Prediction
X = vect.fit_transform(input)
clf.fit(X,y)
X_new = vect.fit_transform(input) y_new= clf.predict(X_new)
Slide 46
Slide 46 text
http://arxiv.org/abs/1309.0238
Slide 47
Slide 47 text
From library to web APIs
Slide 48
Slide 48 text
Machine Learning workflow
Text, Images, etc
Feature Extraction
Learning algorithm
Predictive Model
New Data Prediction
BLACK BOX
Slide 49
Slide 49 text
Machine Learning workflow
Text, Images, etc
Transformed Data
Application
Prediction
Predictive API
Modeling and Prediction are just
a small part of the process
Slide 55
Slide 55 text
- Data locality and data gravity
- Support the full workflow
- Verticalization of platforms
- Scalability
- Collaboration and interoperability
- Black boxing of implementations
Slide 56
Slide 56 text
Explore machine learning for
APIs orchestration.
Talk to Ori
@OriPekelman
Next Frontier ? Or actual reality ?