Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big data and Machine learning APIs

Sam Bessalah
December 03, 2014

Big data and Machine learning APIs

Sam Bessalah

December 03, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Technology

Transcript

  1. Big Data and Machine
    Learning APIs

    View Slide

  2. Sam Bessalah
    @samklr
    Software Engineer, Freelance
    Data Engineering, Distributed systems,
    Machine Learning
    Paris Data Geek Meetup
    @DataParis
    me :

    View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. Big Data Legends ….

    View Slide

  7. Big Data Legends …
    Web logs
    Sensors
    Other Data
    source
    ..
    .
    .
    .

    View Slide

  8. A Big Data Legend …
    Web logs
    Sensors
    Other Data
    sources
    ..
    .
    .
    .

    View Slide

  9. A Big Data Legend …
    Web logs
    Sensors
    Other Data
    sources
    ..
    .
    .
    .

    View Slide

  10. A Big Data Legend …
    Web logs
    Sensors
    Other Data
    sources
    ..
    .
    .
    .

    View Slide

  11. A Big Data Legend …
    Web logs
    Sensors
    Other Data
    sources
    ..
    .
    .
    .

    View Slide

  12. A Big Data Legend …
    Web logs
    Sensors
    Other Data
    sources
    ..
    .
    .
    .
    Data Driven
    Decisions
    Smart
    Applications

    View Slide

  13. BUT ….

    View Slide

  14. - Building big data infrastructures is no easy
    task.
    - Leveraging data for decision making
    requires a mix of multiples skills :
    . System Engineering
    . Distributed computing
    . Statistics
    . Machine Learning

    View Slide

  15. Solutions ….
    - Build Data platforms as a service.
    - Build robust and consistent APIs to bring big
    data to the masses.
    - Leverages fluent APIs for fast data science

    View Slide

  16. View Slide

  17. Big Data is not just about throwing data to Hadoop.

    View Slide

  18. It’s also about data pipelines

    View Slide

  19. Data Sources

    View Slide

  20. Data Sources

    View Slide

  21. Data Sources
    - High Throughput distributed mssaging
    platform
    - Publish Subscribe Model
    - Modelled as a distributed replicated log
    - Persists messages to disk
    - Categorizes messages into Topics
    - Allows message retention for long specified
    amount of time
    - Allows stream replay in case of failure

    View Slide

  22. Data Sources
    Machine Learning High Latency
    Batch Apps
    Real Time
    Processing

    View Slide

  23. How do you build an API around that?

    View Slide

  24. View Slide

  25. /ingest
    REST API

    View Slide

  26. /ingest

    View Slide

  27. /ingest
    /query
    /trainModel
    /process

    View Slide

  28. Things to be careful with
    - Multitenancy (Yarn, Mesos, Docker…)
    - Job Scheduling
    - Security
    - Serialisation : ProtoBuf, Thrift, Avro
    - Storage Format : Optimize queries with columnar
    storage.
    - Compression : LZO, Snappy

    View Slide

  29. Making sense of data …

    View Slide

  30. View Slide

  31. What is Machine
    Learning?

    View Slide

  32. http://dilbert.com/strips/comic/2013-02-02

    View Slide

  33. View Slide

  34. https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark-1

    View Slide

  35. Machine Learning workflow

    View Slide

  36. Machine Learning workflow
    Text, Images, etc

    View Slide

  37. Machine Learning workflow
    Text, Images, etc
    Feature Extraction

    View Slide

  38. Machine Learning workflow
    Text, Images, etc
    Feature Extraction
    Learning algorithm Training

    View Slide

  39. Machine Learning workflow
    Text, Images, etc
    Feature Extraction
    Learning algorithm Training
    Predictive Model

    View Slide

  40. Machine Learning workflow
    Text, Images, etc
    Feature Extraction
    Learning algorithm Training
    Predictive Model
    New Data
    Feature Vector
    Prediction

    View Slide

  41. Machine Learning workflow
    Text, Images, etc
    Feature Extraction
    Learning algorithm
    Predictive Model
    New Data Prediction

    View Slide

  42. Machine Learning workflow
    Text, Images, etc
    Feature Extraction
    Learning algorithm
    Predictive Model
    New Data Prediction
    BLACK BOX

    View Slide

  43. Machine Learning Libraries and Frameworks

    View Slide

  44. scikit-learn.org

    View Slide

  45. Text, Images, etc
    Feature Extraction
    Predictive Model
    New Data Prediction
    X = vect.fit_transform(input)
    clf.fit(X,y)
    X_new = vect.fit_transform(input) y_new= clf.predict(X_new)

    View Slide

  46. http://arxiv.org/abs/1309.0238

    View Slide

  47. From library to web APIs

    View Slide

  48. Machine Learning workflow
    Text, Images, etc
    Feature Extraction
    Learning algorithm
    Predictive Model
    New Data Prediction
    BLACK BOX

    View Slide

  49. Machine Learning workflow
    Text, Images, etc
    Transformed Data
    Application
    Prediction
    Predictive API

    View Slide

  50. Predictive Web APIs

    View Slide

  51. Some examples

    View Slide

  52. Challenges of Predictive APIs

    View Slide

  53. http://www.r-bloggers.com/data-science-toolbox-survey-results-surprise-r-and-python-win/

    View Slide

  54. Modeling and Prediction are just
    a small part of the process

    View Slide

  55. - Data locality and data gravity
    - Support the full workflow
    - Verticalization of platforms
    - Scalability
    - Collaboration and interoperability
    - Black boxing of implementations

    View Slide

  56. Explore machine learning for
    APIs orchestration.
    Talk to Ori
    @OriPekelman
    Next Frontier ? Or actual reality ?

    View Slide

  57. View Slide

  58. http://speakerdeck.com/samklr

    View Slide