Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big data and Machine learning APIs

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=47 Sam Bessalah
December 03, 2014

Big data and Machine learning APIs

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=128

Sam Bessalah

December 03, 2014
Tweet

Transcript

  1. Big Data and Machine Learning APIs

  2. Sam Bessalah @samklr Software Engineer, Freelance Data Engineering, Distributed systems,

    Machine Learning Paris Data Geek Meetup @DataParis me :
  3. None
  4. None
  5. None
  6. Big Data Legends ….

  7. Big Data Legends … Web logs Sensors Other Data source

    .. . . .
  8. A Big Data Legend … Web logs Sensors Other Data

    sources .. . . .
  9. A Big Data Legend … Web logs Sensors Other Data

    sources .. . . .
  10. A Big Data Legend … Web logs Sensors Other Data

    sources .. . . .
  11. A Big Data Legend … Web logs Sensors Other Data

    sources .. . . .
  12. A Big Data Legend … Web logs Sensors Other Data

    sources .. . . . Data Driven Decisions Smart Applications
  13. BUT ….

  14. - Building big data infrastructures is no easy task. -

    Leveraging data for decision making requires a mix of multiples skills : . System Engineering . Distributed computing . Statistics . Machine Learning
  15. Solutions …. - Build Data platforms as a service. -

    Build robust and consistent APIs to bring big data to the masses. - Leverages fluent APIs for fast data science
  16. None
  17. Big Data is not just about throwing data to Hadoop.

  18. It’s also about data pipelines

  19. Data Sources

  20. Data Sources

  21. Data Sources - High Throughput distributed mssaging platform - Publish

    Subscribe Model - Modelled as a distributed replicated log - Persists messages to disk - Categorizes messages into Topics - Allows message retention for long specified amount of time - Allows stream replay in case of failure
  22. Data Sources Machine Learning High Latency Batch Apps Real Time

    Processing
  23. How do you build an API around that?

  24. None
  25. /ingest REST API

  26. /ingest

  27. /ingest /query /trainModel /process

  28. Things to be careful with - Multitenancy (Yarn, Mesos, Docker…)

    - Job Scheduling - Security - Serialisation : ProtoBuf, Thrift, Avro - Storage Format : Optimize queries with columnar storage. - Compression : LZO, Snappy
  29. Making sense of data …

  30. None
  31. What is Machine Learning?

  32. http://dilbert.com/strips/comic/2013-02-02

  33. None
  34. https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark-1

  35. Machine Learning workflow

  36. Machine Learning workflow Text, Images, etc

  37. Machine Learning workflow Text, Images, etc Feature Extraction

  38. Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm

    Training
  39. Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm

    Training Predictive Model
  40. Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm

    Training Predictive Model New Data Feature Vector Prediction
  41. Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm

    Predictive Model New Data Prediction
  42. Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm

    Predictive Model New Data Prediction BLACK BOX
  43. Machine Learning Libraries and Frameworks

  44. scikit-learn.org

  45. Text, Images, etc Feature Extraction Predictive Model New Data Prediction

    X = vect.fit_transform(input) clf.fit(X,y) X_new = vect.fit_transform(input) y_new= clf.predict(X_new)
  46. http://arxiv.org/abs/1309.0238

  47. From library to web APIs

  48. Machine Learning workflow Text, Images, etc Feature Extraction Learning algorithm

    Predictive Model New Data Prediction BLACK BOX
  49. Machine Learning workflow Text, Images, etc Transformed Data Application Prediction

    Predictive API
  50. Predictive Web APIs

  51. Some examples

  52. Challenges of Predictive APIs

  53. http://www.r-bloggers.com/data-science-toolbox-survey-results-surprise-r-and-python-win/

  54. Modeling and Prediction are just a small part of the

    process
  55. - Data locality and data gravity - Support the full

    workflow - Verticalization of platforms - Scalability - Collaboration and interoperability - Black boxing of implementations
  56. Explore machine learning for APIs orchestration. Talk to Ori @OriPekelman

    Next Frontier ? Or actual reality ?
  57. None
  58. http://speakerdeck.com/samklr