Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Demystifying the Buzz in Machine Learning! (Thi...

Demystifying the Buzz in Machine Learning! (This Time for Real)

Slides of my keynote talk at the Chief Data Officer Exchange Europe 2019.

Abstract:
When I started my data science career in 2013, everyone was into big data. In fact, big data was at the peak of inflated expectations (Source: Gartner). You had to use tools like Hadoop and Spark to be one of the cool kids. Many data prophets out there told you that data is the new oil or even gold. Year 2019, things haven’t changed. Data is still cool and going strong. It’s eating the world and yes you still need big data and now also deep deep very deep learning. There’s a lot of bullshit bingo out there.

In this talk, I want to demystify the buzz in machine learning by presenting some simple guidelines for successful data projects and real practical use cases. And yes it involves deep learning and yes it can be quite technical sometimes as well.

Avatar for Dat Tran

Dat Tran

June 17, 2019
Tweet

More Decks by Dat Tran

Other Decks in Technology

Transcript

  1. Demystifying the Buzz in Machine Learning! (This time for real)

    Dat Tran - Head of AI 17 june 2019 ~ berlin ~ 1 #AxelSpringerAI
  2. Problem Statement • For over 50% of the lead-outs, we

    don’t know whether users bought or not • We know it for Amazon & ebay but with a 2-days lag; other problems are direct vs. indirect sales • Predicting sales is valuable, for example for CRM, recommendation engine and many other use cases
  3. Supervised Learning Samples price: 80, pis: 5, ... sale price:

    5, pis: 1, ... non-sale price: 17, pis: 3, ... sale ML Model training Predictions price: 99, pis: 8, ... non-sale price: 65, pis: 2, ... sale (82%) price: 32, pis: 9, ... sale (30%) price: 40, pis: 5, ... sale (50%) price: 20, pis: 2, ... sale (71%) Deep Learning????
  4. Problem Statement • 2.306.658 accommodations • 308.519.299 images • ~

    133 images per accommodation Humans? Deep Learning??
  5. How to start a Deep Learning project 1. Computer Vision:

    ImageNet, AlexNet 2. NLP: Language models (still immature)
  6. Automate Image Quality Assessment To automate the image quality assessment

    we trained: • Aesthetic model → Predicts aesthetic score of an image • Technical model → Predicts the technical image quality (distortion, blur, etc.) We followed the Google paper “NIMA: Neural Image Assessment” published 09/2017
  7. Results - First Iteration Aesthetic model - MobileNet Linear correlation

    coefficient (LCC): 0.5987 Spearman's correlation coefficient (SCRR): 0.6072 Earth Mover's Distance: 0.2018 Accuracy (threshold at 5): 0.74
  8. Learnings • First results are not good but we only

    learned it because we released it ◦ More domain specific data is needed • We could load test our applications which is very valuable ◦ Used MobileNet instead of VGG-16
  9. Second Iteration • We built a simple labeling application •

    ~ 12 people from idealo Reise and Data Science labeled ◦ 1000 hotel images for aesthetics ◦ 3000 hotel images for technical quality • We fine-tuned the aesthetic model with 800 training images • Built aesthetic test dataset with 200 images
  10. This is our tech stack... only an extract;) PyData Deep

    Learning Big Data Computer Vision NLP Production Machine Learning Visualization Data Preparation
  11. • Data changes constantly so monitor your model performance on

    a regular basis • Re-training pipeline is also important • Don’t do it manually, use appropriate tools for this e.g. Apache Airflow Learnings
  12. • Use git • Dockerized aka containerized everything • Use

    conda and/or pip for package management • Automatic pipeline management (testing, data) • TDD & API First strategy (everything as a Microservice) • Don’t use Jupyter notebooks for production system Learnings
  13. 1. Think simple first and then, if it’s really needed,

    get more complex 2. Define your data product MVP and release as early as possible 3. Creating data products is a team sport 4. Use the right tool for the right problem 5. Use the cloud 6. Measure your model and improve it from time to time 7. Your results need to be reproducible 8. Prioritize the projects with the biggest business impact Summary