First Steps Towards Your First Machine Learning Project Galuh Sahid (Twitter: @galuhsahid) Data Scientist at Gojek Google Developer Expert in Machine Learning
I love… • Sports • Idea: build a model that predicts championship winners based on past history • Literature • Idea: text generator in the style of famous authors • Movies • Idea: classify movie review sentiments
Formulating an ML problem Supervised Learning Leaf Width Leaf Length Species 2.7 4.9 small-leaf 3.2 5.5 big-leaf 2.9 5.1 small-leaf 3.4 6.8 big-leaf Adapted from Google’s Machine Learning Problem Framing
Formulating an ML problem Supervised Learning Leaf Width Leaf Length Species 2.7 4.9 small-leaf 3.2 5.5 big-leaf 2.9 5.1 small-leaf 3.4 6.8 big-leaf features label The features and their corresponding labels are fed into an algorithm in a process called training. What happens during training? The algorithm will gradually determine the relationship between features and their corresponding labels. This relationship is called the model. Adapted from Google’s Machine Learning Problem Framing
Supervised Learning Classification Is this a spam/not a spam? Is the sentiment of this movie review positive or negative? Binary Is this a picture of a cat or a dog? Formulating an ML problem Supervised Learning
Supervised Learning Classification Is this a comedy, horror, or drama movie? Is this the voice of a dog, a bird, a cat, or a grasshopper? Binary Is this a picture of a shirt, a book, or a food? Multi-class Formulating an ML problem Supervised Learning
Supervised Learning Classification Binary Multi-class Regression What is the price of the house if it has 2 floors, 4 bedrooms, and 2 bathrooms? What is the temperature in Paris tomorrow? Formulating an ML problem Supervised Learning
Formulating an ML problem Identifying Data Sources Use ready-to-use dataset Collect & build your own dataset from scratch Extract the data from existing data sources
Formulating an ML problem Ready-to-use dataset • Oftentimes data cleansing, manipulation, transformations are still necessary • You need to know the labels that you expect
Formulating an ML problem Extract the data by yourself - Scraping Twitter data Example of applications: - Sentiment analysis (must be labeled) - Topic detection e.g. 50% of tweets: Chris Evans’ new TV series, 25% of the tweets: Avengers, 25% of the tweets: Golden Globes - Scraping news websites
Formulating an ML problem Build your own dataset from scratch • Might be time-consuming, especially if you need a lot of data • Need to ensure that the way your data is collected suits the real-world condition - Example: building a bird audio dataset by recording sounds of different birds
Formulating an ML problem Now you know: - The problem statement of your project (e.g. “Our problem is best framed as 3-class classification, which predicts whether a video will be in one of three classes—{very popular, somewhat popular, not popular}—28 days after being uploaded”) - What data you need to process (text? Images?) - Whether you need labeled data or not - Possible algorithms for your project Adapted from Google’s Machine Learning Problem Framing
Tools & resources Programming languages • Python is usually the go-to programming language • However, you can now train your own machine learning models using JavaScript thanks to TensorFlow.js
Takeaways • Building machine learning projects can be a great way to learn machine learning • ML projects don’t have to be super fancy or complicated, they just have to be something you enjoy :) • The process of formulating your ML problem can help you figure out your next steps (e.g. collect what data, use what tools, possible algorithms) • There are lots of resources & tools out there to help you!