Data knows better workshop at Ironhack: Hobby data science projects

DATA KNOWS BETTER: A beginners guide to machine learning JUST
SOME APPLICATIONS OF MACHINE LEARNING IN DAILY LIFE Let’s link on LinkedIn https://www.linkedin.com/in/longhowlam Longhow Lam Freelance data scientist: Just contact me if you need me :-)

AGENDA Ø INTRODUCTION Ø JAAP.NL HOME ANALYTICS (REGRESSION ANALYSIS, TEXT
MINING) Ø SOAP ANALYTICS: GTST AND THE BOLD (TEXT MINING, TOPIC MODELING, WORD EMBEDDINGS) Ø IKEA ANALYTICS (COMPUTER VISION, TIME SERIES) Ø CAR ANALYTICS: BMW’S, PEUGEOTS, PRICES & OIL (COMPUTER VISION, SPLINE REGRESSION ANALYSIS) Ø RESTAURANTS ANALYTICS, DUTCH FILM WORLD (ASSOCIATION RULES MINING, GRAPH ANALYSIS)

INTRODUCTION • An overview of different data science (machine learning)
techniques • Applied on some ‘playfull’ hobby projects • I can never disclose company data in public → Scrape data • However, all techniques are applied in “real life” • My data science tool set:

INTRODUCTION But you need to learn your whole life! Data
science environments they evolve, come and go I once did stuff in SPSS.... ??

JAAP.NL HOME ANALYTICS BUSINESS ISSUE What is the price of
a home? Can we predict home values? Approach Scrape Jaap.nl for data

Sample Dataiku Workflow for creating home value model

HOME VALUES PREDICT WITH DATAIKU Data scraped from www.jaap.nl: ~
130K houses TARGET INPUT INPUT

PREDICTIVE MODEL A FEW CLICKS AND YOU ARE THERE!

PREDICTIVE MODEL A FEW CLICKS AND YOU ARE THERE! There
are many machine learning algorithms: just a few: Tree ensembles • Random Forests • Boosting trees Lineair regression y = f(x) = a0 + a1 x1 + a2 x2 +…an xn Neural networks y = f(g(h(x))) A single tree MORE COMPLEX & LESS EXPLAINABLE SIMPLE & EXPLAINABLE Wich one to use? Start with the simple ones and use more complex models if needed

PREDICTIVE MODEL A FEW CLICKS AND YOU ARE THERE!

PREDICTIVE MODEL A FEW CLICKS AND YOU ARE THERE

Parameter Price effect (€) Intercept 24,006 First 2 digits postal
code 10 240,839 96 − 103,000 12 204,591 79 − 49,002 ...…... ...…... Type of home Villa 173,000 Tussen woning − 41,000 Vrijstaand 73,000 ...…. ...…. Living area per m2 2,064 Number of rooms each extra room 4,500 LINEAR REGRESSION Simple & explainable but not the most accurate

PREDICTIVE MODEL RESULTS FOR OTHER MODELS Variable importance plot

PREDICTIVE MODEL RESULTS FOR OTHER MODELS Partial dependency plot

PREDICTIVE MODEL RESULTS FOR OTHER MODELS Decision tree

PREDICTIVE MODEL RESULTS FOR OTHER MODELS Model prediction versus observed
price

K Nearest Neighbor regression How home brokers price a house

K-NEAREST NEIGHBOR METHOD Is not a ‘real’ model, i.e no
explicit formula. Ø Given a house x0 , that you want to price, Ø Find the k houses x1 , x2 ,..., xk that are closest to x0 Ø Predict the home value of house x0 as the average of the k houses found x0 5 nearest neighbors of x0

K-NN METHOD, WICH K TO TAKE? 1 nearest neighbor 15
nearest neighbor K Too Small à Too wobbly, too much noise K Too Large à Too smoothed, lost the signal

K-NN EXAMPLE Take the location (long /lat) of a house
for which you want a price Look at the nearest houses for which you know the price. JAAP.NL HOMES K-NN prediction is now just the average of the nearest houses Calculating prices like this is very common among home brokers

K-NN EXAMPLE JAAP.NL HOMES For my home data set JAAP
It turns out k = 5, is the best value Largest R2 on the test set

Model deployment

OK AND NOW? WHO WANTS TO USE MY MODEL? Mail
me your dream house with its features, and I can price it. Any application that needs a home value prediction Eeuhm NO DON’T DO THAT! Create a REST API from the model and put it on a server on line. Dataiku facilitates this process of model deployment

OK AND NOW? WHO WANTS A PREDICTION? curl -X POST
http://188.166.112.55:12000/public/api/v1/house_xgboost/pc2model/predict --data '{ "features" : { "HouseType": "Tussenwoning", "kamers": 6, "Oppervlakte": 134, "VON": 0, "PC": "16" }}' {"prediction":241287.40,"ignored":false} for PC: "10" {"prediction":607246.62,"ignored":false}

http://128.199.59.214/ R Shiny app

Text mining

https://github.com/longhowlam/jaap https://www.linkedin.com/pulse/huis-te-koop-zet-beleggingsobject-je-huisomschrijving-longhow-lam/ HOME VALUE PREDICTION BASED ON DESCRIPTIONS OF THE
HOMES

HOME VALUE PREDICTION WITH LASSO REGRESSION OR XGBOOST TERM DOCUMENT
MATRIX Super sparse: 65.000 rows and 50.000 columns but a lot of zeros! home value kitchen big_garden garage ..(many more terms).. Swimming pool home 1 235.000 1 0 1 ... 0 home 2 450.000 0 1 0 ... 0 home 3 376.000 1 0 0 ... 0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... home 65.000 621.000 1 1 ... ... 1 SO: Each home description is now a long vector of 50.000 numbers Mathematically we have a point in the 50.000 dimensional space! ℝ

TERM DOCUMENT MATRIX For each term (column) estimate a β
(beta) parameter Too many columns normal lineair regression: regularization is needed! For example: “lasso” regression VALUE PREDICTION WITH LASSO REGRESSION

MANAGED NOTEBOOKS ENVIRONMENT FOR MORE ADVANCED CODE ANALYSIS

R TEXT2VEC PACKAGE MY FAVORITE PACKAGE FOR SIMPLE TEXT MINING

LASSO REGRESSION NEGATIVE AND POSITIVE BETA PARAMETERS R2 = 0.66
Intercept € 238.260 parkkosten € 39.644- familiehuis € 60.168 recreatiebungalow € 32.614- vrijstaande_villa € 48.180 bungalowpark € 31.801- belegging € 45.814 limburgse € 23.483- beleggingsobject € 42.543 2_kamer € 23.034- entree_vestibule € 41.674 plinten € 22.510- rijksmonument € 39.379 overdekt_zwembad € 21.971- recreatief € 39.142 2_kamerappartement € 20.625- verhuurd € 36.171 aannemer € 20.314- detaillering € 35.000 recreatiewoning € 19.748- visgraat € 33.589 proeven € 19.631- eigen_badkamer € 33.454 betaalbaar € 19.621- woningen_1 € 33.321 starterswoning € 19.502- toiletten € 32.836 kunststofkozijnen € 18.775- representatieve € 31.904 eigen_gebruik € 18.430- gezinshuis € 31.297

ANALYTICS SOAP

SOAP ANALYTICS TEXT ANALYTICS Business pain Looking at GTST, what
the heck is it all about? Are there trends in the series? Is it not all the same? Approach Take 5000 recaps and apply text mining / topics modeling

SOAP ANALYTICS FINDING TOPICS The 5000 recaps can be transformed
into a Term Document Matrix So each recap is again a point in the 50.000 dimensional space: Instead of prediction we can cluster: create TOPICS

All recaps as one pile Cluster algorithm Topic 1 Described
be key words Topic 3 Described be key words Topic 2 Described be key words Topic 4 Described be key words

SOAP ANALYTICS MAIN TOPICS Main topics in 5000 episodes

SOAP ANALYTICS DISTANCE BETWEEN TOPICS

SOAP ANALYTICS ZOOM IN ON A TOPIC

SOAP ANALYTICS ZOOM IN ON A TOPIC Sub-topics: topic 16
(Ludo, Isabelle, Martine, Janine) · Harmsen voelt zich alleen. · Plan van Jack, gevaarlijk · Afscheidsbrief schrijven · Paniek, angst · Vragen over kinderen · Geld betalen Geld terug krijgen

SOAP ANALYTICS TRENDS OVER TIME

SOAP ANALYTICS ARE ALL EPISODES NOT THE SAME? Create a
3 Dimensional UMAP Uniform Manifold Approximation and Projection of all 5000 Goede Tijden Slechte Tijden episodes Ø Each GTST episode is now a high dimensional vector Ø Project this vector to a lower dimensional space Ø It is easier to see if points are grouped together or not

Interactief plaatje

WORD EMBEDDINGS

WORD EMBEDDINGS IN BOLD & BEAUTIFUL SAMENVATTINGEN Term Document Matrix
Each document / recap is a vector of numbers Word embedding Each word is a vector of numbers A word embedding has to be trained / learned from a large collection of documents / recaps Amsterdam = (0.83, 0.89, 0.34, … , 0.63, 0.19) Steffy = (0.33, 0.19, 0.79, … , 0.13, 0.01) Germany = (0.72, 0.65, 0.43, … , 0.36, 0.57) Laugh = (0.85, 0.77, 0.24, … , 0.88, 0.29) … … See code in my GitHub repo

WORD EMBEDDINGS LINGUISTIC REGULARITIES Closest words Word relations 250 dimensional
space president trump car media press house man woman king queen vector(“man") − vector(“woman") is roughly vector(“king”) − vector(“queen") Trump speaks with the press The president talks to the media

WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE 4000 daily recaps scraped
of the last 15 years. These recaps contain around 10.000 unique words

WORD EMBEDDINGS SOME R CODE Create a vocabulary (and the
most important stars of the show)

WORD EMBEDDINGS SOME R CODE Then from vocabulary create term
co-occurence matrix and the embeddings

WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE 1 steffy steffy 1.00
2 steffy liam 0.82 3 steffy hope 0.79 4 steffy said 0.78 5 steffy wyatt 0.76 6 steffy bill 0.69 7 steffy asked 0.68 8 steffy quinn 0.67 9 steffy agreed 0.65 10 steffy rick 0.65

WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE death furious lastly excused
frustration onset 0.223 0.2006 0.1963 0.1958 0.1950 0.1937 Word vectors for: Steffy − Liam

WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE liam katie wyatt steffy
quinn said 0.5550 0.4845 0.4829 0.4645 0.4491 0.4201 Word vectors for: Bill − anger

IMPORTANT: Business validation! I asked my wife, a loyal, GTST
en TBTB follower She recognizes the results of my analyses completely!

The Bold & Beautiful Chat Bot

CHAT BOT A little digression: Neural Networks (NN) Ø Simple
(shallow) NN existed for a long time Use input features and one or two hidden layers to predict outcome Output prediction # Rooms Surface Region Home Type X1 X2 X3 X4 Z1 Z2 Z3 Y X inputs Hidden layer z Prediction: Home value € 239.111,-- With labeled data (train data) Send data through the network to produce a prediction à Forward propagation You know the ‘ground truth’ so error on prediction is now known update weights w ß Backward propagation Fully Connected Network w1 w13 w2 w12 w14 w15

CHAT BOT A little digression: Neural Networks (NN) Some years
ago deep learning became popular. Neural Nets with many hidden layers and a ‘special structure’ In the hidden layers For normal (boring J ) structured data à added value of deep learning is limited

CHAT BOT A little digression: Neural Networks (NN) Ø For
texts and images using deep neural nets is key! The networks are not fully connected: special structures are used • Convolutional neural networks for images • Recurrent neural networks for texts

CHAT BOT Ø Term Document Matrices and Topic modeling are
relatively old. Ø So called “Shallow methods” for texts. Ø Relatively new methods are deep learning methods for texts. Ø One example is : Sequence to Sequence (Seq2Seq) models, a special form of RNN Use cases are machine translation, text summarization question/answering (chat bot) Train data input output I am home Ik ben Thuis I am not old Ik ben niet oud There are men Er zijn mannen …. input output Where are you? I am at school What is that? That is a car Can you help me? No I won’t … THE BOLD SUBTITLES I am home now Seq2Seq model Ik ben nu thuis Input Sequence of words output: another Sequence of words

CHAT BOT THE BOLD SUBTITLES The sequence of words is
fed word for word into the encoder Ø First word enters the encoder, Ø A so-called “hidden state” is returned, Ø Second word and the previous hidden state enters the encoder, Ø A new hidden state is returned Ø Etc,….etc,…. Ø Then the decoder takes these hidden states to generate the output INPUT OUTPUT Animation taken from Jay Alammars blog

THE BOLD SUBTITLES CHAT BOT So called SRT subtitle text
files For The Bold I have only 1 file 177 sentence pairs only!! Not Ideal but: Assumed conversation input 1, output 2, input 3, output 4, etc… Feed this train data to a Seq2Seq model. Shamelessly reused the pytorch seq2seq tutorial Here is my python code and the srt file.

input = are you going ? SOME EXAMPLE INPUTS: OUTPUT
FROM THE BOLD BOT..... CHAT BOT Need more data……… preferably the scripts! input = is it your first time ? input = how is stephanie ? input = what the hell stephanie ? output = jackie nobody . stephanie oh please . <EOS> output = well so i want you to go back to forrester . <EOS> output = my childhood was a lot different from yours kat . <EOS> output = yeah . <EOS>

Can you shake hands with the two neighbors? A LITTLE
STATISTICAL EXPERIMENT Two statistics I like to share with you:

50.1% of all people don’t wash there hands after using
the toilet A LITTLE STATISTICAL EXPERIMENT

84.6% of all statistics are made up at the spot!!
A LITTLE STATISTICAL EXPERIMENT

analytics

THE BILLY INDEX

IKEA WEBSITE KEEP TRACK OF THE STOCK LEVEL The IKEA
Billy Index Change of stock over time

IKEA ANALYTICS THE IKEA BILLY INDEX

IKEA WEBSITE Start of new year THE IKEA BILLY INDEX

IKEA WEBSITE Summer followed by new school year THE IKEA
BILLY INDEX

IKEA BILLY SALES FORECAST Time Series Forecast model y(t) =
g(t) + s(t) + h(t) + e(t) y Entity you want to forecast, i.e. Billy sales g (long term) trend term s seasonality terms h holiday and events e error term

IKEA BILLY SALES FORECAST

IKEA BILLY SALES FORECASTS

IKEA BILLY INDIVIDUAL COMPONENTS

THE BILLY INDEX SOME STATISTICS

THE BILLY INDEX CORRELATIONS WITH SOME ‘WEATHER’ VARIABELS

Each 1 m/s increase in wind speed results in 19
Billy’s sold less :-)

IKEA Product matching

DEEP LEARNING PRE-TRAINED NETWORKS Deep learning: neural networks with ‘many’
hidden layers The so-called deep convolutional networks are very useful for images Classification VGG16 network consists of millions of parameters and is already trained on millions of labeled images 1. Dog 2. Cat 3. Car 4. House 5. Plane 6. tree … … … 999. Castle 1000. chair We can just re-use this network

DEEP LEARNING PRE-TRAINED NETWORKS Just strip off the top layers
so that we have a network that generates image features vgg16_notop = application_vgg16(weights = 'imagenet', include_top = FALSE) One line of code in R or Python REMOVED

DEEP LEARNING PRE-TRAINED NETWORKS 25.008 dimensional space

IKEA PRODUCT IMAGES HACKATON AT IKEA DECEMBER 2017 • Scrape
9000 product images from the Ikea web site • Score each image with the pre-trained VGG network • Create an R shiny app to upload an image • Determin the Ikea images that are closest to your image

http://145.131.21.163:3838/sample-apps/I_LOVE_IKEA_SHINY_APP/

CAR ANALYTICS Papa, is that a BMW or a Peugeot?
What does a car cost per KM? Am I leaking oil?

IMAGE CLASSIFIER Create your own Peugeot / BMW classifier model
Flatten PEUGEOT BMW Download 150 BMW images and Download 150 Peugeot images. You can then do: 1. Train a “small” CNN from scratch 2. Take an existing pre-trained CNN and retrain top layer(s) (TRANSFER LEARNING) Convolutional and pooling layers in the network

Approach 1 does not really work on my images, too few images…..

Flatten 8192 Dense 256 PEUGEOT BMW Download 150 BMW images and Download 150 Peugeot images. You can: 1. Train a “small” CNN from scratch 2. Take an existing pre-trained CNN and retrain top layer(s) (TRANSFER LEARNING)

One epoch on my laptop takes 382 sec. Very long! On GCP GPU tesla V100 24 sec. accuracy of ~82% on validation pictures. Not very good, I think my son can do better! TRANSFER LEARNING

IMAGE CLASSIFIER Let Google AutoML Vision do the work!

IMAGE CLASSIFIER Google AutoML Vision Ø No code required, Just
upload labeled images Ø Runs fast on the Google cloud with GPU’s Ø Much more accurate. Years of Google research below the hood Ø Model is ready to be used! Ø But It’s not free, $$ for training and using Ø You don’t have the model.

THE COSTS OF A CAR per kilometer? www.gaspedaal.nl: 435.000 auto’s
Car data set with price, mileage, make and model and gear type

NON LINEAR RELATION? A little digression Splines can help you!
Splines fit: Piecewise polynomials that are glued smoothly together Continous slopes!! P P P P KM KM KM KM Ordinary linear fit Simple: slope is price per KM But it is not accurate Step-wise fit: Jumps in prediction Piecewise linear fit: Jumps in slopes

THE COSTS OF A CAR Per kilometer Renault CLIO Kilometers
driven 0 50K 100K 200K 150K 250K Price Notice the difference in new value R CODE Clio = gaspedaal %>% filter( Merk == “Renault”, Model == “Clio” ) Model_out = lm( Price ~ ns(mileage,7), Clio)

Price per kilometer Renault CLIO Kilometers driven Cent per KM

THE COSTS OF A CAR Per kilometer BMW 3-Serie Price
Kiliometers driven 0 50K 100K 200K 150K 300K 250K

Price per kilometer BMW 3-Serie Small shiny R app to
see for your self: Have fun Cent per KM Kilometers driven

WHICH CAR Will leak the most oil? RDW open data
car data available: • 9 mln. Dutch cars with features kenteken • 21 mln. APK maintenance services date defect car make age 13-04-17 Oil leakage 1-abc-12 Peugeot 12 19-05-18 Break 2-xyz-32 Opel 9 23-02-17 lights 3-klm-34 Kia 7 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 03-10-17 gear 5-qwe-45 BMW 8 Now you can derive: Ø How many cars are of make ABC and of age X Ø How many cars are of make ABC and age X and have an oil leakage

Age in years Oil leakage percentage

Advanced Restaurant Analytics

RESTAURANT ANALYTICS Business pain Ø What are the key words
for a good / bad restaturants? Ø I have eaten Chinese, OK nice! But where to eat the next time? Approach Look at restaurant reviews and look where the other reviewers went

RESTAURANT ANALYTICS Ø Same text mining apporach as in “jaap
home descriptions” Ø We now have a binairy Target: BAD = “score < 5” GOOD = “score > 8” Ø Train a regularized logistic regression and look at largest and smallest coefficients Good and bad words Review aardig eten italiaans Keuken … … vis zout Target Review 1 1 0 0 0 … … 1 1 BAD Review 2 0 1 0 1 0 1 BAD … … … … … … Review N 0 0 1 1 … … 1 0 GOOD

TERM DOCUMENT MATRIX For each term (column) estimate a β
(beta) parameter. Just as in ordinairy linear regession in our jaap example, Too many columns for logistic regression: regularization is needed! TARGET PREDICTION WITH LASSO LOGISTIC REGRESSION %&''( = argmin 0 1 232 4 log 1 + 0: ;<20 − (0 + )

RESTAURANT ANALYTICS Good and bad words size of word is
related to β estimates Reviews data and Jupyter notebook with analysis on my GitHub

A FEW FACTS… IENS DATA (TRADITIONAL BI) Most occuring restaurant
name (39 times) Among Dutch restaurants (6 keer) % Sustainable kitchens Biological (67%) French (58%) Fish (44%) Vegetarian (39%) … … … Chinese (3%) 700 reviews on a “normal” Satuday Valentine 2015 had 1200 reviews (1.7 times) 23 times 12 times

A FEW FACTS… DON’T SAY TOO MUCH, THAT IS BAD
J Distribution of number of words in a review Lower scores observed at longer reviews

ASSOCIATION RULES MINING ALSO CALLED MARKET BASKET ANALYSIS Identify frequent
item sets (rules) in transactional data: IF items A and B THEN item C {A, B} → {C} IF items X THEN item Y and Z {X} → {Y, Z} When is rule frequent? If the ‘support’ > a treshold # trxs. {X → Y} Total # trxs. Support {X → Y} = Support Chips –> Beer 0.823% Chips –> Milk 0.002%

Lift {X → Y} = Support {X → Y} Support
(X) * Support(Y) Lift & Confidence Example: a lift van 8.3 for {Chips} → {Beer} means If I know someone has already bought Chips then it is 8.3 more likely that he will also buy beer Other statistics used to assess the usefulness of a rule Conf {X → Y} = Support {X→ Y} Support (X) ASSOCIATION RULES MINING ALSO CALLED MARKET BASKET ANALYSIS

IENS RESTAURANT ASSOCIATION RULES MINING / MARKET BASKET ANALYSE

IENS RESTAURANT LENGTH TWO RULES A → B Interactieve netwerk
Very generic rules Lift is not really high

IENS RESTAURANT LENGTH THREE RULES A, B → C Interactief
plaatje Much more specific, higher lift

IENS RESTAURANT VIRTUAL ITEMS: MAKE IT EVEN MORE PERSONAL Transaction
data with customers and items klant ITEM 1 A 1 X 2 A 2 B 2 C 3 E 3 T 4 S possible rules { A, B } → { C } { X } → { Z } Add customer features as virtual items possible rules { Male, (18, 25], A, B } → { C } { Female, (40,45], X } → { Z } klant ITEM 1 A 1 X 1 Male 1 (18, 25] 2 A 2 B 2 C 2 Male 2 (45, 65] 3 E 3 T 3 Male 4 (30, 35] 4 S 4 Male 4 (30, 35]

The Dutch movie world in a graph

GRAPH BASICS Node or Vertex a point in the network
Edge or Link a relation between two nodes (can be directional) A FEW TERMS A B C D E F G

GRAPH BASICS Node Centraliteit How central is a node *
Degree (number of connections) * Betweenness (number of shortest paths through a node) * Eigencentrality (Google’s page rank is a version of this) Community detection Are there nodes that belong together? A FEW TERMS 5 6 7 4 3 2 1 8 9 1 0 1 1 1 2 Degree 5 Degree 6 Degree 2 3 a Degree 2 Node 6 and 3 have the same Degree, But node 6 has a higher Betweennes than node 3

WWW.IMDB.COM INTERNET MOVIE DATABSE Download movie data: * Dutch movies
in the last 25 years * Per movie we know the cast and crew * A node is a persoon * Node X links with node Y if X and Y were in the same movie R Script of this analysis see my GitHub Repo

DUTCH MOVIE WORLD IN A NETWORK GRAPH Interactive graph Actors
are in the nodes data.frame Their links are in the edges data.frame visNetwork(nodes, edges) edges Chantal Jantzen Stef Tijding Hans de wolf Jeroen Krabee Johan Nijenhuis Frans van Gestel … ….. nodes Chantal Jantzen `Hans de Wolf …

CENTRALITY

COMMUNITIES There are 1257 persons They are divided in 191
community's Take community 6: 54 persons in a wordcloud (Centrality based)

Thanks for your time! Questions? Need me as Freelancer? Let’s
have a cup of coffee https://www.linkedin.com/in/longhowlam https://longhowlam.wordpress.com/ @longhowlam

Data knows better workshop at Ironhack: Hobby data science projects

Data knows better workshop at Ironhack: Hobby data science projects

Other Decks in Business

Featured

Transcript