Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data knows better workshop at Ironhack: Hobby data science projects

longhow lam
August 19, 2019

Data knows better workshop at Ironhack: Hobby data science projects

Presentation of some of my hobby data science projects. How to apply data science and machine learning algorithms in #daily #life #problems.

For example:
* Uncovering the topics and relations in #soap series,
* Which #restaurants to visit,
* Predict the #prices of homes and cars,
* How to select an #IKEA product
* The dutch movie world in a network graph

Without going into too much technical details, the presentation is intended for a broader audience.

longhow lam

August 19, 2019
Tweet

Other Decks in Business

Transcript

  1. DATA KNOWS BETTER: A beginners guide to machine learning JUST

    SOME APPLICATIONS OF MACHINE LEARNING IN DAILY LIFE Let’s link on LinkedIn https://www.linkedin.com/in/longhowlam Longhow Lam Freelance data scientist: Just contact me if you need me :-)
  2. AGENDA Ø INTRODUCTION Ø JAAP.NL HOME ANALYTICS (REGRESSION ANALYSIS, TEXT

    MINING) Ø SOAP ANALYTICS: GTST AND THE BOLD (TEXT MINING, TOPIC MODELING, WORD EMBEDDINGS) Ø IKEA ANALYTICS (COMPUTER VISION, TIME SERIES) Ø CAR ANALYTICS: BMW’S, PEUGEOTS, PRICES & OIL (COMPUTER VISION, SPLINE REGRESSION ANALYSIS) Ø RESTAURANTS ANALYTICS, DUTCH FILM WORLD (ASSOCIATION RULES MINING, GRAPH ANALYSIS)
  3. INTRODUCTION • An overview of different data science (machine learning)

    techniques • Applied on some ‘playfull’ hobby projects • I can never disclose company data in public → Scrape data • However, all techniques are applied in “real life” • My data science tool set:
  4. INTRODUCTION But you need to learn your whole life! Data

    science environments they evolve, come and go I once did stuff in SPSS.... ??
  5. JAAP.NL HOME ANALYTICS BUSINESS ISSUE What is the price of

    a home? Can we predict home values? Approach Scrape Jaap.nl for data
  6. PREDICTIVE MODEL A FEW CLICKS AND YOU ARE THERE! There

    are many machine learning algorithms: just a few: Tree ensembles • Random Forests • Boosting trees Lineair regression y = f(x) = a0 + a1 x1 + a2 x2 +…an xn Neural networks y = f(g(h(x))) A single tree MORE COMPLEX & LESS EXPLAINABLE SIMPLE & EXPLAINABLE Wich one to use? Start with the simple ones and use more complex models if needed
  7. Parameter Price effect (€) Intercept 24,006 First 2 digits postal

    code 10 240,839 96 − 103,000 12 204,591 79 − 49,002 ...…... ...…... Type of home Villa 173,000 Tussen woning − 41,000 Vrijstaand 73,000 ...…. ...…. Living area per m2 2,064 Number of rooms each extra room 4,500 LINEAR REGRESSION Simple & explainable but not the most accurate
  8. K-NEAREST NEIGHBOR METHOD Is not a ‘real’ model, i.e no

    explicit formula. Ø Given a house x0 , that you want to price, Ø Find the k houses x1 , x2 ,..., xk that are closest to x0 Ø Predict the home value of house x0 as the average of the k houses found x0 5 nearest neighbors of x0
  9. K-NN METHOD, WICH K TO TAKE? 1 nearest neighbor 15

    nearest neighbor K Too Small à Too wobbly, too much noise K Too Large à Too smoothed, lost the signal
  10. K-NN EXAMPLE Take the location (long /lat) of a house

    for which you want a price Look at the nearest houses for which you know the price. JAAP.NL HOMES K-NN prediction is now just the average of the nearest houses Calculating prices like this is very common among home brokers
  11. K-NN EXAMPLE JAAP.NL HOMES For my home data set JAAP

    It turns out k = 5, is the best value Largest R2 on the test set
  12. OK AND NOW? WHO WANTS TO USE MY MODEL? Mail

    me your dream house with its features, and I can price it. Any application that needs a home value prediction Eeuhm NO DON’T DO THAT! Create a REST API from the model and put it on a server on line. Dataiku facilitates this process of model deployment
  13. OK AND NOW? WHO WANTS A PREDICTION? curl -X POST

    http://188.166.112.55:12000/public/api/v1/house_xgboost/pc2model/predict --data '{ "features" : { "HouseType": "Tussenwoning", "kamers": 6, "Oppervlakte": 134, "VON": 0, "PC": "16" }}' {"prediction":241287.40,"ignored":false} for PC: "10" {"prediction":607246.62,"ignored":false}
  14. HOME VALUE PREDICTION WITH LASSO REGRESSION OR XGBOOST TERM DOCUMENT

    MATRIX Super sparse: 65.000 rows and 50.000 columns but a lot of zeros! home value kitchen big_garden garage ..(many more terms).. Swimming pool home 1 235.000 1 0 1 ... 0 home 2 450.000 0 1 0 ... 0 home 3 376.000 1 0 0 ... 0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... home 65.000 621.000 1 1 ... ... 1 SO: Each home description is now a long vector of 50.000 numbers Mathematically we have a point in the 50.000 dimensional space! ℝ
  15. TERM DOCUMENT MATRIX For each term (column) estimate a β

    (beta) parameter Too many columns normal lineair regression: regularization is needed! For example: “lasso” regression VALUE PREDICTION WITH LASSO REGRESSION
  16. LASSO REGRESSION NEGATIVE AND POSITIVE BETA PARAMETERS R2 = 0.66

    Intercept € 238.260 parkkosten € 39.644- familiehuis € 60.168 recreatiebungalow € 32.614- vrijstaande_villa € 48.180 bungalowpark € 31.801- belegging € 45.814 limburgse € 23.483- beleggingsobject € 42.543 2_kamer € 23.034- entree_vestibule € 41.674 plinten € 22.510- rijksmonument € 39.379 overdekt_zwembad € 21.971- recreatief € 39.142 2_kamerappartement € 20.625- verhuurd € 36.171 aannemer € 20.314- detaillering € 35.000 recreatiewoning € 19.748- visgraat € 33.589 proeven € 19.631- eigen_badkamer € 33.454 betaalbaar € 19.621- woningen_1 € 33.321 starterswoning € 19.502- toiletten € 32.836 kunststofkozijnen € 18.775- representatieve € 31.904 eigen_gebruik € 18.430- gezinshuis € 31.297
  17. SOAP ANALYTICS TEXT ANALYTICS Business pain Looking at GTST, what

    the heck is it all about? Are there trends in the series? Is it not all the same? Approach Take 5000 recaps and apply text mining / topics modeling
  18. SOAP ANALYTICS FINDING TOPICS The 5000 recaps can be transformed

    into a Term Document Matrix So each recap is again a point in the 50.000 dimensional space: Instead of prediction we can cluster: create TOPICS
  19. All recaps as one pile Cluster algorithm Topic 1 Described

    be key words Topic 3 Described be key words Topic 2 Described be key words Topic 4 Described be key words
  20. SOAP ANALYTICS ZOOM IN ON A TOPIC Sub-topics: topic 16

    (Ludo, Isabelle, Martine, Janine) · Harmsen voelt zich alleen. · Plan van Jack, gevaarlijk · Afscheidsbrief schrijven · Paniek, angst · Vragen over kinderen · Geld betalen Geld terug krijgen
  21. SOAP ANALYTICS ARE ALL EPISODES NOT THE SAME? Create a

    3 Dimensional UMAP Uniform Manifold Approximation and Projection of all 5000 Goede Tijden Slechte Tijden episodes Ø Each GTST episode is now a high dimensional vector Ø Project this vector to a lower dimensional space Ø It is easier to see if points are grouped together or not
  22. WORD EMBEDDINGS IN BOLD & BEAUTIFUL SAMENVATTINGEN Term Document Matrix

    Each document / recap is a vector of numbers Word embedding Each word is a vector of numbers A word embedding has to be trained / learned from a large collection of documents / recaps Amsterdam = (0.83, 0.89, 0.34, … , 0.63, 0.19) Steffy = (0.33, 0.19, 0.79, … , 0.13, 0.01) Germany = (0.72, 0.65, 0.43, … , 0.36, 0.57) Laugh = (0.85, 0.77, 0.24, … , 0.88, 0.29) … … See code in my GitHub repo
  23. WORD EMBEDDINGS LINGUISTIC REGULARITIES Closest words Word relations 250 dimensional

    space president trump car media press house man woman king queen vector(“man") − vector(“woman") is roughly vector(“king”) − vector(“queen") Trump speaks with the press The president talks to the media
  24. WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE 4000 daily recaps scraped

    of the last 15 years. These recaps contain around 10.000 unique words
  25. WORD EMBEDDINGS SOME R CODE Then from vocabulary create term

    co-occurence matrix and the embeddings
  26. WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE 1 steffy steffy 1.00

    2 steffy liam 0.82 3 steffy hope 0.79 4 steffy said 0.78 5 steffy wyatt 0.76 6 steffy bill 0.69 7 steffy asked 0.68 8 steffy quinn 0.67 9 steffy agreed 0.65 10 steffy rick 0.65
  27. WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE death furious lastly excused

    frustration onset 0.223 0.2006 0.1963 0.1958 0.1950 0.1937 Word vectors for: Steffy − Liam
  28. WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE liam katie wyatt steffy

    quinn said 0.5550 0.4845 0.4829 0.4645 0.4491 0.4201 Word vectors for: Bill − anger
  29. IMPORTANT: Business validation! I asked my wife, a loyal, GTST

    en TBTB follower She recognizes the results of my analyses completely!
  30. CHAT BOT A little digression: Neural Networks (NN) Ø Simple

    (shallow) NN existed for a long time Use input features and one or two hidden layers to predict outcome Output prediction # Rooms Surface Region Home Type X1 X2 X3 X4 Z1 Z2 Z3 Y X inputs Hidden layer z Prediction: Home value € 239.111,-- With labeled data (train data) Send data through the network to produce a prediction à Forward propagation You know the ‘ground truth’ so error on prediction is now known update weights w ß Backward propagation Fully Connected Network w1 w13 w2 w12 w14 w15
  31. CHAT BOT A little digression: Neural Networks (NN) Some years

    ago deep learning became popular. Neural Nets with many hidden layers and a ‘special structure’ In the hidden layers For normal (boring J ) structured data à added value of deep learning is limited
  32. CHAT BOT A little digression: Neural Networks (NN) Ø For

    texts and images using deep neural nets is key! The networks are not fully connected: special structures are used • Convolutional neural networks for images • Recurrent neural networks for texts
  33. CHAT BOT Ø Term Document Matrices and Topic modeling are

    relatively old. Ø So called “Shallow methods” for texts. Ø Relatively new methods are deep learning methods for texts. Ø One example is : Sequence to Sequence (Seq2Seq) models, a special form of RNN Use cases are machine translation, text summarization question/answering (chat bot) Train data input output I am home Ik ben Thuis I am not old Ik ben niet oud There are men Er zijn mannen …. input output Where are you? I am at school What is that? That is a car Can you help me? No I won’t … THE BOLD SUBTITLES I am home now Seq2Seq model Ik ben nu thuis Input Sequence of words output: another Sequence of words
  34. CHAT BOT THE BOLD SUBTITLES The sequence of words is

    fed word for word into the encoder Ø First word enters the encoder, Ø A so-called “hidden state” is returned, Ø Second word and the previous hidden state enters the encoder, Ø A new hidden state is returned Ø Etc,….etc,…. Ø Then the decoder takes these hidden states to generate the output INPUT OUTPUT Animation taken from Jay Alammars blog
  35. THE BOLD SUBTITLES CHAT BOT So called SRT subtitle text

    files For The Bold I have only 1 file 177 sentence pairs only!! Not Ideal but: Assumed conversation input 1, output 2, input 3, output 4, etc… Feed this train data to a Seq2Seq model. Shamelessly reused the pytorch seq2seq tutorial Here is my python code and the srt file.
  36. input = are you going ? SOME EXAMPLE INPUTS: OUTPUT

    FROM THE BOLD BOT..... CHAT BOT Need more data……… preferably the scripts! input = is it your first time ? input = how is stephanie ? input = what the hell stephanie ? output = jackie nobody . stephanie oh please . <EOS> output = well so i want you to go back to forrester . <EOS> output = my childhood was a lot different from yours kat . <EOS> output = yeah . <EOS>
  37. Can you shake hands with the two neighbors? A LITTLE

    STATISTICAL EXPERIMENT Two statistics I like to share with you:
  38. 50.1% of all people don’t wash there hands after using

    the toilet A LITTLE STATISTICAL EXPERIMENT
  39. 84.6% of all statistics are made up at the spot!!

    A LITTLE STATISTICAL EXPERIMENT
  40. IKEA WEBSITE KEEP TRACK OF THE STOCK LEVEL The IKEA

    Billy Index Change of stock over time
  41. IKEA BILLY SALES FORECAST Time Series Forecast model y(t) =

    g(t) + s(t) + h(t) + e(t) y Entity you want to forecast, i.e. Billy sales g (long term) trend term s seasonality terms h holiday and events e error term
  42. DEEP LEARNING PRE-TRAINED NETWORKS Deep learning: neural networks with ‘many’

    hidden layers The so-called deep convolutional networks are very useful for images Classification VGG16 network consists of millions of parameters and is already trained on millions of labeled images 1. Dog 2. Cat 3. Car 4. House 5. Plane 6. tree … … … 999. Castle 1000. chair We can just re-use this network
  43. DEEP LEARNING PRE-TRAINED NETWORKS Just strip off the top layers

    so that we have a network that generates image features vgg16_notop = application_vgg16(weights = 'imagenet', include_top = FALSE) One line of code in R or Python REMOVED
  44. IKEA PRODUCT IMAGES HACKATON AT IKEA DECEMBER 2017 • Scrape

    9000 product images from the Ikea web site • Score each image with the pre-trained VGG network • Create an R shiny app to upload an image • Determin the Ikea images that are closest to your image
  45. CAR ANALYTICS Papa, is that a BMW or a Peugeot?

    What does a car cost per KM? Am I leaking oil?
  46. IMAGE CLASSIFIER Create your own Peugeot / BMW classifier model

    Flatten PEUGEOT BMW Download 150 BMW images and Download 150 Peugeot images. You can then do: 1. Train a “small” CNN from scratch 2. Take an existing pre-trained CNN and retrain top layer(s) (TRANSFER LEARNING) Convolutional and pooling layers in the network
  47. IMAGE CLASSIFIER Create your own Peugeot / BMW classifier model

    Approach 1 does not really work on my images, too few images…..
  48. IMAGE CLASSIFIER Create your own Peugeot / BMW classifier model

    Flatten 8192 Dense 256 PEUGEOT BMW Download 150 BMW images and Download 150 Peugeot images. You can: 1. Train a “small” CNN from scratch 2. Take an existing pre-trained CNN and retrain top layer(s) (TRANSFER LEARNING)
  49. IMAGE CLASSIFIER Create your own Peugeot / BMW classifier model

    One epoch on my laptop takes 382 sec. Very long! On GCP GPU tesla V100 24 sec. accuracy of ~82% on validation pictures. Not very good, I think my son can do better! TRANSFER LEARNING
  50. IMAGE CLASSIFIER Google AutoML Vision Ø No code required, Just

    upload labeled images Ø Runs fast on the Google cloud with GPU’s Ø Much more accurate. Years of Google research below the hood Ø Model is ready to be used! Ø But It’s not free, $$ for training and using Ø You don’t have the model.
  51. THE COSTS OF A CAR per kilometer? www.gaspedaal.nl: 435.000 auto’s

    Car data set with price, mileage, make and model and gear type
  52. NON LINEAR RELATION? A little digression Splines can help you!

    Splines fit: Piecewise polynomials that are glued smoothly together Continous slopes!! P P P P KM KM KM KM Ordinary linear fit Simple: slope is price per KM But it is not accurate Step-wise fit: Jumps in prediction Piecewise linear fit: Jumps in slopes
  53. THE COSTS OF A CAR Per kilometer Renault CLIO Kilometers

    driven 0 50K 100K 200K 150K 250K Price Notice the difference in new value R CODE Clio = gaspedaal %>% filter( Merk == “Renault”, Model == “Clio” ) Model_out = lm( Price ~ ns(mileage,7), Clio)
  54. THE COSTS OF A CAR Per kilometer BMW 3-Serie Price

    Kiliometers driven 0 50K 100K 200K 150K 300K 250K
  55. Price per kilometer BMW 3-Serie Small shiny R app to

    see for your self: Have fun Cent per KM Kilometers driven
  56. WHICH CAR Will leak the most oil? RDW open data

    car data available: • 9 mln. Dutch cars with features kenteken • 21 mln. APK maintenance services date defect car make age 13-04-17 Oil leakage 1-abc-12 Peugeot 12 19-05-18 Break 2-xyz-32 Opel 9 23-02-17 lights 3-klm-34 Kia 7 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 03-10-17 gear 5-qwe-45 BMW 8 Now you can derive: Ø How many cars are of make ABC and of age X Ø How many cars are of make ABC and age X and have an oil leakage
  57. RESTAURANT ANALYTICS Business pain Ø What are the key words

    for a good / bad restaturants? Ø I have eaten Chinese, OK nice! But where to eat the next time? Approach Look at restaurant reviews and look where the other reviewers went
  58. RESTAURANT ANALYTICS Ø Same text mining apporach as in “jaap

    home descriptions” Ø We now have a binairy Target: BAD = “score < 5” GOOD = “score > 8” Ø Train a regularized logistic regression and look at largest and smallest coefficients Good and bad words Review aardig eten italiaans Keuken … … vis zout Target Review 1 1 0 0 0 … … 1 1 BAD Review 2 0 1 0 1 0 1 BAD … … … … … … Review N 0 0 1 1 … … 1 0 GOOD
  59. TERM DOCUMENT MATRIX For each term (column) estimate a β

    (beta) parameter. Just as in ordinairy linear regession in our jaap example, Too many columns for logistic regression: regularization is needed! TARGET PREDICTION WITH LASSO LOGISTIC REGRESSION %&''( = argmin 0 1 232 4 log 1 + 0: ;<20 − (0 + )
  60. RESTAURANT ANALYTICS Good and bad words size of word is

    related to β estimates Reviews data and Jupyter notebook with analysis on my GitHub
  61. A FEW FACTS… IENS DATA (TRADITIONAL BI) Most occuring restaurant

    name (39 times) Among Dutch restaurants (6 keer) % Sustainable kitchens Biological (67%) French (58%) Fish (44%) Vegetarian (39%) … … … Chinese (3%) 700 reviews on a “normal” Satuday Valentine 2015 had 1200 reviews (1.7 times) 23 times 12 times
  62. A FEW FACTS… DON’T SAY TOO MUCH, THAT IS BAD

    J Distribution of number of words in a review Lower scores observed at longer reviews
  63. ASSOCIATION RULES MINING ALSO CALLED MARKET BASKET ANALYSIS Identify frequent

    item sets (rules) in transactional data: IF items A and B THEN item C {A, B} → {C} IF items X THEN item Y and Z {X} → {Y, Z} When is rule frequent? If the ‘support’ > a treshold # trxs. {X → Y} Total # trxs. Support {X → Y} = Support Chips –> Beer 0.823% Chips –> Milk 0.002%
  64. Lift {X → Y} = Support {X → Y} Support

    (X) * Support(Y) Lift & Confidence Example: a lift van 8.3 for {Chips} → {Beer} means If I know someone has already bought Chips then it is 8.3 more likely that he will also buy beer Other statistics used to assess the usefulness of a rule Conf {X → Y} = Support {X→ Y} Support (X) ASSOCIATION RULES MINING ALSO CALLED MARKET BASKET ANALYSIS
  65. IENS RESTAURANT LENGTH TWO RULES A → B Interactieve netwerk

    Very generic rules Lift is not really high
  66. IENS RESTAURANT LENGTH THREE RULES A, B → C Interactief

    plaatje Much more specific, higher lift
  67. IENS RESTAURANT VIRTUAL ITEMS: MAKE IT EVEN MORE PERSONAL Transaction

    data with customers and items klant ITEM 1 A 1 X 2 A 2 B 2 C 3 E 3 T 4 S possible rules { A, B } → { C } { X } → { Z } Add customer features as virtual items possible rules { Male, (18, 25], A, B } → { C } { Female, (40,45], X } → { Z } klant ITEM 1 A 1 X 1 Male 1 (18, 25] 2 A 2 B 2 C 2 Male 2 (45, 65] 3 E 3 T 3 Male 4 (30, 35] 4 S 4 Male 4 (30, 35]
  68. GRAPH BASICS Node or Vertex a point in the network

    Edge or Link a relation between two nodes (can be directional) A FEW TERMS A B C D E F G
  69. GRAPH BASICS Node Centraliteit How central is a node *

    Degree (number of connections) * Betweenness (number of shortest paths through a node) * Eigencentrality (Google’s page rank is a version of this) Community detection Are there nodes that belong together? A FEW TERMS 5 6 7 4 3 2 1 8 9 1 0 1 1 1 2 Degree 5 Degree 6 Degree 2 3 a Degree 2 Node 6 and 3 have the same Degree, But node 6 has a higher Betweennes than node 3
  70. WWW.IMDB.COM INTERNET MOVIE DATABSE Download movie data: * Dutch movies

    in the last 25 years * Per movie we know the cast and crew * A node is a persoon * Node X links with node Y if X and Y were in the same movie R Script of this analysis see my GitHub Repo
  71. DUTCH MOVIE WORLD IN A NETWORK GRAPH Interactive graph Actors

    are in the nodes data.frame Their links are in the edges data.frame visNetwork(nodes, edges) edges Chantal Jantzen Stef Tijding Hans de wolf Jeroen Krabee Johan Nijenhuis Frans van Gestel … ….. nodes Chantal Jantzen `Hans de Wolf …
  72. COMMUNITIES There are 1257 persons They are divided in 191

    community's Take community 6: 54 persons in a wordcloud (Centrality based)
  73. Thanks for your time! Questions? Need me as Freelancer? Let’s

    have a cup of coffee https://www.linkedin.com/in/longhowlam https://longhowlam.wordpress.com/ @longhowlam