Upgrade to Pro — share decks privately, control downloads, hide ads and more …

WHAT ON EARTH ARE EMBEDDINGS?

WHAT ON EARTH ARE EMBEDDINGS?

An intro about Embeddings presented to audience from different backgrounds (product owners, data practitioners and software engineers).

OmaymaS

July 23, 2023
Tweet

More Decks by OmaymaS

Other Decks in Technology

Transcript

  1. 1- NO IDEA 2- BASIC KNOWLEDGE 3- DEEP UNDERSTANDING WHAT

    ON EARTH ARE EMBEDDINGS? Embeddings in 2-3 words
  2. item_id dist_to_center price 1001 1 500 1003 15 600 1005

    2.5 400 1006 6.3 150 …… TABULAR DATA
  3. item_id dist_to_center price 1001 1 500 1003 15 600 1005

    2.5 400 1006 6.3 150 …… TABULAR DATA VARIABLES FEATURES NOT PRODUCT FEATURES ITEMS
  4. item_id [dist_to_center, price] 1001 [1, 500] 1003 [15, 600] 1005

    [2.5, 400] 1006 [6.3, 150] …… TABULAR DATA
  5. item_id dist_to_center price chain 1001 1 500 False 1003 15

    600 True 1005 2.5 400 True 1006 6.3 150 False …… TABULAR DATA
  6. item_id dist_to_center price chain type 1001 1 500 False guest_house

    1003 15 600 True hotel 1005 2.5 400 True hotel 1006 6.3 150 False apartment …… TABULAR DATA
  7. item_id dist_to_center price chain type 1001 1 500 False guest_house

    1003 15 600 True hotel 1005 2.5 400 True hotel 1006 6.3 150 False apartment …… TABULAR DATA
  8. item_id dist_to_center price chain type_id 1001 1 500 False 102

    1003 15 600 True 103 1005 2.5 400 True 103 1006 6.3 150 False 100 …… TABULAR DATA
  9. item_id dist_to_center price chain type 1001 1 500 False guest_house

    1003 15 600 True hotel 1005 2.5 400 True hotel 1006 6.3 150 False apartment …… { 'hotel': 2, 'apartment': 0, 'guest_house’: 1 } ORDINAL ENCODING TABULAR DATA
  10. apartment guest_house hotel 0 1 0 0 0 1 0

    0 1 1 0 0 item_id dist_to_center price chain type 1001 1 500 False guest_house 1003 15 600 True hotel 1005 2.5 400 True hotel 1006 6.3 150 False apartment …… ONE HOT ENCODING TABULAR DATA
  11. item_id dist_to_center price chain apartment Guest_house hotel 1001 1 500

    False 0 1 0 1003 15 600 True 0 0 1 1005 2.5 400 True 0 0 1 1006 6.3 150 False 1 0 0 …… ONE HOT ENCODING TABULAR DATA
  12. item You shall know a word by the company it

    keeps. Actions speak louder than words. A picture is worth a thousand words. Silence is more eloquent than words. Here are some words about words. HOW TO REPRESENT? TEXT DATA
  13. item You shall know a word by the company it

    keeps. Actions speak louder than words. A picture is worth a thousand words. Silence is more eloquent than words. Here are some words about words. CORPUS TEXT DATA
  14. item You shall know a word by the company it

    keeps. Actions speak louder than words. A picture is worth a thousand words. Silence is more eloquent than words. Here are some words about words. TEXT DATA DOCUMENTS 1 2 3 4 5 CORPUS
  15. item You shall know a word by the company it

    keeps. Actions speak louder than words. A picture is worth a thousand words. Silence is more eloquent than words. Here are some words about words. ['about', 'actions', 'are', 'by', 'company', 'eloquent', 'here', 'is', 'it', 'keeps', 'know', 'louder', 'more', 'picture', 'shall', 'silence', 'some', 'speak', 'than', 'the', 'thousand', 'word', 'words', 'worth', 'you'] TEXT DATA DOCUMENTS CORPUS VOCABULARY TOKENS
  16. item about actions by company eloquent … words worth You

    shall know a word by the company it keeps. Actions speak louder than words. A picture is worth a thousand words. Silence is more eloquent than words. Here are some words about words. TEXT DATA
  17. item about actions by company eloquent … words worth You

    shall know a word by the company it keeps. 0 0 1 1 0 0 0 Actions speak louder than words. 0 1 0 0 0 1 0 A picture is worth a thousand words. 0 0 0 0 0 1 1 Silence is more eloquent than words. 0 0 0 0 1 1 0 Here are some words about words. 1 0 0 0 0 2 0 TEXT DATA COUNT
  18. item You shall know a word by the company it

    keeps. Actions speak louder than words. A picture is worth a thousand words. Silence is more eloquent than words. Here are some words about words. ['about words', 'actions speak', 'are some', 'by the', 'company it', 'eloquent than', 'here are', 'is more', 'is worth', 'it keeps', 'know word', 'louder than', 'more eloquent', 'picture is', 'shall know', 'silence is', 'some words', 'speak louder', 'than words’, 'the company', 'thousand words', 'word by', 'words about', 'worth thousand', 'you shall'] TEXT DATA
  19. item about words actions speak …………………. you shall You shall

    know a word by the company it keeps. Actions speak louder than words. A picture is worth a thousand words. Silence is more eloquent than words. Here are some words about words. TEXT DATA COUNT
  20. item about words actions speak …………………. you shall You shall

    know a word by the company it keeps. 0 0 1 Actions speak louder than words. 0 1 0 A picture is worth a thousand words. 0 0 0 Silence is more eloquent than words. 0 0 0 Here are some words about words. 1 0 0 TEXT DATA COUNT
  21. TEXT DATA Frequency of any term/token in a given document

    Accounts for the ratio of documents that include this specific term Term Frequency - Inverse Document Frequency (TF-IDF)
  22. item about actions by company eloquent … words worth You

    shall know a word by the company it keeps. 0 0 0.447214 0 0 0 Actions speak louder than words. 0 0.549036 0 0 0.309317 0 A picture is worth a thousand words. 0 0 0 0 0.309317 0.549036 Silence is more eloquent than words. 0 0 0 0.6569 0.370086 0 Here are some words about words. 0.483957 0 0 0 0.545305 0 TEXT DATA Term Frequency - Inverse Document Frequency (TF-IDF)
  23. item about actions by company eloquent … words worth You

    shall know a word by the company it keeps. 0 0 0.447214 0 0 0 Actions speak louder than words. 0 0.549036 0 0 0.309317 0 A picture is worth a thousand words. 0 0 0 0 0.309317 0.549036 Silence is more eloquent than words. 0 0 0 0.6569 0.370086 0 Here are some words about words. 0.483957 0 0 0 0.545305 0 TEXT DATA Term Frequency - Inverse Document Frequency (TF-IDF)
  24. item about actions by company eloquent … words worth You

    shall know a word by the company it keeps. 0 0 0.447214 0 0 0 Actions speak louder than words. 0 0.549036 0 0 0.309317 0 A picture is worth a thousand words. 0 0 0 0 0.309317 0.549036 Silence is more eloquent than words. 0 0 0 0.6569 0.370086 0 Here are some words about words. 0.483957 0 0 0 0.545305 0 TEXT DATA Term Frequency - Inverse Document Frequency (TF-IDF)
  25. item tf_idf_vector You shall know a word by the company

    it keeps. [0. , 0. , 0.447214, 0. , 0. , 0.447214, 0.447214, 0. , 0. , 0.447214, 0. , 0. , 0. , 0. , 0.447214, 0. , 0. ] Actions speak louder than words. [0. , 0.549036, 0. , 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0.309317, 0. ] A picture is worth a thousand words. [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0. , 0. , 0.549036, 0. , 0.309317, 0.549036] Silence is more eloquent than words. [0. , 0. , 0. , 0.6569 , 0. , 0. , 0. , 0. , 0. , 0. , 0.6569 , 0. , 0. , 0. , 0. , 0.370086, 0. ] Here are some words about words. [0.483957, 0. , 0. , 0. , 0.483957, 0. , 0. , 0. , 0. , 0. , 0. , 0.483957, 0. , 0. , 0. , 0.545305, 0. ] TEXT DATA TF-IDF
  26. item tf_idf_vector You shall know a word by the company

    it keeps. [0. , 0. , 0.447214, 0. , 0. , 0.447214, 0.447214, 0. , 0. , 0.447214, 0. , 0. , 0. , 0. , 0.447214, 0. , 0. ] Actions speak louder than words. [0. , 0.549036, 0. , 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0.309317, 0. ] A picture is worth a thousand words. [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0. , 0. , 0.549036, 0. , 0.309317, 0.549036] Silence is more eloquent than words. [0. , 0. , 0. , 0.6569 , 0. , 0. , 0. , 0. , 0. , 0. , 0.6569 , 0. , 0. , 0. , 0. , 0.370086, 0. ] Here are some words about words. [0.483957, 0. , 0. , 0. , 0.483957, 0. , 0. , 0. , 0. , 0. , 0. , 0.483957, 0. , 0. , 0. , 0.545305, 0. ] TEXT DATA TF-IDF
  27. item tf_idf_vector You shall know a word by the company

    it keeps. [0. , 0. , 0.447214, 0. , 0. , 0.447214, 0.447214, 0. , 0. , 0.447214, 0. , 0. , 0. , 0. , 0.447214, 0. , 0. ] Actions speak louder than words. [0. , 0.549036, 0. , 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0.309317, 0. ] A picture is worth a thousand words. [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0. , 0. , 0.549036, 0. , 0.309317, 0.549036] Silence is more eloquent than words. [0. , 0. , 0. , 0.6569 , 0. , 0. , 0. , 0. , 0. , 0. , 0.6569 , 0. , 0. , 0. , 0. , 0.370086, 0. ] Here are some words about words. [0.483957, 0. , 0. , 0. , 0.483957, 0. , 0. , 0. , 0. , 0. , 0. , 0.483957, 0. , 0. , 0. , 0.545305, 0. ] TEXT DATA TF-IDF MORE TOKENS?
  28. item tf_idf_vector You shall know a word by the company

    it keeps. [0. , 0. , 0.447214, 0. , 0. , 0.447214, 0.447214, 0. , 0. , 0.447214, 0. , 0. , 0. , 0. , 0.447214, 0. , 0. ] Actions speak louder than words. [0. , 0.549036, 0. , 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0.309317, 0. ] A picture is worth a thousand words. [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.549036, 0. , 0. , 0. , 0. , 0.549036, 0. , 0.309317, 0.549036] Silence is more eloquent than words. [0. , 0. , 0. , 0.6569 , 0. , 0. , 0. , 0. , 0. , 0. , 0.6569 , 0. , 0. , 0. , 0. , 0.370086, 0. ] Here are some words about words. [0.483957, 0. , 0. , 0. , 0.483957, 0. , 0. , 0. , 0. , 0. , 0. , 0.483957, 0. , 0. , 0. , 0.545305, 0. ] TEXT DATA TF-IDF MORE TOKENS? 100K, 1M, 10M?
  29. word2vec TEXT DATA You shall know a word by the

    company it keeps. INPUT OUTPUT OUTPUT know You 1 know shall 1 know a 1 know word 1 …
  30. word2vec TEXT DATA You shall know a word by the

    company it keeps. INPUT OUTPUT OUTPUT know You 1 know shall 1 know a 1 know word 1 …
  31. word2vec TEXT DATA You shall know a word by the

    company it keeps. INPUT OUTPUT OUTPUT know You 1 know shall 1 know a 1 know word 1 know taco 0 know … NEGATIVE EXAMPLES
  32. [-0.26559 , 1.0267 , -0.21582 , -0.17478 , -0.061777, 0.28052

    , 0.73311 , -1.3842 , -0.84804 , 0.13089 , 0.15464 , 0.45867 , -4.3862 , 0.19371 , 0.060209, 0.18071 , -0.1115 , -0.039883, 0.21777 , -0.64575 , -0.65452 , -0.39837 , -0.18885 , 1.1001 , -0.68782 ] worth [ 1.3070e-01, -3.9077e-01, -8.6797e-01, 1.6434e+00, -3.9666e- 01, -2.5728e-01, 1.5599e+00, -6.6073e-01, -1.7415e-01, - 3.6898e-04, -3.3396e-01, 5.8920e-01, -2.7834e+00, 9.0820e-01, 2.4679e-01, 1.3508e+00, -4.8421e-01, 6.1785e-01, -2.6561e-01, -4.4429e-01, 4.7028e-01, -4.2136e-02, 2.3388e-01, -6.5706e-01, -2.8852e-01] silence TEXT DATA word2vec
  33. nonesense [ ('bullshit', 0.6888), ('hogwash', 0.6415), ('guff', 0.6295), ('baloney', 0.6275),

    ('balderdash', 0.6187), ('poppycock', 0.6143), ('utter_nonsense', 0.6071), ('bull_****', 0.6027), ('malarkey', 0.5966), ('tosh', 0.5939) ] glove-wiki-gigaword-50 TEXT DATA
  34. Booking.com [ ('HotelsCombined', 0.6622), ('HotelClub.com', 0.6537), ('Hotelopia', 0.6528), ('hotel.info', 0.6412),

    ('Expedia_Travelocity_Orbitz', 0.6405), ('Expedia_Priceline', 0.6388), ('Hotels.com_Expedia', 0.6361), ('www.lastminute.com', 0.633), ('www.sidestep.com', 0.6316), ('trivago', 0.6289) ] glove-wiki-gigaword-50 TEXT DATA
  35. Immigrant [ ('pro-life', 0.893), ('clergy', 0.8886), ('undocumented', 0.8883),('migrant', 0.88), ('communist',

    0.8769), ('socialist', 0.8752), ('activist', 0.875), ('jewish', 0.874), ('circumcision', 0.8685), ('ugandan', 0.8638), ('missionary', 0.8635), ('aboriginal', 0.8616), ('militia', 0.8528), ('priests', 0.8527), ('zionist', 0.851), ('refugee', 0.8478), ('minority', 0.8467), ('offender', 0.8434), ('jihadi', 0.843), ('extremist', 0.8424), ('bangladeshi', 0.8419), ('mormon', 0.8415), ('foreign', 0.8414), ('jihadist', 0.8414), ('kenyan', 0.8408), ('anti-gay', 0.8405), ('capitalist', 0.8394), ('hispanic', 0.8365), ('reproductive', 0.8365), ('businessman', 0.8343) ] glove-twitter-25 TEXT DATA