Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science with Elasticsearch

Data Science with Elasticsearch

Using Elasticsearch "significant_terms" aggregation to recommend movies by analyzing a foreground set against a background set.

Felipe Dornelas

March 29, 2016
Tweet

More Decks by Felipe Dornelas

Other Decks in Technology

Transcript

  1. DATA SCIENCE WITH ELASTICSEARCH FELIPE DORNELAS

  2. None
  3. ABOUT ME So ware Engineer Data NERD Electronic Music enthusiast

    Work at ThoughtWorks
  4. WHAT IS ELASTICSEARCH?

  5. A real-time distributed search and analytics engine

  6. Free and Open Source

  7. Distributed document store Full-text search Real-time analytics

  8. DISTRIBUTED DOCUMENT STORE RESTful API Automatic scale (Plug & Play)

    Capable of handling petabytes of data
  9. FULL-TEXT SEARCH Built on Lucene Handles the human language: Synonyms,

    typos and misspellings Internationalization Sort results by relevance score
  10. REAL-TIME ANALYTICS Lots of aggregations and metrics Gelocations Can be

    combined with search Real-time (no batch-processing)
  11. SEARCH

  12. STRUCTURED SEARCH (SQL) "Does the document match the query?" Yes

    or no question
  13. FULL-TEXT SEARCH "How well does the document match the search"?

    Relevance score
  14. INVERTED INDEX

  15. ID Text 1 " T h e q u i

    c k b r o w n f o x j u m p e d o v e r t h e l a z y d o g . " 2 " Q u i c k b r o w n f o x e s l e a p o v e r l a z y d o g s i n s u m m e r . "
  16. TOKENIZATION ID Tokens 1 " T h e " ,

    " q u i c k " , " b r o w n " , " f o x " , " j u m p e d " , " o v e r " , " t h e " , " l a z y " , " d o g " 2 " Q u i c k " , " b r o w n " , " f o x e s " , " l e a p " , " o v e r " , " l a z y " , " d o g s " , " i n " , " s u m m e r "
  17. NORMALIZATION

  18. CAPITALIZATION " Q u i c k " → "

    q u i c k "
  19. STEMMING " f o x e s " → "

    f o x "
  20. REPLACING SYNONYMS " j u m p e d "

    ~ " l e a p " → " j u m p "
  21. REMOVING COMMON WORDS " t h e "

  22. Term Doc #1 Doc #2 brown dog fox in -

    jump lazy over quick summer -
  23. SEARCH EXAMPLE " Q u i c k b r

    o w n f o x e s i n s u m m e r ? "
  24. ELASTICSEARCH API G E T / e x a m

    p l e / d o c u m e n t / _ s e a r c h { " m a t c h " : { " t e x t " : " Q u i c k b r o w n f o x e s i n s u m m e r ? " } }
  25. QUERY IS ALSO NORMALIZED " q u i c k

    " , " b r o w n " , " f o x " , " i n " , " s u m m e r "
  26. MATCHING THE INVERTED INDEX Term Doc #1 Doc #2 quick

    brown fox in - summer -
  27. SEARCH RESULTS " h i t s " : [

    { " _ s c o r e " : 0 . 1 6 2 7 3 3 2 7 , " _ i d " : " 2 " , " _ s o u r c e " : { " t e x t " : " Q u i c k b r o w n f o x e s l e a p o v e r l a z y d o g s i n s u m m e r . " } } , { " _ s c o r e " : 0 . 0 1 2 7 3 3 2 7 , " _ i d " : " 1 " , " _ s o u r c e " : { " t e x t " : " T h e q u i c k b r o w n f o x j u m p e d o v e r t h e l a z y d o g . " } } ]
  28. SEARCH RESULTS Document # 2 is a better match Higher

    relevance score than # 1 Search results are sorted by relevance
  29. AGGREGATIONS

  30. BUCKETS + METRICS

  31. BUCKETS Collection of documents that meet a certain criteria

  32. GENDER SOMEONE IDENTIFIES TO Alice ⇒ female Josh ⇒ male

    Karen ⇒ non-binary
  33. CITIES FROM A STATE San Francisco ⇒ California Belo Horizonte

    ⇒ Minas Gerais
  34. DAYS FROM A MONTH 2 0 1 4 - 1

    0 - 2 8 ⇒ October 2 0 1 4 - 1 1 - 1 5 ⇒ November
  35. METRICS Calculations on top of Buckets Ex: m i n

    , m a x , m e a n , s u m …
  36. AGGREGATION EXAMPLE partition citzens by state then by gender then

    by age ranges then calculate average salary for each bucket (metric)
  37. Male California age < 21 Female Non-Binary New York 21

    < age < 50 age > 50 Texas ~ $ 5000/month avg salary
  38. REAL-TIME ANALYTICS

  39. CAR TRANSACTIONS EXAMPLE G E T / c a r

    s / t r a n s a c t i o n s / A V F r 1 x b V m d U Y W p F 4 6 P s 4 { " p r i c e " : 1 0 0 0 0 , " c o l o r " : " r e d " , " m a k e " : " h o n d a " , " s o l d " : " 2 0 1 4 - 1 0 - 2 8 " }
  40. BEST SELLING CAR COLOR G E T / c a

    r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " c o l o r s " : { " t e r m s " : { " f i e l d s " : " c o l o r " } } } }
  41. BEST SELLING CAR COLOR { " c o l o

    r s " : { " b u c k e t s " : [ { " k e y " : " r e d " , " d o c _ c o u n t " : 1 6 } , { " k e y " : " b l u e " , " d o c _ c o u n t " : 8 } , { " k e y " : " g r e e n " , " d o c _ c o u n t " : 8 } ] } }
  42. AVERAGE CAR COLOR PRICE G E T / c a

    r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " c o l o r s " : { " t e r m s " : { " f i e l d " : " c o l o r " } , " a g g s " : { " a v g _ p r i c e " : { " a v g " : { " f i e l d " : " p r i c e " } } } } } }
  43. AVERAGE CAR COLOR PRICE { " c o l o

    r s " : { " b u c k e t s " : [ { " k e y " : " r e d " , " d o c _ c o u n t " : 1 6 , " a v g _ p r i c e " : { " v a l u e " : 3 2 5 0 0 . 0 } } , { " k e y " : " b l u e " , " d o c _ c o u n t " : 8 , " a v g _ p r i c e " : { " v a l u e " : 2 0 0 0 0 . 0 } } , { " k e y " : " g r e e n " , " d o c _ c o u n t " : 8 , " a v g _ p r i c e " : { " v a l u e " : 2 1 0 0 0 . 0 } } ] } }
  44. CAR SALES REVENUE HISTOGRAM G E T / c a

    r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " p r i c e " : { " h i s t o g r a m " : { " f i e l d " : " p r i c e " , " i n t e r v a l " : 2 0 0 0 0 } , " a g g s " : { " r e v e n u e " : { " s u m " : { " f i e l d " : " p r i c e " } } } } } }
  45. CAR SALES REVENUE HISTOGRAM { " p r i c

    e " : { " b u c k e t s " : [ { " k e y " : 0 , " d o c _ c o u n t " : 1 2 , " r e v e n u e " : { " v a l u e " : 1 4 8 0 0 0 . 0 } } , { " k e y " : 2 0 0 0 0 , " d o c _ c o u n t " : 1 6 , " r e v e n u e " : { " v a l u e " : 3 8 0 0 0 0 . 0 } } , { " k e y " : 4 0 0 0 0 , " d o c _ c o u n t " : 0 , " r e v e n u e " : { " v a l u e " : 0 . 0 } } , { " k e y " : 6 0 0 0 0 , " d o c _ c o u n t " : 0 , " r e v e n u e " : { " v a l u e " : 0 . 0 } } , { " k e y " : 8 0 0 0 0 , " d o c _ c o u n t " : 4 , " r e v e n u e " : { " v a l u e " : 3 2 0 0 0 0 . 0 } } ] } }
  46. CAR SALES REVENUE HISTOGRAM

  47. TIME-SERIES DATA Any data with a timestamp Ex: server logs,

    sales history, stock prices
  48. HOW MANY CARS SOLD PER MONTH? G E T /

    c a r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " s a l e s " : { " d a t e _ h i s t o g r a m " : { " f i e l d " : " s o l d " , " i n t e r v a l " : " m o n t h " , " f o r m a t " : " y y y y - M M - d d " } } } }
  49. HOW MANY CARS SOLD PER MONTH? { " s a

    l e s " : { " b u c k e t s " : [ { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 1 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 2 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 3 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 4 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 5 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 6 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 7 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 8 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 9 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 1 0 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 1 1 - 0 1 " , " d o c _ c o u n t " : 8 } ] } }
  50. HOW MANY CARS SOLD PER MONTH?

  51. COMMON SETUP - ELK Logstash ⇒ Elasticsearch ⇒ Kibana

  52. LOGSTASH Collect and stream logs into Elasticsearch

  53. KIBANA An analytics dashboard for Elasticsearch

  54. None
  55. MOVIE RECOMMENDATIONS

  56. THE MOVIELENS DATA SETS Movies Catalog User Movie Recommendations (0

    to 5) User Movie Tags
  57. THE MOVIELENS 10M DATASET 10 million ratings 10,000 movies 72,000

    users Released on 2009
  58. M O V I E S . D A T

    M o v i e I D : : T i t l e : : G e n r e s
  59. M O V I E S . D A T

    1 : : T o y S t o r y ( 1 9 9 5 ) : : A d v e n t u r e | A n i m a t i o n | C h i l d r e n | C o m e d y | F a n t a s y 2 : : J u m a n j i ( 1 9 9 5 ) : : A d v e n t u r e | C h i l d r e n | F a n t a s y 3 : : G r u m p i e r O l d M e n ( 1 9 9 5 ) : : C o m e d y | R o m a n c e 4 : : W a i t i n g t o E x h a l e ( 1 9 9 5 ) : : C o m e d y | D r a m a | R o m a n c e 5 : : F a t h e r o f t h e B r i d e P a r t I I ( 1 9 9 5 ) : : C o m e d y 6 : : H e a t ( 1 9 9 5 ) : : A c t i o n | C r i m e | T h r i l l e r 7 : : S a b r i n a ( 1 9 9 5 ) : : C o m e d y | R o m a n c e 8 : : T o m a n d H u c k ( 1 9 9 5 ) : : A d v e n t u r e | C h i l d r e n 9 : : S u d d e n D e a t h ( 1 9 9 5 ) : : A c t i o n 1 0 : : G o l d e n E y e ( 1 9 9 5 ) : : A c t i o n | A d v e n t u r e | T h r i l l e r ( . . . )
  60. R A T I N G S . D A

    T U s e r I D : : M o v i e I D : : R a t i n g : : T i m e s t a m p
  61. R A T I N G S . D A

    T 2 : : 1 1 0 : : 5 : : 8 6 8 2 4 5 7 7 7 2 : : 1 5 1 : : 3 : : 8 6 8 2 4 6 4 5 0 2 : : 2 6 0 : : 5 : : 8 6 8 2 4 4 5 6 2 2 : : 3 7 6 : : 3 : : 8 6 8 2 4 5 9 2 0 2 : : 5 3 9 : : 3 : : 8 6 8 2 4 6 2 6 2 2 : : 5 9 0 : : 5 : : 8 6 8 2 4 5 6 0 8 2 : : 6 4 8 : : 2 : : 8 6 8 2 4 4 6 9 9 2 : : 7 1 9 : : 3 : : 8 6 8 2 4 6 1 9 1 2 : : 7 3 3 : : 3 : : 8 6 8 2 4 4 5 6 2 2 : : 7 3 6 : : 3 : : 8 6 8 2 4 4 6 9 8 ( . . . )
  62. LOADING DATA SETS Python script to load the data sets

    into Elasticsearch Using the new e l a s t i c s e a r c h - d s l library One line at a time...
  63. LOADING DATA SETS ...it took too long to load the

    10M data set Improved time by using the bulk API
  64. THE MOVIE TYPE G E T / m o v

    i e l e n s / m o v i e / 2 { " n a m e " : " J u m a n j i ( 1 9 9 5 ) " }
  65. THE USER RATING TYPE G E T / m o

    v i e l e n s / r a t i n g s / A V F r 1 x b V m d U Y W p F 4 6 P s 4 { " u s e r " : 1 5 , " r e c o m m e n d e d _ m o v i e s " : [ 1 2 2 , 1 8 5 , 2 3 1 , 2 9 2 , 3 1 6 , 3 2 9 ] } Recommended movies have s c o r e > = 4
  66. MOVIE RECOMMENDATIONS Given Talladega Nights, starring Will Ferrel We want

    to find comedies in similar style
  67. RECOMMENDING BASED ON POPULARITY Find all users who recommended a

    movie Agregate their recommendations Take the top five most-popular
  68. Find the Talladega Nights ID: G E T / m

    o v i e l e n s / m o v i e / _ s e a r c h { " q u e r y " : { " m a t c h " : { " t i t l e " : " T a l l a d e g a N i g h t s " } } }
  69. Find the Talladega Nights ID: { " h i t

    s " : [ { " _ i d " : " 4 6 9 7 0 " , " _ s o u r c e " : { " t i t l e " : " T a l l a d e g a N i g h t s : T h e B a l l a d o f R i c k y B o b b y ( 2 0 0 6 ) " } } ] }
  70. Find the most popular movies from people who also like

    Talladega Nights: G E T / m o v i e l e n s / r a t i n g s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " q u e r y " : { " f i l t e r e d " : { " f i l t e r " : { " t e r m " : { " m o v i e " : 4 6 9 7 0 } } } } , " a g g s " : { " m o s t _ p o p u l a r " : { " t e r m s " : { " f i e l d " : " m o v i e " , " s i z e " : 6 } } } }
  71. A er correlating the ids to the titles, we got:

    1. Matrix, The 2. Shawshank Redemption 3. Pulp Fiction 4. Fight Club 5. Star Wars Episode IV: A New Hope
  72. Very good list! But almost everyone likes them! These are

    universally well-liked movies.
  73. Findind the most popular movies from all the time: G

    E T / m o v i e l e n s / r a t i n g s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " m o s t _ p o p u l a r " : { " t e r m s " : { " f i e l d " : " m o v i e " , " s i z e " : 5 } } } }
  74. A er correlating the ids to the titles, we got:

    1. Shawshank Redemption 2. Silence of the Lambs, The 3. Pulp Fiction 4. Forrest Gump 5. Star Wars Episode IV: A New Hope
  75. S I G N I F I C A N

    T _ T E R M S Aggregation based on statistics Finds uncommonly common terms in a data set i.e., statistic annomalies
  76. FOREGROUND GROUP Popular movies among people who enjoy Taladega Nights

  77. BACKGROUND GROUP Most popular movies among the entire user base

  78. Using the s i g n i f i c

    a n t _ t e r m s aggregation: G E T / m o v i e l e n s / r a t i n g s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " q u e r y " : { " f i l t e r e d " : { " f i l t e r " : { " t e r m " : { " m o v i e " : 4 6 9 7 0 } } } } , " a g g s " : { " m o s t _ s i g n i f i c a n t " : { " s i g n i f i c a n t _ t e r m s " : { " f i e l d " : " m o v i e " , " s i z e " : 6 } } } }
  79. Returned movies: { " a g g r e g

    a t i o n s " : { " m o s t _ s i g n i f i c a n t " : { " d o c _ c o u n t " : 2 7 1 , " b u c k e t s " : [ { " k e y " : 4 6 9 7 0 , " d o c _ c o u n t " : 2 7 1 , " b g _ c o u n t " : 2 7 1 } , { " k e y " : 5 5 2 4 5 , " d o c _ c o u n t " : 5 9 , " b g _ c o u n t " : 1 8 5 } , { " k e y " : 8 6 4 1 , " d o c _ c o u n t " : 1 0 7 , " b g _ c o u n t " : 7 6 2 } , { " k e y " : 5 8 1 5 6 , " d o c _ c o u n t " : 1 7 , " b g _ c o u n t " : 2 8 } , { " k e y " : 5 2 9 7 3 , " d o c _ c o u n t " : 9 5 , " b g _ c o u n t " : 8 5 7 } , { " k e y " : 3 5 8 3 6 , " d o c _ c o u n t " : 1 2 8 , " b g _ c o u n t " : 1 6 1 0 } ] } } }
  80. A er correlating the ids to the titles, we got:

    1. Blades of Glory 2. Anchorman: The Legend of Ron Burgundy 3. Semi-Pro 4. Knocked Up 5. 40-Year-Old Virgin, The
  81. THANK YOU! @felipead → , , GitHub Twitter MixCloud