Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science with Elasticsearch

Data Science with Elasticsearch

Using Elasticsearch "significant_terms" aggregation to recommend movies by analyzing a foreground set against a background set.

Felipe Dornelas

March 29, 2016
Tweet

More Decks by Felipe Dornelas

Other Decks in Technology

Transcript

  1. FULL-TEXT SEARCH Built on Lucene Handles the human language: Synonyms,

    typos and misspellings Internationalization Sort results by relevance score
  2. REAL-TIME ANALYTICS Lots of aggregations and metrics Gelocations Can be

    combined with search Real-time (no batch-processing)
  3. ID Text 1 " T h e q u i

    c k b r o w n f o x j u m p e d o v e r t h e l a z y d o g . " 2 " Q u i c k b r o w n f o x e s l e a p o v e r l a z y d o g s i n s u m m e r . "
  4. TOKENIZATION ID Tokens 1 " T h e " ,

    " q u i c k " , " b r o w n " , " f o x " , " j u m p e d " , " o v e r " , " t h e " , " l a z y " , " d o g " 2 " Q u i c k " , " b r o w n " , " f o x e s " , " l e a p " , " o v e r " , " l a z y " , " d o g s " , " i n " , " s u m m e r "
  5. REPLACING SYNONYMS " j u m p e d "

    ~ " l e a p " → " j u m p "
  6. Term Doc #1 Doc #2 brown dog fox in -

    jump lazy over quick summer -
  7. SEARCH EXAMPLE " Q u i c k b r

    o w n f o x e s i n s u m m e r ? "
  8. ELASTICSEARCH API G E T / e x a m

    p l e / d o c u m e n t / _ s e a r c h { " m a t c h " : { " t e x t " : " Q u i c k b r o w n f o x e s i n s u m m e r ? " } }
  9. QUERY IS ALSO NORMALIZED " q u i c k

    " , " b r o w n " , " f o x " , " i n " , " s u m m e r "
  10. SEARCH RESULTS " h i t s " : [

    { " _ s c o r e " : 0 . 1 6 2 7 3 3 2 7 , " _ i d " : " 2 " , " _ s o u r c e " : { " t e x t " : " Q u i c k b r o w n f o x e s l e a p o v e r l a z y d o g s i n s u m m e r . " } } , { " _ s c o r e " : 0 . 0 1 2 7 3 3 2 7 , " _ i d " : " 1 " , " _ s o u r c e " : { " t e x t " : " T h e q u i c k b r o w n f o x j u m p e d o v e r t h e l a z y d o g . " } } ]
  11. SEARCH RESULTS Document # 2 is a better match Higher

    relevance score than # 1 Search results are sorted by relevance
  12. DAYS FROM A MONTH 2 0 1 4 - 1

    0 - 2 8 ⇒ October 2 0 1 4 - 1 1 - 1 5 ⇒ November
  13. METRICS Calculations on top of Buckets Ex: m i n

    , m a x , m e a n , s u m …
  14. AGGREGATION EXAMPLE partition citzens by state then by gender then

    by age ranges then calculate average salary for each bucket (metric)
  15. Male California age < 21 Female Non-Binary New York 21

    < age < 50 age > 50 Texas ~ $ 5000/month avg salary
  16. CAR TRANSACTIONS EXAMPLE G E T / c a r

    s / t r a n s a c t i o n s / A V F r 1 x b V m d U Y W p F 4 6 P s 4 { " p r i c e " : 1 0 0 0 0 , " c o l o r " : " r e d " , " m a k e " : " h o n d a " , " s o l d " : " 2 0 1 4 - 1 0 - 2 8 " }
  17. BEST SELLING CAR COLOR G E T / c a

    r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " c o l o r s " : { " t e r m s " : { " f i e l d s " : " c o l o r " } } } }
  18. BEST SELLING CAR COLOR { " c o l o

    r s " : { " b u c k e t s " : [ { " k e y " : " r e d " , " d o c _ c o u n t " : 1 6 } , { " k e y " : " b l u e " , " d o c _ c o u n t " : 8 } , { " k e y " : " g r e e n " , " d o c _ c o u n t " : 8 } ] } }
  19. AVERAGE CAR COLOR PRICE G E T / c a

    r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " c o l o r s " : { " t e r m s " : { " f i e l d " : " c o l o r " } , " a g g s " : { " a v g _ p r i c e " : { " a v g " : { " f i e l d " : " p r i c e " } } } } } }
  20. AVERAGE CAR COLOR PRICE { " c o l o

    r s " : { " b u c k e t s " : [ { " k e y " : " r e d " , " d o c _ c o u n t " : 1 6 , " a v g _ p r i c e " : { " v a l u e " : 3 2 5 0 0 . 0 } } , { " k e y " : " b l u e " , " d o c _ c o u n t " : 8 , " a v g _ p r i c e " : { " v a l u e " : 2 0 0 0 0 . 0 } } , { " k e y " : " g r e e n " , " d o c _ c o u n t " : 8 , " a v g _ p r i c e " : { " v a l u e " : 2 1 0 0 0 . 0 } } ] } }
  21. CAR SALES REVENUE HISTOGRAM G E T / c a

    r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " p r i c e " : { " h i s t o g r a m " : { " f i e l d " : " p r i c e " , " i n t e r v a l " : 2 0 0 0 0 } , " a g g s " : { " r e v e n u e " : { " s u m " : { " f i e l d " : " p r i c e " } } } } } }
  22. CAR SALES REVENUE HISTOGRAM { " p r i c

    e " : { " b u c k e t s " : [ { " k e y " : 0 , " d o c _ c o u n t " : 1 2 , " r e v e n u e " : { " v a l u e " : 1 4 8 0 0 0 . 0 } } , { " k e y " : 2 0 0 0 0 , " d o c _ c o u n t " : 1 6 , " r e v e n u e " : { " v a l u e " : 3 8 0 0 0 0 . 0 } } , { " k e y " : 4 0 0 0 0 , " d o c _ c o u n t " : 0 , " r e v e n u e " : { " v a l u e " : 0 . 0 } } , { " k e y " : 6 0 0 0 0 , " d o c _ c o u n t " : 0 , " r e v e n u e " : { " v a l u e " : 0 . 0 } } , { " k e y " : 8 0 0 0 0 , " d o c _ c o u n t " : 4 , " r e v e n u e " : { " v a l u e " : 3 2 0 0 0 0 . 0 } } ] } }
  23. HOW MANY CARS SOLD PER MONTH? G E T /

    c a r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " s a l e s " : { " d a t e _ h i s t o g r a m " : { " f i e l d " : " s o l d " , " i n t e r v a l " : " m o n t h " , " f o r m a t " : " y y y y - M M - d d " } } } }
  24. HOW MANY CARS SOLD PER MONTH? { " s a

    l e s " : { " b u c k e t s " : [ { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 1 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 2 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 3 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 4 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 5 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 6 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 7 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 8 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 9 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 1 0 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 1 1 - 0 1 " , " d o c _ c o u n t " : 8 } ] } }
  25. M O V I E S . D A T

    M o v i e I D : : T i t l e : : G e n r e s
  26. M O V I E S . D A T

    1 : : T o y S t o r y ( 1 9 9 5 ) : : A d v e n t u r e | A n i m a t i o n | C h i l d r e n | C o m e d y | F a n t a s y 2 : : J u m a n j i ( 1 9 9 5 ) : : A d v e n t u r e | C h i l d r e n | F a n t a s y 3 : : G r u m p i e r O l d M e n ( 1 9 9 5 ) : : C o m e d y | R o m a n c e 4 : : W a i t i n g t o E x h a l e ( 1 9 9 5 ) : : C o m e d y | D r a m a | R o m a n c e 5 : : F a t h e r o f t h e B r i d e P a r t I I ( 1 9 9 5 ) : : C o m e d y 6 : : H e a t ( 1 9 9 5 ) : : A c t i o n | C r i m e | T h r i l l e r 7 : : S a b r i n a ( 1 9 9 5 ) : : C o m e d y | R o m a n c e 8 : : T o m a n d H u c k ( 1 9 9 5 ) : : A d v e n t u r e | C h i l d r e n 9 : : S u d d e n D e a t h ( 1 9 9 5 ) : : A c t i o n 1 0 : : G o l d e n E y e ( 1 9 9 5 ) : : A c t i o n | A d v e n t u r e | T h r i l l e r ( . . . )
  27. R A T I N G S . D A

    T U s e r I D : : M o v i e I D : : R a t i n g : : T i m e s t a m p
  28. R A T I N G S . D A

    T 2 : : 1 1 0 : : 5 : : 8 6 8 2 4 5 7 7 7 2 : : 1 5 1 : : 3 : : 8 6 8 2 4 6 4 5 0 2 : : 2 6 0 : : 5 : : 8 6 8 2 4 4 5 6 2 2 : : 3 7 6 : : 3 : : 8 6 8 2 4 5 9 2 0 2 : : 5 3 9 : : 3 : : 8 6 8 2 4 6 2 6 2 2 : : 5 9 0 : : 5 : : 8 6 8 2 4 5 6 0 8 2 : : 6 4 8 : : 2 : : 8 6 8 2 4 4 6 9 9 2 : : 7 1 9 : : 3 : : 8 6 8 2 4 6 1 9 1 2 : : 7 3 3 : : 3 : : 8 6 8 2 4 4 5 6 2 2 : : 7 3 6 : : 3 : : 8 6 8 2 4 4 6 9 8 ( . . . )
  29. LOADING DATA SETS Python script to load the data sets

    into Elasticsearch Using the new e l a s t i c s e a r c h - d s l library One line at a time...
  30. LOADING DATA SETS ...it took too long to load the

    10M data set Improved time by using the bulk API
  31. THE MOVIE TYPE G E T / m o v

    i e l e n s / m o v i e / 2 { " n a m e " : " J u m a n j i ( 1 9 9 5 ) " }
  32. THE USER RATING TYPE G E T / m o

    v i e l e n s / r a t i n g s / A V F r 1 x b V m d U Y W p F 4 6 P s 4 { " u s e r " : 1 5 , " r e c o m m e n d e d _ m o v i e s " : [ 1 2 2 , 1 8 5 , 2 3 1 , 2 9 2 , 3 1 6 , 3 2 9 ] } Recommended movies have s c o r e > = 4
  33. RECOMMENDING BASED ON POPULARITY Find all users who recommended a

    movie Agregate their recommendations Take the top five most-popular
  34. Find the Talladega Nights ID: G E T / m

    o v i e l e n s / m o v i e / _ s e a r c h { " q u e r y " : { " m a t c h " : { " t i t l e " : " T a l l a d e g a N i g h t s " } } }
  35. Find the Talladega Nights ID: { " h i t

    s " : [ { " _ i d " : " 4 6 9 7 0 " , " _ s o u r c e " : { " t i t l e " : " T a l l a d e g a N i g h t s : T h e B a l l a d o f R i c k y B o b b y ( 2 0 0 6 ) " } } ] }
  36. Find the most popular movies from people who also like

    Talladega Nights: G E T / m o v i e l e n s / r a t i n g s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " q u e r y " : { " f i l t e r e d " : { " f i l t e r " : { " t e r m " : { " m o v i e " : 4 6 9 7 0 } } } } , " a g g s " : { " m o s t _ p o p u l a r " : { " t e r m s " : { " f i e l d " : " m o v i e " , " s i z e " : 6 } } } }
  37. A er correlating the ids to the titles, we got:

    1. Matrix, The 2. Shawshank Redemption 3. Pulp Fiction 4. Fight Club 5. Star Wars Episode IV: A New Hope
  38. Findind the most popular movies from all the time: G

    E T / m o v i e l e n s / r a t i n g s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " m o s t _ p o p u l a r " : { " t e r m s " : { " f i e l d " : " m o v i e " , " s i z e " : 5 } } } }
  39. A er correlating the ids to the titles, we got:

    1. Shawshank Redemption 2. Silence of the Lambs, The 3. Pulp Fiction 4. Forrest Gump 5. Star Wars Episode IV: A New Hope
  40. S I G N I F I C A N

    T _ T E R M S Aggregation based on statistics Finds uncommonly common terms in a data set i.e., statistic annomalies
  41. Using the s i g n i f i c

    a n t _ t e r m s aggregation: G E T / m o v i e l e n s / r a t i n g s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " q u e r y " : { " f i l t e r e d " : { " f i l t e r " : { " t e r m " : { " m o v i e " : 4 6 9 7 0 } } } } , " a g g s " : { " m o s t _ s i g n i f i c a n t " : { " s i g n i f i c a n t _ t e r m s " : { " f i e l d " : " m o v i e " , " s i z e " : 6 } } } }
  42. Returned movies: { " a g g r e g

    a t i o n s " : { " m o s t _ s i g n i f i c a n t " : { " d o c _ c o u n t " : 2 7 1 , " b u c k e t s " : [ { " k e y " : 4 6 9 7 0 , " d o c _ c o u n t " : 2 7 1 , " b g _ c o u n t " : 2 7 1 } , { " k e y " : 5 5 2 4 5 , " d o c _ c o u n t " : 5 9 , " b g _ c o u n t " : 1 8 5 } , { " k e y " : 8 6 4 1 , " d o c _ c o u n t " : 1 0 7 , " b g _ c o u n t " : 7 6 2 } , { " k e y " : 5 8 1 5 6 , " d o c _ c o u n t " : 1 7 , " b g _ c o u n t " : 2 8 } , { " k e y " : 5 2 9 7 3 , " d o c _ c o u n t " : 9 5 , " b g _ c o u n t " : 8 5 7 } , { " k e y " : 3 5 8 3 6 , " d o c _ c o u n t " : 1 2 8 , " b g _ c o u n t " : 1 6 1 0 } ] } } }
  43. A er correlating the ids to the titles, we got:

    1. Blades of Glory 2. Anchorman: The Legend of Ron Burgundy 3. Semi-Pro 4. Knocked Up 5. 40-Year-Old Virgin, The