Slide 1

Slide 1 text

DATA SCIENCE WITH ELASTICSEARCH FELIPE DORNELAS

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

ABOUT ME So ware Engineer Data NERD Electronic Music enthusiast Work at ThoughtWorks

Slide 4

Slide 4 text

WHAT IS ELASTICSEARCH?

Slide 5

Slide 5 text

A real-time distributed search and analytics engine

Slide 6

Slide 6 text

Free and Open Source

Slide 7

Slide 7 text

Distributed document store Full-text search Real-time analytics

Slide 8

Slide 8 text

DISTRIBUTED DOCUMENT STORE RESTful API Automatic scale (Plug & Play) Capable of handling petabytes of data

Slide 9

Slide 9 text

FULL-TEXT SEARCH Built on Lucene Handles the human language: Synonyms, typos and misspellings Internationalization Sort results by relevance score

Slide 10

Slide 10 text

REAL-TIME ANALYTICS Lots of aggregations and metrics Gelocations Can be combined with search Real-time (no batch-processing)

Slide 11

Slide 11 text

SEARCH

Slide 12

Slide 12 text

STRUCTURED SEARCH (SQL) "Does the document match the query?" Yes or no question

Slide 13

Slide 13 text

FULL-TEXT SEARCH "How well does the document match the search"? Relevance score

Slide 14

Slide 14 text

INVERTED INDEX

Slide 15

Slide 15 text

ID Text 1 " T h e q u i c k b r o w n f o x j u m p e d o v e r t h e l a z y d o g . " 2 " Q u i c k b r o w n f o x e s l e a p o v e r l a z y d o g s i n s u m m e r . "

Slide 16

Slide 16 text

TOKENIZATION ID Tokens 1 " T h e " , " q u i c k " , " b r o w n " , " f o x " , " j u m p e d " , " o v e r " , " t h e " , " l a z y " , " d o g " 2 " Q u i c k " , " b r o w n " , " f o x e s " , " l e a p " , " o v e r " , " l a z y " , " d o g s " , " i n " , " s u m m e r "

Slide 17

Slide 17 text

NORMALIZATION

Slide 18

Slide 18 text

CAPITALIZATION " Q u i c k " → " q u i c k "

Slide 19

Slide 19 text

STEMMING " f o x e s " → " f o x "

Slide 20

Slide 20 text

REPLACING SYNONYMS " j u m p e d " ~ " l e a p " → " j u m p "

Slide 21

Slide 21 text

REMOVING COMMON WORDS " t h e "

Slide 22

Slide 22 text

Term Doc #1 Doc #2 brown dog fox in - jump lazy over quick summer -

Slide 23

Slide 23 text

SEARCH EXAMPLE " Q u i c k b r o w n f o x e s i n s u m m e r ? "

Slide 24

Slide 24 text

ELASTICSEARCH API G E T / e x a m p l e / d o c u m e n t / _ s e a r c h { " m a t c h " : { " t e x t " : " Q u i c k b r o w n f o x e s i n s u m m e r ? " } }

Slide 25

Slide 25 text

QUERY IS ALSO NORMALIZED " q u i c k " , " b r o w n " , " f o x " , " i n " , " s u m m e r "

Slide 26

Slide 26 text

MATCHING THE INVERTED INDEX Term Doc #1 Doc #2 quick brown fox in - summer -

Slide 27

Slide 27 text

SEARCH RESULTS " h i t s " : [ { " _ s c o r e " : 0 . 1 6 2 7 3 3 2 7 , " _ i d " : " 2 " , " _ s o u r c e " : { " t e x t " : " Q u i c k b r o w n f o x e s l e a p o v e r l a z y d o g s i n s u m m e r . " } } , { " _ s c o r e " : 0 . 0 1 2 7 3 3 2 7 , " _ i d " : " 1 " , " _ s o u r c e " : { " t e x t " : " T h e q u i c k b r o w n f o x j u m p e d o v e r t h e l a z y d o g . " } } ]

Slide 28

Slide 28 text

SEARCH RESULTS Document # 2 is a better match Higher relevance score than # 1 Search results are sorted by relevance

Slide 29

Slide 29 text

AGGREGATIONS

Slide 30

Slide 30 text

BUCKETS + METRICS

Slide 31

Slide 31 text

BUCKETS Collection of documents that meet a certain criteria

Slide 32

Slide 32 text

GENDER SOMEONE IDENTIFIES TO Alice ⇒ female Josh ⇒ male Karen ⇒ non-binary

Slide 33

Slide 33 text

CITIES FROM A STATE San Francisco ⇒ California Belo Horizonte ⇒ Minas Gerais

Slide 34

Slide 34 text

DAYS FROM A MONTH 2 0 1 4 - 1 0 - 2 8 ⇒ October 2 0 1 4 - 1 1 - 1 5 ⇒ November

Slide 35

Slide 35 text

METRICS Calculations on top of Buckets Ex: m i n , m a x , m e a n , s u m …

Slide 36

Slide 36 text

AGGREGATION EXAMPLE partition citzens by state then by gender then by age ranges then calculate average salary for each bucket (metric)

Slide 37

Slide 37 text

Male California age < 21 Female Non-Binary New York 21 < age < 50 age > 50 Texas ~ $ 5000/month avg salary

Slide 38

Slide 38 text

REAL-TIME ANALYTICS

Slide 39

Slide 39 text

CAR TRANSACTIONS EXAMPLE G E T / c a r s / t r a n s a c t i o n s / A V F r 1 x b V m d U Y W p F 4 6 P s 4 { " p r i c e " : 1 0 0 0 0 , " c o l o r " : " r e d " , " m a k e " : " h o n d a " , " s o l d " : " 2 0 1 4 - 1 0 - 2 8 " }

Slide 40

Slide 40 text

BEST SELLING CAR COLOR G E T / c a r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " c o l o r s " : { " t e r m s " : { " f i e l d s " : " c o l o r " } } } }

Slide 41

Slide 41 text

BEST SELLING CAR COLOR { " c o l o r s " : { " b u c k e t s " : [ { " k e y " : " r e d " , " d o c _ c o u n t " : 1 6 } , { " k e y " : " b l u e " , " d o c _ c o u n t " : 8 } , { " k e y " : " g r e e n " , " d o c _ c o u n t " : 8 } ] } }

Slide 42

Slide 42 text

AVERAGE CAR COLOR PRICE G E T / c a r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " c o l o r s " : { " t e r m s " : { " f i e l d " : " c o l o r " } , " a g g s " : { " a v g _ p r i c e " : { " a v g " : { " f i e l d " : " p r i c e " } } } } } }

Slide 43

Slide 43 text

AVERAGE CAR COLOR PRICE { " c o l o r s " : { " b u c k e t s " : [ { " k e y " : " r e d " , " d o c _ c o u n t " : 1 6 , " a v g _ p r i c e " : { " v a l u e " : 3 2 5 0 0 . 0 } } , { " k e y " : " b l u e " , " d o c _ c o u n t " : 8 , " a v g _ p r i c e " : { " v a l u e " : 2 0 0 0 0 . 0 } } , { " k e y " : " g r e e n " , " d o c _ c o u n t " : 8 , " a v g _ p r i c e " : { " v a l u e " : 2 1 0 0 0 . 0 } } ] } }

Slide 44

Slide 44 text

CAR SALES REVENUE HISTOGRAM G E T / c a r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " p r i c e " : { " h i s t o g r a m " : { " f i e l d " : " p r i c e " , " i n t e r v a l " : 2 0 0 0 0 } , " a g g s " : { " r e v e n u e " : { " s u m " : { " f i e l d " : " p r i c e " } } } } } }

Slide 45

Slide 45 text

CAR SALES REVENUE HISTOGRAM { " p r i c e " : { " b u c k e t s " : [ { " k e y " : 0 , " d o c _ c o u n t " : 1 2 , " r e v e n u e " : { " v a l u e " : 1 4 8 0 0 0 . 0 } } , { " k e y " : 2 0 0 0 0 , " d o c _ c o u n t " : 1 6 , " r e v e n u e " : { " v a l u e " : 3 8 0 0 0 0 . 0 } } , { " k e y " : 4 0 0 0 0 , " d o c _ c o u n t " : 0 , " r e v e n u e " : { " v a l u e " : 0 . 0 } } , { " k e y " : 6 0 0 0 0 , " d o c _ c o u n t " : 0 , " r e v e n u e " : { " v a l u e " : 0 . 0 } } , { " k e y " : 8 0 0 0 0 , " d o c _ c o u n t " : 4 , " r e v e n u e " : { " v a l u e " : 3 2 0 0 0 0 . 0 } } ] } }

Slide 46

Slide 46 text

CAR SALES REVENUE HISTOGRAM

Slide 47

Slide 47 text

TIME-SERIES DATA Any data with a timestamp Ex: server logs, sales history, stock prices

Slide 48

Slide 48 text

HOW MANY CARS SOLD PER MONTH? G E T / c a r s / t r a n s a c t i o n s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " s a l e s " : { " d a t e _ h i s t o g r a m " : { " f i e l d " : " s o l d " , " i n t e r v a l " : " m o n t h " , " f o r m a t " : " y y y y - M M - d d " } } } }

Slide 49

Slide 49 text

HOW MANY CARS SOLD PER MONTH? { " s a l e s " : { " b u c k e t s " : [ { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 1 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 2 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 3 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 4 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 5 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 6 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 7 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 8 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 0 9 - 0 1 " , " d o c _ c o u n t " : 0 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 1 0 - 0 1 " , " d o c _ c o u n t " : 4 } , { " k e y _ a s _ s t r i n g " : " 2 0 1 4 - 1 1 - 0 1 " , " d o c _ c o u n t " : 8 } ] } }

Slide 50

Slide 50 text

HOW MANY CARS SOLD PER MONTH?

Slide 51

Slide 51 text

COMMON SETUP - ELK Logstash ⇒ Elasticsearch ⇒ Kibana

Slide 52

Slide 52 text

LOGSTASH Collect and stream logs into Elasticsearch

Slide 53

Slide 53 text

KIBANA An analytics dashboard for Elasticsearch

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

MOVIE RECOMMENDATIONS

Slide 56

Slide 56 text

THE MOVIELENS DATA SETS Movies Catalog User Movie Recommendations (0 to 5) User Movie Tags

Slide 57

Slide 57 text

THE MOVIELENS 10M DATASET 10 million ratings 10,000 movies 72,000 users Released on 2009

Slide 58

Slide 58 text

M O V I E S . D A T M o v i e I D : : T i t l e : : G e n r e s

Slide 59

Slide 59 text

M O V I E S . D A T 1 : : T o y S t o r y ( 1 9 9 5 ) : : A d v e n t u r e | A n i m a t i o n | C h i l d r e n | C o m e d y | F a n t a s y 2 : : J u m a n j i ( 1 9 9 5 ) : : A d v e n t u r e | C h i l d r e n | F a n t a s y 3 : : G r u m p i e r O l d M e n ( 1 9 9 5 ) : : C o m e d y | R o m a n c e 4 : : W a i t i n g t o E x h a l e ( 1 9 9 5 ) : : C o m e d y | D r a m a | R o m a n c e 5 : : F a t h e r o f t h e B r i d e P a r t I I ( 1 9 9 5 ) : : C o m e d y 6 : : H e a t ( 1 9 9 5 ) : : A c t i o n | C r i m e | T h r i l l e r 7 : : S a b r i n a ( 1 9 9 5 ) : : C o m e d y | R o m a n c e 8 : : T o m a n d H u c k ( 1 9 9 5 ) : : A d v e n t u r e | C h i l d r e n 9 : : S u d d e n D e a t h ( 1 9 9 5 ) : : A c t i o n 1 0 : : G o l d e n E y e ( 1 9 9 5 ) : : A c t i o n | A d v e n t u r e | T h r i l l e r ( . . . )

Slide 60

Slide 60 text

R A T I N G S . D A T U s e r I D : : M o v i e I D : : R a t i n g : : T i m e s t a m p

Slide 61

Slide 61 text

R A T I N G S . D A T 2 : : 1 1 0 : : 5 : : 8 6 8 2 4 5 7 7 7 2 : : 1 5 1 : : 3 : : 8 6 8 2 4 6 4 5 0 2 : : 2 6 0 : : 5 : : 8 6 8 2 4 4 5 6 2 2 : : 3 7 6 : : 3 : : 8 6 8 2 4 5 9 2 0 2 : : 5 3 9 : : 3 : : 8 6 8 2 4 6 2 6 2 2 : : 5 9 0 : : 5 : : 8 6 8 2 4 5 6 0 8 2 : : 6 4 8 : : 2 : : 8 6 8 2 4 4 6 9 9 2 : : 7 1 9 : : 3 : : 8 6 8 2 4 6 1 9 1 2 : : 7 3 3 : : 3 : : 8 6 8 2 4 4 5 6 2 2 : : 7 3 6 : : 3 : : 8 6 8 2 4 4 6 9 8 ( . . . )

Slide 62

Slide 62 text

LOADING DATA SETS Python script to load the data sets into Elasticsearch Using the new e l a s t i c s e a r c h - d s l library One line at a time...

Slide 63

Slide 63 text

LOADING DATA SETS ...it took too long to load the 10M data set Improved time by using the bulk API

Slide 64

Slide 64 text

THE MOVIE TYPE G E T / m o v i e l e n s / m o v i e / 2 { " n a m e " : " J u m a n j i ( 1 9 9 5 ) " }

Slide 65

Slide 65 text

THE USER RATING TYPE G E T / m o v i e l e n s / r a t i n g s / A V F r 1 x b V m d U Y W p F 4 6 P s 4 { " u s e r " : 1 5 , " r e c o m m e n d e d _ m o v i e s " : [ 1 2 2 , 1 8 5 , 2 3 1 , 2 9 2 , 3 1 6 , 3 2 9 ] } Recommended movies have s c o r e > = 4

Slide 66

Slide 66 text

MOVIE RECOMMENDATIONS Given Talladega Nights, starring Will Ferrel We want to find comedies in similar style

Slide 67

Slide 67 text

RECOMMENDING BASED ON POPULARITY Find all users who recommended a movie Agregate their recommendations Take the top five most-popular

Slide 68

Slide 68 text

Find the Talladega Nights ID: G E T / m o v i e l e n s / m o v i e / _ s e a r c h { " q u e r y " : { " m a t c h " : { " t i t l e " : " T a l l a d e g a N i g h t s " } } }

Slide 69

Slide 69 text

Find the Talladega Nights ID: { " h i t s " : [ { " _ i d " : " 4 6 9 7 0 " , " _ s o u r c e " : { " t i t l e " : " T a l l a d e g a N i g h t s : T h e B a l l a d o f R i c k y B o b b y ( 2 0 0 6 ) " } } ] }

Slide 70

Slide 70 text

Find the most popular movies from people who also like Talladega Nights: G E T / m o v i e l e n s / r a t i n g s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " q u e r y " : { " f i l t e r e d " : { " f i l t e r " : { " t e r m " : { " m o v i e " : 4 6 9 7 0 } } } } , " a g g s " : { " m o s t _ p o p u l a r " : { " t e r m s " : { " f i e l d " : " m o v i e " , " s i z e " : 6 } } } }

Slide 71

Slide 71 text

A er correlating the ids to the titles, we got: 1. Matrix, The 2. Shawshank Redemption 3. Pulp Fiction 4. Fight Club 5. Star Wars Episode IV: A New Hope

Slide 72

Slide 72 text

Very good list! But almost everyone likes them! These are universally well-liked movies.

Slide 73

Slide 73 text

Findind the most popular movies from all the time: G E T / m o v i e l e n s / r a t i n g s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " a g g s " : { " m o s t _ p o p u l a r " : { " t e r m s " : { " f i e l d " : " m o v i e " , " s i z e " : 5 } } } }

Slide 74

Slide 74 text

A er correlating the ids to the titles, we got: 1. Shawshank Redemption 2. Silence of the Lambs, The 3. Pulp Fiction 4. Forrest Gump 5. Star Wars Episode IV: A New Hope

Slide 75

Slide 75 text

S I G N I F I C A N T _ T E R M S Aggregation based on statistics Finds uncommonly common terms in a data set i.e., statistic annomalies

Slide 76

Slide 76 text

FOREGROUND GROUP Popular movies among people who enjoy Taladega Nights

Slide 77

Slide 77 text

BACKGROUND GROUP Most popular movies among the entire user base

Slide 78

Slide 78 text

Using the s i g n i f i c a n t _ t e r m s aggregation: G E T / m o v i e l e n s / r a t i n g s / _ s e a r c h ? s e a r c h _ t y p e = c o u n t { " q u e r y " : { " f i l t e r e d " : { " f i l t e r " : { " t e r m " : { " m o v i e " : 4 6 9 7 0 } } } } , " a g g s " : { " m o s t _ s i g n i f i c a n t " : { " s i g n i f i c a n t _ t e r m s " : { " f i e l d " : " m o v i e " , " s i z e " : 6 } } } }

Slide 79

Slide 79 text

Returned movies: { " a g g r e g a t i o n s " : { " m o s t _ s i g n i f i c a n t " : { " d o c _ c o u n t " : 2 7 1 , " b u c k e t s " : [ { " k e y " : 4 6 9 7 0 , " d o c _ c o u n t " : 2 7 1 , " b g _ c o u n t " : 2 7 1 } , { " k e y " : 5 5 2 4 5 , " d o c _ c o u n t " : 5 9 , " b g _ c o u n t " : 1 8 5 } , { " k e y " : 8 6 4 1 , " d o c _ c o u n t " : 1 0 7 , " b g _ c o u n t " : 7 6 2 } , { " k e y " : 5 8 1 5 6 , " d o c _ c o u n t " : 1 7 , " b g _ c o u n t " : 2 8 } , { " k e y " : 5 2 9 7 3 , " d o c _ c o u n t " : 9 5 , " b g _ c o u n t " : 8 5 7 } , { " k e y " : 3 5 8 3 6 , " d o c _ c o u n t " : 1 2 8 , " b g _ c o u n t " : 1 6 1 0 } ] } } }

Slide 80

Slide 80 text

A er correlating the ids to the titles, we got: 1. Blades of Glory 2. Anchorman: The Legend of Ron Burgundy 3. Semi-Pro 4. Knocked Up 5. 40-Year-Old Virgin, The

Slide 81

Slide 81 text

THANK YOU! @felipead → , , GitHub Twitter MixCloud