$30 off During Our Annual Pro Sale. View Details »

Data Science with Elasticsearch

Data Science with Elasticsearch

Using Elasticsearch "significant_terms" aggregation to recommend movies by analyzing a foreground set against a background set.

Felipe Dornelas

March 29, 2016
Tweet

More Decks by Felipe Dornelas

Other Decks in Technology

Transcript

  1. DATA SCIENCE WITH
    ELASTICSEARCH
    FELIPE DORNELAS

    View Slide

  2. View Slide

  3. ABOUT ME
    So ware Engineer
    Data NERD
    Electronic Music enthusiast
    Work at ThoughtWorks

    View Slide

  4. WHAT IS
    ELASTICSEARCH?

    View Slide

  5. A real-time distributed search and analytics engine

    View Slide

  6. Free and Open Source

    View Slide

  7. Distributed document store
    Full-text search
    Real-time analytics

    View Slide

  8. DISTRIBUTED DOCUMENT STORE
    RESTful API
    Automatic scale (Plug & Play)
    Capable of handling petabytes of data

    View Slide

  9. FULL-TEXT SEARCH
    Built on Lucene
    Handles the human language:
    Synonyms, typos and misspellings
    Internationalization
    Sort results by relevance score

    View Slide

  10. REAL-TIME ANALYTICS
    Lots of aggregations and metrics
    Gelocations
    Can be combined with search
    Real-time (no batch-processing)

    View Slide

  11. SEARCH

    View Slide

  12. STRUCTURED SEARCH (SQL)
    "Does the document match the query?"
    Yes or no question

    View Slide

  13. FULL-TEXT SEARCH
    "How well does the document match the
    search"?
    Relevance score

    View Slide

  14. INVERTED INDEX

    View Slide

  15. ID Text
    1 "
    T
    h
    e q
    u
    i
    c
    k b
    r
    o
    w
    n f
    o
    x j
    u
    m
    p
    e
    d o
    v
    e
    r t
    h
    e
    l
    a
    z
    y d
    o
    g
    .
    "
    2 "
    Q
    u
    i
    c
    k b
    r
    o
    w
    n f
    o
    x
    e
    s l
    e
    a
    p o
    v
    e
    r l
    a
    z
    y
    d
    o
    g
    s i
    n s
    u
    m
    m
    e
    r
    .
    "

    View Slide

  16. TOKENIZATION
    ID Tokens
    1 "
    T
    h
    e
    "
    , "
    q
    u
    i
    c
    k
    "
    , "
    b
    r
    o
    w
    n
    "
    , "
    f
    o
    x
    "
    ,
    "
    j
    u
    m
    p
    e
    d
    "
    , "
    o
    v
    e
    r
    "
    , "
    t
    h
    e
    "
    , "
    l
    a
    z
    y
    "
    ,
    "
    d
    o
    g
    "
    2 "
    Q
    u
    i
    c
    k
    "
    , "
    b
    r
    o
    w
    n
    "
    , "
    f
    o
    x
    e
    s
    "
    , "
    l
    e
    a
    p
    "
    ,
    "
    o
    v
    e
    r
    "
    , "
    l
    a
    z
    y
    "
    , "
    d
    o
    g
    s
    "
    , "
    i
    n
    "
    ,
    "
    s
    u
    m
    m
    e
    r
    "

    View Slide

  17. NORMALIZATION

    View Slide

  18. CAPITALIZATION
    "
    Q
    u
    i
    c
    k
    " → "
    q
    u
    i
    c
    k
    "

    View Slide

  19. STEMMING
    "
    f
    o
    x
    e
    s
    " → "
    f
    o
    x
    "

    View Slide

  20. REPLACING SYNONYMS
    "
    j
    u
    m
    p
    e
    d
    " ~ "
    l
    e
    a
    p
    " → "
    j
    u
    m
    p
    "

    View Slide

  21. REMOVING COMMON WORDS
    "
    t
    h
    e
    "

    View Slide

  22. Term Doc #1 Doc #2
    brown
    dog
    fox
    in -
    jump
    lazy
    over
    quick
    summer -

    View Slide

  23. SEARCH EXAMPLE
    "
    Q
    u
    i
    c
    k b
    r
    o
    w
    n f
    o
    x
    e
    s i
    n s
    u
    m
    m
    e
    r
    ?
    "

    View Slide

  24. ELASTICSEARCH API
    G
    E
    T /
    e
    x
    a
    m
    p
    l
    e
    /
    d
    o
    c
    u
    m
    e
    n
    t
    /
    _
    s
    e
    a
    r
    c
    h
    {
    "
    m
    a
    t
    c
    h
    "
    : {
    "
    t
    e
    x
    t
    "
    : "
    Q
    u
    i
    c
    k b
    r
    o
    w
    n f
    o
    x
    e
    s i
    n s
    u
    m
    m
    e
    r
    ?
    "
    }
    }

    View Slide

  25. QUERY IS ALSO NORMALIZED
    "
    q
    u
    i
    c
    k
    "
    , "
    b
    r
    o
    w
    n
    "
    , "
    f
    o
    x
    "
    , "
    i
    n
    "
    , "
    s
    u
    m
    m
    e
    r
    "

    View Slide

  26. MATCHING THE INVERTED INDEX
    Term Doc #1 Doc #2
    quick
    brown
    fox
    in -
    summer -

    View Slide

  27. SEARCH RESULTS
    "
    h
    i
    t
    s
    "
    : [
    {
    "
    _
    s
    c
    o
    r
    e
    "
    : 0
    .
    1
    6
    2
    7
    3
    3
    2
    7
    ,
    "
    _
    i
    d
    "
    : "
    2
    "
    ,
    "
    _
    s
    o
    u
    r
    c
    e
    "
    : {
    "
    t
    e
    x
    t
    "
    : "
    Q
    u
    i
    c
    k b
    r
    o
    w
    n f
    o
    x
    e
    s l
    e
    a
    p o
    v
    e
    r l
    a
    z
    y d
    o
    g
    s i
    n s
    u
    m
    m
    e
    r
    .
    "
    }
    }
    , {
    "
    _
    s
    c
    o
    r
    e
    "
    : 0
    .
    0
    1
    2
    7
    3
    3
    2
    7
    ,
    "
    _
    i
    d
    "
    : "
    1
    "
    ,
    "
    _
    s
    o
    u
    r
    c
    e
    "
    : {
    "
    t
    e
    x
    t
    "
    : "
    T
    h
    e q
    u
    i
    c
    k b
    r
    o
    w
    n f
    o
    x j
    u
    m
    p
    e
    d o
    v
    e
    r t
    h
    e l
    a
    z
    y d
    o
    g
    .
    "
    }
    }
    ]

    View Slide

  28. SEARCH RESULTS
    Document #
    2 is a better match
    Higher relevance score than #
    1
    Search results are sorted by relevance

    View Slide

  29. AGGREGATIONS

    View Slide

  30. BUCKETS + METRICS

    View Slide

  31. BUCKETS
    Collection of documents that meet a certain criteria

    View Slide

  32. GENDER SOMEONE IDENTIFIES TO
    Alice
    ⇒ female
    Josh
    ⇒ male
    Karen
    ⇒ non-binary

    View Slide

  33. CITIES FROM A STATE
    San Francisco
    ⇒ California
    Belo Horizonte
    ⇒ Minas Gerais

    View Slide

  34. DAYS FROM A MONTH
    2
    0
    1
    4
    -
    1
    0
    -
    2
    8
    ⇒ October
    2
    0
    1
    4
    -
    1
    1
    -
    1
    5
    ⇒ November

    View Slide

  35. METRICS
    Calculations on top of Buckets
    Ex: m
    i
    n
    , m
    a
    x
    , m
    e
    a
    n
    , s
    u
    m

    View Slide

  36. AGGREGATION EXAMPLE
    partition citzens by state
    then by gender
    then by age ranges
    then calculate average salary for each bucket (metric)

    View Slide

  37. Male
    California
    age < 21
    Female Non-Binary
    New York
    21 < age < 50 age > 50
    Texas
    ~ $ 5000/month
    avg salary

    View Slide

  38. REAL-TIME ANALYTICS

    View Slide

  39. CAR TRANSACTIONS EXAMPLE
    G
    E
    T /
    c
    a
    r
    s
    /
    t
    r
    a
    n
    s
    a
    c
    t
    i
    o
    n
    s
    /
    A
    V
    F
    r
    1
    x
    b
    V
    m
    d
    U
    Y
    W
    p
    F
    4
    6
    P
    s
    4
    {
    "
    p
    r
    i
    c
    e
    " : 1
    0
    0
    0
    0
    ,
    "
    c
    o
    l
    o
    r
    " : "
    r
    e
    d
    "
    ,
    "
    m
    a
    k
    e
    " : "
    h
    o
    n
    d
    a
    "
    ,
    "
    s
    o
    l
    d
    " : "
    2
    0
    1
    4
    -
    1
    0
    -
    2
    8
    "
    }

    View Slide

  40. BEST SELLING CAR COLOR
    G
    E
    T /
    c
    a
    r
    s
    /
    t
    r
    a
    n
    s
    a
    c
    t
    i
    o
    n
    s
    /
    _
    s
    e
    a
    r
    c
    h
    ?
    s
    e
    a
    r
    c
    h
    _
    t
    y
    p
    e
    =
    c
    o
    u
    n
    t
    {
    "
    a
    g
    g
    s
    "
    : {
    "
    c
    o
    l
    o
    r
    s
    "
    : {
    "
    t
    e
    r
    m
    s
    "
    : {
    "
    f
    i
    e
    l
    d
    s
    "
    : "
    c
    o
    l
    o
    r
    "
    }
    }
    }
    }

    View Slide

  41. BEST SELLING CAR COLOR
    {
    "
    c
    o
    l
    o
    r
    s
    "
    : {
    "
    b
    u
    c
    k
    e
    t
    s
    "
    : [
    {
    "
    k
    e
    y
    "
    : "
    r
    e
    d
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 1
    6
    }
    , {
    "
    k
    e
    y
    "
    : "
    b
    l
    u
    e
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 8
    }
    , {
    "
    k
    e
    y
    "
    : "
    g
    r
    e
    e
    n
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 8
    }
    ]
    }
    }

    View Slide

  42. AVERAGE CAR COLOR PRICE
    G
    E
    T /
    c
    a
    r
    s
    /
    t
    r
    a
    n
    s
    a
    c
    t
    i
    o
    n
    s
    /
    _
    s
    e
    a
    r
    c
    h
    ?
    s
    e
    a
    r
    c
    h
    _
    t
    y
    p
    e
    =
    c
    o
    u
    n
    t
    {
    "
    a
    g
    g
    s
    "
    : {
    "
    c
    o
    l
    o
    r
    s
    "
    : {
    "
    t
    e
    r
    m
    s
    "
    : { "
    f
    i
    e
    l
    d
    "
    : "
    c
    o
    l
    o
    r
    " }
    ,
    "
    a
    g
    g
    s
    "
    : {
    "
    a
    v
    g
    _
    p
    r
    i
    c
    e
    "
    : {
    "
    a
    v
    g
    "
    : { "
    f
    i
    e
    l
    d
    "
    : "
    p
    r
    i
    c
    e
    " }
    }
    }
    }
    }
    }

    View Slide

  43. AVERAGE CAR COLOR PRICE
    {
    "
    c
    o
    l
    o
    r
    s
    " : {
    "
    b
    u
    c
    k
    e
    t
    s
    "
    : [
    {
    "
    k
    e
    y
    "
    : "
    r
    e
    d
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 1
    6
    ,
    "
    a
    v
    g
    _
    p
    r
    i
    c
    e
    "
    : { "
    v
    a
    l
    u
    e
    "
    : 3
    2
    5
    0
    0
    .
    0 }
    }
    , {
    "
    k
    e
    y
    "
    : "
    b
    l
    u
    e
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 8
    ,
    "
    a
    v
    g
    _
    p
    r
    i
    c
    e
    "
    : { "
    v
    a
    l
    u
    e
    "
    : 2
    0
    0
    0
    0
    .
    0 }
    }
    , {
    "
    k
    e
    y
    "
    : "
    g
    r
    e
    e
    n
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 8
    ,
    "
    a
    v
    g
    _
    p
    r
    i
    c
    e
    "
    : { "
    v
    a
    l
    u
    e
    "
    : 2
    1
    0
    0
    0
    .
    0 }
    }
    ]
    }
    }

    View Slide

  44. CAR SALES REVENUE HISTOGRAM
    G
    E
    T /
    c
    a
    r
    s
    /
    t
    r
    a
    n
    s
    a
    c
    t
    i
    o
    n
    s
    /
    _
    s
    e
    a
    r
    c
    h
    ?
    s
    e
    a
    r
    c
    h
    _
    t
    y
    p
    e
    =
    c
    o
    u
    n
    t
    {
    "
    a
    g
    g
    s
    "
    : {
    "
    p
    r
    i
    c
    e
    "
    : {
    "
    h
    i
    s
    t
    o
    g
    r
    a
    m
    "
    : {
    "
    f
    i
    e
    l
    d
    "
    : "
    p
    r
    i
    c
    e
    "
    ,
    "
    i
    n
    t
    e
    r
    v
    a
    l
    "
    : 2
    0
    0
    0
    0
    }
    ,
    "
    a
    g
    g
    s
    "
    : {
    "
    r
    e
    v
    e
    n
    u
    e
    "
    : {
    "
    s
    u
    m
    "
    : { "
    f
    i
    e
    l
    d
    " : "
    p
    r
    i
    c
    e
    " }
    }
    }
    }
    }
    }

    View Slide

  45. CAR SALES REVENUE HISTOGRAM
    {
    "
    p
    r
    i
    c
    e
    " : {
    "
    b
    u
    c
    k
    e
    t
    s
    "
    : [
    { "
    k
    e
    y
    "
    : 0
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 1
    2
    ,
    "
    r
    e
    v
    e
    n
    u
    e
    "
    : { "
    v
    a
    l
    u
    e
    "
    : 1
    4
    8
    0
    0
    0
    .
    0 }
    }
    ,
    { "
    k
    e
    y
    "
    : 2
    0
    0
    0
    0
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 1
    6
    ,
    "
    r
    e
    v
    e
    n
    u
    e
    "
    : { "
    v
    a
    l
    u
    e
    "
    : 3
    8
    0
    0
    0
    0
    .
    0 }
    }
    ,
    { "
    k
    e
    y
    "
    : 4
    0
    0
    0
    0
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 0
    ,
    "
    r
    e
    v
    e
    n
    u
    e
    "
    : { "
    v
    a
    l
    u
    e
    "
    : 0
    .
    0 }
    }
    ,
    { "
    k
    e
    y
    "
    : 6
    0
    0
    0
    0
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 0
    ,
    "
    r
    e
    v
    e
    n
    u
    e
    "
    : { "
    v
    a
    l
    u
    e
    "
    : 0
    .
    0 }
    }
    ,
    { "
    k
    e
    y
    "
    : 8
    0
    0
    0
    0
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 4
    ,
    "
    r
    e
    v
    e
    n
    u
    e
    "
    : { "
    v
    a
    l
    u
    e
    " : 3
    2
    0
    0
    0
    0
    .
    0 }
    }
    ]
    }
    }

    View Slide

  46. CAR SALES REVENUE HISTOGRAM

    View Slide

  47. TIME-SERIES DATA
    Any data with a timestamp
    Ex: server logs, sales history, stock prices

    View Slide

  48. HOW MANY CARS SOLD PER MONTH?
    G
    E
    T /
    c
    a
    r
    s
    /
    t
    r
    a
    n
    s
    a
    c
    t
    i
    o
    n
    s
    /
    _
    s
    e
    a
    r
    c
    h
    ?
    s
    e
    a
    r
    c
    h
    _
    t
    y
    p
    e
    =
    c
    o
    u
    n
    t
    {
    "
    a
    g
    g
    s
    "
    : {
    "
    s
    a
    l
    e
    s
    "
    : {
    "
    d
    a
    t
    e
    _
    h
    i
    s
    t
    o
    g
    r
    a
    m
    "
    : {
    "
    f
    i
    e
    l
    d
    "
    : "
    s
    o
    l
    d
    "
    ,
    "
    i
    n
    t
    e
    r
    v
    a
    l
    "
    : "
    m
    o
    n
    t
    h
    "
    ,
    "
    f
    o
    r
    m
    a
    t
    "
    : "
    y
    y
    y
    y
    -
    M
    M
    -
    d
    d
    "
    }
    }
    }
    }

    View Slide

  49. HOW MANY CARS SOLD PER MONTH?
    {
    "
    s
    a
    l
    e
    s
    " : {
    "
    b
    u
    c
    k
    e
    t
    s
    " : [
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    0
    1
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 4 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    0
    2
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 4 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    0
    3
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 0 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    0
    4
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 0 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    0
    5
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 4 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    0
    6
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 0 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    0
    7
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 4 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    0
    8
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 4 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    0
    9
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 0 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    1
    0
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 4 }
    ,
    { "
    k
    e
    y
    _
    a
    s
    _
    s
    t
    r
    i
    n
    g
    "
    : "
    2
    0
    1
    4
    -
    1
    1
    -
    0
    1
    "
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 8 }
    ]
    }
    }

    View Slide

  50. HOW MANY CARS SOLD PER MONTH?

    View Slide

  51. COMMON SETUP - ELK
    Logstash
    ⇒ Elasticsearch
    ⇒ Kibana

    View Slide

  52. LOGSTASH
    Collect and stream logs into Elasticsearch

    View Slide

  53. KIBANA
    An analytics dashboard for Elasticsearch

    View Slide

  54. View Slide

  55. MOVIE
    RECOMMENDATIONS

    View Slide

  56. THE MOVIELENS DATA SETS
    Movies Catalog
    User Movie Recommendations (0 to 5)
    User Movie Tags

    View Slide

  57. THE MOVIELENS 10M DATASET
    10 million ratings
    10,000 movies
    72,000 users
    Released on 2009

    View Slide

  58. M
    O
    V
    I
    E
    S
    .
    D
    A
    T
    M
    o
    v
    i
    e
    I
    D
    :
    :
    T
    i
    t
    l
    e
    :
    :
    G
    e
    n
    r
    e
    s

    View Slide

  59. M
    O
    V
    I
    E
    S
    .
    D
    A
    T
    1
    :
    :
    T
    o
    y S
    t
    o
    r
    y (
    1
    9
    9
    5
    )
    :
    :
    A
    d
    v
    e
    n
    t
    u
    r
    e
    |
    A
    n
    i
    m
    a
    t
    i
    o
    n
    |
    C
    h
    i
    l
    d
    r
    e
    n
    |
    C
    o
    m
    e
    d
    y
    |
    F
    a
    n
    t
    a
    s
    y
    2
    :
    :
    J
    u
    m
    a
    n
    j
    i (
    1
    9
    9
    5
    )
    :
    :
    A
    d
    v
    e
    n
    t
    u
    r
    e
    |
    C
    h
    i
    l
    d
    r
    e
    n
    |
    F
    a
    n
    t
    a
    s
    y
    3
    :
    :
    G
    r
    u
    m
    p
    i
    e
    r O
    l
    d M
    e
    n (
    1
    9
    9
    5
    )
    :
    :
    C
    o
    m
    e
    d
    y
    |
    R
    o
    m
    a
    n
    c
    e
    4
    :
    :
    W
    a
    i
    t
    i
    n
    g t
    o E
    x
    h
    a
    l
    e (
    1
    9
    9
    5
    )
    :
    :
    C
    o
    m
    e
    d
    y
    |
    D
    r
    a
    m
    a
    |
    R
    o
    m
    a
    n
    c
    e
    5
    :
    :
    F
    a
    t
    h
    e
    r o
    f t
    h
    e B
    r
    i
    d
    e P
    a
    r
    t I
    I (
    1
    9
    9
    5
    )
    :
    :
    C
    o
    m
    e
    d
    y
    6
    :
    :
    H
    e
    a
    t (
    1
    9
    9
    5
    )
    :
    :
    A
    c
    t
    i
    o
    n
    |
    C
    r
    i
    m
    e
    |
    T
    h
    r
    i
    l
    l
    e
    r
    7
    :
    :
    S
    a
    b
    r
    i
    n
    a (
    1
    9
    9
    5
    )
    :
    :
    C
    o
    m
    e
    d
    y
    |
    R
    o
    m
    a
    n
    c
    e
    8
    :
    :
    T
    o
    m a
    n
    d H
    u
    c
    k (
    1
    9
    9
    5
    )
    :
    :
    A
    d
    v
    e
    n
    t
    u
    r
    e
    |
    C
    h
    i
    l
    d
    r
    e
    n
    9
    :
    :
    S
    u
    d
    d
    e
    n D
    e
    a
    t
    h (
    1
    9
    9
    5
    )
    :
    :
    A
    c
    t
    i
    o
    n
    1
    0
    :
    :
    G
    o
    l
    d
    e
    n
    E
    y
    e (
    1
    9
    9
    5
    )
    :
    :
    A
    c
    t
    i
    o
    n
    |
    A
    d
    v
    e
    n
    t
    u
    r
    e
    |
    T
    h
    r
    i
    l
    l
    e
    r
    (
    .
    .
    .
    )

    View Slide

  60. R
    A
    T
    I
    N
    G
    S
    .
    D
    A
    T
    U
    s
    e
    r
    I
    D
    :
    :
    M
    o
    v
    i
    e
    I
    D
    :
    :
    R
    a
    t
    i
    n
    g
    :
    :
    T
    i
    m
    e
    s
    t
    a
    m
    p

    View Slide

  61. R
    A
    T
    I
    N
    G
    S
    .
    D
    A
    T
    2
    :
    :
    1
    1
    0
    :
    :
    5
    :
    :
    8
    6
    8
    2
    4
    5
    7
    7
    7
    2
    :
    :
    1
    5
    1
    :
    :
    3
    :
    :
    8
    6
    8
    2
    4
    6
    4
    5
    0
    2
    :
    :
    2
    6
    0
    :
    :
    5
    :
    :
    8
    6
    8
    2
    4
    4
    5
    6
    2
    2
    :
    :
    3
    7
    6
    :
    :
    3
    :
    :
    8
    6
    8
    2
    4
    5
    9
    2
    0
    2
    :
    :
    5
    3
    9
    :
    :
    3
    :
    :
    8
    6
    8
    2
    4
    6
    2
    6
    2
    2
    :
    :
    5
    9
    0
    :
    :
    5
    :
    :
    8
    6
    8
    2
    4
    5
    6
    0
    8
    2
    :
    :
    6
    4
    8
    :
    :
    2
    :
    :
    8
    6
    8
    2
    4
    4
    6
    9
    9
    2
    :
    :
    7
    1
    9
    :
    :
    3
    :
    :
    8
    6
    8
    2
    4
    6
    1
    9
    1
    2
    :
    :
    7
    3
    3
    :
    :
    3
    :
    :
    8
    6
    8
    2
    4
    4
    5
    6
    2
    2
    :
    :
    7
    3
    6
    :
    :
    3
    :
    :
    8
    6
    8
    2
    4
    4
    6
    9
    8
    (
    .
    .
    .
    )

    View Slide

  62. LOADING DATA SETS
    Python script to load the data sets into Elasticsearch
    Using the new e
    l
    a
    s
    t
    i
    c
    s
    e
    a
    r
    c
    h
    -
    d
    s
    l library
    One line at a time...

    View Slide

  63. LOADING DATA SETS
    ...it took too long to load the 10M data set
    Improved time by using the bulk API

    View Slide

  64. THE MOVIE TYPE
    G
    E
    T /
    m
    o
    v
    i
    e
    l
    e
    n
    s
    /
    m
    o
    v
    i
    e
    /
    2
    {
    "
    n
    a
    m
    e
    "
    : "
    J
    u
    m
    a
    n
    j
    i (
    1
    9
    9
    5
    )
    "
    }

    View Slide

  65. THE USER RATING TYPE
    G
    E
    T /
    m
    o
    v
    i
    e
    l
    e
    n
    s
    /
    r
    a
    t
    i
    n
    g
    s
    /
    A
    V
    F
    r
    1
    x
    b
    V
    m
    d
    U
    Y
    W
    p
    F
    4
    6
    P
    s
    4
    {
    "
    u
    s
    e
    r
    "
    : 1
    5
    ,
    "
    r
    e
    c
    o
    m
    m
    e
    n
    d
    e
    d
    _
    m
    o
    v
    i
    e
    s
    "
    : [
    1
    2
    2
    , 1
    8
    5
    , 2
    3
    1
    , 2
    9
    2
    , 3
    1
    6
    , 3
    2
    9
    ]
    }
    Recommended movies have s
    c
    o
    r
    e >
    = 4

    View Slide

  66. MOVIE RECOMMENDATIONS
    Given Talladega Nights, starring Will Ferrel
    We want to find comedies in similar style

    View Slide

  67. RECOMMENDING BASED ON POPULARITY
    Find all users who recommended a movie
    Agregate their recommendations
    Take the top five most-popular

    View Slide

  68. Find the Talladega Nights ID:
    G
    E
    T /
    m
    o
    v
    i
    e
    l
    e
    n
    s
    /
    m
    o
    v
    i
    e
    /
    _
    s
    e
    a
    r
    c
    h
    {
    "
    q
    u
    e
    r
    y
    "
    : {
    "
    m
    a
    t
    c
    h
    "
    : {
    "
    t
    i
    t
    l
    e
    "
    : "
    T
    a
    l
    l
    a
    d
    e
    g
    a N
    i
    g
    h
    t
    s
    "
    }
    }
    }

    View Slide

  69. Find the Talladega Nights ID:
    {
    "
    h
    i
    t
    s
    "
    : [
    {
    "
    _
    i
    d
    "
    : "
    4
    6
    9
    7
    0
    "
    ,
    "
    _
    s
    o
    u
    r
    c
    e
    "
    : {
    "
    t
    i
    t
    l
    e
    "
    : "
    T
    a
    l
    l
    a
    d
    e
    g
    a N
    i
    g
    h
    t
    s
    : T
    h
    e B
    a
    l
    l
    a
    d o
    f R
    i
    c
    k
    y B
    o
    b
    b
    y (
    2
    0
    0
    6
    )
    "
    }
    }
    ]
    }

    View Slide

  70. Find the most popular movies from people who also like
    Talladega Nights:
    G
    E
    T /
    m
    o
    v
    i
    e
    l
    e
    n
    s
    /
    r
    a
    t
    i
    n
    g
    s
    /
    _
    s
    e
    a
    r
    c
    h
    ?
    s
    e
    a
    r
    c
    h
    _
    t
    y
    p
    e
    =
    c
    o
    u
    n
    t
    {
    "
    q
    u
    e
    r
    y
    "
    : {
    "
    f
    i
    l
    t
    e
    r
    e
    d
    "
    : {
    "
    f
    i
    l
    t
    e
    r
    "
    : { "
    t
    e
    r
    m
    "
    : { "
    m
    o
    v
    i
    e
    "
    : 4
    6
    9
    7
    0 } }
    }
    }
    ,
    "
    a
    g
    g
    s
    "
    : {
    "
    m
    o
    s
    t
    _
    p
    o
    p
    u
    l
    a
    r
    "
    : {
    "
    t
    e
    r
    m
    s
    "
    : { "
    f
    i
    e
    l
    d
    "
    : "
    m
    o
    v
    i
    e
    "
    , "
    s
    i
    z
    e
    "
    : 6 }
    }
    }
    }

    View Slide

  71. A er correlating the ids to the titles, we got:
    1. Matrix, The
    2. Shawshank Redemption
    3. Pulp Fiction
    4. Fight Club
    5. Star Wars Episode IV: A New Hope

    View Slide

  72. Very good list!
    But almost everyone likes them!
    These are universally well-liked movies.

    View Slide

  73. Findind the most popular movies from all the time:
    G
    E
    T /
    m
    o
    v
    i
    e
    l
    e
    n
    s
    /
    r
    a
    t
    i
    n
    g
    s
    /
    _
    s
    e
    a
    r
    c
    h
    ?
    s
    e
    a
    r
    c
    h
    _
    t
    y
    p
    e
    =
    c
    o
    u
    n
    t
    {
    "
    a
    g
    g
    s
    "
    : {
    "
    m
    o
    s
    t
    _
    p
    o
    p
    u
    l
    a
    r
    "
    : {
    "
    t
    e
    r
    m
    s
    "
    : { "
    f
    i
    e
    l
    d
    "
    : "
    m
    o
    v
    i
    e
    "
    , "
    s
    i
    z
    e
    "
    : 5 }
    }
    }
    }

    View Slide

  74. A er correlating the ids to the titles, we got:
    1. Shawshank Redemption
    2. Silence of the Lambs, The
    3. Pulp Fiction
    4. Forrest Gump
    5. Star Wars Episode IV: A New Hope

    View Slide

  75. S
    I
    G
    N
    I
    F
    I
    C
    A
    N
    T
    _
    T
    E
    R
    M
    S
    Aggregation based on statistics
    Finds uncommonly common terms in a data set
    i.e., statistic annomalies

    View Slide

  76. FOREGROUND GROUP
    Popular movies among people who enjoy Taladega Nights

    View Slide

  77. BACKGROUND GROUP
    Most popular movies among the entire user base

    View Slide

  78. Using the s
    i
    g
    n
    i
    f
    i
    c
    a
    n
    t
    _
    t
    e
    r
    m
    s aggregation:
    G
    E
    T /
    m
    o
    v
    i
    e
    l
    e
    n
    s
    /
    r
    a
    t
    i
    n
    g
    s
    /
    _
    s
    e
    a
    r
    c
    h
    ?
    s
    e
    a
    r
    c
    h
    _
    t
    y
    p
    e
    =
    c
    o
    u
    n
    t
    {
    "
    q
    u
    e
    r
    y
    "
    : {
    "
    f
    i
    l
    t
    e
    r
    e
    d
    "
    : {
    "
    f
    i
    l
    t
    e
    r
    "
    : { "
    t
    e
    r
    m
    "
    : { "
    m
    o
    v
    i
    e
    "
    : 4
    6
    9
    7
    0 } }
    }
    }
    ,
    "
    a
    g
    g
    s
    "
    : {
    "
    m
    o
    s
    t
    _
    s
    i
    g
    n
    i
    f
    i
    c
    a
    n
    t
    "
    : {
    "
    s
    i
    g
    n
    i
    f
    i
    c
    a
    n
    t
    _
    t
    e
    r
    m
    s
    "
    : { "
    f
    i
    e
    l
    d
    "
    : "
    m
    o
    v
    i
    e
    "
    , "
    s
    i
    z
    e
    "
    : 6 }
    }
    }
    }

    View Slide

  79. Returned movies:
    {
    "
    a
    g
    g
    r
    e
    g
    a
    t
    i
    o
    n
    s
    "
    : {
    "
    m
    o
    s
    t
    _
    s
    i
    g
    n
    i
    f
    i
    c
    a
    n
    t
    "
    : {
    "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 2
    7
    1
    ,
    "
    b
    u
    c
    k
    e
    t
    s
    "
    : [
    { "
    k
    e
    y
    "
    : 4
    6
    9
    7
    0
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 2
    7
    1
    , "
    b
    g
    _
    c
    o
    u
    n
    t
    "
    : 2
    7
    1 }
    ,
    { "
    k
    e
    y
    "
    : 5
    5
    2
    4
    5
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 5
    9
    , "
    b
    g
    _
    c
    o
    u
    n
    t
    "
    : 1
    8
    5 }
    ,
    { "
    k
    e
    y
    "
    : 8
    6
    4
    1
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 1
    0
    7
    , "
    b
    g
    _
    c
    o
    u
    n
    t
    "
    : 7
    6
    2 }
    ,
    { "
    k
    e
    y
    "
    : 5
    8
    1
    5
    6
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 1
    7
    , "
    b
    g
    _
    c
    o
    u
    n
    t
    "
    : 2
    8 }
    ,
    { "
    k
    e
    y
    "
    : 5
    2
    9
    7
    3
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 9
    5
    , "
    b
    g
    _
    c
    o
    u
    n
    t
    "
    : 8
    5
    7 }
    ,
    { "
    k
    e
    y
    "
    : 3
    5
    8
    3
    6
    , "
    d
    o
    c
    _
    c
    o
    u
    n
    t
    "
    : 1
    2
    8
    , "
    b
    g
    _
    c
    o
    u
    n
    t
    "
    : 1
    6
    1
    0 }
    ]
    }
    }
    }

    View Slide

  80. A er correlating the ids to the titles, we got:
    1. Blades of Glory
    2. Anchorman: The Legend of Ron Burgundy
    3. Semi-Pro
    4. Knocked Up
    5. 40-Year-Old Virgin, The

    View Slide

  81. THANK YOU!
    @felipead → , ,
    GitHub Twitter MixCloud

    View Slide