$30 off During Our Annual Pro Sale. View Details »

Learning to Rank 101, Bringing personalisation to data discovery

Pere Urbón
December 06, 2017

Learning to Rank 101, Bringing personalisation to data discovery

Pere Urbón

December 06, 2017
Tweet

More Decks by Pere Urbón

Other Decks in Technology

Transcript

  1. Learning To Rank 101
    Pere Urbon Bayes — Data Wrangler
    www.springernature.com
    www.purbon.com

    View Slide

  2. About me
    Pere Urbon - Bayes (Berliner since 2011)
    Software Architect and Data Engineer
    All about systems, data and teams
    Open Source Advocate and Contributor

    View Slide

  3. All will be available from
    ● github.com/purbon/learning_to_rank_101
    ● speakerdeck.com/purbon

    View Slide

  4. Building a new
    search
    functionality

    View Slide

  5. Building Search
    A search engine is an information retrieval
    system designed to help find information stored
    on a computer system.
    wikipedia.org/wiki/Search_engine_(computing)

    View Slide

  6. Building Search
    When search works, it can feel almost
    magical: you simply type in what you’re looking
    for and it’s served up in mere milliseconds. It’s
    fast, convenient, and super efficient – no
    wonder so many users prefer search over
    clicking around the site’s categories!
    www.baymard.com

    View Slide

  7. Search, how does this works?
    documents
    D={d
    1
    ,d
    2
    ,...,d
    N
    }
    IR System
    Query
    q
    List of documents (ranked)
    d
    q,1
    d
    q,2
    d
    q,3
    d
    q,4
    d
    q,5
    ...
    d
    q,n
    Ranking based relevance
    TF-IDF, BM25

    View Slide

  8. Building search
    The phases of building a search engine:
    ● Tokenization
    ○ synonyms (filter)
    ○ stop words (filter)
    ○ whitespace
    ○ ngram
    ● Analyzer
    ○ languages
    ○ keywords
    ○ standard
    ● Normalization
    Indexing Time
    Query Time

    View Slide

  9. Tf-IDF
    Term frequency - Inverse Document Frequency

    View Slide

  10. Okapi BM25
    Okapi search Best Matching 25 (BM25)
    Others: PageRank, Learning to Rank, ….

    View Slide

  11. The second line of defence
    ● Tags and Ontologies.
    ● Natural Language Processing.
    ● Result click tracking.
    ● Genetic and evolutionary methods to optimize boosting and weights.
    ● Build your own scorer
    ● ...
    Scary and Complex!!!

    View Slide

  12. Building great search (can be an art)

    View Slide

  13. Learning to
    Rank

    View Slide

  14. Learning to Rank
    The usage of machine learning (supervised, semi-supervised, …) to improve
    the creation of ranking models for information retrieval.
    Common applications are in search engines, collaborative filtering,
    machine translation, biological computation, etc.
    The idea was introduced in 1992 by Norbert Fuhr, describing learning in
    information retrieval as a parameter estimation problem.

    View Slide

  15. Learning to Rank, how does this works?
    documents
    D={d
    1
    ,d
    2
    ,...,d
    N
    }
    IR System
    Query
    q
    m+1
    List of documents (ranked)
    d
    q,1
    , f
    (qm+1, d1)
    d
    q,2,
    f
    (qm+1, d1)
    d
    q,3,
    f
    (qm+1, d1)
    d
    q,4,
    f
    (qm+1, d1)
    d
    q,5,
    f
    (qm+1, d1)
    ...
    d
    q,n,
    f
    (qm+1, d1)
    Learning
    System
    q
    1
    d
    1,1
    d
    1,2
    d
    1,3
    ...
    d
    q,n
    q
    m
    d
    m,1
    d
    m,2
    d
    m,3
    ...
    d
    m,n
    f
    (q,d
    )

    View Slide

  16. Learning to Rank
    Algorithms can be divided in three different groups:
    ● Pointwise: If we assume that each pair (query, document) get a score,
    then the problem can be approximated by a regression.
    ● Pairwise: In this case the problem is treated as a classification problem,
    learning how to better classify each given pair of documents.
    ● Listwise: The last case try to optimize the value of one of previous
    methods, averaged overall queries.
    Order of quality: Listwise > Pairwise > Pointwise.

    View Slide

  17. Learning to Rank
    Most popular algorithms are:
    ● RankNet, LamdaRank, LamdaMart by Chris C.J Burges et others.
    www.microsoft.com/en-us/research/publication/ranking-boosting-and-
    model-adaptation/?from=http%3A%2F%2Fresearch.microsoft.com%2F
    pubs%2F69536%2Ftr-2008-109.pdf
    ● RankSVM or (*) Gradient descendant variants.

    View Slide

  18. Not only for the big companies.

    View Slide

  19. References
    Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul
    Lamere.The Million Song Dataset. In Proceedings of the 12th International
    Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
    Million Song Dataset, official website by Thierry Bertin-Mahieux,
    available at: http://labrosa.ee.columbia.edu/millionsong/
    Tie-Yan Liu (2009), "Learning to Rank for Information Retrieval",
    Foundations and Trends in Information Retrieval, Foundations and Trends
    in Information Retrieval, 3 (3): 225–331, doi:10.1561/1500000016, ISBN
    978-1-60198-244-5.

    View Slide

  20. Demo
    Time….

    View Slide

  21. Thank! Questions?
    Pere Urbon Bayes — Data Wrangler
    www.springernature.com
    www.purbon.com

    View Slide