Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Ghost in the Search Machine

Elastic Co
February 17, 2016

The Ghost in the Search Machine

Elasticsearch is much more than a full-text search engine – it can be ‘taught’ to detect at-risk students, predict the weather, and find similar images at high scale. Attend this session to learn how you can turn Elasticsearch into a smart comparison powerhouse for your organization.

Elastic Co

February 17, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Why Search •  What does a dedicated search engine do?

    o  that a database doesn’t? •  Why not [MySQL|mongoDB|Cassandra | etc]?
  2. Srawman: Why Not MySQL? •  Our mission: Find all the

    “Darth Vader” in SciFi StackExchange Posts! OpenSource Connections P U C V Body 0 1 2 1 <p>What  exactly  did   Obiwan  know  about   Anakin  and  Darth  Vader   before  a  New  Hope   started?</p> 1 2 2 5 <p>Been  meaning  to  read   the  Foundation  Series,   what  should  I  read  first? </p> Found! Missing!
  3. Why not MySQL – SQL Like? •  SQL “LIKE” operator

    – scan all rows for a specific wildcard match   SELECT  *  FROM  posts  WHERE  body  LIKE  "%darth  vader%"   OpenSource Connections Match? Match? Match? Match? Performs Table Scan Approx 300ms to search a measly 20K docs! (what if we had 20 Million?)
  4. SQL Like | CTRL+F | grep is 1.  Extremely Slow

    -- scan! yuck! 2.  Not fuzzy -- Needs exact literal matches, no fuzziness! 3.  Unranked -- Simply says y/n whether there is a match. But what's relevant? OpenSource Connections
  5. Elasticsearch We've all done Elasticsearch 101: curl  –XPUT  http://localhost:9200/stackexchange  

    curl  –XPUT  http://localhost:9200/stackexchange/post/1  -­‐d  ‘{    “Body”:  “<p>Darth  Vader  dined  with  Luke</p>”,    “Title”:  “...”}’   OpenSource Connections
  6. But what is being built? The answer can be found

    in your textbook… OpenSource Connections Book Index: •  Topics -> page no •  Very efficient tool – compare to scanning the whole book! Lucene uses an index: •  Tokens => document ids: laser => [2, 4] light => [2, 5] lightsaber => [0, 1, 5, 7]
  7. Computers == Dumb •  Humans are smart o  I see

    “cat” or “cats” in the back of a book, no duh – jump to page 9 •  Computers are dumb, o  “CAT” != “cat” – “cat” != “cats” – no match returned OpenSource Connections
  8. Normalization aka Text Analysis OpenSource Connections  curl  -­‐XGET  'http://localhost:9200/_analyze?analyzer=snowball'  -­‐d

     'Darth  Vader  dined  with  Luke'   {          "tokens":  [                  {                          "end_offset":  5,                          "position":  1,                          "start_offset":  0,                          "token":  "darth",                          "type":  "<ALPHANUM>"                  },                  {                          "end_offset":  11,                          "position":  2,                          "start_offset":  6,                          "token":  "vader",                          "type":  "<ALPHANUM>"                  },                  {                          "end_offset":  17,                          "position":  3,                          "start_offset":  12,                          "token":  "dine",                          "type":  "<ALPHANUM>"                  },                  {                          "end_offset":  27,                          "position":  5,                          "start_offset":  23,                          "token":  "luke",                          "type":  "<ALPHANUM>"                  }          ]   }  
  9. What is being built? OpenSource Connections field  Body    term

     darth          doc  1    <metadata>            doc  2    <metadata>        term  vader          doc  1        <metadata>    term  dine          doc  1    <metadata>     curl  –XPUT  http://localhost:9200/stackexchange/post/1  –d  ‘{    “Body”:  “<p>Darth  Vader  dined  with  Luke</p>”,    “Title”:  “...”}’     curl  –XPUT  http://localhost:9200/stackexchange/post/2  –d  ‘{    “Body”:  “<p>We  love  Darth</p>”,    “Title”:  “...”}’    
  10. Ranking OpenSource Connections field  Body    term  darth    

         doc  1        <metadata>            doc  2        <metadata>        term  vader          doc  1            <metadata>    term  dine          doc  1        <metadata>     curl  –XPUT  http://localhost:9200/stackexchange/post/1  –d  ‘{    “Body”:  “<p>Darth  Vader  dined  with  Luke</p>”,    “Title”:  “...”}’     curl  –XPUT  http://localhost:9200/stackexchange/post/2  –d  ‘{    “Body”:  “<p>We  love  Darth</p>”,    “Title”:  “...”}’     Can we store anything here to help decide how relevant this term is for this doc? Yes! -  Term Frequency -  How much “darth” is in this doc? -  Position within document -  Helps when we search for the phrase “darth vader”
  11. Query Documents •  When did Darth Vader and Luke have

    dinner? OpenSource Connections curl  -­‐X  POST  "http://localhost:9200/stackexchange/_search? pretty=true"  -­‐d  '   {                                                                                                      "query":  {                  "match":  {                          "Body":  "luke  darth  dinner"              }    }   }   User Query
  12. What happens when we query? OpenSource Connections luke  darth  dinner

      field  Body    term  darth          doc  1    <metadata>            doc  2    <metadata>        term  vader          doc  1        <metadata>    term  dine          doc  1    <metadata>     How to consult index for matches? Analysis [luke]   [darth]   [dine]   [darth]             [dine]           ...     Score for [darth] docs (1 and 2) Score for [dine] docs (1) Return sorted docs client