Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Ghost in the Search Machine

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Elastic Co Elastic Co
February 17, 2016

The Ghost in the Search Machine

Elasticsearch is much more than a full-text search engine – it can be ‘taught’ to detect at-risk students, predict the weather, and find similar images at high scale. Attend this session to learn how you can turn Elasticsearch into a smart comparison powerhouse for your organization.

Avatar for Elastic Co

Elastic Co

February 17, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Why Search •  What does a dedicated search engine do?

    o  that a database doesn’t? •  Why not [MySQL|mongoDB|Cassandra | etc]?
  2. Srawman: Why Not MySQL? •  Our mission: Find all the

    “Darth Vader” in SciFi StackExchange Posts! OpenSource Connections P U C V Body 0 1 2 1 <p>What  exactly  did   Obiwan  know  about   Anakin  and  Darth  Vader   before  a  New  Hope   started?</p> 1 2 2 5 <p>Been  meaning  to  read   the  Foundation  Series,   what  should  I  read  first? </p> Found! Missing!
  3. Why not MySQL – SQL Like? •  SQL “LIKE” operator

    – scan all rows for a specific wildcard match   SELECT  *  FROM  posts  WHERE  body  LIKE  "%darth  vader%"   OpenSource Connections Match? Match? Match? Match? Performs Table Scan Approx 300ms to search a measly 20K docs! (what if we had 20 Million?)
  4. SQL Like | CTRL+F | grep is 1.  Extremely Slow

    -- scan! yuck! 2.  Not fuzzy -- Needs exact literal matches, no fuzziness! 3.  Unranked -- Simply says y/n whether there is a match. But what's relevant? OpenSource Connections
  5. Elasticsearch We've all done Elasticsearch 101: curl  –XPUT  http://localhost:9200/stackexchange  

    curl  –XPUT  http://localhost:9200/stackexchange/post/1  -­‐d  ‘{    “Body”:  “<p>Darth  Vader  dined  with  Luke</p>”,    “Title”:  “...”}’   OpenSource Connections
  6. But what is being built? The answer can be found

    in your textbook… OpenSource Connections Book Index: •  Topics -> page no •  Very efficient tool – compare to scanning the whole book! Lucene uses an index: •  Tokens => document ids: laser => [2, 4] light => [2, 5] lightsaber => [0, 1, 5, 7]
  7. Computers == Dumb •  Humans are smart o  I see

    “cat” or “cats” in the back of a book, no duh – jump to page 9 •  Computers are dumb, o  “CAT” != “cat” – “cat” != “cats” – no match returned OpenSource Connections
  8. Normalization aka Text Analysis OpenSource Connections  curl  -­‐XGET  'http://localhost:9200/_analyze?analyzer=snowball'  -­‐d

     'Darth  Vader  dined  with  Luke'   {          "tokens":  [                  {                          "end_offset":  5,                          "position":  1,                          "start_offset":  0,                          "token":  "darth",                          "type":  "<ALPHANUM>"                  },                  {                          "end_offset":  11,                          "position":  2,                          "start_offset":  6,                          "token":  "vader",                          "type":  "<ALPHANUM>"                  },                  {                          "end_offset":  17,                          "position":  3,                          "start_offset":  12,                          "token":  "dine",                          "type":  "<ALPHANUM>"                  },                  {                          "end_offset":  27,                          "position":  5,                          "start_offset":  23,                          "token":  "luke",                          "type":  "<ALPHANUM>"                  }          ]   }  
  9. What is being built? OpenSource Connections field  Body    term

     darth          doc  1    <metadata>            doc  2    <metadata>        term  vader          doc  1        <metadata>    term  dine          doc  1    <metadata>     curl  –XPUT  http://localhost:9200/stackexchange/post/1  –d  ‘{    “Body”:  “<p>Darth  Vader  dined  with  Luke</p>”,    “Title”:  “...”}’     curl  –XPUT  http://localhost:9200/stackexchange/post/2  –d  ‘{    “Body”:  “<p>We  love  Darth</p>”,    “Title”:  “...”}’    
  10. Ranking OpenSource Connections field  Body    term  darth    

         doc  1        <metadata>            doc  2        <metadata>        term  vader          doc  1            <metadata>    term  dine          doc  1        <metadata>     curl  –XPUT  http://localhost:9200/stackexchange/post/1  –d  ‘{    “Body”:  “<p>Darth  Vader  dined  with  Luke</p>”,    “Title”:  “...”}’     curl  –XPUT  http://localhost:9200/stackexchange/post/2  –d  ‘{    “Body”:  “<p>We  love  Darth</p>”,    “Title”:  “...”}’     Can we store anything here to help decide how relevant this term is for this doc? Yes! -  Term Frequency -  How much “darth” is in this doc? -  Position within document -  Helps when we search for the phrase “darth vader”
  11. Query Documents •  When did Darth Vader and Luke have

    dinner? OpenSource Connections curl  -­‐X  POST  "http://localhost:9200/stackexchange/_search? pretty=true"  -­‐d  '   {                                                                                                      "query":  {                  "match":  {                          "Body":  "luke  darth  dinner"              }    }   }   User Query
  12. What happens when we query? OpenSource Connections luke  darth  dinner

      field  Body    term  darth          doc  1    <metadata>            doc  2    <metadata>        term  vader          doc  1        <metadata>    term  dine          doc  1    <metadata>     How to consult index for matches? Analysis [luke]   [darth]   [dine]   [darth]             [dine]           ...     Score for [darth] docs (1 and 2) Score for [dine] docs (1) Return sorted docs client