Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Terms of endearment - the ElasticSearch Query DSL explained

Terms of endearment - the ElasticSearch Query DSL explained

Given at YAPC::EU 2011

D0dd23d18388ba0225bbb9bcba7ede83?s=128

Clinton Gormley

August 17, 2011
Tweet

More Decks by Clinton Gormley

Other Decks in Programming

Transcript

  1. “Terms of Endearment” The ElasticSearch query language explained Clinton Gormley,

    YAPC::EU 2011 DRTECH @clintongormley
  2. search for : “DELETE QUERY ” We can

  3. search for : “DELETE QUERY ” and find : “deleteByQuery

    ” We can
  4. but you can only find what is stored in the

    database
  5. Normalise values “deleteByQuery” 'delete' 'by' 'query' 'deletebyquery'

  6. Normalise values and search terms “deleteByQuery” “DELETE QUERY” 'delete' 'by'

    'query' 'deletebyquery'
  7. Normalise values and search terms “deleteByQuery” “DELETE QUERY” 'delete' 'by'

    'query' 'deletebyquery'
  8. Analyse values and search terms “deleteByQuery” “DELETE QUERY” 'delete' 'by'

    'query' 'deletebyquery'
  9. What is stored in ElasticSearch?

  10. { tweet => "Perl is GREAT!", posted => "2011-08-15", user

    => { name => "Clinton Gormley", email => "drtech@cpan.org", }, tags => ["perl","opinion"], posts => 2, } Document:
  11. { tweet => "Perl is GREAT!", posted => "2011-08-15", user

    => { name => "Clinton Gormley", email => "drtech@cpan.org", }, tags => ["perl","opinion"], posts => 2, } Fields:
  12. { tweet => "Perl is GREAT!", posted => "2011-08-15", user

    => { name => "Clinton Gormley", email => "drtech@cpan.org", }, tags => ["perl","opinion"], posts => 2, } Values:
  13. { tweet => "Perl is GREAT!", posted => "2011-08-15", user

    => { name => "Clinton Gormley", email => "drtech@cpan.org" }, tags => ["perl","opinion"], posts => 2, } Field types: # object # string # date # nested object # string # string # array of enums # integer
  14. { tweet => "Perl is GREAT!", posted => "2011-08-15", user

    => { name => "Clinton Gormley", email => "drtech@cpan.org", }, tags => ["perl","opinion"], posts => 2, } Nested objects flattened:
  15. { tweet => "Perl is GREAT!", posted => "2011-08-15", user.name

    => "Clinton Gormley", user.email => "drtech@cpan.org", tags => ["perl","opinion"], posts => 2, } Nested objects flattened
  16. { tweet => "Perl is GREAT!", posted => "2011-08-15", user.name

    => "Clinton Gormley", user.email => "drtech@cpan.org", tags => ["perl","opinion"], posts => 2, } Values analyzed into terms
  17. { tweet => ['perl','great'], posted => [Date(2011-08-15)], user.name => ['clinton','gormley'],

    user.email => ['drtech','cpan.org'], tags => ['perl','opinion'], posts => [2], } Values analyzed into terms
  18. database table row ⇒ many tables ⇒ many rows ⇒

    one schema ⇒ many columns In MySQL
  19. index type document ⇒ many types ⇒ many documents ⇒

    one mapping ⇒ many fields In ElasticSearch
  20. Create index with mappings $es->create_index( index => 'twitter', mappings =>

    { tweet => { properties => { title => { type => 'string' }, created => { type => 'date' } } } } );
  21. Add a mapping $es->put_mapping( index => 'twitter', type => 'user',

    mapping => { properties => { name => { type => 'string' }, created => { type => 'date' }, } } );
  22. Can add to existing mapping

  23. Can add to existing mapping Cannot change mapping for field

  24. Core field types { type => 'string', }

  25. Core field types { type => 'string', # byte|short|integer|long|double|float #

    date, ip addr, geolocation # boolean # binary (as base 64) }
  26. Core field types { type => 'string', index => 'analyzed',

    # 'Foo Bar' ⇒ [ 'foo', 'bar' ] }
  27. Core field types { type => 'string', index => 'not_analyzed',

    # 'Foo Bar' ⇒ [ 'Foo Bar' ] }
  28. Core field types { type => 'string', index => 'no',

    # 'Foo Bar' ⇒ [ ] }
  29. Core field types { type => 'string', index => 'analyzed',

    analyzer => 'default', }
  30. Core field types { type => 'string', index => 'analyzed',

    index_analyzer => 'default', search_analyzer => 'default', }
  31. Core field types { type => 'string', index => 'analyzed',

    analyzer => 'default', boost => 2, }
  32. Core field types { type => 'string', index => 'analyzed',

    analyzer => 'default', boost => 2, include_in_all => 1 |0 }
  33. • Standard • Simple • Whitespace • Stop • Keyword

    Built in analyzers • Pattern • Language • Snowball • Custom
  34. The Brown-Cow's Part_No. #A.BC123-456 joe@bloggs.com keyword: The Brown-Cow's Part_No. #A.BC123-456

    joe@bloggs.com whitespace: The, Brown-Cow's, Part_No., #A.BC123-456, joe@bloggs.com simple: the, brown, cow, s, part, no, a, bc, joe, bloggs, com standard: brown, cow's, part_no, a.bc123, 456, joe, bloggs.com snowball (English): brown, cow, part_no, a.bc123, 456, joe, bloggs.com
  35. Token filters • Standard • ASCII Folding • Length •

    Lowercase • NGram • Edge NGram • Porter Stem • Shingle • Stop • Word Delimiter • Stemmer • KStem • Snowball • Phonetic • Synonym • Compound Word • Reverse • Elision • Truncate • Unique
  36. Custom Analyzer $c->create_index( index => 'twitter', settings => { analysis

    => { analyzer => { ascii_html => { type => 'custom', tokenizer => 'standard', filter => [ qw( standard lowercase asciifolding stop ) ], char_filter => ['html_strip'] } } }} );
  37. Searching $result = $es->search( index => 'twitter', type => 'tweet',

    );
  38. Searching $result = $es->search( index => ['twitter','facebook'], type => ['tweet','post'],

    );
  39. Searching $result = $es->search( # all indices # all types

    );
  40. Searching $result = $es->search( index => 'twitter', type => 'tweet',

    query => { text => { _all => 'foo' }}, );
  41. Searching $result = $es->search( index => 'twitter', type => 'tweet',

    queryb => 'foo', # b == ElasticSearch::SearchBuilder );
  42. Searching $result = $es->search( index => 'twitter', type => 'tweet',

    query => { text => { _all => 'foo' }}, sort => [{ '_score': 'desc' }] );
  43. Searching $result = $es->search( index => 'twitter', type => 'tweet',

    query => { text => { _all => 'foo' }}, sort => [{ '_score': 'desc' }] from => 0, size => 10, );
  44. Query DSL

  45. Queries vs Filters

  46. Queries vs Filters • full text & terms • terms

    only
  47. Queries vs Filters • full text & terms • relevance

    scoring • terms only • no scoring
  48. Queries vs Filters • full text & terms • relevance

    scoring • slower • terms only • no scoring • faster
  49. Queries vs Filters • full text & terms • relevance

    scoring • slower • no caching • terms only • no scoring • faster • cacheable
  50. Queries vs Filters • full text & terms • relevance

    scoring • slower • no caching • terms only • no scoring • faster • cacheable Use filters for anything that doesn't affect the relevance score!
  51. Query only Query DSL: $es->search( query => { text =>

    { title => 'perl' } } ); SearchBuilder: $es->search( queryb => { title => 'perl' } );
  52. Filter only Query DSL: $es->search( query => { constant_score =>

    { filter => {term => { tag => 'perl }} } }); SearchBuilder: $es->search( queryb => { -filter => { tag => 'perl' } });
  53. Query and filter Query DSL: $es->search( query => { filtered

    => { query => { text => { title => 'perl' }}, filter =>{ term => { tag => 'perl' }} } }); SearchBuilder: $es->search( queryb => { title => 'perl', -filter => { tag => 'perl' } });
  54. Filters

  55. Filters : equality Query DSL: { term => { tags

    => 'perl' }} { terms => { tags => ['perl','ruby'] }} SearchBuilder: { tags => 'perl' } { tags => ['perl','ruby'] }
  56. Filters : range Query DSL: { range => { date

    => { gte => '2010-11-01', lt => '2010-12-01' }} SearchBuilder: { date => { gte => '2010-11-01', lt => '2011-12-01' }}
  57. Filters : range (many values) Query DSL: { numeric_range =>

    { date => { gte => '2010-11-01', lt => '2010-12-01 }} SearchBuilder: { date => { '>=' => '2010-11-01', '<' => '2011-12-01' }}
  58. Filters : and | or | not Query DSL: {

    and => [ {term=>{X=>1}}, {term=>{Y=>2}} ]} { or => [ {term=>{X=>1}}, {term=>{Y=>2}} ]} { not => { or => [ {term=>{X=>1}}, {term=>{Y=>2}} ] }} SearchBuilder: { X => 1, Y => 2 } [ X => 1, Y => 2 ] { -not => { X => 1, Y => 2 } } # and { -not => [ X => 1, Y => 2 ] } # or
  59. Filters : exists | missing Query DSL: { exists =>

    { field => 'title' }} { missing => { field => 'title' }} SearchBuilder: { -exists => 'title' } { -missing => 'title' }
  60. Filter example SearchBuilder: { -filter => [ featured => 1,

    { created_at => { gt => '2011-08-01' }, status => { '!=' => 'pending' }, }, ] }
  61. Filter example Query DSL: { constant_score => { filter =>

    { or => [ { term => { featured => 1 }}, { and => [ { not => { term => { status => 'pending' }}, { range => { created_at => { gt => '2011-08-01' }}}, ] } ] } } }
  62. Filters : others • script • nested • has_child •

    query • match_all • prefix • limit • ids • type • geo_distance • geo_distance_range • geo_bbox • geo_polygon
  63. Text / Analyzed: • text • query_string / field •

    flt / flt_field • mlt / mlt_field Term / Not analyzed: • term / terms • range • prefix • fuzzy • wildcard • ids • span queries Combining: • bool • dis_max • boosting Scripting: • custom_score • custom_filters_score Wrappers: • match_all • constant_score • filtered “Joins”: • nested • has_child • top_children Queries
  64. Text / Analyzed: • text • query_string / field •

    flt / flt_field • mlt / mlt_field Term / Not analyzed: • term / terms • range • prefix • fuzzy • wildcard • ids • span queries Combining: • bool • dis_max • boosting Scripting: • custom_score • custom_filters_score Wrappers: • match_all • constant_score • filtered “Joins”: • nested • has_child • top_children Queries
  65. Text/Analyzed Queries mapping aware

  66. Text/Analyzed Queries not_analyzed ⇒ term query

  67. Text/Analyzed Queries analyzed ⇒ text query using search_analyzer

  68. Text-Query Family Query DSL: { text => { title =>

    'great perl' }} Search Builder: { title => 'great perl' }
  69. Text-Query Family Query DSL: { text => { title =>

    { query => 'great perl' }}} Search Builder: { title => { '=' => { query => 'great perl' }}}
  70. Text-Query Family Query DSL: { text => { title =>

    { query => 'great perl' , operator => 'and' }}} Search Builder: { title => { '=' => { query => 'great perl', operator => 'and' }}}
  71. Text-Query Family Query DSL: { text => { title =>

    { query => 'great perl' , fuzziness => 0.5 }}} Search Builder: { title => { '=' => { query => 'great perl', fuzziness => 0.5 }}}
  72. Text-Query Family Query DSL: { text => { title =>

    { query => 'great perl', type => 'phrase' }}} Search Builder: { title => { '==' => { query => 'great perl', }}}
  73. Text-Query Family Query DSL: { text => { title =>

    { query => 'great perl', type => 'phrase' }}} Search Builder: { title => { '==' => { query => 'great perl', }}}
  74. Text-Query Family Query DSL: { text => { title =>

    { query => 'perl is great', type => 'phrase' }}} Search Builder: { title => { '==' => { query => 'perl is great', }}}
  75. Text-Query Family Query DSL: { text => { title =>

    { query => 'perl great', type => 'phrase', slop => 3 }}} Search Builder: { title => { '==' => { query => 'perl great', slop => 3 }}}
  76. Text-Query Family Query DSL: { text => { title =>

    { query => 'perl is gr', type => 'phrase_prefix', }}} Search Builder: { title => { '^' => { query => 'perl is gr', }}}
  77. Query string / Field Lucene Query Syntax aware “perl is

    great”~5 AND author:clint* -deleted
  78. Query string / Field Syntax errors: AND perl is great”

    author: clint* -
  79. Query string / Field Syntax errors: AND perl is great”

    author: clint* - ElasticSearch::QueryParser
  80. Combining: Bool Query DSL: { bool => { must =>

    [ { term => { foo => 1}}, ... ], must_not => [ { term => { bar => 1}}, ... ], should => [ { term => { X => 2}}, { term => { Y => 2}},... ], minimum_number_should_match => 1, }}
  81. Combining: Bool SearchBuilder: { foo => 1, bar => {

    '!=' => 1}, -or => [ X => 2, Y => 2], } { -bool => { must => { foo => 1 }, must_not => { bar => 1 }, should => [{ X => 2}, { Y => 2 }], minimum_number_should_match => 1, }}
  82. Combining: DisMax Query DSL: { dis_max => { queries =>

    [ { term => { foo => 1}}, { term => { bar => 1}}, ] }} SearchBuilder: { -dis_max => [ { term => { foo => 1}}, { term => { bar => 1}}, ], }
  83. Bool: combines scores DisMax: uses highest score from all matching

    clauses
  84. Tweaking relevance:

  85. Tweaking relevance: Boosting

  86. Boosting: at index time { properties => { content =>

    { type => “string” }, title => { type => “string” }, }
  87. Boosting: at index time { properties => { content =>

    { type => “string” }, title => { type => “string”, boost => 2, }, }, }
  88. Boosting: at index time { properties => { content =>

    { type => “string” }, title => { type => “string”, boost => 2, }, rank => { type => “integer” }, }, _boost => { name => 'rank', null_value => 1.0 }, }
  89. Boosting: at search time Query DSL: { bool => {

    should => [ { text => { content => 'perl' }}, { text => { title => 'perl' }}, ] }} SearchBuilder: { content => 'perl', title => 'perl' }
  90. Boosting: at search time Query DSL: { bool => {

    should => [ { text => { content => 'perl' }}, { text => { title => { query => 'perl', }}, ] }} SearchBuilder: { content => 'perl', title => { '=' => { query => 'perl' }} }
  91. Boosting: at search time Query DSL: { bool => {

    should => [ { text => { content => 'perl' }}, { text => { title => { query => 'perl', boost => 2 }}, ] }} SearchBuilder: { content => 'perl', title => { '=' => { query => 'perl', boost=> 2 }} }
  92. Boosting: custom_score Query DSL: { custom_score => { query =>

    { text => { title => 'perl' }}, script => “_score * foo /doc['rank'].value”, }} SearchBuilder: { -custom_score => { query => { title => 'perl' }, script => “_score * foo /doc['rank'].value”, }}
  93. Query example SearchBuilder: { -or => [ title => {

    '=' => { query => 'custom score', boost => 2 }}, content => 'custom score', ], -filter => { repo => 'elasticsearch/elasticsearch', created_at => { '>=' => '2011-07-01', '<' => '2011-08-01'}, -or => [ creator_id => 123, assignee_id => 123, ], labels => ['bug','breaking'] } }
  94. Query example Query DSL: { query => { filtered =>

    { query => { bool => { should => [ { text => { content => "custom score" } }, { text => { title => { boost => 2, query => "custom score" } } }, ], }, }, filter => { and => [ { or => [ { term => { creator_id => 123 } }, { term => { assignee_id => 123 } }, ]}, { terms => { labels => ["bug", "breaking"] } }, { term => { repo => "elasticsearch/elasticsearch" } }, { numeric_range => { created_at => { gte => "2011-07-01", lt => "2011-08-01" }}}, ]}, }}
  95. None
  96. https://github.com/clintongormley/GitHubSearch