Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch for Data Engineers

Duy Do
September 16, 2016

Elasticsearch for Data Engineers

This is Elasticsearch crash course for Data Engineers at Sentifi AG

Duy Do

September 16, 2016
Tweet

More Decks by Duy Do

Other Decks in Technology

Transcript

  1. About • A Father, A Husband, A Software Engineer •

    Founder of Vietnamese Elasticsearch Community • Author of Vietnamese Elasticsearch Analysis Plugin • Technical Consultant at Sentifi AG • Co-Founder at Krom • Follow me @duydo
  2. Basic Terms • Cluster is a collection of nodes. •

    Node is a single server, part of a cluster. • Index is a collection of shards ~ database. • Shard is a collection of documents. • Type is a category/partition of an index ~ table in database. • Document is a Json object ~ record in database.
  3. One node, One shard Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } }
  4. Two nodes, One shard Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } } Node 2
  5. One node, Two shards Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } P1
  6. Two Nodes, Two Shards Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } Node 2 employees P1 P1
  7. Two nodes, Two shards, One replica of each shard Node

    1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 1 } } R1 Node 2 employees P1 R0
  8. Create Index PUT /employees { “settings”: {...}, “mappings”: { “type_one”:

    {...}, “type_two”: {...} }, “aliases”: { “alias_one”: {...}, “alias_two”: {...} } }
  9. Index Mappings PUT /employees/_mappings { “employee”: { “properties”: { “name”:

    {“type”: “string”}, “gender”: {“type”: “string”, “index”: “not_analyzed”}, “email”: {“type”: “string”, “index”: “not_analyzed”}, “dob”: {“type”: “date”}, “country”: {“type”: “string”, “index”: “not_analyzed”}, “salary”: {“type”: “double”}, } } }
  10. Index a Document with ID PUT /employees/employee/1 { “name”: “Duy

    Do”, “email”: “[email protected]”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }
  11. Index a Document without ID POST /employees/employee/ { “name”: “Duy

    Do”, “email”: “[email protected]”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }
  12. Structured Search Date, Times, Numbers, Text • Finding Exact Values

    • Finding Multiple Exact Values • Ranges • Working with Null Values • Combining Filters
  13. Finding Exact Values GET /employees/employee/_search { “query”: { “term”: {

    “country”: “VN” } } } SQL: SELECT * FROM employee WHERE country = ‘VN’;
  14. Finding Multiple Exact Values GET /employees/employee/_search { “query”: { “terms”:

    { “country”: [“VN”, “US”] } } } SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;
  15. Ranges GET /employees/employee/_search { “query”: { “range”: { “dob”: {“gt”:

    “1984-01-01”, “lt”: “2000-01-01”} } } } SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;
  16. Working with Null values GET /employees/employee/_search { “query”: { “filtered”:

    { “filter”: { “exists”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NOT NULL;
  17. Working with Null Values GET /employees/employee/_search { “query”: { “filtered”:

    { “filter”: { “missing”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NULL;
  18. Combining Filters GET /employees/employee/_search { “query”: { “filtered”: { “filter”:

    { “bool”: { “must”:[{“exists”: {“field”: “email”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } } } }
  19. Combining Filters SQL: SELECT * FROM employee WHERE email IS

    NOT NULL AND gender != ‘female’ AND (country = ‘VN’ OR country = ‘US’);
  20. Match Query - Single Word GET /employees/employee/_search { “query”: {

    “match”: { “name”: { “query”: “Duy” } } } }
  21. Match Query - Multi Words GET /employees/employee/_search { “query”: {

    “match”: { “name”: { “query”: “Duy Do”, “operator”: “and” } } } }
  22. Combining Queries GET /employees/employee/_search { “query”: { “bool”: { “must”:[{“match”:

    {“name”: “Do”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } }
  23. Boosting Query Clauses GET /employees/employee/_search { “query”: { “bool”: {

    “must”:[{“term”: {“gender”: “female”}}], # default boost 1 “should”:[ {“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most important {“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than #1 but not as important as #2 ], } } }
  24. Aggregations Analyze & Summarize • How many needles in the

    haystack? • What is the average length of the needles? • What is the median length of the needles, broken down by manufacturer? • How many needles are added to the haystacks each month? • What are the most popular needle manufacturers? • ...
  25. Buckets & Metrics SELECT COUNT(country) # a metric FROM employee

    GROUP BY country # a bucket GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”: {“field”: “country”} } } }
  26. Combination Buckets & Metrics • Partitions employees by country (bucket)

    • Then partitions each country bucket by gender (bucket) • Finally calculate the average salary for each gender bucket (metric)
  27. Combination Query GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”:

    {“field”: “country”}, “aggs”: { “by_gender”: { “terms”: {“field”: “gender”}, “aggs”: { “avg_salary”: {“avg”: “field”: “salary”} } } } } } }
  28. More Aggregations • Histogram • Date Histogram • Date Range

    • Filter/Filters • Missing • Geo Distance • Nested • ...
  29. Indexing • Use bulk indexing APIs. • Tune your bulk

    size 5-10MB. • Partitions your time series data by time period (monthly, weekly, daily). • Use aliases for your indices. • Turn off refresh, replicas while indexing. Turn on once it’s done • Multiple shards for parallel indexing. • Multiple replicas for parallel reading.
  30. Mapping • Disable _all field • Keep _source field, do

    not store any field. • Use not_analyzed if possible
  31. Query • Use filters instead of queries if possible. •

    Consider orders and scope of your filters. • Do not use string query. • Do not load too many results with single query, use scroll API instead.