Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch for Data Engineers

Duy Do
September 16, 2016

Elasticsearch for Data Engineers

This is Elasticsearch crash course for Data Engineers at Sentifi AG

Duy Do

September 16, 2016
Tweet

More Decks by Duy Do

Other Decks in Technology

Transcript

  1. Elasticsearch Crash Course for Data Engineers Duy Do (@duydo)

  2. About • A Father, A Husband, A Software Engineer •

    Founder of Vietnamese Elasticsearch Community • Author of Vietnamese Elasticsearch Analysis Plugin • Technical Consultant at Sentifi AG • Co-Founder at Krom • Follow me @duydo
  3. Elasticsearch is Everywhere

  4. None
  5. What is Elasticsearch?

  6. Elasticsearch is a distributed search and analytics engine, designed for

    horizontal scalability with easy management.
  7. Basic Terms • Cluster is a collection of nodes. •

    Node is a single server, part of a cluster. • Index is a collection of shards ~ database. • Shard is a collection of documents. • Type is a category/partition of an index ~ table in database. • Document is a Json object ~ record in database.
  8. Distributed & Scalable

  9. Shards & Replicas

  10. One node, One shard Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } }
  11. Two nodes, One shard Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } } Node 2
  12. One node, Two shards Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } P1
  13. Two Nodes, Two Shards Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } Node 2 employees P1 P1
  14. Two nodes, Two shards, One replica of each shard Node

    1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 1 } } R1 Node 2 employees P1 R0
  15. Index Management

  16. Create Index PUT /employees { “settings”: {...}, “mappings”: { “type_one”:

    {...}, “type_two”: {...} }, “aliases”: { “alias_one”: {...}, “alias_two”: {...} } }
  17. Index Settings PUT /employees/_settings { “number_of_replicas”: 1 }

  18. Index Mappings PUT /employees/_mappings { “employee”: { “properties”: { “name”:

    {“type”: “string”}, “gender”: {“type”: “string”, “index”: “not_analyzed”}, “email”: {“type”: “string”, “index”: “not_analyzed”}, “dob”: {“type”: “date”}, “country”: {“type”: “string”, “index”: “not_analyzed”}, “salary”: {“type”: “double”}, } } }
  19. Delete Index DELETE /employees

  20. Put Data In, Get Data Out

  21. Index a Document with ID PUT /employees/employee/1 { “name”: “Duy

    Do”, “email”: “[email protected]”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }
  22. Index a Document without ID POST /employees/employee/ { “name”: “Duy

    Do”, “email”: “[email protected]”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }
  23. Retrieve a Document GET /employees/employee/1

  24. Update a Document POST /employees/employee/1/_update { “doc”:{ “salary”: 500.0 }

    }
  25. Delete a Document DELETE /employees/employee/1

  26. Searching

  27. Structured Search Date, Times, Numbers, Text • Finding Exact Values

    • Finding Multiple Exact Values • Ranges • Working with Null Values • Combining Filters
  28. Finding Exact Values GET /employees/employee/_search { “query”: { “term”: {

    “country”: “VN” } } } SQL: SELECT * FROM employee WHERE country = ‘VN’;
  29. Finding Multiple Exact Values GET /employees/employee/_search { “query”: { “terms”:

    { “country”: [“VN”, “US”] } } } SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;
  30. Ranges GET /employees/employee/_search { “query”: { “range”: { “dob”: {“gt”:

    “1984-01-01”, “lt”: “2000-01-01”} } } } SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;
  31. Working with Null values GET /employees/employee/_search { “query”: { “filtered”:

    { “filter”: { “exists”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NOT NULL;
  32. Working with Null Values GET /employees/employee/_search { “query”: { “filtered”:

    { “filter”: { “missing”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NULL;
  33. Combining Filters GET /employees/employee/_search { “query”: { “filtered”: { “filter”:

    { “bool”: { “must”:[{“exists”: {“field”: “email”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } } } }
  34. Combining Filters SQL: SELECT * FROM employee WHERE email IS

    NOT NULL AND gender != ‘female’ AND (country = ‘VN’ OR country = ‘US’);
  35. More Queries • Prefix • Wildcard • Regex • Fuzzy

    • Type • Ids • ...
  36. Full-Text Search Relevance, Analysis • Match Query • Combining Queries

    • Boosting Query Clauses
  37. Match Query - Single Word GET /employees/employee/_search { “query”: {

    “match”: { “name”: { “query”: “Duy” } } } }
  38. Match Query - Multi Words GET /employees/employee/_search { “query”: {

    “match”: { “name”: { “query”: “Duy Do”, “operator”: “and” } } } }
  39. Combining Queries GET /employees/employee/_search { “query”: { “bool”: { “must”:[{“match”:

    {“name”: “Do”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } }
  40. Boosting Query Clauses GET /employees/employee/_search { “query”: { “bool”: {

    “must”:[{“term”: {“gender”: “female”}}], # default boost 1 “should”:[ {“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most important {“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than #1 but not as important as #2 ], } } }
  41. More Queries • Multi Match • Common Terms • Query

    Strings • ...
  42. Analytics

  43. Aggregations Analyze & Summarize • How many needles in the

    haystack? • What is the average length of the needles? • What is the median length of the needles, broken down by manufacturer? • How many needles are added to the haystacks each month? • What are the most popular needle manufacturers? • ...
  44. Buckets & Metrics SELECT COUNT(country) # a metric FROM employee

    GROUP BY country # a bucket GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”: {“field”: “country”} } } }
  45. Bucket is a collection of documents that meet certain criteria.

  46. Metric is simple mathematical operations such as: min, max, mean,

    sum and avg.
  47. Combination Buckets & Metrics • Partitions employees by country (bucket)

    • Then partitions each country bucket by gender (bucket) • Finally calculate the average salary for each gender bucket (metric)
  48. Combination Query GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”:

    {“field”: “country”}, “aggs”: { “by_gender”: { “terms”: {“field”: “gender”}, “aggs”: { “avg_salary”: {“avg”: “field”: “salary”} } } } } } }
  49. More Aggregations • Histogram • Date Histogram • Date Range

    • Filter/Filters • Missing • Geo Distance • Nested • ...
  50. Best Practices

  51. Indexing • Use bulk indexing APIs. • Tune your bulk

    size 5-10MB. • Partitions your time series data by time period (monthly, weekly, daily). • Use aliases for your indices. • Turn off refresh, replicas while indexing. Turn on once it’s done • Multiple shards for parallel indexing. • Multiple replicas for parallel reading.
  52. Mapping • Disable _all field • Keep _source field, do

    not store any field. • Use not_analyzed if possible
  53. Query • Use filters instead of queries if possible. •

    Consider orders and scope of your filters. • Do not use string query. • Do not load too many results with single query, use scroll API instead.
  54. Tools

  55. Kibana for Discovery, Visualization

  56. Sense for Query

  57. Marvel for Monitoring