Elasticsearch for Data Engineers

6ac96bcb854145bb47eac6fa80b50d44?s=47 Duy Do
September 16, 2016

Elasticsearch for Data Engineers

This is Elasticsearch crash course for Data Engineers at Sentifi AG

6ac96bcb854145bb47eac6fa80b50d44?s=128

Duy Do

September 16, 2016
Tweet

Transcript

  1. Elasticsearch Crash Course for Data Engineers Duy Do (@duydo)

  2. About • A Father, A Husband, A Software Engineer •

    Founder of Vietnamese Elasticsearch Community • Author of Vietnamese Elasticsearch Analysis Plugin • Technical Consultant at Sentifi AG • Co-Founder at Krom • Follow me @duydo
  3. Elasticsearch is Everywhere

  4. None
  5. What is Elasticsearch?

  6. Elasticsearch is a distributed search and analytics engine, designed for

    horizontal scalability with easy management.
  7. Basic Terms • Cluster is a collection of nodes. •

    Node is a single server, part of a cluster. • Index is a collection of shards ~ database. • Shard is a collection of documents. • Type is a category/partition of an index ~ table in database. • Document is a Json object ~ record in database.
  8. Distributed & Scalable

  9. Shards & Replicas

  10. One node, One shard Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } }
  11. Two nodes, One shard Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } } Node 2
  12. One node, Two shards Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } P1
  13. Two Nodes, Two Shards Node 1 employees P0 PUT /employees

    { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } Node 2 employees P1 P1
  14. Two nodes, Two shards, One replica of each shard Node

    1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 1 } } R1 Node 2 employees P1 R0
  15. Index Management

  16. Create Index PUT /employees { “settings”: {...}, “mappings”: { “type_one”:

    {...}, “type_two”: {...} }, “aliases”: { “alias_one”: {...}, “alias_two”: {...} } }
  17. Index Settings PUT /employees/_settings { “number_of_replicas”: 1 }

  18. Index Mappings PUT /employees/_mappings { “employee”: { “properties”: { “name”:

    {“type”: “string”}, “gender”: {“type”: “string”, “index”: “not_analyzed”}, “email”: {“type”: “string”, “index”: “not_analyzed”}, “dob”: {“type”: “date”}, “country”: {“type”: “string”, “index”: “not_analyzed”}, “salary”: {“type”: “double”}, } } }
  19. Delete Index DELETE /employees

  20. Put Data In, Get Data Out

  21. Index a Document with ID PUT /employees/employee/1 { “name”: “Duy

    Do”, “email”: “duy.do@sentifi.com”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }
  22. Index a Document without ID POST /employees/employee/ { “name”: “Duy

    Do”, “email”: “duy.do@sentifi.com”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }
  23. Retrieve a Document GET /employees/employee/1

  24. Update a Document POST /employees/employee/1/_update { “doc”:{ “salary”: 500.0 }

    }
  25. Delete a Document DELETE /employees/employee/1

  26. Searching

  27. Structured Search Date, Times, Numbers, Text • Finding Exact Values

    • Finding Multiple Exact Values • Ranges • Working with Null Values • Combining Filters
  28. Finding Exact Values GET /employees/employee/_search { “query”: { “term”: {

    “country”: “VN” } } } SQL: SELECT * FROM employee WHERE country = ‘VN’;
  29. Finding Multiple Exact Values GET /employees/employee/_search { “query”: { “terms”:

    { “country”: [“VN”, “US”] } } } SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;
  30. Ranges GET /employees/employee/_search { “query”: { “range”: { “dob”: {“gt”:

    “1984-01-01”, “lt”: “2000-01-01”} } } } SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;
  31. Working with Null values GET /employees/employee/_search { “query”: { “filtered”:

    { “filter”: { “exists”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NOT NULL;
  32. Working with Null Values GET /employees/employee/_search { “query”: { “filtered”:

    { “filter”: { “missing”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NULL;
  33. Combining Filters GET /employees/employee/_search { “query”: { “filtered”: { “filter”:

    { “bool”: { “must”:[{“exists”: {“field”: “email”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } } } }
  34. Combining Filters SQL: SELECT * FROM employee WHERE email IS

    NOT NULL AND gender != ‘female’ AND (country = ‘VN’ OR country = ‘US’);
  35. More Queries • Prefix • Wildcard • Regex • Fuzzy

    • Type • Ids • ...
  36. Full-Text Search Relevance, Analysis • Match Query • Combining Queries

    • Boosting Query Clauses
  37. Match Query - Single Word GET /employees/employee/_search { “query”: {

    “match”: { “name”: { “query”: “Duy” } } } }
  38. Match Query - Multi Words GET /employees/employee/_search { “query”: {

    “match”: { “name”: { “query”: “Duy Do”, “operator”: “and” } } } }
  39. Combining Queries GET /employees/employee/_search { “query”: { “bool”: { “must”:[{“match”:

    {“name”: “Do”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } }
  40. Boosting Query Clauses GET /employees/employee/_search { “query”: { “bool”: {

    “must”:[{“term”: {“gender”: “female”}}], # default boost 1 “should”:[ {“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most important {“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than #1 but not as important as #2 ], } } }
  41. More Queries • Multi Match • Common Terms • Query

    Strings • ...
  42. Analytics

  43. Aggregations Analyze & Summarize • How many needles in the

    haystack? • What is the average length of the needles? • What is the median length of the needles, broken down by manufacturer? • How many needles are added to the haystacks each month? • What are the most popular needle manufacturers? • ...
  44. Buckets & Metrics SELECT COUNT(country) # a metric FROM employee

    GROUP BY country # a bucket GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”: {“field”: “country”} } } }
  45. Bucket is a collection of documents that meet certain criteria.

  46. Metric is simple mathematical operations such as: min, max, mean,

    sum and avg.
  47. Combination Buckets & Metrics • Partitions employees by country (bucket)

    • Then partitions each country bucket by gender (bucket) • Finally calculate the average salary for each gender bucket (metric)
  48. Combination Query GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”:

    {“field”: “country”}, “aggs”: { “by_gender”: { “terms”: {“field”: “gender”}, “aggs”: { “avg_salary”: {“avg”: “field”: “salary”} } } } } } }
  49. More Aggregations • Histogram • Date Histogram • Date Range

    • Filter/Filters • Missing • Geo Distance • Nested • ...
  50. Best Practices

  51. Indexing • Use bulk indexing APIs. • Tune your bulk

    size 5-10MB. • Partitions your time series data by time period (monthly, weekly, daily). • Use aliases for your indices. • Turn off refresh, replicas while indexing. Turn on once it’s done • Multiple shards for parallel indexing. • Multiple replicas for parallel reading.
  52. Mapping • Disable _all field • Keep _source field, do

    not store any field. • Use not_analyzed if possible
  53. Query • Use filters instead of queries if possible. •

    Consider orders and scope of your filters. • Do not use string query. • Do not load too many results with single query, use scroll API instead.
  54. Tools

  55. Kibana for Discovery, Visualization

  56. Sense for Query

  57. Marvel for Monitoring