Slide 1

Slide 1 text

Elasticsearch Crash Course for Data Engineers Duy Do (@duydo)

Slide 2

Slide 2 text

About ● A Father, A Husband, A Software Engineer ● Founder of Vietnamese Elasticsearch Community ● Author of Vietnamese Elasticsearch Analysis Plugin ● Technical Consultant at Sentifi AG ● Co-Founder at Krom ● Follow me @duydo

Slide 3

Slide 3 text

Elasticsearch is Everywhere

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

What is Elasticsearch?

Slide 6

Slide 6 text

Elasticsearch is a distributed search and analytics engine, designed for horizontal scalability with easy management.

Slide 7

Slide 7 text

Basic Terms ● Cluster is a collection of nodes. ● Node is a single server, part of a cluster. ● Index is a collection of shards ~ database. ● Shard is a collection of documents. ● Type is a category/partition of an index ~ table in database. ● Document is a Json object ~ record in database.

Slide 8

Slide 8 text

Distributed & Scalable

Slide 9

Slide 9 text

Shards & Replicas

Slide 10

Slide 10 text

One node, One shard Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } }

Slide 11

Slide 11 text

Two nodes, One shard Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } } Node 2

Slide 12

Slide 12 text

One node, Two shards Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } P1

Slide 13

Slide 13 text

Two Nodes, Two Shards Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } Node 2 employees P1 P1

Slide 14

Slide 14 text

Two nodes, Two shards, One replica of each shard Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 1 } } R1 Node 2 employees P1 R0

Slide 15

Slide 15 text

Index Management

Slide 16

Slide 16 text

Create Index PUT /employees { “settings”: {...}, “mappings”: { “type_one”: {...}, “type_two”: {...} }, “aliases”: { “alias_one”: {...}, “alias_two”: {...} } }

Slide 17

Slide 17 text

Index Settings PUT /employees/_settings { “number_of_replicas”: 1 }

Slide 18

Slide 18 text

Index Mappings PUT /employees/_mappings { “employee”: { “properties”: { “name”: {“type”: “string”}, “gender”: {“type”: “string”, “index”: “not_analyzed”}, “email”: {“type”: “string”, “index”: “not_analyzed”}, “dob”: {“type”: “date”}, “country”: {“type”: “string”, “index”: “not_analyzed”}, “salary”: {“type”: “double”}, } } }

Slide 19

Slide 19 text

Delete Index DELETE /employees

Slide 20

Slide 20 text

Put Data In, Get Data Out

Slide 21

Slide 21 text

Index a Document with ID PUT /employees/employee/1 { “name”: “Duy Do”, “email”: “[email protected]”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }

Slide 22

Slide 22 text

Index a Document without ID POST /employees/employee/ { “name”: “Duy Do”, “email”: “[email protected]”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }

Slide 23

Slide 23 text

Retrieve a Document GET /employees/employee/1

Slide 24

Slide 24 text

Update a Document POST /employees/employee/1/_update { “doc”:{ “salary”: 500.0 } }

Slide 25

Slide 25 text

Delete a Document DELETE /employees/employee/1

Slide 26

Slide 26 text

Searching

Slide 27

Slide 27 text

Structured Search Date, Times, Numbers, Text ● Finding Exact Values ● Finding Multiple Exact Values ● Ranges ● Working with Null Values ● Combining Filters

Slide 28

Slide 28 text

Finding Exact Values GET /employees/employee/_search { “query”: { “term”: { “country”: “VN” } } } SQL: SELECT * FROM employee WHERE country = ‘VN’;

Slide 29

Slide 29 text

Finding Multiple Exact Values GET /employees/employee/_search { “query”: { “terms”: { “country”: [“VN”, “US”] } } } SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;

Slide 30

Slide 30 text

Ranges GET /employees/employee/_search { “query”: { “range”: { “dob”: {“gt”: “1984-01-01”, “lt”: “2000-01-01”} } } } SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;

Slide 31

Slide 31 text

Working with Null values GET /employees/employee/_search { “query”: { “filtered”: { “filter”: { “exists”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NOT NULL;

Slide 32

Slide 32 text

Working with Null Values GET /employees/employee/_search { “query”: { “filtered”: { “filter”: { “missing”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NULL;

Slide 33

Slide 33 text

Combining Filters GET /employees/employee/_search { “query”: { “filtered”: { “filter”: { “bool”: { “must”:[{“exists”: {“field”: “email”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } } } }

Slide 34

Slide 34 text

Combining Filters SQL: SELECT * FROM employee WHERE email IS NOT NULL AND gender != ‘female’ AND (country = ‘VN’ OR country = ‘US’);

Slide 35

Slide 35 text

More Queries ● Prefix ● Wildcard ● Regex ● Fuzzy ● Type ● Ids ● ...

Slide 36

Slide 36 text

Full-Text Search Relevance, Analysis ● Match Query ● Combining Queries ● Boosting Query Clauses

Slide 37

Slide 37 text

Match Query - Single Word GET /employees/employee/_search { “query”: { “match”: { “name”: { “query”: “Duy” } } } }

Slide 38

Slide 38 text

Match Query - Multi Words GET /employees/employee/_search { “query”: { “match”: { “name”: { “query”: “Duy Do”, “operator”: “and” } } } }

Slide 39

Slide 39 text

Combining Queries GET /employees/employee/_search { “query”: { “bool”: { “must”:[{“match”: {“name”: “Do”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } }

Slide 40

Slide 40 text

Boosting Query Clauses GET /employees/employee/_search { “query”: { “bool”: { “must”:[{“term”: {“gender”: “female”}}], # default boost 1 “should”:[ {“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most important {“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than #1 but not as important as #2 ], } } }

Slide 41

Slide 41 text

More Queries ● Multi Match ● Common Terms ● Query Strings ● ...

Slide 42

Slide 42 text

Analytics

Slide 43

Slide 43 text

Aggregations Analyze & Summarize ● How many needles in the haystack? ● What is the average length of the needles? ● What is the median length of the needles, broken down by manufacturer? ● How many needles are added to the haystacks each month? ● What are the most popular needle manufacturers? ● ...

Slide 44

Slide 44 text

Buckets & Metrics SELECT COUNT(country) # a metric FROM employee GROUP BY country # a bucket GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”: {“field”: “country”} } } }

Slide 45

Slide 45 text

Bucket is a collection of documents that meet certain criteria.

Slide 46

Slide 46 text

Metric is simple mathematical operations such as: min, max, mean, sum and avg.

Slide 47

Slide 47 text

Combination Buckets & Metrics ● Partitions employees by country (bucket) ● Then partitions each country bucket by gender (bucket) ● Finally calculate the average salary for each gender bucket (metric)

Slide 48

Slide 48 text

Combination Query GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”: {“field”: “country”}, “aggs”: { “by_gender”: { “terms”: {“field”: “gender”}, “aggs”: { “avg_salary”: {“avg”: “field”: “salary”} } } } } } }

Slide 49

Slide 49 text

More Aggregations ● Histogram ● Date Histogram ● Date Range ● Filter/Filters ● Missing ● Geo Distance ● Nested ● ...

Slide 50

Slide 50 text

Best Practices

Slide 51

Slide 51 text

Indexing ● Use bulk indexing APIs. ● Tune your bulk size 5-10MB. ● Partitions your time series data by time period (monthly, weekly, daily). ● Use aliases for your indices. ● Turn off refresh, replicas while indexing. Turn on once it’s done ● Multiple shards for parallel indexing. ● Multiple replicas for parallel reading.

Slide 52

Slide 52 text

Mapping ● Disable _all field ● Keep _source field, do not store any field. ● Use not_analyzed if possible

Slide 53

Slide 53 text

Query ● Use filters instead of queries if possible. ● Consider orders and scope of your filters. ● Do not use string query. ● Do not load too many results with single query, use scroll API instead.

Slide 54

Slide 54 text

Tools

Slide 55

Slide 55 text

Kibana for Discovery, Visualization

Slide 56

Slide 56 text

Sense for Query

Slide 57

Slide 57 text

Marvel for Monitoring