Elasticsearch for Data Engineers

Elasticsearch Crash Course for Data Engineers Duy Do (@duydo)

About • A Father, A Husband, A Software Engineer •
Founder of Vietnamese Elasticsearch Community • Author of Vietnamese Elasticsearch Analysis Plugin • Technical Consultant at Sentifi AG • Co-Founder at Krom • Follow me @duydo

Elasticsearch is Everywhere

What is Elasticsearch?

Elasticsearch is a distributed search and analytics engine, designed for
horizontal scalability with easy management.

Basic Terms • Cluster is a collection of nodes. •
Node is a single server, part of a cluster. • Index is a collection of shards ~ database. • Shard is a collection of documents. • Type is a category/partition of an index ~ table in database. • Document is a Json object ~ record in database.

Distributed & Scalable

Shards & Replicas

One node, One shard Node 1 employees P0 PUT /employees
{ “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } }

Two nodes, One shard Node 1 employees P0 PUT /employees
{ “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } } Node 2

One node, Two shards Node 1 employees P0 PUT /employees
{ “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } P1

Two Nodes, Two Shards Node 1 employees P0 PUT /employees
{ “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } Node 2 employees P1 P1

Two nodes, Two shards, One replica of each shard Node
1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 1 } } R1 Node 2 employees P1 R0

Index Management

Create Index PUT /employees { “settings”: {...}, “mappings”: { “type_one”:
{...}, “type_two”: {...} }, “aliases”: { “alias_one”: {...}, “alias_two”: {...} } }

Index Settings PUT /employees/_settings { “number_of_replicas”: 1 }

Index Mappings PUT /employees/_mappings { “employee”: { “properties”: { “name”:
{“type”: “string”}, “gender”: {“type”: “string”, “index”: “not_analyzed”}, “email”: {“type”: “string”, “index”: “not_analyzed”}, “dob”: {“type”: “date”}, “country”: {“type”: “string”, “index”: “not_analyzed”}, “salary”: {“type”: “double”}, } } }

Delete Index DELETE /employees

Put Data In, Get Data Out

Index a Document with ID PUT /employees/employee/1 { “name”: “Duy
Do”, “email”: “[email protected]”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }

Index a Document without ID POST /employees/employee/ { “name”: “Duy
Do”, “email”: “[email protected]”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }

Retrieve a Document GET /employees/employee/1

Update a Document POST /employees/employee/1/_update { “doc”:{ “salary”: 500.0 }
}

Delete a Document DELETE /employees/employee/1

Searching

Structured Search Date, Times, Numbers, Text • Finding Exact Values
• Finding Multiple Exact Values • Ranges • Working with Null Values • Combining Filters

Finding Exact Values GET /employees/employee/_search { “query”: { “term”: {
“country”: “VN” } } } SQL: SELECT * FROM employee WHERE country = ‘VN’;

Finding Multiple Exact Values GET /employees/employee/_search { “query”: { “terms”:
{ “country”: [“VN”, “US”] } } } SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;

Ranges GET /employees/employee/_search { “query”: { “range”: { “dob”: {“gt”:
“1984-01-01”, “lt”: “2000-01-01”} } } } SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;

Working with Null values GET /employees/employee/_search { “query”: { “filtered”:
{ “filter”: { “exists”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NOT NULL;

Working with Null Values GET /employees/employee/_search { “query”: { “filtered”:
{ “filter”: { “missing”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NULL;

Combining Filters GET /employees/employee/_search { “query”: { “filtered”: { “filter”:
{ “bool”: { “must”:[{“exists”: {“field”: “email”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } } } }

Combining Filters SQL: SELECT * FROM employee WHERE email IS
NOT NULL AND gender != ‘female’ AND (country = ‘VN’ OR country = ‘US’);

More Queries • Prefix • Wildcard • Regex • Fuzzy
• Type • Ids • ...

Full-Text Search Relevance, Analysis • Match Query • Combining Queries
• Boosting Query Clauses

Match Query - Single Word GET /employees/employee/_search { “query”: {
“match”: { “name”: { “query”: “Duy” } } } }

Match Query - Multi Words GET /employees/employee/_search { “query”: {
“match”: { “name”: { “query”: “Duy Do”, “operator”: “and” } } } }

Combining Queries GET /employees/employee/_search { “query”: { “bool”: { “must”:[{“match”:
{“name”: “Do”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } }

Boosting Query Clauses GET /employees/employee/_search { “query”: { “bool”: {
“must”:[{“term”: {“gender”: “female”}}], # default boost 1 “should”:[ {“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most important {“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than #1 but not as important as #2 ], } } }

More Queries • Multi Match • Common Terms • Query
Strings • ...

Analytics

Aggregations Analyze & Summarize • How many needles in the
haystack? • What is the average length of the needles? • What is the median length of the needles, broken down by manufacturer? • How many needles are added to the haystacks each month? • What are the most popular needle manufacturers? • ...

Buckets & Metrics SELECT COUNT(country) # a metric FROM employee
GROUP BY country # a bucket GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”: {“field”: “country”} } } }

Bucket is a collection of documents that meet certain criteria.

Metric is simple mathematical operations such as: min, max, mean,
sum and avg.

Combination Buckets & Metrics • Partitions employees by country (bucket)
• Then partitions each country bucket by gender (bucket) • Finally calculate the average salary for each gender bucket (metric)

Combination Query GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”:
{“field”: “country”}, “aggs”: { “by_gender”: { “terms”: {“field”: “gender”}, “aggs”: { “avg_salary”: {“avg”: “field”: “salary”} } } } } } }

More Aggregations • Histogram • Date Histogram • Date Range
• Filter/Filters • Missing • Geo Distance • Nested • ...

Best Practices

Indexing • Use bulk indexing APIs. • Tune your bulk
size 5-10MB. • Partitions your time series data by time period (monthly, weekly, daily). • Use aliases for your indices. • Turn off refresh, replicas while indexing. Turn on once it’s done • Multiple shards for parallel indexing. • Multiple replicas for parallel reading.

Mapping • Disable _all field • Keep _source field, do
not store any field. • Use not_analyzed if possible

Query • Use filters instead of queries if possible. •
Consider orders and scope of your filters. • Do not use string query. • Do not load too many results with single query, use scroll API instead.

Kibana for Discovery, Visualization

Sense for Query

Marvel for Monitoring

Elasticsearch for Data Engineers

Elasticsearch for Data Engineers

More Decks by Duy Do

Other Decks in Technology

Featured

Transcript