Indexing A Document
2
PUT localhost:9200/logs/mysql/1
{
“message”: "070917 16:29:01 21 Query select * from location"
}
Slide 3
Slide 3 text
3
070917 16:29:01 21 Query select * from location
Time thread_id command_type query_body
Slide 4
Slide 4 text
Logstash is made for this!
4
{
input { … }
filter {
grok {
match => {
“message” : “^%{NUMBER:date} *%{TIME:time} *
%{WORD:command_type} *%{DATA:command}”
}
mutate { remove => [“message”] }
}
output { … }
}
Slide 5
Slide 5 text
Ingest Pipeline
5
just the filters
grok remove
Inputs outputs
Slide 6
Slide 6 text
Indexing [An Enriched ] Document
6
PUT localhost:9200/logs/mysql/1?pipeline=my_mysql_pipeline
{
“message”: "070917 16:29:01 21 Query select * from location"
}
specifying an ingest pipeline to execute before indexing
Slide 7
Slide 7 text
Indexing [An Enriched ] Document
7
PUT localhost:9200/logs/mysql/1
{
“timestamp”: “2007-09-17T16:29:01-08:00”,
“thread_id”: “21”,
“command_type”: “Query”,
“command”: “select * from location”
}
document as it is being indexed
Slide 8
Slide 8 text
8
Ingest Node
• On in Elasticsearch 5.0 by default
• All nodes can become ingest nodes.
• Pipelines run on any ingest node
• Logstash Filters and Ingest Processors will be able compatible with one
another.
Enrich documents before indexing
Slide 9
Slide 9 text
Examples and Use-Cases
9
Beats Reindex
• filebeat
• no middle man, ship logs
directory into
elasticsearch
• Misnamed field, bulk
rename using reindex api
with a pipeline defined.
Slide 10
Slide 10 text
The Pipeline
10
Slide 11
Slide 11 text
11
The Pipeline
• A description of a sequence of processors to execute against each
document that is to be indexed.
• defined in JSON
• It is run before the documents are indexed into Elasticsearch
process machine
Slide 12
Slide 12 text
Pipeline Management
Create, Retrieve, Update, Delete
12
PUT _ingest/pipeline/pipeline-name
GET _ingest/pipeline/pipeline-name
GET ingest/pipeline/*
DELETE _ingest/pipeline/pipeline-name
Slide 13
Slide 13 text
‹#›
Example Pipeline
13
extract mysql fields from
`message` field and then
remove it from the document
{
“description”: “mysql pipeline”,
“processors”: [
{
“grok” : {
“field” : “message”,
“pattern” : “…”
}
},
{
“remove” : {
“field” : “message”
}
}
]
}
Slide 14
Slide 14 text
The Processors
14
Grok
Rename
Set
Convert
attachment
Geoip
Date
‹#› 16
Grok Processor
Parses new field values out of
pattern-matched substrings
{
"grok": {
"field": "field1",
"pattern": “%{DATE:date}”
}
}
Slide 17
Slide 17 text
‹#› 17
Geoip Processor
Adds information about the
geographical location of IP
addresses
{
"geoip": {
"field": “my_ip_field”,
}
}
Slide 18
Slide 18 text
‹#›
Introducing new
processors is as
easy as writing a
new plugin!
18
{
“your_plugin_here”: {
"field": “field1”,
…
}
}
Slide 19
Slide 19 text
Pipeline Error Handling
19
Slide 20
Slide 20 text
20
Handling Failure
• Sometimes the input documents do not match what is expected. Failing is
sometimes expected, and can handled
• Use the on_failure parameter to define processors to run when the parent
processor or pipeline throw an exception
• on_failure blocks behave like little pipelines!
Data is not clean, pipelines clean them
Source: Gray Arial10pt
Slide 21
Slide 21 text
Pipeline With Failure Guards
21
grok date set
Slide 22
Slide 22 text
Pipeline With Failure Guards
22
grok date set
Slide 23
Slide 23 text
Pipeline With Failure Guards
23
grok date set
set date
Slide 24
Slide 24 text
Pipeline With Failure Guards
24
grok date set
set date
Slide 25
Slide 25 text
25
Investigating Pipelines
• on_failure is great for allowing you to catch exceptions
• what if you do not know what exceptions can occur?
• what if you are unsure what a processor will do?
• what if there was a way to test a pipeline against your documents before
indexing?
Peek into behavior of Pipeline
Source: Gray Arial10pt
Slide 26
Slide 26 text
Simulation
26
Slide 27
Slide 27 text
27
_simulate
• The Simulate API can be used to run pipelines against input documents
• These input documents are not indexed
• Use verbose flag to see output of each processor within a pipeline, not just
the end result
• Used to debug output of specific processors within a pipeline
running pipelines without indexing
Slide 28
Slide 28 text
Simulate Endpoints
28
POST _ingest/pipeline/pipeline-name/_simulate
POST _ingest/pipeline/_simulate
POST _ingest/pipeline/_simulate?verbose
31
Ingest node internals
• Ingest node is part of core Elasticsearch
• The infrastructure
• Many processors
• Some processors are available as plugins
Slide 32
Slide 32 text
32
Ingest node internals - the infrastructure
• Ingest needs to intercept documents before it gets indexed for and
preprocess it.
• But where should intercept the documents?
• The cost of preprocessing and the ability to isolate that are important
38
Ingest node internals - other details
node
Ingest
processing
Slide 39
Slide 39 text
Ingest node in production
39
Slide 40
Slide 40 text
40
Ingest node in production - performance
• The type of operations your pipelines are going to perform
• The number of processors in your pipelines
• The amount of fields your pipelines add or remove.
• The number of nodes ingest can run on.
• So it depends!
‹#›
Please attribute Elastic with a link to elastic.co
Except where otherwise noted, this work is licensed under
http://creativecommons.org/licenses/by-nd/4.0/
Creative Commons and the double C in a circle are
registered trademarks of Creative Commons in the United States and other countries.
Third party marks and brands are the property of their respective holders.
57