Ingest Node: Enriching Documents within Elasticsearch

‹#› @talevy @lucacavanna @mvgroningen Ingest Node

Indexing A Document 2 PUT localhost:9200/logs/mysql/1 { “message”: "070917 16:29:01
21 Query select * from location" }

3 070917 16:29:01 21 Query select * from location Time
thread_id command_type query_body

Logstash is made for this! 4 { input { …
} filter { grok { match => { “message” : “^%{NUMBER:date} *%{TIME:time} * %{WORD:command_type} *%{DATA:command}” } mutate { remove => [“message”] } } output { … } }

Ingest Pipeline 5 just the filters grok remove Inputs outputs

Indexing [An Enriched ] Document 6 PUT localhost:9200/logs/mysql/1?pipeline=my_mysql_pipeline { “message”:
"070917 16:29:01 21 Query select * from location" } specifying an ingest pipeline to execute before indexing

Indexing [An Enriched ] Document 7 PUT localhost:9200/logs/mysql/1 { “timestamp”:
“2007-09-17T16:29:01-08:00”, “thread_id”: “21”, “command_type”: “Query”, “command”: “select * from location” } document as it is being indexed

8 Ingest Node • On in Elasticsearch 5.0 by default
• All nodes can become ingest nodes. • Pipelines run on any ingest node • Logstash Filters and Ingest Processors will be able compatible with one another. Enrich documents before indexing

Examples and Use-Cases 9 Beats Reindex • filebeat • no
middle man, ship logs directory into elasticsearch • Misnamed field, bulk rename using reindex api with a pipeline defined.

The Pipeline 10

11 The Pipeline • A description of a sequence of
processors to execute against each document that is to be indexed. • defined in JSON • It is run before the documents are indexed into Elasticsearch process machine

Pipeline Management Create, Retrieve, Update, Delete 12 PUT _ingest/pipeline/pipeline-name GET
_ingest/pipeline/pipeline-name GET ingest/pipeline/* DELETE _ingest/pipeline/pipeline-name

‹#› Example Pipeline 13 extract mysql fields from `message` field
and then remove it from the document { “description”: “mysql pipeline”, “processors”: [ { “grok” : { “field” : “message”, “pattern” : “…” } }, { “remove” : { “field” : “message” } } ] }

The Processors 14 Grok Rename Set Convert attachment Geoip Date

‹#› Mutate Processors 15 set, convert, split, uppercase, lowercase, remove,
rename, append, and more! MuTaTE mutate MUTATE mUTAte

‹#› 16 Grok Processor Parses new field values out of
pattern-matched substrings { "grok": { "field": "field1", "pattern": “%{DATE:date}” } }

‹#› 17 Geoip Processor Adds information about the geographical location
of IP addresses { "geoip": { "field": “my_ip_field”, } }

‹#› Introducing new processors is as easy as writing a
new plugin! 18 { “your_plugin_here”: { "field": “field1”, … } }

Pipeline Error Handling 19

20 Handling Failure • Sometimes the input documents do not
match what is expected. Failing is sometimes expected, and can handled • Use the on_failure parameter to define processors to run when the parent processor or pipeline throw an exception • on_failure blocks behave like little pipelines! Data is not clean, pipelines clean them Source: Gray Arial10pt

Pipeline With Failure Guards 21 grok date set

Pipeline With Failure Guards 22 grok date set

Pipeline With Failure Guards 23 grok date set set date

Pipeline With Failure Guards 24 grok date set set date

25 Investigating Pipelines • on_failure is great for allowing you
to catch exceptions • what if you do not know what exceptions can occur? • what if you are unsure what a processor will do? • what if there was a way to test a pipeline against your documents before indexing? Peek into behavior of Pipeline Source: Gray Arial10pt

Simulation 26

27 _simulate • The Simulate API can be used to
run pipelines against input documents • These input documents are not indexed • Use verbose flag to see output of each processor within a pipeline, not just the end result • Used to debug output of specific processors within a pipeline running pipelines without indexing

Simulate Endpoints 28 POST _ingest/pipeline/pipeline-name/_simulate POST _ingest/pipeline/_simulate POST _ingest/pipeline/_simulate?verbose

29 _simulate request response POST _ingest/pipeline/_simulate { “pipeline” : {
… }, “docs” : [ {“message”: “transform!”}, {“message”: “me too!”} ] } { “docs” : [ {…}, {…} ] }

Ingest node internals 30

31 Ingest node internals • Ingest node is part of
core Elasticsearch • The infrastructure • Many processors • Some processors are available as plugins

32 Ingest node internals - the infrastructure • Ingest needs
to intercept documents before it gets indexed for and preprocess it. • But where should intercept the documents? • The cost of preprocessing and the ability to isolate that are important

cluster 33 Ingest node internals - the infrastructure node index1
1P node index1 1R node index2 1P node index2 1R index1 2R index1 2P index2 2R index2 2P

Cluster Node 34 Ingest node internals - the infrastructure index1
1P index1 2R

Cluster Node 35 Ingest node internals - the infrastructure index1
1P index1 2R

Cluster Node 36 Ingest node internals - the infrastructure ?
? CS

cluster node CS 37 Ingest node internals - the infrastructure
node index1 1P index1 1R node index2 1P node index2 1R index1 2R index1 2P index2 2R index2 2P CS CS CS

38 Ingest node internals - other details node Ingest processing

Ingest node in production 39

40 Ingest node in production - performance • The type
of operations your pipelines are going to perform • The number of processors in your pipelines • The amount of fields your pipelines add or remove. • The number of nodes ingest can run on. • So it depends!

41 Ingest node in production - default architecture cluster node
index1 1P node index1 1R node index2 1P node index2 1R index1 2R index1 2P index2 2R index2 2P

42 Ingest node in production - dedicated ingest nodes cluster
d+m node index1 1P d+m node index1 1R d+m node index2 1P d+m node index2 1R index1 2R index1 2P index2 2R index2 2P ingest node ingest node ingest node ingest node

43 Dedicated ingest node in production - benchmark • Indexing
apache logs • Bulk indexing with 5 concurrent threads • 2 node cluster • Data+master node • Ingest only node

Dedicated ingest node in production - benchmark 44 0% 20%
40% 60% 80% 100% 0.3M 0.6M 1.2M 2.4M 4.8M 9.6M empty set grok grok + geoip grok + remove

‹#› Pipeline Construction Demo experiment with pipeline creation from within
Kibana

‹#› Filebeat Demo Shipping Logs to Ingest from Beats

Filebeat demo Grok processor 51 { "grok": { "field": "message",
"pattern": "%{COMBINEDAPACHELOG}" } }

Filebeat demo Date processor 52 { "date": { "match_field" :
"timestamp", "target_field" : "timestamp", "match_formats" : ["dd/MMM/YYYY:HH:mm:ss Z"] } }

Filebeat demo Convert processor 53 { "convert": { "field" :
"response", "type" : "integer" } }

Filebeat demo Convert processor with on_failure 54 { "convert": {
"field" : "bytes", "type" : "integer", "on_failure" : [ { "set" : { "field" : "bytes", "value" : 0 } } ] } }

See you at AMA 56

‹#› Please attribute Elastic with a link to elastic.co Except
where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 57

Ingest Node: Enriching Documents within Elastic...

Ingest Node: Enriching Documents within Elasticsearch

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript