Ingest Node: Enriching Documents within Elasticsearch

Slide 1

Slide 1 text

‹#› @talevy @lucacavanna @mvgroningen Ingest Node

Slide 2

Slide 2 text

Indexing A Document 2 PUT localhost:9200/logs/mysql/1 { “message”: "070917 16:29:01 21 Query select * from location" }

Slide 3

Slide 3 text

3 070917 16:29:01 21 Query select * from location Time thread_id command_type query_body

Slide 4

Slide 4 text

Logstash is made for this! 4 { input { … } filter { grok { match => { “message” : “^%{NUMBER:date} *%{TIME:time} * %{WORD:command_type} *%{DATA:command}” } mutate { remove => [“message”] } } output { … } }

Slide 5

Slide 5 text

Ingest Pipeline 5 just the filters grok remove Inputs outputs

Slide 6

Slide 6 text

Indexing [An Enriched ] Document 6 PUT localhost:9200/logs/mysql/1?pipeline=my_mysql_pipeline { “message”: "070917 16:29:01 21 Query select * from location" } specifying an ingest pipeline to execute before indexing

Slide 7

Slide 7 text

Indexing [An Enriched ] Document 7 PUT localhost:9200/logs/mysql/1 { “timestamp”: “2007-09-17T16:29:01-08:00”, “thread_id”: “21”, “command_type”: “Query”, “command”: “select * from location” } document as it is being indexed

Slide 8

Slide 8 text

8 Ingest Node • On in Elasticsearch 5.0 by default • All nodes can become ingest nodes. • Pipelines run on any ingest node • Logstash Filters and Ingest Processors will be able compatible with one another. Enrich documents before indexing

Slide 9

Slide 9 text

Examples and Use-Cases 9 Beats Reindex • filebeat • no middle man, ship logs directory into elasticsearch • Misnamed field, bulk rename using reindex api with a pipeline defined.

Slide 10

Slide 10 text

The Pipeline 10

Slide 11

Slide 11 text

11 The Pipeline • A description of a sequence of processors to execute against each document that is to be indexed. • defined in JSON • It is run before the documents are indexed into Elasticsearch process machine

Slide 12

Slide 12 text

Pipeline Management Create, Retrieve, Update, Delete 12 PUT _ingest/pipeline/pipeline-name GET _ingest/pipeline/pipeline-name GET ingest/pipeline/* DELETE _ingest/pipeline/pipeline-name

Slide 13

Slide 13 text

‹#› Example Pipeline 13 extract mysql fields from `message` field and then remove it from the document { “description”: “mysql pipeline”, “processors”: [ { “grok” : { “field” : “message”, “pattern” : “…” } }, { “remove” : { “field” : “message” } } ] }

Slide 14

Slide 14 text

The Processors 14 Grok Rename Set Convert attachment Geoip Date

Slide 15

Slide 15 text

‹#› Mutate Processors 15 set, convert, split, uppercase, lowercase, remove, rename, append, and more! MuTaTE mutate MUTATE mUTAte

Slide 16

Slide 16 text

‹#› 16 Grok Processor Parses new field values out of pattern-matched substrings { "grok": { "field": "field1", "pattern": “%{DATE:date}” } }

Slide 17

Slide 17 text

‹#› 17 Geoip Processor Adds information about the geographical location of IP addresses { "geoip": { "field": “my_ip_field”, } }

Slide 18

Slide 18 text

‹#› Introducing new processors is as easy as writing a new plugin! 18 { “your_plugin_here”: { "field": “field1”, … } }

Slide 19

Slide 19 text

Pipeline Error Handling 19

Slide 20

Slide 20 text

20 Handling Failure • Sometimes the input documents do not match what is expected. Failing is sometimes expected, and can handled • Use the on_failure parameter to define processors to run when the parent processor or pipeline throw an exception • on_failure blocks behave like little pipelines! Data is not clean, pipelines clean them Source: Gray Arial10pt

Slide 21

Slide 21 text

Pipeline With Failure Guards 21 grok date set

Slide 22

Slide 22 text

Pipeline With Failure Guards 22 grok date set

Slide 23

Slide 23 text

Pipeline With Failure Guards 23 grok date set set date

Slide 24

Slide 24 text

Pipeline With Failure Guards 24 grok date set set date

Slide 25

Slide 25 text

25 Investigating Pipelines • on_failure is great for allowing you to catch exceptions • what if you do not know what exceptions can occur? • what if you are unsure what a processor will do? • what if there was a way to test a pipeline against your documents before indexing? Peek into behavior of Pipeline Source: Gray Arial10pt

Slide 26

Slide 26 text

Simulation 26

Slide 27

Slide 27 text

27 _simulate • The Simulate API can be used to run pipelines against input documents • These input documents are not indexed • Use verbose flag to see output of each processor within a pipeline, not just the end result • Used to debug output of specific processors within a pipeline running pipelines without indexing

Slide 28

Slide 28 text

Simulate Endpoints 28 POST _ingest/pipeline/pipeline-name/_simulate POST _ingest/pipeline/_simulate POST _ingest/pipeline/_simulate?verbose

Slide 29

Slide 29 text

29 _simulate request response POST _ingest/pipeline/_simulate { “pipeline” : { … }, “docs” : [ {“message”: “transform!”}, {“message”: “me too!”} ] } { “docs” : [ {…}, {…} ] }

Slide 30

Slide 30 text

Ingest node internals 30

Slide 31

Slide 31 text

31 Ingest node internals • Ingest node is part of core Elasticsearch • The infrastructure • Many processors • Some processors are available as plugins

Slide 32

Slide 32 text

32 Ingest node internals - the infrastructure • Ingest needs to intercept documents before it gets indexed for and preprocess it. • But where should intercept the documents? • The cost of preprocessing and the ability to isolate that are important

Slide 33

Slide 33 text

cluster 33 Ingest node internals - the infrastructure node index1 1P node index1 1R node index2 1P node index2 1R index1 2R index1 2P index2 2R index2 2P

Slide 34

Slide 34 text

Cluster Node 34 Ingest node internals - the infrastructure index1 1P index1 2R

Slide 35

Slide 35 text

Cluster Node 35 Ingest node internals - the infrastructure index1 1P index1 2R

Slide 36

Slide 36 text

Cluster Node 36 Ingest node internals - the infrastructure ? ? CS

Slide 37

Slide 37 text

cluster node CS 37 Ingest node internals - the infrastructure node index1 1P index1 1R node index2 1P node index2 1R index1 2R index1 2P index2 2R index2 2P CS CS CS

Slide 38

Slide 38 text

38 Ingest node internals - other details node Ingest processing

Slide 39

Slide 39 text

Ingest node in production 39

Slide 40

Slide 40 text

40 Ingest node in production - performance • The type of operations your pipelines are going to perform • The number of processors in your pipelines • The amount of fields your pipelines add or remove. • The number of nodes ingest can run on. • So it depends!

Slide 41

Slide 41 text

41 Ingest node in production - default architecture cluster node index1 1P node index1 1R node index2 1P node index2 1R index1 2R index1 2P index2 2R index2 2P

Slide 42

Slide 42 text

42 Ingest node in production - dedicated ingest nodes cluster d+m node index1 1P d+m node index1 1R d+m node index2 1P d+m node index2 1R index1 2R index1 2P index2 2R index2 2P ingest node ingest node ingest node ingest node

Slide 43

Slide 43 text

43 Dedicated ingest node in production - benchmark • Indexing apache logs • Bulk indexing with 5 concurrent threads • 2 node cluster • Data+master node • Ingest only node

Slide 44

Slide 44 text

Dedicated ingest node in production - benchmark 44 0% 20% 40% 60% 80% 100% 0.3M 0.6M 1.2M 2.4M 4.8M 9.6M empty set grok grok + geoip grok + remove

Slide 45

Slide 45 text

‹#› Pipeline Construction Demo experiment with pipeline creation from within Kibana

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

‹#› Filebeat Demo Shipping Logs to Ingest from Beats

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Filebeat demo Grok processor 51 { "grok": { "field": "message", "pattern": "%{COMBINEDAPACHELOG}" } }

Slide 52

Slide 52 text

Filebeat demo Date processor 52 { "date": { "match_field" : "timestamp", "target_field" : "timestamp", "match_formats" : ["dd/MMM/YYYY:HH:mm:ss Z"] } }

Slide 53

Slide 53 text

Filebeat demo Convert processor 53 { "convert": { "field" : "response", "type" : "integer" } }

Slide 54

Slide 54 text

Filebeat demo Convert processor with on_failure 54 { "convert": { "field" : "bytes", "type" : "integer", "on_failure" : [ { "set" : { "field" : "bytes", "value" : 0 } } ] } }

Slide 55

Slide 55 text

Slide 56

Slide 56 text

See you at AMA 56

Slide 57

Slide 57 text

‹#› Please attribute Elastic with a link to elastic.co Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 57