Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ingest Node: Enriching Documents within Elasticsearch

Elastic Co
February 18, 2016

Ingest Node: Enriching Documents within Elasticsearch

When ingesting data into Elasticsearch, sometimes only simple transforms need to be performed on the data prior to indexing. Enter Ingest Node: a new node type that will allow you to do just that! This talk will introduce you to Ingest Node and how to integrate it with the rest of the Elastic Stack.

Elastic Co

February 18, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 3 070917 16:29:01 21 Query select * from location Time

    thread_id command_type query_body
  2. Logstash is made for this! 4 { input { …

    } filter { grok { match => { “message” : “^%{NUMBER:date} *%{TIME:time} * %{WORD:command_type} *%{DATA:command}” } mutate { remove => [“message”] } } output { … } }
  3. Indexing [An Enriched ] Document 6 PUT localhost:9200/logs/mysql/1?pipeline=my_mysql_pipeline { “message”:

    "070917 16:29:01 21 Query select * from location" } specifying an ingest pipeline to execute before indexing
  4. Indexing [An Enriched ] Document 7 PUT localhost:9200/logs/mysql/1 { “timestamp”:

    “2007-09-17T16:29:01-08:00”, “thread_id”: “21”, “command_type”: “Query”, “command”: “select * from location” } document as it is being indexed
  5. 8 Ingest Node • On in Elasticsearch 5.0 by default

    • All nodes can become ingest nodes. • Pipelines run on any ingest node • Logstash Filters and Ingest Processors will be able compatible with one another. Enrich documents before indexing
  6. Examples and Use-Cases 9 Beats Reindex • filebeat • no

    middle man, ship logs directory into elasticsearch • Misnamed field, bulk rename using reindex api with a pipeline defined.
  7. 11 The Pipeline • A description of a sequence of

    processors to execute against each document that is to be indexed. • defined in JSON • It is run before the documents are indexed into Elasticsearch process machine
  8. Pipeline Management Create, Retrieve, Update, Delete 12 PUT _ingest/pipeline/pipeline-name GET

    _ingest/pipeline/pipeline-name GET ingest/pipeline/* DELETE _ingest/pipeline/pipeline-name
  9. ‹#› Example Pipeline 13 extract mysql fields from `message` field

    and then remove it from the document { “description”: “mysql pipeline”, “processors”: [ { “grok” : { “field” : “message”, “pattern” : “…” } }, { “remove” : { “field” : “message” } } ] }
  10. ‹#› Mutate Processors 15 set, convert, split, uppercase, lowercase, remove,

    rename, append, and more! MuTaTE mutate MUTATE mUTAte
  11. ‹#› 16 Grok Processor Parses new field values out of

    pattern-matched substrings { "grok": { "field": "field1", "pattern": “%{DATE:date}” } }
  12. ‹#› 17 Geoip Processor Adds information about the geographical location

    of IP addresses { "geoip": { "field": “my_ip_field”, } }
  13. ‹#› Introducing new processors is as easy as writing a

    new plugin! 18 { “your_plugin_here”: { "field": “field1”, … } }
  14. 20 Handling Failure • Sometimes the input documents do not

    match what is expected. Failing is sometimes expected, and can handled • Use the on_failure parameter to define processors to run when the parent processor or pipeline throw an exception • on_failure blocks behave like little pipelines! Data is not clean, pipelines clean them Source: Gray Arial10pt
  15. 25 Investigating Pipelines • on_failure is great for allowing you

    to catch exceptions • what if you do not know what exceptions can occur? • what if you are unsure what a processor will do? • what if there was a way to test a pipeline against your documents before indexing? Peek into behavior of Pipeline Source: Gray Arial10pt
  16. 27 _simulate • The Simulate API can be used to

    run pipelines against input documents • These input documents are not indexed • Use verbose flag to see output of each processor within a pipeline, not just the end result • Used to debug output of specific processors within a pipeline running pipelines without indexing
  17. 29 _simulate request response POST _ingest/pipeline/_simulate { “pipeline” : {

    … }, “docs” : [ {“message”: “transform!”}, {“message”: “me too!”} ] } { “docs” : [ {…}, {…} ] }
  18. 31 Ingest node internals • Ingest node is part of

    core Elasticsearch • The infrastructure • Many processors • Some processors are available as plugins
  19. 32 Ingest node internals - the infrastructure • Ingest needs

    to intercept documents before it gets indexed for and preprocess it. • But where should intercept the documents? • The cost of preprocessing and the ability to isolate that are important
  20. cluster 33 Ingest node internals - the infrastructure node index1

    1P node index1 1R node index2 1P node index2 1R index1 2R index1 2P index2 2R index2 2P
  21. cluster node CS 37 Ingest node internals - the infrastructure

    node index1 1P index1 1R node index2 1P node index2 1R index1 2R index1 2P index2 2R index2 2P CS CS CS
  22. 40 Ingest node in production - performance • The type

    of operations your pipelines are going to perform • The number of processors in your pipelines • The amount of fields your pipelines add or remove. • The number of nodes ingest can run on. • So it depends!
  23. 41 Ingest node in production - default architecture cluster node

    index1 1P node index1 1R node index2 1P node index2 1R index1 2R index1 2P index2 2R index2 2P
  24. 42 Ingest node in production - dedicated ingest nodes cluster

    d+m node index1 1P d+m node index1 1R d+m node index2 1P d+m node index2 1R index1 2R index1 2P index2 2R index2 2P ingest node ingest node ingest node ingest node
  25. 43 Dedicated ingest node in production - benchmark • Indexing

    apache logs • Bulk indexing with 5 concurrent threads • 2 node cluster • Data+master node • Ingest only node
  26. Dedicated ingest node in production - benchmark 44 0% 20%

    40% 60% 80% 100% 0.3M 0.6M 1.2M 2.4M 4.8M 9.6M empty set grok grok + geoip grok + remove
  27. 46

  28. 47

  29. 48

  30. 50

  31. Filebeat demo Date processor 52 { "date": { "match_field" :

    "timestamp", "target_field" : "timestamp", "match_formats" : ["dd/MMM/YYYY:HH:mm:ss Z"] } }
  32. Filebeat demo Convert processor with on_failure 54 { "convert": {

    "field" : "bytes", "type" : "integer", "on_failure" : [ { "set" : { "field" : "bytes", "value" : 0 } } ] } }
  33. 55

  34. ‹#› Please attribute Elastic with a link to elastic.co Except

    where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 57