Ingest Node: (re)Indexing and Enriching Documents within Elasticsearch

Ingest Node: (re)Indexing and Enriching Documents within Elasticsearch

When ingesting data into Elasticsearch, sometimes only simple transforms need to be performed on the data prior to indexing. Enter Ingest Node: a new node type that will allow you to do just that! This talk will introduce you to Ingest Node and how to integrate it with the rest of the Elastic Stack. The talk will also cover the reindex api, which can be used in combination with ingest pipelines to modify data while reindexing.

660d1a296a8badddc4c44fb2c7eef011?s=128

Luca Cavanna

April 19, 2016
Tweet

Transcript

  1. 2.
  2. 5.
  3. 6.

    Logstash common setup 6 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET

    /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 message
  4. 7.

    Ingest node setup 7 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET

    /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24
  5. 8.

    Filebeat: collect and ship 8 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200]

    "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] \"GET / HTTP/1.1\" 200 24" } { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] \"GET /not_found/ HTTP/1.1\" 404 7218" } { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] \"GET /favicon.ico HTTP/1.1\" 200 3638" }
  6. 9.

    Elasticsearch: enrich and index 9 { "message" : "127.0.0.1 -

    - [19/Apr/2016:12:00:04 +0200] \"GET / HTTP/1.1\" 200 24" } { "request" : "/", "auth" : "-", "ident" : "-", "verb" : "GET", "@timestamp" : "2016-04-19T10:00:04.000Z", "response" : "200", "bytes" : "24", "clientip" : "127.0.0.1", "httpversion" : "1.1", "rawrequest" : null, "timestamp" : "19/Apr/2016:12:00:04 +0200" }
  7. 11.
  8. 12.

    Define a pipeline PUT /_ingest/pipeline/apache-log { "processors" : [ {

    "grok" : { "field": "message", "pattern": "%{COMMONAPACHELOG}" } }, { "date" : { "match_field" : "timestamp", "match_formats" : ["dd/MMM/YYYY:HH:mm:ss Z"] } }, { "remove" : { "field" : "message" } } ] } 12
  9. 13.

    Index a document Provide the id of the pipeline to

    execute PUT /logs/apache/1?pipeline=apache-log { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] \"GET / HTTP/1.1\" 200 24" } 13
  10. 14.

    GET /logs/apache/1 { "request" : "/", "auth" : "-", "ident"

    : "-", "verb" : "GET", "@timestamp" : "2016-04-19T10:00:04.000Z", "response" : "200", "bytes" : "24", "clientip" : "127.0.0.1", "httpversion" : "1.1", "rawrequest" : null, "timestamp" : "19/Apr/2016:12:00:04 +0200" } What has actually been indexed 14
  11. 15.

    PUT /_ingest/pipeline/apache-log { … } GET /_ingest/pipeline/apache-log GET /_ingest/pipeline/* DELETE

    /_ingest/pipeline/apache-log Pipeline management Create, Read, Update & Delete 15
  12. 16.

    16 grok remove attachment convert uppercase foreach trim append gsub

    set split fail geoip join lowercase rename date
  13. 17.

    Extracts structured fields out of a single text field 17

    Grok processor { "grok": { "field": "message", "pattern": "%{DATE:date}" } }
  14. 18.

    set, remove, rename, convert, gsub, split, join, lowercase, uppercase, trim,

    append 18 Mutate processors { "remove": { "field": "message" } }
  15. 19.

    Parses a date from a string 19 Date processor {

    "date": { "field": "timestamp", "match_formats": ["YYYY"] } }
  16. 20.

    Adds information about the geographical location of IP addresses 20

    Geoip processor { "geoip": { "field": "ip" } }
  17. 21.

    Do something for every element of an array 21 Foreach

    processor { "foreach": { "field" : "values", "processors" : [ { "uppercase" : { "field" : "_value" } } ] } }
  18. 22.

    Raises an exception with a configurable message 22 Fail processor

    { "fail": { "message": "custom error" } }
  19. 23.

    Introducing new processors is as easy as writing a plugin

    23 Plugins { "your_plugin": { … } }
  20. 25.

    25 grok date remove { "message" : "127.0.0.1 - -

    [19/Apr/2016:12:00:00 +040] \"GET / HTTP/1.1\" 200 24" }
  21. 26.

    26 grok date remove { "message" : "127.0.0.1 - -

    [19/Apr/2016:12:00:00 +040] \"GET / HTTP/1.1\" 200 24" } 400 Bad Request unable to parse date [19/Apr/2016:12:00:00 +040]
  22. 27.

    27 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } grok date remove set on failure processors at the pipeline level
  23. 28.

    28 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } remove 200 OK grok date set on failure processors at the pipeline level
  24. 29.

    29 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } grok date remove set on failure processors at the processor level remove
  25. 30.

    30 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } grok date remove set remove 200 OK on failure processors at the processor level
  26. 32.

    cluster Default scenario 32 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Cluster State logs index: 3 primary shards, 1 replica each All nodes are equal: - node.data: true - node.master: true - node.ingest: true
  27. 33.

    cluster Default scenario 33 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Pre-processing on the coordinating node All nodes are equal: - node.data: true - node.master: true - node.ingest: true index request for shard 3
  28. 34.

    cluster Default scenario 34 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Indexing on the primary shard All nodes are equal: - node.data: true - node.master: true - node.ingest: true index request for shard 3
  29. 35.

    cluster Default scenario 35 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Indexing on the replica shard All nodes are equal: - node.data: true - node.master: true - node.ingest: true index request for shard 3
  30. 36.

    cluster Ingest dedicated nodes 36 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS node.data: false node.master: false node.ingest: true node.data: true node.master: true node.ingest: false
  31. 37.

    cluster Ingest dedicated nodes 37 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Forward request to an ingest node
  32. 38.

    cluster Ingest dedicated nodes 38 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Pre-processing on the ingest node
  33. 39.

    cluster Ingest dedicated nodes 39 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Indexing on the primary shard
  34. 40.

    cluster Ingest dedicated nodes 40 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Indexing on the replica shard
  35. 43.

    43 Bulk api PUT /logs/_bulk { "index": { "_type": "apache",

    "_id": "1", "pipeline": "apache-log" } }\n { "message" : "…" }\n { "index": {"_type": "mysql", "_id": "1", "pipeline": "mysql-log" } }\n { "message" : "…" }\n
  36. 44.

    Scan/scroll & bulk indexing made easy 44 Reindex api POST

    /_reindex { "source": { "index": "logs", "type": "apache" }, "dest": { "index": "apache-logs", "pipeline" : "apache-log" } }