Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Indexing and Parsing Documents into Elasticsearch with Ingest

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
December 15, 2016

Indexing and Parsing Documents into Elasticsearch with Ingest

These slides are a version of the Ingest Node Slides that was given at Confoo Vancouver and a TLV meetup in 2016

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

December 15, 2016
Tweet

Transcript

  1. ‹#› (re)indexing and parsing documents within Elasticsearch @talevy

  2. www.elastic.co Copyright Elasticsearch BV 2016 Copying, publishing and/or distributing without

    written permission is strictly prohibited 2 The Elastic Stack
  3. Really Quick Intro: Elasticsearch 3 POST /myindex/_search { "query": {

    "bool": { "must": { "match": { "title": "elasticsearch" } }, "filter": { "range": { "price": { "lt": 40 } } } } } } POST /myindex/books/1 {"title": "Elasticsearch in Action", "price": 35.69} POST /myindex/books/2 {"title": "Elasticsearch: The Definitive Guide", "price": 39.90} POST /myindex/books/3 {"title": "Relevant Search", "price": 39.61} index query
  4. Really Quick Intro: Logstash 4 input { kafka { topics

    => [“my-topic”] } } filter { mutate { remove => [ “field-a” ] } } output { elasticsearch { index => “my-kafka-index” } } Pipeline
  5. Really Quick Intro: Beats 5

  6. Really Quick Intro: Kibana 6

  7. ‹#› Why ingest node? Parsing Indexed Documents

  8. ‹#› I just want to tail a file.

  9. Logstash: collect, enrich & transport 9 grok date mutate input

    output Filters The file Elasticsearch
  10. Logstash common setup 10 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET

    /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 message
  11. Ingest node setup 11 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET

    /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24
  12. Filebeat: collect and ship 12 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200]

    "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] \"GET / HTTP/1.1\" 200 24" } { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] \"GET /not_found/ HTTP/1.1\" 404 7218" } { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] \"GET /favicon.ico HTTP/1.1\" 200 3638" }
  13. Elasticsearch: parse and index 13 { "message" : "127.0.0.1 -

    - [19/Apr/2016:12:00:04 +0200] \"GET / HTTP/1.1\" 200 24" } { "request" : "/", "auth" : "-", "ident" : "-", "verb" : "GET", "@timestamp" : "2016-04-19T10:00:04.000Z", "response" : "200", "bytes" : "24", "clientip" : "127.0.0.1", "httpversion" : "1.1", "rawrequest" : null, "timestamp" : "19/Apr/2016:12:00:04 +0200" }
  14. ‹#› How does ingest node work?

  15. Ingest pipeline 15 Pipeline: a set of processors grok date

    remove document enriched document
  16. Define a pipeline PUT /_ingest/pipeline/apache-log { "processors" : [ {

    "grok" : { "field": "message", "pattern": "%{COMMONAPACHELOG}" } }, { "date" : { "match_field" : "timestamp", "match_formats" : ["dd/MMM/YYYY:HH:mm:ss Z"] } }, { "remove" : { "field" : "message" } } ] } 16
  17. Index a document Provide the id of the pipeline to

    execute PUT /logs/apache/1?pipeline=apache-log { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] \"GET / HTTP/1.1\" 200 24" } 17
  18. GET /logs/apache/1 { "request" : "/", "auth" : "-", "ident"

    : "-", "verb" : "GET", "@timestamp" : "2016-04-19T10:00:04.000Z", "response" : "200", "bytes" : "24", "clientip" : "127.0.0.1", "httpversion" : "1.1", "rawrequest" : null, "timestamp" : "19/Apr/2016:12:00:04 +0200" } What has actually been indexed 18
  19. PUT /_ingest/pipeline/apache-log { … } GET /_ingest/pipeline/apache-log GET /_ingest/pipeline/* DELETE

    /_ingest/pipeline/apache-log Pipeline management Create, Read, Update & Delete 19
  20. 20 grok remove attachment convert uppercase foreach trim append gsub

    set split fail geoip join lowercase rename date
  21. Extracts structured fields out of a single text field 21

    Grok processor { "grok": { "field": "message", "pattern": "%{DATE:date}" } }
  22. set, remove, rename, convert, gsub, split, join, lowercase, uppercase, trim,

    append 22 Mutate processors { "remove": { "field": "message" } }
  23. Parses a date from a string 23 Date processor {

    "date": { "field": "timestamp", "match_formats": ["YYYY"] } }
  24. Adds information about the geographical location of IP addresses 24

    Geoip processor { "geoip": { "field": "ip" } }
  25. parses structured documents like PDF, doc, etc. 25 Attachment Processor

    { "attachment": { "field": “my-doc-field“ } }
  26. entity recognition and stores the output in the JSON before

    it is being stored. 26 Open NLP Processor { "opennlp": { "field": “my-field" } } { "my_field" : "Kobe Bryant was one of the best basketball players of all times. Not even Michael Jordan has ever scored 81 points in one game. Munich is really an awesome city, but New York is as well. Yesterday has been the hottest day of the year.", "entities" : { "locations" : [ "Munich", "New York" ], "dates" : [ "Yesterday" ], "names" : [ "Kobe Bryant", "Michael Jordan" ] } }
  27. Introducing new processors is as easy as writing a plugin

    https://www.elastic.co/blog/ writing-your-own-ingest- processor-for-elasticsearch 27 Plugins { "your_plugin": { … } }
  28. ‹#› Error handling

  29. 29 grok date remove { "message" : "127.0.0.1 - -

    [19/Apr/2016:12:00:00 +040] \"GET / HTTP/1.1\" 200 24" }
  30. 30 grok date remove { "message" : "127.0.0.1 - -

    [19/Apr/2016:12:00:00 +040] \"GET / HTTP/1.1\" 200 24" } 400 Bad Request unable to parse date [19/Apr/2016:12:00:00 +040]
  31. 31 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } grok date remove set on failure processors at the pipeline level
  32. 32 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } remove 200 OK grok date set on failure processors at the pipeline level
  33. 33 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } grok date remove set on failure processors at the processor level remove
  34. 34 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } grok date remove set remove 200 OK on failure processors at the processor level
  35. ‹#› Ingest node internals

  36. cluster Default scenario 36 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Cluster State logs index: 3 primary shards, 1 replica each All nodes are equal: - node.data: true - node.master: true - node.ingest: true
  37. cluster Default scenario 37 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Pre-processing on the coordinating node All nodes are equal: - node.data: true - node.master: true - node.ingest: true index request for shard 3
  38. cluster Default scenario 38 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Indexing on the primary shard All nodes are equal: - node.data: true - node.master: true - node.ingest: true index request for shard 3
  39. cluster Default scenario 39 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Indexing on the replica shard All nodes are equal: - node.data: true - node.master: true - node.ingest: true index request for shard 3
  40. cluster Ingest dedicated nodes 40 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS node.data: false node.master: false node.ingest: true node.data: true node.master: true node.ingest: false
  41. cluster Ingest dedicated nodes 41 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Forward request to an ingest node
  42. cluster Ingest dedicated nodes 42 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Pre-processing on the ingest node
  43. cluster Ingest dedicated nodes 43 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Indexing on the primary shard
  44. cluster Ingest dedicated nodes 44 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Indexing on the replica shard
  45. ‹#› Where can ingest pipelines be used?

  46. 46 Beats output.elasticsearch.pipeline: ‘%{[fields.example]}’ output.elasticsearch.pipelines: - pipeline: 'ok-pipeline' when.range: http.code:

    [200, 299] - pipeline: 'verybad-pipeline' when.range: http.code: [500, 999] - pipeline: 'default-pipeline'
  47. 47 Index API PUT /logs/apache/1?pipeline=apache-log { "message" : "…" }

  48. 48 Bulk api PUT /logs/_bulk { "index": { "_type": "apache",

    "_id": "1", "pipeline": "apache-log" } }\n { "message" : "…" }\n { "index": {"_type": "mysql", "_id": "1", "pipeline": "mysql-log" } }\n { "message" : "…" }\n
  49. Scan/scroll & bulk indexing made easy 49 Reindex API POST

    /_reindex { "source": { "index": "logs", "type": "apache" }, "dest": { "index": "apache-logs", "pipeline" : "apache-log" } }
  50. Great way to upgrade if you have the hardware for

    two 50 Reindex From Remote POST /_reindex { "source": { "index": "logs", "type": "apache", "remote": { "host": "http://localhost:9200" } }, "dest": { "index": "apache-logs", "pipeline" : "apache-log" } }
  51. ‹#› https://www.elastic.co/products https://writequit.org/org/es/presentations/ whats-new-elasticsearch-5.0.html https://www.elastic.co/blog/elastic-stack-5-0-0- released Go Get Elastic Stack

    v5!
  52. ‹#› Thank you