Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Indexing and Parsing Documents into Elasticsear...

Avatar for Elastic Co Elastic Co
December 15, 2016

Indexing and Parsing Documents into Elasticsearch with Ingest

These slides are a version of the Ingest Node Slides that was given at Confoo Vancouver and a TLV meetup in 2016

Avatar for Elastic Co

Elastic Co

December 15, 2016
Tweet

More Decks by Elastic Co

Other Decks in Programming

Transcript

  1. www.elastic.co Copyright Elasticsearch BV 2016 Copying, publishing and/or distributing without

    written permission is strictly prohibited 2 The Elastic Stack
  2. Really Quick Intro: Elasticsearch 3 POST /myindex/_search { "query": {

    "bool": { "must": { "match": { "title": "elasticsearch" } }, "filter": { "range": { "price": { "lt": 40 } } } } } } POST /myindex/books/1 {"title": "Elasticsearch in Action", "price": 35.69} POST /myindex/books/2 {"title": "Elasticsearch: The Definitive Guide", "price": 39.90} POST /myindex/books/3 {"title": "Relevant Search", "price": 39.61} index query
  3. Really Quick Intro: Logstash 4 input { kafka { topics

    => [“my-topic”] } } filter { mutate { remove => [ “field-a” ] } } output { elasticsearch { index => “my-kafka-index” } } Pipeline
  4. Logstash common setup 10 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET

    /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 message
  5. Ingest node setup 11 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET

    /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:01 +0200] "GET /cgi-bin/try/ HTTP/1.1" 200 3395 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:18 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:00 +0200] "GET /robots.txt HTTP/1.1" 200 68 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 127.0.0.1 - - [19/Apr/2016:12:00:15 +0200] "GET / HTTP/1.1" 200 24
  6. Filebeat: collect and ship 12 127.0.0.1 - - [19/Apr/2016:12:00:04 +0200]

    "GET / HTTP/1.1" 200 24 127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] "GET /not_found/ HTTP/1.1" 404 7218 127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] "GET /favicon.ico HTTP/1.1" 200 3638 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] \"GET / HTTP/1.1\" 200 24" } { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:07 +0200] \"GET /not_found/ HTTP/1.1\" 404 7218" } { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:09 +2000] \"GET /favicon.ico HTTP/1.1\" 200 3638" }
  7. Elasticsearch: parse and index 13 { "message" : "127.0.0.1 -

    - [19/Apr/2016:12:00:04 +0200] \"GET / HTTP/1.1\" 200 24" } { "request" : "/", "auth" : "-", "ident" : "-", "verb" : "GET", "@timestamp" : "2016-04-19T10:00:04.000Z", "response" : "200", "bytes" : "24", "clientip" : "127.0.0.1", "httpversion" : "1.1", "rawrequest" : null, "timestamp" : "19/Apr/2016:12:00:04 +0200" }
  8. Define a pipeline PUT /_ingest/pipeline/apache-log { "processors" : [ {

    "grok" : { "field": "message", "pattern": "%{COMMONAPACHELOG}" } }, { "date" : { "match_field" : "timestamp", "match_formats" : ["dd/MMM/YYYY:HH:mm:ss Z"] } }, { "remove" : { "field" : "message" } } ] } 16
  9. Index a document Provide the id of the pipeline to

    execute PUT /logs/apache/1?pipeline=apache-log { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:04 +0200] \"GET / HTTP/1.1\" 200 24" } 17
  10. GET /logs/apache/1 { "request" : "/", "auth" : "-", "ident"

    : "-", "verb" : "GET", "@timestamp" : "2016-04-19T10:00:04.000Z", "response" : "200", "bytes" : "24", "clientip" : "127.0.0.1", "httpversion" : "1.1", "rawrequest" : null, "timestamp" : "19/Apr/2016:12:00:04 +0200" } What has actually been indexed 18
  11. PUT /_ingest/pipeline/apache-log { … } GET /_ingest/pipeline/apache-log GET /_ingest/pipeline/* DELETE

    /_ingest/pipeline/apache-log Pipeline management Create, Read, Update & Delete 19
  12. 20 grok remove attachment convert uppercase foreach trim append gsub

    set split fail geoip join lowercase rename date
  13. Extracts structured fields out of a single text field 21

    Grok processor { "grok": { "field": "message", "pattern": "%{DATE:date}" } }
  14. set, remove, rename, convert, gsub, split, join, lowercase, uppercase, trim,

    append 22 Mutate processors { "remove": { "field": "message" } }
  15. Parses a date from a string 23 Date processor {

    "date": { "field": "timestamp", "match_formats": ["YYYY"] } }
  16. Adds information about the geographical location of IP addresses 24

    Geoip processor { "geoip": { "field": "ip" } }
  17. parses structured documents like PDF, doc, etc. 25 Attachment Processor

    { "attachment": { "field": “my-doc-field“ } }
  18. entity recognition and stores the output in the JSON before

    it is being stored. 26 Open NLP Processor { "opennlp": { "field": “my-field" } } { "my_field" : "Kobe Bryant was one of the best basketball players of all times. Not even Michael Jordan has ever scored 81 points in one game. Munich is really an awesome city, but New York is as well. Yesterday has been the hottest day of the year.", "entities" : { "locations" : [ "Munich", "New York" ], "dates" : [ "Yesterday" ], "names" : [ "Kobe Bryant", "Michael Jordan" ] } }
  19. Introducing new processors is as easy as writing a plugin

    https://www.elastic.co/blog/ writing-your-own-ingest- processor-for-elasticsearch 27 Plugins { "your_plugin": { … } }
  20. 29 grok date remove { "message" : "127.0.0.1 - -

    [19/Apr/2016:12:00:00 +040] \"GET / HTTP/1.1\" 200 24" }
  21. 30 grok date remove { "message" : "127.0.0.1 - -

    [19/Apr/2016:12:00:00 +040] \"GET / HTTP/1.1\" 200 24" } 400 Bad Request unable to parse date [19/Apr/2016:12:00:00 +040]
  22. 31 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } grok date remove set on failure processors at the pipeline level
  23. 32 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } remove 200 OK grok date set on failure processors at the pipeline level
  24. 33 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } grok date remove set on failure processors at the processor level remove
  25. 34 { "message" : "127.0.0.1 - - [19/Apr/2016:12:00:00 +040] \"GET

    / HTTP/1.1\" 200 24" } grok date remove set remove 200 OK on failure processors at the processor level
  26. cluster Default scenario 36 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Cluster State logs index: 3 primary shards, 1 replica each All nodes are equal: - node.data: true - node.master: true - node.ingest: true
  27. cluster Default scenario 37 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Pre-processing on the coordinating node All nodes are equal: - node.data: true - node.master: true - node.ingest: true index request for shard 3
  28. cluster Default scenario 38 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Indexing on the primary shard All nodes are equal: - node.data: true - node.master: true - node.ingest: true index request for shard 3
  29. cluster Default scenario 39 Client node1 logs 2P logs 3R

    CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS Indexing on the replica shard All nodes are equal: - node.data: true - node.master: true - node.ingest: true index request for shard 3
  30. cluster Ingest dedicated nodes 40 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS node.data: false node.master: false node.ingest: true node.data: true node.master: true node.ingest: false
  31. cluster Ingest dedicated nodes 41 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Forward request to an ingest node
  32. cluster Ingest dedicated nodes 42 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Pre-processing on the ingest node
  33. cluster Ingest dedicated nodes 43 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Indexing on the primary shard
  34. cluster Ingest dedicated nodes 44 Client node1 logs 2P logs

    3R CS node2 logs 3P logs 1R CS node3 logs 1P logs 2R CS node4 CS node5 CS index request for shard 3 Indexing on the replica shard
  35. 46 Beats output.elasticsearch.pipeline: ‘%{[fields.example]}’ output.elasticsearch.pipelines: - pipeline: 'ok-pipeline' when.range: http.code:

    [200, 299] - pipeline: 'verybad-pipeline' when.range: http.code: [500, 999] - pipeline: 'default-pipeline'
  36. 48 Bulk api PUT /logs/_bulk { "index": { "_type": "apache",

    "_id": "1", "pipeline": "apache-log" } }\n { "message" : "…" }\n { "index": {"_type": "mysql", "_id": "1", "pipeline": "mysql-log" } }\n { "message" : "…" }\n
  37. Scan/scroll & bulk indexing made easy 49 Reindex API POST

    /_reindex { "source": { "index": "logs", "type": "apache" }, "dest": { "index": "apache-logs", "pipeline" : "apache-log" } }
  38. Great way to upgrade if you have the hardware for

    two 50 Reindex From Remote POST /_reindex { "source": { "index": "logs", "type": "apache", "remote": { "host": "http://localhost:9200" } }, "dest": { "index": "apache-logs", "pipeline" : "apache-log" } }