Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ingest Node: Enriching Documents within Elasticsearch

Elastic Co
February 18, 2016

Ingest Node: Enriching Documents within Elasticsearch

When ingesting data into Elasticsearch, sometimes only simple transforms need to be performed on the data prior to indexing. Enter Ingest Node: a new node type that will allow you to do just that! This talk will introduce you to Ingest Node and how to integrate it with the rest of the Elastic Stack.

Elastic Co

February 18, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. ‹#›
    @talevy
    @lucacavanna
    @mvgroningen
    Ingest Node

    View Slide

  2. Indexing A Document
    2
    PUT localhost:9200/logs/mysql/1
    {
    “message”: "070917 16:29:01 21 Query select * from location"
    }

    View Slide

  3. 3
    070917 16:29:01 21 Query select * from location
    Time thread_id command_type query_body

    View Slide

  4. Logstash is made for this!
    4
    {
    input { … }
    filter {
    grok {
    match => {
    “message” : “^%{NUMBER:date} *%{TIME:time} *
    %{WORD:command_type} *%{DATA:command}”
    }
    mutate { remove => [“message”] }
    }
    output { … }
    }

    View Slide

  5. Ingest Pipeline
    5
    just the filters
    grok remove
    Inputs outputs

    View Slide

  6. Indexing [An Enriched ] Document
    6
    PUT localhost:9200/logs/mysql/1?pipeline=my_mysql_pipeline
    {
    “message”: "070917 16:29:01 21 Query select * from location"
    }
    specifying an ingest pipeline to execute before indexing

    View Slide

  7. Indexing [An Enriched ] Document
    7
    PUT localhost:9200/logs/mysql/1
    {
    “timestamp”: “2007-09-17T16:29:01-08:00”,
    “thread_id”: “21”,
    “command_type”: “Query”,
    “command”: “select * from location”
    }
    document as it is being indexed

    View Slide

  8. 8
    Ingest Node
    • On in Elasticsearch 5.0 by default
    • All nodes can become ingest nodes.
    • Pipelines run on any ingest node
    • Logstash Filters and Ingest Processors will be able compatible with one
    another.
    Enrich documents before indexing

    View Slide

  9. Examples and Use-Cases
    9
    Beats Reindex
    • filebeat
    • no middle man, ship logs
    directory into
    elasticsearch
    • Misnamed field, bulk
    rename using reindex api
    with a pipeline defined.

    View Slide

  10. The Pipeline
    10

    View Slide

  11. 11
    The Pipeline
    • A description of a sequence of processors to execute against each
    document that is to be indexed.
    • defined in JSON
    • It is run before the documents are indexed into Elasticsearch
    process machine

    View Slide

  12. Pipeline Management
    Create, Retrieve, Update, Delete
    12
    PUT _ingest/pipeline/pipeline-name
    GET _ingest/pipeline/pipeline-name
    GET ingest/pipeline/*
    DELETE _ingest/pipeline/pipeline-name

    View Slide

  13. ‹#›
    Example Pipeline
    13
    extract mysql fields from
    `message` field and then
    remove it from the document
    {
    “description”: “mysql pipeline”,
    “processors”: [
    {
    “grok” : {
    “field” : “message”,
    “pattern” : “…”
    }
    },
    {
    “remove” : {
    “field” : “message”
    }
    }
    ]
    }

    View Slide

  14. The Processors
    14
    Grok
    Rename
    Set
    Convert
    attachment
    Geoip
    Date

    View Slide

  15. ‹#›
    Mutate
    Processors
    15
    set, convert, split, uppercase,
    lowercase, remove, rename,
    append, and more!
    MuTaTE
    mutate
    MUTATE
    mUTAte

    View Slide

  16. ‹#› 16
    Grok Processor
    Parses new field values out of
    pattern-matched substrings
    {
    "grok": {
    "field": "field1",
    "pattern": “%{DATE:date}”
    }
    }

    View Slide

  17. ‹#› 17
    Geoip Processor
    Adds information about the
    geographical location of IP
    addresses
    {
    "geoip": {
    "field": “my_ip_field”,
    }
    }

    View Slide

  18. ‹#›
    Introducing new
    processors is as
    easy as writing a
    new plugin!
    18
    {
    “your_plugin_here”: {
    "field": “field1”,

    }
    }

    View Slide

  19. Pipeline Error Handling
    19

    View Slide

  20. 20
    Handling Failure
    • Sometimes the input documents do not match what is expected. Failing is
    sometimes expected, and can handled
    • Use the on_failure parameter to define processors to run when the parent
    processor or pipeline throw an exception
    • on_failure blocks behave like little pipelines!
    Data is not clean, pipelines clean them
    Source: Gray Arial10pt

    View Slide

  21. Pipeline With Failure Guards
    21
    grok date set

    View Slide

  22. Pipeline With Failure Guards
    22
    grok date set

    View Slide

  23. Pipeline With Failure Guards
    23
    grok date set
    set date

    View Slide

  24. Pipeline With Failure Guards
    24
    grok date set
    set date

    View Slide

  25. 25
    Investigating Pipelines
    • on_failure is great for allowing you to catch exceptions
    • what if you do not know what exceptions can occur?
    • what if you are unsure what a processor will do?
    • what if there was a way to test a pipeline against your documents before
    indexing?
    Peek into behavior of Pipeline
    Source: Gray Arial10pt

    View Slide

  26. Simulation
    26

    View Slide

  27. 27
    _simulate
    • The Simulate API can be used to run pipelines against input documents
    • These input documents are not indexed
    • Use verbose flag to see output of each processor within a pipeline, not just
    the end result
    • Used to debug output of specific processors within a pipeline
    running pipelines without indexing

    View Slide

  28. Simulate Endpoints
    28
    POST _ingest/pipeline/pipeline-name/_simulate
    POST _ingest/pipeline/_simulate
    POST _ingest/pipeline/_simulate?verbose

    View Slide

  29. 29
    _simulate
    request response
    POST _ingest/pipeline/_simulate
    {
    “pipeline” : { … },
    “docs” : [
    {“message”: “transform!”},
    {“message”: “me too!”}
    ]
    }
    {
    “docs” : [
    {…},
    {…}
    ]
    }

    View Slide

  30. Ingest node internals
    30

    View Slide

  31. 31
    Ingest node internals
    • Ingest node is part of core Elasticsearch
    • The infrastructure
    • Many processors
    • Some processors are available as plugins

    View Slide

  32. 32
    Ingest node internals - the infrastructure
    • Ingest needs to intercept documents before it gets indexed for and
    preprocess it.
    • But where should intercept the documents?
    • The cost of preprocessing and the ability to isolate that are important

    View Slide

  33. cluster
    33
    Ingest node internals - the infrastructure
    node
    index1
    1P
    node
    index1
    1R
    node
    index2
    1P
    node
    index2
    1R
    index1
    2R
    index1
    2P
    index2
    2R
    index2
    2P

    View Slide

  34. Cluster
    Node
    34
    Ingest node internals - the infrastructure
    index1
    1P
    index1
    2R

    View Slide

  35. Cluster
    Node
    35
    Ingest node internals - the infrastructure
    index1
    1P
    index1
    2R

    View Slide

  36. Cluster
    Node
    36
    Ingest node internals - the infrastructure
    ? ?
    CS

    View Slide

  37. cluster
    node CS
    37
    Ingest node internals - the infrastructure
    node
    index1
    1P
    index1
    1R
    node
    index2
    1P
    node
    index2
    1R
    index1
    2R
    index1
    2P
    index2
    2R
    index2
    2P
    CS
    CS CS

    View Slide

  38. 38
    Ingest node internals - other details
    node
    Ingest
    processing

    View Slide

  39. Ingest node in production
    39

    View Slide

  40. 40
    Ingest node in production - performance
    • The type of operations your pipelines are going to perform
    • The number of processors in your pipelines
    • The amount of fields your pipelines add or remove.
    • The number of nodes ingest can run on.
    • So it depends!

    View Slide

  41. 41
    Ingest node in production - default architecture
    cluster
    node
    index1
    1P
    node
    index1
    1R
    node
    index2
    1P
    node
    index2
    1R
    index1
    2R
    index1
    2P
    index2
    2R
    index2
    2P

    View Slide

  42. 42
    Ingest node in production - dedicated ingest nodes
    cluster
    d+m node
    index1
    1P
    d+m node
    index1
    1R
    d+m node
    index2
    1P
    d+m node
    index2
    1R
    index1
    2R
    index1
    2P
    index2
    2R
    index2
    2P
    ingest

    node
    ingest

    node
    ingest

    node
    ingest

    node

    View Slide

  43. 43
    Dedicated ingest node in production - benchmark
    • Indexing apache logs
    • Bulk indexing with 5 concurrent threads
    • 2 node cluster
    • Data+master node
    • Ingest only node

    View Slide

  44. Dedicated ingest node in production - benchmark
    44
    0%
    20%
    40%
    60%
    80%
    100%
    0.3M 0.6M 1.2M 2.4M 4.8M 9.6M
    empty
    set
    grok
    grok + geoip
    grok + remove

    View Slide

  45. ‹#›
    Pipeline Construction Demo
    experiment with pipeline creation
    from within Kibana

    View Slide

  46. 46

    View Slide

  47. 47

    View Slide

  48. 48

    View Slide

  49. ‹#›
    Filebeat Demo
    Shipping Logs to Ingest from Beats

    View Slide

  50. 50

    View Slide

  51. Filebeat demo
    Grok processor
    51
    {
    "grok": {
    "field": "message",
    "pattern": "%{COMBINEDAPACHELOG}"
    }
    }

    View Slide

  52. Filebeat demo
    Date processor
    52
    {
    "date": {
    "match_field" : "timestamp",
    "target_field" : "timestamp",
    "match_formats" : ["dd/MMM/YYYY:HH:mm:ss Z"]
    }
    }

    View Slide

  53. Filebeat demo
    Convert processor
    53
    {
    "convert": {
    "field" : "response",
    "type" : "integer"
    }
    }

    View Slide

  54. Filebeat demo
    Convert processor with on_failure
    54
    {
    "convert": {
    "field" : "bytes",
    "type" : "integer",
    "on_failure" : [
    {
    "set" : {
    "field" : "bytes",
    "value" : 0
    }
    }
    ]
    }
    }

    View Slide

  55. 55

    View Slide

  56. See you at AMA
    56

    View Slide

  57. ‹#›
    Please attribute Elastic with a link to elastic.co
    Except where otherwise noted, this work is licensed under
    http://creativecommons.org/licenses/by-nd/4.0/
    Creative Commons and the double C in a circle are
    registered trademarks of Creative Commons in the United States and other countries.
    Third party marks and brands are the property of their respective holders.
    57

    View Slide