Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic Stack News (2.x + 5.0)

Elastic Stack News (2.x + 5.0)

In this 45 minutes talk Pablo will give an overview of the main features of the Elastic Stack in the last year (e.g. Reindex API, Pipeline Aggregation, and Beats). Pablo will point what you should pay attention to and what you should be careful with. Pablo will also talk about the new features that are coming in 5.0 (e.g. Graph API, Ingest Node, and Logstash Persistency).

Pablo Musa

May 10, 2016
Tweet

More Decks by Pablo Musa

Other Decks in Programming

Transcript

  1. Elastic Stack 5.0 3 Elastic Cloud Security Monitoring Alerting Graph

    X-Pack Kibana User Interface Elasticsearch Logstash Beats Store, Index, and Analyze Ingest
  2. 4

  3. Query DSL Update and Optimization (2.0) • Elasticsearch 2.x intelligently

    executes queries • no need to write queries "the best way" • Examples: • For "conjunction" queries (2 or more"match" queries in a must section) • Sub-queries are sorted by term frequency • Executed lowest to highest by term frequency • For complex queries ("match_phrase" for instance) a 2-Phase execution strategy is used • Approximation Phase • Verification Phase 5
  4. Query DSL Update and Optimization • The "Query Cache" will

    cache filter parts if they appear enough times ‒ Complex queries - 2 executions in the last 256 queries ‒ Typical queries - 5 executions in the last 256 queries ‒ Simple queries - 20 executions in the last 256 queries • No need to manage this with _cache or _cache_key any longer - deprecated features from 1.x • Only big segments get caches • Segments that contain 3% of index documents or 10,000 documents 6
  5. Doc Values and Field Data (2.0) • Inverted Index ‒

    For a "value", which "docs" contain it? • What if we need the opposite: ‒ For a "doc", what is a particular field's "value"? • Why do we need this? ‒ Sorting ‒ Aggregations ‒ Some Scripting • Two approaches for storing and accessing this structure... 7
  6. Doc Values • Build columnar style data structure on disk

    • We call these "doc values" (Lucene construct) • Created at indexing time, stored as part of the segment • Read like other pieces of the Lucene index ‒ Don't take up heap space ‒ Uses file system cache • Default for not_analyzed string and numeric fields in 2.0+ 8
  7. Field Data • Data structure built on the fly at

    query time • Held PER SEGMENT in the JVM memory • "15-20% faster", but comes at the cost of large heap usage (*GC) • To intentionally enable field data on 2.0+ ("not advised") 9 "properties" : { "tag": { "type": "string", "index" : "not_analyzed", "doc_values": false } }
  8. Significant Terms (find the “uncommonly common”) • Terms Aggregation is

    about popularity. • Significant Terms Aggregation is about significance. • Create a foreground dataset • See which terms are “significant” to it VS the background dataset 12
  9. Sampler Aggregation (2.0) • Limit the amount of documents a

    sub aggregation will operate on • Reduce noise • Get better and faster results 13
  10. Pipeline Aggregations (2.0) • After you've aggregated data, how can

    you aggregate the results? • Elasticsearch 2.0 introduced "Pipeline Aggregations" • Many type of aggregations such as moving averages, derivatives, bucket selectors and more! 14
  11. Pipeline Aggregations • Simple to use • Specify a type

    of Pipeline Aggregation • Specify a "bucket_path" • Optionally, use bucket_selectors to filter out buckets you don't want to pipeline aggregate 15 GET stack/question/_search { "size": 0, "aggs": { "daily_comments": { "date_histogram": { "field": "creation_date", "interval": "hour" }, "aggs" : { "comment_counts" : { "sum" : { "field" : "comment_count" } }, "comments_moving_avg" : { "moving_avg": { "buckets_path": "comment_counts", "model": "simple" } } } } }
  12. Pipeline Aggregations 16 • Chart with simple model, 30m intervals

    and a window of 50: Blue Line: Count Red Line: Moving Average
  13. Pipeline Aggregations 17 • Chart with linear model, 30m intervals

    and a window of 50: Blue Line: Count Red Line: Moving Average
  14. Pipeline Aggregations 18 • Chart with holt-winters model, 30m intervals

    and a window of 100, period of 48 and prediction of 200: Blue Line: Count Red Line: Moving Average
  15. GET /my-index/_search { "profile": true, "query": { "match_all": {} }

    } Query Profiler (2.2) • Attempts to time execution of query components • Best-effort profiling • Expensive! Verbose! 19 “SQL Explain for ES” https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html
  16. 21

  17. Update by Query (2.3) • Gets a 'snapshot' of the

    index • Indexes what it finds • version++ • Version conflict if there are changes between 'snapshot' and update • Failures cause abortion • no roll back • "conflicts": "proceed" 22 POST /twitter/_update_by_query { "script": { "inline": "ctx._source.likes++" }, "query": { "term": { "user": "kimchy" } } }
  18. Update by Query (2.3) • Batches of 100 docs •

    ?scroll_size=200 • First failure aborts, but all failures that are returned by the failing bulk request are returned 23 { "took" : 639, "updated": 1235, "batches": 13, "version_conflicts": 2, "failures" : [ ] } https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html
  19. Reindex API (2.3) • Gets a 'snapshot' of the index

    • Indexes to a new index • Failures cause abortion • no roll back • "conflicts": "proceed" • Multiple indices and types 24 POST _reindex { "source": { "index": "old_index", "query": { "match": { "user": "twitter" } } }, "dest": { "index": "new_index" } }
  20. Reindex API (2.3) • Conflicts are not likely, but •

    "version_type": "internal" • "version_type": "external" • "op_type": "create" • "size": 100 • "sort": { "date": "desc" } • Very flexible and powerful (scripts, refresh, wait_for_completion) 25 { "took" : 639, "updated": 112, "batches": 130, "version_conflicts": 0, "failures" : [ ], "created": 12344 } https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
  21. Task Management API (2.3) • Monitoring of cancellation of running

    tasks 26 https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html GET /_tasks GET /_tasks?nodes=nodeId1,nodeId2 GET /_tasks?nodes=nodeId1,nodeId2&actions=cluster:* GET /_tasks/taskId1 GET /_tasks?parent_task_id=parentTaskId1 GET /_tasks/taskId1?wait_for_completion=true&timeout=10s POST /_tasks/taskId1/_cancel
  22. Ingest Node (5.0) • Adding the power of Logstash filters

    inside an Elasticsearch node • Pre-process documents before the actual indexing takes place • Enabled by default • (node.ingest: false) 27 https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html PUT _ingest/pipeline/pipeline-name GET _ingest/pipeline/pipeline-name GET ingest/pipeline/* DELETE _ingest/pipeline/pipeline-name
  23. Ingest Node (5.0) 28 { "description": "mysql pipeline", "processors": [

    { "grok" : { "field" : "message", "pattern" : "..." } }, { "remove" : { "field" : "message" } } ] } • You define pipelines as series of processors • For example: • extract mysql fields from `message` field and then remove it from the document • Simplifies the ingestion pipeline
  24. Painless Scripting (5.0) • Fast • Secure • Single function

    only • Groovy-like syntax • Dynamic & static typing 29 # {"name":"JC", "goals":[9,27], "assists":[0,0]} GET /hockey-stats/_search { "query": { "function_score": { "script_score": { "script": { "lang": "painless", "inline": "int total = 0; for (int i = 0; i < input.doc.goals.size(); ++i) { total += input.doc.goals[i]; } return total;" } } } } } https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-painless.html
  25. Java HTTP Client (5.0) • Decouple server/client • Minimize dependencies

    • Similar to other clients 30 https://github.com/elastic/elasticsearch/issues/7743
  26. Lucene 6 multi-dimensional points (5.0) • improves indexing of numeric

    field • faster indexing • less memory at search time 31 0% 25% 50% 75% 100% Index Size Index Time Search Time Search Time Heap Usage 15% 76% 49% 49% 100% 100% 100% 100% NumericField PointField (Michael McCandless, https://www.elastic.co/blog/lucene-points-6.0)
  27. String Mappings (5.0) • The string field datatype has been

    replaced by • the text field for full text analyzed content • the keyword field for not-analyzed string values 32 https://www.elastic.co/guide/en/elasticsearch/reference/master/breaking_50_mapping_changes.html "city": { "type": "text", "fields": { "raw": { "type": "keyword" } } } "my_number": { "type": "long", "fields": { "raw": { "type": "keyword" } } }
  28. 33

  29. Kibana • 4.X • Status Page • Shield integration •

    Flexibility (filters, legend colors, dark theme) • UI framework • 5.0 • New design • First-class applications • Packs and a new plugin installer 34
  30. Filters 38 Edit with the full power of the Elasticsearch

    DSL Pin it then take it with you. Alias for commonly used filters
  31. Creating Kibana Apps 45 # install npm install -g yo

    # install yeoman npm install -g generator-kibana-plugin # configure mkdir my-new-plugin cd my-new-plugin yo kibana-plugin # Generate an app skeleton npm start # Start the plugin development environment # create cd ../kibana npm start # start the kibana dev environment (needs Elasticsearch) 
 # go to http://localhost:5601
  32. Packs, and a new plugin installer 49 # Want to

    install a third party pack? Just give it a url: bin/kibana-plugin install https://example.com/mypack.zip # Or how about one of our own bin/kibana-plugin install timelion # Want security, monitoring, reporting, and graph? bin/kibana-plugin install x-pack
  33. 50

  34. 51

  35. 52

  36. 53

  37. 54

  38. 55

  39. 56

  40. 57

  41. 58

  42. 59

  43. 60

  44. Found -> Cloud • Easy updates - 2 clicks from

    all the new features • Kibana • Security • Monitoring • Flexibility (configs, plugins, ...) • Back up every 30 minutes • Easy AWS integration 61
  45. 63

  46. Beats 64 Topbeat Filebeat Packetbeat {Community}beat libbeat Beats Platform Elasticsearch

    Kibana Logstash Optional Open source platform for building lightweight data shippers
  47. libbeat • Foundation for all Beats • Go library •

    Just worry about how to collect (parse) the data • Do not worry ‒ where to ship the data ‒ how to connect • Create a new beat guide ‒ https://www.elastic.co/guide/en/beats/libbeat/current/new-beat.html 65
  48. Other Inputs and Outputs • Outputs • Kafka (built-in) •

    Redis (built-in) • Inputs • Windows event logs (built-in) • Nginx, Apache • Redis • MySQL • https://www.elastic.co/guide/en/beats/libbeat/master/community-beats.html 66
  49. 67

  50. Logstash • Deprecating support for node protocol (only http) (2.0)

    • Installing Plugins Offline (2.2) • Config Reload (2.3) • Next Generation (NG) pipeline (5.0) • Metrics (5.0) • Configuration Management (5.0) • Persistency (5.0) 68
  51. Config Reloading Previously: Any config change made to file required

    a process restart Feedback loop for development/ testing slow Processing pipeline must be long living 69 File watched for changes or SIGHUP triggers reload Current Pipeline stopped Config Validated New Pipeline started - no process restart Why? How?
  52. 71

  53. Metrics (5.0) • Current web api resources (default port 9600):

    • http://localhost:9600/_node/hot_threads • http://localhost:9600/_node/stats/ • http://localhost:9600/_node/stats/events • http://localhost:9600/_stats/jvm • http://localhost:9600/_plugins/ • ….. 72
  54. 76

  55. X-Pack • A new product that extends the Elastic Stack

    with features: • Security (Shield) - Protect your data across the Elastic Stack. • Alerting (Watcher) - Get notifications about changes in your data. • Monitoring (Marvel) - Keep a pulse on the health of your stack. • Graph - Query and visualize meaningful relationships in your data. • Reporting - Generate, schedule, and email PDF reports. 77