Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Workday Search Built their Metrics Pipeline with the Elastic Stack

Elastic Co
March 09, 2017

How Workday Search Built their Metrics Pipeline with the Elastic Stack

After struggling to find a traditional database that could ingest large volumes of application metrics at an acceptable rate, Workday Search noticed that each of their already existing Elastic Stack deployments were able to process over 1 billion log events a week without issues. This talk will share how Workday Search expanded their deployment by implementing a robust, easy-to-use metrics processing pipeline.

Bo and Thomas will provide details on how they architected their pipeline, the scripts and frontend tools they use to visualize their data and proactively alert them to issues in production, as well as the metrics they look at to provide insight into usage patterns and facilitate intelligent product decisions.

Bodecker DellaMaria l Software Engineer l Workday
Thomas Kim l Principal Engineer l Workday

Elastic Co

March 09, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Workday Thursday, March 9, 2017 bojdell, tksfz Bodecker DellaMaria, Software

    Engineer Thomas Kim, Principal Engineer How Workday Search Built our Metrics Pipeline With the Elastic Stack 1
  2. I'll begin with a shockingly controversial statement: programming languages vary

    in power. - Paul Graham Source: http://paulgraham.com/avg.html 2
  3. 1. About Workday Search 2. Our Metrics 3. Metrics Pipeline:

    Elastic Stack 4. Metrics in Action Outline 3
  4. This talk is about what works for us Maybe it’ll

    work for your team, maybe it won’t You should do what works for you :) Disclaimer 4
  5. 6

  6. 8

  7. Enterprise Search → Indexing Pipeline, Query Service → Multi-tenant architecture

    → Per-tenant encryption How we use Elasticsearch Indexing Pipeline Query Service 15 ES
  8. Types of Metrics Metric Type Examples System Health Elasticsearch cluster

    health Process status (running, stopped) Correctness Incremental indexing coverage Query error rate Performance Full reindex times Query speed 17
  9. Types of Metrics Metric Type Examples System Health Elasticsearch cluster

    health Process status (running, stopped) Correctness Incremental indexing coverage Query error rate Performance Full reindex times Query speed 18
  10. Metric Type Examples System Health Elasticsearch cluster health Process status

    (running, stopped) Correctness Incremental indexing coverage Query error rate Performance Full reindex times Query speed Types of Metrics 19
  11. 20

  12. 21

  13. 22

  14. ES Query Service Metrics Pipeline: Old ES Indexing Pipeline Query

    Pipeline InfluxDB Grafana ES Indexing Pipeline Query Service ES (Logs) = Logstash 33
  15. ES Query Pipeline Metrics Pipeline: Old ES Indexing Pipeline Query

    Pipeline InfluxDB Grafana ES Indexing Pipeline Query Service ES (Logs) = Logstash 34
  16. ES Query Pipeline Metrics Pipeline: Old ES Indexing Pipeline Query

    Pipeline InfluxDB Grafana ES Indexing Pipeline Query Service ES (Logs) = Logstash 37
  17. ES Query Pipeline Metrics Pipeline: Old ES Indexing Pipeline Query

    Pipeline InfluxDB Grafana ES Indexing Pipeline Query Service ES (Logs) = Logstash 38
  18. ES Query Pipeline Metrics Pipeline: New ES Indexing Pipeline Query

    Pipeline ES Indexing Pipeline Query Service ES (???) 39 = ???
  19. ES Query Pipeline Metrics Pipeline: New ES Indexing Pipeline Query

    Pipeline ES Indexing Pipeline Query Service ES (Metrics) 40 = Logstash, Marvel
  20. ES Query Pipeline Metrics Pipeline: New ES Indexing Pipeline Query

    Pipeline ES Indexing Pipeline Query Service ES (Metrics) 41 = Logstash, Marvel → Logs → Marvel Metrics → App Metrics
  21. ES Query Pipeline Metrics Pipeline: New ES Indexing Pipeline Query

    Pipeline ES Indexing Pipeline Query Service ES (Metrics) 42 = Logstash, Marvel → Logs → Marvel Metrics → App Metrics
  22. ES Query Pipeline Metrics Pipeline: New ES Indexing Pipeline Query

    Pipeline ES Indexing Pipeline Query Service ES (Metrics) 43 = Logstash, Marvel → Logs → Marvel Metrics → App Metrics
  23. Vanilla ES Node x12 ES (Metrics) HAProxy (x2) searcher (microservice)

    x8 Query Service Logstash app.log metrics.log ES Masters, Clients, Data x32 Marvel ES Logstash elasticsearch.log 44
  24. ES Query Pipeline Metrics Pipeline: New ES Indexing Pipeline Query

    Pipeline ES Indexing Pipeline Query Service ES (Metrics) 45 = Logstash, Marvel
  25. Metrics Pipeline: New Grafana Kibana Nagios 46 ES Query Pipeline

    ES Indexing Pipeline Query Pipeline ES Indexing Pipeline Query Service ES (Metrics) ...
  26. Grafana Kibana Nagios ES Query Pipeline ES Indexing Pipeline Query

    Pipeline ES Indexing Pipeline Query Service ES (Metrics) ... 47
  27. Daily index, store 2 weeks of data Use Elastic curator

    to rotate indices Index Maintenance 48
  28. Data Breakdown - Last 2 Weeks 49 Data Type #

    of Data Points Logs 15,513,194,303 Metrics 9,371,706,585 Marvel 191,023,311
  29. Data Breakdown - Last 2 Weeks 50 Data Type #

    of Data Points Logs 12,825 Metrics 7,747 Marvel 157
  30. Data Breakdown - Last 2 Weeks 51 Data Type #

    of Data Points Logs ~11 TB Metrics ~3 TB Marvel ~1 TB
  31. Data Breakdown - Last 2 Weeks Usage of Metrics Cluster

    by Data Type (# of data points) 52
  32. ### Metrics Parsing Pipeline if [path] =~ /metrics.log/ { grok

    { match => { "message" => [ "\[%{TIMESTAMP_ISO8601:timestamp}\]%{SPACE} %{WORD:type}:%{SPACE}%{GREEDYDATA:metrics_json}" ] } } if [metrics_json] { json { source => 'metrics_json' target => 'metrics' } ruby { code => " event['metrics']['values'].each do |k, v| event['value_' + k.to_s] = v end event['metrics']['tags'].each do |k, v| event['tag_' + k.to_s] = v end " } mutate { remove_field => [ "metrics", "metrics_json" ] } } } 53 Logstash Conf
  33. ### Metrics Parsing Pipeline if [path] =~ /metrics.log/ { grok

    { match => { "message" => [ "\[%{TIMESTAMP_ISO8601:timestamp}\]%{SPACE} %{WORD:type}:%{SPACE}%{GREEDYDATA:metrics_json}" ] } } Logstash Conf 54
  34. ### Metrics Parsing Pipeline if [path] =~ /metrics.log/ { grok

    { match => { "message" => [ "\[%{TIMESTAMP_ISO8601:timestamp}\]%{SPACE} %{WORD:type}:%{SPACE}%{GREEDYDATA:metrics_json}" ] } } Logstash Conf 55
  35. ### Metrics Parsing Pipeline if [path] =~ /metrics.log/ { grok

    { match => { "message" => [ "\[%{TIMESTAMP_ISO8601:timestamp}\]%{SPACE} %{WORD:type}:%{SPACE}%{GREEDYDATA:metrics_json}" ] } } Logstash Conf 56
  36. Regex: "\[%{TIMESTAMP_ISO8601:timestamp}\]%{SPACE} %{WORD:type}:%{SPACE}%{GREEDYDATA:metrics_json}" Real Log Line: [2017-02-18 00:01:30,054] searcher_metrics: {"values":{

    "oms_request_deser_time":3,"es_time":291,"num_filter_iids": 122,"sel_aggs_dimensions_count":0,"obj_es_query_length":688 1,"es_response_deser_time":0,"query_generation_time":17,"es _response_post_proc_time":0,"cluster_index":0,"obj_total":8 5,"check_security_time":6,"obj_num_results":85,"query_lengt h":7,"keywords_fields_count":15,"sel_aggregations_count":0, "es_request_ser_time":0,"searcher_time":366,"aggs_fields_co unt":0},"tags":{"request_id":"F5S|3CB29DF8|58A78ED9","sid": "15609$34","environment":"tsprod","tenant":"super"}}
  37. Regex: "\[%{TIMESTAMP_ISO8601:timestamp}\]%{SPACE} %{WORD:type}:%{SPACE}%{GREEDYDATA:metrics_json}" Real Log Line: [2017-02-18 00:01:30,054] searcher_metrics: {"values":{

    "oms_request_deser_time":3,"es_time":291,"num_filter_iids": 122,"sel_aggs_dimensions_count":0,"obj_es_query_length":688 1,"es_response_deser_time":0,"query_generation_time":17,"es _response_post_proc_time":0,"cluster_index":0,"obj_total":8 5,"check_security_time":6,"obj_num_results":85,"query_lengt h":7,"keywords_fields_count":15,"sel_aggregations_count":0, "es_request_ser_time":0,"searcher_time":366,"aggs_fields_co unt":0},"tags":{"request_id":"F5S|3CB29DF8|58A78ED9","sid": "15609$34","environment":"tsprod","tenant":"super"}}
  38. Regex: "\[%{TIMESTAMP_ISO8601:timestamp}\]%{SPACE} %{WORD:type}:%{SPACE}%{GREEDYDATA:metrics_json}" Real Log Line: [2017-02-18 00:01:30,054] searcher_metrics: {"values":{

    "oms_request_deser_time":3,"es_time":291,"num_filter_iids": 122,"sel_aggs_dimensions_count":0,"obj_es_query_length":688 1,"es_response_deser_time":0,"query_generation_time":17,"es _response_post_proc_time":0,"cluster_index":0,"obj_total":8 5,"check_security_time":6,"obj_num_results":85,"query_lengt h":7,"keywords_fields_count":15,"sel_aggregations_count":0, "es_request_ser_time":0,"searcher_time":366,"aggs_fields_co unt":0},"tags":{"request_id":"F5S|3CB29DF8|58A78ED9","sid": "15609$34","environment":"tsprod","tenant":"super"}}
  39. Regex: "\[%{TIMESTAMP_ISO8601:timestamp}\]%{SPACE} %{WORD:type}:%{SPACE}%{GREEDYDATA:metrics_json}" Real Log Line: [2017-02-18 00:01:30,054] searcher_metrics: {"values":{

    "oms_request_deser_time":3,"es_time":291,"num_filter_iids": 122,"sel_aggs_dimensions_count":0,"obj_es_query_length":688 1,"es_response_deser_time":0,"query_generation_time":17,"es _response_post_proc_time":0,"cluster_index":0,"obj_total":8 5,"check_security_time":6,"obj_num_results":85,"query_lengt h":7,"keywords_fields_count":15,"sel_aggregations_count":0, "es_request_ser_time":0,"searcher_time":366,"aggs_fields_co unt":0},"tags":{"request_id":"F5S|3CB29DF8|58A78ED9","sid": "15609$34","environment":"tsprod","tenant":"super"}}
  40. if [metrics_json] { json { source => 'metrics_json' target =>

    'metrics' } ruby { code => " event['metrics']['values'].each do |k, v| event['value_' + k.to_s] = v end event['metrics']['tags'].each do |k, v| event['tag_' + k.to_s] = v end " }
  41. { "_index": "metrics-searcher-server-2017.02.18", "_type": "searcher_metrics", "_id": "AVpOhgbsmrDvR7Q0C5wY", "_score": 1, "_source":

    { "message": "[2017-02-18 00:01:30,054] searcher_metrics: {\"values\":{\"oms_request_deser_time\":3,\"es_time\":291,\"num_filter_iids\":122,\"se l_aggs_dimensions_count\":0,\"obj_es_query_length\":6881,\"es_response_deser_time\":0, \"query_generation_time\":17,\"es_response_post_proc_time\":0,\"cluster_index\":0,\"ob j_total\":85,\"check_security_time\":6,\"obj_num_results\":85,\"query_length\":7,\"key words_fields_count\":15,\"sel_aggregations_count\":0,\"es_request_ser_time\":0,\"searc her_time\":366,\"aggs_fields_count\":0},\"tags\":{\"request_id\":\"F5S|3CB29DF8|58A78E D9\",\"sid\":\"15609$34\",\"environment\":\"tsprod\",\"tenant\":\"super\"}}", "@version": "1", "@timestamp": "2017-02-18T00:01:30.054Z", "host": "shr-pdxeng-04.smn.eng.pdx.wd", "path": "/var/log/searcher-server/metrics.log", "type": "searcher_metrics", "value_oms_request_deser_time": 3, "value_es_time": 291, "value_num_filter_iids": 122, "value_sel_aggs_dimensions_count": 0, "value_obj_es_query_length": 6881, "value_es_response_deser_time": 0, "value_query_generation_time": 17,
  42. "host": "shr-pdxeng-04.smn.eng.pdx.wd", "path": "/var/log/searcher-server/metrics.log", "type": "searcher_metrics", "value_oms_request_deser_time": 3, "value_es_time": 291,

    "value_num_filter_iids": 122, "value_sel_aggs_dimensions_count": 0, "value_obj_es_query_length": 6881, "value_es_response_deser_time": 0, "value_query_generation_time": 17, "value_es_response_post_proc_time": 0, "value_cluster_index": 0, "value_obj_total": 85, "value_check_security_time": 6, "value_obj_num_results": 85, "value_query_length": 7, "value_keywords_fields_count": 15, "value_sel_aggregations_count": 0, "value_es_request_ser_time": 0, "value_searcher_time": 366, "value_aggs_fields_count": 0, "tag_request_id": "F5S|3CB29DF8|58A78ED9", "tag_sid": "15609$34", "tag_environment": "tsprod", "tag_tenant": "super", "loglevel": "INFO" }
  43. // singleton object StatsLogger { val DEFAULT_METRICS_FILE = "metrics.log" ...

    val RequestIdTag = "request_id" val ClusterIndexTag = "cluster_index" } trait StatsLogger { lazy val logger: Logger = ... def logFile: String // actually log a data point def writePoint(name: String, values: Map[String, AnyVal], tags: Map[String, String]): Unit = logger.info(s"$name: ${JsonUtils.toJson(Map("values" -> values, "tags" -> tags))}") ... } Scala API
  44. // singleton object StatsLogger { val DEFAULT_METRICS_FILE = "metrics.log" ...

    val RequestIdTag = "request_id" val ClusterIndexTag = "cluster_index" } trait StatsLogger { lazy val logger: Logger = ... def logFile: String // actually log a data point def writePoint(name: String, values: Map[String, AnyVal], tags: Map[String, String]): Unit = logger.info(s"$name: ${JsonUtils.toJson(Map("values" -> values, "tags" -> tags))}") ... } Scala API
  45. Pipeline is abstracted from application Metrics persisted to disk As

    a Search team, we know ES High degree of skill overlap More Benefits 68
  46. The Challenge: All customer data written to disk Must be

    encrypted with a per-tenant key. 70
  47. 1. Store all customer data unencrypted in-memory 2. Implement on-disk

    encryption in Elasticsearch Two Options: 71
  48. Pre-ES 2.0 → store.type: memory Wrote custom in-memory translog (We

    know, it was a bad idea) The Easy Way: Memory Store 72
  49. → Lucene Directory-level encryption (LUCENE-2228) → Translog encryption → Per-tenant

    keys → Implemented as a store plugin The Right Way: Encryption at Rest 73
  50. Metrics-Driven Validation 75 ES Query Pipeline ES Indexing Pipeline Query

    Pipeline ES (In-memory) Indexing Pipeline Query Service
  51. Metrics-Driven Validation 76 ES1 (In-memory) ES2 (Encrypted, On-disk) Query Pipeline

    Indexing Pipeline Query Pipeline Indexing Pipeline Query Service Foreground Background
  52. Metrics-Driven Validation 77 ES1 (In-memory) ES2 (Encrypted, On-disk) Query Pipeline

    Indexing Pipeline Query Pipeline Indexing Pipeline Query Service Foreground Background Compare metrics by cluster_id → Full reindex times → Query speed → Query relevance
  53. New correctness metrics → Set difference → Kendall tau distance

    Metrics-Driven Validation 78 ES1 (In-memory) ES2 (Encrypted, On-disk) Query Pipeline Indexing Pipeline Query Pipeline Indexing Pipeline Query Service Foreground Background Compare metrics by cluster_id → Full reindex times → Query speed → Query relevance
  54. [A, B, C, D] vs. [B, A, D, C] Result

    Pair (item1, item2) L1[item1] < L1[item2] ? L2[item1] < L2[item2] ? Diff. order? (A, B) 0 < 1 1 > 0 1 (A, C) 0 < 1 1 < 3 0 (A, D) 0 < 3 1 < 2 0 (B, C) 1 < 2 0 < 3 0 (B, D) 1 < 3 0 < 2 0 (C, D) 2 < 3 3 > 2 1 Kendall tau distance 2 L1 L2
  55. Metrics-Driven Validation 80 ES1 (In-memory) ES2 (Encrypted, On-disk) Query Pipeline

    Indexing Pipeline Query Pipeline Indexing Pipeline Query Service Foreground Background Identical data → Identical scoring → Identical order → KT Distance = 0?
  56. Metrics-Driven Validation 81 ES1 (In-memory) ES2 (Encrypted, On-disk) Query Pipeline

    Indexing Pipeline Query Pipeline Indexing Pipeline Query Service Foreground Background Identical data → Identical scoring → Identical order → KT Distance = 0? KTD > 0.5 for 30% of queries → Something is seriously wrong...
  57. Virtually all major discrepancies traced to multi-shard tenants For single-shard

    tenants: KTD > 0 for ~3.8% of queries KTD > 0.05 for ~0.3% Metrics-Driven Validation 82
  58. Virtually all major discrepancies traced to multi-shard tenants For single-shard

    tenants: KTD > 0 for ~3.8% of queries KTD > 0.05 for ~0.3% Metrics-Driven Validation 83
  59. 1. About Workday Search 2. Our Metrics 3. Metrics Pipeline:

    Elastic Stack 4. Metrics in Action Outline 85
  60. Metrics Pipeline 3.0 ES 1.7 → ES 5.2 Logstash →

    Filebeat Grafana → Kibana What’s Next for Our Metrics? 88