Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The telecom's event space with Elasticsearch

The telecom's event space with Elasticsearch

Achieving a metrics data lake with Elastic stack to support performance engineering tasks and speed up troubleshooting.

filipealmeida

January 05, 2017
Tweet

Other Decks in Technology

Transcript

  1. 1 Elasticsearch - The telecom’s event space with Elasticsearch The

    telecom's event space with Elasticsearch https://docs.google.com/presentation/d/1H_r6DuYSK1eq9rLlKMxRX4lqyDXz7zdJ6awjrqvmg0o
  2. 2 Elasticsearch - The telecom’s event space with Elasticsearch Who

    we are and what we do Filipe Almeida Console Jockey https://www.linkedin.com/in/thespecialone Daniel Valente Data Juggler https://www.linkedin.com/in/danielvalente
  3. 3 Elasticsearch - The telecom’s event space with Elasticsearch Presentation’s

    roadmap The pains of being a performance engineer • Mostly problem solving but the purpose is to optimize. • Many sources of information. • Almost every source has own format. • Many interactions needed. • Initiating a problem solving task takes a long time. Alleviating the pain • Elastic stack (and friends) to the rescue, the setup. Life afterwards, a few examples 1. Shortcutting RCA. 2. Correlation: measuring events impact on performance. 3. Freestyle exploratory data analysis. 4. Collaborating in devops.
  4. 4 Elasticsearch - The telecom’s event space with Elasticsearch Who

    we are and what we sometimes still do IT notables sorting out the logjam
  5. 5 Elasticsearch - The telecom’s event space with Elasticsearch The

    Problems and the problem ~~~~~~~~~~~~~ Getting to problem solving takes a long time. Let’s speed it up! ~~~~~~~~~~~~~
  6. 6 Elasticsearch - The telecom’s event space with Elasticsearch Shipping

    (1) - If we have it we don’t need to fetch it
  7. 8 Elasticsearch - The telecom’s event space with Elasticsearch Data

    normalization, transformation and enrichment (3) - Mapping { "mytemplateforwindows" : { ... "template" : "windows_*", "settings" : { ... }, "mappings" : { "_default_" : { "dynamic_templates" : [ { "strings" : { "mapping" : { "index" : "not_analyzed", "type" : "string", "doc_values" : true }, "match_mapping_type" : "string", "match" : "*" } }, { "common" : { ... "metric_value" : { "type" : "double" }, "metric_string" : { "index" : "not_analyzed", "type" : "string" }, "metric_name" : { "index" : "not_analyzed", "type" : "string" }, "sourcetype" : { "index" : "not_analyzed", "type" : "string" }, "serviceid" : { "index" : "not_analyzed", "type" : "string" },
  8. 9 Elasticsearch - The telecom’s event space with Elasticsearch Data

    normalization, transformation and enrichment (3) - Dataminion { "description":"Stream definition, a pair of input/output to be handled by a dataminion", "input": { "classname": "dataminion.input.ampq.RabbitMQ", "enabled": true, "ack":false, "add_field":{ "shipper":"dataminion" }, "auto_delete":false, "codec":"json", "consumer_tag":"dataminion", "durable":true, "exchange":"windowslogs", "exchange_type":"topic", "exclusive":false, "host":"greathost.lan", "key":"thatkey", "passive":false, "password":"thatpassword", "port":5672, "prefetch_count":128, "queue":"theRAWqueue", "ssl":false, "tags":[], "threads":1, "type":"is deprecated", "user":"someuser", "verify_ssl":false, "vhost":"prd" }, "filter": { "classname": "dataminion.filter.windowslogs.MouraoMagic", "coordinator_root": "/nexus/knowledgebase/headers" }, "output": { "classname": "dataminion.output.ampq.RabbitMQ", "add_field":{ "shipper":"dataminion" }, "codec":"json", "durable":true, "exchange":"windows_es", "exchange_type":"topic", ...
  9. 10 Elasticsearch - The telecom’s event space with Elasticsearch Indexing

    (4) - Storing, querying and the delivering of data Elasticsearch MASTER#1 MASTER#2 DATA#1 DATA#2 QUERY#1 QUERY#2 From AMPQ via Logstash From Probespawner Kibana Grafana Queries, scripts and tailored viz ES-Hadoop/Spark
  10. 11 Elasticsearch - The telecom’s event space with Elasticsearch Explore

    (5) - The awesome of data visualization Kibana Grafana From Elasticsearch From Influx/OpenTSDB
  11. 12 Elasticsearch - The telecom’s event space with Elasticsearch Hence,

    the actual setup: 1. Shipping: Data gathering/reception from any source. 2. Data routing 3. Data normalization, transformation and enrichment 4. Indexing 5. Data exploration and reporting Setup overview
  12. 13 Elasticsearch - The telecom’s event space with Elasticsearch Data

    through the pipeline: some examples Metrics (perfmon, collectl, SNMP, syslog, ...) Events (scheduled operations, tests, business and user activity, ...) Logs (apache, iis, syslog, event viewer, DB stashes, ...) Diagnostics (heap dumps, stack traces, probes, DB monitoring, health checks, ...) Microsoft performance monitor 1. Create new user defined data collector set in Perfmon, store data in CSV. 2. Have NXLOG ship the data through TCP socket, encapsulated in JSON format adding file, host and type fields. 3. Logstash receives the data and publishes it in AMPQ queue. 4. Dataminion parses, normalizes and enrich data, publish to index queue. 5. Logstash indexes in Elasticsearch data from AMPQ idx queue. Scheduled operations, Business events 1. Scheduled interventions are read from the database using Probespawner (JDBC). 2. Data is enriched and normalized with Probespawner filters. 3. Data is indexed by Probespawner. Microsoft log files 1. Configure NXLOG to ship the files through TCP socket, encapsulated in JSON format adding file, host and type fields. 2. Logstash receives the data and publishes it in AMPQ queue. 3. Dataminion parses, normalizes and enrich data, publish to index queue. 4. Logstash indexes in Elasticsearch data from AMPQ idx queue. Stack traces 1. Diagnostics tools are configured to ship data directly to AMPQ queues (e.g.: a JAVA stack trace, tcpdump, sysdig, ...) 2. Either Dataminion handles the raw data and set it up in the index queue, or the diagnostic tool does it by itself (*). 3. Logstash indexes in Elasticsearch data from AMPQ idx queue.
  13. 14 Elasticsearch - The telecom’s event space with Elasticsearch Interesting

    stuff ~~~~~~~~~~~~~ Very interesting stuff! What now? ~~~~~~~~~~~~~
  14. 15 Elasticsearch - The telecom’s event space with Elasticsearch Shortcuting

    in RCA Logfile: “Somelogfile.log” ... In<ED>cio PASSO: P062 30-01-2016 17:36:42 In<ED>cio PASSO: P063 30-01-2016 17:38:38 In<ED>cio PASSO: P064 30-01-2016 17:40:02 In<ED>cio PASSO: P070 30-01-2016 17:45:55 In<ED>cio PASSO: P071 30-01-2016 17:58:24 In<ED>cio PASSO: P073 30-01-2016 18:00:17 ... Perl script computes the elapsed time between steps and returns JSON: {"@timestamp":"2016-01-30T17:36:42+0000","EXECUTION_TIME":104000,"component":"PACK_DWPRQMOVSRV","filename":"PRQ_SRV_Ph.PACK_DWPRQMOVSRV.DWPRQSRVDP30Q.sql.20160129.2.log","message": "Início PASSO: P062 30-01-2016 17:36:42","process":"PRQ_SRV_Ph","segment":"PRQ_SRV_Ph","shipper":"parser","state":"running","step":"P061"} {"@timestamp":"2016-01-30T17:38:38+0000","EXECUTION_TIME":116000,"component":"PACK_DWPRQMOVSRV","filename":"PRQ_SRV_Ph.PACK_DWPRQMOVSRV.DWPRQSRVDP30Q.sql.20160129.2.log","message": "Início PASSO: P063 30-01-2016 17:38:38","process":"PRQ_SRV_Ph","segment":"PRQ_SRV_Ph","shipper":"parser","state":"running","step":"P062"} {"@timestamp":"2016-01-30T17:40:02+0000","EXECUTION_TIME":84000,"component":"PACK_DWPRQMOVSRV","filename":"PRQ_SRV_Ph.PACK_DWPRQMOVSRV.DWPRQSRVDP30Q.sql.20160129.2.log","message":" Início PASSO: P064 30-01-2016 17:40:02","process":"PRQ_SRV_Ph","segment":"PRQ_SRV_Ph","shipper":"parser","state":"running","step":"P063"} {"@timestamp":"2016-01-30T17:45:55+0000","EXECUTION_TIME":353000,"component":"PACK_DWPRQMOVSRV","filename":"PRQ_SRV_Ph.PACK_DWPRQMOVSRV.DWPRQSRVDP30Q.sql.20160129.2.log","message": "Início PASSO: P070 30-01-2016 17:45:55","process":"PRQ_SRV_Ph","segment":"PRQ_SRV_Ph","shipper":"parser","state":"running","step":"P064"} {"@timestamp":"2016-01-30T17:58:24+0000","EXECUTION_TIME":749000,"component":"PACK_DWPRQMOVSRV","filename":"PRQ_SRV_Ph.PACK_DWPRQMOVSRV.DWPRQSRVDP30Q.sql.20160129.2.log","message": "Início PASSO: P071 30-01-2016 17:58:24","process":"PRQ_SRV_Ph","segment":"PRQ_SRV_Ph","shipper":"parser","state":"running","step":"P070"} {"@timestamp":"2016-01-30T18:00:17+0000","EXECUTION_TIME":113000,"component":"PACK_DWPRQMOVSRV","filename":"PRQ_SRV_Ph.PACK_DWPRQMOVSRV.DWPRQSRVDP30Q.sql.20160129.2.log","message": "Início PASSO: P073 30-01-2016 18:00:17","process":"PRQ_SRV_Ph","segment":"PRQ_SRV_Ph","shipper":"parser","state":"running","step":"P071"} Finally, we run the oneliner to populate elasticsearch with the generated metrics: $ ./parsefiles.pl PRQ_SRV_Ph/PRQ_SRV_Ph.PACK_DWPRQMOVSRV.DWPRQSRVDP30Q.sql.20160129.2.log | ~/opt/logstash/bin/logstash -f ~/etc/jsonin_ampqout.conf And finally, insight through kibana, “who took the greater time slice?”, “what component deviates the most?”, ...
  15. 16 Elasticsearch - The telecom’s event space with Elasticsearch Probespawner

    periodically grabs the operations’ schedule (JDBC) and stores it (directly) on Elasticsearch with tags according the message class. Nagios alert information is sent through the RAW queue to Elasticsearch. (tags: alert) Logs and scripts, where configured, dispatch special events (tags: shutdown, start, deploy, etc.) Configure grafana to query Elasticsearch (tags: {opevent,deploy,shutdown,alert,...}) Enjoy the overlay of events along with your metrics Event HUD with Grafana
  16. 17 Elasticsearch - The telecom’s event space with Elasticsearch Probespawner

    periodically grabs the operations’ schedule (JDBC) and stores it (directly) on Elasticsearch with tags according the message class. Nagios alert information is sent through the RAW queue to Elasticsearch. (tags: alert) Logs and scripts, where configured, dispatch special events (tags: shutdown, start, deploy, etc.) Configure grafana to query Elasticsearch (tags: {opevent,deploy,shutdown,alert,...}) Enjoy the overlay of events along with your metrics Event HUD with Grafana
  17. 18 Elasticsearch - The telecom’s event space with Elasticsearch Kibana

    is paramount on drilldown and sifting of the data, find which is relevant to solve a problem. When troubleshooting we usually plunge first on the dashboard at your right. Here we can watch and follow any given metric in any asset. Let’s find out if we had any problem in the past few days. Exploratory data analysis
  18. 19 Elasticsearch - The telecom’s event space with Elasticsearch Create

    a test plan in JMETER. Have every target shipping information (metrics and events from logs) to Elasticsearch. Add backend listeners to your test plan, configure for the graphite input. Create dashboard with the pertinent data and automate reporting... or just watch it happen. DEVOPS - Continuous testing (kind of)
  19. 20 Elasticsearch - The telecom’s event space with Elasticsearch Difficulties,

    problems, issues, ghouls and some statistics • 3000+ sources • 250k+ queries daily • 1800M documents (max stable with actual distribution of document types) • ~1500M documents present daily (1,549,239,356) • 475 indexes • Approximately 720Gb for data per node (754525372 Kbytes) • Running on commodity hardware (no SSD) • Data nodes: 2 x 16Gb Quad Xeon (no swap) • Master nodes: 2 x 8Gb Dual Core x64 (ESX) • Query nodes: 2 x 8Gb Quad i5 (Desktops) • Logstash: 2 inputs; DNS round-robin • ~12000 TPS Max stable (more than that causes delays) • 2030 TPS Avg (24h)
  20. 21 Elasticsearch - The telecom’s event space with Elasticsearch Event

    producers: • Get and produce events from: ◦ Nagios ◦ HP Performance Manager ◦ Applications (log entries) • Agent to supervise metrics, produce events on off pattern metrics. ML • Open Spark access to work the data lake. • Harness Dark Data. • (near) Automatic RCA. • Performance recommendations engine. Forging now
  21. 22 Elasticsearch - The telecom’s event space with Elasticsearch Big

    data program direction Milestone #1: Scale to support business events Milestone #2: Scale to support client events Head towards total DW/DM replacement Logging/Data normalization (structure) Establish and enforce logging strategies Have paths available to receive logs from many flavours Reduce the number of vertices towards elasticsearch Interaction Intelligent/scripted operation (e.g.: Slack bots) Alerting Automatic handling of events, humans the last line of defense Security Protect and segregate the data Reporting Have reports sent via email with charts and tables, etc Next steps
  22. 23 Elasticsearch - The telecom’s event space with Elasticsearch Ask

    me anything ~~~~~~~~~~~~~ ASK ME ANYTHING ~~~~~~~~~~~~~
  23. 24 Elasticsearch - The telecom’s event space with Elasticsearch The

    telecom's event space with Elasticsearch DIT - Direção de IT TCS/PER - Perf, Reliab & Tools