Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observe all your applications

Observe all your applications

This talk will give you an overview how the SAP HANA QA team monitors a huge stack from the 2000 physical machines up to the 10,000 parallel running Python application processes, micro-service instances and batch processing jobs.

Avatar for Christoph Heer

Christoph Heer

October 25, 2018
Tweet

More Decks by Christoph Heer

Other Decks in Technology

Transcript

  1. 2 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ • Quality assurance for SAP HANA • Automated testing of >700 commits per day • Testing with physical hardware: 1400 Nodes, 560 TB RAM (256 GB – 8TB) • Development of special tools and services for ~650 HANA developers Background
  2. 3 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ “Anything that can go wrong will go wrong” Murphy’s law
  3. 4 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ How and when do you know about a problem? What is the impact of the problem? How can you analyze the problem? Can you build a fix with the available data? How do you prove the fix works as expected? Something goes wrong Identify Analyze Solve
  4. 5 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ Observability Logging Metrics Distributed Tracing Errors Observability Tools System Developer User analyze improve
  5. 6 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ • Maybe obvious but powerful observability tool • Log messages are for humans • Writing good log messages is a hard problem Logging: Messages Bad • Acquire lock • Skip cache eviction because cache size 744352895045 bytes is already below target 1099511627776 bytes Better • Blocking attempt to acquire lock: /scheduler/master • Skip cache eviction because current cache size (744 GB) is already below configured target size (1 TB)
  6. 7 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ Logging: Formatting [2018-10-06 12:50:45 +0000] 1 MainThread application.package.module INFO Timestamp with timezone info PID Thread >>> import logging >>> logging.basicConfig() >>> logging.error('Test') ERROR:root:Test Logger Level Python default formatting: Our default formatting:
  7. 10 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ Logging: Central collection • Distributed system => Distributed logs • ssh + grep doesn’t scale well Application Processor Datastore Log file Developer Interface Parsing Rules ^\[(?P<timestamp>\d+-\d+-\d+ \d+:\d+:\d+ \+\d+)\] (?P<pid>\d+) (?P<thread_name>\S+) (?P<logger>\S+) (?P<log_level>\S+) (?P<message>.*?)$ Common architecture for central log collection
  8. 11 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ • Formatted log files are for humans • Integrate log collection into application’s logging framework Logging: Central collection Application API SAP HANA stdout Developer Web Interface HTTP API Internal in-house architecture for central log collection
  9. 14 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ • Central log collection unveils exception: [2018-10-02 07:46:03 +0000] 1 MainThread ERROR Exception while processing HTTP request: AttributeError: 'NoneType' object has no attribute 'status_code’ Central error/exception collection • Open-Source/SaaS solution: • Specialized on error cases • Rare events therefore collect as much data & context as possible Errors
  10. 19 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ Metrics • Error: single event at single point in time • Metrics: Multiple events over time (e.g. interval) with data points • Collecting various host/container metrics with own daemon Hardware Operating system Application Container CPU, Memory, IO, Network, Disk space, HW RAID status User login (event), Queue length (utilization), Response time (latency), Health (state) daemon SAP HANA
  11. 20 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ Metrics • Without visualization metrics are often useless • Visualization and alerting with: • Enable developers to build dashboards • Dashboard per service/application maintained by developer • Allow ad-hoc queries for incident analysis • Prove problem resolution
  12. 25 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ • Typical monolith: application + database • Performance problem: Inspect application log & expensive statement log • Distributed system with various components • Performance problem: Inspect more logs and metrics • Pass unique id trough system and attach it to all events (timespans, logs and metrics) • On-going experiments with OpenTracing & jaeger • OpenTracing 2.0 is a major step for the Python API • Basic instrumentation revealed various areas for performance improvements • Visualization helps to understand time and application flow Distributed tracing
  13. ???

  14. 27 PUBLIC © 2018 SAP SE or an SAP affiliate

    company. All rights reserved. ǀ • Observability != Monitoring • Improve application observability • Allows monitoring with better data • Faster & data-driven incident handling • Targeted performance tuning • Easy integration and visualization is key to convince developers • Better instrumentation => higher data quality • Enable developers to use tools Conclusion
  15. Christoph Heer [email protected] @ChristophHeer Thank you. We are hiring! Python

    Developer: https://jobs.sap.com/s/6UvDKk Cloud Engineer: https://jobs.sap.com/s/Pw235g