Unlocking the Power of OpenTelemetry: Enhancing Design, Development, and Testing

Slide 1

Slide 1 text

© Hitachi, Ltd. 2024. All rights reserved. Unlocking the Power of OpenTelemetry: Enhancing Design, Development, and Testing Oct. 28, 2024 Takaya Ide, Yasuo Nakashima Services Computing Research Dept. Research & Development Group, Hitachi, Ltd.

Slide 2

Slide 2 text

© Hitachi, Ltd. 2024. All rights reserved. Today’s complex distributed systems involve multiple interacting services. This makes design, debugging, and testing increasingly challenging. Increasing Complexity of Development 1 1. Top Software Development Trends for 2024, SIMFROM 2. Business Wire. 76% of CIOs Say It’s Impossible to Manage Digital Performance Complexity, 2018 3. DORA. (2022). State of DevOps 2022. DevOps Research and Assessment. 74% of orgs use microservice architecture 1 76% of CIOs Say It Could Become Impossible to Manage Digital Performance, as IT Complexity Soars 2 42% of orgs use hybrid hloud 3 82% of software makers reporting defects associated with undiagnosed test failures causing production problems” 5 “developers say they tend to spend 25–50% of their time per year on debugging” 6 “Among projects over 500 person- months, 51.7% missed deadlines, and 40.4% exceeded budgets” 4 4. JUAS, “企業IT動向調査報告書2024 (Corporate IT Trends Survey Report 2024)”, 2024 5. Undo.io, Optimizing the software supplier and customer relationship, 2020 6. Undo, “Time spent debugging software”

Slide 3

Slide 3 text

© Hitachi, Ltd. 2024. All rights reserved. How should we address such issues? Challenges Faced by Developers 2 What value will be implementing a cache server bring? Or will it just add burdens in terms of operations and costs? When a DB failure occurs, how will the effects spread? What is the risk of cascade failures? What is the impact on latency? What processing is the bottleneck in OIDC authentication? It’s tough because parameter adjustments shift the bottleneck.

Slide 4

Slide 4 text

Slide 5

Slide 5 text

© Hitachi, Ltd. 2024. All rights reserved. • OpenTelemetry is an open-source observability framework and spec. • Measuring, Collecting, Processing, Exporting telemetry signals OpenTelemetry (OTel) 4 App1 OTel API/SDK App2 OTel Auto-Inst. OTel Collector ... ... Other Monitoring Tools / Services Measuring Collecting Processing Exporting Storing Analyzing Visualizing Signals Application Monitoring Input Output Signals Signals

Slide 6

Slide 6 text

© Hitachi, Ltd. 2024. All rights reserved. OTel should be also a powerful tool for design and development Enable Data-driven decision making Correlating signals across multiple applications May be considered excessive for development use Favorable Trends • Auto-instrumentation allows fast attach-detach, minimizing setup cost • Growing support from open-source and cloud services for analyzing OTel signals Unlocking OTel Beyond Operations 5 → Can we leverage OTel in design and dev. by attaching only when needed?

Slide 7

Slide 7 text

Slide 8

Slide 8 text

© Hitachi, Ltd. 2024. All rights reserved. Attach instrumentation without modifying program code • Achieved via monkey patching, which dynamically modifying target code • Supported: Java, JavaScript, Python, PHP, .NET, Ruby, (Go) Auto-Instrumentation 7 app.jar opentelemnetry- agent.jar • Analyze intermediate code • Detect libs (e.g., spring) • Modify code Monkey Patching

Slide 9

Slide 9 text

Slide 10

Slide 10 text

© Hitachi, Ltd. 2024. All rights reserved. OpenTelemetry Go Instrumentation Auto-instrumentation is inherently hard to apply to binaries. Efforts are underway to solve this issue (Ref.) Auto-Instrumentation for binary 9 [WIP]Auto-Instrumentation based on Traffic Pattern (Hitachi 2021) [WIP]opentelemetry-go-instrumentation (OTel community) https://github.com/open-telemetry/opentelemetry-go-instrumentation Golang Process eBPF Analyzer Inst. Manager Set probe. Load eBPF program This agent analyzes a target Go process and find instrumentable functions. Then it attaches eBPF program to hooks in the functions. Traces Detect process. Find funcs. Analyze stack & CPU register

Slide 11

Slide 11 text

© Hitachi, Ltd. 2024. All rights reserved. OpenTelemetry Operator enables auto-instrumentation of containers and deploys the OTel Collector in a Kubernetes-native way OpenTelemetry Operator for Kubernetes 10 OTel Collector https://github.com/open-telemetry/opentelemetry-operator Instrumentation custom resource OpenTelemetryCollector custom resource Add OTel agent as init-continer Deploy Signals Con- troller OpenTelemetry Operator App OTel agent Auto Inst. Kubernetes

Slide 12

Slide 12 text

Slide 13

Slide 13 text

© Hitachi, Ltd. 2024. All rights reserved. OTel Signals 12 As of 2024, Auto-Instrumentation measures metrics, logs, and traces Enrich log output Metrics Logs Traces Time-series data like latency, with various values defined by Semantic Conventions. Converts standard logs into structured logs with contextual information. Visualizes process call relationships; generates large data volumes.

Slide 14

Slide 14 text

© Hitachi, Ltd. 2024. All rights reserved. Semantic Convention 13 https://opentelemetry.io/docs/specs/semconv/ Semantic Conventions define common attributes that give meaning to signals. E.g., Metric: http.server.request.duration Attribute Type Description Examples Requirement Level Stability http.request.method string HTTP request method. [1] GET; POST; HEAD Required url.scheme string The URI scheme component identifying the used protocol. http; https Required error.type string Describes a class of error the operation ended with. [3] timeout; java.net.Unkno wnHostException; server _certificate_invalid; 500 Conditionally Required If request has ended with an err. http.response. status_code int HTTP response status code. 200 Conditionally Required If and only if one was received/sent. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ stable stable stable stable → Java Auto-Instrumentation supports not only HTTP, but also JVM, Process

Slide 15

Slide 15 text

Slide 16

Slide 16 text

© Hitachi, Ltd. 2024. All rights reserved. Examples of developer support using OTel 15 Overall performance and downtime can be monitored using conventional external monitoring, Req. Resp. Request generator Frontend App. Frontend Backend App. Backend http request Long Response time Internal Server Error 500 … ？？？ However, It is difficult to evaluate each component. Load Balancer DB Cache Server Load Balancer

Slide 17

Slide 17 text

© Hitachi, Ltd. 2024. All rights reserved. Examples of developer support using OTel 16 Req. Resp. Request generator App. App. http request Long Response time Internal Server Error 500 … Dashboard / Analysis by script (ex. anomaly detection, downtime calculation) OpenTelemetry Measurement Agent OpenTelemetry Measurement Agent Traces, Metrics, Logs Collecting telemetry for each component using OTel Understanding the performance and behavior of each component DB Cache Server Telemetry management platform (e.g., Amazon CloudWatch, Grafana) Frontend Backend Load Balancer Load Balancer

Slide 18

Slide 18 text

© Hitachi, Ltd. 2024. All rights reserved. Information that can be obtained ✓ Start and end timings of events (failures or load spikes) ✓ Error rate, latency, resource consumption before, during, and after the event ✓ Changes in Java connection count, Java thread count, and other metrics Use case examples 1. Failure testing: Investigate how the duration of frontend errors changes during DB failover when RDS Proxy is introduced. 2. Performance testing: Test if resource consumption under load remains below the specified limits and investigate the components causing bottlenecks for potential improvements. 3. Proof of Concept: Investigate how the response time for database access changes when a cache server is introduced. Concept of analysis with OTel 17 [{ "name": "my_aurora_db", "start_time": "2024-06-19T12:00:00Z", "end_time": "2024-06-19T12:01:00Z", "abnormal_time_seconds": 60, "metrics": { "before_abnormal": { "average_latency_ms": 50, "total_errors": 0, "error_percentage": 0.0, "requests_per_second": 200, "cpu_usage_percentage": 30.0, "memory_usage_mb": 2048, "read_iops": 600, "write_iops": 400, "retry_attempts": 0, "cache_hit_ratio": 95.0, "connection_errors": 0, "transaction_rollbacks": 0 }, "under_abnormal": { "average_latency_ms": 250, "total_errors": 150, "error_percentage": 5.0, "requests_per_second": 80, "cpu_usage_percentage": 75.0, "memory_usage_mb": 4096, "read_iops": 1200, "write_iops": 800, "retry_attempts": 20, "cache_hit_ratio": 85.0, "connection_errors": 30, "transaction_rollbacks": 10 }, "after_abnormal": { ... } } }, { "name": "backend_app", ... }] Example analysis output behavior before a failure behavior during the failure anomaly time Target component behavior after the failure

Slide 19

Slide 19 text

© Hitachi, Ltd. 2024. All rights reserved. Issue: Evaluating how database failures affect applications during fault testing Case 1: Analysis of database failures during fault testing 18 Request generator (k6) Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container AWS FIS Template Fault injection (reboot DB) Aurora PostgreSQL Analysis Script (Python+boto3) Req./Res Get traces/metrics Transmit traces/metrics Amazon CloudWatch, AWS X-Ray Dashboard Detail Summary Start test Analysis tools Testing tools Target system ALB RDS Proxy ALB: Auto Load Balance RDS: Relational Database Service FIS: Fault Injection Service

Slide 20

Slide 20 text

© Hitachi, Ltd. 2024. All rights reserved. Issue: Evaluating how database failures affect applications during fault testing Case 1: Analysis of database failures during fault testing 19 Request generator (k6) AWS FIS Template Fault injection (reboot DB) Analysis Script (Python+boto3) Get traces/metrics Amazon CloudWatch, AWS X-Ray Dashboard Detail Summary Start test Analysis tools Testing tools Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container Aurora PostgreSQL Transmit traces/metrics Target system ALB RDS Proxy 1. Select metrics about fault with target attributes (ex. MetricName="FaultRate", "ServiceName="test-app", ServiceType=AWS::ECS::Fargate) 2. Detect anomaly time for each component using the metrics 3. Analyze related metrics and traces around the anomaly time Req./Res

Slide 21

Slide 21 text

© Hitachi, Ltd. 2024. All rights reserved. Information in metrics 20 Metric Attribute (situation where the metric was recorded) Metric name (what the metric means) Metric datapoints ・・・statistics of metric including Sample count, Average, Max, Min, p99, …) *From Amazon Cloudwatch

Slide 22

Slide 22 text

© Hitachi, Ltd. 2024. All rights reserved. ◆ about HTTP • http.server.duration, http.client.duration ◆ about Runtime environment • jvm.threads.count, jvm.memory.usage, jvm.cpu.utilization, … ◆ about Database • db.client.connection.count, db.client.connection.create_time, db.client.connection.pending_requests, … Examples of OTel metrics 21

Slide 23

Slide 23 text

© Hitachi, Ltd. 2024. All rights reserved. Case 1: How database failures affect applications 22 Request generator (k6) AWS FIS Template Fault injection (reboot DB) Analysis Script (Python+boto3) Get traces/metrics Detail Summary Start test Analysis tools Testing tools Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container Aurora PostgreSQL Transmit traces/metrics Target system ALB RDS Proxy Req./Res Dashboard Amazon CloudWatch, AWS X-Ray

Slide 24

Slide 24 text

© Hitachi, Ltd. 2024. All rights reserved. Metrics provide insight into statistical behavior over relatively long periods of time (1min~). Case 1: Information obtained from metrics 23 Reboot DB by FIS Metrics obtained from the API Dashboard(Amazon CloudWatch) Anomaly time detection using threshold Example: ✓ Average Error rate (5xx) > 0 ✓ p99 of Response time > 1 seconds ✓ p95 of Response time > p99 of Response time under normal conditions *From Amazon Cloudwatch

Slide 25

Slide 25 text

© Hitachi, Ltd. 2024. All rights reserved. Case 1: Analysis using trace 24 Trace: Information of each request ◆Start time ◆Http Status Anomaly time detection in seconds Start time(sec) Response time(sec) Start time Http status=500 ◆ DB connection ✓ db.client.connection.count ✓ db.client.connection.wait_time ◆ Resource usage by retry ✓ jvm.memory.usage ✓ jvm. system.cpu.utilization ✓ jvm.gc.duration Analysis around the anomaly time Related metrics to check The response stopped when the database failed because no timeout set on the application ◆ Response time of each request in trace

Slide 26

Slide 26 text

© Hitachi, Ltd. 2024. All rights reserved. Issue: Detecting performance bottlenecks for specific requests during load testing. Case 2: Bottleneck detection during load testing 25 Request generator (k6) Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container Aurora PostgreSQL Analysis Script (Python+boto3) Req./Res Get traces/metrics Transmit traces/metrics Amazon CloudWatch, AWS X-Ray Dashboard Detail Summary Start test Analysis tools Testing tools Target system ALB RDS Proxy Increase # of requests ALB: Auto Load Balance RDS: Relational Database Service

Slide 27

Slide 27 text

Slide 28

Slide 28 text

© Hitachi, Ltd. 2024. All rights reserved. Issue: Assessing the impact of incorporating a cache server on system behavior during the design phase. Case 3: Evaluation of performance change by a cache server 27 Evaluate with / without a cache server Request generator (k6) Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container Aurora PostgreSQL Analysis Script (Python+boto3) Req./Res Get traces/metrics Transmit traces/metrics Amazon CloudWatch, AWS X-Ray Dashboard Detail Summary Start test Analysis tools Testing tools Target system ALB RDS Proxy Amazon ElastiCache ALB: Auto Load Balance RDS: Relational Database Service

Slide 29

Slide 29 text

© Hitachi, Ltd. 2024. All rights reserved. Case 3: Evaluation of performance change by a cache server 28 Statistics value of Latency (Average, Min, Max, p99, p50 …) w/o a cache server w/ a cache server w/ a cache server w/o a cache server or App. w/ OTel agent container Collector container Aurora PostgreSQL RDS Proxy Amazon ElastiCache ✓ The cache reduces latency but increases operational costs. ✓ I want to evaluate the benefits quantitatively, rather than relying on intuition or experience. Will a cache server provide benefits that justify the cost? Quantitative evaluation using metrics by OTel Latency Worth the cost Not worth the cost ✓ Statistical evaluation can be done easily. →Implementation decisions can be made based on solid evidence.

Slide 30

Slide 30 text

© Hitachi, Ltd. 2024. All rights reserved. Useful telemetry for testing 29 1. v1.20 or later : http.server.request.duration 2. Python auto instrumentation (v1.27.0) is not support process. Metrics Traces Throughput http.server.duration1 rpc.server.duration Latency http.server.duration rpc.server.duration (end_time) – (start_time) Error Rate status attribute of http.server.duration Failover time http.server.duration distribution of traces Resource Utilization jvm.threads.count, jvm.memory.usage (process.memory.usage2), jvm.cpu.utilization (process.cpu.time), … Connection to DB db.client.connection.count, db.client.connection.create_time, db.client.connection.pending_requests

Slide 31

Slide 31 text

© Hitachi, Ltd. 2024. All rights reserved. OTel metrics(HTTP,DB) 30 Category Information Metrics Unit HTTP Duration of HTTP server requests http.server.request.duration second DB The number of connections that are currently in state described by the state attribute db.client.connection.count (db.client.connections.usage) # of connection The maximum/minumum number of idle open connections allowed db.client.connection.max db.client.connection.idle.min # of connection The number of current pending requests for an open connection db.client.connection.pending_requests # of connection The time it took to create a new connection db.client.connection.create_time second The time it took to obtain an open connection from the pool db.client.connection.wait_time second The time between borrowing a connection and returning it to the pool db.client.connection.use_time second *https://opentelemetry.io/docs/specs/semconv/

Slide 32

Slide 32 text

© Hitachi, Ltd. 2024. All rights reserved. OTel metrics(JVM) 31 Category Information Metrics Unit jvm Thread count process.runtime.jvm.threads.count # of threads Recent system-wide CPU usage process.runtime.jvm.system.cpu.utilization CPU usage Average system-wide CPU load over the past minute process.runtime.jvm.system.cpu.load_1m # of CPU cores Memory usage in use process.runtime.jvm.memory.usage Byte Garbage collection duration process.runtime.jvm.gc.duration second Process CPU usage process.runtime.jvm.cpu.utilization CPU usage Number of classes unloaded since JVM startup process.runtime.jvm.classes.unloaded # of classes Number of classes loaded since JVM startup process.runtime.jvm.classes.loaded # of classes Number of classes currently loaded process.runtime.jvm.classes.current_loaded # of classes Memory used by buffers process.runtime.jvm.buffer.usage Byte Maximum memory used by buffers process.runtime.jvm.buffer.limit Byte Number of buffers in the pool process.runtime.jvm.buffer.count # of buffers * https://opentelemetry.io/docs/specs/semconv/

Slide 33

Slide 33 text

© Hitachi, Ltd. 2024. All rights reserved. Our practice, Tips 32 *From AWS Cost Explorer collect traces throughout the entire day collect traces during tests  Even if the telemetry transmitter is OSS, costs are incurred when using a managed service on the receiver side. ◆ We use OTel as a transmitter and AWS as a receiver ✓ If distributed tracing is collected without sampling, • it could cost $20/day for 50 requests per second on a 2-layer system • $0.2/day for 1 request per second per component. → Sending 1,000 req./sec to 10 components could cost $2,000/day=$60,000/month just for tracing. ✓ It's important to be cautious about long-running tests and forgetting to stop them, and to configure sampling rules to avoid such issues.

Slide 34

Slide 34 text

© Hitachi, Ltd. 2024. All rights reserved.  Some metrics are optional ◆ The metrics you want to collect may not be implemented, depending on the collector or programming language. ◆ In HTTP metrics,  In some versions of OTel, the names of metrics may have changed ◆ e.g., db.client.connections.usage(v1.24.0)→db.client.connection.count(v1.26.0) ◆ For instance, in distributions like ADOT, metrics might still be collected under older names, and there can be a lag before updates from the OSS are reflected ◆ When checking or configuring metrics, it's important to be aware of these name changes based on the version in use. Our practice, Tips 33 Required http.server.request.duration, http.client.request.duration Optional http.server.active_requests, http.server.request.body.size, http.server.response.body.size, http.client.request.body.size, http.client.response.body.size, http.client.open_connections, http.client.connection.duration, http.client.active_requests

Slide 35

Slide 35 text

© Hitachi, Ltd. 2024. All rights reserved.  Comment from a development team ◆ It is challenging to build application logs to the extent that OTel collects. ◆ While commercial software offers high functionality, the implementation burden is significant, making it desirable if the same can be achieved with OSS. ◆ Since support is limited to OSS-level, careful consideration is needed when integrating it into products. Our practice, Tips 34

Slide 36

Slide 36 text

© Hitachi, Ltd. 2024. All rights reserved. • Increasing complexity of development • OpenTelemetry (Otel) can help you to design, develop application • Otel is relatively easy to implement, allowing for the analysis of user experience, resource changes. • Be mindful of the receiver of telemetry and the cost when collecting telemetry Key Takeaways 35 Let’s use OpenTelemetry to enhance our development experience!

Slide 37

Slide 37 text

© Hitachi, Ltd. 2024. All rights reserved. • Amazon Fault Injection Service, Application Load Balancer, Amazon CloudWatch, AWS X-Ray, Amazon Aurora, Amazon RDS Proxy, Amazon Elastic Container Service, AWS Cost Explorer, and Amazon ElastiCache , boto3 are trademarks of Amazon Web Services, Inc. in the United States and/or other countries. • OpenTelemetry (OTel), Kubernetes is a registered trademark of Linux Foundation in the United States and/or other countries. • Grafana, K6 is a registered trademark of Grafana Labs in the United States and/or other countries. Trademarks 36

Slide 38

Slide 38 text

No content