Do More with Less - Unit and Integration Testing is Broken

© 2019, Amazon Web Services, Inc. or its affiliates. All
rights reserved. Do More with Less - Unit and Integration Testing is Broken Kevin Crawley S e s s i o n I D Developer Relations Instana

@notsureifkevin $> whoami • Developer Relations @ Instana ◦ Education
/ Awareness ◦ Product Focus on SRE ◦ Content Development • Principal SRE @ Single Music ◦ Co-Owner and Consultant ◦ Built Delivery Systems and Manage Infrastructure ◦ Maintain Production Excellence • 20 years software dev exp ◦ Early Adoption of Docker (2014) ◦ Docker Captain ◦ Gitlab Hero

@notsureifkevin Discussion Points • Culture of Testing • What is
Sufficiently Advanced Monitoring? • Observability In Action: Single Music

@notsureifkevin The Culture of Testing Is Testing Agile or Not?

@notsureifkevin

@notsureifkevin Testing has nothing to do with “Agile” ... http://softwareprocess.es/pubs/borle2017EMSE-TDD.pdf
http://www.sserg.org/publications/uploads/04b700e040d0cac8681ba3d039be87a56020dd41.pdf

@notsureifkevin Unit Testing is like putting your code in a
tar pit • More abstraction • More logic • More code • Velocity is slowed • Productivity is reduced

@notsureifkevin Benefits diminish as complexity increases • Distributed Architecture ◦
Microservices ◦ Lambda • External Dependencies • More Integration Testing?

@notsureifkevin When production is broken How many of you wait
for integration tests to pass? https://xkcd.com/license.html

@notsureifkevin “Everybody has a testing environment. Some people are lucky
enough enough to have a totally separate environment to run production in.”

@notsureifkevin What should we test? Critical Systems Code / Pathways
• Nuclear Reactor Control Systems • Aircraft Flight Software • Open Source Libraries • Regressions

@notsureifkevin What shouldn’t we be testing? • Non-critical user interactions
(which can be monitored) • Fence-Post errors • CRUD, edge concerns, etc

@notsureifkevin Distributed Tracing Is this sufficiently advanced monitoring?

@notsureifkevin Distributed Tracing Abstract In 2010, Google published a technical
report on their distributed tracing project named Dapper. In their abstract they summarized why they built Dapper in the first place: “Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.” Google Technical Report dapper-2010-1, April 2010, p. 1 - https://ai.google/research/pubs/pub36356

@notsureifkevin How to Visualize Distributed Interactions • Every request has
a custom header injected into it which is intercepted and processed by a system of record • This is visualized with a GANTT chart to show the hierarchical structure and timing of every transaction which occurred once the initial trace is generated Google Technical Report dapper-2010-1, April 2010, p. 3, fig. 2 - https://ai.google/research/pubs/pub36356

@notsureifkevin What Do We Get From Distributed Tracing? • A
tremendous amount of telemetry data which is perfect for: ◦ Aggregation ◦ Machine Learning ◦ Performance Analysis ◦ Testing and Debugging

@notsureifkevin … or rather, a big ‘ol Data Lake

@notsureifkevin We Need More Than Just Lakes and Query Tools
• No longer treating services like Schrödinger's cat • Much more context around events and transactions • Interactive dashboards and metrics generated by aggregated request-scoped events

@notsureifkevin What does aggregated request-scoped events actually mean? • Create
individual buckets of data that have meaning, ie, grouped by “services” or “applications” • Roll up “golden metrics” for those buckets which include latency percentiles, throughput, and error rate • Retain context of those aggregated metrics, along with logs and application metrics, so anomalies can be quickly discovered and all contributing factors may be investigated

@notsureifkevin Single Music Case Study How and Why We Use
Observability Tools

@notsureifkevin • Built and maintained by 3 engineers • Peak
20k transaction / hour, 20+ integrations, 150k LOC, with less than 15% test coverage • Launched in 2018 with 15 microservices on Docker Swarm – has since expanded to over 35 microservices with zero additional engineering personnel • One-touch deployment and provisioning for new and existing services

@notsureifkevin What Does Single Music Do? • Marketing and Management
tools for bands, artists, and labels using the Shopify App Store ◦ Digital Music Reporting, Fulfillment, and Distribution ◦ Physical Music Reporting ◦ Sales Tracking ◦ Landing Pages ◦ Analytics and Reporting ◦ Integrations with Soundscan, Spotify, Soundcloud and more.

@notsureifkevin Visualizing Our Infrastructure Easily Dive Into and Explore Our
Stack

@notsureifkevin

@notsureifkevin Interactive Dashboards Gives us the health and quality of
our application and it’s services

@notsureifkevin

@notsureifkevin Visualize Patterns Understanding scope, algorithms, and more

@notsureifkevin

@notsureifkevin Exponential Backoff

@notsureifkevin Solving Real Problems

@notsureifkevin Rising Latency

@notsureifkevin Rise in Latency + Processing Time • DBO (Hibernate
Query) causing O(n log n) rise in latency and processing time • Application Dashboard indicated an issue with overall latency increasing • Fix deployed and improvement was observed immediately

@notsureifkevin Issue Resolved

@notsureifkevin

@notsureifkevin Caching Solved One Problem … but caused another •
We implemented Redis for caching, and processing time went down • However, we didn’t account for token policies changing and they suddenly began to expire after 30 seconds • Alerting around error rates for this endpoint raised our awareness around this issue

@notsureifkevin

@notsureifkevin Logs and Metrics Need Distributed Context Easily visualize WARN
and ERROR log aggregates and jump to the trace and jump to relevant metrics

@notsureifkevin

@notsureifkevin In Summary • Microservices have allowed Single Music to
build and scale effectively and meet the needs of the worlds largest bands • Observability Tooling has helped Single become more effective at deploying and managing those microservice applications • We rely on Instana because we want to focus on delivering music and value to our customers - not spend our time setting up monitoring solutions and constantly querying the data needed to validate our architecture and reliability

@notsureifkevin “Sufficiently advanced monitoring is indistinguishable from testing …” Ed
Keyes Site Reliability Engineer – Google // 2008

@notsureifkevin I think our organization could use more observability!

Thank you! © 2019, Amazon Web Services, Inc. or its
affiliates. All rights reserved. Kevin Crawley Twitter: @notsureifkevin Visit our booth #511 or schedule some time with us http://bit.ly/instana-reinvent

Do More with Less - Unit and Integration Testin...

Do More with Less - Unit and Integration Testing is Broken

More Decks by Kevin Crawley

Other Decks in Technology

Featured

Transcript