Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Do More with Less - Unit and Integration Testin...

Do More with Less - Unit and Integration Testing is Broken

Unit testing and integration testing fall short in helping engineers understand how their applications run in production and avoid testing bits of code that should never be tested. In this session, Kevin Crawley of Instana shows real-world examples of modern observability tools that enable organizations to operate on as little as 15% code coverage and focus on what actually matters: the health of their production application. With these examples, Kevin demonstrates how sufficiently advanced monitoring is indistinguishable from testing and ultimately helps organizations recover from test coverage fatigue.

Kevin Crawley

December 04, 2019
Tweet

More Decks by Kevin Crawley

Other Decks in Technology

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Do More with Less - Unit and Integration Testing is Broken Kevin Crawley S e s s i o n I D Developer Relations Instana
  2. @notsureifkevin $> whoami • Developer Relations @ Instana ◦ Education

    / Awareness ◦ Product Focus on SRE ◦ Content Development • Principal SRE @ Single Music ◦ Co-Owner and Consultant ◦ Built Delivery Systems and Manage Infrastructure ◦ Maintain Production Excellence • 20 years software dev exp ◦ Early Adoption of Docker (2014) ◦ Docker Captain ◦ Gitlab Hero
  3. @notsureifkevin Discussion Points • Culture of Testing • What is

    Sufficiently Advanced Monitoring? • Observability In Action: Single Music
  4. @notsureifkevin Testing has nothing to do with “Agile” ... http://softwareprocess.es/pubs/borle2017EMSE-TDD.pdf

    http://www.sserg.org/publications/uploads/04b700e040d0cac8681ba3d039be87a56020dd41.pdf
  5. @notsureifkevin Unit Testing is like putting your code in a

    tar pit • More abstraction • More logic • More code • Velocity is slowed • Productivity is reduced
  6. @notsureifkevin Benefits diminish as complexity increases • Distributed Architecture ◦

    Microservices ◦ Lambda • External Dependencies • More Integration Testing?
  7. @notsureifkevin When production is broken How many of you wait

    for integration tests to pass? https://xkcd.com/license.html
  8. @notsureifkevin “Everybody has a testing environment. Some people are lucky

    enough enough to have a totally separate environment to run production in.”
  9. @notsureifkevin What should we test? Critical Systems Code / Pathways

    • Nuclear Reactor Control Systems • Aircraft Flight Software • Open Source Libraries • Regressions
  10. @notsureifkevin What shouldn’t we be testing? • Non-critical user interactions

    (which can be monitored) • Fence-Post errors • CRUD, edge concerns, etc
  11. @notsureifkevin Distributed Tracing Abstract In 2010, Google published a technical

    report on their distributed tracing project named Dapper. In their abstract they summarized why they built Dapper in the first place: “Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.” Google Technical Report dapper-2010-1, April 2010, p. 1 - https://ai.google/research/pubs/pub36356
  12. @notsureifkevin How to Visualize Distributed Interactions • Every request has

    a custom header injected into it which is intercepted and processed by a system of record • This is visualized with a GANTT chart to show the hierarchical structure and timing of every transaction which occurred once the initial trace is generated Google Technical Report dapper-2010-1, April 2010, p. 3, fig. 2 - https://ai.google/research/pubs/pub36356
  13. @notsureifkevin What Do We Get From Distributed Tracing? • A

    tremendous amount of telemetry data which is perfect for: ◦ Aggregation ◦ Machine Learning ◦ Performance Analysis ◦ Testing and Debugging
  14. @notsureifkevin We Need More Than Just Lakes and Query Tools

    • No longer treating services like Schrödinger's cat • Much more context around events and transactions • Interactive dashboards and metrics generated by aggregated request-scoped events
  15. @notsureifkevin What does aggregated request-scoped events actually mean? • Create

    individual buckets of data that have meaning, ie, grouped by “services” or “applications” • Roll up “golden metrics” for those buckets which include latency percentiles, throughput, and error rate • Retain context of those aggregated metrics, along with logs and application metrics, so anomalies can be quickly discovered and all contributing factors may be investigated
  16. @notsureifkevin • Built and maintained by 3 engineers • Peak

    20k transaction / hour, 20+ integrations, 150k LOC, with less than 15% test coverage • Launched in 2018 with 15 microservices on Docker Swarm – has since expanded to over 35 microservices with zero additional engineering personnel • One-touch deployment and provisioning for new and existing services
  17. @notsureifkevin What Does Single Music Do? • Marketing and Management

    tools for bands, artists, and labels using the Shopify App Store ◦ Digital Music Reporting, Fulfillment, and Distribution ◦ Physical Music Reporting ◦ Sales Tracking ◦ Landing Pages ◦ Analytics and Reporting ◦ Integrations with Soundscan, Spotify, Soundcloud and more.
  18. @notsureifkevin Rise in Latency + Processing Time • DBO (Hibernate

    Query) causing O(n log n) rise in latency and processing time • Application Dashboard indicated an issue with overall latency increasing • Fix deployed and improvement was observed immediately
  19. @notsureifkevin Caching Solved One Problem … but caused another •

    We implemented Redis for caching, and processing time went down • However, we didn’t account for token policies changing and they suddenly began to expire after 30 seconds • Alerting around error rates for this endpoint raised our awareness around this issue
  20. @notsureifkevin Logs and Metrics Need Distributed Context Easily visualize WARN

    and ERROR log aggregates and jump to the trace and jump to relevant metrics
  21. @notsureifkevin In Summary • Microservices have allowed Single Music to

    build and scale effectively and meet the needs of the worlds largest bands • Observability Tooling has helped Single become more effective at deploying and managing those microservice applications • We rely on Instana because we want to focus on delivering music and value to our customers - not spend our time setting up monitoring solutions and constantly querying the data needed to validate our architecture and reliability
  22. Thank you! © 2019, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Kevin Crawley Twitter: @notsureifkevin Visit our booth #511 or schedule some time with us http://bit.ly/instana-reinvent