Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Do More with Less - Unit and Integration Testing is Broken Kevin Crawley S e s s i o n I D Developer Relations Instana

Slide 3

Slide 3 text

@notsureifkevin $> whoami ● Developer Relations @ Instana ○ Education / Awareness ○ Product Focus on SRE ○ Content Development ● Principal SRE @ Single Music ○ Co-Owner and Consultant ○ Built Delivery Systems and Manage Infrastructure ○ Maintain Production Excellence ● 20 years software dev exp ○ Early Adoption of Docker (2014) ○ Docker Captain ○ Gitlab Hero

Slide 4

Slide 4 text

@notsureifkevin Discussion Points ● Culture of Testing ● What is Sufficiently Advanced Monitoring? ● Observability In Action: Single Music

Slide 5

Slide 5 text

@notsureifkevin The Culture of Testing Is Testing Agile or Not?

Slide 6

Slide 6 text

@notsureifkevin

Slide 7

Slide 7 text

@notsureifkevin Testing has nothing to do with “Agile” ... http://softwareprocess.es/pubs/borle2017EMSE-TDD.pdf http://www.sserg.org/publications/uploads/04b700e040d0cac8681ba3d039be87a56020dd41.pdf

Slide 8

Slide 8 text

@notsureifkevin Unit Testing is like putting your code in a tar pit ● More abstraction ● More logic ● More code ● Velocity is slowed ● Productivity is reduced

Slide 9

Slide 9 text

@notsureifkevin Benefits diminish as complexity increases ● Distributed Architecture ○ Microservices ○ Lambda ● External Dependencies ● More Integration Testing?

Slide 10

Slide 10 text

@notsureifkevin When production is broken How many of you wait for integration tests to pass? https://xkcd.com/license.html

Slide 11

Slide 11 text

@notsureifkevin “Everybody has a testing environment. Some people are lucky enough enough to have a totally separate environment to run production in.”

Slide 12

Slide 12 text

@notsureifkevin What should we test? Critical Systems Code / Pathways ● Nuclear Reactor Control Systems ● Aircraft Flight Software ● Open Source Libraries ● Regressions

Slide 13

Slide 13 text

@notsureifkevin What shouldn’t we be testing? ● Non-critical user interactions (which can be monitored) ● Fence-Post errors ● CRUD, edge concerns, etc

Slide 14

Slide 14 text

@notsureifkevin Distributed Tracing Is this sufficiently advanced monitoring?

Slide 15

Slide 15 text

@notsureifkevin Distributed Tracing Abstract In 2010, Google published a technical report on their distributed tracing project named Dapper. In their abstract they summarized why they built Dapper in the first place: “Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.” Google Technical Report dapper-2010-1, April 2010, p. 1 - https://ai.google/research/pubs/pub36356

Slide 16

Slide 16 text

@notsureifkevin How to Visualize Distributed Interactions ● Every request has a custom header injected into it which is intercepted and processed by a system of record ● This is visualized with a GANTT chart to show the hierarchical structure and timing of every transaction which occurred once the initial trace is generated Google Technical Report dapper-2010-1, April 2010, p. 3, fig. 2 - https://ai.google/research/pubs/pub36356

Slide 17

Slide 17 text

@notsureifkevin What Do We Get From Distributed Tracing? ● A tremendous amount of telemetry data which is perfect for: ○ Aggregation ○ Machine Learning ○ Performance Analysis ○ Testing and Debugging

Slide 18

Slide 18 text

@notsureifkevin … or rather, a big ‘ol Data Lake

Slide 19

Slide 19 text

@notsureifkevin We Need More Than Just Lakes and Query Tools • No longer treating services like Schrödinger's cat • Much more context around events and transactions • Interactive dashboards and metrics generated by aggregated request-scoped events

Slide 20

Slide 20 text

@notsureifkevin What does aggregated request-scoped events actually mean? ● Create individual buckets of data that have meaning, ie, grouped by “services” or “applications” ● Roll up “golden metrics” for those buckets which include latency percentiles, throughput, and error rate ● Retain context of those aggregated metrics, along with logs and application metrics, so anomalies can be quickly discovered and all contributing factors may be investigated

Slide 21

Slide 21 text

@notsureifkevin Single Music Case Study How and Why We Use Observability Tools

Slide 22

Slide 22 text

@notsureifkevin ● Built and maintained by 3 engineers ● Peak 20k transaction / hour, 20+ integrations, 150k LOC, with less than 15% test coverage ● Launched in 2018 with 15 microservices on Docker Swarm – has since expanded to over 35 microservices with zero additional engineering personnel ● One-touch deployment and provisioning for new and existing services

Slide 23

Slide 23 text

@notsureifkevin What Does Single Music Do? ● Marketing and Management tools for bands, artists, and labels using the Shopify App Store ○ Digital Music Reporting, Fulfillment, and Distribution ○ Physical Music Reporting ○ Sales Tracking ○ Landing Pages ○ Analytics and Reporting ○ Integrations with Soundscan, Spotify, Soundcloud and more.

Slide 24

Slide 24 text

@notsureifkevin Visualizing Our Infrastructure Easily Dive Into and Explore Our Stack

Slide 25

Slide 25 text

@notsureifkevin

Slide 26

Slide 26 text

@notsureifkevin

Slide 27

Slide 27 text

@notsureifkevin Interactive Dashboards Gives us the health and quality of our application and it’s services

Slide 28

Slide 28 text

@notsureifkevin

Slide 29

Slide 29 text

@notsureifkevin

Slide 30

Slide 30 text

@notsureifkevin

Slide 31

Slide 31 text

@notsureifkevin Visualize Patterns Understanding scope, algorithms, and more

Slide 32

Slide 32 text

@notsureifkevin

Slide 33

Slide 33 text

@notsureifkevin Exponential Backoff

Slide 34

Slide 34 text

@notsureifkevin Solving Real Problems

Slide 35

Slide 35 text

@notsureifkevin Rising Latency

Slide 36

Slide 36 text

@notsureifkevin Rise in Latency + Processing Time ● DBO (Hibernate Query) causing O(n log n) rise in latency and processing time ● Application Dashboard indicated an issue with overall latency increasing ● Fix deployed and improvement was observed immediately

Slide 37

Slide 37 text

@notsureifkevin Issue Resolved

Slide 38

Slide 38 text

@notsureifkevin

Slide 39

Slide 39 text

@notsureifkevin Caching Solved One Problem … but caused another ● We implemented Redis for caching, and processing time went down ● However, we didn’t account for token policies changing and they suddenly began to expire after 30 seconds ● Alerting around error rates for this endpoint raised our awareness around this issue

Slide 40

Slide 40 text

@notsureifkevin

Slide 41

Slide 41 text

@notsureifkevin

Slide 42

Slide 42 text

@notsureifkevin Logs and Metrics Need Distributed Context Easily visualize WARN and ERROR log aggregates and jump to the trace and jump to relevant metrics

Slide 43

Slide 43 text

@notsureifkevin

Slide 44

Slide 44 text

@notsureifkevin In Summary ● Microservices have allowed Single Music to build and scale effectively and meet the needs of the worlds largest bands ● Observability Tooling has helped Single become more effective at deploying and managing those microservice applications ● We rely on Instana because we want to focus on delivering music and value to our customers - not spend our time setting up monitoring solutions and constantly querying the data needed to validate our architecture and reliability

Slide 45

Slide 45 text

@notsureifkevin “Sufficiently advanced monitoring is indistinguishable from testing …” Ed Keyes Site Reliability Engineer – Google // 2008

Slide 46

Slide 46 text

@notsureifkevin I think our organization could use more observability!

Slide 47

Slide 47 text

Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Kevin Crawley Twitter: @notsureifkevin Visit our booth #511 or schedule some time with us http://bit.ly/instana-reinvent

Slide 48

Slide 48 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.