Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Observability Matters

Nylas
August 03, 2018

Why Observability Matters

The macro landscape in software has changed in the last decade—metrics, monitoring, unstructured logs, and alerting aren't enough to tell what our software is doing anymore. This opening keynote from o11ycon 2018 will focus on how we've achieved observability on the Nylas platform, using many concrete examples—and will show what observability means to us and where we're still lacking.

Nylas

August 03, 2018
Tweet

More Decks by Nylas

Other Decks in Technology

Transcript

  1. Now: • Many services • Containers • Service discovery •

    Cattle, not pets • Autoscaling • Orchestration • High availability
  2. What is Observability? Software is opaque by default; it must

    generate data in order to clue humans in on what it is doing. Observable systems allow humans to answer the question, “Is it working properly?” & diagnose the scope of impact & identify what is going wrong if the answer is “no”.
  3. What is Observability? Observable systems not only have the data

    available to understand them, but the data is accessible, explorable, and understandable in a fast, user-friendly manner.
  4. Who needs observability at Nylas? • Engineers • Customer support

    • Customers! ◦ Our customers are developers too
  5. How do we achieve observability? Our software generates... • Traditional

    time-series metrics • Structured logs & events • Stack traces / exceptions We use tools to explore this data.
  6. Example: Long-running database sessions • Our codebase uses transactions heavily

    (ORM default) • Long-running transactions are bad for perf in RDBMS • Exhausting max connections on database hosts causes outages • Tricky to track down the source
  7. Example: Account sync state changes • Nylas sync backend involves

    persistent sync processes • Custom health system measures that sync processes are still alive & syncing properly • Accounts may transition between working/not-working states
  8. Example: Time to first message • Customers were reporting that

    mailboxes were showing no data for minutes • Used event exploration to find which accounts were affected, & then in-order log reading to figure out what was happening with those events
  9. Example: Time to first message Account slow to start up:

    tracked to a load-related issue with how we were claiming accounts to sync on sync fleet instances Step 2: Examine instances of problem in detail In this case, consulted system metrics (CPU, etc.) and logs for new account queue as well.
  10. If the needles we were looking for were API requests

    and we had several services involved in servicing those requests, step 2 might be a request-tracing tool instead.
  11. Why aren’t log search, metrics, monitoring enough? • Log search

    is too slow and a lot of value comes out of grouping & filtering by fields. Hard to find the needles in the haystack. • Can’t zoom in on individual customers with metrics unless you thought of doing that beforehand. Tools blow up when trying to aggregate across many different metrics. • Cost scaling: in large-scale platforms, log data quickly balloons in volume
  12. Tools are complementary • Metrics & alerting to proactively learn

    about predictable issues • Data exploration to find potential sources or examples of a problem first, or to visualize patterns / trends on demand • Then tracing or log-reading to get details about specific issues • Sometimes spelunking into database objects is necessary to really get to the root
  13. What’s next? • What happens when your scale is so

    large that it’s no longer cost-effective to save detailed in-order logs to dive into individual problems? • Better tools for exploring synced data in the DB • A big part of our product is data state, which may need to be directly inspected to debug a problem • Engineers need to know the detailed, up-to-date system architecture to know which services may be involved in a problem
  14. Food for thought • What have your experiences with observability

    been? • Where could observability be improved in your own teams, companies, projects? • Who needs observability @ your org who is not an engineer?