Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why we built Thundra at OpsGenie, and what is i...

Why we built Thundra at OpsGenie, and what is in it for you!

In this presentation, I talked about the motivation behind moving to Serverless architecture and building a Serverless monitoring tool.

Serhat Can

April 25, 2018
Tweet

More Decks by Serhat Can

Other Decks in Programming

Transcript

  1. Why we built Thundra at OpsGenie, and what is in

    it for you! 25.04.2018 | Serhat Can
  2. About me • Ex-Software Engineer Technical Evangelist at • Co-organizer

    ◦ Serverless Turkey Meetup ◦ DevOpsDays İstanbul ◦ DevOps Turkey Meetup • @srhtcn on Twitter
  3. OpsGenie • Always-on • Don’t react, RESPOND! • Reliable and

    flexible alerting • Reporting and analytics • Close to 200th integration
  4. OpsGenie’s tech stack • Java as the main language •

    AWS on multiple zones & regions • EC2, VPC, SQS, SNS, DynamoDB and more • A lot of external services with backups and retries
  5. Pain points of current solutions • Fast scaling under immediate

    high load • Under-utilized machines • Pricing (still not a huge concern) • Operational complexity • Learning curve - kubernetes?
  6. Why Serverless - AWS Lambda? • Migration to Microservices architecture

    • Better resource utilization • True auto-scaling (or is it?) • Less code and infra maintenance • We know and already use AWS • AWS Lambda is robust compared to others
  7. DynamoDB auto scale • Needed an automated way to scale

    DynamoDB to avoid read & write throttles • Capacity needs to be adjusted during migration
  8. Our history with Serverless • DynamoDB cross region replication (at

    least before Global Tables) • Custom solutions and integrations • Our new feature, Service and Incident Management
  9. Pain points • Concurrent execution limit ◦ Takes time to

    increase the limit when needed ◦ One function can consume a lot • No well-known good practices • Hard to develop locally • API Gateway is not our favorite AWS service :)
  10. Pain points that lead to Thundra • Hard to debug

    • Hard to auto-instrument You can’t attach JVM agents • Hard to search in logs • Hard to see the big picture
  11. The name: Thundra • Thundra is a type of Genie.

    • Through the use of her amulet, she can manipulate the weather. • She controls an army of clouds that spread rain throughout the world.
  12. Why not traditional methods? • Cloudwatch: Only logs and not

    easy to search • AWS X-Ray • Existing APM solutions for non-serverless environments
  13. Why Thundra? • Zero overhead (async) • No code change

    (only implement our interface) • Automatic instrumentation and profiling • Three pillars of observability • Integrations (AWS SDK, JDBC, Redis etc.) • Reduce cold starts by warm-up Advance searching Metric & Log Aggregation Debuging Profiling Tracing Zero Overhead Instrumentation
  14. Three pillars of observability • Trace ◦ No code change

    ◦ Rule and level based tracing ◦ Line by line tracing and debugging • Metric ◦ Environment (Java, Node.js, Go, Python) specific metric collection ◦ Rule based metric collection • Log ◦ Aggregate logs with traces
  15. Async Monitoring • Doesn’t block invocation for publishing monitor data

    • Can switch between sync and async modes by configuration • Use cases: ◦ No extra delay is acceptable (min 20ms) ◦ Invocation should not fail due to monitor data publish failures ◦ Failing publications of monitor data should be retried ◦ Lambda runs in VPC so there is no internet access for HTTP(S)