Why we built Thundra at OpsGenie, and what is in it for you!

Why we built Thundra at OpsGenie, and what is in
it for you! 25.04.2018 | Serhat Can

About me • Ex-Software Engineer Technical Evangelist at • Co-organizer
◦ Serverless Turkey Meetup ◦ DevOpsDays İstanbul ◦ DevOps Turkey Meetup • @srhtcn on Twitter

OpsGenie • Always-on • Don’t react, RESPOND! • Reliable and
flexible alerting • Reporting and analytics • Close to 200th integration

Requirements of OpsGenie • Always-on • Reliable, on-time delivered notifications
• Aggressive auto-scaling • Security

OpsGenie’s tech stack • Java as the main language •
AWS on multiple zones & regions • EC2, VPC, SQS, SNS, DynamoDB and more • A lot of external services with backups and retries

Pain points of current solutions • Fast scaling under immediate
high load • Under-utilized machines • Pricing (still not a huge concern) • Operational complexity • Learning curve - kubernetes?

Why Serverless - AWS Lambda? • Migration to Microservices architecture
• Better resource utilization • True auto-scaling (or is it?) • Less code and infra maintenance • We know and already use AWS • AWS Lambda is robust compared to others

Our history with Serverless 20 million+ invocations per day on
production

DynamoDB auto scale • Needed an automated way to scale
DynamoDB to avoid read & write throttles • Capacity needs to be adjusted during migration

Our history with Serverless • DynamoDB cross region replication (at
least before Global Tables) • Custom solutions and integrations • Our new feature, Service and Incident Management

Service and Incident Management

Pain points • Cold-starts in Java • 100ms based pricing
• Retries with SQS (coming soon!)

Pain points • Concurrent execution limit ◦ Takes time to
increase the limit when needed ◦ One function can consume a lot • No well-known good practices • Hard to develop locally • API Gateway is not our favorite AWS service :)

Pain points that lead to Thundra • Hard to debug
• Hard to auto-instrument You can’t attach JVM agents • Hard to search in logs • Hard to see the big picture

Thundra: A Spin-off

The name: Thundra • Thundra is a type of Genie.
• Through the use of her amulet, she can manipulate the weather. • She controls an army of clouds that spread rain throughout the world.

Why not traditional methods? • Cloudwatch: Only logs and not
easy to search • AWS X-Ray • Existing APM solutions for non-serverless environments

Why Thundra? • Zero overhead (async) • No code change
(only implement our interface) • Automatic instrumentation and profiling • Three pillars of observability • Integrations (AWS SDK, JDBC, Redis etc.) • Reduce cold starts by warm-up Advance searching Metric & Log Aggregation Debuging Profiling Tracing Zero Overhead Instrumentation

Three pillars of observability • Trace ◦ No code change
◦ Rule and level based tracing ◦ Line by line tracing and debugging • Metric ◦ Environment (Java, Node.js, Go, Python) specific metric collection ◦ Rule based metric collection • Log ◦ Aggregate logs with traces

Async Monitoring • Doesn’t block invocation for publishing monitor data
• Can switch between sync and async modes by configuration • Use cases: ◦ No extra delay is acceptable (min 20ms) ◦ Invocation should not fail due to monitor data publish failures ◦ Failing publications of monitor data should be retried ◦ Lambda runs in VPC so there is no internet access for HTTP(S)

Links and references • engineering.opsgenie.com • thundra.io • medium.com/thundra •
twitter: @opsgenie and @thundraio

Thank you ! @srhtcn

Why we built Thundra at OpsGenie, and what is i...

Why we built Thundra at OpsGenie, and what is in it for you!

Serhat Can

More Decks by Serhat Can

Other Decks in Programming

Featured

Transcript