Processing TeraBytes of data every day and sleeping at night

Processing Terabytes of data every day … and sleeping at
night @katavic_d - @loige User Group Belfast 06/02/2019 loige.link/tera-bel

Domagoj Katavic Senior Software Engineer @katavic_d github.com/dkatavic

Luciano Mammino Cloud Architect @loige github.com/lmammino loige.co 4.7 out of
5 stars on Amazon.com With @mariocasciaro

Agenda • The problem space • Our first MVP &
Beta period • INCIDENTS! And lessons learned • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige

AI to detect and hunt for cyber attackers Cognito Platform
• Detect • Recall @katavic_d - @loige

Cognito Detect on premise solution • Analyzing network traffic and
logs • Uses AI to deliver real-time attack visibility • Behaviour driven Host centric • Provides threat context and most relevant attack details @katavic_d - @loige

@katavic_d - @loige

Cognito Recall • Collects network metadata and stores it in
“the cloud” • Data is processed, enriched and standardised • Data is made searchable @katavic_d - @loige A new Vectra product for Incident Response

Recall requirements • Data isolation • Ingestion speed: ~2GB/min x
customer (up ~3TB x day per customer) • Investigation tool: Flexible data exploration @katavic_d - @loige

Our first iteration @katavic_d - @loige

@katavic_d - @loige Control plane Centralised Logging & Metrics

Security • Separate VPCs • Strict Security Groups (whitelisting) •
Red, amber, green subnets • Encryption at rest through AWS services • Client Certificates + TLS • Pentest @katavic_d - @loige

Let’s start the beta! @katavic_d - @loige

Warning: different timezones! A cu m Our ne * @katavic_d
- @loige *yeah, we actually look that cute when we sleep!

@katavic_d - @loige

Lambda timeouts incident • AWS Lambda timeout: 5 minutes 15
minutes • We are receiving files every minute (containing 1 minute of network traffic) • During peak hours for the biggest customer, files can be too big to be processed within timeout limits @katavic_d - @loige

Splitter lambda @katavic_d - @loige

Message-aware splitting

Lessons learned • Predictable data input for predictable performance •
Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige

@katavic_d - @loige

Lambdas IP starvation incident • Spinning up many lambdas consumed
all the available IPs in a subnet • Failure to get an IP for the new ES machines • ElasticSearch cannot scale up • Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!

Lessons learned • Every running lambda inside a VPC uses
an ENI (Elastic Network Interface) • Every ENI takes a private IP address • Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! • Consider putting lambdas in their dedicated subnet @katavic_d - @loige

@katavic_d - @loige

@katavic_d - @loige Missing data incident

@katavic_d - @loige

• New lambda version: triggered insertion failures • ElasticSearch rejecting
inserts and logging errors • Our log reporting agents got stuck (we DDoS’d ourselves!) • Monitoring/Alerting failed Resolution: • Fix mismatching schema • Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige

Alerting on lambda failures Using logs: • Best case: no
logs • Worst case: no logs (available)! A better approach: • Attach a DLQ to your lambdas • Alert on queue size with CloudWatch! • Visibility on Lambda retries @katavic_d - @loige

@katavic_d - @loige

Fast retry at peak times • Lambda retry logic is
not configurable loige.link/lambda-retry • Most events will be retried 2 times • Time between retry attempts is not clearly defined (observed in the order of few seconds) • What if all retry attempts happen at peak times? @katavic_d - @loige

Fast retry at peak times @katavic_d - @loige

Fast retry at peak times Processing in these range of
time is likely to succeed @katavic_d - @loige

Fast retry at peak times @katavic_d - @loige

Fast retry at peak times Processing in this range of
time is likely to fail @katavic_d - @loige

Fast retry at peak times If retries are in the
same zone, the message will fail and go to the DLQ 1st retry 2nd retry

Can we extend the retry period in case of failure?
@katavic_d - @loige

@katavic_d - @loige Extended retry period We normally trigger our
ingestion Lambda when a new file is stored in S3

@katavic_d - @loige Extended retry period If the Lambda fails,
the event is automatically retried, up to 2 times

@katavic_d - @loige Extended retry period If the Lambda still
fails, the event is copied to the Dead Letter Queue (DLQ)

@katavic_d - @loige Extended retry period At this point our
Lambda, can receive an SQS event from the DLQ (custom retry logic)

@katavic_d - @loige Extended retry period If the processing still
fails, we can extend the VisibilityTimeout (event delay) x3

@katavic_d - @loige Extended retry period If the processing still
fails, we eventually drop the message and alert for manual intervention. x3

Lessons learned • Cannot always rely on the default retry
logic • SQS events + DLQ = custom SERVERLESS retry logic • Now we only alert on custom metrics when we are sure the event will fail (logic error) • https://loige.link/async-lambda-retry @katavic_d - @loige

AWS nuances • Serverless is generally cheap, but be careful!
◦ You are paying for wait time ◦ Bugs may be expensive ◦ 100ms charging blocks • https://loige.link/lambda-pricing • https://loige.link/serverless-costs-all-wrong @katavic_d - @loige

AWS nuances • Not every service/feature is available in every
region or AZ ◦ SQS FIFO :( ◦ Not all AWS regions have 3 AZs ◦ Not all instance types are available in every availability zone • https://loige.link/aws-regional-services @katavic_d - @loige

AWS nuances • Limits everywhere! ◦ Soft vs hard limits
◦ Take them into account in your design • https://loige.link/aws-service-limits @katavic_d - @loige

Process How to deal with incidents • Page • Engineers
on call • Incident Retrospective • Actions @katavic_d - @loige

Pages • Page is an alarm for people on call
(Pagerduty) • Rotate ops & devs (share the pain) • Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) • When a page is received, it needs to be acknowledged or it is automatically escalated • If customer facing (e.g. service not available), customer is notified @katavic_d - @loige

Engineers on call 1. Use operational handbook 2. Might escalate
to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige

Incidents Retrospective "Regardless of what we discover, we understand and
truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Nor t , Pro t R os t e : A Han k o T m e TLDR; NOT A BLAMING GAME! @katavic_d - @loige

Incidents Retrospective • Summary • Events timeline • Contributing Factors
• Remediation / Solution • Actions for the future • Transparency @katavic_d - @loige

Development best practices • Regular Retrospectives (not just for incidents)
◦ What’s good ◦ What’s bad ◦ Actions to improve • Kanban Board ◦ All work visible ◦ One card at the time ◦ Work In Progress limit ◦ “Stop Starting Start Finishing” @katavic_d - @loige

Development best practices • Clear acceptance criteria ◦ Collectively defined
(3 amigos) ◦ Make sure you know when a card is done • Split the work in small cards ◦ High throughput ◦ More predictability • Bugs take priority over features! @katavic_d - @loige

Development best practices • Pair programming ◦ Share the knowledge/responsibility
◦ Improve team dynamics ◦ Enforced by low WIP limit • Quality over deadlines • Don’t estimate without data @katavic_d - @loige

Release process • Infrastructure as a code ◦ Deterministic deployments
◦ Infrastructure versioning using git • No “snowflakes”, one code base for all customers • Feature flags: ◦ Special features ◦ Soft releases • Automated tests before release @katavic_d - @loige

Conclusion We are still waking up at night sometimes, but
we are definitely sleeping a lot more and better! Takeaways: • Have healthy and clear processes • Allow your team space to fail • Always review and strive for improvement • Monitor/Instrument as much as you can • Use managed services to reduce the operational overhead (but learn their nuances) @katavic_d - @loige

We are hiring … Talk to us! @katavic_d - @loige
Thank you! loige.link/tera-bel

Credits Pictures from Unsplash Huge thanks for support and reviews
to: • All the Vectra team • Yan Cui (@theburningmonk) • Paul Dolan • @gbinside • @augeva • @Podgeypoos79 • @PawrickMannion • @micktwomey • Vedran Jukic

Processing TeraBytes of data every day and slee...

Processing TeraBytes of data every day and sleeping at night

More Decks by Luciano Mammino

Other Decks in Technology

Featured

Transcript