Processing tera bytes of data every day and sleeping at night

Processing Terabytes of data every day … and sleeping at
night @katavic_d - @loige Milan, 30/11/2018 loige.link/terabytes

Domagoj Katavic Senior Software Engineer @katavic_d github.com/dkatavic

Luciano Mammino Solution Architect @loige github.com/lmammino loige.co 4.7 out of
5 stars on Amazon.com With @mariocasciaro

Agenda • The problem space • Our first MVP &
Beta period • INCIDENTS! And lessons learned • Monitoring and instrumentation • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige

AI to detect and hunt for cyber attackers Cognito Platform
• Detect • Recall @katavic_d - @loige

Cognito Detect on premise solution • Analyzing network traffic and
logs • Uses AI to deliver real-time attack visibility • Behaviour driven Host centric • Provides threat context and most relevant attack details @katavic_d - @loige

@katavic_d - @loige

Cognito Recall • Collects network metadata and stores it in
“the cloud” • Data is processed, enriched and standardised • Data is made searchable @katavic_d - @loige A new Vectra product for Incident Response

Recall requirements • Data isolation • Ingestion speed: ~2GB/min x
customer (up ~3TB x day per customer) • Forensic tool: Flexible data exploration @katavic_d - @loige

Our first iteration @katavic_d - @loige fastify.rocks/v2

@katavic_d - @loige Control plane Centralised Logging & Metrics

Security • Separate VPCs • Strict Security Groups (whitelisting) •
Red, amber, green subnets • Encryption at rest through AWS services • Client Certificates + TLS • Pentest @katavic_d - @loige

Let’s start the beta! @katavic_d - @loige

Warning: different timezones! A cu m Our ne * @katavic_d
- @loige *yeah, we actually look that cute when we sleep!

@katavic_d - @loige

Lambda timeouts incident • AWS Lambda timeout: 5 minutes (now
15) • We are receiving files every minute (containing 1 minute of network traffic) • During peak hours for the biggest customer, files can be too big to be processed within 5 minutes @katavic_d - @loige

Splitter lambda @katavic_d - @loige

Message-aware splitting

Lessons learned • Predictable data input for predictable performance •
Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige

@katavic_d - @loige

Lambdas IP starvation incident • Spinning up many lambdas consumed
all the available IPs in a subnet • Failure to get an IP for the new ES machines • ElasticSearch cannot scale up • Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!

Lessons learned • Every lambda takes an IP from the
subnet • Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! • Consider putting lambdas in their dedicated subnet @katavic_d - @loige

@katavic_d - @loige

@katavic_d - @loige Missing data incident

@katavic_d - @loige

• New lambda version: triggered insertion failures • ElasticSearch rejecting
inserts and logging errors • Our log reporting agents got stuck (we DDoS’d ourselves!) • Monitoring/Alerting failed Resolution: • Fix mismatching schema • Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige

Alerting on lambda failures Using logs: • Best case: no
logs • Worst case: no logs (available)! A better approach: • Attach a DLQ to your lambdas • Alert on queue size with CloudWatch! • Visibility on Lambda retries @katavic_d - @loige

@katavic_d - @loige

Missing data incident: the return! • Missing data in the
database • Stack is working fine • Not receiving data in the collector node • Root cause: a new customer firewall rule was blocking our traffic at source! • As soon as the rule was fixed, data was flowing in again @katavic_d - @loige

• When is this an actual problem? • How to
alert effectively? @katavic_d - @loige How to deal with lack of data...

loige.link/def-canary @katavic_d - @loige The canary

@katavic_d - @loige

Lessons learned • Ping & health checks to make sure
everything is working and data can flow • Tracing to track performance degradations and pipeline issues. • Instrumentation can be very valuable: do it! @katavic_d - @loige

Instrumentation & metrics Statsd • Daemon for stats aggregations •
Very lightweight (UDP based) • Simple to integrate • Visualize data through Grafana @katavic_d - @loige

Counter Used to count events E.g. number of files received
over time

Timing Used to report time measurements E.g. how long did
it take to process and insert a batch

Gauges Record arbitrary absolute values E.g. CPU Load, memory usage,
money in the bank, etc.

AWS nuances • Serverless is cheap, but be aware of
timeouts! • Not every service/feature is available everywhere ◦ SQS FIFO :( ◦ Not all AWS regions have 3 AZs ◦ Not all instance types are available in every availability zone • Limits everywhere! ◦ Soft vs hard limits ◦ Take them into account in your design @katavic_d - @loige

Process How to deal with incidents • Page • Engineers
on call • Incident Retrospective • Actions @katavic_d - @loige

Pages • Page is an alarm for people on call
(Pagerduty) • Rotate ops & devs (share the pain) • Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) • When a page is received, it needs to be acknowledged or it is automatically escalated • If customer facing (e.g. service not available), customer is notified @katavic_d - @loige

Engineers on call 1. Use operational handbook 2. Might escalate
to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige

Incidents Retrospective "Regardless of what we discover, we understand and
truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Nor t , Pro t R os t e : A Han k o T m e TLDR; NOT A BLAMING GAME! @katavic_d - @loige

Incidents Retrospective • Summary • Events timeline • Contributing Factors
• Remediation / Solution • Actions for the future • Transparency @katavic_d - @loige

Development best practices • Regular Retrospectives (not just for incidents)
◦ What’s good ◦ What’s bad ◦ Actions to improve • Kanban Board ◦ All work visible ◦ One card at the time ◦ Work In Progress limit ◦ “Stop Starting Start Finishing” @katavic_d - @loige

Development best practices • Clear acceptance criteria ◦ Collectively defined
(3 amigos) ◦ Make sure you know when a card is done • Split the work in small cards ◦ High throughput ◦ More predictability • Bugs take priority over features! @katavic_d - @loige

Development best practices • Pair programming ◦ Share the knowledge/responsibility
◦ Improve team dynamics ◦ Enforced by low WIP limit • Quality over deadlines • Don’t estimate without data @katavic_d - @loige

Release process • Infrastructure as a code (Terraform + Ansible)
◦ Deterministic deployments ◦ Infrastructure versioning using git • No “snowflakes”, one code base for all customers • Feature flags: ◦ Special features ◦ Soft releases • Automated tests before release @katavic_d - @loige

Conclusion @katavic_d - @loige We are still waking up at
night sometimes, but we are definitely sleeping a lot more and better! Takeaways: • Have healthy and clear processes • Always review and strive for improvement • Monitor/Instrument as much as you can (even monitoring) • Use managed services to reduce the operational overhead (but learn their nuances)

We are hiring … Talk to us! @katavic_d - @loige
GRAZIE! loige.link/terabytes

Credits Pictures from Unsplash Huge thanks to: • All the
Vectra team • Paul Dolan • @gbinside • @augeva • @Podgeypoos79 • @PawrickMannion • @micktwomey • Vedran Jukic for support and reviews!

Processing tera bytes of data every day and sle...

Processing tera bytes of data every day and sleeping at night

More Decks by Luciano Mammino

Other Decks in Technology

Featured

Transcript