Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Processing TeraBytes of data every day and sleeping at night

Processing TeraBytes of data every day and sleeping at night

This is the story of how we built a highly available data pipeline that processes terabytes of network data every day, making it available to security researchers for assessment and threat hunting.

Building this kind of stuff in the cloud is not that complicated, but if you have to make it near real-time, fault tolerant and 24/7 available, well... that's another story.

In this talk, we will tell you how we achieved this ambitious goal and how we missed a few good nights of sleep while trying to do that!

Spoiler alert: contains AWS, serverless, elastic search, monitoring, alerting & more!

Luciano Mammino

February 06, 2019
Tweet

More Decks by Luciano Mammino

Other Decks in Technology

Transcript

  1. Processing Terabytes of data every day … and sleeping at

    night @katavic_d - @loige User Group Belfast 06/02/2019 loige.link/tera-bel
  2. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  3. AI to detect and hunt for cyber attackers Cognito Platform

    • Detect • Recall @katavic_d - @loige
  4. Cognito Detect on premise solution • Analyzing network traffic and

    logs • Uses AI to deliver real-time attack visibility • Behaviour driven Host centric • Provides threat context and most relevant attack details @katavic_d - @loige
  5. Cognito Recall • Collects network metadata and stores it in

    “the cloud” • Data is processed, enriched and standardised • Data is made searchable @katavic_d - @loige A new Vectra product for Incident Response
  6. Recall requirements • Data isolation • Ingestion speed: ~2GB/min x

    customer (up ~3TB x day per customer) • Investigation tool: Flexible data exploration @katavic_d - @loige
  7. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  8. Security • Separate VPCs • Strict Security Groups (whitelisting) •

    Red, amber, green subnets • Encryption at rest through AWS services • Client Certificates + TLS • Pentest @katavic_d - @loige
  9. Warning: different timezones! A cu m Our ne * @katavic_d

    - @loige *yeah, we actually look that cute when we sleep!
  10. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  11. Lambda timeouts incident • AWS Lambda timeout: 5 minutes 15

    minutes • We are receiving files every minute (containing 1 minute of network traffic) • During peak hours for the biggest customer, files can be too big to be processed within timeout limits @katavic_d - @loige
  12. Lessons learned • Predictable data input for predictable performance •

    Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige
  13. Lambdas IP starvation incident • Spinning up many lambdas consumed

    all the available IPs in a subnet • Failure to get an IP for the new ES machines • ElasticSearch cannot scale up • Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!
  14. Lessons learned • Every running lambda inside a VPC uses

    an ENI (Elastic Network Interface) • Every ENI takes a private IP address • Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! • Consider putting lambdas in their dedicated subnet @katavic_d - @loige
  15. • New lambda version: triggered insertion failures • ElasticSearch rejecting

    inserts and logging errors • Our log reporting agents got stuck (we DDoS’d ourselves!) • Monitoring/Alerting failed Resolution: • Fix mismatching schema • Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige
  16. Alerting on lambda failures Using logs: • Best case: no

    logs • Worst case: no logs (available)! A better approach: • Attach a DLQ to your lambdas • Alert on queue size with CloudWatch! • Visibility on Lambda retries @katavic_d - @loige
  17. Fast retry at peak times • Lambda retry logic is

    not configurable loige.link/lambda-retry • Most events will be retried 2 times • Time between retry attempts is not clearly defined (observed in the order of few seconds) • What if all retry attempts happen at peak times? @katavic_d - @loige
  18. Fast retry at peak times Processing in these range of

    time is likely to succeed @katavic_d - @loige
  19. Fast retry at peak times Processing in this range of

    time is likely to fail @katavic_d - @loige
  20. Fast retry at peak times If retries are in the

    same zone, the message will fail and go to the DLQ 1st retry 2nd retry
  21. @katavic_d - @loige Extended retry period We normally trigger our

    ingestion Lambda when a new file is stored in S3
  22. @katavic_d - @loige Extended retry period If the Lambda fails,

    the event is automatically retried, up to 2 times
  23. @katavic_d - @loige Extended retry period If the Lambda still

    fails, the event is copied to the Dead Letter Queue (DLQ)
  24. @katavic_d - @loige Extended retry period At this point our

    Lambda, can receive an SQS event from the DLQ (custom retry logic)
  25. @katavic_d - @loige Extended retry period If the processing still

    fails, we can extend the VisibilityTimeout (event delay) x3
  26. @katavic_d - @loige Extended retry period If the processing still

    fails, we eventually drop the message and alert for manual intervention. x3
  27. Lessons learned • Cannot always rely on the default retry

    logic • SQS events + DLQ = custom SERVERLESS retry logic • Now we only alert on custom metrics when we are sure the event will fail (logic error) • https://loige.link/async-lambda-retry @katavic_d - @loige
  28. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  29. AWS nuances • Serverless is generally cheap, but be careful!

    ◦ You are paying for wait time ◦ Bugs may be expensive ◦ 100ms charging blocks • https://loige.link/lambda-pricing • https://loige.link/serverless-costs-all-wrong @katavic_d - @loige
  30. AWS nuances • Not every service/feature is available in every

    region or AZ ◦ SQS FIFO :( ◦ Not all AWS regions have 3 AZs ◦ Not all instance types are available in every availability zone • https://loige.link/aws-regional-services @katavic_d - @loige
  31. AWS nuances • Limits everywhere! ◦ Soft vs hard limits

    ◦ Take them into account in your design • https://loige.link/aws-service-limits @katavic_d - @loige
  32. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  33. Process How to deal with incidents • Page • Engineers

    on call • Incident Retrospective • Actions @katavic_d - @loige
  34. Pages • Page is an alarm for people on call

    (Pagerduty) • Rotate ops & devs (share the pain) • Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) • When a page is received, it needs to be acknowledged or it is automatically escalated • If customer facing (e.g. service not available), customer is notified @katavic_d - @loige
  35. Engineers on call 1. Use operational handbook 2. Might escalate

    to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige
  36. Incidents Retrospective "Regardless of what we discover, we understand and

    truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Nor t , Pro t R os t e : A Han k o T m e TLDR; NOT A BLAMING GAME! @katavic_d - @loige
  37. Incidents Retrospective • Summary • Events timeline • Contributing Factors

    • Remediation / Solution • Actions for the future • Transparency @katavic_d - @loige
  38. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  39. Development best practices • Regular Retrospectives (not just for incidents)

    ◦ What’s good ◦ What’s bad ◦ Actions to improve • Kanban Board ◦ All work visible ◦ One card at the time ◦ Work In Progress limit ◦ “Stop Starting Start Finishing” @katavic_d - @loige
  40. Development best practices • Clear acceptance criteria ◦ Collectively defined

    (3 amigos) ◦ Make sure you know when a card is done • Split the work in small cards ◦ High throughput ◦ More predictability • Bugs take priority over features! @katavic_d - @loige
  41. Development best practices • Pair programming ◦ Share the knowledge/responsibility

    ◦ Improve team dynamics ◦ Enforced by low WIP limit • Quality over deadlines • Don’t estimate without data @katavic_d - @loige
  42. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  43. Release process • Infrastructure as a code ◦ Deterministic deployments

    ◦ Infrastructure versioning using git • No “snowflakes”, one code base for all customers • Feature flags: ◦ Special features ◦ Soft releases • Automated tests before release @katavic_d - @loige
  44. Conclusion We are still waking up at night sometimes, but

    we are definitely sleeping a lot more and better! Takeaways: • Have healthy and clear processes • Allow your team space to fail • Always review and strive for improvement • Monitor/Instrument as much as you can • Use managed services to reduce the operational overhead (but learn their nuances) @katavic_d - @loige
  45. We are hiring … Talk to us! @katavic_d - @loige

    Thank you! loige.link/tera-bel
  46. Credits Pictures from Unsplash Huge thanks for support and reviews

    to: • All the Vectra team • Yan Cui (@theburningmonk) • Paul Dolan • @gbinside • @augeva • @Podgeypoos79 • @PawrickMannion • @micktwomey • Vedran Jukic