Processing TeraBytes of data every day and sleeping at night

Slide 1

Slide 1 text

Processing Terabytes of data every day … and sleeping at night @katavic_d - @loige User Group Belfast 06/02/2019 loige.link/tera-bel

Slide 2

Slide 2 text

Domagoj Katavic Senior Software Engineer @katavic_d github.com/dkatavic

Slide 3

Slide 3 text

Luciano Mammino Cloud Architect @loige github.com/lmammino loige.co 4.7 out of 5 stars on Amazon.com With @mariocasciaro

Slide 4

Slide 4 text

Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige

Slide 5

Slide 5 text

AI to detect and hunt for cyber attackers Cognito Platform ● Detect ● Recall @katavic_d - @loige

Slide 6

Slide 6 text

Cognito Detect on premise solution ● Analyzing network traffic and logs ● Uses AI to deliver real-time attack visibility ● Behaviour driven Host centric ● Provides threat context and most relevant attack details @katavic_d - @loige

Slide 7

Slide 7 text

@katavic_d - @loige

Slide 8

Slide 8 text

Cognito Recall ● Collects network metadata and stores it in “the cloud” ● Data is processed, enriched and standardised ● Data is made searchable @katavic_d - @loige A new Vectra product for Incident Response

Slide 9

Slide 9 text

Recall requirements ● Data isolation ● Ingestion speed: ~2GB/min x customer (up ~3TB x day per customer) ● Investigation tool: Flexible data exploration @katavic_d - @loige

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Our first iteration @katavic_d - @loige

Slide 12

Slide 12 text

@katavic_d - @loige Control plane Centralised Logging & Metrics

Slide 13

Slide 13 text

Security ● Separate VPCs ● Strict Security Groups (whitelisting) ● Red, amber, green subnets ● Encryption at rest through AWS services ● Client Certificates + TLS ● Pentest @katavic_d - @loige

Slide 14

Slide 14 text

Let’s start the beta! @katavic_d - @loige

Slide 15

Slide 15 text

Warning: different timezones! A cu m Our ne * @katavic_d - @loige *yeah, we actually look that cute when we sleep!

Slide 16

Slide 16 text

Slide 17

Slide 17 text

@katavic_d - @loige

Slide 18

Slide 18 text

@katavic_d - @loige

Slide 19

Slide 19 text

Lambda timeouts incident ● AWS Lambda timeout: 5 minutes 15 minutes ● We are receiving files every minute (containing 1 minute of network traffic) ● During peak hours for the biggest customer, files can be too big to be processed within timeout limits @katavic_d - @loige

Slide 20

Slide 20 text

Splitter lambda @katavic_d - @loige

Slide 21

Slide 21 text

Message-aware splitting

Slide 22

Slide 22 text

Lessons learned ● Predictable data input for predictable performance ● Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige

Slide 23

Slide 23 text

@katavic_d - @loige

Slide 24

Slide 24 text

Lambdas IP starvation incident ● Spinning up many lambdas consumed all the available IPs in a subnet ● Failure to get an IP for the new ES machines ● ElasticSearch cannot scale up ● Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!

Slide 25

Slide 25 text

Lessons learned ● Every running lambda inside a VPC uses an ENI (Elastic Network Interface) ● Every ENI takes a private IP address ● Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! ● Consider putting lambdas in their dedicated subnet @katavic_d - @loige

Slide 26

Slide 26 text

@katavic_d - @loige

Slide 27

Slide 27 text

@katavic_d - @loige Missing data incident

Slide 28

Slide 28 text

@katavic_d - @loige

Slide 29

Slide 29 text

@katavic_d - @loige

Slide 30

Slide 30 text

● New lambda version: triggered insertion failures ● ElasticSearch rejecting inserts and logging errors ● Our log reporting agents got stuck (we DDoS’d ourselves!) ● Monitoring/Alerting failed Resolution: ● Fix mismatching schema ● Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige

Slide 31

Slide 31 text

Alerting on lambda failures Using logs: ● Best case: no logs ● Worst case: no logs (available)! A better approach: ● Attach a DLQ to your lambdas ● Alert on queue size with CloudWatch! ● Visibility on Lambda retries @katavic_d - @loige

Slide 32

Slide 32 text

@katavic_d - @loige

Slide 33

Slide 33 text

@katavic_d - @loige

Slide 34

Slide 34 text

@katavic_d - @loige

Slide 35

Slide 35 text

Fast retry at peak times ● Lambda retry logic is not configurable loige.link/lambda-retry ● Most events will be retried 2 times ● Time between retry attempts is not clearly defined (observed in the order of few seconds) ● What if all retry attempts happen at peak times? @katavic_d - @loige

Slide 36

Slide 36 text

Fast retry at peak times @katavic_d - @loige

Slide 37

Slide 37 text

Fast retry at peak times Processing in these range of time is likely to succeed @katavic_d - @loige

Slide 38

Slide 38 text

Fast retry at peak times @katavic_d - @loige

Slide 39

Slide 39 text

Fast retry at peak times Processing in this range of time is likely to fail @katavic_d - @loige

Slide 40

Slide 40 text

Fast retry at peak times If retries are in the same zone, the message will fail and go to the DLQ 1st retry 2nd retry

Slide 41

Slide 41 text

Can we extend the retry period in case of failure? @katavic_d - @loige

Slide 42

Slide 42 text

@katavic_d - @loige Extended retry period We normally trigger our ingestion Lambda when a new file is stored in S3

Slide 43

Slide 43 text

@katavic_d - @loige Extended retry period If the Lambda fails, the event is automatically retried, up to 2 times

Slide 44

Slide 44 text

@katavic_d - @loige Extended retry period If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)

Slide 45

Slide 45 text

@katavic_d - @loige Extended retry period At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)

Slide 46

Slide 46 text

@katavic_d - @loige Extended retry period If the processing still fails, we can extend the VisibilityTimeout (event delay) x3

Slide 47

Slide 47 text

@katavic_d - @loige Extended retry period If the processing still fails, we eventually drop the message and alert for manual intervention. x3

Slide 48

Slide 48 text

Lessons learned ● Cannot always rely on the default retry logic ● SQS events + DLQ = custom SERVERLESS retry logic ● Now we only alert on custom metrics when we are sure the event will fail (logic error) ● https://loige.link/async-lambda-retry @katavic_d - @loige

Slide 49

Slide 49 text

Slide 50

Slide 50 text

AWS nuances ● Serverless is generally cheap, but be careful! ○ You are paying for wait time ○ Bugs may be expensive ○ 100ms charging blocks ● https://loige.link/lambda-pricing ● https://loige.link/serverless-costs-all-wrong @katavic_d - @loige

Slide 51

Slide 51 text

AWS nuances ● Not every service/feature is available in every region or AZ ○ SQS FIFO :( ○ Not all AWS regions have 3 AZs ○ Not all instance types are available in every availability zone ● https://loige.link/aws-regional-services @katavic_d - @loige

Slide 52

Slide 52 text

AWS nuances ● Limits everywhere! ○ Soft vs hard limits ○ Take them into account in your design ● https://loige.link/aws-service-limits @katavic_d - @loige

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Process How to deal with incidents ● Page ● Engineers on call ● Incident Retrospective ● Actions @katavic_d - @loige

Slide 55

Slide 55 text

Pages ● Page is an alarm for people on call (Pagerduty) ● Rotate ops & devs (share the pain) ● Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) ● When a page is received, it needs to be acknowledged or it is automatically escalated ● If customer facing (e.g. service not available), customer is notified @katavic_d - @loige

Slide 56

Slide 56 text

Engineers on call 1. Use operational handbook 2. Might escalate to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige

Slide 57

Slide 57 text

Incidents Retrospective "Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Nor t , Pro t R os t e : A Han k o T m e TLDR; NOT A BLAMING GAME! @katavic_d - @loige

Slide 58

Slide 58 text

Incidents Retrospective ● Summary ● Events timeline ● Contributing Factors ● Remediation / Solution ● Actions for the future ● Transparency @katavic_d - @loige

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Development best practices ● Regular Retrospectives (not just for incidents) ○ What’s good ○ What’s bad ○ Actions to improve ● Kanban Board ○ All work visible ○ One card at the time ○ Work In Progress limit ○ “Stop Starting Start Finishing” @katavic_d - @loige

Slide 61

Slide 61 text

Development best practices ● Clear acceptance criteria ○ Collectively defined (3 amigos) ○ Make sure you know when a card is done ● Split the work in small cards ○ High throughput ○ More predictability ● Bugs take priority over features! @katavic_d - @loige

Slide 62

Slide 62 text

Development best practices ● Pair programming ○ Share the knowledge/responsibility ○ Improve team dynamics ○ Enforced by low WIP limit ● Quality over deadlines ● Don’t estimate without data @katavic_d - @loige

Slide 63

Slide 63 text

Slide 64

Slide 64 text

Release process ● Infrastructure as a code ○ Deterministic deployments ○ Infrastructure versioning using git ● No “snowflakes”, one code base for all customers ● Feature flags: ○ Special features ○ Soft releases ● Automated tests before release @katavic_d - @loige

Slide 65

Slide 65 text

Conclusion We are still waking up at night sometimes, but we are definitely sleeping a lot more and better! Takeaways: ● Have healthy and clear processes ● Allow your team space to fail ● Always review and strive for improvement ● Monitor/Instrument as much as you can ● Use managed services to reduce the operational overhead (but learn their nuances) @katavic_d - @loige

Slide 66

Slide 66 text

We are hiring … Talk to us! @katavic_d - @loige Thank you! loige.link/tera-bel

Slide 67

Slide 67 text

Credits Pictures from Unsplash Huge thanks for support and reviews to: ● All the Vectra team ● Yan Cui (@theburningmonk) ● Paul Dolan ● @gbinside ● @augeva ● @Podgeypoos79 ● @PawrickMannion ● @micktwomey ● Vedran Jukic