Processing tera bytes of data every day and sleeping at night

Slide 1

Slide 1 text

Processing Terabytes of data every day … and sleeping at night @katavic_d - @loige Milan, 30/11/2018 loige.link/terabytes

Slide 2

Slide 2 text

Domagoj Katavic Senior Software Engineer @katavic_d github.com/dkatavic

Slide 3

Slide 3 text

Luciano Mammino Solution Architect @loige github.com/lmammino loige.co 4.7 out of 5 stars on Amazon.com With @mariocasciaro

Slide 4

Slide 4 text

Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● Monitoring and instrumentation ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige

Slide 5

Slide 5 text

AI to detect and hunt for cyber attackers Cognito Platform ● Detect ● Recall @katavic_d - @loige

Slide 6

Slide 6 text

Cognito Detect on premise solution ● Analyzing network traffic and logs ● Uses AI to deliver real-time attack visibility ● Behaviour driven Host centric ● Provides threat context and most relevant attack details @katavic_d - @loige

Slide 7

Slide 7 text

@katavic_d - @loige

Slide 8

Slide 8 text

Cognito Recall ● Collects network metadata and stores it in “the cloud” ● Data is processed, enriched and standardised ● Data is made searchable @katavic_d - @loige A new Vectra product for Incident Response

Slide 9

Slide 9 text

Recall requirements ● Data isolation ● Ingestion speed: ~2GB/min x customer (up ~3TB x day per customer) ● Forensic tool: Flexible data exploration @katavic_d - @loige

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Our first iteration @katavic_d - @loige fastify.rocks/v2

Slide 12

Slide 12 text

@katavic_d - @loige Control plane Centralised Logging & Metrics

Slide 13

Slide 13 text

Security ● Separate VPCs ● Strict Security Groups (whitelisting) ● Red, amber, green subnets ● Encryption at rest through AWS services ● Client Certificates + TLS ● Pentest @katavic_d - @loige

Slide 14

Slide 14 text

Let’s start the beta! @katavic_d - @loige

Slide 15

Slide 15 text

Warning: different timezones! A cu m Our ne * @katavic_d - @loige *yeah, we actually look that cute when we sleep!

Slide 16

Slide 16 text

Slide 17

Slide 17 text

@katavic_d - @loige

Slide 18

Slide 18 text

@katavic_d - @loige

Slide 19

Slide 19 text

Lambda timeouts incident ● AWS Lambda timeout: 5 minutes (now 15) ● We are receiving files every minute (containing 1 minute of network traffic) ● During peak hours for the biggest customer, files can be too big to be processed within 5 minutes @katavic_d - @loige

Slide 20

Slide 20 text

Splitter lambda @katavic_d - @loige

Slide 21

Slide 21 text

Message-aware splitting

Slide 22

Slide 22 text

Lessons learned ● Predictable data input for predictable performance ● Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige

Slide 23

Slide 23 text

@katavic_d - @loige

Slide 24

Slide 24 text

Lambdas IP starvation incident ● Spinning up many lambdas consumed all the available IPs in a subnet ● Failure to get an IP for the new ES machines ● ElasticSearch cannot scale up ● Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!

Slide 25

Slide 25 text

Lessons learned ● Every lambda takes an IP from the subnet ● Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! ● Consider putting lambdas in their dedicated subnet @katavic_d - @loige

Slide 26

Slide 26 text

@katavic_d - @loige

Slide 27

Slide 27 text

@katavic_d - @loige Missing data incident

Slide 28

Slide 28 text

@katavic_d - @loige

Slide 29

Slide 29 text

@katavic_d - @loige

Slide 30

Slide 30 text

● New lambda version: triggered insertion failures ● ElasticSearch rejecting inserts and logging errors ● Our log reporting agents got stuck (we DDoS’d ourselves!) ● Monitoring/Alerting failed Resolution: ● Fix mismatching schema ● Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige

Slide 31

Slide 31 text

Alerting on lambda failures Using logs: ● Best case: no logs ● Worst case: no logs (available)! A better approach: ● Attach a DLQ to your lambdas ● Alert on queue size with CloudWatch! ● Visibility on Lambda retries @katavic_d - @loige

Slide 32

Slide 32 text

@katavic_d - @loige

Slide 33

Slide 33 text

@katavic_d - @loige

Slide 34

Slide 34 text

@katavic_d - @loige

Slide 35

Slide 35 text

Missing data incident: the return! ● Missing data in the database ● Stack is working fine ● Not receiving data in the collector node ● Root cause: a new customer firewall rule was blocking our traffic at source! ● As soon as the rule was fixed, data was flowing in again @katavic_d - @loige

Slide 36

Slide 36 text

● When is this an actual problem? ● How to alert effectively? @katavic_d - @loige How to deal with lack of data...

Slide 37

Slide 37 text

loige.link/def-canary @katavic_d - @loige The canary

Slide 38

Slide 38 text

@katavic_d - @loige

Slide 39

Slide 39 text

Lessons learned ● Ping & health checks to make sure everything is working and data can flow ● Tracing to track performance degradations and pipeline issues. ● Instrumentation can be very valuable: do it! @katavic_d - @loige

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Instrumentation & metrics Statsd ● Daemon for stats aggregations ● Very lightweight (UDP based) ● Simple to integrate ● Visualize data through Grafana @katavic_d - @loige

Slide 42

Slide 42 text

Counter Used to count events E.g. number of files received over time

Slide 43

Slide 43 text

Timing Used to report time measurements E.g. how long did it take to process and insert a batch

Slide 44

Slide 44 text

Gauges Record arbitrary absolute values E.g. CPU Load, memory usage, money in the bank, etc.

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

Slide 48

Slide 48 text

AWS nuances ● Serverless is cheap, but be aware of timeouts! ● Not every service/feature is available everywhere ○ SQS FIFO :( ○ Not all AWS regions have 3 AZs ○ Not all instance types are available in every availability zone ● Limits everywhere! ○ Soft vs hard limits ○ Take them into account in your design @katavic_d - @loige

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Process How to deal with incidents ● Page ● Engineers on call ● Incident Retrospective ● Actions @katavic_d - @loige

Slide 51

Slide 51 text

Pages ● Page is an alarm for people on call (Pagerduty) ● Rotate ops & devs (share the pain) ● Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) ● When a page is received, it needs to be acknowledged or it is automatically escalated ● If customer facing (e.g. service not available), customer is notified @katavic_d - @loige

Slide 52

Slide 52 text

Engineers on call 1. Use operational handbook 2. Might escalate to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige

Slide 53

Slide 53 text

Incidents Retrospective "Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Nor t , Pro t R os t e : A Han k o T m e TLDR; NOT A BLAMING GAME! @katavic_d - @loige

Slide 54

Slide 54 text

Incidents Retrospective ● Summary ● Events timeline ● Contributing Factors ● Remediation / Solution ● Actions for the future ● Transparency @katavic_d - @loige

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Development best practices ● Regular Retrospectives (not just for incidents) ○ What’s good ○ What’s bad ○ Actions to improve ● Kanban Board ○ All work visible ○ One card at the time ○ Work In Progress limit ○ “Stop Starting Start Finishing” @katavic_d - @loige

Slide 57

Slide 57 text

Development best practices ● Clear acceptance criteria ○ Collectively defined (3 amigos) ○ Make sure you know when a card is done ● Split the work in small cards ○ High throughput ○ More predictability ● Bugs take priority over features! @katavic_d - @loige

Slide 58

Slide 58 text

Development best practices ● Pair programming ○ Share the knowledge/responsibility ○ Improve team dynamics ○ Enforced by low WIP limit ● Quality over deadlines ● Don’t estimate without data @katavic_d - @loige

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Release process ● Infrastructure as a code (Terraform + Ansible) ○ Deterministic deployments ○ Infrastructure versioning using git ● No “snowflakes”, one code base for all customers ● Feature flags: ○ Special features ○ Soft releases ● Automated tests before release @katavic_d - @loige

Slide 61

Slide 61 text

Conclusion @katavic_d - @loige We are still waking up at night sometimes, but we are definitely sleeping a lot more and better! Takeaways: ● Have healthy and clear processes ● Always review and strive for improvement ● Monitor/Instrument as much as you can (even monitoring) ● Use managed services to reduce the operational overhead (but learn their nuances)

Slide 62

Slide 62 text

We are hiring … Talk to us! @katavic_d - @loige GRAZIE! loige.link/terabytes

Slide 63

Slide 63 text

Credits Pictures from Unsplash Huge thanks to: ● All the Vectra team ● Paul Dolan ● @gbinside ● @augeva ● @Podgeypoos79 ● @PawrickMannion ● @micktwomey ● Vedran Jukic for support and reviews!