TAP(ping) Out Security Threats at FireEye

1 Chris Rimondi @crimondi TAP(ping) Out Security Threats at FireEye

2 What is Tennessee known for?

4 Mandiant 2013

5 How do we enable the analyst?

7 TAP Overview

8 TAP Overview Industry’s best threat intelligence, applied to event
data •  7M+ indicators •  1,000+ proprietary rules •  Tight integration with FireEye Intelligence Center for context •  Analytics provides heuristic detection •  Integrates with other FireEye products Fast response •  Sub-second search across billions of events •  Pivoting and grouping to facilitate hunting •  Integrated case management

9 Hunting TAP Use Case

10 The collective name for any interactive or semi-automated technique
used to detect security incidents. What is hunting?

11 Analysts asks TAP Citrix connections originating from Russia, China,
Ireland and grouped by duration, received bytes, and destination port

12 Query Language class:bro_conn dstipv4:$external_citrix_serve rs srccountrycode:[ru,cn,ir] connstate:sf | groupby
[duration,rcvdipbytes,dstport] 200

13 Elasticsearch DSL {"query":{"filtered":{"filter":{"and":[{"range":{"meta_ts": {"gte":"2016-01-26T17:00:00.000Z","lte":"2016-02-02T17:10:20. 478Z"}}},{"term":{"class":"bro_conn"}},{"terms":{"dstipv4": {"index":"lists","type":"indicator","id":"external_citrix_s ervers","path":"values","cache":false}}},{"terms": {"srccountrycode":["ru","cn","ir"],"execution":"or"}}, {"term":{"connstate":"sf"}},{"limit":{"value":
208333}}]}}},"aggs":{"groupby:duration_rcvdipbytes_dstport": {"terms":{"lang":"native","script":"join","params": {"fields":["duration","rcvdipbytes","dstport"],"separator":", "},"size":200,"min_doc_count":1,"order": {"_count":"desc"}}}},"size":10,"from":0,"timeout":120000}

14 Our Elasticsearch Use Case •  Big Aggregations •  Expensive
Regex •  Heavy Indexing •  Petabytes of data

15 Architecture & Footprint

16 0 1 2 3 4 5 6 7 8
0.9 (3) 1.1 (3) 1.2 (3) 1.3 (3) 1.5 (3) 1.7 (4) Wakeups per week Elasticsearch version (Number of kids) Elasticsearch Kids

17 How We Think About Clusters •  A shard is
simply a unit of performance with an observable characteristic over time •  An Elasticsearch instance is just a shard container with upper sizing limits (primarily the JVM heap) •  All operations compete for resources within a single Elasticsearch instance •  Determine shards based on workload requirements For an indexing-driven workload

18 In every system, exactly one constraint determines the systems
capacity The Goal Theory of Constraints

19 Raw Storage Across ~ 40 production clusters 3.6P 700B
300K Production Footprint EPS Events per second indexed to production Indexed Events In 400+ Nodes Peak 20B/day

21 Challenges

22 How many eggs can we fry with a bad
regex query?

23 Show me credit card data! {"query":{"filtered":{"filter":{"and":[{"range": {"meta_ts": {"gte":"2015-10-25T13:00:00.000Z","lte":"2015-10 -26T13:37:07.554Z"}}},{"query":{"common":
{"metaclass": {"query":"http_proxy","low_freq_operator":"and", "high_freq_operator":"and","cutoff_frequency": 0.001,"analyzer":"standard"}}}},{"script": {"script":"regexp","lang":"native","params": {"regexp":".*encoding\\\\=.*\\\\&t\\\\=.*\\\\&cc \\\\=.*\\\\&process\\\\=.*\\\\&track\\\ \=/","field":"uri","limit":-1}}}]}}},"size": 10,"from":0,"timeout":120000}

24 1080 Cores pegged at 100% CPU for 83 minutes
!

25 Eggs Fried Per Query 1 2 3 4 Thermal
mass for a single egg is 274 J / °C Integrated temperature from 4 to 80 C gives us total heat of: 274 J/C * (80 - 4 °C) D2 series uses Haswell Intel Xeon E5-2673v3 processors Thermal Design Power: 120W We used 8 cores of the 12 cores total for .75 * 120W * 135 Procs = 90W Total Query execution time in seconds: 83 min x 60 s 5 Total Energy = 12,150W * 4980 seconds (length of query) 274 J/°C 20,812 J 12,150 W 4,990 s 60.5 MJ

26 2,907 Eggs Fried Searching for credit card track data
in URIs

28 How do we fix this? •  In-flight query limitations
– Don’t add insult to injury •  Limit documents regex executes on – Helps but a bad query on a single document can be painful •  Judicious use of PrefixQuery – Leverage existing Lucene functionality Several Approaches

29 How do we fix this? • Custom tooling to schedule
expensive cluster tasks ‒ Relocations/rebalances ‒ Forced optimizations ‒ Recovery/transfer rate Deference of cluster related tasks (Sleepwalk)

30 Final word on regex •  Stuff we know exists
but aren’t using… yet ‒  Lucene has native functionality for regex (max_determinized_states) ‒  N-grams (tri & hex) •  What are we doing? ‒  Native scripts - NFA (Non deterministic Finite Automaton) approach ‒  Backtracking algorithm allows for possibility of exponential time in query •  Why? ‒  Additional filter does nothing to limit documents regex is run on ‒  Intuitive way of alerting customers their query has too many states The not so dumb linear scan

31 The Great Migration

32 AWS launches new dense storage type d2 Series

33 r3.2xl 61 GB of memory 8 vCPUs 4 TB
of EBS GPSSD d2.2xl 61 GB of memory 8 vCPUs 12 TB of DAS

34 You are going to run production workloads on ephemeral,
spinning disks?

35 Yes.

36 Steel is too heavy, it can’t float –you shouldn’t
build ships from it. Said no one ever

37 Avoid Generalizations Like unilaterally saying avoid HDDs.

38 Reliability •  Understand, observe and measure performance. •  Decide
what’s important: do you actually need sub-ms seeks versus added storage capacity? •  Be a good ‘engineer’ and quantify your requirements. •  Establish what is good health •  Indexing latency •  Thread pool starvation •  GC times (duh) Keeping clusters alive with one weird trick

39 RAID10 •  Even with RAID10 Storage I/O is still
faster than network I/O •  Second to memory starvation, drive failures were the most devastating cause for downtime •  Reduce leaning on ES’s replication/recovery mechanism when you have huge minimum transfer times •  Large 2AM emergency recoveries plummeted, query experience remains good. Yes, seriously, in a distributed/replicated database.

40 The Migration Plan Shooting for No Down Time 1
2 3 4 Add new d2 nodes to the cluster Route indexing to d2’s. Set exclusion on r3’s. Profit

41 How did we do?

42 We migrated over 400 nodes saving the company tens
of thousands of dollars per month with ZERO downtime!

46 What’s Next?

47 What’s next for our Elasticsearch architecture? Elasticsearch 2.x • 
Deflate Compression + Better Memory Management = Denser Nodes •  Leverage More Native Elasticsearch Functionality

48 What’s next for TAP? Be the Platform

TAP(ping) Out Security Threats at FireEye

TAP(ping) Out Security Threats at FireEye

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript