Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TAP(ping) Out Security Threats at FireEye

Elastic Co
February 17, 2016

TAP(ping) Out Security Threats at FireEye

FireEye’s Threat Analytics Platform (TAP) leverages Elasticsearch to index hundreds of thousands of events per second and maintain more than a petabyte of data. Learn what the security hunting use case is and how FireEye built a platform to allow its enterprise customers to find evil in their organizations.

Elastic Co

February 17, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 3

  2. 6 +

  3. 8 TAP Overview Industry’s best threat intelligence, applied to event

    data •  7M+ indicators •  1,000+ proprietary rules •  Tight integration with FireEye Intelligence Center for context •  Analytics provides heuristic detection •  Integrates with other FireEye products Fast response •  Sub-second search across billions of events •  Pivoting and grouping to facilitate hunting •  Integrated case management
  4. 10 The collective name for any interactive or semi-automated technique

    used to detect security incidents. What is hunting?
  5. 11 Analysts asks TAP Citrix connections originating from Russia, China,

    Ireland and grouped by duration, received bytes, and destination port
  6. 14 Our Elasticsearch Use Case •  Big Aggregations •  Expensive

    Regex •  Heavy Indexing •  Petabytes of data
  7. 16 0 1 2 3 4 5 6 7 8

    0.9 (3) 1.1 (3) 1.2 (3) 1.3 (3) 1.5 (3) 1.7 (4) Wakeups per week Elasticsearch version (Number of kids) Elasticsearch Kids
  8. 17 How We Think About Clusters •  A shard is

    simply a unit of performance with an observable characteristic over time •  An Elasticsearch instance is just a shard container with upper sizing limits (primarily the JVM heap) •  All operations compete for resources within a single Elasticsearch instance •  Determine shards based on workload requirements For an indexing-driven workload
  9. 19 Raw Storage Across ~ 40 production clusters 3.6P 700B

    300K Production Footprint EPS Events per second indexed to production Indexed Events In 400+ Nodes Peak 20B/day
  10. 20

  11. 23 Show me credit card data! {"query":{"filtered":{"filter":{"and":[{"range": {"meta_ts": {"gte":"2015-10-25T13:00:00.000Z","lte":"2015-10 -26T13:37:07.554Z"}}},{"query":{"common":

    {"metaclass": {"query":"http_proxy","low_freq_operator":"and", "high_freq_operator":"and","cutoff_frequency": 0.001,"analyzer":"standard"}}}},{"script": {"script":"regexp","lang":"native","params": {"regexp":".*encoding\\\\=.*\\\\&t\\\\=.*\\\\&cc \\\\=.*\\\\&process\\\\=.*\\\\&track\\\ \=/","field":"uri","limit":-1}}}]}}},"size": 10,"from":0,"timeout":120000}
  12. 25 Eggs Fried Per Query 1 2 3 4 Thermal

    mass for a single egg is 274 J / °C Integrated temperature from 4 to 80 C gives us total heat of: 274 J/C * (80 - 4 °C) D2 series uses Haswell Intel Xeon E5-2673v3 processors Thermal Design Power: 120W We used 8 cores of the 12 cores total for .75 * 120W * 135 Procs = 90W Total Query execution time in seconds: 83 min x 60 s 5 Total Energy = 12,150W * 4980 seconds (length of query) 274 J/°C 20,812 J 12,150 W 4,990 s 60.5 MJ
  13. 27

  14. 28 How do we fix this? •  In-flight query limitations

    – Don’t add insult to injury •  Limit documents regex executes on – Helps but a bad query on a single document can be painful •  Judicious use of PrefixQuery – Leverage existing Lucene functionality Several Approaches
  15. 29 How do we fix this? • Custom tooling to schedule

    expensive cluster tasks ‒ Relocations/rebalances ‒ Forced optimizations ‒ Recovery/transfer rate Deference of cluster related tasks (Sleepwalk)
  16. 30 Final word on regex •  Stuff we know exists

    but aren’t using… yet ‒  Lucene has native functionality for regex (max_determinized_states) ‒  N-grams (tri & hex) •  What are we doing? ‒  Native scripts - NFA (Non deterministic Finite Automaton) approach ‒  Backtracking algorithm allows for possibility of exponential time in query •  Why? ‒  Additional filter does nothing to limit documents regex is run on ‒  Intuitive way of alerting customers their query has too many states The not so dumb linear scan
  17. 33 r3.2xl 61 GB of memory 8 vCPUs 4 TB

    of EBS GPSSD d2.2xl 61 GB of memory 8 vCPUs 12 TB of DAS
  18. 36 Steel is too heavy, it can’t float –you shouldn’t

    build ships from it. Said no one ever
  19. 38 Reliability •  Understand, observe and measure performance. •  Decide

    what’s important: do you actually need sub-ms seeks versus added storage capacity? •  Be a good ‘engineer’ and quantify your requirements. •  Establish what is good health •  Indexing latency •  Thread pool starvation •  GC times (duh) Keeping clusters alive with one weird trick
  20. 39 RAID10 •  Even with RAID10 Storage I/O is still

    faster than network I/O •  Second to memory starvation, drive failures were the most devastating cause for downtime •  Reduce leaning on ES’s replication/recovery mechanism when you have huge minimum transfer times •  Large 2AM emergency recoveries plummeted, query experience remains good. Yes, seriously, in a distributed/replicated database.
  21. 40 The Migration Plan Shooting for No Down Time 1

    2 3 4 Add new d2 nodes to the cluster Route indexing to d2’s. Set exclusion on r3’s. Profit
  22. 42 We migrated over 400 nodes saving the company tens

    of thousands of dollars per month with ZERO downtime!
  23. 43

  24. 44

  25. 45

  26. 47 What’s next for our Elasticsearch architecture? Elasticsearch 2.x • 

    Deflate Compression + Better Memory Management = Denser Nodes •  Leverage More Native Elasticsearch Functionality