Slide 1

Slide 1 text

OpenSOC

Slide 2

Slide 2 text

2 §  Problem Statement & Business Case for OpenSOC §  Solution Architecture and Design §  Best Practices and Lessons Learned §  Q & A Over Next Few Minutes

Slide 3

Slide 3 text

3 Business Case

Slide 4

Slide 4 text

4 “There's now a growing sense of fatalism: It's no longer if or when you get hacked, but the assumption is that you've already been hacked, with a focus on minimizing the damage.” Source: Dark Reading / Security’s New Reality: Assume The Worst

Slide 5

Slide 5 text

5 Breaches Happen in Hours…

Slide 6

Slide 6 text

6 Cisco Global Cloud Index Source: 2014 Cisco Global Cloud Index

Slide 7

Slide 7 text

7 Introducing OpenSOC

Slide 8

Slide 8 text

8 OpenSOC Journey Sept 2013 First Prototype Dec 2013 Hortonworks joins the project March 2014 Platform development finished Sept 2014 General Availability May 2014 CR Work off April 2014 First beta test at customer site

Slide 9

Slide 9 text

9 Solution Architecture & Design

Slide 10

Slide 10 text

10 OpenSOC Conceptual Architecture Raw Network Stream Network Metadata Stream Netflow Syslog Raw Application Logs Other Streaming Telemetry Hive HBase Raw Packet Store Long-Term Store Elastic Search Real-Time Index Network Packet Mining and PCAP Reconstruction Log Mining and Analytics Big Data Exploration, Predictive Modeling Applications + Analyst Tools Parse + Format Enrich Alert Threat Intelligence Feeds Enrichment Data

Slide 11

Slide 11 text

11 §  Raw Network Packet Capture, Store, Traffic Reconstruction §  Telemetry Ingest, Enrichment and Real-Time Rules-Based Alerts §  Real-Time Telemetry Search and Cross-Telemetry Matching §  Automated Reports, Anomaly Detection and Anomaly Alerts §  Rich Analytics Apps and Integration with Existing Analytics Tools Key Functional Capabilities

Slide 12

Slide 12 text

12 §  Fully-Backed by Cisco and Used Internally for Multiple Customers §  Free, Open Source and Apache Licensed §  Built on Highly-Scalable and Proven Platforms (Hadoop, Kafka, Storm) §  Extensible and Pluggable Design §  Flexible Deployment Model (On-Premise or Cloud) §  Centralize your processes, people and data The OpenSOC Advantage

Slide 13

Slide 13 text

13 OpenSOC Deployment at Cisco Hardware footprint (40u) §  14 Data Nodes (UCS C240 M3) §  3 Cluster Control Nodes (UCS C220 M3) §  2 ESX Hypervisor Hosts (UCS C220 M3) §  1 PCAP Processor (UCS C220 M3 + Napatech NIC) §  2 SourceFire Threat alert processors §  1 Anue Network Traffic splitter §  1 Router §  1 48 Port 10GE Switch Software Stack § HDP 2.1 § Kafka 0.8 § Elastic Search 1.1 § MySQL 5.5

Slide 14

Slide 14 text

14 OpenSOC - Stitching Things Together Access Messaging System Data Collection Source Systems Storage Real Time Processing Storm Kafka B Topic N Topic Elastic Search Index Web Services Search PCAP Reconstruction HBase PCAP Table Analytic Tools R / Python Power Pivot Tableau Hive Raw Data ORC Passive Tap PCAP Topic DPI Topic A Topic Telemetry Sources Syslog HTTP File System Other Flume Agent A Agent B Agent N B Topology N Topology A Topology PCAP Traffic Replicator PCAP Topology DPI Topology

Slide 15

Slide 15 text

15 OpenSOC - Stitching Things Together Access Messaging System Data Collection Source Systems Storage Real Time Processing Storm Kafka B Topic N Topic Elastic Search Index Web Services Search PCAP Reconstruction HBase PCAP Table Analytic Tools R / Python Power Pivot Tableau Hive Raw Data ORC Passive Tap PCAP Topic DPI Topic A Topic Telemetry Sources Syslog HTTP File System Other Flume Agent A Agent B Agent N B Topology N Topology A Topology PCAP Traffic Replicator Deeper Look PCAP Topology DPI Topology

Slide 16

Slide 16 text

16 PCAP Topology Storage Real Time Processing Storm Elastic Search Index HBase PCAP Table Hive Raw Data ORC Kafka Spout Parser Bolt HDFS Bolt HBase Bolt ES Bolt

Slide 17

Slide 17 text

17 DPI Topology & Telemetry Enrichment Storage Real Time Processing Storm Elastic Search Index HBase PCAP Table Hive Raw Data ORC Kafka Spout Parser Bolt GEO Enrich Whois Enrich CIF Enrich HDFS Bolt ES Bolt

Slide 18

Slide 18 text

18 Enrichments Parser Bolt GEO Enrich RAW Message {! “msg_key1”: “msg value1”,! “src_ip”: “10.20.30.40”,! “dest_ip”: “20.30.40.50”,! “domain”: “mydomain.com”! }! Who Is Enrich "geo":[ {"region":"CA",! "postalCode":"95134",! "areaCode":"408",! "metroCode":"807",! "longitude":-121.946,! "latitude":37.425,! "locId":4522,! "city":"San Jose",! "country":"US"! }]! CIF Enrich "whois":[ {! "OrgId":"CISCOS",! "Parent":"NET-144-0-0-0-0",! "OrgAbuseName":"Cisco Systems Inc",! "RegDate":"1991-01-171991-01-17",! "OrgName":"Cisco Systems",! "Address":"170 West Tasman Drive",! "NetType":"Direct Assignment"! } ],! “cif”:”Yes”! Enriched Message Cache MySQL Geo Lite Data Cache HBase Who Is Data Cache HBase CIF Data

Slide 19

Slide 19 text

19 Applications: Telemetry Matching and DPI Step1: Search Step2: Match Step3: Analyze Step4: Build PCAP

Slide 20

Slide 20 text

20 Integration with Analytics Tools Dashboards Reports

Slide 21

Slide 21 text

21 Best Practices

Slide 22

Slide 22 text

22 Journey Towards Highly Scalable Application

Slide 23

Slide 23 text

23 Kafka Tuning

Slide 24

Slide 24 text

24 This is where we began

Slide 25

Slide 25 text

25 Some code optimizations and increased parallelism

Slide 26

Slide 26 text

26 §  Is Disk I/O heavy §  Kafka 0.8+ supports replication and JBOD §  Better performance compared to RAID §  Parallelism is largely driven by number of disks and partitions per topic §  Key configuration parameters: §  num.io.threads - Keep it at least equal to number of disks provided to Kafka §  num.network.threads - adjust it based on number of concurrent producers, consumers and replication factor Kafka Tuning

Slide 27

Slide 27 text

27 After Kafka Tuning

Slide 28

Slide 28 text

28 Bottleneck Isolation, Resource Profiling, Load Balancing

Slide 29

Slide 29 text

29 HBase Tuning

Slide 30

Slide 30 text

30 This is where we began

Slide 31

Slide 31 text

31 §  Row Key design is critical (gets or scans or both?) §  Keys with IP Addresses §  Standard IP addresses have only two variations of the first character : 1 & 2 §  Minimum key length will be 7 characters and max 15 with a typical average of 12 §  Subnet range scans become difficult – range of 90 to 220 excludes 112 §  IP converted to hex (10.20.30.40 => 0a141e28) §  gives 16 variations of first key character §  consistently 8 character key §  Easy to search for subnet ranges Row Key Design

Slide 32

Slide 32 text

32 Experiments with Row Key

Slide 33

Slide 33 text

33 §  Know your data §  Auto split under high workload can result into hotspots and split storms §  Understand your data and presplit the regions §  Identify how many regions a RS can have to perform optimally. Use the formula below (RS memory)*(total memstore fraction)/((memstore size)*(# column families))! Region Splits

Slide 34

Slide 34 text

34 With Region Pre-Splits

Slide 35

Slide 35 text

35 §  Enable Micro Batching (client side buffer) §  Smart shuffle/grouping in storm §  Understand your data and situationally exploit various WAL options §  Watch for many minor compactions §  For heavy ‘write’ workload Increase hbase.hstore.blockingStoreFiles (we used 200) Know Your Application

Slide 36

Slide 36 text

36 And Finally

Slide 37

Slide 37 text

37 Kafka Spout

Slide 38

Slide 38 text

38 §  Parallelism is controlled by number of partitions per topic §  Set Kafka spout parallelism equal to number of partitions in topic §  Other key parameters that drive performance §  fetchSizeBytes! §  bufferSizeBytes! Kafka Spout

Slide 39

Slide 39 text

39 Mysteriously Missing Data

Slide 40

Slide 40 text

40 §  A bug in Kafka spout that used to miss out some partitions and loose data §  It is now fixed and available from Hortonworks repository ( http://repo.hortonworks.com/content/repositories/releases/org/apache/ storm/storm-Kafka ) Mysteriously Missing Data Root Cause

Slide 41

Slide 41 text

41 Storm

Slide 42

Slide 42 text

42 §  Every small thing counts at scale §  Even simple string operations can slowdown throughput when executed on millions of Tuples Storm

Slide 43

Slide 43 text

43 §  Error handling is critical §  Poorly handled errors can lead to topology failure and eventually loss of data (or data duplication) Storm

Slide 44

Slide 44 text

44 §  Tune & Scale individual spout and bolts before performance testing/tuning entire topology §  Write your own simple data generator spouts and no-op bolts §  Making as many things configurable as possible helps a lot Storm

Slide 45

Slide 45 text

45 §  When it comes to Hadoop…partner up §  Separate the hype from the opportunity §  Start small then scale up §  Design Iteratively §  It doesn’t work unless you have proven it at scale §  Keep an eye on ROI Lessons Learned

Slide 46

Slide 46 text

46 How can you contribute? §  Technology Partner Program – contribute developers to join the Cisco and Hortonworks team Looking for Community Partners

Slide 47

Slide 47 text

Thank you! We are hiring: [email protected] [email protected]