Machine Learning for Cybersecurity

Slide 1

Slide 1 text

GVDJ #IDSECCONF2016 machine learning   for cybersecurity

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

USA SOUTH KOREA NORTH KOREA INDONESIA

Slide 5

Slide 5 text

GVDJ #IDSECCONF2016 security goals ▸ security goals ▸ confidentiality of information and resources ▸ integrity of information and resources ▸ availability of information and resources ▸ basic definitions ▸ threat: potential violation of a security goal ▸ security: protection from intentional threats ▸ attack: intentional violation of a security goal SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY

Slide 6

Slide 6 text

GVDJ #IDSECCONF2016 security mechanisms ▸ security policies and mechanisms ▸ policy: statement of what is and what is not allowed ▸ mechanism: method or tool enforcing a security policy ▸ security is a process, not a product! ▸ strategies for security mechanisms ▸ prevention of attacks, e.g. encryption ▸ detection of attacks, e.g. virus scanner ▸ analysis of attacks, e.g. forensic SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec

Slide 7

Slide 7 text

GVDJ #IDSECCONF2016 prevention is a hard task ▸ continuous discovery of vulnerabilities ▸ insecure software and hardware ▸ developers unawareness goto fail; goto fail; goto fail  (february 2014) heartbleed  (april 2014) shellshock  (september 2014) SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec

Slide 8

Slide 8 text

GVDJ #IDSECCONF2016 attacks against services ▸ numerous security breaches at popular web services ▸ identities often include real names, addresses, emails, passwords, etc. ‘;--have i been pwned? 142  pwned websites 1,444,567,928  pwned accounts 39,842  pastes 31,108,929  paste accounts SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec

Slide 9

Slide 9 text

GVDJ #IDSECCONF2016 imbalance of security cycle ▸ increasing imbalance of security cycle ▸ increasing number of vulnerabilities ▸ high amount of novel attacks ▸ high diversity of malicious software ▸ bottleneck: human analyst in the loop ▸ manual discovery of vulnerabilities ▸ manual generation of attack signatures ▸ manual analysis of malicious software SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec

Slide 10

Slide 10 text

GVDJ #IDSECCONF2016 conventional detection ▸ conventional attack detection using signatures ▸ ineffective against novel and unknown attacks ▸ inherent delay to availability of novel signatures ▸ analysis obstructed by polymorphism and obfuscation HEADER APPLICATION PAYLOAD ... IP TCP GET /scripts/ ..%c1%9c.. /system32/cmd.exe TCP ..%c1%9c.. NIMDA WORM SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec

Slide 11

Slide 11 text

GVDJ #IDSECCONF2016 intelligent defence ▸ construction of intelligent security systems ▸ combining computer security and machine learning ▸ minimum human intervention on prevention, detection, and analysis ▸ challenge in practice ▸ effectivity, efficiency, and robustness ▸ transparency and controlability SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec

Slide 12

Slide 12 text

machine learning for cybersecurity

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

MACHINE LEARNING PREDICTION PLATFORM HUMAN INTUITION

Slide 15

Slide 15 text

attack mitigation issues supervised unsupervised rules driven  (limited by experiences and expertise) high rates undetectable attacks  (false negatives) delayed response  (between detection and prevention) statistical driven  (improved detection of new attacks) substantial investigative efforts   (false positives) alarm fatigue and distrust  (reversion to supervised method)

Slide 16

Slide 16 text

GVDJ #IDSECCONF2016 implementation challenges ▸ lack of data: limited or no history of previous attacks (required by supervised learning model) ▸ evolving attacks: attackers constantly change their behaviours, making current models obsolete ▸ limited resources: relying on security analysts to investigate the attacks can be costly and time consuming

Slide 17

Slide 17 text

GVDJ #IDSECCONF2016 components THREAT PREDICTION PLATFORM MODEL ANALYSTS PREDICTION FEATURE RAW DATA ACTION EVENTS MODELLING CONTEXTUAL MODELLING

Slide 18

Slide 18 text

GVDJ #IDSECCONF2016 components ▸ big data processing system: quantifying features from raw data ▸ outlier detection system: learning a descriptive model using features from unsupervised learning process ▸ feedback mechanism and continuous learning: incorporating analyst input

Slide 19

Slide 19 text

GVDJ #IDSECCONF2016 data characteristics

Slide 20

Slide 20 text

GVDJ #IDSECCONF2016 data characteristics  0.1 data sources ▸ common sources: networking devices and applications log ▸ router, switch, firewall, ids, ips, and load balancer devices ▸ web, database, and micro services ▸ frontend and backend applications ▸ delivered in realtime from widely distributed systems

Slide 21

Slide 21 text

GVDJ #IDSECCONF2016 data characteristics  0.2 data dimensions and unique entities ▸ volume of raw data: metrics (GB/TB) or number of lines (≥ tens of millions on a daily basis) ▸ specific to behavioural analytics: IP addresses, users, sessions, etc. 01010101010101001111010111010101 01010001100010010100010011110110 10100100010010010010001010111101 10100111101101001100011110101011 10101110011010111011011101100111 11100000101001100010000011101101 01100001000000011010111110111011 00111001110001000100010011100100 00111011111011110110100100100110 10001010001110111110001001001001

Slide 22

Slide 22 text

GVDJ #IDSECCONF2016 data characteristics  0.3 malicious activity prevalence ▸ under normal circumstances, malicious activities are extremely rare (generally ≤ 0.1%) ▸ resulting extreme class imbalance in supervised learning ▸ increasing the difficulty of detection processes ▸ unknown and/or unreported attacks introduce noise into data ▸ attack vectors can take a wide variety of shapes

Slide 23

Slide 23 text

GVDJ #IDSECCONF2016 big data analytics DAILY WEEKLY MONTHLY RAW DATA AGGREGATED DATA JIM ✖ ✖ ✖ FEATURES IS NEW USER? LAST CHANGED PASSWORD LAST IP ADDRESS LAST SESSION LENGTH ..... ..... ..... ..... ..... NUMBER OF FAILED LOGIN JIM

Slide 24

Slide 24 text

GVDJ #IDSECCONF2016 big data analytics  0.1 behavioural signatures ▸ quantifying signatures (often comprises the series of attack steps) from raw data ▸ quantitative values can be defined by security analysts ▸ extracting features per-entity and per-time-segment basis

Slide 25

Slide 25 text

GVDJ #IDSECCONF2016 big data analytics  0.2 design requirements ▸ capable of analysing ≥ 10 millions entities in daily basis ▸ capable of updating and retrieving signatures of active entities, on demand and/or in realtime

Slide 26

Slide 26 text

GVDJ #IDSECCONF2016 big data analytics  0.3.1 process: activity tracking ▸ absorbing the log stream: identifying the entities and updating corresponding records ▸ in short temporal window: 30 minutes, 1 hour, 12 hours, or 24 hours. ▸ focus on efficient retrieval for feature computation

Slide 27

Slide 27 text

GVDJ #IDSECCONF2016 big data analytics  0.3.2 process: activity aggregation ▸ computing behavioural features over an interval of time ▸ retrieving all activity records within given interval ▸ aggregating smaller time unit (minutes, hours, days, weeks) as the feature demands

Slide 28

Slide 28 text

GVDJ #IDSECCONF2016 algorithm selection

Slide 29

Slide 29 text

GVDJ #IDSECCONF2016 algorithm selection

Slide 30

Slide 30 text

GVDJ #IDSECCONF2016 outlier detection OUTLIER

Slide 31

Slide 31 text

GVDJ #IDSECCONF2016 outlier detection ▸ matrix decomposition-based outlier analysis ▸ replicator neural networks ▸ density-based outlier analysis ▸ score interpretation ▸ transforming score to probabilities ▸ detection ensembles MATRIX DECOMPOSITION REPLICATOR NEURAL NETWORKS

Slide 32

Slide 32 text

GVDJ #IDSECCONF2016 continuous learning ▸ overcomes limited analyst bandwidth ▸ overcomes weaknesses of unsupervised learning ▸ actively adapts and synthesises new models PREDICT ACT TRAIN

Slide 33

Slide 33 text

GVDJ #IDSECCONF2016 example: open network insight  leveraging insights from flow and packet analysis

Slide 34

Slide 34 text

GVDJ #IDSECCONF2016 example: open network insight  advantages

Slide 35

Slide 35 text

GVDJ #IDSECCONF2016 example: open network insight  how it works

Slide 36

Slide 36 text

GVDJ #IDSECCONF2016 example: entrada  network data analytics platform

Slide 37

Slide 37 text

GVDJ #IDSECCONF2016 summary ▸ current problems of security ▸ automatisation of attacks ▸ massive amount of novel malicious code ▸ defences involving manual actions (often ineffective) ▸ machine learning in security ▸ adaptive defences using learning algorithms ▸ automatic detection and analysis of threats

Slide 38

Slide 38 text

GVDJ #IDSECCONF2016 references ▸ machine learning for computer security by konrad rieck, fabian yamaguchi, and alwin maier (institute of system security, tu-braunschweig)  https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec ▸ 360º unsupervised anomaly-based intrusion detection by stefano zanero   https://www.blackhat.com/presentations/bh-dc-07/Zanero/Presentation/bh-dc-07-Zanero.pdf ▸ mlsec project  http://www.mlsecproject.org/ ▸ redefining infosec by combining ai and human intuition  https://www.patternex.com/redefining-infosec-by-combining-ai-and-human-intuition-wp

Slide 39

Slide 39 text

question?