Slide 1

Slide 1 text

Regular Expressions Performance Optimizing event capture building better Ossim Agent plugins

Slide 2

Slide 2 text

About A3Sec ● AlienVault's spin-off ● Professional Services, SIEM deployments ● Alienvault's Authorized Training Center (ATC) for Spain and LATAM ● Team of more than 25 Security Experts ● Own developments and tool integrations ● Advanced Health Check Monitoring ● Web: www.a3sec.com, Twitter: @a3sec

Slide 3

Slide 3 text

About Me ● David Gil ● Developer, Sysadmin, Project Manager ● Really believes in Open Source model ● Programming since he was 9 years old ● Ossim developer at its early stage ● Agent core engine (full regex) and first plugins ● Python lover :-) ● Debian package maintainer (a long, long time ago) ● Sci-Fi books reader and mountain bike rider

Slide 4

Slide 4 text

Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Regular Expressions What is a regex? Regular expression: (bb|[^b]{2})

Slide 7

Slide 7 text

Regular Expressions What is a regex? Regular expression: (bb|[^b]{2})\d\d Input strings: bb445, 2ac3357bb, bb3aa2c7, a2ab64b, abb83fh6l3hi22ui

Slide 8

Slide 8 text

Regular Expressions What is a regex? Regular expression: (bb|[^b]{2})\d\d Input strings: bb445, 2ac3357bb, bb3aa2c7, a2ab64b, abb83fh6l3hi22ui

Slide 9

Slide 9 text

Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools

Slide 10

Slide 10 text

Regular Expressions To RE or not to RE ● Regular expressions are almost never the right answer ○ Difficult to debug and maintain ○ Performance reasons, slower for simple matching ○ Learning curve

Slide 11

Slide 11 text

Regular Expressions To RE or not to RE ● Regular expressions are almost never the right answer ○ Difficult to debug and maintain ○ Performance reasons, slower for simple matching ○ Learning curve ● Python string functions are small C loops: super fast! ○ beginswith(), endswith(), split(), etc.

Slide 12

Slide 12 text

Regular Expressions To RE or not to RE ● Regular expressions are almost never the right answer ○ Difficult to debug and maintain ○ Performance reasons, slower for simple matching ○ Learning curve ● Python string functions are small C loops: super fast! ○ beginswith(), endswith(), split(), etc. ● Use standard parsing libraries! Formats: JSON, HTML, XML, CSV, etc.

Slide 13

Slide 13 text

Regular Expressions To RE or not to RE Example: URL parsing ● regex: ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$ ● parse_url() php method: $url = "http://username:password@hostname/path?arg=value#anchor"; print_r(parse_url($url)); ( [scheme] => http [host] => hostname [user] => username [pass] => password [path] => /path [query] => arg=value [fragment] => anchor )

Slide 14

Slide 14 text

Regular Expressions To RE or not to RE But, there are a lot of reasons to use regex: ● powerful ● portable ● fast (with performance in mind) ● useful for complex patterns ● save development time ● short code ● fun :-) ● beautiful?

Slide 15

Slide 15 text

Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools

Slide 16

Slide 16 text

Regular Expressions Basics - Characters ● \d, \D: digits. \w, \W: words. \s, \S: spaces >>> re.findall('\d\d\d\d-(\d\d)-\d\d', '2013-07-21') >>> re.findall('(\S+)\s+(\S+)', 'foo bar') ● ^, $: Begin/End of string >>> re.findall('(\d+)', 'cba3456csw') >>> re.findall('^(\d+)$', 'cba3456csw') ● . (dot): Any character: >>> re.findall('foo(.)bar', 'foo=bar') >>> re.findall('(...)=(...)', 'foo=bar')

Slide 17

Slide 17 text

Regular Expressions Basics - Repetitions ● *, +: 0-1 or more repetitions >>> re.findall('FO+', 'FOOOOOOOOO') >>> re.findall('BA*R', 'BR') ● ?: 0 or 1 repetitions >>> re.findall('colou?r', 'color') >>> re.findall('colou?r', 'colour') ● {n}, {n,m}: N repetitions: >>> re.findall('\d{2}', '2013-07-21') >>> re.findall('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}','192.168.1.25')

Slide 18

Slide 18 text

Regular Expressions Basics - Groups [...]: Set of characters >>> re.findall('[a-z]+=[a-z]+', 'foo=bar') ...|...: Alternation >>> re.findall('(foo|bar)=(foo|bar)', 'foo=bar') (...) and \1, \2, ...: Group >>> re.findall(r'(\w+)=(\1)', 'foo=bar') >>> re.findall(r'(\w+)=(\1)', 'foo=foo') (?P...): Named group >>> re.findall('\d{4}-\d{2}-(?P\d{2}'), '2013-07-23')

Slide 19

Slide 19 text

Regular Expressions Greedy & Lazy quantifiers: *?, +? ● Greedy vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A']

Slide 20

Slide 20 text

Regular Expressions Greedy & Lazy quantifiers: *?, +? ● Greedy vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A'] ● An overall match takes precedence over and overall non-match >>> re.findall('<.*>.*', 'i am bold') >>> re.findall('<(.*)>.*', 'i am bold')

Slide 21

Slide 21 text

Regular Expressions Greedy & Lazy quantifiers: *?, +? ● Greedy vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A'] ● An overall match takes precedence over and overall non-match >>> re.findall('<.*>.*', 'i am bold') >>> re.findall('<(.*)>.*', 'i am bold') ● Minimal matching, non-greedy >>> re.findall('<(.*)>.*', 'i am bold') >>> re.findall('<(.*?)>.*', 'i am bold')

Slide 22

Slide 22 text

Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools

Slide 23

Slide 23 text

Regular Expressions Performance Tests Different implementations of a custom is_a_word() function: ● #1 Regexp ● #2 Char iteration ● #3 String functions

Slide 24

Slide 24 text

Regular Expressions Performance Test #1 def is_a_word(word): CHARS = string.uppercase + string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP"

Slide 25

Slide 25 text

Regular Expressions Performance Test #1 def is_a_word(word): CHARS = string.uppercase + string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 1.49650502205 YES len=4 word 1.65614509583 YES len=25 wordlongerthanpreviousone.. 1.92520785332 YES len=60 wordlongerthanpreviosoneplusan.. 2.38850092888 YES len=120 wordlongerthanpreviosoneplusan.. 1.55924701691 NOP len=10 not a word 1.7087020874 NOP len=25 not a word, just a phrase.. 1.92521882057 NOP len=50 not a word, just a phrase bigg.. 2.39075493813 NOP len=102 not a word, just a phrase bigg..

Slide 26

Slide 26 text

Regular Expressions Performance Test #1 def is_a_word(word): CHARS = string.uppercase + string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 1.49650502205 YES len=4 word 1.65614509583 YES len=25 wordlongerthanpreviousone.. 1.92520785332 YES len=60 wordlongerthanpreviosoneplusan.. 2.38850092888 YES len=120 wordlongerthanpreviosoneplusan.. 1.55924701691 NOP len=10 not a word 1.7087020874 NOP len=25 not a word, just a phrase.. 1.92521882057 NOP len=50 not a word, just a phrase bigg.. 2.39075493813 NOP len=102 not a word, just a phrase bigg.. If the target string is longer, the regex matching is slower. No matter if success or fail.

Slide 27

Slide 27 text

Regular Expressions Performance Test #2 def is_a_word(word): for char in word: if not char in (CHARS): return "NOP" return "YES"

Slide 28

Slide 28 text

Regular Expressions Performance Test #2 def is_a_word(word): for char in word: if not char in (CHARS): return "NOP" return "YES" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.687522172928 YES len=4 word 1.0725839138 YES len=25 wordlongerthanpreviousone.. 2.34717106819 YES len=60 wordlongerthanpreviosoneplusan.. 4.31543898582 YES len=120 wordlongerthanpreviosoneplusan.. 0.54797577858 NOP len=10 not a word 0.547253847122 NOP len=25 not a word, just a phrase.. 0.546499967575 NOP len=50 not a word, just a phrase bigg.. 0.553755998611 NOP len=102 not a word, just a phrase bigg..

Slide 29

Slide 29 text

Regular Expressions Performance Test #2 def is_a_word(word): for char in word: if not char in (CHARS): return "NOP" return "YES" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.687522172928 YES len=4 word 1.0725839138 YES len=25 wordlongerthanpreviousone.. 2.34717106819 YES len=60 wordlongerthanpreviosoneplusan.. 4.31543898582 YES len=120 wordlongerthanpreviosoneplusan.. 0.54797577858 NOP len=10 not a word 0.547253847122 NOP len=25 not a word, just a phrase.. 0.546499967575 NOP len=50 not a word, just a phrase bigg.. 0.553755998611 NOP len=102 not a word, just a phrase bigg.. 2 python nested loops if success (slow) But fails at the same point&time (first space)

Slide 30

Slide 30 text

Regular Expressions Performance Test #3 def is_a_word(word): return "YES" if word.isalpha() else "NOP"

Slide 31

Slide 31 text

Regular Expressions Performance Test #3 def is_a_word(word): return "YES" if word.isalpha() else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.146447896957 YES len=4 word 0.212563037872 YES len=25 wordlongerthanpreviousone.. 0.318686008453 YES len=60 wordlongerthanpreviosoneplusan.. 0.493942975998 YES len=120 wordlongerthanpreviosoneplusan.. 0.14647102356 NOP len=10 not a word 0.146160840988 NOP len=25 not a word, just a phrase.. 0.147103071213 NOP len=50 not a word, just a phrase bigg.. 0.146239995956 NOP len=102 not a word, just a phrase bigg..

Slide 32

Slide 32 text

Regular Expressions Performance Test #3 def is_a_word(word): return "YES" if word.isalpha() else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.146447896957 YES len=4 word 0.212563037872 YES len=25 wordlongerthanpreviousone.. 0.318686008453 YES len=60 wordlongerthanpreviosoneplusan.. 0.493942975998 YES len=120 wordlongerthanpreviosoneplusan.. 0.14647102356 NOP len=10 not a word 0.146160840988 NOP len=25 not a word, just a phrase.. 0.147103071213 NOP len=50 not a word, just a phrase bigg.. 0.146239995956 NOP len=102 not a word, just a phrase bigg.. Python string functions (fast and small C loops)

Slide 33

Slide 33 text

Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools

Slide 34

Slide 34 text

Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?

Slide 35

Slide 35 text

Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ...

Slide 36

Slide 36 text

Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... ● Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef')

Slide 37

Slide 37 text

Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... ● Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower re.findall(r'(ab)\s(cd)\s(ef)', 'ab cd ef') # faster

Slide 38

Slide 38 text

Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... ● Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower re.findall(r'(ab)\s(cd)\s(ef)', 'ab cd ef') # faster ● Longer target string -> slower regex matching

Slide 39

Slide 39 text

Regular Expressions Performance Strategies Writing regex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi)

Slide 40

Slide 40 text

Regular Expressions Performance Strategies Writing regex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) ● Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)

Slide 41

Slide 41 text

Regular Expressions Performance Strategies Writing regex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) ● Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY) TRAFFIC_(ALLOW|DROP|DENY)

Slide 42

Slide 42 text

Regular Expressions Performance Strategies Writing regex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) ● Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY) TRAFFIC_(ALLOW|DROP|DENY) ● Use anchors (^ and $) to limit the score re.findall(r'(ab){2}', 'abcabcabc') re.findall(r'^(ab){2}','abcabcabc') #failures occur faster

Slide 43

Slide 43 text

Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools

Slide 44

Slide 44 text

Regular Expressions Performance Strategies Writing Agent plugins ● A new process is forked for each loaded plugin ○ Use the plugins that you really need!

Slide 45

Slide 45 text

Regular Expressions Performance Strategies Writing Agent plugins ● A new process is forked for each loaded plugin ○ Use the plugins that you really need! ● A plugin is a set of rules (regexp operations) for matching log lines ○ If a plugin doesn't match a log entry, it fails in ALL its rules! ○ Reduce the number of rules, use a [translation] table

Slide 46

Slide 46 text

Regular Expressions Performance Strategies Writing Agent plugins ● Alphabetical order for rule matching ○ Order your rules by priority, pattern most likely to match first

Slide 47

Slide 47 text

Regular Expressions Performance Strategies Writing Agent plugins ● Alphabetical order for rule matching ○ Order your rules by priority, pattern most likely to match first ● Divide and conquer ○ A plugin is configured to read from a source file, use dedicated source files per technology ○ Also, use dedicated plugins for each technology

Slide 48

Slide 48 text

Regular Expressions Performance Strategies Tool1 20 logs/sec Tool2 20 logs/sec Tool3 20 logs/sec /var/log/syslog Tool4 20 logs/sec (100 logs/sec) Tool5 20 logs/sec 5 plugins with 1 rule reading /var/log/syslog 5x100 = 500 total regex/sec

Slide 49

Slide 49 text

Regular Expressions Performance Strategies Tool1 20 logs/sec /var/log/tool1 Tool2 20 logs/sec /var/log/tool2 Tool3 20 logs/sec /var/log/tool3 Tool4 20 logs/sec /var/log/tool4 Tool5 20 logs/sec /var/log/tool5 (100 logs/sec) 5 plugins with 1 rule reading /var/log/tool{1-5} 5x20 = 100 total regex/sec (x5) Faster

Slide 50

Slide 50 text

Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools

Slide 51

Slide 51 text

Regular Expressions Tools for testing Regex Python: >>> import re >>> re.findall('(\S+) (\S+)', 'foo bar') [('foo', 'bar')] >>> result = re.search( ... '(?P\w+)\s*=\s*(?P\w+)', ... 'foo=bar' ... ) >>> result.groupdict() { 'key': 'foo', 'value': 'bar' }

Slide 52

Slide 52 text

Regular Expressions Tools for testing Regex Regex debuggers: ● Kiki ● Kodos Online regex testers: ● http://gskinner.com/RegExr/ (java) ● http://regexpal.com/ (javascript) ● http://rubular.com/ (ruby) ● http://www.pythonregex.com/ (python) Online regex visualization: ● http://www.regexper.com/ (javascript)

Slide 53

Slide 53 text

any (?:question|doubt|comment)+\?

Slide 54

Slide 54 text

A3Sec web: www.a3sec.com email: [email protected] twitter: @a3sec Spain Head Office C/ Aravaca, 6, Piso 2 28040 Madrid Tlf. +34 533 09 78 México Head Office Avda. Paseo de la Reforma, 389 Piso 10 México DF Tlf. +52 55 5980 3547