Regular Expressions Performance

Regular Expressions Performance Optimizing event capture building better Ossim Agent
plugins

About A3Sec • AlienVault's spin-off • Professional Services, SIEM deployments
• Alienvault's Authorized Training Center (ATC) for Spain and LATAM • Team of more than 25 Security Experts • Own developments and tool integrations • Advanced Health Check Monitoring • Web: www.a3sec.com, Twitter: @a3sec

About Me • David Gil <dgil@a3sec.com> • Developer, Sysadmin, Project
Manager • Really believes in Open Source model • Programming since he was 9 years old • Ossim developer at its early stage • Agent core engine (full regex) and first plugins • Python lover :-) • Debian package maintainer (a long, long time ago) • Sci-Fi books reader and mountain bike rider

Summary 1. What is a regexp? 2. When to use
regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools

Regular Expressions What is a regex? Regular expression: (bb|[^b]{2})

Regular Expressions What is a regex? Regular expression: (bb|[^b]{2})\d\d Input
strings: bb445, 2ac3357bb, bb3aa2c7, a2ab64b, abb83fh6l3hi22ui

Regular Expressions To RE or not to RE • Regular
expressions are almost never the right answer ◦ Difficult to debug and maintain ◦ Performance reasons, slower for simple matching ◦ Learning curve

expressions are almost never the right answer ◦ Difficult to debug and maintain ◦ Performance reasons, slower for simple matching ◦ Learning curve • Python string functions are small C loops: super fast! ◦ beginswith(), endswith(), split(), etc.

expressions are almost never the right answer ◦ Difficult to debug and maintain ◦ Performance reasons, slower for simple matching ◦ Learning curve • Python string functions are small C loops: super fast! ◦ beginswith(), endswith(), split(), etc. • Use standard parsing libraries! Formats: JSON, HTML, XML, CSV, etc.

Regular Expressions To RE or not to RE Example: URL
parsing • regex: ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$ • parse_url() php method: $url = "http://username:password@hostname/path?arg=value#anchor"; print_r(parse_url($url)); ( [scheme] => http [host] => hostname [user] => username [pass] => password [path] => /path [query] => arg=value [fragment] => anchor )

Regular Expressions To RE or not to RE But, there
are a lot of reasons to use regex: • powerful • portable • fast (with performance in mind) • useful for complex patterns • save development time • short code • fun :-) • beautiful?

Regular Expressions Basics - Characters • \d, \D: digits. \w,
\W: words. \s, \S: spaces >>> re.findall('\d\d\d\d-(\d\d)-\d\d', '2013-07-21') >>> re.findall('(\S+)\s+(\S+)', 'foo bar') • ^, $: Begin/End of string >>> re.findall('(\d+)', 'cba3456csw') >>> re.findall('^(\d+)$', 'cba3456csw') • . (dot): Any character: >>> re.findall('foo(.)bar', 'foo=bar') >>> re.findall('(...)=(...)', 'foo=bar')

Regular Expressions Basics - Repetitions • *, +: 0-1 or
more repetitions >>> re.findall('FO+', 'FOOOOOOOOO') >>> re.findall('BA*R', 'BR') • ?: 0 or 1 repetitions >>> re.findall('colou?r', 'color') >>> re.findall('colou?r', 'colour') • {n}, {n,m}: N repetitions: >>> re.findall('\d{2}', '2013-07-21') >>> re.findall('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}','192.168.1.25')

Regular Expressions Basics - Groups [...]: Set of characters >>>
re.findall('[a-z]+=[a-z]+', 'foo=bar') ...|...: Alternation >>> re.findall('(foo|bar)=(foo|bar)', 'foo=bar') (...) and \1, \2, ...: Group >>> re.findall(r'(\w+)=(\1)', 'foo=bar') >>> re.findall(r'(\w+)=(\1)', 'foo=foo') (?P<name>...): Named group >>> re.findall('\d{4}-\d{2}-(?P<day>\d{2}'), '2013-07-23')

Regular Expressions Greedy & Lazy quantifiers: *?, +? • Greedy
vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A']

vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A'] • An overall match takes precedence over and overall non-match >>> re.findall('<.*>.*</.*>', 'i am bold') >>> re.findall('<(.*)>.*</(.*)>', 'i am bold')

vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A'] • An overall match takes precedence over and overall non-match >>> re.findall('<.*>.*</.*>', 'i am bold') >>> re.findall('<(.*)>.*</(.*)>', 'i am bold') • Minimal matching, non-greedy >>> re.findall('<(.*)>.*', 'i am bold') >>> re.findall('<(.*?)>.*', 'i am bold')

Regular Expressions Performance Tests Different implementations of a custom is_a_word()
function: • #1 Regexp • #2 Char iteration • #3 String functions

Regular Expressions Performance Test #1 def is_a_word(word): CHARS = string.uppercase
+ string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP"

+ string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 1.49650502205 YES len=4 word 1.65614509583 YES len=25 wordlongerthanpreviousone.. 1.92520785332 YES len=60 wordlongerthanpreviosoneplusan.. 2.38850092888 YES len=120 wordlongerthanpreviosoneplusan.. 1.55924701691 NOP len=10 not a word 1.7087020874 NOP len=25 not a word, just a phrase.. 1.92521882057 NOP len=50 not a word, just a phrase bigg.. 2.39075493813 NOP len=102 not a word, just a phrase bigg..

+ string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 1.49650502205 YES len=4 word 1.65614509583 YES len=25 wordlongerthanpreviousone.. 1.92520785332 YES len=60 wordlongerthanpreviosoneplusan.. 2.38850092888 YES len=120 wordlongerthanpreviosoneplusan.. 1.55924701691 NOP len=10 not a word 1.7087020874 NOP len=25 not a word, just a phrase.. 1.92521882057 NOP len=50 not a word, just a phrase bigg.. 2.39075493813 NOP len=102 not a word, just a phrase bigg.. If the target string is longer, the regex matching is slower. No matter if success or fail.

Regular Expressions Performance Test #2 def is_a_word(word): for char in
word: if not char in (CHARS): return "NOP" return "YES"

word: if not char in (CHARS): return "NOP" return "YES" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.687522172928 YES len=4 word 1.0725839138 YES len=25 wordlongerthanpreviousone.. 2.34717106819 YES len=60 wordlongerthanpreviosoneplusan.. 4.31543898582 YES len=120 wordlongerthanpreviosoneplusan.. 0.54797577858 NOP len=10 not a word 0.547253847122 NOP len=25 not a word, just a phrase.. 0.546499967575 NOP len=50 not a word, just a phrase bigg.. 0.553755998611 NOP len=102 not a word, just a phrase bigg..

word: if not char in (CHARS): return "NOP" return "YES" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.687522172928 YES len=4 word 1.0725839138 YES len=25 wordlongerthanpreviousone.. 2.34717106819 YES len=60 wordlongerthanpreviosoneplusan.. 4.31543898582 YES len=120 wordlongerthanpreviosoneplusan.. 0.54797577858 NOP len=10 not a word 0.547253847122 NOP len=25 not a word, just a phrase.. 0.546499967575 NOP len=50 not a word, just a phrase bigg.. 0.553755998611 NOP len=102 not a word, just a phrase bigg.. 2 python nested loops if success (slow) But fails at the same point&time (first space)

Regular Expressions Performance Test #3 def is_a_word(word): return "YES" if
word.isalpha() else "NOP"

word.isalpha() else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.146447896957 YES len=4 word 0.212563037872 YES len=25 wordlongerthanpreviousone.. 0.318686008453 YES len=60 wordlongerthanpreviosoneplusan.. 0.493942975998 YES len=120 wordlongerthanpreviosoneplusan.. 0.14647102356 NOP len=10 not a word 0.146160840988 NOP len=25 not a word, just a phrase.. 0.147103071213 NOP len=50 not a word, just a phrase bigg.. 0.146239995956 NOP len=102 not a word, just a phrase bigg..

word.isalpha() else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.146447896957 YES len=4 word 0.212563037872 YES len=25 wordlongerthanpreviousone.. 0.318686008453 YES len=60 wordlongerthanpreviosoneplusan.. 0.493942975998 YES len=120 wordlongerthanpreviosoneplusan.. 0.14647102356 NOP len=10 not a word 0.146160840988 NOP len=25 not a word, just a phrase.. 0.147103071213 NOP len=50 not a word, just a phrase bigg.. 0.146239995956 NOP len=102 not a word, just a phrase bigg.. Python string functions (fast and small C loops)

Regular Expressions Performance Strategies Writing regex • Be careful with
repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?

repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ...

repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... • Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef')

repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... • Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower re.findall(r'(ab)\s(cd)\s(ef)', 'ab cd ef') # faster

repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... • Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower re.findall(r'(ab)\s(cd)\s(ef)', 'ab cd ef') # faster • Longer target string -> slower regex matching

Regular Expressions Performance Strategies Writing regex • Use the non-capturing
group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi)

group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) • Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY) TRAFFIC_(ALLOW|DROP|DENY) • Use anchors (^ and $) to limit the score re.findall(r'(ab){2}', 'abcabcabc') re.findall(r'^(ab){2}','abcabcabc') #failures occur faster

Regular Expressions Performance Strategies Writing Agent plugins • A new
process is forked for each loaded plugin ◦ Use the plugins that you really need!

Regular Expressions Performance Strategies Writing Agent plugins • A new
process is forked for each loaded plugin ◦ Use the plugins that you really need! • A plugin is a set of rules (regexp operations) for matching log lines ◦ If a plugin doesn't match a log entry, it fails in ALL its rules! ◦ Reduce the number of rules, use a [translation] table

Regular Expressions Performance Strategies Writing Agent plugins • Alphabetical order
for rule matching ◦ Order your rules by priority, pattern most likely to match first

Regular Expressions Performance Strategies Writing Agent plugins • Alphabetical order
for rule matching ◦ Order your rules by priority, pattern most likely to match first • Divide and conquer ◦ A plugin is configured to read from a source file, use dedicated source files per technology ◦ Also, use dedicated plugins for each technology

Regular Expressions Performance Strategies Tool1 20 logs/sec Tool2 20 logs/sec
Tool3 20 logs/sec /var/log/syslog Tool4 20 logs/sec (100 logs/sec) Tool5 20 logs/sec 5 plugins with 1 rule reading /var/log/syslog 5x100 = 500 total regex/sec

Regular Expressions Performance Strategies Tool1 20 logs/sec /var/log/tool1 Tool2 20
logs/sec /var/log/tool2 Tool3 20 logs/sec /var/log/tool3 Tool4 20 logs/sec /var/log/tool4 Tool5 20 logs/sec /var/log/tool5 (100 logs/sec) 5 plugins with 1 rule reading /var/log/tool{1-5} 5x20 = 100 total regex/sec (x5) Faster

Regular Expressions Tools for testing Regex Python: >>> import re
>>> re.findall('(\S+) (\S+)', 'foo bar') [('foo', 'bar')] >>> result = re.search( ... '(?P<key>\w+)\s*=\s*(?P<value>\w+)', ... 'foo=bar' ... ) >>> result.groupdict() { 'key': 'foo', 'value': 'bar' }

Regular Expressions Tools for testing Regex Regex debuggers: • Kiki
• Kodos Online regex testers: • http://gskinner.com/RegExr/ (java) • http://regexpal.com/ (javascript) • http://rubular.com/ (ruby) • http://www.pythonregex.com/ (python) Online regex visualization: • http://www.regexper.com/ (javascript)

any (?:question|doubt|comment)+\?

A3Sec web: www.a3sec.com email: training@a3sec.com twitter: @a3sec Spain Head Office
C/ Aravaca, 6, Piso 2 28040 Madrid Tlf. +34 533 09 78 México Head Office Avda. Paseo de la Reforma, 389 Piso 10 México DF Tlf. +52 55 5980 3547

Regular Expressions Performance

Regular Expressions Performance

More Decks by A3Sec

Other Decks in Programming

Featured

Transcript