Slide 1

Slide 1 text

Probabilistic Data Structures in Redis Srinivasan Rangarajan @cnu

Slide 2

Slide 2 text

Srinivasan Rangarajan • [email protected] • @cnu • https://cnu.name

Slide 3

Slide 3 text

Log Analysis

Slide 4

Slide 4 text

User Events Kinesis Firehose ELK

Slide 5

Slide 5 text

Sample Event Data { "ip": "123.123.123.123", "client_id": 232, "user_id": "35827", "email": "[email protected]", "product_id": "ABC-12345", "image_id": 3, "action": "pageview", "datetime": "2017-06-29T12:42:53Z", }

Slide 6

Slide 6 text

Challenges • 100s of Millions of events processed every day • Peak of ~10 Million events in an hour • Needed Real Time processing • Low memory/storage requirements

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

User Events Kinesis Firehose ELK AWS Lambda Redis

Slide 9

Slide 9 text

Cost Accuracy Scale

Slide 10

Slide 10 text

Probabilistic Data Structures

Slide 11

Slide 11 text

xkcd/1132

Slide 12

Slide 12 text

Loading Modules • ./redis-server --loadmodule /path/to/module.so • redis.conf loadmodule /path/to/module.so • MODULE LOAD /path/to/module.so

Slide 13

Slide 13 text

Execute custom commands >>> import redis >>> r = redis.Redis() >>> out = r.execute_command('CMD param1 param2')

Slide 14

Slide 14 text

Data Structures • HyperLogLog • TopK • CountMinSketch • Bloom Filters

Slide 15

Slide 15 text

HyperLogLog Count the Cardinality of a Set

Slide 16

Slide 16 text

Count Unique Visitors/hour >>> r.pfadd('users:2017083120', 123, 456, 789) 1 >>> r.pfcount('users:2017083120') 3 >>> r.pfadd('users:2017083120', 456) 0

Slide 17

Slide 17 text

Merge Hourly into Daily >>> r.pfadd('users:2017083121', 121, 454, 787) 1 >>> r.pfmerge('users:20170831', 'users:2017083120', 'users:2017083121') True >>> r.pfcount('users:20170831’) 6

Slide 18

Slide 18 text

Links • https://redis.io/commands#hyperloglog • http://antirez.com/news/75

Slide 19

Slide 19 text

TopK Get top K elements in a set

Slide 20

Slide 20 text

Top K IP Addresses >>> r.execute_command('TOPK.ADD ip:20170831 3 123.45.67.89') >>> r.execute_command('TOPK.ADD ip:20170831 3 123.45.67.90') >>> r.execute_command('TOPK.ADD ip:20170831 3 123.45.67.91') 1L >>> r.execute_command('TOPK.ADD ip:20170831 3 123.45.67.92') -1L

Slide 21

Slide 21 text

Top K IP Addresses >>> r.zrange('ip:20170831’, 0, -1, withscores=True) [('TOPK:1.0.1:1.0:\xff\xff\xff\xff\xff\xff\xff\xff\x04\x00\x0 0\x00\x00\x00\x00\x00', 1.0), ('123.45.67.89', 1.0), ('123.45.67.90', 1.0), ('123.45.67.92', 2.0)]

Slide 22

Slide 22 text

Links • https://github.com/RedisLabsModules/topk

Slide 23

Slide 23 text

CountMinSketch Count the frequency of items

Slide 24

Slide 24 text

1 2 3 4 h1 0 0 0 0 h2 0 0 0 0 h3 0 0 0 0

Slide 25

Slide 25 text

1 2 3 4 h1 1 0 0 0 h2 0 1 0 0 h3 0 0 1 0 h1(s1) = 1; h2(s1) = 2; h3(s1) = 3

Slide 26

Slide 26 text

1 2 3 4 h1 1 0 0 1 h2 0 1 0 1 h3 0 0 1 1 h1(s2) = 4; h2(s2) = 4; h3(s2) = 4

Slide 27

Slide 27 text

1 2 3 4 h1 2 1 1 1 h2 0 1 0 1 h3 0 0 1 1 h1(s3) = 1; h2(s3) = 1; h3(s3) = 1

Slide 28

Slide 28 text

User Pageview counter >>> r.execute_command('CMS.INCRBY u:pv:20170831 123 1 456 3 789 2 234 1 567 1') 'OK' >>> r.execute_command('CMS.QUERY u:pv:20170831 123 456 789 234 567') [1L, 3L, 2L, 1L, 1L]

Slide 29

Slide 29 text

Merge Counters >>> r.execute_command('CMS.MERGE u:pv:201708 3 u:pv:20170829 u:pv:20170830 u:pv:20170831') 'OK'

Slide 30

Slide 30 text

Links • https://github.com/RedisLabsModules/countminsketch • https://redislabs.com/blog/count-min-sketch-the-art-and-science- of-estimating-stuff/

Slide 31

Slide 31 text

Bloom Filters Test Membership in a Set

Slide 32

Slide 32 text

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Empty Bit Array

Slide 33

Slide 33 text

0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 h1(item1) = 2; h2(item1) = 5; h3(item1) = 8 Insert Item 1

Slide 34

Slide 34 text

0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 h1(item2) = 7; h2(item2) = 8; h3(item2) = 10 Insert Item 2

Slide 35

Slide 35 text

0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 h1(item3) = 2; h2(item3) = 11; h3(item3) = 0 Check Item3

Slide 36

Slide 36 text

0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 h1(item4) = 10; h2(item4) = 8; h3(item4) = 7 Check Item4

Slide 37

Slide 37 text

Bloom Filter returns What it means False Definitely not in the set True Maybe in the set

Slide 38

Slide 38 text

Check User Session >>> r.execute_command('BF.MADD u:sess:20170831 123 456 789') [1L, 1L, 1L] >>> r.execute_command('BF.EXISTS u:sess:20170831 456') 1L >>> r.execute_command('BF.EXISTS u:sess:20170831 234') 0L

Slide 39

Slide 39 text

Links • https://github.com/RedisLabsModules/rebloom • https://redislabs.com/blog/rebloom-bloom-filter-datatype-redis/ • https://github.com/kristoff-it/redis-cuckoofilter - Better than bloom filters

Slide 40

Slide 40 text

“An 80% solution today is much better than an 100% solution tomorrow.”

Slide 41

Slide 41 text

Thank You https://cnu.name/talks/redisconf-2018/