Slide 1

Slide 1 text

Stream processing made easy with riko 8th Light University - Chicago, IL May 30, 2017 by Reuben Cummings @reubano #8LU

Slide 2

Slide 2 text

@reubano #8LU Who am I? Managing Director, Nerevu Development Programming in Python since 2011 Author of several popular packages

Slide 3

Slide 3 text

streams

Slide 4

Slide 4 text

@reubano #8LU basic stream process 1 * 2 > 2 sum 2 3

Slide 5

Slide 5 text

@reubano #8LU 2 basic stream process * 2 > 2 sum 2 3

Slide 6

Slide 6 text

@reubano #8LU 4 basic stream process * 2 > 2 sum 3

Slide 7

Slide 7 text

@reubano #8LU 6 4 basic stream process * 2 > 2 sum

Slide 8

Slide 8 text

@reubano #8LU 4 6 basic stream process * 2 > 2 sum

Slide 9

Slide 9 text

@reubano #8LU 10 basic stream process * 2 > 2 sum

Slide 10

Slide 10 text

@reubano #8LU 10 basic stream process * 2 > 2 sum

Slide 11

Slide 11 text

constructing streams

Slide 12

Slide 12 text

>>> 'abracadabra'[0] 'a' >>> 'hello 8th Light University'.split(' ')[0] 'hello' >>> range(1, 11)[0] 1 >>> [{'x': x} for x in range(4)][0] {'x': 0} >>> ({'x': x} for x in range(4)) at 0x103c10830> >>> next({'x': x} for x in range(4)) {'x': 0}

Slide 13

Slide 13 text

processing streams

Slide 14

Slide 14 text

>>> [ord(x) for x in 'abracadabra'] [97, 98, 114, 97, 99, 97, 100, 97, 98, 114, 97] >>> [2 * x for x in range(1, 11)] [2, 4, 6, 8, 10, 12, 14, 16, 18, 20] >>> [x for x in range(1, 11) if x > 5] [6, 7, 8, 9, 10] >>> stream = ({'num': x} for x in range(4)) >>> sum(s['num'] for s in stream) 6

Slide 15

Slide 15 text

so what!

Slide 16

Slide 16 text

RSS feeds (feedly)

Slide 17

Slide 17 text

aggregators (kayak)

Slide 18

Slide 18 text

news feeds (linkedin)

Slide 19

Slide 19 text

frameworks

Slide 20

Slide 20 text

$ cat file.txt What is Lorem Ipsum? Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap...

Slide 21

Slide 21 text

hello world Hadoop

Slide 22

Slide 22 text

>>> from mrjob.job import MRJob >>> from mrjob.step import MRStep >>> >>> >>> class MRWordCount(MRJob): ... def steps(self): ... kwargs = { ... 'mapper': self.mapper, ... 'combiner': self.combiner, ... 'reducer': self.reducer} ... ... return [MRStep(**kwargs)] ... ... def mapper(self, _, line):

Slide 23

Slide 23 text

>>> >>> class MRWordCount(MRJob): ... def steps(self): ... kwargs = { ... 'mapper': self.mapper, ... 'combiner': self.combiner, ... 'reducer': self.reducer} ... ... return [MRStep(**kwargs)] ... ... def mapper(self, _, line): ... for word in line.split(' '): ... yield word.lower(), 1 ... ... def combiner(self, word, counts):

Slide 24

Slide 24 text

... return [MRStep(**kwargs)] ... ... def mapper(self, _, line): ... for word in line.split(' '): ... yield word.lower(), 1 ... ... def combiner(self, word, counts): ... yield word, sum(counts) ... ... def reducer(self, word, counts): ... yield word, sum(counts) >>> >>> if __name__ == '__main__': ... MRWordCount.run()

Slide 25

Slide 25 text

$ python hadoop_job.py file.txt '1500s' 1 'a' 2 'an' 1 'and' 1 'been' 1 'book' 1 'but' 1 'centuries' 1 'dummy' 1 'ever' 1 'five' 1 'galley' 1

Slide 26

Slide 26 text

hello world Spark

Slide 27

Slide 27 text

>>> from operator import add >>> from pyspark.sql import SparkSession >>> >>> spark = SparkSession.builder.getOrCreate() >>> fpath = 'hdfs:///file.txt' >>> >>> stream = (spark.read.text(fpath).rdd ... .flatMap(lambda line: line.split(' ')) ... .map(lambda word: (word.lower(), 1)) ... .reduceByKey(add) ... .collect())

Slide 28

Slide 28 text

>>> stream[0] ('1500s', 1)

Slide 29

Slide 29 text

introducing riko github.com/nerevu/riko

Slide 30

Slide 30 text

github.com/nerevu/riko

Slide 31

Slide 31 text

@reubano #8LU riko

Slide 32

Slide 32 text

@reubano #8LU riko { 'greeting':'hello', 'location':'8th Light', 'enthusiasm': 9 }

Slide 33

Slide 33 text

pip install riko

Slide 34

Slide 34 text

obtaining data

Slide 35

Slide 35 text

8th Light Blog https://8thlight.com/blog/feed/ rss.xml

Slide 36

Slide 36 text

8th Light Blog https://8thlight.com/blog/feed/ rss.xml

Slide 37

Slide 37 text

>>> from riko.modules.fetch import pipe >>> >>> url = 'https://8thlight.com/blog/feed' >>> url += '/rss.xml' >>> stream = pipe(conf={'url': url}) >>> item = next(stream) >>> item['author'] 'Rabea Gleissner' >>> item['published'] 'Fri, 26 May 2017 00:00:00 -0500' >>> item['title'] 'How to set up a React project without flipping tables'

Slide 38

Slide 38 text

transforming data

Slide 39

Slide 39 text

>>> from riko.collections import SyncPipe >>> >>> frule = { ... 'field': 'title', ... 'op': 'contains', ... 'value':'erlang'} >>> >>> stream = ( ... SyncPipe('fetch', conf={'url': url}) ... .filter(conf={'rule': frule}) ... .output)

Slide 40

Slide 40 text

>>> item = next(stream) >>> item['title'] 'The Core of Erlang' >>> item['tags'][0]['term'] 'Coding' >>> item['link'] 'https://8thlight.com/blog/kofi-gumbs/ 2017/05/02/core-erlang.html'

Slide 41

Slide 41 text

hello world riko

Slide 42

Slide 42 text

>>> conf ={'url': 'file:///file.txt'} >>> tconf = {'delimiter': ' '} >>> rule = {'transform': 'lower'} >>> cconf = {'count_key': 'strtransform'} >>> >>> stream = (SyncPipe('fetchtext', conf=conf) ... .tokenizer(conf=tconf, emit=True) ... .strtransform(conf={'rule': rule}) ... .count(conf=cconf) ... .output) >>> >>> next(stream) {'1500s': 1}

Slide 43

Slide 43 text

parallel processing

Slide 44

Slide 44 text

The Core of Erlang https://8thlight.com/blog/kofi- gumbs/2017/05/02/core-erlang.html

Slide 45

Slide 45 text

>>> from riko.modules import xpathfetchpage >>> >>> pipe = xpathfetchpage.pipe >>> xpath = '/html/body/section/div/div[1]' >>> xpath += '/div/div/article/div[3]/div' >>> xpath += '/ul[1]/li/a' >>> >>> xconf = { ... 'url': item['link'], 'xpath': xpath} >>> >>> stream = pipe(conf=xconf)

Slide 46

Slide 46 text

>>> next(stream) {'content': "Two Design Patterns You're Probably Already Using", 'href': '/blog/becca-nelson/2017/05/22/two- design-patterns-youre-probably-already- using.html'}

Slide 47

Slide 47 text

>>> kwargs = {'conf': xconf} >>> parts = [ ... {'value': 'http://8thlight.com'}, ... {'subkey': 'href'}] >>> >>> fconf = { ... 'url': {'subkey': 'strconcat'}, ... 'start': '

', 'end': '

'} >>> >>> stream = ( ... SyncPipe('xpathfetchpage', **kwargs) ... .strconcat(conf={'part': parts}) ... .fetchpage(conf=fconf) ... .output)

Slide 48

Slide 48 text

>>> kwargs = {'conf': xconf} >>> parts = [ ... {'value': 'http://8thlight.com'}, ... {'subkey': 'href'}] >>> >>> fconf = { ... 'url': {'subkey': 'strconcat'}, ... 'start': '

', 'end': '

'} >>> >>> stream = ( ... SyncPipe('xpathfetchpage', **kwargs) ... .strconcat(conf={'part': parts}) ... .fetchpage(conf=fconf) ... .output)

Slide 49

Slide 49 text

>>> next(stream)['content'].decode('utf-8') 'I came into this field from a very non- technical background. And when I say very non-technical, I mean that I was an elementary fine arts teacher. The most important calculation I performed on a daily basis was counting my kindergarteners when they lined up to leave to make sure I hadn’t lost any since the beginning of the class period.'

Slide 50

Slide 50 text

>>> from time import monotonic >>> >>> start = monotonic() >>> count = len(list(stream)) >>> stop = monotonic() - start >>> count, stop (9, 0.4573155799989763)

Slide 51

Slide 51 text

>>> kwargs = {'conf': xconf, 'parallel': True} >>> start = monotonic() >>> >>> stream = ( ... SyncPipe('xpathfetchpage', **kwargs) ... .strconcat(conf={'part': parts}) ... .fetchpage(conf=fconf) ... .output) >>> >>> count = len(list(stream)) >>> stop = monotonic() - start >>> count, stop (10, 0.2804829629985761)

Slide 52

Slide 52 text

async processing

Slide 53

Slide 53 text

pip install riko[async]

Slide 54

Slide 54 text

>>> from riko.bado import coroutine, react >>> from riko.collections import AsyncPipe >>> >>> @coroutine >>> def run(reactor): ... start = monotonic() ... flow = AsyncPipe( ... 'xpathfetchpage', **kwargs) ... ... stream = yield (flow ... .strconcat(conf={'part': parts}) ... .fetchpage(conf=fconf) ... .output) ...

Slide 55

Slide 55 text

>>> @coroutine >>> def run(reactor): ... start = monotonic() ... flow = AsyncPipe( ... 'xpathfetchpage', **kwargs) ... ... stream = yield (flow ... .strconcat(conf={'part': parts}) ... .fetchpage(conf=fconf) ... .output) ... ... count = len(list(stream)) ... stop = monotonic() - start ... print((count, stop))

Slide 56

Slide 56 text

>>> react(run) (10, 0.2462857809982961)

Slide 57

Slide 57 text

@reubano #8LU Head to Head Spark, etc. Huginn riko installation complex moderate simple push/pull push push pull native ingestors few many many parallel ✔ ✔ ✔ async ✔ json serializable ✔ ✔ distributed ✔

Slide 58

Slide 58 text

@reubano #8LU Reuben Cummings [email protected] https://www.reubano.xyz Thanks!