Stream processing made easy with riko

Slide 1

Slide 1 text

Stream processing made easy with riko PyConZA - Cape Town, SA Oct 6, 2016 by Reuben Cummings @reubano #PyConZA16

Slide 2

Slide 2 text

Who am I? Managing Director, Nerevu Development Programming in Python since 2011 Author of several popular packages Flask over Django, Twisted over Tornado, functions over classes @reubano #PyConZA16

Slide 3

Slide 3 text

streams

Slide 4

Slide 4 text

basic stream process 1 * 2 > 2 sum 2 3

Slide 5

Slide 5 text

basic stream process 1 > 2 sum 2 3 * 2

Slide 6

Slide 6 text

basic stream process > 2 sum 2 3 2 * 2

Slide 7

Slide 7 text

basic stream process 2 > 2 sum 2 3 * 2

Slide 8

Slide 8 text

basic stream process > 2 sum 2 3 * 2

Slide 9

Slide 9 text

basic stream process > 2 sum 3 4 * 2

Slide 10

Slide 10 text

basic stream process 4 > 2 sum 3 * 2

Slide 11

Slide 11 text

basic stream process > 2 sum 4 3 * 2

Slide 12

Slide 12 text

basic stream process > 2 sum 4 6 * 2

Slide 13

Slide 13 text

basic stream process > 2 sum 4 6 * 2

Slide 14

Slide 14 text

basic stream process > 2 sum 4 * 2 6

Slide 15

Slide 15 text

basic stream process > 2 sum 4 * 2 6

Slide 16

Slide 16 text

basic stream process > 2 sum 4 * 2 6

Slide 17

Slide 17 text

basic stream process > 2 sum 4 * 2 10

Slide 18

Slide 18 text

basic stream process > 2 sum 10 * 2

Slide 19

Slide 19 text

constructing streams

Slide 20

Slide 20 text

>>> 'abracadabra'[0] >>> range(1, 11)[0] >>> 'hello pycon attendees'.split(' ')[0] >>> [{'x': x} for x in range(4)][0] 'a' 'hello' 1 {'x': 0} >>> ({'num': x} for x in range(4)) >>> next({'num': x} for x in range(4)) at 0x103c10830> {'num': 0}

Slide 21

Slide 21 text

processing streams

Slide 22

Slide 22 text

>>> [ord(x) for x in 'abracadabra'] >>> [2 * x for x in range(1, 11)] >>> [x for x in range(1, 11) if x > 5] [97, 98, 114, 97, 99, 97, 100, 97, 98, 114, 97] [2, 4, 6, 8, 10, 12, 14, 16, 18, 20] [6, 7, 8, 9, 10] >>> stream = ({'num': x} for x in range(4)) >>> sum(s['num'] for s in stream) 6

Slide 23

Slide 23 text

so what!

Slide 24

Slide 24 text

RSS feeds (feedly)

Slide 25

Slide 25 text

aggregators (kayak)

Slide 26

Slide 26 text

mashups (portwiture)

Slide 27

Slide 27 text

frameworks

Slide 28

Slide 28 text

What is Lorem Ipsum? Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap...

Slide 29

Slide 29 text

hello world Hadoop

Slide 30

Slide 30 text

>>> import re >>> from mrjob.job import MRJob >>> from mrjob.step import MRStep >>> >>> WORD_RE = re.compile(r"[\w']+") >>> >>> >>> class MRWordCount(MRJob): ... def steps(self): ... kwargs = { ... 'mapper': self.mapper, ... 'combiner': self.combiner, ... 'reducer': self.reducer} ... ... return [MRStep(**kwargs)]

Slide 31

Slide 31 text

... def mapper(self, _, line): ... for word in WORD_RE.findall(line): ... yield word.lower(), 1 ... ... def combiner(self, word, counts): ... yield word, sum(counts) ... ... def reducer(self, word, counts): ... yield word, sum(counts) >>> >>> if __name__ == '__main__': ... MRWordCount.run()

Slide 32

Slide 32 text

hello world Spark

Slide 33

Slide 33 text

>>> from operator import add >>> from pyspark.sql import SparkSession >>> >>> spark = SparkSession.builder.getOrCreate() >>> >>> stream = (spark.read.text('hdfs://file.txt').rdd ... .flatMap(lambda line: line.split(' ')) ... .map(lambda word: (word.lower(), 1)) ... .reduceByKey(add) ... .collect()) >>> >>> stream[0] ('"de', 1)

Slide 34

Slide 34 text

introducing riko github.com/nerevu/riko

Slide 35

Slide 35 text

riko @reubano #PyConZA16

Slide 36

Slide 36 text

riko { 'greeting':'hello', 'location':'pycon', 'enthusiasm': 3 } @reubano #PyConZA16

Slide 37

Slide 37 text

basic usage

Slide 38

Slide 38 text

pip install riko

Slide 39

Slide 39 text

hello world riko

Slide 40

Slide 40 text

>>> from riko.collections.sync import SyncPipe >>> >>> url = 'file:///file.txt' >>> conf = {'delimiter': ' '} >>> rule = {'transform': 'lower'} >>> >>> next(stream) >>> next(stream) {'"de': 1} {'"lorem': 1} >>> stream = (SyncPipe('fetchtext', conf={'url': url}) ... .stringtokenizer(conf=conf, emit=True) ... .strtransform(conf={'rule': rule}) ... .count(conf={'count_key': 'strtransform'}) ... .output)

Slide 41

Slide 41 text

let's get some data

Slide 42

Slide 42 text

Code for South Africa (data.code4sa.org)

Slide 43

Slide 43 text

home page

Slide 44

Slide 44 text

Data catalogue

Slide 45

Slide 45 text

API access

Slide 46

Slide 46 text

>>> from riko.modules.fetchdata import pipe >>> url = 'data.code4sa.org/resource/6rgz-ak57.json' >>> stream = pipe(conf={'url': url}) >>> next(stream) >>> next(stream) {'crime': 'All theft not mentioned elsewhere', 'incidents': '2266.0', 'police_station': 'Bellville', 'province': 'WC', 'year': '2014'} {'crime': 'Drug-related crime', 'incidents': '2578.0', 'police_station': 'Bishop Lavis', 'province': 'WC', 'year': '2014'}

Slide 47

Slide 47 text

filtering & truncating

Slide 48

Slide 48 text

>>> sort_conf = { ... 'rule': { ... 'sort_key': 'incidents', 'sort_dir': 'desc'}} >>> filter_conf = { ... 'rule': { ... 'field': 'province', ... 'op': 'eq', ... 'value':'GP'}} >>> from riko.collections.sync import SyncPipe >>> stream = (SyncPipe('fetchdata', conf={'url': url}) ... .filter(conf=filter_conf) ... .sort(conf=sort_conf) ... .truncate(conf={'count': '5'}) ... .output)

Slide 49

Slide 49 text

>>> next(stream) {'crime': 'All theft not mentioned elsewhere', 'incidents': '3339.0', 'police_station': 'Pretoria Central', 'province': 'GP', 'year': '2014'} {'crime': 'Drug-related crime', 'incidents': '3125.0', 'police_station': 'Eldorado Park', 'province': 'GP', 'year': '2014'} >>> next(stream)

Slide 50

Slide 50 text

joins

Slide 51

Slide 51 text

>>> from riko.modules import fetchdata, join >>> >>> url2 = 'data.code4sa.org/resource/qtx7-xbrs.json' >>> stream = fetchdata.pipe(conf={'url': url}) >>> stream2 = fetchdata.pipe(conf={'url': url2}) >>> conf = { ... 'join_key': 'police_station', ... 'other_join_key': 'station'} >>> joined = join.pipe(stream, conf=conf, other=stream2) {'station': 'Aberdeen', 'sum_2014_2015': '1153'} >>> next(stream2)

Slide 52

Slide 52 text

>>> next(joined) {'crime': 'All theft not mentioned elsewhere', 'incidents': '2266.0', 'police_station': 'Bellville', 'province': 'WC', 'station': 'Bellville', 'sum_2014_2015': '28989', 'year': '2014'} stream data stream2 data

Slide 53

Slide 53 text

{'crime': 'Drug-related crime', 'incidents': '2578.0', 'police_station': 'Bishop Lavis', 'province': 'WC', 'station': 'Bishop Lavis', 'sum_2014_2015': '24983', 'year': '2014'} >>> next(joined)

Slide 54

Slide 54 text

riko's many paradigms

Slide 55

Slide 55 text

async API (Twisted)

Slide 56

Slide 56 text

pip install riko[async]

Slide 57

Slide 57 text

>>> from riko.bado import coroutine, react >>> from riko.collections.async import AsyncCollection >>> sources = [ ... {'url': url, 'type': 'fetchdata'}, ... {'url': url2, 'type': 'fetchdata'}] >>> flow = AsyncCollection(sources) >>> >>> @coroutine >>> def run(reactor): ... stream = yield flow.async_fetch() ... print(next(stream)) >>> >>> react(run) {'police_station': 'Bellville', 'crime': 'All theft...', 'year': '2014', 'province': 'WC', 'incidents': '2266.0'}

Slide 58

Slide 58 text

parallel API (threads)

Slide 59

Slide 59 text

>>> flow = SyncCollection(sources, parallel=True) >>> stream = flow.list >>> stream[0] {'crime': 'All theft not mentioned elsewhere', 'incidents': '2266.0', 'police_station': 'Bellville', 'province': 'WC', 'year': '2014'} >>> stream[-1] {'station': 'Tierpoort', 'sum_2014_2015': '327'} >>> from riko.collections.sync import SyncCollection >>>

Slide 60

Slide 60 text

parallel API (processes)

Slide 61

Slide 61 text

>>> kwargs = {'parallel': True, 'threads': False} >>> flow = SyncCollection(sources, **kwargs) >>> stream = flow.list >>> stream[0] {'crime': 'All theft not mentioned elsewhere', 'incidents': '2266.0', 'police_station': 'Bellville', 'province': 'WC', 'year': '2014'} >>> stream[-1] {'station': 'Tierpoort', 'sum_2014_2015': '327'}

Slide 62

Slide 62 text

Head to Head Spark, etc. Huginn riko installation complex moderate simple push/pull push push pull native ingestors few many many parallel ✔ ✔ ✔ async ✔ distributed ✔ @reubano #PyConZA16

Slide 63

Slide 63 text

Reuben Cummings [email protected] https://github.com/nerevu/riko @reubano #PyConZA16 Thanks!

Slide 64

Slide 64 text

fetching RSS feeds

Slide 65

Slide 65 text

>>> from riko.modules.fetchsitefeed import pipe >>> >>> url = 'arstechnica.com/rss-feeds/' >>> stream = pipe(conf={'url': url}) >>> item = next(stream) >>> item.keys() dict_keys(['tags', 'summary_detail','author.name', 'y:published', 'content', 'title', 'pubDate', 'id', 'summary', 'authors','links', 'y:id', 'author', 'link','published']) >>> item['title'], item['author'], item['id'] ('Gravity doesn’t care about quantum spin', 'Chris Lee', 'http://arstechnica.com/?p=924009')