Stream processing made easy with riko

Stream processing made easy with riko

Talk on stream processing given at PyConZA16

869402f85dcbabcef3da1ee61b88a45a?s=128

Reuben Cummings

October 06, 2016
Tweet

Transcript

  1. Stream processing made easy with riko PyConZA - Cape Town,

    SA Oct 6, 2016 by Reuben Cummings @reubano #PyConZA16
  2. Who am I? Managing Director, Nerevu Development Programming in Python

    since 2011 Author of several popular packages Flask over Django, Twisted over Tornado, functions over classes @reubano #PyConZA16
  3. streams

  4. basic stream process 1 * 2 > 2 sum 2

    3
  5. basic stream process 1 > 2 sum 2 3 *

    2
  6. basic stream process > 2 sum 2 3 2 *

    2
  7. basic stream process 2 > 2 sum 2 3 *

    2
  8. basic stream process > 2 sum 2 3 * 2

  9. basic stream process > 2 sum 3 4 * 2

  10. basic stream process 4 > 2 sum 3 * 2

  11. basic stream process > 2 sum 4 3 * 2

  12. basic stream process > 2 sum 4 6 * 2

  13. basic stream process > 2 sum 4 6 * 2

  14. basic stream process > 2 sum 4 * 2 6

  15. basic stream process > 2 sum 4 * 2 6

  16. basic stream process > 2 sum 4 * 2 6

  17. basic stream process > 2 sum 4 * 2 10

  18. basic stream process > 2 sum 10 * 2

  19. constructing streams

  20. >>> 'abracadabra'[0] >>> range(1, 11)[0] >>> 'hello pycon attendees'.split(' ')[0]

    >>> [{'x': x} for x in range(4)][0] 'a' 'hello' 1 {'x': 0} >>> ({'num': x} for x in range(4)) >>> next({'num': x} for x in range(4)) <generator object <genexpr> at 0x103c10830> {'num': 0}
  21. processing streams

  22. >>> [ord(x) for x in 'abracadabra'] >>> [2 * x

    for x in range(1, 11)] >>> [x for x in range(1, 11) if x > 5] [97, 98, 114, 97, 99, 97, 100, 97, 98, 114, 97] [2, 4, 6, 8, 10, 12, 14, 16, 18, 20] [6, 7, 8, 9, 10] >>> stream = ({'num': x} for x in range(4)) >>> sum(s['num'] for s in stream) 6
  23. so what!

  24. RSS feeds (feedly)

  25. aggregators (kayak)

  26. mashups (portwiture)

  27. frameworks

  28. What is Lorem Ipsum? Lorem Ipsum is simply dummy text

    of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap...
  29. hello world Hadoop

  30. >>> import re >>> from mrjob.job import MRJob >>> from

    mrjob.step import MRStep >>> >>> WORD_RE = re.compile(r"[\w']+") >>> >>> >>> class MRWordCount(MRJob): ... def steps(self): ... kwargs = { ... 'mapper': self.mapper, ... 'combiner': self.combiner, ... 'reducer': self.reducer} ... ... return [MRStep(**kwargs)]
  31. ... def mapper(self, _, line): ... for word in WORD_RE.findall(line):

    ... yield word.lower(), 1 ... ... def combiner(self, word, counts): ... yield word, sum(counts) ... ... def reducer(self, word, counts): ... yield word, sum(counts) >>> >>> if __name__ == '__main__': ... MRWordCount.run()
  32. hello world Spark

  33. >>> from operator import add >>> from pyspark.sql import SparkSession

    >>> >>> spark = SparkSession.builder.getOrCreate() >>> >>> stream = (spark.read.text('hdfs://file.txt').rdd ... .flatMap(lambda line: line.split(' ')) ... .map(lambda word: (word.lower(), 1)) ... .reduceByKey(add) ... .collect()) >>> >>> stream[0] ('"de', 1)
  34. introducing riko github.com/nerevu/riko

  35. riko @reubano #PyConZA16

  36. riko { 'greeting':'hello', 'location':'pycon', 'enthusiasm': 3 } @reubano #PyConZA16

  37. basic usage

  38. pip install riko

  39. hello world riko

  40. >>> from riko.collections.sync import SyncPipe >>> >>> url = 'file:///file.txt'

    >>> conf = {'delimiter': ' '} >>> rule = {'transform': 'lower'} >>> >>> next(stream) >>> next(stream) {'"de': 1} {'"lorem': 1} >>> stream = (SyncPipe('fetchtext', conf={'url': url}) ... .stringtokenizer(conf=conf, emit=True) ... .strtransform(conf={'rule': rule}) ... .count(conf={'count_key': 'strtransform'}) ... .output)
  41. let's get some data

  42. Code for South Africa (data.code4sa.org)

  43. home page

  44. Data catalogue

  45. API access

  46. >>> from riko.modules.fetchdata import pipe >>> url = 'data.code4sa.org/resource/6rgz-ak57.json' >>>

    stream = pipe(conf={'url': url}) >>> next(stream) >>> next(stream) {'crime': 'All theft not mentioned elsewhere', 'incidents': '2266.0', 'police_station': 'Bellville', 'province': 'WC', 'year': '2014'} {'crime': 'Drug-related crime', 'incidents': '2578.0', 'police_station': 'Bishop Lavis', 'province': 'WC', 'year': '2014'}
  47. filtering & truncating

  48. >>> sort_conf = { ... 'rule': { ... 'sort_key': 'incidents',

    'sort_dir': 'desc'}} >>> filter_conf = { ... 'rule': { ... 'field': 'province', ... 'op': 'eq', ... 'value':'GP'}} >>> from riko.collections.sync import SyncPipe >>> stream = (SyncPipe('fetchdata', conf={'url': url}) ... .filter(conf=filter_conf) ... .sort(conf=sort_conf) ... .truncate(conf={'count': '5'}) ... .output)
  49. >>> next(stream) {'crime': 'All theft not mentioned elsewhere', 'incidents': '3339.0',

    'police_station': 'Pretoria Central', 'province': 'GP', 'year': '2014'} {'crime': 'Drug-related crime', 'incidents': '3125.0', 'police_station': 'Eldorado Park', 'province': 'GP', 'year': '2014'} >>> next(stream)
  50. joins

  51. >>> from riko.modules import fetchdata, join >>> >>> url2 =

    'data.code4sa.org/resource/qtx7-xbrs.json' >>> stream = fetchdata.pipe(conf={'url': url}) >>> stream2 = fetchdata.pipe(conf={'url': url2}) >>> conf = { ... 'join_key': 'police_station', ... 'other_join_key': 'station'} >>> joined = join.pipe(stream, conf=conf, other=stream2) {'station': 'Aberdeen', 'sum_2014_2015': '1153'} >>> next(stream2)
  52. >>> next(joined) {'crime': 'All theft not mentioned elsewhere', 'incidents': '2266.0',

    'police_station': 'Bellville', 'province': 'WC', 'station': 'Bellville', 'sum_2014_2015': '28989', 'year': '2014'} stream data stream2 data
  53. {'crime': 'Drug-related crime', 'incidents': '2578.0', 'police_station': 'Bishop Lavis', 'province': 'WC',

    'station': 'Bishop Lavis', 'sum_2014_2015': '24983', 'year': '2014'} >>> next(joined)
  54. riko's many paradigms

  55. async API (Twisted)

  56. pip install riko[async]

  57. >>> from riko.bado import coroutine, react >>> from riko.collections.async import

    AsyncCollection >>> sources = [ ... {'url': url, 'type': 'fetchdata'}, ... {'url': url2, 'type': 'fetchdata'}] >>> flow = AsyncCollection(sources) >>> >>> @coroutine >>> def run(reactor): ... stream = yield flow.async_fetch() ... print(next(stream)) >>> >>> react(run) {'police_station': 'Bellville', 'crime': 'All theft...', 'year': '2014', 'province': 'WC', 'incidents': '2266.0'}
  58. parallel API (threads)

  59. >>> flow = SyncCollection(sources, parallel=True) >>> stream = flow.list >>>

    stream[0] {'crime': 'All theft not mentioned elsewhere', 'incidents': '2266.0', 'police_station': 'Bellville', 'province': 'WC', 'year': '2014'} >>> stream[-1] {'station': 'Tierpoort', 'sum_2014_2015': '327'} >>> from riko.collections.sync import SyncCollection >>>
  60. parallel API (processes)

  61. >>> kwargs = {'parallel': True, 'threads': False} >>> flow =

    SyncCollection(sources, **kwargs) >>> stream = flow.list >>> stream[0] {'crime': 'All theft not mentioned elsewhere', 'incidents': '2266.0', 'police_station': 'Bellville', 'province': 'WC', 'year': '2014'} >>> stream[-1] {'station': 'Tierpoort', 'sum_2014_2015': '327'}
  62. Head to Head Spark, etc. Huginn riko installation complex moderate

    simple push/pull push push pull native ingestors few many many parallel ✔ ✔ ✔ async ✔ distributed ✔ @reubano #PyConZA16
  63. Reuben Cummings reubano@gmail.com https://github.com/nerevu/riko @reubano #PyConZA16 Thanks!

  64. fetching RSS feeds

  65. >>> from riko.modules.fetchsitefeed import pipe >>> >>> url = 'arstechnica.com/rss-feeds/'

    >>> stream = pipe(conf={'url': url}) >>> item = next(stream) >>> item.keys() dict_keys(['tags', 'summary_detail','author.name', 'y:published', 'content', 'title', 'pubDate', 'id', 'summary', 'authors','links', 'y:id', 'author', 'link','published']) >>> item['title'], item['author'], item['id'] ('Gravity doesn’t care about quantum spin', 'Chris Lee', 'http://arstechnica.com/?p=924009')