Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stream processing made easy with riko

Stream processing made easy with riko

8th Light University talk on stream processing with riko

Reuben Cummings

May 30, 2017
Tweet

More Decks by Reuben Cummings

Other Decks in Programming

Transcript

  1. Stream processing
    made easy with riko
    8th Light University - Chicago, IL
    May 30, 2017
    by Reuben Cummings
    @reubano #8LU

    View Slide

  2. @reubano #8LU
    Who am I?
    Managing Director, Nerevu Development
    Programming in Python since 2011
    Author of several popular packages

    View Slide

  3. streams

    View Slide

  4. @reubano #8LU
    basic stream process
    1
    * 2 > 2 sum
    2
    3

    View Slide

  5. @reubano #8LU
    2
    basic stream process
    * 2 > 2 sum
    2
    3

    View Slide

  6. @reubano #8LU
    4
    basic stream process
    * 2 > 2 sum
    3

    View Slide

  7. @reubano #8LU
    6 4
    basic stream process
    * 2 > 2 sum

    View Slide

  8. @reubano #8LU
    4
    6
    basic stream process
    * 2 > 2 sum

    View Slide

  9. @reubano #8LU
    10
    basic stream process
    * 2 > 2 sum

    View Slide

  10. @reubano #8LU
    10
    basic stream process
    * 2 > 2 sum

    View Slide

  11. constructing
    streams

    View Slide

  12. >>> 'abracadabra'[0]
    'a'
    >>> 'hello 8th Light University'.split(' ')[0]
    'hello'
    >>> range(1, 11)[0]
    1
    >>> [{'x': x} for x in range(4)][0]
    {'x': 0}
    >>> ({'x': x} for x in range(4))
    at 0x103c10830>
    >>> next({'x': x} for x in range(4))
    {'x': 0}

    View Slide

  13. processing
    streams

    View Slide

  14. >>> [ord(x) for x in 'abracadabra']
    [97, 98, 114, 97, 99, 97, 100, 97, 98, 114,
    97]
    >>> [2 * x for x in range(1, 11)]
    [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
    >>> [x for x in range(1, 11) if x > 5]
    [6, 7, 8, 9, 10]
    >>> stream = ({'num': x} for x in range(4))
    >>> sum(s['num'] for s in stream)
    6

    View Slide

  15. so what!

    View Slide

  16. RSS feeds (feedly)

    View Slide

  17. aggregators (kayak)

    View Slide

  18. news feeds (linkedin)

    View Slide

  19. frameworks

    View Slide

  20. $ cat file.txt
    What is Lorem Ipsum?
    Lorem Ipsum is simply dummy text of the
    printing and typesetting industry. Lorem Ipsum
    has been the industry's standard dummy text
    ever since the 1500s, when an unknown printer
    took a galley of type and scrambled it to make
    a type specimen book. It has survived not only
    five centuries, but also the leap...

    View Slide

  21. hello world
    Hadoop

    View Slide

  22. >>> from mrjob.job import MRJob
    >>> from mrjob.step import MRStep
    >>>
    >>>
    >>> class MRWordCount(MRJob):
    ... def steps(self):
    ... kwargs = {
    ... 'mapper': self.mapper,
    ... 'combiner': self.combiner,
    ... 'reducer': self.reducer}
    ...
    ... return [MRStep(**kwargs)]
    ...
    ... def mapper(self, _, line):

    View Slide

  23. >>>
    >>> class MRWordCount(MRJob):
    ... def steps(self):
    ... kwargs = {
    ... 'mapper': self.mapper,
    ... 'combiner': self.combiner,
    ... 'reducer': self.reducer}
    ...
    ... return [MRStep(**kwargs)]
    ...
    ... def mapper(self, _, line):
    ... for word in line.split(' '):
    ... yield word.lower(), 1
    ...
    ... def combiner(self, word, counts):

    View Slide

  24. ... return [MRStep(**kwargs)]
    ...
    ... def mapper(self, _, line):
    ... for word in line.split(' '):
    ... yield word.lower(), 1
    ...
    ... def combiner(self, word, counts):
    ... yield word, sum(counts)
    ...
    ... def reducer(self, word, counts):
    ... yield word, sum(counts)
    >>>
    >>> if __name__ == '__main__':
    ... MRWordCount.run()

    View Slide

  25. $ python hadoop_job.py file.txt
    '1500s' 1
    'a' 2
    'an' 1
    'and' 1
    'been' 1
    'book' 1
    'but' 1
    'centuries' 1
    'dummy' 1
    'ever' 1
    'five' 1
    'galley' 1

    View Slide

  26. hello world
    Spark

    View Slide

  27. >>> from operator import add
    >>> from pyspark.sql import SparkSession
    >>>
    >>> spark = SparkSession.builder.getOrCreate()
    >>> fpath = 'hdfs:///file.txt'
    >>>
    >>> stream = (spark.read.text(fpath).rdd
    ... .flatMap(lambda line: line.split(' '))
    ... .map(lambda word: (word.lower(), 1))
    ... .reduceByKey(add)
    ... .collect())

    View Slide

  28. >>> stream[0]
    ('1500s', 1)

    View Slide

  29. introducing
    riko
    github.com/nerevu/riko

    View Slide

  30. github.com/nerevu/riko

    View Slide

  31. @reubano #8LU
    riko

    View Slide

  32. @reubano #8LU
    riko
    {
    'greeting':'hello',
    'location':'8th Light',
    'enthusiasm': 9
    }

    View Slide

  33. pip install riko

    View Slide

  34. obtaining data

    View Slide

  35. 8th Light Blog
    https://8thlight.com/blog/feed/
    rss.xml

    View Slide

  36. 8th Light Blog
    https://8thlight.com/blog/feed/
    rss.xml

    View Slide

  37. >>> from riko.modules.fetch import pipe
    >>>
    >>> url = 'https://8thlight.com/blog/feed'
    >>> url += '/rss.xml'
    >>> stream = pipe(conf={'url': url})
    >>> item = next(stream)
    >>> item['author']
    'Rabea Gleissner'
    >>> item['published']
    'Fri, 26 May 2017 00:00:00 -0500'
    >>> item['title']
    'How to set up a React project without
    flipping tables'

    View Slide

  38. transforming
    data

    View Slide

  39. >>> from riko.collections import SyncPipe
    >>>
    >>> frule = {
    ... 'field': 'title',
    ... 'op': 'contains',
    ... 'value':'erlang'}
    >>>
    >>> stream = (
    ... SyncPipe('fetch', conf={'url': url})
    ... .filter(conf={'rule': frule})
    ... .output)

    View Slide

  40. >>> item = next(stream)
    >>> item['title']
    'The Core of Erlang'
    >>> item['tags'][0]['term']
    'Coding'
    >>> item['link']
    'https://8thlight.com/blog/kofi-gumbs/
    2017/05/02/core-erlang.html'

    View Slide

  41. hello world riko

    View Slide

  42. >>> conf ={'url': 'file:///file.txt'}
    >>> tconf = {'delimiter': ' '}
    >>> rule = {'transform': 'lower'}
    >>> cconf = {'count_key': 'strtransform'}
    >>>
    >>> stream = (SyncPipe('fetchtext', conf=conf)
    ... .tokenizer(conf=tconf, emit=True)
    ... .strtransform(conf={'rule': rule})
    ... .count(conf=cconf)
    ... .output)
    >>>
    >>> next(stream)
    {'1500s': 1}

    View Slide

  43. parallel
    processing

    View Slide

  44. The Core of Erlang
    https://8thlight.com/blog/kofi-
    gumbs/2017/05/02/core-erlang.html

    View Slide

  45. >>> from riko.modules import xpathfetchpage
    >>>
    >>> pipe = xpathfetchpage.pipe
    >>> xpath = '/html/body/section/div/div[1]'
    >>> xpath += '/div/div/article/div[3]/div'
    >>> xpath += '/ul[1]/li/a'
    >>>
    >>> xconf = {
    ... 'url': item['link'], 'xpath': xpath}
    >>>
    >>> stream = pipe(conf=xconf)

    View Slide

  46. >>> next(stream)
    {'content': "Two Design Patterns You're
    Probably Already Using",
    'href': '/blog/becca-nelson/2017/05/22/two-
    design-patterns-youre-probably-already-
    using.html'}

    View Slide

  47. >>> kwargs = {'conf': xconf}
    >>> parts = [
    ... {'value': 'http://8thlight.com'},
    ... {'subkey': 'href'}]
    >>>
    >>> fconf = {
    ... 'url': {'subkey': 'strconcat'},
    ... 'start': '', 'end': ''}
    >>>
    >>> stream = (
    ... SyncPipe('xpathfetchpage', **kwargs)
    ... .strconcat(conf={'part': parts})
    ... .fetchpage(conf=fconf)
    ... .output)

    View Slide

  48. >>> kwargs = {'conf': xconf}
    >>> parts = [
    ... {'value': 'http://8thlight.com'},
    ... {'subkey': 'href'}]
    >>>
    >>> fconf = {
    ... 'url': {'subkey': 'strconcat'},
    ... 'start': '', 'end': ''}
    >>>
    >>> stream = (
    ... SyncPipe('xpathfetchpage', **kwargs)
    ... .strconcat(conf={'part': parts})
    ... .fetchpage(conf=fconf)
    ... .output)

    View Slide

  49. >>> next(stream)['content'].decode('utf-8')
    'I came into this field from a very non-
    technical background. And when I say very
    non-technical, I mean that I was an
    elementary fine arts teacher. The most
    important calculation I performed on a daily
    basis was counting my kindergarteners when
    they lined up to leave to make sure I hadn’t
    lost any since the beginning of the class
    period.'

    View Slide

  50. >>> from time import monotonic
    >>>
    >>> start = monotonic()
    >>> count = len(list(stream))
    >>> stop = monotonic() - start
    >>> count, stop
    (9, 0.4573155799989763)

    View Slide

  51. >>> kwargs = {'conf': xconf, 'parallel': True}
    >>> start = monotonic()
    >>>
    >>> stream = (
    ... SyncPipe('xpathfetchpage', **kwargs)
    ... .strconcat(conf={'part': parts})
    ... .fetchpage(conf=fconf)
    ... .output)
    >>>
    >>> count = len(list(stream))
    >>> stop = monotonic() - start
    >>> count, stop
    (10, 0.2804829629985761)

    View Slide

  52. async
    processing

    View Slide

  53. pip install riko[async]

    View Slide

  54. >>> from riko.bado import coroutine, react
    >>> from riko.collections import AsyncPipe
    >>>
    >>> @coroutine
    >>> def run(reactor):
    ... start = monotonic()
    ... flow = AsyncPipe(
    ... 'xpathfetchpage', **kwargs)
    ...
    ... stream = yield (flow
    ... .strconcat(conf={'part': parts})
    ... .fetchpage(conf=fconf)
    ... .output)
    ...

    View Slide

  55. >>> @coroutine
    >>> def run(reactor):
    ... start = monotonic()
    ... flow = AsyncPipe(
    ... 'xpathfetchpage', **kwargs)
    ...
    ... stream = yield (flow
    ... .strconcat(conf={'part': parts})
    ... .fetchpage(conf=fconf)
    ... .output)
    ...
    ... count = len(list(stream))
    ... stop = monotonic() - start
    ... print((count, stop))

    View Slide

  56. >>> react(run)
    (10, 0.2462857809982961)

    View Slide

  57. @reubano #8LU
    Head to Head
    Spark, etc. Huginn riko
    installation complex moderate simple
    push/pull push push pull
    native ingestors few many many
    parallel ✔ ✔ ✔
    async ✔
    json serializable ✔ ✔
    distributed ✔

    View Slide

  58. @reubano #8LU
    Reuben Cummings
    [email protected]
    https://www.reubano.xyz
    Thanks!

    View Slide