Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quantum of Data: A data science journey

Quantum of Data: A data science journey

A data science talk given at Python Exposé, Nairobi.

Reuben Cummings

April 01, 2017
Tweet

More Decks by Reuben Cummings

Other Decks in Programming

Transcript

  1. Quantum of Data
    A data science journey
    Python Exposé ● Nairobi, Kenya
    April 1, 2017
    by Reuben Cummings

    View Slide

  2. Who am I?
    Managing Director, Nerevu Development
    Founder of Arusha Coders
    Author of several popular Python packages
    reubano on Twitter and GitHub

    View Slide

  3. MISSION
    DOCID AGENT
    0001 00111
    CASE OF THE MISSING SHORTCAKE

    View Slide

  4. Ransom Note
    If you want to see your
    shortcake again, visit bitly.com/
    pyexpose for further instructions

    View Slide

  5. EVIDENCE
    DOCID AGENT
    0010 00111
    DROPBOX FOLDER
    WWW.DROPBOX.COM/HOME/EXPOSE/RANSOM

    View Slide

  6. GPG encrypted
    file
    encryption key
    unknown
    decryptme.txt.gpg
    readme.txt

    View Slide

  7. To obtain the key,
    first get the
    number of
    attendees from
    the previous
    meetups.
    decryptme.txt.gpg
    readme.txt

    View Slide

  8. HINTS
    DOCID AGENT
    0011 00111
    HINT #1
    WWW.MEETUP.COM/PYTHON-NAIROBI/EVENTS/PAST

    View Slide

  9. HINTS
    DOCID AGENT
    0011 00111
    HINT #1
    WWW.MEETUP.COM/PYTHON-NAIROBI/EVENTS/PAST

    View Slide

  10. from html.parser import HTMLParser
    from itertools import chain
    def handle_starttag(self, tag, attrs):
    entry = dict(attrs)
    if entry.get('class') == 'event-rating':
    self.match = True
    class AttendanceParser(HTMLParser):
    def reset(self):
    HTMLParser.reset(self)
    self.match = False
    self.nums = iter([])

    View Slide

  11. from html.parser import HTMLParser
    from itertools import chain
    class AttendanceParser(HTMLParser):
    ...
    def handle_data(self, data):
    num = data.strip()
    if self.match and num:
    self.nums = chain(self.nums, [int(num)])
    self.match = False

    View Slide

  12. from urllib.request import urlopen
    BASE = 'https://www.meetup.com/Python-Nairobi'
    BASE_URL = '{base}/events/past/?page={page}'
    >>> extract_attendance()
    def extract_attendance():
    parser = AttendanceParser()
    url = BASE_URL.format(base=BASE, page=0)
    f = urlopen(url)
    encoding = f.info().get_content_charset()
    [parser.feed(line.decode(encoding)) for line in f]
    return list(parser.nums)
    [65, 83, 50, 64, 46]

    View Slide

  13. def extract_attendance():
    parser = AttendanceParser()
    # Inner loop to parse each line
    for line in f:
    parser.feed(line.decode(encoding))
    yield from parser.nums
    # Outer loop to extract each page
    for page in range(5):
    url = BASE_URL.format(base=BASE, page=page)
    f = urlopen(url)
    encoding = f.info().get_content_charset()
    >>> len(list(extract_attendance()))
    25

    View Slide

  14. Hint #2
    This code is available at bitly.com/
    pyexpose-attendance
    python extract-attendance.py
    in a shell, enter the command:

    View Slide

  15. Hint #3
    Each number represents the Unicode
    code point character of a password.

    View Slide

  16. Hint #4
    chr(i) returns the string
    representing a character whose
    Unicode code point is the integer i

    View Slide

  17. >>> attendance = list(extract_attendance())
    >>> chr(attendance[10])
    >>> print(chr(attendance[10]))
    >>> chr(attendance[10]).isprintable()
    >>> chr(attendance[0])
    'A'
    '\x11'
    False

    View Slide

  18. >>> printable = [
    ...: chr(x) for x in range(150)
    ...: if chr(x).isprintable()]
    >>> len(printable)
    >>> [
    ...: (x, chr(x)) for x in range(150)
    ...: if chr(x).isprintable()]
    >>> ''.join(printable[num] for num in attendance)
    [(32, ' '), (33, '!'), (34, '"'), (35, '#')...]
    'asR`N_WMH.153F24682579(?7'
    95

    View Slide

  19. Hint #5
    gpg ransom/decryptme.txt.gpg
    asR`N_WMH.153F24682579(?7
    when prompted, enter the password:
    in a shell, enter the command:

    View Slide

  20. Hint #6
    This decrypted message is available
    at bitly.com/pyexpose-decrypted

    View Slide

  21. Decrypted message
    Your shortcake is at a cafe in Nairobi that shares
    an object with a snake in this flickr group
    https://www.flickr.com/groups/1329313@N21/
    Find the first photo taken by the most prolific
    group member in 2017.

    View Slide

  22. EVIDENCE
    DOCID AGENT
    0100 00111
    FLICKR GROUP
    WWW.FLICKR.COM/GROUPS/1329313@N21/

    View Slide

  23. HINTS
    DOCID AGENT
    0101 00111
    HINT #6
    API.FLICKR.COM/SERVICES/FEEDS/GROUPS_POOL.GNE?ID=1329313@N21

    View Slide

  24. Hint #7
    pip install riko
    in a shell, enter the command:

    View Slide

  25. >>> from riko.collections import SyncPipe
    >>>
    >>> BASE = 'https://api.flickr.com/services/feeds'
    >>> BASE_URL = '{}/groups_pool.gne?id=1329313@N21'
    >>> conf = {'url': BASE_URL.format(BASE)}
    >>> stream = SyncPipe('fetch', conf=conf).output
    >>> next(stream)

    View Slide

  26. {'author.name': 'Sharon B Mott',
    'link': 'https://www.flickr.com/photos/...',
    'pubDate': time.struct_time(tm_year=2017, tm_mo,...),
    'tags': [
    {'label': None,
    'scheme': 'https://www.flickr.com/photos/tags/',
    'term': 'boaconstrictor'},
    {'label': None,
    'scheme': 'https://www.flickr.com/photos/tags/',
    'term': 'boa'},
    ...
    ],
    'title': 'Hints of blue',
    ...
    }

    View Slide

  27. >>> from datetime import datetime as dt
    >>>
    15
    >>> stream = (
    ...: SyncPipe('fetch', conf=conf)
    ...: .filter(conf={'rule': rule})
    ...: .list)
    >>> len(stream)
    >>> rule = {
    ...: 'field': 'pubDate',
    ...: 'op': 'after',
    ...: 'value': dt(2016, 12, 31)}

    View Slide

  28. >>> creators = [
    ...: item.get('author.name') for item in stream]
    ['Sharon B Mott',
    'baker.cameron43',
    'stevekpriest',
    'TessaSmits',
    'TessaSmits',
    'TessaSmits',
    'Sharon B Mott',
    ...
    ]
    >>> creators

    View Slide

  29. Hint #8
    collections.Counter is a
    dict subclass for counting hashable
    objects

    View Slide

  30. >>> from collections import Counter
    >>>
    Counter({'Jesonis|Photography_On/Off (super busy)': 1,
    'Sabrina Filipiak Vasseur': 3,
    'Sharon B Mott': 5,
    'TessaSmits': 3,
    'baker.cameron43': 2,
    'stevekpriest': 1})
    >>> c = Counter(creators)
    >>> c
    >>> c.most_common(1)
    [('Sharon B Mott', 5)]
    >>> top_creator = c.most_common(1)[0][0]

    View Slide

  31. >>> links[-1]
    'https://www.flickr.com/photos/
    125407841@N08/32344001285/in/pool-1329313@N21'
    >>> links = [
    ...: item['link'] for item in stream
    ...: if item.get('author.name') == top_creator]

    View Slide

  32. Hint #9
    This code is available at bitly.com/
    pyexpose-flickr
    python get-flickr-link.py
    in a shell, enter the command:

    View Slide

  33. EVIDENCE
    DOCID AGENT
    0110 00111
    FLICKR GROUP PHOTO
    WWW.FLICKR.COM/PHOTOS/125407841@N08/32344001285/SIZES/L

    View Slide

  34. Hint #10
    Your shortcake is at a cafe in Nairobi
    that shares an object with a snake in
    this flickr group

    View Slide

  35. EVIDENCE
    DOCID AGENT
    0111 00111
    GOOGLE MAPS
    WWW.GOOGLE.CO.KE/MAPS/SEARCH/LEAF+CAFE+NAIROBI/

    View Slide

  36. EVIDENCE
    DOCID AGENT
    0111 00111
    GOOGLE MAPS
    WWW.GOOGLE.CO.KE/MAPS/SEARCH/LEAF+CAFE+NAIROBI/

    View Slide

  37. MISSION
    DOCID AGENT
    0001 00111
    CASE OF THE MISSING SHORTCAKE
    SOLVED!

    View Slide

  38. Thank you!
    Questions?
    Reuben Cummings
    @reubano

    View Slide