$30 off During Our Annual Pro Sale. View Details »

Learn Python Through Public Data Hacking

Learn Python Through Public Data Hacking

Tutorial. PyCon 2013. Santa Clara. Conference video at https://www.youtube.com/watch?v=RrPZza_vZ3w

David Beazley

March 13, 2013
Tweet

More Decks by David Beazley

Other Decks in Programming

Transcript

  1. Copyright (C) 2013, http://www.dabeaz.com
    Learn Python Through
    Public Data Hacking
    1
    David Beazley
    @dabeaz
    http://www.dabeaz.com
    Presented at PyCon'2013, Santa Clara, CA
    March 13, 2013

    View Slide

  2. Copyright (C) 2013, http://www.dabeaz.com
    Requirements
    2
    • Python 2.7 or 3.3
    • Support files:
    http://www.dabeaz.com/pydata
    • Also, datasets passed around on USB-key

    View Slide

  3. Copyright (C) 2013, http://www.dabeaz.com
    Welcome!
    • And now for something completely different
    • This tutorial merges two topics
    • Learning Python
    • Public data sets
    • I hope you find it to be fun
    3

    View Slide

  4. Copyright (C) 2013, http://www.dabeaz.com
    Primary Focus
    • Learn Python through practical examples
    • Learn by doing!
    • Provide a few fun programming challenges
    4

    View Slide

  5. Copyright (C) 2013, http://www.dabeaz.com
    Not a Focus
    • Data science
    • Statistics
    • GIS
    • Advanced Math
    • "Big Data"
    • We are learning Python
    5

    View Slide

  6. Copyright (C) 2013, http://www.dabeaz.com
    Approach
    • Coding! Coding! Coding! Coding!
    • Introduce yourself to your neighbors
    • You're going to work together
    • A bit like a hackathon
    6

    View Slide

  7. Copyright (C) 2013, http://www.dabeaz.com
    Your Responsibilities
    • Ask questions!
    • Don't be afraid to try things
    • Read the documentation!
    • Ask for help if stuck
    7

    View Slide

  8. Copyright (C) 2013, http://www.dabeaz.com
    Ready, Set, Go...
    8

    View Slide

  9. Copyright (C) 2013, http://www.dabeaz.com
    Running Python
    • Run it from a terminal
    bash % python
    Python 2.7.3 (default, Jun 13 2012, 15:29:09)
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on dar
    Type "help", "copyright", "credits" or "license"
    >>> print 'Hello World'
    Hello World
    >>> 3 + 4
    7
    >>>
    9
    • Start typing commands

    View Slide

  10. Copyright (C) 2013, http://www.dabeaz.com
    IDLE
    • Look for it in the "Start" menu
    10

    View Slide

  11. Copyright (C) 2013, http://www.dabeaz.com
    Interactive Mode
    • The interpreter runs a "read-eval" loop
    >>> print "hello world"
    hello world
    >>> 37*42
    1554
    >>> for i in range(5):
    ... print i
    ...
    0
    1
    2
    3
    4
    >>>
    • It runs what you type
    11

    View Slide

  12. Copyright (C) 2013, http://www.dabeaz.com
    Interactive Mode
    • Some notes on using the interactive shell
    >>> print "hello world"
    hello world
    >>> 37*42
    1554
    >>> for i in range(5):
    ... print i
    ...
    0
    1
    2
    3
    4
    >>>
    12
    >>> is the interpreter
    prompt for starting a
    new statement
    ... is the interpreter
    prompt for continuing
    a statement (it may be
    blank in some tools) Enter a blank line to
    finish typing and to run

    View Slide

  13. Copyright (C) 2013, http://www.dabeaz.com
    Creating Programs
    • Programs are put in .py files
    # helloworld.py
    print "hello world"
    • Create with your favorite editor (e.g., emacs)
    • Can also edit programs with IDLE or other
    Python IDE (too many to list)
    13

    View Slide

  14. Copyright (C) 2013, http://www.dabeaz.com
    Running Programs
    • Running from the terminal
    • Command line (Unix)
    bash % python helloworld.py
    hello world
    bash %
    • Command shell (Windows)
    C:\SomeFolder>helloworld.py
    hello world
    C:\SomeFolder>c:\python27\python helloworld.py
    hello world
    14

    View Slide

  15. Copyright (C) 2013, http://www.dabeaz.com
    Pro-Tip
    • Use python -i
    bash % python -i helloworld.py
    hello world
    >>>
    • It runs your program and then enters the
    interactive shell
    • Great for debugging, exploration, etc.
    15

    View Slide

  16. Copyright (C) 2013, http://www.dabeaz.com
    Running Programs (IDLE)
    • Select "Run Module" from editor
    • Will see output in IDLE shell window
    16

    View Slide

  17. Copyright (C) 2013, http://www.dabeaz.com
    Python 101 : Statements
    • A Python program is a sequence of statements
    • Each statement is terminated by a newline
    • Statements are executed one after the other
    until you reach the end of the file.
    17

    View Slide

  18. Copyright (C) 2013, http://www.dabeaz.com
    Python 101 : Comments
    • Comments are denoted by #
    # This is a comment
    height = 442 # Meters
    18
    • Extend to the end of the line

    View Slide

  19. Copyright (C) 2013, http://www.dabeaz.com
    Python 101: Variables
    • A variable is just a name for some value
    • Name consists of letters, digits, and _.
    • Must start with a letter or _
    height = 442
    user_name = "Dave"
    filename1 = 'Data/data.csv'
    19

    View Slide

  20. Copyright (C) 2013, http://www.dabeaz.com
    Python 101 : Basic Types
    • Numbers
    a = 12345 # Integer
    b = 123.45 # Floating point
    • Text Strings
    name = 'Dave'
    filename = "Data/stocks.dat"
    20
    • Nothing (a placeholder)
    f = None

    View Slide

  21. Copyright (C) 2013, http://www.dabeaz.com
    Python 101 : Math
    • Math operations behave normally
    y = 2 * x**2 - 3 * x + 10
    z = (x + y) / 2.0
    • Potential Gotcha: Integer Division in Python 2
    >>> 7/4
    1
    >>> 2/3
    0
    21
    • Use decimals if it matters
    >>> 7.0/4
    1.75

    View Slide

  22. Copyright (C) 2013, http://www.dabeaz.com
    Python 101 : Text Strings
    • A few common operations
    a = 'Hello'
    b = 'World'
    >>> len(a) # Length
    5
    >>> a + b # Concatenation
    'HelloWorld'
    >>> a.upper() # Case convert
    'HELLO'
    >>> a.startswith('Hell') # Prefix Test
    True
    >>> a.replace('H', 'M') # Replacement
    'Mello
    >>>
    22

    View Slide

  23. Copyright (C) 2013, http://www.dabeaz.com
    Python 101: Conversions
    • To convert values
    a = int(x) # Convert x to integer
    b = float(x) # Convert x to float
    c = str(x) # Convert x to string
    • Example:
    >>> xs = '123'
    >>> xs + 10
    Traceback (most recent call last):
    File "", line 1, in
    TypeError: cannot concatenate 'str' and 'int' o
    >>> int(xs) + 10
    133
    >>>
    23

    View Slide

  24. Copyright (C) 2013, http://www.dabeaz.com
    Python 101 : Conditionals
    • If-else
    if a < b:
    print "Computer says no"
    else:
    print "Computer says yes"
    • If-elif-else
    if a < b:
    print "Computer says not enough"
    elif a > b:
    print "Computer says too much"
    else:
    print "Computer says just right"
    24

    View Slide

  25. Copyright (C) 2013, http://www.dabeaz.com
    Python 101 : Relations
    25
    • Relational operators
    < > <= >= == !=
    • Boolean expressions (and, or, not)
    if b >= a and b <= c:
    print "b is between a and c"
    if not (b < a or b > c):
    print "b is still between a and c"

    View Slide

  26. Copyright (C) 2013, http://www.dabeaz.com
    Python 101: Looping
    • while executes a loop
    • Executes the indented statements
    underneath while the condition is true
    26
    n = 10
    while n > 10:
    print 'T-minus', n
    n = n - 1
    print 'Blastoff!'

    View Slide

  27. Copyright (C) 2013, http://www.dabeaz.com
    Python 101: Iteration
    • for iterates over a sequence of data
    • Processes the items one at a time
    • Note: variable name doesn't matter
    27
    names = ['Dave', 'Paula', 'Thomas', 'Lewis']
    for name in names:
    print name
    for n in names:
    print n

    View Slide

  28. Copyright (C) 2013, http://www.dabeaz.com
    Python 101 : Indentation
    • There is a preferred indentation style
    • Always use spaces
    • Use 4 spaces per level
    • Avoid tabs
    • Always use a Python-aware editor
    28

    View Slide

  29. Copyright (C) 2013, http://www.dabeaz.com
    Python 101 : Printing
    • The print statement (Python 2)
    print x
    print x, y, z
    print "Your name is", name
    print x, # Omits newline
    • The print function (Python 3)
    29
    print(x)
    print(x, y, z)
    print("Your name is", name)
    print(x, end=' ') # Omits newline

    View Slide

  30. Copyright (C) 2013, http://www.dabeaz.com
    Python 101: Files
    • Opening a file
    f = open("foo.txt","r") # Open for reading
    f = open("bar.txt","w") # Open for writing
    • To read data
    data = f.read() # Read all data
    • To write text to a file
    g.write("some text\n")
    30

    View Slide

  31. Copyright (C) 2013, http://www.dabeaz.com
    Python 101: File Iteration
    • Reading a file one line at a time
    f = open("foo.txt","r")
    for line in f:
    # Process the line
    ...
    f.close()
    31
    • Extremely common with data processing

    View Slide

  32. Copyright (C) 2013, http://www.dabeaz.com
    Python 101: Functions
    • Defining a new function
    def hello(name):
    print('Hello %s!' % name)
    def distance(lat1, lat2):
    'Return approx miles between lat1 and lat2'
    return 69 * abs(lat1 - lat2)
    32
    • Example:
    >>> hello('Guido')
    Hello Guido!
    >>> distance(41.980262, 42.031662)
    3.5465999999995788
    >>>

    View Slide

  33. Copyright (C) 2013, http://www.dabeaz.com
    Python 101: Imports
    • There is a huge library of functions
    • Example: math functions
    import math
    x = math.sin(2)
    y = math.cos(2)
    33
    • Reading from the web
    import urllib # urllib.request on Py3
    u = urllib.urlopen('http://www.python.org)
    data = u.read()

    View Slide

  34. Copyright (C) 2013, http://www.dabeaz.com
    Coding Challenge
    34
    "The Traveling Suitcase"

    View Slide

  35. Copyright (C) 2013, http://www.dabeaz.com
    The Traveling Suitcase
    35
    Travis traveled to Chicago and took
    the Clark Street #22 bus up to
    Dave's office.
    Problem: He just left his suitcase on the bus!
    Your task: Get it back!

    View Slide

  36. Copyright (C) 2013, http://www.dabeaz.com
    Panic!
    36
    >>> import urllib
    >>> u = urllib.urlopen('http://ctabustracker.com/
    bustime/map/getBusesForRoute.jsp?route=22')
    >>> data = u.read()
    >>> f = open('rt22.xml', 'wb')
    >>> f.write(data)
    >>> f.close()
    >>>
    • Start the Python interpreter and type this
    • Don't ask questions: you have 5 minutes...

    View Slide

  37. Copyright (C) 2013, http://www.dabeaz.com
    Hacking Transit Data
    37
    • Many major cities provide a transit API
    • Example: Chicago Transit Authority (CTA)
    http://www.transitchicago.com/developers/
    • Available data:
    • Real-time GPS tracking
    • Stop predictions
    • Alerts

    View Slide

  38. Copyright (C) 2013, http://www.dabeaz.com
    38

    View Slide

  39. Copyright (C) 2013, http://www.dabeaz.com
    Here's the Data
    39


    1:14 PM

    6801
    22
    North Bound
    N
    41.875033214174465
    -87.62907409667969
    3932
    North Bound
    P209
    Howard
    34058
    ...

    ...

    View Slide

  40. Copyright (C) 2013, http://www.dabeaz.com
    Here's the Data
    40


    1:14 PM

    6801
    22
    North Bound
    N
    41.875033214174465
    -87.62907409667969
    3932
    North Bound
    P209
    Howard
    34058
    ...

    ...

    View Slide

  41. Copyright (C) 2013, http://www.dabeaz.com
    Your Challenge
    41
    • Task 1:
    latitude 41.980262
    longitude -87.668452
    Travis doesn't know the number of the bus he
    was riding. Find likely candidates by parsing
    the data just downloaded and identifying
    vehicles traveling northbound of Dave's office.
    Dave's office is located at:

    View Slide

  42. Copyright (C) 2013, http://www.dabeaz.com
    Your Challenge
    42
    • Task 2:
    Write a program that periodically monitors
    the identified buses and reports their current
    distance from Dave's office.
    When the bus gets closer than 0.5 miles, have
    the program issue an alert by popping up a
    web-page showing the bus location on a map.
    Travis will meet the bus and get his suitcase.

    View Slide

  43. Copyright (C) 2013, http://www.dabeaz.com
    Parsing XML
    43
    from xml.etree.ElementTree import parse
    doc = parse('rt22.xml')
    • Parsing a document into a tree


    1:14 PM

    6801
    22
    North Bound
    N
    41.875033214174465
    -87.62907409667969
    3932
    North Bound
    P209
    Howard
    34058
    ...

    ...
    root
    time
    bus
    bus
    bus
    bus
    id
    rt
    d
    dn
    lat
    lon
    doc

    View Slide

  44. Copyright (C) 2013, http://www.dabeaz.com
    Parsing XML
    44
    for bus in doc.findall('bus'):
    ...
    • Iterating over specific element type
    root
    time
    bus
    bus
    bus
    bus
    id
    rt
    d
    dn
    lat
    lon
    doc

    View Slide

  45. Copyright (C) 2013, http://www.dabeaz.com
    Parsing XML
    45
    for bus in doc.findall('bus'):
    ...
    • Iterating over specific element type
    root
    time
    bus
    bus
    bus
    bus
    id
    rt
    d
    dn
    lat
    lon
    doc
    bus
    Produces a
    sequence of
    matching
    elements

    View Slide

  46. Copyright (C) 2013, http://www.dabeaz.com
    Parsing XML
    46
    for bus in doc.findall('bus'):
    ...
    • Iterating over specific element type
    root
    time
    bus
    bus
    bus
    bus
    id
    rt
    d
    dn
    lat
    lon
    doc
    bus
    Produces a
    sequence of
    matching
    elements

    View Slide

  47. Copyright (C) 2013, http://www.dabeaz.com
    Parsing XML
    47
    for bus in doc.findall('bus'):
    ...
    • Iterating over specific element type
    root
    time
    bus
    bus
    bus
    bus
    id
    rt
    d
    dn
    lat
    lon
    doc
    bus
    Produces a
    sequence of
    matching
    elements

    View Slide

  48. Copyright (C) 2013, http://www.dabeaz.com
    Parsing XML
    48
    for bus in doc.findall('bus'):
    ...
    • Iterating over specific element type
    root
    time
    bus
    bus
    bus
    bus
    id
    rt
    d
    dn
    lat
    lon
    doc
    bus
    Produces a
    sequence of
    matching
    elements

    View Slide

  49. Copyright (C) 2013, http://www.dabeaz.com
    Parsing XML
    49
    for bus in doc.findall('bus'):
    d = bus.findtext('d')
    lat = float(bus.findtext('lat'))
    • Extracting data : elem.findtext()
    root
    time
    bus
    bus
    bus
    bus
    id
    rt
    d
    dn
    lat
    lon
    doc
    bus "North Bound"
    "41.9979871114"

    View Slide

  50. Copyright (C) 2013, http://www.dabeaz.com
    Mapping
    50
    • To display a map : Maybe Google Static Maps
    https://developers.google.com/maps/documentation/
    staticmaps/
    • To show a page in a browser
    import webbrowser
    webbrowser.open('http://...')

    View Slide

  51. Copyright (C) 2013, http://www.dabeaz.com
    51

    View Slide

  52. Copyright (C) 2013, http://www.dabeaz.com
    Go Code...
    52
    30 Minutes
    • Talk to your neighbors
    • Consult handy cheat-sheet
    • http://www.dabeaz.com/pydata

    View Slide

  53. Copyright (C) 2013, http://www.dabeaz.com
    New Concepts
    53

    View Slide

  54. Copyright (C) 2013, http://www.dabeaz.com
    Data Structures
    • Real programs have more complex data
    • Example: A place marker
    Bus 6541 at 41.980262, -87.668452
    • An "object" with three parts
    • Label ("6541")
    • Latitude (41.980262)
    • Longitude (-87.668452)
    54

    View Slide

  55. Copyright (C) 2013, http://www.dabeaz.com
    Tuples
    • A collection of related values grouped together
    • Example:
    55
    bus = ('6541', 41.980262, -87.668452)
    • Analogy: A row in a database table
    • A single object with multiple parts

    View Slide

  56. Copyright (C) 2013, http://www.dabeaz.com
    Tuples (cont)
    • Tuple contents are ordered (like an array)
    bus = ('6541', 41.980262, -87.668452)
    id = bus[0] # '6541'
    lat = bus[1] # 41.980262
    lon = bus[2] # -87.668452
    • However, the contents can't be modified
    >>> bus[0] = '1234'
    TypeError: object does not support item
    assignment
    56

    View Slide

  57. Copyright (C) 2013, http://www.dabeaz.com
    Tuple Unpacking
    • Unpacking values from a tuple
    bus = ('6541', 41.980262, -87.668452)
    id, lat, lon = bus
    # id = '6541'
    # lat = 41.980262
    # lon = -87.668452
    • This is extremely common
    • Example: Unpacking database row into vars
    57

    View Slide

  58. Copyright (C) 2013, http://www.dabeaz.com
    Dictionaries
    • A collection of values indexed by "keys"
    • Example:
    bus = {
    'id' : '6541',
    'lat' : 41.980262,
    'lon' : -87.668452
    }
    58
    • Use:
    >>> bus['id']
    '6541'
    >>> bus['lat'] = 42.003172
    >>>

    View Slide

  59. Copyright (C) 2013, http://www.dabeaz.com
    Lists
    • An ordered sequence of items
    names = ['Dave', 'Paula', 'Thomas']
    59
    • A few operations
    >>> len(names)
    3
    >>> names.append('Lewis')
    >>> names
    ['Dave', 'Paula', 'Thomas', 'Lewis']
    >>> names[0]
    'Dave'
    >>>

    View Slide

  60. Copyright (C) 2013, http://www.dabeaz.com
    List Usage
    • Typically hold items of the same type
    nums = [10, 20, 30]
    buses = [
    ('1412', 41.8750332142, -87.6290740967),
    ('1406', 42.0126361553, -87.6747320322),
    ('1307', 41.8886332973, -87.6295552408),
    ('1875', 41.9996211482, -87.6711741429),
    ('1780', 41.9097633362, -87.6315689087),
    ]
    60

    View Slide

  61. Copyright (C) 2013, http://www.dabeaz.com
    Dicts as Lookup Tables
    • Use a dict for fast, random lookups
    • Example: Bus locations
    61
    bus_locs = {
    '1412': (41.8750332142, -87.6290740967),
    '1406': (42.0126361553, -87.6747320322),
    '1307': (41.8886332973, -87.6295552408),
    '1875': (41.9996211482, -87.6711741429),
    '1780': (41.9097633362, -87.6315689087),
    }
    >>> bus_locs['1307']
    (41.8886332973, -87.6295552408)
    >>>

    View Slide

  62. Copyright (C) 2013, http://www.dabeaz.com
    Sets
    • An unordered collections of unique items
    62
    ids = set(['1412', '1406', '1307', '1875'])
    • Common operations
    >>> ids.add('1642')
    >>> ids.remove('1406')
    >>> '1307' in ids
    True
    >>> '1871' in ids
    False
    >>>
    • Useful for detecting duplicates, related tasks

    View Slide

  63. Copyright (C) 2013, http://www.dabeaz.com
    Coding Challenge
    63
    "Diabolical Road Biking"

    View Slide

  64. Copyright (C) 2013, http://www.dabeaz.com
    Problem
    64
    Not content to ride your
    bike on the lakefront
    path, you seek a new
    road biking challenge
    involving large potholes
    and heavy traffic.
    Your Task: Find the five most post-apocalyptic
    pothole-filled 10-block sections of road in Chicago.
    Bonus: Identify the worst road based on historical
    data involving actual number of patched potholes.

    View Slide

  65. Copyright (C) 2013, http://www.dabeaz.com
    Data Portals
    65
    • Many cities are publishing datasets online
    • http://data.cityofchicago.org
    • https://data.sfgov.org/
    • https://explore.data.gov/
    • You can download and play with data

    View Slide

  66. Copyright (C) 2013, http://www.dabeaz.com
    66

    View Slide

  67. Copyright (C) 2013, http://www.dabeaz.com
    67
    Pothole Data
    https://data.cityofchicago.org/Service-Requests/311-Service-
    Requests-Pot-Holes-Reported/7as2-ds3y

    View Slide

  68. Copyright (C) 2013, http://www.dabeaz.com
    Getting the Data
    • You can download from the website
    • I have provided a copy on USB-key
    68
    Data/potholes.csv
    • Approx: 31 MB, 137000 lines

    View Slide

  69. Copyright (C) 2013, http://www.dabeaz.com
    Parsing CSV Data
    • You will need to parse CSV data
    import csv
    f = open('potholes.csv')
    for row in csv.DictReader(f):
    addr = row['STREET ADDRESS']
    num = row['NUMBER OF POTHOLES FILLED ON BLOCK']
    69
    • Use the CSV module

    View Slide

  70. Copyright (C) 2013, http://www.dabeaz.com
    Tabulating Data
    • You'll probably need to make lookup tables
    potholes_by_block = {}
    f = open('potholes.csv')
    for row in csv.DictReader(f):
    ...
    potholes_by_block[block] += num_potholes
    ...
    70
    • Use a dict. Map keys to counts.

    View Slide

  71. Copyright (C) 2013, http://www.dabeaz.com
    String Splitting
    • You might need to manipulate strings
    >>> addr = '350 N STATE ST'
    >>> parts = addr.split()
    >>> parts
    ['350', 'N', 'STATE', 'ST']
    >>> num = parts[0]
    >>> parts[0] = num[:-2] + 'XX'
    >>> parts
    ['3XX', 'N', 'STATE', 'ST']
    >>> ' '.join(parts)
    '3XX N STATE ST'
    >>>
    71
    • For example, to rewrite addresses

    View Slide

  72. Copyright (C) 2013, http://www.dabeaz.com
    Data Reduction/Sorting
    • Some useful data manipulation functions
    >>> nums = [50, 10, 5, 7, -2, 8]
    >>> min(nums)
    -2
    >>> max(nums)
    50
    >>> sorted(nums)
    [-2, 5, 7, 8, 10, 50]
    >>> sorted(nums, reverse=True)
    [50, 10, 8, 7, 5, -2]
    >>>
    72

    View Slide

  73. Copyright (C) 2013, http://www.dabeaz.com
    Exception Handling
    • You might need to account for bad data
    for row in csv.DictReader(f):
    try:
    n = int(row['NUMBER OF POTHOLES FILLED'])
    except ValueError:
    n = 0
    ...
    73
    • Use try-except to catch exceptions (if needed)

    View Slide

  74. Copyright (C) 2013, http://www.dabeaz.com
    Code...
    74
    40 Minutes
    Hint: This problem requires more thought
    than actual coding
    (The solution is small)

    View Slide

  75. Copyright (C) 2013, http://www.dabeaz.com
    Power Tools
    (Python powered)
    75

    View Slide

  76. Copyright (C) 2013, http://www.dabeaz.com
    List Comprehensions
    • Creates a new list by applying an operation
    to each element of a sequence.
    >>> a = [1,2,3,4,5]
    >>> b = [2*x for x in a]
    >>> b
    [2, 4, 6, 8, 10]
    >>>
    76
    • Shorthand for this:
    >>> b = []
    >>> for x in a:
    ... b.append(2*x)
    ...
    >>>

    View Slide

  77. Copyright (C) 2013, http://www.dabeaz.com
    List Comprehensions
    • A list comprehension can also filter
    >>> a = [1, -5, 4, 2, -2, 10]
    >>> b = [2*x for x in a if x > 0]
    >>> b
    [2, 8, 4, 20]
    >>>
    77

    View Slide

  78. Copyright (C) 2013, http://www.dabeaz.com
    List Comp: Examples
    • Collecting the values of a specific field
    addrs = [r['STREET ADDRESS'] for r in records]
    • Performing database-like queries
    filled = [r for r in records
    if r['STATUS'] == 'Completed']
    78
    • Building new data structures
    locs = [ (r['LATITUDE'], r['LONGITUDE'])
    for r in records ]

    View Slide

  79. Copyright (C) 2013, http://www.dabeaz.com
    Simplified Tabulation
    • Counter objects
    79
    from collections import Counter
    words = ['yes','but','no','but','yes']
    wordcounts = Counter(words)
    >>> wordcounts['yes']
    2
    >>> wordcounts.most_common()
    [('yes', 2), ('but', 2), ('no', 1)]
    >>>

    View Slide

  80. Copyright (C) 2013, http://www.dabeaz.com
    Advanced Sorting
    • Use of a key-function
    80
    records.sort(key=lambda p: p['COMPLETION DATE'])
    records.sort(key=lambda p: p['ZIP'])
    • lambda: creates a tiny in-line function
    f = lambda p: p['COMPLETION DATE']
    # Same as
    def f(p):
    return p['COMPLETION DATE']
    • Result of key func determines sort order

    View Slide

  81. Copyright (C) 2013, http://www.dabeaz.com
    Grouping of Data
    • Iterating over groups of sorted data
    81
    from itertools import groupby
    groups = groupby(records, key=lambda r: r['ZIP'])
    for zipcode, group in groups:
    for r in group:
    # All records with same zip-code
    ...
    • Note: data must already be sorted by field
    records.sort(key=lambda r: r['ZIP'])

    View Slide

  82. Copyright (C) 2013, http://www.dabeaz.com
    Index Building
    • Building indices to data
    82
    from collections import defaultdict
    zip_index = defaultdict(list)
    for r in records:
    zip_index[r['ZIP']].append(r)
    • Builds a dictionary
    zip_index = {
    '60640' : [ rec, rec, ... ],
    '60637' : [ rec, rec, rec, ... ],
    ...
    }

    View Slide

  83. Copyright (C) 2013, http://www.dabeaz.com
    Third Party Libraries
    • Many useful packages
    • numpy/scipy (array processing)
    • matplotlib (plotting)
    • pandas (statistics, data analysis)
    • requests (interacting with APIs)
    • ipython (better interactive shell)
    • Too many others to list
    83

    View Slide

  84. Copyright (C) 2013, http://www.dabeaz.com
    Coding Challenge
    84
    "Hmmmm.... Pies"

    View Slide

  85. Copyright (C) 2013, http://www.dabeaz.com
    Problem
    85
    You're ravenously
    hungry after all of that
    biking, but you can
    never be too careful.

    View Slide

  86. Copyright (C) 2013, http://www.dabeaz.com
    Problem
    86
    You're ravenously
    hungry after all of that
    biking, but you can
    never be too careful.
    Your Task: Analyze Chicago's food inspection data
    and make a series of tasty pie charts and tables

    View Slide

  87. Copyright (C) 2013, http://www.dabeaz.com
    87
    The Data
    https://data.cityofchicago.org/Health-Human-
    Services/Food-Inspections/4ijn-s7e5
    • It's a 77MB CSV file. Don't download
    • Available on USB key (passed around)
    • New challenges abound!

    View Slide

  88. Copyright (C) 2013, http://www.dabeaz.com
    88
    Problems of Interest
    • Outcomes of a health-inspection (pass, fail)
    • Risk levels
    • Breakdown of establishment types
    • Most common code violations
    • Use your imagination...

    View Slide

  89. Copyright (C) 2013, http://www.dabeaz.com
    89
    To Make Charts...
    You're going to have to
    install some packages...

    View Slide

  90. Copyright (C) 2013, http://www.dabeaz.com
    90
    Bleeding Edge

    View Slide

  91. Copyright (C) 2013, http://www.dabeaz.com
    Code
    91
    45 Minutes
    • Code should not be long
    • For plotting/ipython consider EPD-Free,
    Anaconda CE, or other distribution
    • See samples at http://www.dabeaz.com/pydata

    View Slide

  92. Copyright (C) 2013, http://www.dabeaz.com
    92
    Where To Go From Here?
    • Python coding
    • Functions, modules, classes, objects
    • Data analysis
    • Numpy/Scipy, pandas, matplotlib
    • Data sources
    • Open government, data portals, etc.

    View Slide

  93. Copyright (C) 2013, http://www.dabeaz.com
    93
    Final Comments
    • Thanks!
    • Hope you had some fun!
    • Learned at least a few new things
    • Follow me on Twitter: @dabeaz

    View Slide