Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Vanishing Pattern: from iterators to genera...

The Vanishing Pattern: from iterators to generators in Python

The core of the talk is refactoring a simple iterable class from the classic Iterator design pattern (as implemented in the GoF book) to compatible but less verbose implementations using generators. This provides a meaningful context to understand the value of generators. Along the way the behavior of the iter function, the Sequence protocol and the Iterable interface are presented. The motivating examples of this talk are database applications.

Luciano Ramalho

July 25, 2013
Tweet

More Decks by Luciano Ramalho

Other Decks in Technology

Transcript

  1. >>> from django.db import connection >>> q = connection.queries >>>

    q [] >>> from municipios.models import * >>> res = Municipio.objects.all()[:5] >>> q [] >>> for m in res: print m.uf, m.nome ... GO Abadia de Goiás MG Abadia dos Dourados GO Abadiânia MG Abaeté PA Abaetetuba >>> q [{'time': '0.000', 'sql': u'SELECT "municipios_municipio"."id", "municipios_municipio"."uf", "municipios_municipio"."nome", "municipios_municipio"."nome_ascii", "municipios_municipio"."meso_regiao_id", "municipios_municipio"."capital", "municipios_municipio"."latitude", "municipios_municipio"."longitude", "municipios_municipio"."geohash" FROM "municipios_municipio" ORDER BY "municipios_municipio"."nome_ascii" ASC LIMIT 5'}]
  2. >>> from django.db import connection >>> q = connection.queries >>>

    q [] >>> from municipios.models import * >>> res = Municipio.objects.all()[:5] >>> q [] >>> for m in res: print m.uf, m.nome ... GO Abadia de Goiás MG Abadia dos Dourados GO Abadiânia MG Abaeté PA Abaetetuba >>> q [{'time': '0.000', 'sql': u'SELECT "municipios_municipio"."id", "municipios_municipio"."uf", "municipios_municipio"."nome", "municipios_municipio"."nome_ascii", "municipios_municipio"."meso_regiao_id", "municipios_municipio"."capital", "municipios_municipio"."latitude", "municipios_municipio"."longitude", "municipios_municipio"."geohash" FROM "municipios_municipio" ORDER BY "municipios_municipio"."nome_ascii" ASC LIMIT 5'}] this expression makes a Django QuerySet
  3. >>> from django.db import connection >>> q = connection.queries >>>

    q [] >>> from municipios.models import * >>> res = Municipio.objects.all()[:5] >>> q [] >>> for m in res: print m.uf, m.nome ... GO Abadia de Goiás MG Abadia dos Dourados GO Abadiânia MG Abaeté PA Abaetetuba >>> q [{'time': '0.000', 'sql': u'SELECT "municipios_municipio"."id", "municipios_municipio"."uf", "municipios_municipio"."nome", "municipios_municipio"."nome_ascii", "municipios_municipio"."meso_regiao_id", "municipios_municipio"."capital", "municipios_municipio"."latitude", "municipios_municipio"."longitude", "municipios_municipio"."geohash" FROM "municipios_municipio" ORDER BY "municipios_municipio"."nome_ascii" ASC LIMIT 5'}] this expression makes a Django QuerySet QuerySets are “lazy”: no database access so far
  4. >>> from django.db import connection >>> q = connection.queries >>>

    q [] >>> from municipios.models import * >>> res = Municipio.objects.all()[:5] >>> q [] >>> for m in res: print m.uf, m.nome ... GO Abadia de Goiás MG Abadia dos Dourados GO Abadiânia MG Abaeté PA Abaetetuba >>> q [{'time': '0.000', 'sql': u'SELECT "municipios_municipio"."id", "municipios_municipio"."uf", "municipios_municipio"."nome", "municipios_municipio"."nome_ascii", "municipios_municipio"."meso_regiao_id", "municipios_municipio"."capital", "municipios_municipio"."latitude", "municipios_municipio"."longitude", "municipios_municipio"."geohash" FROM "municipios_municipio" ORDER BY "municipios_municipio"."nome_ascii" ASC LIMIT 5'}] this expression makes a Django QuerySet QuerySets are “lazy”: no database access so far the query is made only when we iterate over the results
  5. @ramalhoorg Lazy • Avoids unnecessary work, by postponing it as

    long as possible • The opposite of eager 9 In Computer Science, being “lazy” is often a good thing!
  6. @ramalhoorg Iteration: C and Python #include <stdio.h> int main(int argc,

    char *argv[]) { int i; for(i = 0; i < argc; i++) printf("%s\n", argv[i]); return 0; } import sys for arg in sys.argv: print arg
  7. @ramalhoorg Iteration: Java (classic) class Arguments { public static void

    main(String[] args) { for (int i=0; i < args.length; i++) System.out.println(args[i]); } } $ java Arguments alfa bravo charlie alfa bravo charlie
  8. @ramalhoorg Iteration: Java ≥1.5 $ java Arguments2 alfa bravo charlie

    alfa bravo charlie • Enhanced for (a.k.a. foreach) since 2004 class Arguments2 { public static void main(String[] args) { for (String arg : args) System.out.println(arg); } }
  9. @ramalhoorg Iteration: Java ≥1.5 • Enhanced for (a.k.a. foreach) class

    Arguments2 { public static void main(String[] args) { for (String arg : args) System.out.println(arg); } } since 2004 import sys for arg in sys.argv: print arg since 1991
  10. @ramalhoorg You can iterate over many Python objects • strings

    • files • XML: ElementTree nodes • not limited to built-in types: • Django QuerySet • etc. 15
  11. @ramalhoorg So, what is an iterable? • Informal, recursive definition:

    • iterable: fit to be iterated • just as: edible: fit to be eaten 16
  12. List comprehension • Compreensão de lista ou abrangência • Exemplo:

    usar todos os elementos: – L2 = [n*10 for n in L] List comprehension • An expression that builds a list from any iterable >>> s = 'abracadabra' >>> l = [ord(c) for c in s] >>> l [97, 98, 114, 97, 99, 97, 100, 97, 98, 114, 97] input: any iterable object output: a list (always)
  13. @ramalhoorg Set comprehension • An expression that builds a set

    from any iterable >>> s = 'abracadabra' >>> set(s) {'b', 'r', 'a', 'd', 'c'} >>> {ord(c) for c in s} {97, 98, 99, 100, 114} 19
  14. @ramalhoorg Dict comprehensions • An expression that builds a dict

    from any iterable >>> s = 'abracadabra' >>> {c:ord(c) for c in s} {'a': 97, 'r': 114, 'b': 98, 'c': 99, 'd': 100} 20
  15. @ramalhoorg Syntactic support for iterables • Tuple unpacking, parallel assignment

    >>> a, b, c = 'XYZ' >>> a 'X' >>> b 'Y' >>> c 'Z' 21 >>> l = [(c, ord(c)) for c in 'XYZ'] >>> l [('X', 88), ('Y', 89), ('Z', 90)] >>> for char, code in l: ... print char, '->', code ... X -> 88 Y -> 89 Z -> 90
  16. @ramalhoorg Syntactic support for iterables (2) • Function calls: exploding

    arguments with * >>> import math >>> def hypotenuse(a, b): ... return math.sqrt(a*a + b*b) ... >>> hypotenuse(3, 4) 5.0 >>> sides = (3, 4) >>> hypotenuse(sides) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: hypotenuse() takes exactly 2 arguments (1 given) >>> hypotenuse(*sides) 5.0 22
  17. @ramalhoorg Built-in iterable types • basestring • str • unicode

    • dict • file • frozenset • list • set • tuple • xrange 23
  18. @ramalhoorg Built-in functions that take iterable arguments • all •

    any • filter • iter • len • map • max • min • reduce • sorted • sum • zip unrelated to compression
  19. @ramalhoorg Iterator is... • a classic design pattern Design Patterns

    Gamma, Helm, Johnson & Vlissides Addison-Wesley, ISBN 0-201-63361-2 26
  20. @ramalhoorg Head First Design Patterns Poster O'Reilly, ISBN 0-596-10214-3 28

    “The Iterator Pattern provides a way to access the elements of an aggregate object sequentially without exposing the underlying representation.”
  21. An iterable Train class >>> train = Train(4) >>> for

    car in train: ... print(car) car #1 car #2 car #3 car #4 >>>
  22. @ramalhoorg class Train(object): def __init__(self, cars): self.cars = cars def

    __len__(self): return self.cars def __iter__(self): return TrainIterator(self) class TrainIterator(object): def __init__(self, train): self.train = train self.current = 0 def __next__(self): # Python 3 if self.current < len(self.train): self.current += 1 return 'car #%s' % (self.current) else: raise StopIteration() An iterable Train with iterator iterable iterator
  23. @ramalhoorg Iterable ABC • collections.Iterable abstract base class • A

    concrete subclass of Iterable must implement .__iter__ • .__iter__ returns an Iterator • You don’t usually call .__iter__ directly • when needed, call iter(x) 31
  24. @ramalhoorg Iterator ABC • Iterator provides .next or .__next__ •

    .__next__ returns the next item • You don’t usually call .__next__ directly • when needed, call next(x) Python 3 Python 2 Python ≥ 2.6 32
  25. @ramalhoorg for car in train: • calls iter(train) to obtain

    a TrainIterator • makes repeated calls to next(aTrainIterator) until it raises StopIteration class Train(object): def __init__(self, cars): self.cars = cars def __len__(self): return self.cars def __iter__(self): return TrainIterator(self) class TrainIterator(object): def __init__(self, train): self.train = train self.current = 0 def __next__(self): # Python 3 if self.current < len(self.train): self.current += 1 return 'car #%s' % (self.current) else: raise StopIteration() Train with iterator 1 1 2 >>> train = Train(3) >>> for car in train: ... print(car) car #1 car #2 car #3 2
  26. @ramalhoorg Design patterns in dynamic languages • Dynamic languages: Lisp,

    Smalltalk, Python, Ruby, PHP, JavaScript... • Many features not found in C++, where most of the original 23 Design Patterns were identified • Java is more dynamic than C++, but much more static than Lisp, Python etc. 36 Gamma, Helm, Johnson, Vlissides a.k.a. the Gang of Four (GoF)
  27. @ramalhoorg Dynamic types • No need to declare types or

    interfaces • It does not matter what an object claims do be, only what it is capable of doing 38
  28. @ramalhoorg Duck typing 39 “In other words, don't check whether

    it is-a duck: check whether it quacks- like-a duck, walks-like-a duck, etc, etc, depending on exactly what subset of duck-like behaviour you need to play your language-games with.” Alex Martelli comp.lang.python (2000)
  29. @ramalhoorg A Python iterable is... • An object from which

    the iter function can produce an iterator • The iter(x) call: • invokes x.__iter__() to obtain an iterator • but, if x has no __iter__: • iter makes an iterator which tries to fetch items from x by doing x[0], x[1], x[2]... sequence protocol Iterable interface 40
  30. @ramalhoorg Train: a sequence of cars train = Train(4) 41

    train[0] train[1] train[2] train[3]
  31. Train: a sequence of cars >>> train = Train(4) >>>

    len(train) 4 >>> train[0] 'car #1' >>> train[3] 'car #4' >>> train[-1] 'car #4' >>> train[4] Traceback (most recent call last): ... IndexError: no car at 4 >>> for car in train: ... print(car) car #1 car #2 car #3 car #4
  32. Train: a sequence of cars class Train(object): def __init__(self, cars):

    self.cars = cars def __getitem__(self, key): index = key if key >= 0 else self.cars + key if 0 <= index < len(self): # index 2 -> car #3 return 'car #%s' % (index + 1) else: raise IndexError('no car at %s' % key) if __getitem__ exists, iteration “just works”
  33. @ramalhoorg The sequence protocol at work >>> t = Train(4)

    >>> len(t) 4 >>> t[0] 'car #1' >>> t[3] 'car #4' >>> t[-1] 'car #4' >>> for car in t: ... print(car) car #1 car #2 car #3 car #4 __len__ __getitem__ __getitem__
  34. @ramalhoorg Protocol • protocol: a synonym for interface used in

    dynamic languages like Smalltalk, Python, Ruby, Lisp... • not declared, and not enforced by static checks 45
  35. class Train(object): def __init__(self, cars): self.cars = cars def __len__(self):

    return self.cars def __getitem__(self, key): index = key if key >= 0 else self.cars + key if 0 <= index < len(self): # index 2 -> car #3 return 'car #%s' % (index + 1) else: raise IndexError('no car at %s' % key) Sequence protocol __len__ and __getitem__ implement the immutable sequence protocol
  36. import collections class Train(collections.Sequence): def __init__(self, cars): self.cars = cars

    def __len__(self): return self.cars def __getitem__(self, key): index = key if key >= 0 else self.cars + key if 0 <= index < len(self): # index 2 -> car #3 return 'car #%s' % (index + 1) else: raise IndexError('no car at %s' % key) Sequence ABC • collections.Sequence abstract base class abstract methods Python ≥ 2.6
  37. import collections class Train(collections.Sequence): def __init__(self, cars): self.cars = cars

    def __len__(self): return self.cars def __getitem__(self, key): index = key if key >= 0 else self.cars + key if 0 <= index < len(self): # index 2 -> car #3 return 'car #%s' % (index + 1) else: raise IndexError('no car at %s' % key) Sequence ABC • collections.Sequence abstract base class implement these 2
  38. import collections class Train(collections.Sequence): def __init__(self, cars): self.cars = cars

    def __len__(self): return self.cars def __getitem__(self, key): index = key if key >= 0 else self.cars + key if 0 <= index < len(self): # index 2 -> car #3 return 'car #%s' % (index + 1) else: raise IndexError('no car at %s' % key) Sequence ABC • collections.Sequence abstract base class inherit these 5
  39. @ramalhoorg Sequence ABC • collections.Sequence abstract base class >>> train

    = Train(4) >>> 'car #2' in train True >>> 'car #7' in train False >>> for car in reversed(train): ... print(car) car #4 car #3 car #2 car #1 >>> train.index('car #3') 2 50
  40. @ramalhoorg Iteration in C (example 2) #include <stdio.h> int main(int

    argc, char *argv[]) { int i; for(i = 0; i < argc; i++) printf("%d : %s\n", i, argv[i]); return 0; } $ ./args2 alfa bravo charlie 0 : ./args2 1 : alfa 2 : bravo 3 : charlie
  41. @ramalhoorg Iteration in Python (ex. 2) import sys for i

    in range(len(sys.argv)): print i, ':', sys.argv[i] $ python args2.py alfa bravo charlie 0 : args2.py 1 : alfa 2 : bravo 3 : charlie 54 not Pythonic
  42. @ramalhoorg Iteration in Python (ex. 2) import sys for i,

    arg in enumerate(sys.argv): print i, ':', arg $ python args2.py alfa bravo charlie 0 : args2.py 1 : alfa 2 : bravo 3 : charlie 55 Pythonic!
  43. @ramalhoorg import sys for i, arg in enumerate(sys.argv): print i,

    ':', arg Iteration in Python (ex. 2) $ python args2.py alfa bravo charlie 0 : args2.py 1 : alfa 2 : bravo 3 : charlie this returns a lazy iterable object that object yields tuples (index, item) on demand, at each iteration 56
  44. @ramalhoorg What enumerate does >>> e = enumerate('Turing') >>> e

    <enumerate object at 0x...> >>> enumerate builds an enumerate object 57
  45. @ramalhoorg What enumerate does isso constroi um gerador and that

    is iterable >>> e = enumerate('Turing') >>> e <enumerate object at 0x...> >>> for item in e: ... print item ... (0, 'T') (1, 'u') (2, 'r') (3, 'i') (4, 'n') (5, 'g') >>> 58 enumerate builds an enumerate object
  46. @ramalhoorg What enumerate does isso constroi um gerador the enumerate

    object produces an (index, item) tuple for each next(e) call >>> e = enumerate('Turing') >>> e <enumerate object at 0x...> >>> next(e) (0, 'T') >>> next(e) (1, 'u') >>> next(e) (2, 'r') >>> next(e) (3, 'i') >>> next(e) (4, 'n') >>> next(e) (5, 'g') >>> next(e) Traceback (most recent...): ... StopIteration • The enumerator object is an example of a generator
  47. @ramalhoorg Iterator x generator • By definition (in GoF) an

    iterator retrieves successive items from an existing collection • A generator implements the iterator interface (next) but produces items not necessarily in a collection • a generator may iterate over a collection, but return the items decorated in some way, skip some items... • it may also produce items independently of any existing data source (eg. Fibonacci sequence generator) 60
  48. @ramalhoorg Generator function • Any function that has the yield

    keyword in its body is a generator function 63 >>> def gen_123(): ... yield 1 ... yield 2 ... yield 3 ... >>> for i in gen_123(): print(i) 1 2 3 >>> the keyword gen was considered for defining generator functions, but def prevailed
  49. @ramalhoorg • When invoked, a generator function returns a generator

    object Generator function 64 >>> def gen_123(): ... yield 1 ... yield 2 ... yield 3 ... >>> for i in gen_123(): print(i) 1 2 3 >>> g = gen_123() >>> g <generator object gen_123 at ...>
  50. @ramalhoorg Generator function >>> def gen_123(): ... yield 1 ...

    yield 2 ... yield 3 ... >>> g = gen_123() >>> g <generator object gen_123 at ...> >>> next(g) 1 >>> next(g) 2 >>> next(g) 3 >>> next(g) Traceback (most recent call last): ... StopIteration • Generator objects implement the Iterator interface 65
  51. @ramalhoorg Generator behavior • Note how the output of the

    generator function is interleaved with the output of the calling code 66 >>> def gen_AB(): ... print('START') ... yield 'A' ... print('CONTINUE') ... yield 'B' ... print('END.') ... >>> for c in gen_AB(): ... print('--->', c) ... START ---> A CONTINUE ---> B END. >>>
  52. @ramalhoorg Generator behavior • The body is executed only when

    next is called, and it runs only up to the following yield >>> def gen_AB(): ... print('START') ... yield 'A' ... print('CONTINUE') ... yield 'B' ... print('END.') ... >>> g = gen_AB() >>> next(g) START 'A' >>>
  53. @ramalhoorg Generator behavior • When the body of the function

    returns, the generator object throws StopIteration • The for statement catches that for you 68 >>> def gen_AB(): ... print('START') ... yield 'A' ... print('CONTINUE') ... yield 'B' ... print('END.') ... >>> g = gen_AB() >>> next(g) START 'A' >>> next(g) CONTINUE 'B' >>> next(g) END. Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
  54. for car in train: • calls iter(train) to obtain a

    generator • makes repeated calls to next(generator) until the function returns, which raises StopIteration class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self): for i in range(self.cars): # index 2 is car #3 yield 'car #%s' % (i+1) Train with generator function 1 1 2 >>> train = Train(3) >>> for car in train: ... print(car) car #1 car #2 car #3 2
  55. Classic iterator x generator class Train(object): def __init__(self, cars): self.cars

    = cars def __len__(self): return self.cars def __iter__(self): return TrainIterator(self) class TrainIterator(object): def __init__(self, train): self.train = train self.current = 0 def __next__(self): # Python 3 if self.current < len(self.train): self.current += 1 return 'car #%s' % (self.current) else: raise StopIteration() class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self): for i in range(self.cars): yield 'car #%s' % (i+1) 2 classes, 12 lines of code 1 class, 3 lines of code
  56. class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self):

    for i in range(self.cars): yield 'car #%s' % (i+1) The pattern just vanished
  57. class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self):

    for i in range(self.cars): yield 'car #%s' % (i+1) “When I see patterns in my programs, I consider it a sign of trouble. The shape of a program should reflect only the problem it needs to solve. Any other regularity in the code is a sign, to me at least, that I'm using abstractions that aren't powerful enough -- often that I'm generating by hand the expansions of some macro that I need to write.” Paul Graham Revenge of the nerds (2002)
  58. Generator expression (genexp) >>> g = (c for c in

    'ABC') >>> g <generator object <genexpr> at 0x10045a410> >>> for l in g: ... print(l) ... A B C >>>
  59. @ramalhoorg • When evaluated, returns a generator object >>> g

    = (n for n in [1, 2, 3]) >>> g <generator object <genexpr> at 0x...> >>> next(g) 1 >>> next(g) 2 >>> next(g) 3 >>> next(g) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration Generator expression (genexp)
  60. for car in train: • calls iter(train) to obtain a

    generator • makes repeated calls to next(generator) until the function returns, which raises StopIteration class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self): for i in range(self.cars): # index 2 is car #3 yield 'car #%s' % (i+1) Train with generator function 1 1 2 >>> train = Train(3) >>> for car in train: ... print(car) car #1 car #2 car #3 2
  61. for car in train: • calls iter(train) to obtain a

    generator • makes repeated calls to next(generator) until the function returns, which raises StopIteration 1 2 class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self): return ('car #%s' % (i+1) for i in range(self.cars)) Train with generator expression >>> train = Train(3) >>> for car in train: ... print(car) car #1 car #2 car #3
  62. class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self):

    return ('car #%s' % (i+1) for i in range(self.cars)) Generator function x genexp class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self): for i in range(self.cars): yield 'car #%s' % (i+1)
  63. @ramalhoorg Built-in functions that return iterables, iterators or generators •

    dict • enumerate • frozenset • list • reversed • set • tuple 78
  64. @ramalhoorg • boundless generators • count(), cycle(), repeat() • generators

    which combine several iterables: • chain(), tee(), izip(), imap(), product(), compress()... • generators which select or group items: • compress(), dropwhile(), groupby(), ifilter(), islice()... • generators producing combinations of items: • product(), permutations(), combinations()... The itertools module Don’t reinvent the wheel, use itertools! these were not reinvented: ported from Haskell great for MapReduce
  65. @ramalhoorg Generators in Python 3 • Several functions and methods

    of the standard library that used to return lists, now return generators and other lazy iterables in Python 3 • dict.keys(), dict.items(), dict.values()... • range(...) • like xrange in Python 2.x (more than a generator) • If you really need a list, just pass the generator to the list constructor. Eg.: list(range(10)) 81
  66. @ramalhoorg A practical example using generator functions • Generator functions

    to decouple reading and writing logic in a database conversion tool designed to handle large datasets https://github.com/ramalho/isis2json 82
  67. main: determine input format selected generator function is passed as

    an argument input generator function is selected based on the input file extension
  68. writeJsonArray: iterates over one of the input generator functions selected

    generator function received as an argument... and called to produce input generator
  69. @ramalhoorg We did not cover • other generator methods: •

    gen.close(): causes a GeneratorExit exception to be raised within the generator body, at the point where it is paused • gen.throw(e): causes any exception e to be raised within the generator body, at the point it where is paused Mostly useful for long-running processes. Often not needed in batch processing scripts. 99
  70. @ramalhoorg We did not cover • generator delegation with yield

    from • sending data into a generator function with the gen.send(x) method (instead of next(gen)), and using yield as an expression to get the data sent • using generator functions as coroutines not useful in the context of iteration Python ≥ 3.3 “Coroutines are not related to iteration” David Beazley 100
  71. @ramalhoorg How to learn generators • Forget about .send() and

    coroutines: that is a completely different subject. Look into that only after mastering and becoming really confortable using generators for iteration. • Study and use the itertools module • Don’t worry about .close() and .throw() initially. You can be productive with generators without using these methods. • yield from is only available in Python 3.3, and only relevant if you need to use .close() and .throw() 101