Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Iterators & generators: the Python Way

Iterators & generators: the Python Way

Did you know that "for a, (b, c) in s:" is a valid Python line? From the elegant for statement through list/set/dict comprehensions and generator functions, this talk shows how the Iterator pattern is so deeply embedded in the syntax of Python, and so widely supported by its libraries, that some of its most powerful applications can be overlooked by programmers coming from other languages.

27c093d0834208f4712faaaec38c2c5c?s=128

Luciano Ramalho

October 20, 2012
Tweet

Transcript

  1. Iterators & generators: the Python way Luciano Ramalho luciano@ramalho.org @ramalhoorg

  2. @ramalhoorg Iteration: C and Python #include <stdio.h> int main(int argc,

    char *argv[]) { int i; for(i = 0; i < argc; i++) printf("%s\n", argv[i]); return 0; } import sys for arg in sys.argv: print arg
  3. @ramalhoorg Iteration: Java (classic) class Argumentos { public static void

    main(String[] args) { for (int i=0; i < args.length; i++) System.out.println(args[i]); } } $ java Argumentos alfa bravo charlie alfa bravo charlie
  4. @ramalhoorg Iteration: Java ≥1.5 class Argumentos2 { public static void

    main(String[] args) { for (String arg : args) System.out.println(arg); } } $ java Argumentos2 alfa bravo charlie alfa bravo charlie • Enhanced for (for melhorado) since 2004
  5. @ramalhoorg Iteration: Java ≥1.5 class Argumentos2 { public static void

    main(String[] args) { for (String arg : args) System.out.println(arg); } } since 2004 • Enhanced for (for melhorado) import sys for arg in sys.argv: print arg since 1991
  6. @ramalhoorg Demo: some iterables • High-level iteration: not limited to

    built-in types • string • file • XML: ElementTree nodes • Django QuerySet • etc.
  7. >>> from django.db import connection >>> q = connection.queries >>>

    q [] >>> from municipios.models import * >>> res = Municipio.objects.all()[:5] >>> q [] >>> for m in res: print m.uf, m.nome ... GO Abadia de Goiás MG Abadia dos Dourados GO Abadiânia MG Abaeté PA Abaetetuba >>> q [{'time': '0.000', 'sql': u'SELECT "municipios_municipio"."id", "municipios_municipio"."uf", "municipios_municipio"."nome", "municipios_municipio"."nome_ascii", "municipios_municipio"."meso_regiao_id", "municipios_municipio"."capital", "municipios_municipio"."latitude", "municipios_municipio"."longitude", "municipios_municipio"."geohash" FROM "municipios_municipio" ORDER BY "municipios_municipio"."nome_ascii" ASC LIMIT 5'}] In Django, QuerySet is a lazy iterable when the iteration happens, the query is made no database access so far
  8. @ramalhoorg The for statement is not the only construct that

    groks iterables...
  9. @ramalhoorg List comprehensions • Expressions that build lists from arbitrary

    iterables >>> s = 'abracadabra' >>> l = [ord(c) for c in s] >>> l [97, 98, 114, 97, 99, 97, 100, 97, 98, 114, 97] any iterable result: always a list ≈ math set notation List comprehension • Compreensão de lista ou abrangência • Exemplo: usar todos os elementos: – L2 = [n*10 for n in L]
  10. @ramalhoorg Set & dict comprehensions • Expressions that build sets

    / dicts from arbitrary iterables >>> s = 'abracadabra' >>> {c for c in s} set(['a', 'r', 'b', 'c', 'd']) >>> {c:ord(c) for c in s} {'a': 97, 'r': 114, 'b': 98, 'c': 99, 'd': 100}
  11. @ramalhoorg Built-in iterable types • basestring • str • unicode

    • dict • file • frozenset • list • set • tuple • xrange
  12. @ramalhoorg Built-in functions that take iterable arguments • all •

    any • filter • iter • len • map • max • min • reduce • sorted • sum • zip unrelated to compression
  13. @ramalhoorg Syntactic support • Tuple unpacking • parallel assignment •

    function calls with * >>> def soma(a, b): ... return a + b ... >>> soma(1, 2) 3 >>> t = (3, 4) >>> soma(t) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: soma() takes exactly 2 arguments (1 given) >>> soma(*t) 7 >>> a, b, c = 'XYZ' >>> a 'X' >>> b 'Y' >>> c 'Z' >>> g = (n for n in [1, 2, 3]) >>> a, b, c = g >>> a 1 >>> b 2 >>> c 3
  14. @ramalhoorg A Python iterable is... • An object from which

    the iter function can produce an iterator • The iter(x) call: • invokes x.__iter__() to obtain an iterator • but, if x has no __iter__: • iter makes an iterator which tries to fetch items from x by doing x[0], x[1], x[2]...
  15. @ramalhoorg Train: a sequence of cars train train[0] sequences were

    called trains in ABC, the language that preceded Python
  16. @ramalhoorg Train: a sequence of cars >>> train = Train(4)

    >>> len(train) 4 >>> train[0] 'car #1' >>> train[3] 'car #4' >>> train[-1] 'car #4' >>> train[4] Traceback (most recent call last): ... IndexError: no car at 4 >>> for car in train: ... print(car) car #1 car #2 car #3 car #4 if __getitem__ exists, iteration “just works”
  17. @ramalhoorg Train: a sequence of cars class Train(object): def __init__(self,

    cars): self.cars = cars def __len__(self): return self.cars def __getitem__(self, key): index = key if key >= 0 else self.cars + key if 0 <= index < len(self): # index 2 -> car #3 return 'car #%s' % (index + 1) else: raise IndexError('no car at %s' % key)
  18. @ramalhoorg class Train(object): def __init__(self, cars): self.cars = cars def

    __len__(self): return self.cars def __getitem__(self, key): index = key if key >= 0 else self.cars + key if 0 <= index < len(self): # index 2 -> car #3 return 'car #%s' % (index + 1) else: raise IndexError('no car at %s' % key) Sequence protocol • protocol: a synonym for interface used in dynamic languages like Smalltalk, Python, Ruby... • not declared, and not enforced by static checks __len__ and __getitem__ implement the immutable sequence protocol
  19. @ramalhoorg import collections class Train(collections.Sequence): def __init__(self, cars): self.cars =

    cars def __len__(self): return self.cars def __getitem__(self, key): index = key if key >= 0 else self.cars + key if 0 <= index < len(self): # index 2 -> car #3 return 'car #%s' % (index + 1) else: raise IndexError('no car at %s' % key) Sequence ABC • collections.Sequence abstract base class abstract methods
  20. @ramalhoorg >>> train = Train(4) >>> 'car #2' in train

    True >>> 'car #7' in train False >>> for car in reversed(train): ... print(car) car #4 car #3 car #2 car #1 >>> train.index('car #3') 2 Sequence ABC implement __len__ and __getitem__ inherit 5 methods import collections class Train(collections.Sequence def __init__(self, cars): self.cars = cars def __len__(self): return self.cars def __getitem__(self, key):
  21. @ramalhoorg Iterable ABC • A concrete subclass of Iterable must

    implement __iter__ • __iter__ returns an Iterator • Iterator must implement a next method • in Python 3: __next__
  22. @ramalhoorg Iterator is... • a classic design pattern Design Patterns

    Gamma, Helm, Johnson & Vlissides Addison-Wesley, ISBN 0-201-63361-2
  23. @ramalhoorg Head First Design Patterns Poster O'Reilly, ISBN 0-596-10214-3

  24. @ramalhoorg “The Iterator Pattern provides a way to access the

    elements of an aggregate object sequentially without exposing the underlying representation” Head First Design Patterns Poster O'Reilly, ISBN 0-596-10214-3
  25. @ramalhoorg for car in train: • calls iter(train) to obtain

    a TrainIterator • makes repeated calls to aTrainIterator.__next__() until it raises StopIteration class Train(object): def __init__(self, cars): self.cars = cars def __len__(self): return self.cars def __iter__(self): return TrainIterator(self) class TrainIterator(object): def __init__(self, train): self.train = train self.current = 0 def __next__(self): # Python 3 if self.current < len(self.train): self.current += 1 return 'car #%s' % (self.current) else: raise StopIteration() next = __next__ # Python 2 compatibility Train with iterator >>> train = Train(3) >>> for car in train: ... print(car) car #1 car #2 car #3 1 1 2 2
  26. @ramalhoorg A Python iterable is... • An object from which

    the iter function can produce an iterator • The iter(x) call: • invokes x.__iter__() to obtain an iterator • but, if x has no __iter__: • iter makes an iterator which tries to fetch items from x by doing x[0], x[1], x[2]... sequence protocol Iterable interface
  27. @ramalhoorg Iteration in C (example 2) #include <stdio.h> int main(int

    argc, char *argv[]) { int i; for(i = 0; i < argc; i++) printf("%d : %s\n", i, argv[i]); return 0; } $ ./args2 alfa bravo charlie 0 : ./args2 1 : alfa 2 : bravo 3 : charlie
  28. @ramalhoorg Iteration in Python (ex. 2) import sys for i

    in range(len(sys.argv)): print i, ':', sys.argv[i] $ python args2.py alfa bravo charlie 0 : args2.py 1 : alfa 2 : bravo 3 : charlie not Pythonic
  29. @ramalhoorg Iteration in Python (ex. 2) import sys for i,

    arg in enumerate(sys.argv): print i, ':', arg $ python args2.py alfa bravo charlie 0 : args2.py 1 : alfa 2 : bravo 3 : charlie this returns a lazy iterable generator the generator yields tuples (index, item) on demand, at each iteration
  30. @ramalhoorg Iterator x generator • By definition (GoF) an iterator

    retrieves successive items from an existing collection • A generator implements the iterator interface but produces items not necessarily in a collection • a generator may iterate over a collection, but return the items decorated in some way • it may also produce items independently of any other data structure (eg. Fibonacci generator)
  31. @ramalhoorg Generator function >>> def gen_123(): ... yield 1 ...

    yield 2 ... yield 3 ... >>> for i in gen_123(): print(i) 1 2 3 >>> g = gen_123() >>> g <generator object gen_123 at ...> >>> g.next() 1 >>> g.next() 2 >>> g.next() 3 >>> g.next() Traceback (most recent call last): ... StopIteration Python 2.x • When invoked, returns a generator object • Generator objects implement the iterator interface: .next (.__next__ in Python 3)
  32. @ramalhoorg Generator function >>> def gen_123(): ... yield 1 ...

    yield 2 ... yield 3 ... >>> for i in gen_123(): print(i) 1 2 3 >>> g = gen_123() >>> g <generator object gen_123 at ...> >>> g.__next__() 1 >>> g.__next__() 2 >>> g.__next__() 3 >>> g.__next__() Traceback (most recent call last): ... StopIteration • When invoked, returns a generator object • Generator objects implement the iterator interface: .next (.__next__ in Python 3) Python 3.x
  33. @ramalhoorg Generator function >>> def gen_123(): ... yield 1 ...

    yield 2 ... yield 3 ... >>> for i in gen_123(): print(i) 1 2 3 >>> g = gen_123() >>> g <generator object gen_123 at ...> >>> next(g) 1 >>> next(g) 2 >>> next(g) 3 >>> next(g) Traceback (most recent call last): ... StopIteration • When invoked, returns a generator object • Generator objects implement the iterator interface: .next (.__next__ in Python 3) Python ≥ 2.6
  34. @ramalhoorg Generator behavior >>> def gen_ab(): ... print('starting...') ... yield

    'A' ... print('here comes B:') ... yield 'B' ... print('the end.') ... >>> for s in gen_ab(): print(s) starting... A here comes B: B the end. >>> g = gen_ab() >>> next(g) starting... 'A' >>> next(g) here comes B: 'B' >>> next(g) Traceback (most recent call last): ... StopIteration • Invoking a generator function builds the generator object but does not execute the body of the function
  35. @ramalhoorg Generator behavior >>> def gen_ab(): ... print('starting...') ... yield

    'A' ... print('here comes B:') ... yield 'B' ... print('the end.') ... >>> for s in gen_ab(): print(s) starting... A here comes B: B the end. >>> g = gen_ab() >>> next(g) starting... 'A' >>> next(g) here comes B: 'B' >>> next(g) Traceback (most recent call last): ... StopIteration • The body is executed only when next is called, and only up to the following yield
  36. @ramalhoorg for car in train: • calls iter(train) to obtain

    a generator • makes repeated calls to next(generator) until the function returns, which raises StopIteration class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self): for i in range(self.cars): # index 2 is car #3 yield 'car #%s' % (i+1) Train with generator function 1 1 2 2 >>> train = Train(3) >>> for car in train: ... print(car) car #1 car #2 car #3
  37. Classic iterator x generator class Train(object): def __init__(self, cars): self.cars

    = cars def __len__(self): return self.cars def __iter__(self): return TrainIterator(self) class TrainIterator(object): def __init__(self, train): self.train = train self.current = 0 def __next__(self): # Python 3 if self.current < len(self.train): self.current += 1 return 'car #%s' % (self.current) else: raise StopIteration() class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self): for i in range(self.cars): yield 'car #%s' % (i+1) 2 classes, 12 lines of code 1 class, 3 lines of code
  38. @ramalhoorg Generator expression • When evaluated, returns a generator object

    >>> g = (n for n in [1, 2, 3]) >>> for i in g: print i ... 1 2 3 >>> g = (n for n in [1, 2, 3]) >>> g <generator object <genexpr> at 0x109a4deb0> >>> g.next() 1 >>> g.next() 2 >>> g.next() 3 >>> g.next() Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
  39. @ramalhoorg for car in train: • calls iter(train) to obtain

    a generator • makes repeated calls to next(generator) until the generator raises StopIteration class Train(object): def __init__(self, cars): self.cars = cars def __iter__(self): return ('car #%s' % (i+1) for i in range(self.cars)) Train with generator expression 1 1 2 2 >>> train = Train(3) >>> for car in train: ... print(car) car #1 car #2 car #3
  40. @ramalhoorg Built-in functions that return iterables, iterators or generators •

    dict • enumerate • frozenset • list • reversed • set • tuple
  41. @ramalhoorg • boundless generators • count(), cycle(), repeat() • generators

    which combine several iterables: • chain(), tee(), izip(), imap(), product(), compress()... • generators which select or group items: • compress(), dropwhile(), groupby(), ifilter(), islice()... • generators producing combinations of items: • product(), permutations(), combinations()... The itertools module
  42. @ramalhoorg A practical example using generator functions • Generator functions

    to decouple reading and writing logic in a database conversion tool designed to handle large datasets https://github.com/ramalho/isis2json
  43. @ramalhoorg Main loop writes JSON file

  44. @ramalhoorg Another loop reads the input records

  45. @ramalhoorg One implementation: same loop reads/writes

  46. @ramalhoorg But what if we need to read another format?

  47. @ramalhoorg Functions in the script •iterMstRecords* •iterIsoRecords* •writeJsonArray •main *

    generator functions
  48. @ramalhoorg main: read command line arguments

  49. @ramalhoorg main: determine input format selected generator function is passed

    as an argument input generator function is selected based on the input file extension
  50. @ramalhoorg writeJsonArray: write JSON records

  51. @ramalhoorg writeJsonArray: iterates over one of the input generator functions

    selected generator function received as an argument... and called to produce input generator
  52. @ramalhoorg iterIsoRecords: read records from ISO-2709 format file generator function!

  53. @ramalhoorg iterIsoRecords yields one record, structured as a dict creates

    a new dict in each iteration
  54. @ramalhoorg iterMstRecords: read records from ISIS .MST file generator function!

  55. @ramalhoorg iterIsoRecords iterMstRecords yields one record, structured as a dict

    creates a new dict in each iteration
  56. @ramalhoorg Generators at work

  57. @ramalhoorg Generators at work

  58. @ramalhoorg Generators at work

  59. @ramalhoorg What we did not cover • sending data into

    a generator function with the .send() method (instead of .next()), and using yield as an expression to get the data sent • using generator functions as coroutines not very useful in the context of iteration “Coroutines are not related to iteration” David Beazley
  60. Q & A Luciano Ramalho luciano@ramalho.org @ramalhoorg https://github.com/ramalho/isis2json