Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New dict implementation in Python 3.6 (KLab Tech Meetup 2017-09-04)

351a10f392414345ed67a05e986dc4dd?s=47 INADA Naoki
September 04, 2017

New dict implementation in Python 3.6 (KLab Tech Meetup 2017-09-04)

351a10f392414345ed67a05e986dc4dd?s=128

INADA Naoki

September 04, 2017
Tweet

Transcript

  1. New dict implementation in Python 3.6 Inada Naoki (@methane)

  2. 自己紹介 @methane K-Labo, KLab Inc. Python core developer C, Go,

    Network (server) programming, MySQL clients ISUCON 6 winner (See http://isucon.net/ )
  3. Table of contents • dict in Python • Python 3.5

    implementation • Python 3.6 implementation • Toward Python 3.7
  4. Dict in Python

  5. Dict Key-Value storage. A.k.a. associative-array, map, hash. x = {"foo":

    42, "bar": 84} print( x["foo"] ) # => 42 Key feature: • Constant time lookup • Amortized constant time insertion • Support custom (user-defined) key type
  6. Dicts are everywhere in Python x = 5 # global

    namespace is dict. Insert 'x' to it. def add(a): # Insert 'add' to global dict return a + x # lookup 'x' from global dict print(add(7)) # search 'print' and 'add' from global dict There are many dicts in Python program. Lookup speed is critical. Insertion speed and memory usage is very important too.
  7. Python 3.5 implementation

  8. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" # insert new item hash("foo") = 42 # hash value is 42 42 % 8 = 2 # hash value % hash table size = 2
  9. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam"
  10. Key hash value 0 1 2 3 4 5 6

    7 d["bar"] = "ham" hash("bar") = 52 52 % 8 = 4 "foo" 42 "spam" "bar" 52 "ham"
  11. Key hash value 0 1 2 3 4 5 6

    7 d["baz"] = "egg" hash("baz") = 58 58 % 8 = 2 # "baz" is conflict with "foo" "foo" 42 "spam" "bar" 52 "ham"
  12. Key hash value 0 1 2 3 4 5 6

    7 "Open addressing" uses another slot in the table. (Another strategy is "chaining") For example, "linear probing" algorithm uses next entry. ※Python uses more complex probing, but I use simpler way in this example. "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"
  13. Key hash value 0 1 2 3 4 5 6

    7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"
  14. Key hash value 0 1 2 3 4 5 6

    7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "bar" 52 "ham" "baz" 58 "egg"
  15. Key hash value 0 1 2 3 4 5 6

    7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (!!?) "bar" 52 "ham" "baz" 58 "egg"
  16. Key hash value 0 1 2 3 4 5 6

    7 del d["foo"] remains DUMMY key "bar" 52 "ham" "baz" 58 "egg" DUMMY
  17. Key hash value 0 1 2 3 4 5 6

    7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (conflict with dummy, then linear probing) "bar" 52 "ham" "baz" 58 "egg" DUMMY
  18. Problems in classical open addressing hash table • Large memory

    usage ◦ At least 1/3 of entries are empty ▪ Otherwise, "probing" can be too slow ◦ One entry uses 3 words ▪ word = 8 bytes on recent machine ◦ minimum size = 192 byte ▪ 8 (byte/word) * 3 (word/entry) * 8 (table width)
  19. Python 3.6 implementation

  20. Compact and ordered dict PyPy implements it in 2015 https://morepypy.blogspot.jp/2015/01/faster-more-memory-efficient-and-more.html

    Python 3.6 dict is almost same as PyPy. Ruby 2.4, php 7 has similar one.
  21. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" # hash("foo") = 42, 42 % 8 = 2 "foo" 42 "spam" 0 index
  22. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" d["bar"] = "ham" # hash("bar") = 52 , 52 % 8 = 4 "bar" 52 "ham" "foo" 42 "spam" 0 1 index
  23. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" d["bar"] = "ham" d["baz"] = "egg" del d["foo"] "bar" 52 "ham" "baz" 58 "egg" DUMMY 2 1 index
  24. • Less memory usage ◦ Index can be 1 byte

    for small dict ◦ 3*8 *5 (entries) + 8 (index table) = 128 bytes ▪ It was 192 bytes in legacy implementation • Faster iteration (dense entries) • Preserve insertion order • (cons) One more indirect memory access New dict vs Legacy dict
  25. Toward Python 3.7

  26. Working on ... • Remove redundant code for optimize legacy

    implementation. • OrderedDict based on New dict ◦ Remove doubly linked list used for keep order ◦ About 1/2 memory usage! ◦ Faster creation and iterating. ◦ (cons) Slower .move_to_end() method
  27. We're finding new contributors Contributing to Python is easier, thanks

    to Github. • Read devguide (https://devguide.python.org/ ) • Find easy bug on https://bugs.python.org/ and fix it. • Review other's code • Translate document on Transifex ◦ See https://docs.python.org/ja/
  28. None
  29. None
  30. Future ideas • specialized dict for namespace ◦ all keys

    are interned string ◦ only pointer comparison ◦ no "hash" in entry -> more compact • Implement set like dict ◦ current set is larger than dict... • functools.lru_cache ◦ Use `od.move_to_end(key)`, instead of linked list
  31. PEP 412: Key sharing dict

  32. PEP 412: Key sharing dict Introduced in Python 3.4 Instances

    of same class can share keys object
  33. class A: def __init__(self, a, b): self.foo = a self.bar

    = b a = A("spam", "ham") b = A("bacon", "egg")
  34. Key Class value 0 1 2 3 4 5 6

    7 "bar" 52 "foo" 42 0 1 index "ham" "spam" values "egg" "bacon" values instance instance
  35. Problem • Two instances can have different insertion order ◦

    drop key sharing dict? ▪ key sharing dict can save more memory. • But __slots__ can be used for such cases! ▪ performance improvements in some microbench • Is it matter for real case? __slots__? ▪ Needs consensus • it's more difficult than implementation
  36. Keep key sharing dict support • Only exactly same order

    can be permitted ◦ "skipped" keys are prohibited ◦ deletion is also prohibited • Otherwise, stop "key sharing" ◦ `self.x = None` is faster than `del self.x`