Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New dict implementation in Python 3.6 (KLab Tech Meetup 2017-09-04)

INADA Naoki
September 04, 2017

New dict implementation in Python 3.6 (KLab Tech Meetup 2017-09-04)

INADA Naoki

September 04, 2017
Tweet

More Decks by INADA Naoki

Other Decks in Programming

Transcript

  1. 自己紹介 @methane K-Labo, KLab Inc. Python core developer C, Go,

    Network (server) programming, MySQL clients ISUCON 6 winner (See http://isucon.net/ )
  2. Table of contents • dict in Python • Python 3.5

    implementation • Python 3.6 implementation • Toward Python 3.7
  3. Dict Key-Value storage. A.k.a. associative-array, map, hash. x = {"foo":

    42, "bar": 84} print( x["foo"] ) # => 42 Key feature: • Constant time lookup • Amortized constant time insertion • Support custom (user-defined) key type
  4. Dicts are everywhere in Python x = 5 # global

    namespace is dict. Insert 'x' to it. def add(a): # Insert 'add' to global dict return a + x # lookup 'x' from global dict print(add(7)) # search 'print' and 'add' from global dict There are many dicts in Python program. Lookup speed is critical. Insertion speed and memory usage is very important too.
  5. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" # insert new item hash("foo") = 42 # hash value is 42 42 % 8 = 2 # hash value % hash table size = 2
  6. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam"
  7. Key hash value 0 1 2 3 4 5 6

    7 d["bar"] = "ham" hash("bar") = 52 52 % 8 = 4 "foo" 42 "spam" "bar" 52 "ham"
  8. Key hash value 0 1 2 3 4 5 6

    7 d["baz"] = "egg" hash("baz") = 58 58 % 8 = 2 # "baz" is conflict with "foo" "foo" 42 "spam" "bar" 52 "ham"
  9. Key hash value 0 1 2 3 4 5 6

    7 "Open addressing" uses another slot in the table. (Another strategy is "chaining") For example, "linear probing" algorithm uses next entry. ※Python uses more complex probing, but I use simpler way in this example. "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"
  10. Key hash value 0 1 2 3 4 5 6

    7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"
  11. Key hash value 0 1 2 3 4 5 6

    7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "bar" 52 "ham" "baz" 58 "egg"
  12. Key hash value 0 1 2 3 4 5 6

    7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (!!?) "bar" 52 "ham" "baz" 58 "egg"
  13. Key hash value 0 1 2 3 4 5 6

    7 del d["foo"] remains DUMMY key "bar" 52 "ham" "baz" 58 "egg" DUMMY
  14. Key hash value 0 1 2 3 4 5 6

    7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (conflict with dummy, then linear probing) "bar" 52 "ham" "baz" 58 "egg" DUMMY
  15. Problems in classical open addressing hash table • Large memory

    usage ◦ At least 1/3 of entries are empty ▪ Otherwise, "probing" can be too slow ◦ One entry uses 3 words ▪ word = 8 bytes on recent machine ◦ minimum size = 192 byte ▪ 8 (byte/word) * 3 (word/entry) * 8 (table width)
  16. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" # hash("foo") = 42, 42 % 8 = 2 "foo" 42 "spam" 0 index
  17. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" d["bar"] = "ham" # hash("bar") = 52 , 52 % 8 = 4 "bar" 52 "ham" "foo" 42 "spam" 0 1 index
  18. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" d["bar"] = "ham" d["baz"] = "egg" del d["foo"] "bar" 52 "ham" "baz" 58 "egg" DUMMY 2 1 index
  19. • Less memory usage ◦ Index can be 1 byte

    for small dict ◦ 3*8 *5 (entries) + 8 (index table) = 128 bytes ▪ It was 192 bytes in legacy implementation • Faster iteration (dense entries) • Preserve insertion order • (cons) One more indirect memory access New dict vs Legacy dict
  20. Working on ... • Remove redundant code for optimize legacy

    implementation. • OrderedDict based on New dict ◦ Remove doubly linked list used for keep order ◦ About 1/2 memory usage! ◦ Faster creation and iterating. ◦ (cons) Slower .move_to_end() method
  21. We're finding new contributors Contributing to Python is easier, thanks

    to Github. • Read devguide (https://devguide.python.org/ ) • Find easy bug on https://bugs.python.org/ and fix it. • Review other's code • Translate document on Transifex ◦ See https://docs.python.org/ja/
  22. Future ideas • specialized dict for namespace ◦ all keys

    are interned string ◦ only pointer comparison ◦ no "hash" in entry -> more compact • Implement set like dict ◦ current set is larger than dict... • functools.lru_cache ◦ Use `od.move_to_end(key)`, instead of linked list
  23. class A: def __init__(self, a, b): self.foo = a self.bar

    = b a = A("spam", "ham") b = A("bacon", "egg")
  24. Key Class value 0 1 2 3 4 5 6

    7 "bar" 52 "foo" 42 0 1 index "ham" "spam" values "egg" "bacon" values instance instance
  25. Problem • Two instances can have different insertion order ◦

    drop key sharing dict? ▪ key sharing dict can save more memory. • But __slots__ can be used for such cases! ▪ performance improvements in some microbench • Is it matter for real case? __slots__? ▪ Needs consensus • it's more difficult than implementation
  26. Keep key sharing dict support • Only exactly same order

    can be permitted ◦ "skipped" keys are prohibited ◦ deletion is also prohibited • Otherwise, stop "key sharing" ◦ `self.x = None` is faster than `del self.x`