Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New dict implementation in Python 3.6

351a10f392414345ed67a05e986dc4dd?s=47 INADA Naoki
January 31, 2017

New dict implementation in Python 3.6

Python 3.6 の新しい dict の実装の紹介と、苦労話をする予定でした。
口頭で補足するつもりだった部分が抜けているので、気になる方はこちらの blog 記事も参考にしてください。
http://dsas.blog.klab.org/archives/python-compact-dict.html

351a10f392414345ed67a05e986dc4dd?s=128

INADA Naoki

January 31, 2017
Tweet

Transcript

  1. New dict implementation in Python 3.6 Python 3.6 release party

    Inada Naoki (@methane)
  2. 稲田 直哉 (Inada Naoki) @methane KLab Inc. (We're hiring) CPython core

    developer C, Go, Network, MySQL clients
  3. Table of contents Open addressing hash table New implementation Key

    sharing dict (PEP 412) Future idea
  4. Open addressing hash table

  5. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" hash("foo") = 42 42 % 8 = 2
  6. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam"
  7. Key hash value 0 1 2 3 4 5 6

    7 d["bar"] = "ham" hash("bar") = 52 52 % 8 = 4 "foo" 42 "spam" "bar" 52 "ham"
  8. Key hash value 0 1 2 3 4 5 6

    7 d["baz"] = "egg" hash("baz") = 58 58 % 8 = 2 (conflict!) "foo" 42 "spam" "bar" 52 "ham"
  9. Key hash value 0 1 2 3 4 5 6

    7 "probing" "linear probing" uses next entry. (CPython uses "5i + 1" | (hash >> 5) probing, but use simpler way in this example) "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"
  10. Key hash value 0 1 2 3 4 5 6

    7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"
  11. Key hash value 0 1 2 3 4 5 6

    7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "bar" 52 "ham" "baz" 58 "egg"
  12. Key hash value 0 1 2 3 4 5 6

    7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (!!?) "bar" 52 "ham" "baz" 58 "egg"
  13. Key hash value 0 1 2 3 4 5 6

    7 del d["foo"] remains DUMMY key "bar" 52 "ham" "baz" 58 "egg" DMMY
  14. Key hash value 0 1 2 3 4 5 6

    7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (conflict with dummy, then linear probing) "bar" 52 "ham" "baz" 58 "egg" DMMY
  15. Problems in classical open addressing hash table • Large memory

    usage ◦ at least 1/3 of entries are empty ▪ otherwise, "probing" can be too slow ◦ one entry uses 3 words. (24bytes in 64bit machine) ◦ 8 * 8 * 3 = 192 bytes for minimum dict
  16. New dict implementation

  17. Compact dict Original idea is from Raymond Hettinger. PyPy implements

    it with some customize. https://morepypy.blogspot.jp/2015/01/faster-more-memory-eff icient-and-more.html CPython 3.6 has almost same as PyPy
  18. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" d["bar"] = "ham" "bar" 52 "ham" "foo" 42 "spam" 0 1 index
  19. Key hash value 0 1 2 3 4 5 6

    7 d["foo"] = "spam" d["bar"] = "ham" d["baz"] = "egg" del d["foo"] "bar" 52 "ham" "baz" 58 "egg" DMMY 2 1 index
  20. Pros and cons • Less memory usage ◦ index can

    be 1 byte for size < 255 ◦ 3 * 8 * 5 + 8 = 128bytes (was 192bytes) • Faster iteration • Keep insertion order • (cons) One more lookup stage
  21. PEP 412: Key sharing dict

  22. PEP 412: Key sharing dict Introduced in Python 3.4 Instances

    of same class can share keys object
  23. class A: def __init__(self, a, b): self.foo = a self.bar

    = b a = A("spam", "ham") b = A("bacon", "egg")
  24. Key Class value 0 1 2 3 4 5 6

    7 "bar" 52 "foo" 42 0 1 index "ham" "spam" values "egg" "bacon" values instance instance
  25. Problem • Two instances can have different insertion order ◦

    drop key sharing dict? ▪ key sharing dict can save more memory. • But __slots__ can be used for such cases! ▪ performance improvements in some microbench • Is it matter for real case? __slots__? ▪ Needs consensus • it's more difficult than implementation
  26. Keep key sharing dict support • Only exactly same order

    can be permitted ◦ "skipped" keys are prohibited ◦ deletion is also prohibited • Otherwise, stop "key sharing" ◦ `self.x = None` is faster than `del self.x`
  27. Future ideas

  28. Ideas that will be tried later... • specialized dict for

    namespace ◦ all keys are interned string ◦ only pointer comparison ◦ no "hash" in entry -> more compact • OrderedKey based on new dict ◦ no more doubly linked list ◦ `od.move_to_end(k, last=False)` is difficult, but it's possible • functools.lru_cache ◦ no more doubly linked list ◦ Using `od.move_to_end(key)`
  29. We're moving to Github! New contributors are welcome!