Slide 1

Slide 1 text

New dict implementation in Python 3.6 Python 3.6 release party Inada Naoki (@methane)

Slide 2

Slide 2 text

稲田 直哉 (Inada Naoki) @methane KLab Inc. (We're hiring) CPython core developer C, Go, Network, MySQL clients

Slide 3

Slide 3 text

Table of contents Open addressing hash table New implementation Key sharing dict (PEP 412) Future idea

Slide 4

Slide 4 text

Open addressing hash table

Slide 5

Slide 5 text

Key hash value 0 1 2 3 4 5 6 7 d["foo"] = "spam" hash("foo") = 42 42 % 8 = 2

Slide 6

Slide 6 text

Key hash value 0 1 2 3 4 5 6 7 d["foo"] = "spam" hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam"

Slide 7

Slide 7 text

Key hash value 0 1 2 3 4 5 6 7 d["bar"] = "ham" hash("bar") = 52 52 % 8 = 4 "foo" 42 "spam" "bar" 52 "ham"

Slide 8

Slide 8 text

Key hash value 0 1 2 3 4 5 6 7 d["baz"] = "egg" hash("baz") = 58 58 % 8 = 2 (conflict!) "foo" 42 "spam" "bar" 52 "ham"

Slide 9

Slide 9 text

Key hash value 0 1 2 3 4 5 6 7 "probing" "linear probing" uses next entry. (CPython uses "5i + 1" | (hash >> 5) probing, but use simpler way in this example) "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"

Slide 10

Slide 10 text

Key hash value 0 1 2 3 4 5 6 7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"

Slide 11

Slide 11 text

Key hash value 0 1 2 3 4 5 6 7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "bar" 52 "ham" "baz" 58 "egg"

Slide 12

Slide 12 text

Key hash value 0 1 2 3 4 5 6 7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (!!?) "bar" 52 "ham" "baz" 58 "egg"

Slide 13

Slide 13 text

Key hash value 0 1 2 3 4 5 6 7 del d["foo"] remains DUMMY key "bar" 52 "ham" "baz" 58 "egg" DMMY

Slide 14

Slide 14 text

Key hash value 0 1 2 3 4 5 6 7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (conflict with dummy, then linear probing) "bar" 52 "ham" "baz" 58 "egg" DMMY

Slide 15

Slide 15 text

Problems in classical open addressing hash table ● Large memory usage ○ at least 1/3 of entries are empty ■ otherwise, "probing" can be too slow ○ one entry uses 3 words. (24bytes in 64bit machine) ○ 8 * 8 * 3 = 192 bytes for minimum dict

Slide 16

Slide 16 text

New dict implementation

Slide 17

Slide 17 text

Compact dict Original idea is from Raymond Hettinger. PyPy implements it with some customize. https://morepypy.blogspot.jp/2015/01/faster-more-memory-eff icient-and-more.html CPython 3.6 has almost same as PyPy

Slide 18

Slide 18 text

Key hash value 0 1 2 3 4 5 6 7 d["foo"] = "spam" d["bar"] = "ham" "bar" 52 "ham" "foo" 42 "spam" 0 1 index

Slide 19

Slide 19 text

Key hash value 0 1 2 3 4 5 6 7 d["foo"] = "spam" d["bar"] = "ham" d["baz"] = "egg" del d["foo"] "bar" 52 "ham" "baz" 58 "egg" DMMY 2 1 index

Slide 20

Slide 20 text

Pros and cons ● Less memory usage ○ index can be 1 byte for size < 255 ○ 3 * 8 * 5 + 8 = 128bytes (was 192bytes) ● Faster iteration ● Keep insertion order ● (cons) One more lookup stage

Slide 21

Slide 21 text

PEP 412: Key sharing dict

Slide 22

Slide 22 text

PEP 412: Key sharing dict Introduced in Python 3.4 Instances of same class can share keys object

Slide 23

Slide 23 text

class A: def __init__(self, a, b): self.foo = a self.bar = b a = A("spam", "ham") b = A("bacon", "egg")

Slide 24

Slide 24 text

Key Class value 0 1 2 3 4 5 6 7 "bar" 52 "foo" 42 0 1 index "ham" "spam" values "egg" "bacon" values instance instance

Slide 25

Slide 25 text

Problem ● Two instances can have different insertion order ○ drop key sharing dict? ■ key sharing dict can save more memory. ● But __slots__ can be used for such cases! ■ performance improvements in some microbench ● Is it matter for real case? __slots__? ■ Needs consensus ● it's more difficult than implementation

Slide 26

Slide 26 text

Keep key sharing dict support ● Only exactly same order can be permitted ○ "skipped" keys are prohibited ○ deletion is also prohibited ● Otherwise, stop "key sharing" ○ `self.x = None` is faster than `del self.x`

Slide 27

Slide 27 text

Future ideas

Slide 28

Slide 28 text

Ideas that will be tried later... ● specialized dict for namespace ○ all keys are interned string ○ only pointer comparison ○ no "hash" in entry -> more compact ● OrderedKey based on new dict ○ no more doubly linked list ○ `od.move_to_end(k, last=False)` is difficult, but it's possible ● functools.lru_cache ○ no more doubly linked list ○ Using `od.move_to_end(key)`

Slide 29

Slide 29 text

We're moving to Github! New contributors are welcome!