New dict implementation in Python 3.6 (KLab Tech Meetup 2017-09-04)

New dict implementation in Python 3.6 Inada Naoki (@methane)

自己紹介 @methane K-Labo, KLab Inc. Python core developer C, Go,
Network (server) programming, MySQL clients ISUCON 6 winner (See http://isucon.net/ )

Table of contents • dict in Python • Python 3.5
implementation • Python 3.6 implementation • Toward Python 3.7

Dict in Python

Dict Key-Value storage. A.k.a. associative-array, map, hash. x = {"foo":
42, "bar": 84} print( x["foo"] ) # => 42 Key feature: • Constant time lookup • Amortized constant time insertion • Support custom (user-defined) key type

Dicts are everywhere in Python x = 5 # global
namespace is dict. Insert 'x' to it. def add(a): # Insert 'add' to global dict return a + x # lookup 'x' from global dict print(add(7)) # search 'print' and 'add' from global dict There are many dicts in Python program. Lookup speed is critical. Insertion speed and memory usage is very important too.

Python 3.5 implementation

Key hash value 0 1 2 3 4 5 6
7 d["foo"] = "spam" # insert new item hash("foo") = 42 # hash value is 42 42 % 8 = 2 # hash value % hash table size = 2

7 d["foo"] = "spam" hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam"

7 d["bar"] = "ham" hash("bar") = 52 52 % 8 = 4 "foo" 42 "spam" "bar" 52 "ham"

7 d["baz"] = "egg" hash("baz") = 58 58 % 8 = 2 # "baz" is conflict with "foo" "foo" 42 "spam" "bar" 52 "ham"

7 "Open addressing" uses another slot in the table. (Another strategy is "chaining") For example, "linear probing" algorithm uses next entry. ※Python uses more complex probing, but I use simpler way in this example. "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"

7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "foo" 42 "spam" "bar" 52 "ham" "baz" 58 "egg"

7 del d["foo"] hash("foo") = 42 42 % 8 = 2 "bar" 52 "ham" "baz" 58 "egg"

7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (!!?) "bar" 52 "ham" "baz" 58 "egg"

7 del d["foo"] remains DUMMY key "bar" 52 "ham" "baz" 58 "egg" DUMMY

7 x = d["baz"] hash("baz") = 58 58 % 8 = 2 (conflict with dummy, then linear probing) "bar" 52 "ham" "baz" 58 "egg" DUMMY

Problems in classical open addressing hash table • Large memory
usage ◦ At least 1/3 of entries are empty ▪ Otherwise, "probing" can be too slow ◦ One entry uses 3 words ▪ word = 8 bytes on recent machine ◦ minimum size = 192 byte ▪ 8 (byte/word) * 3 (word/entry) * 8 (table width)

Python 3.6 implementation

Compact and ordered dict PyPy implements it in 2015 https://morepypy.blogspot.jp/2015/01/faster-more-memory-efficient-and-more.html
Python 3.6 dict is almost same as PyPy. Ruby 2.4, php 7 has similar one.

7 d["foo"] = "spam" # hash("foo") = 42, 42 % 8 = 2 "foo" 42 "spam" 0 index

7 d["foo"] = "spam" d["bar"] = "ham" # hash("bar") = 52 , 52 % 8 = 4 "bar" 52 "ham" "foo" 42 "spam" 0 1 index

7 d["foo"] = "spam" d["bar"] = "ham" d["baz"] = "egg" del d["foo"] "bar" 52 "ham" "baz" 58 "egg" DUMMY 2 1 index

• Less memory usage ◦ Index can be 1 byte
for small dict ◦ 3*8 *5 (entries) + 8 (index table) = 128 bytes ▪ It was 192 bytes in legacy implementation • Faster iteration (dense entries) • Preserve insertion order • (cons) One more indirect memory access New dict vs Legacy dict

Toward Python 3.7

Working on ... • Remove redundant code for optimize legacy
implementation. • OrderedDict based on New dict ◦ Remove doubly linked list used for keep order ◦ About 1/2 memory usage! ◦ Faster creation and iterating. ◦ (cons) Slower .move_to_end() method

We're finding new contributors Contributing to Python is easier, thanks
to Github. • Read devguide (https://devguide.python.org/ ) • Find easy bug on https://bugs.python.org/ and fix it. • Review other's code • Translate document on Transifex ◦ See https://docs.python.org/ja/

Future ideas • specialized dict for namespace ◦ all keys
are interned string ◦ only pointer comparison ◦ no "hash" in entry -> more compact • Implement set like dict ◦ current set is larger than dict... • functools.lru_cache ◦ Use `od.move_to_end(key)`, instead of linked list

PEP 412: Key sharing dict

PEP 412: Key sharing dict Introduced in Python 3.4 Instances
of same class can share keys object

class A: def __init__(self, a, b): self.foo = a self.bar
= b a = A("spam", "ham") b = A("bacon", "egg")

Key Class value 0 1 2 3 4 5 6
7 "bar" 52 "foo" 42 0 1 index "ham" "spam" values "egg" "bacon" values instance instance

Problem • Two instances can have different insertion order ◦
drop key sharing dict? ▪ key sharing dict can save more memory. • But __slots__ can be used for such cases! ▪ performance improvements in some microbench • Is it matter for real case? __slots__? ▪ Needs consensus • it's more difficult than implementation

Keep key sharing dict support • Only exactly same order
can be permitted ◦ "skipped" keys are prohibited ◦ deletion is also prohibited • Otherwise, stop "key sharing" ◦ `self.x = None` is faster than `del self.x`

New dict implementation in Python 3.6 (KLab Tec...

New dict implementation in Python 3.6 (KLab Tech Meetup 2017-09-04)

More Decks by INADA Naoki

Other Decks in Programming

Featured

Transcript