Slide 1

Slide 1 text

Python dictionary past, present, future Dmitry Alimov Senior Software Engineer Zodiac Interactive 2016 SPb Python Interest Group

Slide 2

Slide 2 text

Dictionary in Python

Slide 3

Slide 3 text

>>> d = {} # the same as d = dict() >>> d['a'] = 123 >>> d['b'] = 345 >>> d['c'] = 678 >>> d {'a': 123, 'c': 678, 'b': 345} >>> d['b'] 345 >>> del d['c'] >>> d {'a': 123, 'b': 345}

Slide 4

Slide 4 text

Dictionary keys must be hashable An object is hashable if it has a hash value which never changes during its lifetime >>> d[list()] = 1 Traceback (most recent call last): File "", line 1, in TypeError: unhashable type: 'list' >>> d[set()] = 2 Traceback (most recent call last): File "", line 1, in TypeError: unhashable type: 'set' >>> d[dict()] = 3 Traceback (most recent call last): File "", line 1, in TypeError: unhashable type: 'dict' All of Python’s immutable built-in objects are hashable

Slide 5

Slide 5 text

import random class A(object): def __init__(self, index): self.index = index def __eq__(self, other): return True def __hash__(self): return random.randint(0, 3) def __repr__(self): return 'A%d' % self.index d = {A(0): 0, A(1): 1, A(2): 2} print('keys: %s' % d.keys()) print('values: %s' % d.values()) for k in d: print('%s = %s' % (k, d.get(k, 'not found'))) Random hash is a bad idea Run 1 keys: [A1, A2, A0] values: [1, 2, 0] A1 = 1 A2 = not found A0 = 0 Run 2 keys: [A1, A0] values: [2, 0] A1 = not found A0 = not found

Slide 6

Slide 6 text

Past

Slide 7

Slide 7 text

Three kinds of slots in the table: 1) Unused 2) Active 3) Dummy typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry; - Hash table - Open addressing collision resolution strategy - Initial size = 8 - Load factor = 2/3 - Growth rate = 2 or 4 (depending on the number of cells used) - “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt” Dictionary in CPython >2.1

Slide 8

Slide 8 text

ma_fill – is the number of non-NULL keys (sum of Active and Dummy) ma_used – number of Active items ma_mask – mask == PyDict_MINSIZE - 1 ma_lookup – lookup function (lookdict_string by default) #define PyDict_MINSIZE 8 typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; };

Slide 9

Slide 9 text

Good hash functions are needed >>> map(hash, [0, 1, 2, 3, 4]) [0, 1, 2, 3, 4] >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [1540938117, 1540938118, 1540938119, 1540938112, 1540938113] Modified FNV (Fowler–Noll–Vo) hash function for strings “-R” option – turns on hash randomization, so that the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [-218138032, -218138029, -218138030, -218138027, -218138028] Hash functions

Slide 10

Slide 10 text

Collision resolution Collision is a situation that occurs when two distinct pieces of data have the same hash value. Probing is a scheme in computer programming for resolving collisions in hash tables for maintaining a collection of key–value pairs and looking up the value associated with a given key. In CPython a pseudo-random probing is used PERTURB_SHIFT = 5 perturb = hash(key) while True: j = (5 * j) + 1 + perturb perturb >>= PERTURB_SHIFT index = j % 2**i See “/Objects/dictobject.c” In CPython <2.2 used a polynomial-based index computing

Slide 11

Slide 11 text

>>> PyDict_MINSIZE = 8 >>> key = 123 >>> hash(key) % PyDict_MINSIZE >>> 3 Index computing >>> mask = PyDict_MINSIZE - 1 >>> hash(key) & mask >>> 3 Instead of the modulo operation use logical "AND" and the mask Get least significant bits of the hash: 2 ** i = PyDict_MINSIZE, hence i = 3, i.e. three least significant bits is enough hash(123) = 123 = 0b1111011 mask = PyDict_MINSIZE - 1 = 8 - 1 = 7 = 0b111 index = hash(123) & mask = 0b1111011 & 0b111 = 0b011 = 3

Slide 12

Slide 12 text

mask = PyDict_MINSIZE - 1 index = hash(123) & mask Integers

Slide 13

Slide 13 text

Strings mask = PyDict_MINSIZE - 1 index = hash(123) & mask

Slide 14

Slide 14 text

Dictionary in CPython >2.1 Dictionary initialization Add an item PyDict_SetItem() PyDict_New() ma_used = 0 ma_fill = 0 ma_mask = PyDict_MINSIZE – 1 ma_table = ma_smalltable ma_lookup = lookdict_string insertdict() ma_used += 1 ma_fill += 1 dictresize() if ma_fill >= 2/3 * size Delete an item PyDict_DelItem() ma_used -= 1

Slide 15

Slide 15 text

Add item

Slide 16

Slide 16 text

Add item

Slide 17

Slide 17 text

Add item

Slide 18

Slide 18 text

Add item

Slide 19

Slide 19 text

Add item

Slide 20

Slide 20 text

perturb = -1297030748 # i = (i * 5) + 1 + perturb i = (4 * 5) + 1 + (-1297030748) = -1297030727 index = -1297030727 & 7 = 1 hash('!!!') = -1297030748 i = -1297030748 & 7 = 4 # perturb = perturb >> PERTURB_SHIFT perturb = -1297030748 >> 5 = -40532211 # i = (i * 5) + 1 + perturb i = (-1297030727 * 5) + 1 + (-40532211) = -6525685845 index = -6525685845 & 7 = 3

Slide 21

Slide 21 text

>>> d {'python': 2, 'article': 4, '!!!': 5, 'dict': 3, 'a key': 1} >>> d.__sizeof__() 248 Add item

Slide 22

Slide 22 text

Hash table resize >>> d {'!!!': 5, 'python': 2, 'dict': 3, 'a key': 1, 'article': 4, ';)': 6} >>> d.__sizeof__() 1016

Slide 23

Slide 23 text

Hash table resize /* Find the smallest table size > minused. */ for (newsize = 8; newsize <= minused && newsize > 0; newsize <<= 1) ; ... } dictresize(PyDictObject *mp, Py_ssize_t minused) { ... PyDict_SetItem(...) { ... dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used); ... } In the example: ma_fill = 6 > (8 * 2 / 3) ma_used = 6 Hence minused = 4 * 6 = 24, therefore newsize = 32

Slide 24

Slide 24 text

Addition order >>> d1 = {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5} >>> d2 = {'three': 3, 'two': 2, 'five': 5, 'four': 4, 'one': 1} >>> d1 == d2 True >>> d1.keys() ['four', 'three', 'five', 'two', 'one'] >>> d2.keys() ['four', 'one', 'five', 'three', 'two'] The order of items added to the dictionary depends on the items already in it

Slide 25

Slide 25 text

>>> 7.0 == 7 == (7+0j) True >>> d = {} >>> d[7.0] = 'float' >>> d {7.0: 'float'} >>> d[7] = 'int' >>> d {7.0: 'int'} >>> d[7+0j] = 'complex' >>> d {7.0: 'complex'} >>> type(d.keys()[0]) int, float, complex >>> hash(7) 7 >>> hash(7.0) 7 >>> hash(7+0j) 7

Slide 26

Slide 26 text

>>> d = {'a': 1} >>> for i in d: ... d['new item'] = 123 ... Traceback (most recent call last): File "", line 1, in RuntimeError: dictionary changed size during iteration Adding item during iteration

Slide 27

Slide 27 text

Delete item dummy = PyString_FromString(""));

Slide 28

Slide 28 text

Interesting case

Slide 29

Slide 29 text

Interesting case ma_fill = 6 > (8 * 2 / 3) dictresize()

Slide 30

Slide 30 text

Interesting case ma_fill = 6 > (8 * 2 / 3) ma_used = 1 hence minused = 4 * 1 = 4, therefore newsize = 8

Slide 31

Slide 31 text

Cache PyDictEntry ma_smalltable[8]; On x86 with 64 bytes per cache line: 64 / (4 * 3) = 5.333 entries typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry; Cache locality and collisions See “/Objects/dictnotes.txt” Source Access time L1 Cache 1 ns L2 Cache 4 ns RAM 100 ns

Slide 32

Slide 32 text

Open addressing vs separate chaining Although here is the linear probing rather than pseudo-random as in CPython

Slide 33

Slide 33 text

OrderedDict from collections import OrderedDict - Internal dict - Circular doubly linked list - “/Lib/collections/__init__.py”

Slide 34

Slide 34 text

Present

Slide 35

Slide 35 text

Dictionary in CPython 3.5 - PEP 412 - Key-Sharing Dictionary - The DictObject can be in one of two forms: combined table or split table - Initial size = 4 (split table) or 8 (combined table) - Maximum dictionary load = (2*n+1)/3 - Growth rate = used*2 + capacity/2 - “/Objects/dict-common.h”, “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt” typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; dict_lookup_func dk_lookup; Py_ssize_t dk_usable; PyDictKeyEntry dk_entries[1]; }; typedef struct { PyObject_HEAD Py_ssize_t ma_used; PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;

Slide 36

Slide 36 text

Combined table vs split table Combined table - For explicit dictionaries (dict() and {}) - ma_values = NULL, dk_refcnt = 1 - Never becomes a split-table dictionary Split table - For attribute dictionaries (the__dict__ attribute of an object) - ma_values != NULL, dk_refcnt >= 1 - Only string (unicode) keys are allowed - Values are stored in the ma_values array - When resizing a split dictionary it is converted to a combined table (but if resizing is as a result of storing an instance attribute, and there is only instance of a class, then the dictionary will be re-split immediately) - Lookup function = lookdict_split

Slide 37

Slide 37 text

Dictionary in CPython 3.5 A new kind of slot: 1) Unused 2) Active 3) Dummy 4) Pending (me_key != NULL, me_key != dummy and me_value == NULL) typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;

Slide 38

Slide 38 text

Split table Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3, i.e. initially ma_keys->dk_usable = 3

Slide 39

Slide 39 text

Split table class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 setattr(a, 'd', 4) # re-split print(a.__dict__.__sizeof__()) # 168 print({}.__sizeof__()) # 264 Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3 Growth rate = used*2 + capacity/2 = 3*2 + 4/2 = 8, hence minused = 8, therefore newsize = 16 (see dictresize)

Slide 40

Slide 40 text

class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 b = A() setattr(a, 'd', 4) # no re-split because of b print(a.__dict__.__sizeof__()) # 456 Split table Split table is converted to a combined table

Slide 41

Slide 41 text

Key differences between this implementation and CPython 2.x: - The table can be split into two parts – the keys and the values - A new kind of slot - No more ma_smalltable embedded in the dict - General dictionaries are slightly larger - All object dictionaries of a single class can share a single key-table, saving about 60% memory for such cases (accordint to https://github.com/python/cpython/blob/3.5/Objects/dictnotes.txt) Bugs still happens: Unbounded memory growth resizing split-table dicts (https://bugs.python.org/issue28147) Summary

Slide 42

Slide 42 text

Hash functions in CPython 3.5 SipHash for strings and bytes (>= CPython 3.4) - Resistant against hash flooding DoS attacks - Successfully used in many other languages Slightly modified hash function for float PEP 456 – Secure and interchangeable hash algorithm hash(float("+inf")) == 314159, hash(float("-inf")) == -314159, was -271828

Slide 43

Slide 43 text

OrderedDict in CPython 3.5 - Doubly-linked-list - od_fast_nodes hash table that mirrors the od_dict table - “/Include/odictobject.h”, “/Objects/odictobject.c”

Slide 44

Slide 44 text

Alternative versions

Slide 45

Slide 45 text

Dictionary in PyPy - Starting from PyPy 2.5.0 – ordereddict is used by default - Initial size = 16 - Load factor up to 2/3 - Growth rate = 4 (up to 30000 items) or 2 - If a lot of items are deleted the compaction is performed - “/rpython/rtyper/lltypesystem/rordereddict.py” struct dicttable { int num_live_items; int num_ever_used_items; int resize_counter; variable_int *indexes; // byte, short, int, long dictentry *entries; ... } struct dictentry { PyObject *key; PyObject *value; long hash; bool valid; }

Slide 46

Slide 46 text

Dictionary in PyPy struct dicttable { variable_int *indexes; dictentry *entries; ... } FREE = 0 DELETED = 1 VALID_OFFSET = 2

Slide 47

Slide 47 text

PyDictionary in Jython - Based on ConcurrentHashMap - Separate chaining collision resolution - Initial size = 16, load factor = 0.75, growth rate = 2 - Segments and thread safety

Slide 48

Slide 48 text

PythonDictionary in IronPython - Based on Dictionary (.NET) - Separate chaining collision resolution - Initial size = 0, load factor = 1.0 - Rehashing if the number of collisions >= 100 - Growth rate = 2 (the new size is equal to the next higher prime number) from a set of primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,… , 4999559, 5999471, 7199369}

Slide 49

Slide 49 text

Future

Slide 50

Slide 50 text

Raymond Hettinger is happy

Slide 51

Slide 51 text

Dictionary in CPython 3.6 typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; typedef struct { PyObject_HEAD Py_ssize_t ma_used; /* number of items in the dictionary */ uint64_t ma_version_tag; /* unique, changes when dict modified */ PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject; - ma_version_tag is added (PEP 509 – Add a private version to dict) - Initial size = 8 (for split table too) - Maximum dictionary load = (2*n)/3 - Contributed by INADA Naoki in https://bugs.python.org/issue27350 Four kinds of slots in the table: 1) Unused (index == DKIX_EMPTY == -1) 2) Active (index >= 0 , me_key != NULL and me_value != NULL) 3) Dummy (index == DKIX_DUMMY == -2, only for combined table) 4) Pending (index >= 0 , me_key != NULL and me_value == NULL, only for split table)

Slide 52

Slide 52 text

Dictionary in CPython 3.6 - Added dk_nentries and dk_indices struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; /* Size of the hash table (dk_indices) */ dict_lookup_func dk_lookup; /* Function to lookup in dk_indices */ Py_ssize_t dk_usable; /* Number of usable entries in dk_entries */ Py_ssize_t dk_nentries; /* Number of used entries in dk_entries */ union { int8_t as_1[8]; int16_t as_2[4]; int32_t as_4[2]; #if SIZEOF_VOID_P > 4 int64_t as_8[1]; #endif } dk_indices; PyDictKeyEntry dk_entries[dk_usable]; /* using DK_ENTRIES macro */ };

Slide 53

Slide 53 text

Dictionary in CPython 3.6 (Combined table)

Slide 54

Slide 54 text

Key differences between this implementation and CPython 3.5: - Compact and ordered - Added dk_indices with type, depending on the size of dictionary - Added ma_version_tag (PEP 509) - Initial size for split table is changed to 8 - Maximum dictionary load changed to (2*n)/3 - Deleting item cause converting the dict to the combined table - Preserving the order of **kwargs in a function (PEP 468) is implemented - Preserving Class Attribute Definition Order (PEP 520) is implemented - The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5 (https://docs.python.org/3.6/whatsnew/3.6.html#other-language- changes) Summary

Slide 55

Slide 55 text

References 1. The implementation of a dictionary in Python 2.7 https://habrahabr.ru/post/247843/ 2. Python hash calculation algorithms http://delimitry.blogspot.com/2014/07/python-hash-calculation-algorithms.html 3. PEP 412 - Key-Sharing Dictionary https://www.python.org/dev/peps/pep-0412/ 4. PEP 456 - Secure and interchangeable hash algorithm https://www.python.org/dev/peps/pep-0456/ 5. Mirror of the CPython repository https://github.com/python/cpython/ 6. Faster, more memory efficient and more ordered dictionaries on PyPy https://morepypy.blogspot.com/2015/01/faster- more-memory-efficient-and-more.html 7. PyDictionary (Jython API documentation) http://www.jython.org/javadoc/org/python/core/PyDictionary.html 8. Jython repository https://bitbucket.org/jython/jython 9. Java theory and practice: Building a better HashMap http://www.ibm.com/developerworks/library/j-jtp08223/ 10. Back to basics: Dictionary part 2, .NET implementation https://blog.markvincze.com/back-to-basics-dictionary-part-2- net-implementation/ 11. http://referencesource.microsoft.com/mscorlib/system/collections/generic/dictionary.cs.html 12. https://github.com/IronLanguages/main/blob/ipy-2.7-maint/Languages/IronPython/IronPython/ 13. https://bitbucket.org/pypy/pypy/ 14. https://twitter.com/raymondh 15. PEP 509 - Add a private version to dict https://www.python.org/dev/peps/pep-0509/ 16. Compact and ordered dict http://bugs.python.org/issue27350 17. What’s New In Python 3.6 https://docs.python.org/3.6/whatsnew/3.6.html 18. PEP 468 - Preserving the order of **kwargs in a function https://www.python.org/dev/peps/pep-0468/ 19. PEP 520 - Preserving Class Attribute Definition Order https://www.python.org/dev/peps/pep-0520/ 20. https://en.wikipedia.org/ Images from: http://www.rcreptiles.com/blog/index.php/2008/06/28/read_the_operating_manual_first http://kiwigamer450.deviantart.com/art/Back-to-The-Past-Logo-567858767 http://beyondplm.com/wp-content/uploads/2014/04/time-paradox-past-future-present.jpg http://itband.ru/wp-content/uploads/2014/10/Future.jpg https://en.wikipedia.org/wiki/Hash_table

Slide 56

Slide 56 text

Q & A @delimitry spbpython.guru SPb Python Interest Group

Slide 57

Slide 57 text

Additional slides

Slide 58

Slide 58 text

Separate chaining collision resolution Open addressing collision resolution (pseudo-random probing)