Python dictionary past, present, future

Python dictionary past, present, future Dmitry Alimov Senior Software Engineer
Zodiac Interactive 2016 SPb Python Interest Group

Dictionary in Python

>>> d = {} # the same as d =
dict() >>> d['a'] = 123 >>> d['b'] = 345 >>> d['c'] = 678 >>> d {'a': 123, 'c': 678, 'b': 345} >>> d['b'] 345 >>> del d['c'] >>> d {'a': 123, 'b': 345}

Dictionary keys must be hashable An object is hashable if
it has a hash value which never changes during its lifetime >>> d[list()] = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list' >>> d[set()] = 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'set' >>> d[dict()] = 3 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'dict' All of Python’s immutable built-in objects are hashable

import random class A(object): def __init__(self, index): self.index = index
def __eq__(self, other): return True def __hash__(self): return random.randint(0, 3) def __repr__(self): return 'A%d' % self.index d = {A(0): 0, A(1): 1, A(2): 2} print('keys: %s' % d.keys()) print('values: %s' % d.values()) for k in d: print('%s = %s' % (k, d.get(k, 'not found'))) Random hash is a bad idea Run 1 keys: [A1, A2, A0] values: [1, 2, 0] A1 = 1 A2 = not found A0 = 0 Run 2 keys: [A1, A0] values: [2, 0] A1 = not found A0 = not found

Three kinds of slots in the table: 1) Unused 2)
Active 3) Dummy typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry; - Hash table - Open addressing collision resolution strategy - Initial size = 8 - Load factor = 2/3 - Growth rate = 2 or 4 (depending on the number of cells used) - “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt” Dictionary in CPython >2.1

ma_fill – is the number of non-NULL keys (sum of
Active and Dummy) ma_used – number of Active items ma_mask – mask == PyDict_MINSIZE - 1 ma_lookup – lookup function (lookdict_string by default) #define PyDict_MINSIZE 8 typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; };

Good hash functions are needed >>> map(hash, [0, 1, 2,
3, 4]) [0, 1, 2, 3, 4] >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [1540938117, 1540938118, 1540938119, 1540938112, 1540938113] Modified FNV (Fowler–Noll–Vo) hash function for strings “-R” option – turns on hash randomization, so that the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [-218138032, -218138029, -218138030, -218138027, -218138028] Hash functions

Collision resolution Collision is a situation that occurs when two
distinct pieces of data have the same hash value. Probing is a scheme in computer programming for resolving collisions in hash tables for maintaining a collection of key–value pairs and looking up the value associated with a given key. In CPython a pseudo-random probing is used PERTURB_SHIFT = 5 perturb = hash(key) while True: j = (5 * j) + 1 + perturb perturb >>= PERTURB_SHIFT index = j % 2**i See “/Objects/dictobject.c” In CPython <2.2 used a polynomial-based index computing

>>> PyDict_MINSIZE = 8 >>> key = 123 >>> hash(key)
% PyDict_MINSIZE >>> 3 Index computing >>> mask = PyDict_MINSIZE - 1 >>> hash(key) & mask >>> 3 Instead of the modulo operation use logical "AND" and the mask Get least significant bits of the hash: 2 ** i = PyDict_MINSIZE, hence i = 3, i.e. three least significant bits is enough hash(123) = 123 = 0b1111011 mask = PyDict_MINSIZE - 1 = 8 - 1 = 7 = 0b111 index = hash(123) & mask = 0b1111011 & 0b111 = 0b011 = 3

mask = PyDict_MINSIZE - 1 index = hash(123) & mask
Integers

Strings mask = PyDict_MINSIZE - 1 index = hash(123) &
mask

Dictionary in CPython >2.1 Dictionary initialization Add an item PyDict_SetItem()
PyDict_New() ma_used = 0 ma_fill = 0 ma_mask = PyDict_MINSIZE – 1 ma_table = ma_smalltable ma_lookup = lookdict_string insertdict() ma_used += 1 ma_fill += 1 dictresize() if ma_fill >= 2/3 * size Delete an item PyDict_DelItem() ma_used -= 1

Add item

perturb = -1297030748 # i = (i * 5) +
1 + perturb i = (4 * 5) + 1 + (-1297030748) = -1297030727 index = -1297030727 & 7 = 1 hash('!!!') = -1297030748 i = -1297030748 & 7 = 4 # perturb = perturb >> PERTURB_SHIFT perturb = -1297030748 >> 5 = -40532211 # i = (i * 5) + 1 + perturb i = (-1297030727 * 5) + 1 + (-40532211) = -6525685845 index = -6525685845 & 7 = 3

>>> d {'python': 2, 'article': 4, '!!!': 5, 'dict': 3,
'a key': 1} >>> d.__sizeof__() 248 Add item

Hash table resize >>> d {'!!!': 5, 'python': 2, 'dict':
3, 'a key': 1, 'article': 4, ';)': 6} >>> d.__sizeof__() 1016

Hash table resize /* Find the smallest table size >
minused. */ for (newsize = 8; newsize <= minused && newsize > 0; newsize <<= 1) ; ... } dictresize(PyDictObject *mp, Py_ssize_t minused) { ... PyDict_SetItem(...) { ... dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used); ... } In the example: ma_fill = 6 > (8 * 2 / 3) ma_used = 6 Hence minused = 4 * 6 = 24, therefore newsize = 32

Addition order >>> d1 = {'one': 1, 'two': 2, 'three':
3, 'four': 4, 'five': 5} >>> d2 = {'three': 3, 'two': 2, 'five': 5, 'four': 4, 'one': 1} >>> d1 == d2 True >>> d1.keys() ['four', 'three', 'five', 'two', 'one'] >>> d2.keys() ['four', 'one', 'five', 'three', 'two'] The order of items added to the dictionary depends on the items already in it

>>> 7.0 == 7 == (7+0j) True >>> d =
{} >>> d[7.0] = 'float' >>> d {7.0: 'float'} >>> d[7] = 'int' >>> d {7.0: 'int'} >>> d[7+0j] = 'complex' >>> d {7.0: 'complex'} >>> type(d.keys()[0]) <type 'float'> int, float, complex >>> hash(7) 7 >>> hash(7.0) 7 >>> hash(7+0j) 7

>>> d = {'a': 1} >>> for i in d:
... d['new item'] = 123 ... Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: dictionary changed size during iteration Adding item during iteration

Delete item dummy = PyString_FromString("<dummy key>"));

Interesting case

Interesting case ma_fill = 6 > (8 * 2 /
3) dictresize()

Interesting case ma_fill = 6 > (8 * 2 /
3) ma_used = 1 hence minused = 4 * 1 = 4, therefore newsize = 8

Cache PyDictEntry ma_smalltable[8]; On x86 with 64 bytes per cache
line: 64 / (4 * 3) = 5.333 entries typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry; Cache locality and collisions See “/Objects/dictnotes.txt” Source Access time L1 Cache 1 ns L2 Cache 4 ns RAM 100 ns

Open addressing vs separate chaining Although here is the linear
probing rather than pseudo-random as in CPython

OrderedDict from collections import OrderedDict - Internal dict - Circular
doubly linked list - “/Lib/collections/__init__.py”

Present

Dictionary in CPython 3.5 - PEP 412 - Key-Sharing Dictionary
- The DictObject can be in one of two forms: combined table or split table - Initial size = 4 (split table) or 8 (combined table) - Maximum dictionary load = (2*n+1)/3 - Growth rate = used*2 + capacity/2 - “/Objects/dict-common.h”, “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt” typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; dict_lookup_func dk_lookup; Py_ssize_t dk_usable; PyDictKeyEntry dk_entries[1]; }; typedef struct { PyObject_HEAD Py_ssize_t ma_used; PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;

Combined table vs split table Combined table - For explicit
dictionaries (dict() and {}) - ma_values = NULL, dk_refcnt = 1 - Never becomes a split-table dictionary Split table - For attribute dictionaries (the__dict__ attribute of an object) - ma_values != NULL, dk_refcnt >= 1 - Only string (unicode) keys are allowed - Values are stored in the ma_values array - When resizing a split dictionary it is converted to a combined table (but if resizing is as a result of storing an instance attribute, and there is only instance of a class, then the dictionary will be re-split immediately) - Lookup function = lookdict_split

Dictionary in CPython 3.5 A new kind of slot: 1)
Unused 2) Active 3) Dummy 4) Pending (me_key != NULL, me_key != dummy and me_value == NULL) typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;

Split table Initial size = 4 Maximum dictionary load =
(2*n+1)/3 = (2*4+1)/3 = 3, i.e. initially ma_keys->dk_usable = 3

Split table class A(): def __init__(self): self.a = 1 self.b
= 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 setattr(a, 'd', 4) # re-split print(a.__dict__.__sizeof__()) # 168 print({}.__sizeof__()) # 264 Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3 Growth rate = used*2 + capacity/2 = 3*2 + 4/2 = 8, hence minused = 8, therefore newsize = 16 (see dictresize)

class A(): def __init__(self): self.a = 1 self.b = 2
self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 b = A() setattr(a, 'd', 4) # no re-split because of b print(a.__dict__.__sizeof__()) # 456 Split table Split table is converted to a combined table

Key differences between this implementation and CPython 2.x: - The
table can be split into two parts – the keys and the values - A new kind of slot - No more ma_smalltable embedded in the dict - General dictionaries are slightly larger - All object dictionaries of a single class can share a single key-table, saving about 60% memory for such cases (accordint to https://github.com/python/cpython/blob/3.5/Objects/dictnotes.txt) Bugs still happens: Unbounded memory growth resizing split-table dicts (https://bugs.python.org/issue28147) Summary

Hash functions in CPython 3.5 SipHash for strings and bytes
(>= CPython 3.4) - Resistant against hash flooding DoS attacks - Successfully used in many other languages Slightly modified hash function for float PEP 456 – Secure and interchangeable hash algorithm hash(float("+inf")) == 314159, hash(float("-inf")) == -314159, was -271828

OrderedDict in CPython 3.5 - Doubly-linked-list - od_fast_nodes hash table
that mirrors the od_dict table - “/Include/odictobject.h”, “/Objects/odictobject.c”

Alternative versions

Dictionary in PyPy - Starting from PyPy 2.5.0 – ordereddict
is used by default - Initial size = 16 - Load factor up to 2/3 - Growth rate = 4 (up to 30000 items) or 2 - If a lot of items are deleted the compaction is performed - “/rpython/rtyper/lltypesystem/rordereddict.py” struct dicttable { int num_live_items; int num_ever_used_items; int resize_counter; variable_int *indexes; // byte, short, int, long dictentry *entries; ... } struct dictentry { PyObject *key; PyObject *value; long hash; bool valid; }

Dictionary in PyPy struct dicttable { variable_int *indexes; dictentry *entries;
... } FREE = 0 DELETED = 1 VALID_OFFSET = 2

PyDictionary in Jython - Based on ConcurrentHashMap - Separate chaining
collision resolution - Initial size = 16, load factor = 0.75, growth rate = 2 - Segments and thread safety

PythonDictionary in IronPython - Based on Dictionary (.NET) - Separate
chaining collision resolution - Initial size = 0, load factor = 1.0 - Rehashing if the number of collisions >= 100 - Growth rate = 2 (the new size is equal to the next higher prime number) from a set of primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,… , 4999559, 5999471, 7199369}

Future

Raymond Hettinger is happy

Dictionary in CPython 3.6 typedef struct { Py_hash_t me_hash; PyObject
*me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; typedef struct { PyObject_HEAD Py_ssize_t ma_used; /* number of items in the dictionary */ uint64_t ma_version_tag; /* unique, changes when dict modified */ PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject; - ma_version_tag is added (PEP 509 – Add a private version to dict) - Initial size = 8 (for split table too) - Maximum dictionary load = (2*n)/3 - Contributed by INADA Naoki in https://bugs.python.org/issue27350 Four kinds of slots in the table: 1) Unused (index == DKIX_EMPTY == -1) 2) Active (index >= 0 , me_key != NULL and me_value != NULL) 3) Dummy (index == DKIX_DUMMY == -2, only for combined table) 4) Pending (index >= 0 , me_key != NULL and me_value == NULL, only for split table)

Dictionary in CPython 3.6 - Added dk_nentries and dk_indices struct
_dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; /* Size of the hash table (dk_indices) */ dict_lookup_func dk_lookup; /* Function to lookup in dk_indices */ Py_ssize_t dk_usable; /* Number of usable entries in dk_entries */ Py_ssize_t dk_nentries; /* Number of used entries in dk_entries */ union { int8_t as_1[8]; int16_t as_2[4]; int32_t as_4[2]; #if SIZEOF_VOID_P > 4 int64_t as_8[1]; #endif } dk_indices; PyDictKeyEntry dk_entries[dk_usable]; /* using DK_ENTRIES macro */ };

Dictionary in CPython 3.6 (Combined table)

Key differences between this implementation and CPython 3.5: - Compact
and ordered - Added dk_indices with type, depending on the size of dictionary - Added ma_version_tag (PEP 509) - Initial size for split table is changed to 8 - Maximum dictionary load changed to (2*n)/3 - Deleting item cause converting the dict to the combined table - Preserving the order of **kwargs in a function (PEP 468) is implemented - Preserving Class Attribute Definition Order (PEP 520) is implemented - The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5 (https://docs.python.org/3.6/whatsnew/3.6.html#other-language- changes) Summary

References 1. The implementation of a dictionary in Python 2.7
https://habrahabr.ru/post/247843/ 2. Python hash calculation algorithms http://delimitry.blogspot.com/2014/07/python-hash-calculation-algorithms.html 3. PEP 412 - Key-Sharing Dictionary https://www.python.org/dev/peps/pep-0412/ 4. PEP 456 - Secure and interchangeable hash algorithm https://www.python.org/dev/peps/pep-0456/ 5. Mirror of the CPython repository https://github.com/python/cpython/ 6. Faster, more memory efficient and more ordered dictionaries on PyPy https://morepypy.blogspot.com/2015/01/faster- more-memory-efficient-and-more.html 7. PyDictionary (Jython API documentation) http://www.jython.org/javadoc/org/python/core/PyDictionary.html 8. Jython repository https://bitbucket.org/jython/jython 9. Java theory and practice: Building a better HashMap http://www.ibm.com/developerworks/library/j-jtp08223/ 10. Back to basics: Dictionary part 2, .NET implementation https://blog.markvincze.com/back-to-basics-dictionary-part-2- net-implementation/ 11. http://referencesource.microsoft.com/mscorlib/system/collections/generic/dictionary.cs.html 12. https://github.com/IronLanguages/main/blob/ipy-2.7-maint/Languages/IronPython/IronPython/ 13. https://bitbucket.org/pypy/pypy/ 14. https://twitter.com/raymondh 15. PEP 509 - Add a private version to dict https://www.python.org/dev/peps/pep-0509/ 16. Compact and ordered dict http://bugs.python.org/issue27350 17. What’s New In Python 3.6 https://docs.python.org/3.6/whatsnew/3.6.html 18. PEP 468 - Preserving the order of **kwargs in a function https://www.python.org/dev/peps/pep-0468/ 19. PEP 520 - Preserving Class Attribute Definition Order https://www.python.org/dev/peps/pep-0520/ 20. https://en.wikipedia.org/ Images from: http://www.rcreptiles.com/blog/index.php/2008/06/28/read_the_operating_manual_first http://kiwigamer450.deviantart.com/art/Back-to-The-Past-Logo-567858767 http://beyondplm.com/wp-content/uploads/2014/04/time-paradox-past-future-present.jpg http://itband.ru/wp-content/uploads/2014/10/Future.jpg https://en.wikipedia.org/wiki/Hash_table

Q & A @delimitry spbpython.guru SPb Python Interest Group

Additional slides

Separate chaining collision resolution Open addressing collision resolution (pseudo-random probing)

Python dictionary past, present, future

Python dictionary past, present, future

More Decks by Delimitry

Other Decks in Programming

Featured

Transcript