Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python dictionary past, present, future

Delimitry
September 20, 2016

Python dictionary past, present, future

The presentation from SPb Python Interest Group community meetup.
The presentation tells about the dictionaries in Python, reviews the implementation of dictionary in CPython 2.x, dictionary in CPython 3.x, and also recent changes in CPython 3.6. In addition to CPython the dictionaries in alternative Python implementations such as PyPy, IronPython and Jython are reviewed.

Delimitry

September 20, 2016
Tweet

More Decks by Delimitry

Other Decks in Programming

Transcript

  1. Python dictionary past, present, future Dmitry Alimov Senior Software Engineer

    Zodiac Interactive 2016 SPb Python Interest Group
  2. >>> d = {} # the same as d =

    dict() >>> d['a'] = 123 >>> d['b'] = 345 >>> d['c'] = 678 >>> d {'a': 123, 'c': 678, 'b': 345} >>> d['b'] 345 >>> del d['c'] >>> d {'a': 123, 'b': 345}
  3. Dictionary keys must be hashable An object is hashable if

    it has a hash value which never changes during its lifetime >>> d[list()] = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list' >>> d[set()] = 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'set' >>> d[dict()] = 3 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'dict' All of Python’s immutable built-in objects are hashable
  4. import random class A(object): def __init__(self, index): self.index = index

    def __eq__(self, other): return True def __hash__(self): return random.randint(0, 3) def __repr__(self): return 'A%d' % self.index d = {A(0): 0, A(1): 1, A(2): 2} print('keys: %s' % d.keys()) print('values: %s' % d.values()) for k in d: print('%s = %s' % (k, d.get(k, 'not found'))) Random hash is a bad idea Run 1 keys: [A1, A2, A0] values: [1, 2, 0] A1 = 1 A2 = not found A0 = 0 Run 2 keys: [A1, A0] values: [2, 0] A1 = not found A0 = not found
  5. Three kinds of slots in the table: 1) Unused 2)

    Active 3) Dummy typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry; - Hash table - Open addressing collision resolution strategy - Initial size = 8 - Load factor = 2/3 - Growth rate = 2 or 4 (depending on the number of cells used) - “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt” Dictionary in CPython >2.1
  6. ma_fill – is the number of non-NULL keys (sum of

    Active and Dummy) ma_used – number of Active items ma_mask – mask == PyDict_MINSIZE - 1 ma_lookup – lookup function (lookdict_string by default) #define PyDict_MINSIZE 8 typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; };
  7. Good hash functions are needed >>> map(hash, [0, 1, 2,

    3, 4]) [0, 1, 2, 3, 4] >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [1540938117, 1540938118, 1540938119, 1540938112, 1540938113] Modified FNV (Fowler–Noll–Vo) hash function for strings “-R” option – turns on hash randomization, so that the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [-218138032, -218138029, -218138030, -218138027, -218138028] Hash functions
  8. Collision resolution Collision is a situation that occurs when two

    distinct pieces of data have the same hash value. Probing is a scheme in computer programming for resolving collisions in hash tables for maintaining a collection of key–value pairs and looking up the value associated with a given key. In CPython a pseudo-random probing is used PERTURB_SHIFT = 5 perturb = hash(key) while True: j = (5 * j) + 1 + perturb perturb >>= PERTURB_SHIFT index = j % 2**i See “/Objects/dictobject.c” In CPython <2.2 used a polynomial-based index computing
  9. >>> PyDict_MINSIZE = 8 >>> key = 123 >>> hash(key)

    % PyDict_MINSIZE >>> 3 Index computing >>> mask = PyDict_MINSIZE - 1 >>> hash(key) & mask >>> 3 Instead of the modulo operation use logical "AND" and the mask Get least significant bits of the hash: 2 ** i = PyDict_MINSIZE, hence i = 3, i.e. three least significant bits is enough hash(123) = 123 = 0b1111011 mask = PyDict_MINSIZE - 1 = 8 - 1 = 7 = 0b111 index = hash(123) & mask = 0b1111011 & 0b111 = 0b011 = 3
  10. Dictionary in CPython >2.1 Dictionary initialization Add an item PyDict_SetItem()

    PyDict_New() ma_used = 0 ma_fill = 0 ma_mask = PyDict_MINSIZE – 1 ma_table = ma_smalltable ma_lookup = lookdict_string insertdict() ma_used += 1 ma_fill += 1 dictresize() if ma_fill >= 2/3 * size Delete an item PyDict_DelItem() ma_used -= 1
  11. perturb = -1297030748 # i = (i * 5) +

    1 + perturb i = (4 * 5) + 1 + (-1297030748) = -1297030727 index = -1297030727 & 7 = 1 hash('!!!') = -1297030748 i = -1297030748 & 7 = 4 # perturb = perturb >> PERTURB_SHIFT perturb = -1297030748 >> 5 = -40532211 # i = (i * 5) + 1 + perturb i = (-1297030727 * 5) + 1 + (-40532211) = -6525685845 index = -6525685845 & 7 = 3
  12. >>> d {'python': 2, 'article': 4, '!!!': 5, 'dict': 3,

    'a key': 1} >>> d.__sizeof__() 248 Add item
  13. Hash table resize >>> d {'!!!': 5, 'python': 2, 'dict':

    3, 'a key': 1, 'article': 4, ';)': 6} >>> d.__sizeof__() 1016
  14. Hash table resize /* Find the smallest table size >

    minused. */ for (newsize = 8; newsize <= minused && newsize > 0; newsize <<= 1) ; ... } dictresize(PyDictObject *mp, Py_ssize_t minused) { ... PyDict_SetItem(...) { ... dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used); ... } In the example: ma_fill = 6 > (8 * 2 / 3) ma_used = 6 Hence minused = 4 * 6 = 24, therefore newsize = 32
  15. Addition order >>> d1 = {'one': 1, 'two': 2, 'three':

    3, 'four': 4, 'five': 5} >>> d2 = {'three': 3, 'two': 2, 'five': 5, 'four': 4, 'one': 1} >>> d1 == d2 True >>> d1.keys() ['four', 'three', 'five', 'two', 'one'] >>> d2.keys() ['four', 'one', 'five', 'three', 'two'] The order of items added to the dictionary depends on the items already in it
  16. >>> 7.0 == 7 == (7+0j) True >>> d =

    {} >>> d[7.0] = 'float' >>> d {7.0: 'float'} >>> d[7] = 'int' >>> d {7.0: 'int'} >>> d[7+0j] = 'complex' >>> d {7.0: 'complex'} >>> type(d.keys()[0]) <type 'float'> int, float, complex >>> hash(7) 7 >>> hash(7.0) 7 >>> hash(7+0j) 7
  17. >>> d = {'a': 1} >>> for i in d:

    ... d['new item'] = 123 ... Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: dictionary changed size during iteration Adding item during iteration
  18. Interesting case ma_fill = 6 > (8 * 2 /

    3) ma_used = 1 hence minused = 4 * 1 = 4, therefore newsize = 8
  19. Cache PyDictEntry ma_smalltable[8]; On x86 with 64 bytes per cache

    line: 64 / (4 * 3) = 5.333 entries typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry; Cache locality and collisions See “/Objects/dictnotes.txt” Source Access time L1 Cache 1 ns L2 Cache 4 ns RAM 100 ns
  20. Open addressing vs separate chaining Although here is the linear

    probing rather than pseudo-random as in CPython
  21. OrderedDict from collections import OrderedDict - Internal dict - Circular

    doubly linked list - “/Lib/collections/__init__.py”
  22. Dictionary in CPython 3.5 - PEP 412 - Key-Sharing Dictionary

    - The DictObject can be in one of two forms: combined table or split table - Initial size = 4 (split table) or 8 (combined table) - Maximum dictionary load = (2*n+1)/3 - Growth rate = used*2 + capacity/2 - “/Objects/dict-common.h”, “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt” typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; dict_lookup_func dk_lookup; Py_ssize_t dk_usable; PyDictKeyEntry dk_entries[1]; }; typedef struct { PyObject_HEAD Py_ssize_t ma_used; PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;
  23. Combined table vs split table Combined table - For explicit

    dictionaries (dict() and {}) - ma_values = NULL, dk_refcnt = 1 - Never becomes a split-table dictionary Split table - For attribute dictionaries (the__dict__ attribute of an object) - ma_values != NULL, dk_refcnt >= 1 - Only string (unicode) keys are allowed - Values are stored in the ma_values array - When resizing a split dictionary it is converted to a combined table (but if resizing is as a result of storing an instance attribute, and there is only instance of a class, then the dictionary will be re-split immediately) - Lookup function = lookdict_split
  24. Dictionary in CPython 3.5 A new kind of slot: 1)

    Unused 2) Active 3) Dummy 4) Pending (me_key != NULL, me_key != dummy and me_value == NULL) typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;
  25. Split table Initial size = 4 Maximum dictionary load =

    (2*n+1)/3 = (2*4+1)/3 = 3, i.e. initially ma_keys->dk_usable = 3
  26. Split table class A(): def __init__(self): self.a = 1 self.b

    = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 setattr(a, 'd', 4) # re-split print(a.__dict__.__sizeof__()) # 168 print({}.__sizeof__()) # 264 Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3 Growth rate = used*2 + capacity/2 = 3*2 + 4/2 = 8, hence minused = 8, therefore newsize = 16 (see dictresize)
  27. class A(): def __init__(self): self.a = 1 self.b = 2

    self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 b = A() setattr(a, 'd', 4) # no re-split because of b print(a.__dict__.__sizeof__()) # 456 Split table Split table is converted to a combined table
  28. Key differences between this implementation and CPython 2.x: - The

    table can be split into two parts – the keys and the values - A new kind of slot - No more ma_smalltable embedded in the dict - General dictionaries are slightly larger - All object dictionaries of a single class can share a single key-table, saving about 60% memory for such cases (accordint to https://github.com/python/cpython/blob/3.5/Objects/dictnotes.txt) Bugs still happens: Unbounded memory growth resizing split-table dicts (https://bugs.python.org/issue28147) Summary
  29. Hash functions in CPython 3.5 SipHash for strings and bytes

    (>= CPython 3.4) - Resistant against hash flooding DoS attacks - Successfully used in many other languages Slightly modified hash function for float PEP 456 – Secure and interchangeable hash algorithm hash(float("+inf")) == 314159, hash(float("-inf")) == -314159, was -271828
  30. OrderedDict in CPython 3.5 - Doubly-linked-list - od_fast_nodes hash table

    that mirrors the od_dict table - “/Include/odictobject.h”, “/Objects/odictobject.c”
  31. Dictionary in PyPy - Starting from PyPy 2.5.0 – ordereddict

    is used by default - Initial size = 16 - Load factor up to 2/3 - Growth rate = 4 (up to 30000 items) or 2 - If a lot of items are deleted the compaction is performed - “/rpython/rtyper/lltypesystem/rordereddict.py” struct dicttable { int num_live_items; int num_ever_used_items; int resize_counter; variable_int *indexes; // byte, short, int, long dictentry *entries; ... } struct dictentry { PyObject *key; PyObject *value; long hash; bool valid; }
  32. PyDictionary in Jython - Based on ConcurrentHashMap - Separate chaining

    collision resolution - Initial size = 16, load factor = 0.75, growth rate = 2 - Segments and thread safety
  33. PythonDictionary in IronPython - Based on Dictionary (.NET) - Separate

    chaining collision resolution - Initial size = 0, load factor = 1.0 - Rehashing if the number of collisions >= 100 - Growth rate = 2 (the new size is equal to the next higher prime number) from a set of primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,… , 4999559, 5999471, 7199369}
  34. Dictionary in CPython 3.6 typedef struct { Py_hash_t me_hash; PyObject

    *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; typedef struct { PyObject_HEAD Py_ssize_t ma_used; /* number of items in the dictionary */ uint64_t ma_version_tag; /* unique, changes when dict modified */ PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject; - ma_version_tag is added (PEP 509 – Add a private version to dict) - Initial size = 8 (for split table too) - Maximum dictionary load = (2*n)/3 - Contributed by INADA Naoki in https://bugs.python.org/issue27350 Four kinds of slots in the table: 1) Unused (index == DKIX_EMPTY == -1) 2) Active (index >= 0 , me_key != NULL and me_value != NULL) 3) Dummy (index == DKIX_DUMMY == -2, only for combined table) 4) Pending (index >= 0 , me_key != NULL and me_value == NULL, only for split table)
  35. Dictionary in CPython 3.6 - Added dk_nentries and dk_indices struct

    _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; /* Size of the hash table (dk_indices) */ dict_lookup_func dk_lookup; /* Function to lookup in dk_indices */ Py_ssize_t dk_usable; /* Number of usable entries in dk_entries */ Py_ssize_t dk_nentries; /* Number of used entries in dk_entries */ union { int8_t as_1[8]; int16_t as_2[4]; int32_t as_4[2]; #if SIZEOF_VOID_P > 4 int64_t as_8[1]; #endif } dk_indices; PyDictKeyEntry dk_entries[dk_usable]; /* using DK_ENTRIES macro */ };
  36. Key differences between this implementation and CPython 3.5: - Compact

    and ordered - Added dk_indices with type, depending on the size of dictionary - Added ma_version_tag (PEP 509) - Initial size for split table is changed to 8 - Maximum dictionary load changed to (2*n)/3 - Deleting item cause converting the dict to the combined table - Preserving the order of **kwargs in a function (PEP 468) is implemented - Preserving Class Attribute Definition Order (PEP 520) is implemented - The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5 (https://docs.python.org/3.6/whatsnew/3.6.html#other-language- changes) Summary
  37. References 1. The implementation of a dictionary in Python 2.7

    https://habrahabr.ru/post/247843/ 2. Python hash calculation algorithms http://delimitry.blogspot.com/2014/07/python-hash-calculation-algorithms.html 3. PEP 412 - Key-Sharing Dictionary https://www.python.org/dev/peps/pep-0412/ 4. PEP 456 - Secure and interchangeable hash algorithm https://www.python.org/dev/peps/pep-0456/ 5. Mirror of the CPython repository https://github.com/python/cpython/ 6. Faster, more memory efficient and more ordered dictionaries on PyPy https://morepypy.blogspot.com/2015/01/faster- more-memory-efficient-and-more.html 7. PyDictionary (Jython API documentation) http://www.jython.org/javadoc/org/python/core/PyDictionary.html 8. Jython repository https://bitbucket.org/jython/jython 9. Java theory and practice: Building a better HashMap http://www.ibm.com/developerworks/library/j-jtp08223/ 10. Back to basics: Dictionary part 2, .NET implementation https://blog.markvincze.com/back-to-basics-dictionary-part-2- net-implementation/ 11. http://referencesource.microsoft.com/mscorlib/system/collections/generic/dictionary.cs.html 12. https://github.com/IronLanguages/main/blob/ipy-2.7-maint/Languages/IronPython/IronPython/ 13. https://bitbucket.org/pypy/pypy/ 14. https://twitter.com/raymondh 15. PEP 509 - Add a private version to dict https://www.python.org/dev/peps/pep-0509/ 16. Compact and ordered dict http://bugs.python.org/issue27350 17. What’s New In Python 3.6 https://docs.python.org/3.6/whatsnew/3.6.html 18. PEP 468 - Preserving the order of **kwargs in a function https://www.python.org/dev/peps/pep-0468/ 19. PEP 520 - Preserving Class Attribute Definition Order https://www.python.org/dev/peps/pep-0520/ 20. https://en.wikipedia.org/ Images from: http://www.rcreptiles.com/blog/index.php/2008/06/28/read_the_operating_manual_first http://kiwigamer450.deviantart.com/art/Back-to-The-Past-Logo-567858767 http://beyondplm.com/wp-content/uploads/2014/04/time-paradox-past-future-present.jpg http://itband.ru/wp-content/uploads/2014/10/Future.jpg https://en.wikipedia.org/wiki/Hash_table