Claudio Freire - Efficient shared memory data structures

Sharing memory efficiently Claudio Daniel Freire klaussfreire@gmail.com PyCon US 2018

Python (sometimes)

Sharing memory... what for? • Cache too big to fit
in RAM... – N times with N processors • Multiprocessing: input data • Tornado / mod_wsgi: slow-evolving caches – When only a small fraction of it is frequently accessed • The “working set” fits in memory, but not the whole dataset • When disk access for infrequently needed data is acceptable 2

Sharing memory... what for? • When the serialization cost becomes
prohibitive – Big and complex data structures: you can’t afford the CPU cycles (de)serializing lots of objects – Objects of inefficient serialization: • Eg: SQLAlchemy 2

Sharing memory... what for? • The transition to a “shared
buffer” – Split the cache in two layers: • A read-only layer of slow-evolving data • A regular read-write layer with continuous (but infrequent) updates – Most data must reside in the static layer 3

Why not multiprocessing? • Lock contention • Poor support for
complex structures: 4 Note Although it is possible to store a pointer in shared memory remember that this will refer to a location in the address space of a specific process. However, the pointer is quite likely to be invalid in the context of a second process and trying to dereference the pointer from the second process may cause a crash. multiprocessing docs

The shared buffer • Getting the best of both worlds
– Compact and efficient shared memory representation of static or slow-changing data – Dynamic and fast-updateable structure for the rest User Old stuff New stuff Buffer 5

How? P1 P2 P3 P3 P4 P5 File Shared through
mmap 6

How? As simple as: fileobj = open("buf", "r+") buf =
mmap.mmap( fileobj.fileno(), 0, access = mmap.ACCESS_READ) 6

How do I get objects into this thing? Slightly more
complex • Define a schema that is: – Easily manipulated without serialization – Efficient in space and access time • Build the machinery that allows accessing it... – …as if it were an object – …without copying it to process-private memory 7

How do I get objects into this thing? Slightly more
complex • Define a schema that is: –struct— – Easily manipulated without serialization – Efficient in space and access time • Build the machinery that allows accessing it... – …as if it were an object – …without copying it to process-private memory –proxies— 7

Structs In C: struct { int a; float b; bool
c; } In Python import struct struct.pack( "if?", 1, 2.0, True) 8

Structs • Why on earth get C into this? –
Native machine code can access struct elements natively – Widely portable (most every language can parse C structs in some way or another) – Cython 8

Proxies • Classes that know where a struct lays within
a buffer • They convert attribute access to struct access: x = Proxy(buf, offset=10) x.a # reads the int x.b # reads the float x.c # reads the bool 9

Proxies • Don’t require serialization – It’s enough to know
where the struct is (ie, have a pointer) • The can easily be “repointed” – Change the offset to switch the proxy to another object – Avoids python object creation overhead • Relativley transparent – They look quite like the original object – They can even quack like the original as well 9

Proxies – adding complexity struct ComplexProxy { int value; int
child_left_offset; int child_right_offset; } class ComplexObj: def __init__(self, l = None, r = None): self.value = 3 self.left = l self.right = r 10

Proxies – adding complexity class ComplexProxy: def __init__(self, buf, pos):
self.buf = buf self.pos = pos a = IntProperty(offset=0) b = ProxyProperty(ComplexProxy, offset=4) c = ProxyProperty(ComplexProxy, offset=8) class IntProperty: def __get__(self, obj, kls): return unpack("i", obj.buf, obj.pos+self.offset) class ProxyProperty: def __get__(self, obj, kls): voffset = unpack("i", obj.buf, obj.pos+self.offset) return ComplexProxy( obj.buf, voffset) 11

Proxies – cyclic references – OOPS! • It gets tricky
when you add cyclic references – They need to be recognized when building the buffer – They require care, as always • A few options available: – Forbid them – Allow them 12

Proxies – cyclic references – OOPS! Identity maps • id(object)
→ offset • When an object is packed, update the identity map – Check it also to detect already-packed objects • Compresses the file – Unifies repeated references to the same object • Breaks cycles 13

Proxies – cyclic references – OOPS! Identity maps • Tricky
points – If the buffer is built by iterating a generator, you will probably get different objects with the same id() • The identity map has to be synchronized with the lifetime of in-memory objects at all times. If an object is destroyed, its entry on the identity map must be removed as well. – The identity map can get quite big • In particular when packing millions of objects into large buffers 13

Wait a minute

You said “no serialization”

Manipulation without serialization • Building a buffer is expensive –
Kinda like serializing, sure

Manipulation without serialization • Building a buffer is expensive –
Kinda like serializing, sure • But... using it, isn’t – Open – Read – Search – Even write (up to a point) 13

Manipulation without serialization Structure of an object 14

Manipulation without serialization Structure of an object Attribute bitmap present:11010000
nulls:11010000

nulls:11010000 a : 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint

nulls:11010000 a : 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint c : 12 bytes : str

Manipulation without serialization Nesting objects Attribute bitmap present:11010000 nulls:11010000 a
: 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint c : N bytes : object Attribute bitmap present:11010000 nulls:11010000 a : 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint c : N bytes : object

Manipulation without serialization Dynamic typing Attribute bitmap present:11010000 nulls:11010000 a
: 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint c : N bytes : any typecode a : 4 bytes : int value : 8 bytes : double 17

Manipulation without serialization Linear sequences *i1 *i2 *i3 *i4 v1
: 4b v2 : 4b v3 : 10b v4 : 40b Index Data

Manipulation without serialization Writing *i1 *i2 *i3 *i4 v1 :
4b v2 : 4b v3 : 10b v4 : 40b Index Data

Manipulation without serialization Writing *i1 *i2 *i3 *i4 v1 :
4b v2 : 4b v3 : 10b v4 : 40b Index Data 20

Associative maps • Compact hash table: – Sorted array of
tuples <hash, key, value> – Binary search optimized for uniform distributions • One prediction given the known key distribution (hash) • One iteration of exponential search to adjust the prediction • Finalize with a regular binary search • Approximate hash table: – Throw away the key, assume hash collisions as acceptable error – Particularly efficient with long string keys

Associative maps *k1 *k2 *k3 *k4 h1 h2 h3 h4
*v1 *v2 *v3 *v4 pedro 2324 4141 Index Keys Values

Associative maps alice bob cloe pedro 1 7 7 15
*v1 *v2 *v3 *v4 pedro 2324 4141 Index Keys Values m['bob'] == v2 m['cloe'] == v3

Approximate associative maps h1 h2 h3 h4 *v1 *v2 *v3
*v4 2324 4141 Index Values

Approximate associative maps 1 7 7 15 *v1 *v2 *v3
*v4 2324 4141 Index Values m['bob'] == m['cloe'] == [v2,v3]

Approximate associative maps Optimized binary search ... 3 27 87
56 23

56 Initial prediction

56 Initial prediction 29 45

56 Initial prediction 29 45 Upper bound

56 Initial prediction Upper bound Found

Speed • Performance: – Only "hot" data set (most used)
needs to fit in RAM – Optimized search in 2 log( ) ɛ • ɛ being the error between prediction and actual position • < n ɛ • Approximate hash table: – Fixed size even with big keys (long strings) – Even more efficient access (no need to verify and store keys) 25

Speed • Performance: – Good disk access pattern even if
it won’t fit in RAM: • Exponential search is mostly sequential access • Good locality with good predictions – O(1) seeks on average • Possibility to preload the index to RAM – Much more likely to fit than values or keys

Speed • Cython magic: – Instead of using struct everywhere
– Avoids building python objects for temporary operations • Proxy reuse: – Instead of building new proxies, repoint a reusable one – Type transmutation to change the shape of a proxy • proxy.__class__ = new_cls 28

Related fascinating stuff http://poshmodule.sourceforge.net/posh/posh.pdf

Don’t write it pip install it Sharedbuffers Questions?

Claudio Freire - Efficient shared memory data s...

Claudio Freire - Efficient shared memory data structures

More Decks by PyCon 2018

Other Decks in Programming

Featured

Transcript