2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memory Main memory CPU CPU (a) (b) (c) Central processing unit (CPU)
• Compression levels: from 0 to 9 • Codecs: “blosclz”, “lz4”, “lz4hc”, “snappy”, “zlib” and “zstd” • Different ﬁlters: “shufﬂe” and “bitshufﬂe” • Number of threads • Block sizes (the chunk is split in blocks internally) Nice opportunity for ﬁne tuning for a speciﬁc setup!
compressed with Blosc after being ﬁlled up for improved memory consumption and faster transmission. Event Field 1 Queue Event Field 2 Event Field N . . . Compressed Field 1 Compressed Field 2 Compressed Field N Thread safe queues gRPC Streams . . . Blosc-compressed gRPC buffers
~5x in both transmission and storage. • It was used throughout all the project: • In gRPC buffers, for improved memory consumption in the Repeater queues and faster transmission • In HDF5, so as to greatly reduce the disk usage and ingestion time • The system was able to ingest more than 500K mess/sec (~650K mess/sec in our setup using a single machine with >16 physical cores). Not possible without compression!
back in 1998 at NCSA (with the support of NASA) • Great adoption in many ﬁelds, including scientiﬁc, engineer and ﬁnance • Maintained by The HDF Group, a non-proﬁt corporation • Two major Python wrappers: h5py and PyTables
the data arranged column-wise. Better performance for big tables, as well as for improving the compression ratio. • Efﬁcient shrinks and appends: you can shrink or append more data at the end of the objects very efﬁciently.
any NumPy dtype. • Chunk arrays along any dimension. • Compress chunks using Blosc or alternatively zlib, BZ2 or LZMA. • Created by Alistair Miles from MRC Centre for Genomics and Global Health for handling genomic data in-memory.
0 to 9 • Codecs: “blosclz”, “lz4”, “lz4hc”, “snappy”, “zlib” and “zstd” • Different ﬁlters: “shufﬂe” and “bitshufﬂe” • Number of threads • Block sizes (the chunk is split in blocks internally) Question: how to choose the best candidates for maximum speed? Or for maximum compression? Or for a right balance?
what she prefer: • Maximum compression ratio • Maximum compression speed • Maximum decompression speed • A balance between all the above • Based on that, and the characteristics of the data to be compressed, the training step gives hints on the optimal Blosc parameters to be used in new datasets.
evolution in computer architecture, the compression can be effective for two reasons: • We can work with more data using the same resources. • We can reduce the overhead of compression to near zero, and even beyond that! We are deﬁnitely entering in an age where compression will be used much more ubiquitously.
• Not many data libraries focus on chunked data containers nowadays. • No silver bullet: we won’t be able to ﬁnd a single container that makes everybody happy; it’s all about tradeoffs. • With chunked containers we can use persistent media (disk) as if it is ephemeral (memory) and the other way around -> independency of media!