Sysadmindays #9: MyRocks at Synthesio !

MyRocks, a RocksDB storage engine with MySQL All your data
are belong to us SysadminDays #9 - 18-19 November 2019 - Paris

Whoami Romain Bouvier SRE Team @synthesio 12 years being paid
for rebooting things Xfce developper @_skunnyk_

Synthesio

Cold Storage at Synthesio “Doc” cluster: keep all crawled documents
since 10 years - Around 60 millions new documents by days, dozens of billions of documents stored 4 MySQL clusters, around 50TB of data (without replication): - One cluster for “new” mentions (last ~2 months), on SSD disks, with compressed InnoDB (ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8 ) - 3 clusters for oldest data with spinning disks (slooooow, but lots of space), with MyRocks Engine - New MySQL tables every ~40 millions documents - More than 3500 tables - Tables are automagically oﬄoaded from InnoDB to RocksDB every days.

Cold Storage Architecture docoffload1 docoffload2 docoffload3 Doc (SSD) ~2 months
Crawling Processing chain (ES, Kafka, ScyllaDB, dashboards …) docoffloadXX ~1year Offload process read/write

What is MyRocks - MyRocks is a storage engine that
converts your MySQL requests into requests for the RocksDB engine - Created and maintained by Facebook, based on MySQL 5.6. They use it for “UDB” (User DataBase), their social graph, Messenger - Supported by Percona and MariaDB - Beneﬁt from (almost) all the features of MySQL while using RocksDB as backend storage (you don’t need to change your application) - MyRocks focuses on space and write eﬃciency - MyRocks can be 4x smaller than compressed InnoDB

What is RocksDB ? - An embeddable persistent key-value store,
fork of LevelDB - High Performance, uses a log structured database engine (LSM tree) - Optimized for flash drives and SSD - Write data in “sstables”. Immutable data, once written (sequentially), no update-in-place is done. - New data are written sorted, in new files, read operations can read multiples sstables. - Compactions are made in background to merge sstables in new one and reduce read amplification (but with write amplification) - Features: Secondary indexes, bloom filtering, snapshots… - Created and maintained by Facebook, used in dozen of projects: TiDB, YugaByte, Apache Samza, CockroachDB, TiKV …

Log-structured merge-tree (LSM Tree) - Data structure with performance characteristics
that make it attractive for providing indexed access to files with high insert volum - Designed to provide better write throughput than traditional B+ tree - Sequential writes. No need to “read before write” - Write immediately (in-memory), do the heavy work “sometime” - later - Reads are more costly than writes (but bloom filter helps !) - Compactions are used to reorder/merge/delete data, and spread sstables in “levels” (in leveled implementations) - Concept used in BigTables, HBase, Cassandra/ScyllaDB, RocksDB, InfluxDB, even in Prometheus

Levels ? Compactions ? Source: https://docs.scylladb.com/kb/compaction/

Compression ? - Your data should be in last LSM
level. So use a “fast” compression for all level, except the bottommost, where you can use a slower but better one, like kZSTD. - compression=kLZ4Compression;bottommost_compression=kZSTD

Bloom ﬁlters ? - Bloom ﬁlter: - Test whether an
element is a member of a set. - False positives are possible, but false negatives are not ! - It can determines if a data is in a sstable or not - It helps to limit the number of read

ACID with MyRocks - rocksdb_flush_log_at_trx_commit : Speciﬁes whether to sync
on every transaction commit, similar to innodb_flush_log_at_trx_commit . - We use rocksdb_flush_log_at_trx_commit : do not sync on transaction commit. This provides better performance, but may lead to data inconsistency in case of a crash. - We have replica and failover - Big Data means you can lose a little bit of data and nobody will notice, right ?

“Limitations“ of MyRocks - No online DDL (Data Deﬁnition Language),
but fast index creation! - No foreign key or fulltext index support - No Transportable tablespace - Replication exclusively row-based. - No easy “hot backup” solution - ﬁxed with percona 8 and mariadb ! Not really a limitation: - Must have a primary key on table

InnoDB vs MyRocks - InnoDB is the default MySQL storage
engine - InnoDB favor read at cost of write and space - MyRocks favor write and space at cost of read - Both are ACID compliant - Compressed InnoDB is roughly 2x smaller than uncompressed InnoDB - MyRocks can be 4x smaller to compressed InnoDB ! - Decompression cost on read - No Galera/XtraCluster support for MyRocks

RocksDB Write Architecture Source: https://blog.pythian.com/exposing-myrocks-internals-via-system-variables-part-1-data-writing/

RocksDB Read Architecture Source: https://blog.pythian.com/exposing-myrocks-internals-via-system-variables-part-5-data-reads/

Delete: - Data in sstables are immutable. So when you
do a delete, you *write* a new entry saying ‘this data does not exist anymore’ (this is called tombstone) - Then on next read, the engine will read the data, the tombstone and say “meh, the data is in fact deleted” - On next round of compaction, tombstone and data will merge, to really drop the data. - RocksDB have some mechanisms to avoid theses steps when possible (when you remove a key, named SingleDelete) RocksDB Delete Architecture

How it works with MySQL Data are located in $datadir/.rocksdb
folder (for Percona and Facebook ﬂavors) MyRocks is a loadable MySQL module : INSTALL PLUGIN ROCKSDB SONAME 'ha_rocksdb.so'; INSTALL PLUGIN ROCKSDB_CFSTATS SONAME 'ha_rocksdb.so'; [...] [mysqld] rocksdb

Change RocksDB options using MyRocks rocksdb_default_cf_options="block_based_table_factory={cache_index_and_filter_bloc ks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dy namic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappi ngRatio;compression=kLZ4Compression;bottommost_compression=kZSTD;target_file_size_ base=256m;"

Read the code ! (and the doc)

Create a table create table `myrocks`( ìd` bigint(22) unsigned NOT
NULL AUTO_INCREMENT, ìdTopic` smallint(3) unsigned NOT NULL DEFAULT '0', ìdThread` varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL, ùsername` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, PRIMARY KEY (ìd`) COMMENT 'cf_id_pk', KEY ìd1_topic` (ìd1`,ìdTopic`) COMMENT 'rev:cf_topic' ) ENGINE=ROCKSDB

Comment section RocksDB uses Column Family (“”tables””) to store its
data. PRIMARY KEY (ìd`) COMMENT 'cf_id_pk' means the primary key will be stored in cf_id_pk column family (auto created) KEY ìd1_topic` (ìd1`,ìdTopic`) COMMENT 'rev:cf_topic' Secondary index. Reference the primary index internally. In ASC order by default. Set “rev” to force a DESC order (based on your needs). Each column family have its own config, resources, data files, memory caches. At Synthesio we only use 1 CF because all our MyRocks tables use the same and needs the same options

Data Loading - By default, MyRocks conﬁgurations are optimized for
short transactions, and not for data loading. MyRocks has a couple of special session variables to speed up data loading dramatically. SET session rocksdb_bulk_load=1; - This will force all data to be inserted directly to the bottommost level ! Your data need to be ordered by primary key before insertion (already the case if you have a primary key) - Remove secondary indexes from the schema, re-add them after loading !

Some MyRocks statistics

Some MyRocks metrics

Snapshots - SST ﬁles are Immutables data - Create a
snapshot (hard link), and you are done ! > set rocksdb_create_checkpoint=mysnapshot; This will create a snapshot in $datadir/.rocksdb/mysnapshot/ This is what is used by xtrabackup/mariabackup (with extra step for the global coherence with InnoDB/MyISAM tables in mysql database)

Backup ? - Facebook provides myrocks_hotbackup to create physical hotbackup
- Can’t be used with our Percona ﬂavor because MySQL 5.7 uses innodb for some mysql system tables - MariaDB provides mariabackup (a xtrabackup fork) with myrocks hotbackup support. It uses snapshot feature for the rocksdb part - xtrabackup only support myrocks hot backup starting version 8.0 (and will not backport it to 5.7 branch) - We are using Percona Server 5.7, so, what to do ? Take mariabackup’s MyRocks patch, and backport it to xtrabackup 2.4 ! Surprise, it worked ! Not perfectly, but enough to create a new replica from another instance.

Why choose MyRocks for Synthesio - Heavy use of MySQL
- Legacy codebase with old PHP versions which only speaks MySQL - It’s MySQL ! So no need to change our application ! We know how to operate and deploy it - Once in the “oﬄoad” process, we almost never update the data - BUT, when we need to update backdata, we may need to update millions of mentions (for compliance purpose for example) and it takes ages. So a better write performance is a good deal - Space costs money (remember, TB of data)

- Mysqldump the schema with --no-data - drop secondary indexes
from the schema - s/ENGINE=InnoDB/ENGINE=ROCKSDB/ - Reinject to the MyRocks servers , with rocksdb_bulk_load=1 - Add back secondary indexes with alter table - Update the map for the application - Do this ~ 3500 times during 4 months - ??? - Proﬁt ! Migration process

MyRocks at Synthesio - Used on cold storage - Around
30% of disk space saved ! - From 60TB - To 39TB - Better read/writes performances than InnoDB from what we saw - From 150 updates/seconds - To 600 updates/seconds - Used on spinning disks (!) - Replication is done via traditional MySQL way, so we know how it works - Hot Backup is done with our own fork of Percona Xtrabackup (with patches based on mariabackup) - Tons of rocksdb options, that’s why we need to read the rocksdb source code (and comments) to really understand them (well, not all… yet !)

Today - Work in production since 6 months, no problems
so far - Creation of a new server with the patched xtrabackup works - Next step: upgrade to Percona Server 8.0 to beneﬁt from native xtrabackup support and performance gains.

Links - https://myrocks.io / https://rocksdb.org - https://www.percona.com/doc/percona-server/LATEST/myrocks/index.html - https://www.percona.com/sites/default/ﬁles/presentations/Webinar-2019- MyRocks.pdf
- https://www.slideshare.net/matsunobu/myrocks-deep-dive - https://www.percona.com/blog/2018/02/01/myrocks-engine-things-know-s tart - https://blog.pythian.com/tag/myrocks/ - http://smalldatum.blogspot.com/search/label/rocksdb - https://engineering.fb.com/core-data/migrating-a-database-from-innodb-t o-myrocks/

What are your questions ?

Sysadmindays #9: MyRocks at Synthesio !

Sysadmindays #9: MyRocks at Synthesio !

Skunnyk

Other Decks in Technology

Featured

Transcript

MyRocks, a RocksDB storage engine with MySQL All your data

Whoami Romain Bouvier SRE Team @synthesio 12 years being paid

Synthesio

Synthesio

Cold Storage at Synthesio “Doc” cluster: keep all crawled documents

Cold Storage Architecture docoffload1 docoffload2 docoffload3 Doc (SSD) ~2 months

What is MyRocks - MyRocks is a storage engine that

What is RocksDB ? - An embeddable persistent key-value store,

Log-structured merge-tree (LSM Tree) - Data structure with performance characteristics

Levels ? Compactions ? Source: https://docs.scylladb.com/kb/compaction/

Compression ? - Your data should be in last LSM

Bloom ﬁlters ? - Bloom ﬁlter: - Test whether an

ACID with MyRocks - rocksdb_flush_log_at_trx_commit : Speciﬁes whether to sync

“Limitations“ of MyRocks - No online DDL (Data Deﬁnition Language),

InnoDB vs MyRocks - InnoDB is the default MySQL storage

RocksDB Write Architecture Source: https://blog.pythian.com/exposing-myrocks-internals-via-system-variables-part-1-data-writing/

RocksDB Read Architecture Source: https://blog.pythian.com/exposing-myrocks-internals-via-system-variables-part-5-data-reads/

Delete: - Data in sstables are immutable. So when you

How it works with MySQL Data are located in $datadir/.rocksdb

Read the code ! (and the doc)

Create a table create table `myrocks`( `id` bigint(22) unsigned NOT

Comment section RocksDB uses Column Family (“”tables””) to store its

Data Loading - By default, MyRocks conﬁgurations are optimized for

Some MyRocks statistics

Some MyRocks metrics

Snapshots - SST ﬁles are Immutables data - Create a

Backup ? - Facebook provides myrocks_hotbackup to create physical hotbackup

Why choose MyRocks for Synthesio - Heavy use of MySQL

- Mysqldump the schema with --no-data - drop secondary indexes

MyRocks at Synthesio - Used on cold storage - Around

Today - Work in production since 6 months, no problems

Links - https://myrocks.io / https://rocksdb.org - https://www.percona.com/doc/percona-server/LATEST/myrocks/index.html - https://www.percona.com/sites/default/ﬁles/presentations/Webinar-2019- MyRocks.pdf

What are your questions ?