JUGNsk Meetup #18. Aндрей Гура: " Архитектура хранилища распределенной базы данных Apache Ignite: решения и компромиссы"

Архитектура хранилища распределенной базы данных Apache Ignite: решения и компромиссы
Алексей Гончарук Андрей Гура 1

2021 © GridGain Systems Target audience • Apache Ignite users
and developers • Persistence data structures enthusiasts • Distributed systems and data bases enthusiasts 2 2

2021 © GridGain Systems Apache Ignite overview • Distributed Database
For High-Performance Computing With In-Memory Speed ◦ Distributed SQL ◦ Multi-Tier Storage ◦ Co-located Compute ◦ Transactions ◦ Machine Learning ◦ Continuous Queries 3 3

2021 © GridGain Systems Moving from memory to disk •
Initial goals: ◦ Fast restarts, serving SQL without data preloading ◦ Store more data than available RAM • Restriction: single architecture for volatile and persistent caches 4 4

2021 © GridGain Systems Generic approach ARIES: • Supports arbitrary
data structures • Recoverable durability • Transactional 5 https://dl.acm.org/doi/10.1145/128765.128770 5

2021 © GridGain Systems Generic approach: ARIES • On-disk space
is split into blocks of equal size (pages) • Each page has a unique identifier locating the page on disk 6 6

2021 © GridGain Systems Generic approach: ARIES • Pages are
loaded to memory page buffers 7 7

loaded to memory page buffers • There are fewer page buffers than disk pages ◦ Need mapping from page ID to buffer address 8 8

loaded to memory page buffers • Page buffer content can be replaced by a different page content on demand 9 9

2021 © GridGain Systems Generic approach: ARIES • All data
structures are page-organized 12 12

2021 © GridGain Systems Generic approach: ARIES • All changes
are logged to a single journal (Write-Ahead Log) ◦ Logical records (page-independant) ◦ Physical records (single-page change) 13 13

2021 © GridGain Systems Generic approach: ARIES • Modification protocol:
◦ Update acquires a shared modification lock 14 14

◦ Update acquires a shared modification lock ◦ Logical record is written 15 15

◦ Update acquires a shared modification lock ◦ Logical record is written ◦ Pages are updated (with physical records) 16 16

◦ Update acquires a shared modification lock ◦ Logical record is written ◦ Pages are updated (with physical records) ◦ Shared lock is released 17 17

2021 © GridGain Systems Generic approach: ARIES • Checkpointing protocol
◦ Exclusive lock is acquired 18 18

◦ Exclusive lock is acquired ◦ Changed pages are copied* 19 *Only a collection of page identiﬁers is copied, pages are COW-ed 19

◦ Exclusive lock is acquired ◦ Changed pages are copied* ◦ Lock is released 20 *Only a collection of page identiﬁers is copied, pages are COW-ed 20

◦ Exclusive lock is acquired ◦ Changed pages are copied* ◦ Lock is released ◦ Pages are written to disk 21 *Only a collection of page identiﬁers is copied, pages are COW-ed 21

2020 © GridGain Systems Implementation choices and tradeoffs 22

2021 © GridGain Systems Page Lock: Implementation details • Targeting
100s Gigabytes of RAM create 100M pages • j.u.c.l.ReentrantReadWriteLock is large, inefficient to keep it for each page • Creating and destroying a lock for each page on demand is expensive • Bad cache locality • Inline lock in the page buffer 23 23

2021 © GridGain Systems Page Lock: implementation details OffheapReadWriteLock •
Not fair, may starve • Not reentrant • 8 bytes • Helps to avoid ABA problem 24 24

2021 © GridGain Systems Sharp checkpoint Exclusive lock blocks writers
• Latency spikes for higher percentiles • Simpler code and recovery ◦ Can implement arbitrary data structures (e.g. consistent free lists) 25 25

2021 © GridGain Systems File per partition Data partitioning is
reflected in file structure • Single data file per partition ◦ Page ID = <cache ID, partition ID, page index> • Number of files is proportional to the number of caches 26 26

2021 © GridGain Systems Tree scans • Scans are always
index-based ◦ Suitable for low cardinality result sets ◦ Random access on full scan 29 29

2021 © GridGain Systems Relocation table • Contains mapping Page
ID -> Buffer address • When partition is evicted, need to invalidate a subset of this mapping ◦ Standard map => Full Scan • Custom open addressing hash map • Robin Hood hashing to maximize load factor 30 30

2021 © GridGain Systems Relocation table • Each partition is
assigned a generation • O(1) map clear for a partition 32 32

2021 © GridGain Systems Summary: native persistence • The more
data fit in RAM, the better the performance • Prefer SSDs or high-IOPs cloud storages, especially if data volume grows beyond available RAM capacity • Optimize for OLTP workloads ◦ Avoid full scans (scan partition at a time, preload) ◦ Use data collocation 34 34

2021 © GridGain Systems Summary: potential future improvements • HTAP
Storage Engine (GridGain HydraDragon) • Write-optimized engine (incremental checkpoints, LSM) • Support for page-level scans 35 35

2021 © GridGain Systems Welcome to contribute ◦ ignite.apache.org ◦
github.com/apache/ignite ◦ [email protected] ◦ t.me/RU_Ignite ◦ vk.com/apacheignite 36 36

JUGNsk Meetup #18. Aндрей Гура: " Архитектура х...

JUGNsk Meetup #18. Aндрей Гура: " Архитектура хранилища распределенной базы данных Apache Ignite: решения и компромиссы"

More Decks by jugnsk

Other Decks in Programming

Featured

Transcript