Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JUGNsk Meetup #18. Aндрей Гура: " Архитектура хранилища распределенной базы данных Apache Ignite: решения и компромиссы"

JUGNsk Meetup #18. Aндрей Гура: " Архитектура хранилища распределенной базы данных Apache Ignite: решения и компромиссы"

Устройство персистентного хранилища Apache Ignite следует подходам, принятым в классических базах данных, основанных на архитектуре ARIES. Тем не менее, разработчикам Ignite потребовалось скорректировать архитектуру для увеличения скорости разработки и облегчения поддержки в том числе in-memory хранилища.

Андрей начнет с небольшого обзора движка Ignite, затем расскажет про компромиссы, выбранные решения и причины, которые привели к принятию этих решений. Отдельно будут затронуты сложности, возникшие при реализации хранилища на Java, и то, как их преодолевало сообщество.

jugnsk

June 27, 2021
Tweet

More Decks by jugnsk

Other Decks in Programming

Transcript

  1. 2021 © GridGain Systems Target audience • Apache Ignite users

    and developers • Persistence data structures enthusiasts • Distributed systems and data bases enthusiasts 2 2
  2. 2021 © GridGain Systems Apache Ignite overview • Distributed Database

    For High-Performance Computing With In-Memory Speed ◦ Distributed SQL ◦ Multi-Tier Storage ◦ Co-located Compute ◦ Transactions ◦ Machine Learning ◦ Continuous Queries 3 3
  3. 2021 © GridGain Systems Moving from memory to disk •

    Initial goals: ◦ Fast restarts, serving SQL without data preloading ◦ Store more data than available RAM • Restriction: single architecture for volatile and persistent caches 4 4
  4. 2021 © GridGain Systems Generic approach ARIES: • Supports arbitrary

    data structures • Recoverable durability • Transactional 5 https://dl.acm.org/doi/10.1145/128765.128770 5
  5. 2021 © GridGain Systems Generic approach: ARIES • On-disk space

    is split into blocks of equal size (pages) • Each page has a unique identifier locating the page on disk 6 6
  6. 2021 © GridGain Systems Generic approach: ARIES • Pages are

    loaded to memory page buffers • There are fewer page buffers than disk pages ◦ Need mapping from page ID to buffer address 8 8
  7. 2021 © GridGain Systems Generic approach: ARIES • Pages are

    loaded to memory page buffers • Page buffer content can be replaced by a different page content on demand 9 9
  8. 2021 © GridGain Systems Generic approach: ARIES • Pages are

    loaded to memory page buffers • Page buffer content can be replaced by a different page content on demand 10 10
  9. 2021 © GridGain Systems Generic approach: ARIES • Pages are

    loaded to memory page buffers • Page buffer content can be replaced by a different page content on demand 11 11
  10. 2021 © GridGain Systems Generic approach: ARIES • All changes

    are logged to a single journal (Write-Ahead Log) ◦ Logical records (page-independant) ◦ Physical records (single-page change) 13 13
  11. 2021 © GridGain Systems Generic approach: ARIES • Modification protocol:

    ◦ Update acquires a shared modification lock 14 14
  12. 2021 © GridGain Systems Generic approach: ARIES • Modification protocol:

    ◦ Update acquires a shared modification lock ◦ Logical record is written 15 15
  13. 2021 © GridGain Systems Generic approach: ARIES • Modification protocol:

    ◦ Update acquires a shared modification lock ◦ Logical record is written ◦ Pages are updated (with physical records) 16 16
  14. 2021 © GridGain Systems Generic approach: ARIES • Modification protocol:

    ◦ Update acquires a shared modification lock ◦ Logical record is written ◦ Pages are updated (with physical records) ◦ Shared lock is released 17 17
  15. 2021 © GridGain Systems Generic approach: ARIES • Checkpointing protocol

    ◦ Exclusive lock is acquired ◦ Changed pages are copied* 19 *Only a collection of page identifiers is copied, pages are COW-ed 19
  16. 2021 © GridGain Systems Generic approach: ARIES • Checkpointing protocol

    ◦ Exclusive lock is acquired ◦ Changed pages are copied* ◦ Lock is released 20 *Only a collection of page identifiers is copied, pages are COW-ed 20
  17. 2021 © GridGain Systems Generic approach: ARIES • Checkpointing protocol

    ◦ Exclusive lock is acquired ◦ Changed pages are copied* ◦ Lock is released ◦ Pages are written to disk 21 *Only a collection of page identifiers is copied, pages are COW-ed 21
  18. 2021 © GridGain Systems Page Lock: Implementation details • Targeting

    100s Gigabytes of RAM create 100M pages • j.u.c.l.ReentrantReadWriteLock is large, inefficient to keep it for each page • Creating and destroying a lock for each page on demand is expensive • Bad cache locality • Inline lock in the page buffer 23 23
  19. 2021 © GridGain Systems Page Lock: implementation details OffheapReadWriteLock •

    Not fair, may starve • Not reentrant • 8 bytes • Helps to avoid ABA problem 24 24
  20. 2021 © GridGain Systems Sharp checkpoint Exclusive lock blocks writers

    • Latency spikes for higher percentiles • Simpler code and recovery ◦ Can implement arbitrary data structures (e.g. consistent free lists) 25 25
  21. 2021 © GridGain Systems File per partition Data partitioning is

    reflected in file structure • Single data file per partition ◦ Page ID = <cache ID, partition ID, page index> • Number of files is proportional to the number of caches 26 26
  22. 2021 © GridGain Systems File per partition Need to clear

    a partition before dropping the file 27 27
  23. 2021 © GridGain Systems File per partition Need to clear

    a partition before dropping the file 28 28
  24. 2021 © GridGain Systems Tree scans • Scans are always

    index-based ◦ Suitable for low cardinality result sets ◦ Random access on full scan 29 29
  25. 2021 © GridGain Systems Relocation table • Contains mapping Page

    ID -> Buffer address • When partition is evicted, need to invalidate a subset of this mapping ◦ Standard map => Full Scan • Custom open addressing hash map • Robin Hood hashing to maximize load factor 30 30
  26. 2021 © GridGain Systems Relocation table • Each partition is

    assigned a generation • O(1) map clear for a partition 32 32
  27. 2021 © GridGain Systems Summary: native persistence • The more

    data fit in RAM, the better the performance • Prefer SSDs or high-IOPs cloud storages, especially if data volume grows beyond available RAM capacity • Optimize for OLTP workloads ◦ Avoid full scans (scan partition at a time, preload) ◦ Use data collocation 34 34
  28. 2021 © GridGain Systems Summary: potential future improvements • HTAP

    Storage Engine (GridGain HydraDragon) • Write-optimized engine (incremental checkpoints, LSM) • Support for page-level scans 35 35
  29. 2021 © GridGain Systems Welcome to contribute ◦ ignite.apache.org ◦

    github.com/apache/ignite ◦ [email protected] ◦ t.me/RU_Ignite ◦ vk.com/apacheignite 36 36