Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JUGNsk Meetup #18. Aндрей Гура: " Архитектура хранилища распределенной базы данных Apache Ignite: решения и компромиссы"

JUGNsk Meetup #18. Aндрей Гура: " Архитектура хранилища распределенной базы данных Apache Ignite: решения и компромиссы"

Устройство персистентного хранилища Apache Ignite следует подходам, принятым в классических базах данных, основанных на архитектуре ARIES. Тем не менее, разработчикам Ignite потребовалось скорректировать архитектуру для увеличения скорости разработки и облегчения поддержки в том числе in-memory хранилища.

Андрей начнет с небольшого обзора движка Ignite, затем расскажет про компромиссы, выбранные решения и причины, которые привели к принятию этих решений. Отдельно будут затронуты сложности, возникшие при реализации хранилища на Java, и то, как их преодолевало сообщество.

jugnsk

June 27, 2021
Tweet

More Decks by jugnsk

Other Decks in Programming

Transcript

  1. Архитектура хранилища
    распределенной базы данных
    Apache Ignite: решения
    и компромиссы
    Алексей Гончарук
    Андрей Гура
    1

    View Slide

  2. 2021 © GridGain Systems
    Target audience
    ● Apache Ignite users and developers
    ● Persistence data structures enthusiasts
    ● Distributed systems and data bases enthusiasts
    2
    2

    View Slide

  3. 2021 © GridGain Systems
    Apache Ignite overview
    ● Distributed Database For High-Performance Computing With
    In-Memory Speed
    ○ Distributed SQL
    ○ Multi-Tier Storage
    ○ Co-located Compute
    ○ Transactions
    ○ Machine Learning
    ○ Continuous Queries
    3
    3

    View Slide

  4. 2021 © GridGain Systems
    Moving from memory to disk
    ● Initial goals:
    ○ Fast restarts, serving SQL without data preloading
    ○ Store more data than available RAM
    ● Restriction: single architecture for volatile and persistent caches
    4
    4

    View Slide

  5. 2021 © GridGain Systems
    Generic approach
    ARIES:
    ● Supports arbitrary data structures
    ● Recoverable durability
    ● Transactional
    5
    https://dl.acm.org/doi/10.1145/128765.128770
    5

    View Slide

  6. 2021 © GridGain Systems
    Generic approach: ARIES
    ● On-disk space is split into blocks of equal size (pages)
    ● Each page has a unique identifier locating the page on disk
    6
    6

    View Slide

  7. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Pages are loaded to memory page buffers
    7
    7

    View Slide

  8. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Pages are loaded to memory page buffers
    ● There are fewer page buffers than disk pages
    ○ Need mapping from page ID to buffer address
    8
    8

    View Slide

  9. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Pages are loaded to memory page buffers
    ● Page buffer content can be replaced by a different page
    content on demand
    9
    9

    View Slide

  10. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Pages are loaded to memory page buffers
    ● Page buffer content can be replaced by a different page
    content on demand
    10
    10

    View Slide

  11. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Pages are loaded to memory page buffers
    ● Page buffer content can be replaced by a different page
    content on demand
    11
    11

    View Slide

  12. 2021 © GridGain Systems
    Generic approach: ARIES
    ● All data structures are page-organized
    12
    12

    View Slide

  13. 2021 © GridGain Systems
    Generic approach: ARIES
    ● All changes are logged to a single journal (Write-Ahead Log)
    ○ Logical records (page-independant)
    ○ Physical records (single-page change)
    13
    13

    View Slide

  14. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Modification protocol:
    ○ Update acquires a
    shared modification lock
    14
    14

    View Slide

  15. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Modification protocol:
    ○ Update acquires a
    shared modification lock
    ○ Logical record is written
    15
    15

    View Slide

  16. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Modification protocol:
    ○ Update acquires a
    shared modification lock
    ○ Logical record is written
    ○ Pages are updated (with
    physical records)
    16
    16

    View Slide

  17. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Modification protocol:
    ○ Update acquires a
    shared modification lock
    ○ Logical record is written
    ○ Pages are updated (with
    physical records)
    ○ Shared lock is released
    17
    17

    View Slide

  18. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Checkpointing protocol
    ○ Exclusive lock is
    acquired
    18
    18

    View Slide

  19. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Checkpointing protocol
    ○ Exclusive lock is
    acquired
    ○ Changed pages are
    copied*
    19
    *Only a collection of page identifiers is copied, pages are COW-ed
    19

    View Slide

  20. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Checkpointing protocol
    ○ Exclusive lock is
    acquired
    ○ Changed pages are
    copied*
    ○ Lock is released
    20
    *Only a collection of page identifiers is copied, pages are COW-ed
    20

    View Slide

  21. 2021 © GridGain Systems
    Generic approach: ARIES
    ● Checkpointing protocol
    ○ Exclusive lock is
    acquired
    ○ Changed pages are
    copied*
    ○ Lock is released
    ○ Pages are written to disk
    21
    *Only a collection of page identifiers is copied, pages are COW-ed
    21

    View Slide

  22. 2020 © GridGain Systems
    Implementation
    choices and tradeoffs
    22

    View Slide

  23. 2021 © GridGain Systems
    Page Lock: Implementation details
    ● Targeting 100s Gigabytes of RAM create 100M pages
    ● j.u.c.l.ReentrantReadWriteLock is large, inefficient to keep it
    for each page
    ● Creating and destroying a lock for each page on demand is
    expensive
    ● Bad cache locality
    ● Inline lock in the page buffer
    23
    23

    View Slide

  24. 2021 © GridGain Systems
    Page Lock: implementation details
    OffheapReadWriteLock
    ● Not fair, may starve
    ● Not reentrant
    ● 8 bytes
    ● Helps to avoid ABA problem
    24
    24

    View Slide

  25. 2021 © GridGain Systems
    Sharp checkpoint
    Exclusive lock blocks writers
    ● Latency spikes for higher percentiles
    ● Simpler code and recovery
    ○ Can implement arbitrary data structures (e.g. consistent
    free lists)
    25
    25

    View Slide

  26. 2021 © GridGain Systems
    File per partition
    Data partitioning is reflected in file structure
    ● Single data file per partition
    ○ Page ID =
    ● Number of files is proportional to the number of caches
    26
    26

    View Slide

  27. 2021 © GridGain Systems
    File per partition
    Need to clear a partition before dropping the file
    27
    27

    View Slide

  28. 2021 © GridGain Systems
    File per partition
    Need to clear a partition before dropping the file
    28
    28

    View Slide

  29. 2021 © GridGain Systems
    Tree scans
    ● Scans are always index-based
    ○ Suitable for low cardinality
    result sets
    ○ Random access on full
    scan
    29
    29

    View Slide

  30. 2021 © GridGain Systems
    Relocation table
    ● Contains mapping Page ID -> Buffer address
    ● When partition is evicted, need to invalidate a subset of this
    mapping
    ○ Standard map => Full Scan
    ● Custom open addressing hash map
    ● Robin Hood hashing to maximize load factor
    30
    30

    View Slide

  31. 2021 © GridGain Systems
    Relocation table
    ● Each partition is assigned a
    generation
    31
    31

    View Slide

  32. 2021 © GridGain Systems
    Relocation table
    ● Each partition is assigned a
    generation
    ● O(1) map clear for a partition
    32
    32

    View Slide

  33. 2020 © GridGain Systems
    Summary
    33

    View Slide

  34. 2021 © GridGain Systems
    Summary: native persistence
    ● The more data fit in RAM, the better the performance
    ● Prefer SSDs or high-IOPs cloud storages, especially if data
    volume grows beyond available RAM capacity
    ● Optimize for OLTP workloads
    ○ Avoid full scans (scan partition at a time, preload)
    ○ Use data collocation
    34
    34

    View Slide

  35. 2021 © GridGain Systems
    Summary: potential future improvements
    ● HTAP Storage Engine (GridGain HydraDragon)
    ● Write-optimized engine (incremental checkpoints, LSM)
    ● Support for page-level scans
    35
    35

    View Slide

  36. 2021 © GridGain Systems
    Welcome to contribute
    ○ ignite.apache.org
    ○ github.com/apache/ignite
    [email protected]
    ○ t.me/RU_Ignite
    ○ vk.com/apacheignite
    36
    36

    View Slide