Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SQLite Internals: How The World's Most Used Database Works

SQLite Internals: How The World's Most Used Database Works

Abdur-Rahmaan Janhangeer

December 10, 2022
Tweet

More Decks by Abdur-Rahmaan Janhangeer

Other Decks in Technology

Transcript

  1. SQLite Internals:

    How the World's

    Most Used Database

    Works

    View Slide

  2. ssslides

    View Slide

  3. 3

    View Slide

  4. Abdur-Rahmaan Janhangeer
    Help people break into OpenSource
    People hire me to work on Python projects
    www.compileralchemy.com
    4

    View Slide

  5. SQLite Internals:

    How the World's

    Most Used Database

    Works
    5

    View Slide

  6. THE world's most used piece of sotware (2nd most used
    actually [1])
    6

    View Slide

  7. In the device you are holding most probably
    By the Mil
    On planes
    Space
    7

    View Slide

  8. Presentation meant to be taken home and explored
    8

    View Slide

  9. incomplete - v2 will have more info
    9

    View Slide

  10. The story behind
    10

    View Slide

  11. Author
    Dwayne Richard Hipp (DRH)
    Phd
    11

    View Slide

  12. Inception
    Was working on software for a battleship: DDG-79 Oscar
    Austin [2]
    12

    View Slide

  13. 13

    View Slide

  14. Working on program to solve flow routing, open & close
    valves
    DB already on ship, Informix, now by IBM "For the edge
    and beyond"
    Server down -> blame DRH
    No engines around
    14

    View Slide

  15. one of the guys I was working with says, “Richard, why
    don’t you just write one?” “Okay, I’ll give it a try.
    ...
    all government contracts got shut down, so I was out of
    work for a few months, and I thought, “Well, I’ll just write
    that database engine now.””
    15

    View Slide

  16. so I wrote a byte code engine that would actually run a
    query and then I wrote a compiler that would translate
    SQL into that byte code and voila, SQLite was born.
    16

    View Slide

  17. Milestones
    Posted on internet, picked up speed
    Motorola OS -> (Oh you can monetise OSS) project $80k
    America Online -> Wanted db on CD
    Symbian OS on Nokia phones (tried 10 db, 2 OSS, 7
    prop)
    Symbian needed consortium (bus factor)
    Mozilla Foundation helped set the guidelines
    Android: Changed the phone game, used SQLite,
    contacted by Google
    17

    View Slide

  18. We were going around boasting to everybody naively
    that SQLite didn’t have any bugs in it, or no serious bugs,
    but Android definitely proved us wrong. ... It’s amazing
    how many bugs will crop up when your software
    suddenly gets shipped on millions of devices.
    18

    View Slide

  19. Worked for Rockwell Collins - DO-178B aviation quality
    standard - 100% MCDC test coverage
    Tests better than even postgres which relies on peer
    reviews [3]. Change code fearlessly.
    19

    View Slide

  20. A from scratch / first principle culture
    I need a db: Write it
    Wanted B-tree, implemented from scratch
    Simple enough to write parsers etc
    Build the CVS: Fossil
    Own implementation of printf inside query
    Freedom
    20

    View Slide

  21. ... And it's GPL, and

    so SQLite Version 1 was GPL, it had to be because it

    was linking against the GPL library.

    But GDBM is only key-value, I can't do range queries

    with it. Then I said, “I'm gonna write my own B-tree

    layer”
    21

    View Slide

  22. And so, we can can expect it to be a
    really complex piece of software!
    22

    View Slide

  23. never understood lex because it's so easy to write a
    bunch of c codes faster then lex [1]
    23

    View Slide

  24. Internals
    24

    View Slide

  25. 120K lines, 200k with comments, compiled in one source
    file
    25

    View Slide

  26. Overview
    +----------+ +----------+ +----+

    | Compiler | --> | bytecode | --> | VM |

    +----------+ +----------+ +----+

    26

    View Slide

  27. Better view
    27

    View Slide

  28. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    28

    View Slide

  29. [ parser ] \

    | \ compiler

    v /

    [ code generator ] /

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    29

    View Slide

  30. [ parser ] \

    | \ sqlite3_prepare_v2()

    v /

    [ code generator ] /

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    30

    View Slide

  31. [ parser ] \

    | \ prepared statements

    v /

    [ code generator ] /

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    31

    View Slide

  32. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] \

    | \

    v \

    [ btree ] \

    | \ run the program

    v /

    [ pager ] /

    | /

    v /

    [ shim ] /

    | /

    v /

    [ OS Interface ] /

    32

    View Slide

  33. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] \

    | \

    v \

    [ btree ] \

    | \ sqlite3_step()

    v /

    [ pager ] /

    | /

    v /

    [ shim ] /

    | /

    v /

    [ OS Interface ] /

    33

    View Slide

  34. [ parser ] \

    | \ compiler

    v /

    [ code generator ] /

    |

    v

    [ VM ] \

    | \

    v \

    [ btree ] \

    | \ run the program

    v /

    [ pager ] /

    | /

    v /

    [ shim ] /

    | /

    v /

    [ OS Interface ] /

    34

    View Slide

  35. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] \

    | \

    v \

    [ pager ] \

    | \ storage engine

    v /

    [ shim ] /

    | /

    v /

    [ OS Interface ] /

    35

    View Slide

  36. [ parser ] tokenise.c -> push-down automaton parser

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    36

    View Slide

  37. [ parser ] (reentrant and thread-safe)

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    37

    View Slide

  38. [ parser ] generated by lemon, parse.y, tool/lemon.c

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    38

    View Slide

  39. [ parser ] outputs AST (sqliteInt.h)

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    39

    View Slide

  40. [ parser ]

    |

    v

    [ code generator ] semantic analysis

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    40

    View Slide

  41. [ parser ]

    |

    v

    [ code generator ] AST transformation select.c

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    41

    View Slide

  42. [ parser ]

    |

    v

    [ code generator ] Join order where*.c whereInt.h

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    42

    View Slide

  43. [ parser ]

    |

    v

    [ code generator ] query planning (select.c)

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    43

    View Slide

  44. [ parser ]

    |

    v

    [ code generator ] output bytecodes

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    44

    View Slide

  45. [ parser ]

    |

    v

    [ code generator ] build.c, delete.c, expr.c

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    45

    View Slide

  46. [ parser ]

    |

    v

    [ code generator ] insert.c, update.c

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    46

    View Slide

  47. [ parser ]

    |

    v

    [ code generator ] section w most lines of codes

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    47

    View Slide

  48. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] 2nd section in most lines of code

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    48

    View Slide

  49. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] vdbe.c, vdbe.h, vdbeLnt.h

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    49

    View Slide

  50. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] vdbe*.c, func.c, date.c

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    50

    View Slide

  51. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    | Interface defined by btree.h

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    51

    View Slide

  52. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] has both b- and b+ trees

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    52

    View Slide

  53. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] b- for indexes*

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    53

    View Slide

  54. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] b+ for records*

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    54

    View Slide

  55. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] multiple b-trees per db file

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    55

    View Slide

  56. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] read via cursor

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    56

    View Slide

  57. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] concurrent r/w same table diff cursors

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    57

    View Slide

  58. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] Also called page cache

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    58

    View Slide

  59. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] Protects data from power loss

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    59

    View Slide

  60. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] 1. Roll back mode

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    60

    View Slide

  61. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] 2. Write Ahead Log (WAL) mode

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    61

    View Slide

  62. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] Concurrency control (when trying to access)

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    62

    View Slide

  63. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] In-memory cache

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    63

    View Slide

  64. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] pager.c, pager.h, pcache1.c, pcache.c

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    64

    View Slide

  65. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] pcache.h, wal.c, wal.h

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    65

    View Slide

  66. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ] encryption, compression, logging

    |

    v

    [ OS Interface ]

    66

    View Slide

  67. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ] used to emulate an OS layer

    |

    v

    [ OS Interface ]

    67

    View Slide

  68. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ] used for tests to simulate hardware failures

    |

    v

    [ OS Interface ]

    68

    View Slide

  69. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ] test_multiplex.c, test_vfstrace.c

    |

    v

    [ OS Interface ]

    69

    View Slide

  70. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ] os-specific interfacing

    70

    View Slide

  71. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ] changed at runtime

    71

    View Slide

  72. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ] I/O (test_onefile.c)

    72

    View Slide

  73. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ] os.c, os_unix.c, os_win.c, os*.h

    73

    View Slide

  74. Interesting parts
    74

    View Slide

  75. Recovery
    75

    View Slide

  76. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] Rollback & Write Ahead Log (WAL)

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    76

    View Slide

  77. Rollback
    77

    View Slide

  78. read
    78

    View Slide

  79. | | | |

    | user space | os cache | disk |

    | a | a | x |

    | a | | x |

    | | a | x |

    | | | x |

    shared

    79

    View Slide

  80. shared prevents processes from changing data
    80

    View Slide

  81. write
    81

    View Slide

  82. | | | |

    | user space | os cache | disk |

    | a | a | x |

    | a | | x |

    | | a | x |

    | | | x |

    reserved

    | | | | journal

    | | | |

    | | | |

    | | | |

    | | | |

    82

    View Slide

  83. | | | |

    | user space | os cache | disk |

    | a | a | x |

    | a | | x |

    | | a | x |

    | | | x |

    reserved

    | | | | journal

    | | | |

    | | | |

    | | a | |

    | | a | |

    83

    View Slide

  84. file.database-journal
    84

    View Slide

  85. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    reserved

    | | | | journal

    | | | |

    | | | |

    | | a | |

    | | a | |

    85

    View Slide

  86. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    reserved

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    86

    View Slide

  87. flush rollback jorunal to disk
    takes time, can be turned off
    87

    View Slide

  88. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    88

    View Slide

  89. stops writes
    89

    View Slide

  90. | | | |

    | user space | os cache | disk |

    | b | b | x |

    | b | | x |

    | | b | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    90

    View Slide

  91. | | | |

    | user space | os cache | disk |

    | b | b | b |

    | b | | x |

    | | b | b |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    91

    View Slide

  92. commit deletes rollback journal
    92

    View Slide

  93. | | | |

    | user space | os cache | disk |

    | b | b | b |

    | b | | x |

    | | b | b |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | | |

    | | | |

    93

    View Slide

  94. power loss before commit
    94

    View Slide

  95. | | | |

    | user space | os cache | disk |

    | | | b |

    | | | x |

    | | | x |

    | | | x |



    | | | | journal

    | | | |

    | | | |

    | | | a |

    | | | a |

    95

    View Slide

  96. power restored
    96

    View Slide

  97. | | | |

    | user space | os cache | disk |

    | | | b |

    | | | x |

    | | | x |

    | | | x |

    shared

    | | | | journal

    | | | |

    | | | |

    | | | a |

    | | | a |

    97

    View Slide

  98. | | | |

    | user space | os cache | disk |

    | | | b |

    | | | x |

    | | | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    98

    View Slide

  99. | | | |

    | user space | os cache | disk |

    | | a | b |

    | | a | x |

    | | | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    99

    View Slide

  100. | | | |

    | user space | os cache | disk |

    | | a | a |

    | | a | a |

    | | | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    100

    View Slide

  101. Most reliable, default, but, slow
    101

    View Slide

  102. Write Ahead Log (WAL)
    not default as hash lookup for page in wal in shared memory,
    not work in network from diff pc
    not work with os with dodgy memory mapping
    102

    View Slide

  103. | | | |

    | user space | os cache | disk |

    | a | a | x |

    | a | | x |

    | | a | x |

    | | | x |

    shared

    | | | | file.database-wal

    | | | |

    | | | |

    | | | |

    | | | |

    103

    View Slide

  104. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    shared

    | | | | file.database-wal

    | | | |

    | | | |

    | | | |

    | | | |

    104

    View Slide

  105. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    shared

    | | | | file.database-wal

    | | | |

    | | | |

    | | b | |

    | | b | |

    105

    View Slide

  106. | | | | |

    | user2 | user space | os cache | disk |

    | b | b | a <- x |

    | a | b | | x |

    | | | a <- x |

    | | | | x |

    shared

    | | |

    | | | wal

    | | |

    | b | |

    | b | |

    106

    View Slide

  107. user2: another process

    user2: b from wal cache, a from os cache
    107

    View Slide

  108. | | | | |

    | user2 | user space | os cache | disk |

    | b | b | a | x |

    | c | b | | x |

    | | | a | x |

    | | | | x |

    shared

    | | |

    | | | wal

    | | |

    | b | |

    | c | |

    108

    View Slide

  109. different reads and writes w snapshot isolation
    109

    View Slide

  110. | | | | |

    | user2 | user space | os cache | disk |

    | b | b | a | x |

    | c | b | | x |

    | | | a | x |

    | | | | x |

    shared

    | | |

    | | | wal

    | | |

    | b | b |

    | c | c |

    110

    View Slide

  111. flush to disk occurs w pragma synchronous=full;
    111

    View Slide

  112. | | | |

    | user space | os cache | disk |

    | | c | x |

    | | b | x |

    | | c | x |

    | | a | x |

    shared

    | | |

    | | | wal

    | | |

    | | |

    | | |

    112

    View Slide

  113. checkpoint operation
    113

    View Slide

  114. VM
    Originally stack-based, now switch statement in for loop
    3-address machine
    114

    View Slide

  115. 115

    View Slide

  116. libsql> EXPLAIN SELECT price FROM product WHERE price=100;

    addr opcode p1 p2 p3 p4 p5 comment

    ---- ------------- ---- ---- ---- ------------- -- -------------

    0 Init 0 11 0 0 Start at 11

    1 OpenRead 0 30 0 3 0 root=30 iDb=0; product

    2 Rewind 0 10 0 0

    3 Column 0 2 1 0 r[1]= cursor 0 column 2

    4 RealAffinity 1 0 0 0

    5 Ne 2 9 1 BINARY-8 85 if r[1]!=r[2] goto 9

    6 Column 0 2 3 0 r[3]= cursor 0 column 2

    7 RealAffinity 3 0 0 0

    8 ResultRow 3 1 0 0 output=r[3]

    9 Next 0 3 0 1

    10 Halt 0 0 0 0

    11 Transaction 0 0 25 0 1 usesStmtJournal=0
    12 Integer 100 2 0 0 r[2]=100

    13 Goto 0 1 0 0

    116

    View Slide

  117. Btree
    Deals w binary format for data, k int
    b+ for tables
    b- for indexes
    index | data (in 'record' format)

    [k][ d ] -> called cell

    [k][ d ]

    [k][ d ]

    [k][ d ]
    [k][ d ]

    interpreted by next step
    117

    View Slide

  118. x x x x

    / \

    xx xx <- non-leaf (holds key only)

    [k][ d ][k][ d ] [k][ d ][k][ d ] <- leaf

    k+d in leaves, more than one k in tree (small integers)
    118

    View Slide

  119. btree page
    [ header ][ offset to content ][ free ][ content ]

    content: each cell contains left-child pointer, key, data
    119

    View Slide

  120. Record format [8]
    key and data in record format
    record format
    [ header ][ body ]

    both called payloads [7] but mainly body
    120

    View Slide

  121. VarInt
    1234
    10011010010 - b

    00010011010010 - 2 x 7

    10001001 01010010

    1 beg: more chunk coming

    0 beg 2nd chunk no more chunk coming

    store in two bytes
    121

    View Slide

  122. CREATE TABLE sandwiches (

    id INTEGER PRIMARY KEY,

    name TEXT,
    length REAL,

    count INTEGER

    )

    122

    View Slide

  123. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    |

    V

    h15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

    h15 size of row body
    h01 row id
    both use varint
    123

    View Slide

  124. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    |

    V

    h

    15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

    data & types
    name: 0x1b -> TEXT length of 7 bytes. if type val odd & >=
    13: TEXT calculated using (n*2) + 13 .
    (7*2) + 13 -> 27 -> h1b
    BLOB : (n*2) + 12
    Length floating point always encoded as h07
    124

    View Slide

  125. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    |

    V

    h

    15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

    count field which is an integer. These get packed down
    similar to varints but in a slightly different format. Integers that
    can fit in an
    8-bit integer are represented with a type value of 0x01. 16-bit
    integers are 0x02,

    24-bit integers are 0x03 and so on.
    125

    View Slide

  126. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    |

    V

    h

    15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

    values
    "italian" in utf-8 49 74 61 6c 69 61 6e
    126

    View Slide

  127. length value of 7.5 as `=IEEE-754-2008 floating-point number.
    SQLite can optimize integer floating-point values by storing
    them as pure integer fields but since we have a decimal place
    it is stored with 8-bytes:
    40 1e 00 00 00 00 00 00
    count value of 2:
    02
    127

    View Slide

  128. rows grouped in chunks called pages
    skip it
    128

    View Slide

  129. x x x x

    / \

    xx xx <- non-leaf (holds key only)

    [k][ d ][k][ d ] [k][ d ][k][ d ] <- leaf

    129

    View Slide

  130. [ page 1 ] x x x x - [1]

    [ page 2 ] / \

    [ page 3 ] [4]- xx xx - [6]

    [ page 4 ]

    [ page 5 ] [3] - [k][ d ][k][ d ] [k][ d ][k][ d ] - [8]

    [ page 6 ]

    [ page 7 ] x x x x - [5]

    [ page 8 ] / \

    [ page 9 ] [2] - xx xx - [4]

    [ page 10 ]

    [ page 11 ] [7] - [k][ d ][k][ d ] [k][ d ][k][ d ] - [10]

    130

    View Slide

  131. If table does not fit on page -> overflow to next page
    linkedlist
    first page always b+ tree holding root node
    131

    View Slide

  132. db format
    db: [page1][page2][page3]
    page1
    start byte - offset byte - description

    00 - 16: SQLite format 3

    16 - 02: Page size in bytes

    18 - 01: File format write version

    19 - 01: File format read version

    20 - 01: Bytes reserved at the end of each page

    21 - 01: Max embedded payload fraction

    22 - 01: Max embedded payload fraction

    23 - 01: Min leaf payload fraction

    24 - 04: File change counter

    28 - 04: Size of db in pages

    32 - 04: First freelist page

    36 - 04: Number of freelist pages in the file

    40 - 04: Schema cookie number

    44 - 56: 14 4-byte meta values passed to higher layers

    132

    View Slide

  133. Knowing The Internals
    133

    View Slide

  134. WebSQL
    WebStorage on the web
    Not going to implement an engine from scratch: Use
    SQLite
    "User agents must implement the SQL dialect supported
    by Sqlite 3.6.19"
    Example exploit: Omer Gull - SELECT code execution
    FROM USING SQLite [4]
    Need upated version of SQLite -> conflict with
    requirement of 3.6.19
    Aug 2022 Chrome: Deprecating and Removing webSQL
    [5]
    Memory corruption available from JS
    134

    View Slide

  135. Replaced by the beautiful IndexedDB written by a
    developer from the noble house of Oracle
    135

    View Slide

  136. The future
    136

    View Slide

  137. libSQL
    A great fork
    Real, OpenSource SQLite
    137

    View Slide

  138. LumoSQL
    A clone that is 100% on time
    swappable db engine, btree
    138

    View Slide

  139. lots of distributed clones
    139

    View Slide

  140. With the advent of Wasm, SQL or NoSQL solutions can
    come to the web. One example is DuckDB-Wasm,
    another is absurd-sql. Based on these creations, we feel
    that the developer community can iterate on and create
    new storage solutions faster and better than browser
    vendors.
    libSQL introduced native WASM support to SQLite
    140

    View Slide

  141. Ending Quotes
    141

    View Slide

  142. I had this crazy idea that I’m going to build a database
    engine that does not have a server, that talks directly to
    disk, and ignores the data types, and if you asked any of
    the experts of the day, they would say, “That’s
    impossible. That will never work. That’s a stupid idea.”
    Fortunately, I didn’t know any experts and so I did it
    anyway, so this sort of thing happens. I think, maybe, just
    don’t listen to the experts too much and do what makes
    sense. Solve your problem.
    142

    View Slide

  143. If I'd known how hard it would be I probably

    never would've have written it [3]
    143

    View Slide

  144. Interesting bits
    The customer wanted Informix, so SQLite was not used
    on the Ship
    144

    View Slide

  145. Refs
    [1] SQLite, A Database for the Edge of the Network, DRH,
    Databaseology Lectures, Carnegie Mellon (2015)
    [2] CORECURSIVE Podcast, Episode #066, The Untold
    Story of SQLite
    [3] Richard Hipp Speaks Out on

    SQLite, ACM SIGMOD interviews with DB people,
    Marianne Winslett and Vanessa Braganholo (2019),
    https://sigmodrecord.org/publications/sigmodRecord/1906
    /pdfs/06_Profiles_Hipp.pdf
    [4] DEF CON 27 - Omer Gull - SELECT code execution
    FROM USING SQLite, https://www.youtube.com/watch?
    v=JokZUjwGj4M
    145

    View Slide

  146. [5] Deprecating and removing Web SQL,
    https://developer.chrome.com/blog/deprecating-web-sql/
    [6] Craft vulnerable db
    https://github.com/CheckPointSW/QueryOrientedProgram
    ming/blob/master/qop.py
    [7] https://www.sqlite.org/fileformat.html#record_format
    [8] https://fly.io/blog/sqlite-internals-btree/
    146

    View Slide

  147. Images
    DGG-79: https://news.usni.org/2019/06/04/uss-oscar-
    austin-fire-damage-repairs-will-stretch-into-2022
    147

    View Slide