Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SQLite Internals: How The World's Most Used Database Works

SQLite Internals: How The World's Most Used Database Works

Abdur-Rahmaan Janhangeer

December 10, 2022
Tweet

More Decks by Abdur-Rahmaan Janhangeer

Other Decks in Technology

Transcript

  1. SQLite Internals:

    How the World's

    Most Used Database

    Works

    View full-size slide

  2. Abdur-Rahmaan Janhangeer
    Help people break into OpenSource
    People hire me to work on Python projects
    www.compileralchemy.com
    4

    View full-size slide

  3. SQLite Internals:

    How the World's

    Most Used Database

    Works
    5

    View full-size slide

  4. THE world's most used piece of sotware (2nd most used
    actually [1])
    6

    View full-size slide

  5. In the device you are holding most probably
    By the Mil
    On planes
    Space
    7

    View full-size slide

  6. Presentation meant to be taken home and explored
    8

    View full-size slide

  7. incomplete - v2 will have more info
    9

    View full-size slide

  8. The story behind
    10

    View full-size slide

  9. Author
    Dwayne Richard Hipp (DRH)
    Phd
    11

    View full-size slide

  10. Inception
    Was working on software for a battleship: DDG-79 Oscar
    Austin [2]
    12

    View full-size slide

  11. Working on program to solve flow routing, open & close
    valves
    DB already on ship, Informix, now by IBM "For the edge
    and beyond"
    Server down -> blame DRH
    No engines around
    14

    View full-size slide

  12. one of the guys I was working with says, “Richard, why
    don’t you just write one?” “Okay, I’ll give it a try.
    ...
    all government contracts got shut down, so I was out of
    work for a few months, and I thought, “Well, I’ll just write
    that database engine now.””
    15

    View full-size slide

  13. so I wrote a byte code engine that would actually run a
    query and then I wrote a compiler that would translate
    SQL into that byte code and voila, SQLite was born.
    16

    View full-size slide

  14. Milestones
    Posted on internet, picked up speed
    Motorola OS -> (Oh you can monetise OSS) project $80k
    America Online -> Wanted db on CD
    Symbian OS on Nokia phones (tried 10 db, 2 OSS, 7
    prop)
    Symbian needed consortium (bus factor)
    Mozilla Foundation helped set the guidelines
    Android: Changed the phone game, used SQLite,
    contacted by Google
    17

    View full-size slide

  15. We were going around boasting to everybody naively
    that SQLite didn’t have any bugs in it, or no serious bugs,
    but Android definitely proved us wrong. ... It’s amazing
    how many bugs will crop up when your software
    suddenly gets shipped on millions of devices.
    18

    View full-size slide

  16. Worked for Rockwell Collins - DO-178B aviation quality
    standard - 100% MCDC test coverage
    Tests better than even postgres which relies on peer
    reviews [3]. Change code fearlessly.
    19

    View full-size slide

  17. A from scratch / first principle culture
    I need a db: Write it
    Wanted B-tree, implemented from scratch
    Simple enough to write parsers etc
    Build the CVS: Fossil
    Own implementation of printf inside query
    Freedom
    20

    View full-size slide

  18. ... And it's GPL, and

    so SQLite Version 1 was GPL, it had to be because it

    was linking against the GPL library.

    But GDBM is only key-value, I can't do range queries

    with it. Then I said, “I'm gonna write my own B-tree

    layer”
    21

    View full-size slide

  19. And so, we can can expect it to be a
    really complex piece of software!
    22

    View full-size slide

  20. never understood lex because it's so easy to write a
    bunch of c codes faster then lex [1]
    23

    View full-size slide

  21. 120K lines, 200k with comments, compiled in one source
    file
    25

    View full-size slide

  22. Overview
    +----------+ +----------+ +----+

    | Compiler | --> | bytecode | --> | VM |

    +----------+ +----------+ +----+

    26

    View full-size slide

  23. Better view
    27

    View full-size slide

  24. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    28

    View full-size slide

  25. [ parser ] \

    | \ compiler

    v /

    [ code generator ] /

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    29

    View full-size slide

  26. [ parser ] \

    | \ sqlite3_prepare_v2()

    v /

    [ code generator ] /

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    30

    View full-size slide

  27. [ parser ] \

    | \ prepared statements

    v /

    [ code generator ] /

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    31

    View full-size slide

  28. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] \

    | \

    v \

    [ btree ] \

    | \ run the program

    v /

    [ pager ] /

    | /

    v /

    [ shim ] /

    | /

    v /

    [ OS Interface ] /

    32

    View full-size slide

  29. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] \

    | \

    v \

    [ btree ] \

    | \ sqlite3_step()

    v /

    [ pager ] /

    | /

    v /

    [ shim ] /

    | /

    v /

    [ OS Interface ] /

    33

    View full-size slide

  30. [ parser ] \

    | \ compiler

    v /

    [ code generator ] /

    |

    v

    [ VM ] \

    | \

    v \

    [ btree ] \

    | \ run the program

    v /

    [ pager ] /

    | /

    v /

    [ shim ] /

    | /

    v /

    [ OS Interface ] /

    34

    View full-size slide

  31. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] \

    | \

    v \

    [ pager ] \

    | \ storage engine

    v /

    [ shim ] /

    | /

    v /

    [ OS Interface ] /

    35

    View full-size slide

  32. [ parser ] tokenise.c -> push-down automaton parser

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    36

    View full-size slide

  33. [ parser ] (reentrant and thread-safe)

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    37

    View full-size slide

  34. [ parser ] generated by lemon, parse.y, tool/lemon.c

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    38

    View full-size slide

  35. [ parser ] outputs AST (sqliteInt.h)

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    39

    View full-size slide

  36. [ parser ]

    |

    v

    [ code generator ] semantic analysis

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    40

    View full-size slide

  37. [ parser ]

    |

    v

    [ code generator ] AST transformation select.c

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    41

    View full-size slide

  38. [ parser ]

    |

    v

    [ code generator ] Join order where*.c whereInt.h

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    42

    View full-size slide

  39. [ parser ]

    |

    v

    [ code generator ] query planning (select.c)

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    43

    View full-size slide

  40. [ parser ]

    |

    v

    [ code generator ] output bytecodes

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    44

    View full-size slide

  41. [ parser ]

    |

    v

    [ code generator ] build.c, delete.c, expr.c

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    45

    View full-size slide

  42. [ parser ]

    |

    v

    [ code generator ] insert.c, update.c

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    46

    View full-size slide

  43. [ parser ]

    |

    v

    [ code generator ] section w most lines of codes

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    47

    View full-size slide

  44. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] 2nd section in most lines of code

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    48

    View full-size slide

  45. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] vdbe.c, vdbe.h, vdbeLnt.h

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    49

    View full-size slide

  46. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ] vdbe*.c, func.c, date.c

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    50

    View full-size slide

  47. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    | Interface defined by btree.h

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    51

    View full-size slide

  48. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] has both b- and b+ trees

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    52

    View full-size slide

  49. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] b- for indexes*

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    53

    View full-size slide

  50. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] b+ for records*

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    54

    View full-size slide

  51. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] multiple b-trees per db file

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    55

    View full-size slide

  52. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] read via cursor

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    56

    View full-size slide

  53. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ] concurrent r/w same table diff cursors

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    57

    View full-size slide

  54. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] Also called page cache

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    58

    View full-size slide

  55. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] Protects data from power loss

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    59

    View full-size slide

  56. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] 1. Roll back mode

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    60

    View full-size slide

  57. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] 2. Write Ahead Log (WAL) mode

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    61

    View full-size slide

  58. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] Concurrency control (when trying to access)

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    62

    View full-size slide

  59. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] In-memory cache

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    63

    View full-size slide

  60. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] pager.c, pager.h, pcache1.c, pcache.c

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    64

    View full-size slide

  61. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] pcache.h, wal.c, wal.h

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    65

    View full-size slide

  62. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ] encryption, compression, logging

    |

    v

    [ OS Interface ]

    66

    View full-size slide

  63. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ] used to emulate an OS layer

    |

    v

    [ OS Interface ]

    67

    View full-size slide

  64. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ] used for tests to simulate hardware failures

    |

    v

    [ OS Interface ]

    68

    View full-size slide

  65. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ] test_multiplex.c, test_vfstrace.c

    |

    v

    [ OS Interface ]

    69

    View full-size slide

  66. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ] os-specific interfacing

    70

    View full-size slide

  67. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ] changed at runtime

    71

    View full-size slide

  68. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ] I/O (test_onefile.c)

    72

    View full-size slide

  69. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ]

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ] os.c, os_unix.c, os_win.c, os*.h

    73

    View full-size slide

  70. Interesting parts
    74

    View full-size slide

  71. [ parser ]

    |

    v

    [ code generator ]

    |

    v

    [ VM ]

    |

    v

    [ btree ]

    |

    v

    [ pager ] Rollback & Write Ahead Log (WAL)

    |

    v

    [ shim ]

    |

    v

    [ OS Interface ]

    76

    View full-size slide

  72. | | | |

    | user space | os cache | disk |

    | a | a | x |

    | a | | x |

    | | a | x |

    | | | x |

    shared

    79

    View full-size slide

  73. shared prevents processes from changing data
    80

    View full-size slide

  74. | | | |

    | user space | os cache | disk |

    | a | a | x |

    | a | | x |

    | | a | x |

    | | | x |

    reserved

    | | | | journal

    | | | |

    | | | |

    | | | |

    | | | |

    82

    View full-size slide

  75. | | | |

    | user space | os cache | disk |

    | a | a | x |

    | a | | x |

    | | a | x |

    | | | x |

    reserved

    | | | | journal

    | | | |

    | | | |

    | | a | |

    | | a | |

    83

    View full-size slide

  76. file.database-journal
    84

    View full-size slide

  77. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    reserved

    | | | | journal

    | | | |

    | | | |

    | | a | |

    | | a | |

    85

    View full-size slide

  78. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    reserved

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    86

    View full-size slide

  79. flush rollback jorunal to disk
    takes time, can be turned off
    87

    View full-size slide

  80. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    88

    View full-size slide

  81. stops writes
    89

    View full-size slide

  82. | | | |

    | user space | os cache | disk |

    | b | b | x |

    | b | | x |

    | | b | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    90

    View full-size slide

  83. | | | |

    | user space | os cache | disk |

    | b | b | b |

    | b | | x |

    | | b | b |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    91

    View full-size slide

  84. commit deletes rollback journal
    92

    View full-size slide

  85. | | | |

    | user space | os cache | disk |

    | b | b | b |

    | b | | x |

    | | b | b |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | | |

    | | | |

    93

    View full-size slide

  86. power loss before commit
    94

    View full-size slide

  87. | | | |

    | user space | os cache | disk |

    | | | b |

    | | | x |

    | | | x |

    | | | x |



    | | | | journal

    | | | |

    | | | |

    | | | a |

    | | | a |

    95

    View full-size slide

  88. power restored
    96

    View full-size slide

  89. | | | |

    | user space | os cache | disk |

    | | | b |

    | | | x |

    | | | x |

    | | | x |

    shared

    | | | | journal

    | | | |

    | | | |

    | | | a |

    | | | a |

    97

    View full-size slide

  90. | | | |

    | user space | os cache | disk |

    | | | b |

    | | | x |

    | | | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    98

    View full-size slide

  91. | | | |

    | user space | os cache | disk |

    | | a | b |

    | | a | x |

    | | | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    99

    View full-size slide

  92. | | | |

    | user space | os cache | disk |

    | | a | a |

    | | a | a |

    | | | x |

    | | | x |

    exclusive

    | | | | journal

    | | | |

    | | | |

    | | a | a |

    | | a | a |

    100

    View full-size slide

  93. Most reliable, default, but, slow
    101

    View full-size slide

  94. Write Ahead Log (WAL)
    not default as hash lookup for page in wal in shared memory,
    not work in network from diff pc
    not work with os with dodgy memory mapping
    102

    View full-size slide

  95. | | | |

    | user space | os cache | disk |

    | a | a | x |

    | a | | x |

    | | a | x |

    | | | x |

    shared

    | | | | file.database-wal

    | | | |

    | | | |

    | | | |

    | | | |

    103

    View full-size slide

  96. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    shared

    | | | | file.database-wal

    | | | |

    | | | |

    | | | |

    | | | |

    104

    View full-size slide

  97. | | | |

    | user space | os cache | disk |

    | b | a | x |

    | b | | x |

    | | a | x |

    | | | x |

    shared

    | | | | file.database-wal

    | | | |

    | | | |

    | | b | |

    | | b | |

    105

    View full-size slide

  98. | | | | |

    | user2 | user space | os cache | disk |

    | b | b | a <- x |

    | a | b | | x |

    | | | a <- x |

    | | | | x |

    shared

    | | |

    | | | wal

    | | |

    | b | |

    | b | |

    106

    View full-size slide

  99. user2: another process

    user2: b from wal cache, a from os cache
    107

    View full-size slide

  100. | | | | |

    | user2 | user space | os cache | disk |

    | b | b | a | x |

    | c | b | | x |

    | | | a | x |

    | | | | x |

    shared

    | | |

    | | | wal

    | | |

    | b | |

    | c | |

    108

    View full-size slide

  101. different reads and writes w snapshot isolation
    109

    View full-size slide

  102. | | | | |

    | user2 | user space | os cache | disk |

    | b | b | a | x |

    | c | b | | x |

    | | | a | x |

    | | | | x |

    shared

    | | |

    | | | wal

    | | |

    | b | b |

    | c | c |

    110

    View full-size slide

  103. flush to disk occurs w pragma synchronous=full;
    111

    View full-size slide

  104. | | | |

    | user space | os cache | disk |

    | | c | x |

    | | b | x |

    | | c | x |

    | | a | x |

    shared

    | | |

    | | | wal

    | | |

    | | |

    | | |

    112

    View full-size slide

  105. checkpoint operation
    113

    View full-size slide

  106. VM
    Originally stack-based, now switch statement in for loop
    3-address machine
    114

    View full-size slide

  107. libsql> EXPLAIN SELECT price FROM product WHERE price=100;

    addr opcode p1 p2 p3 p4 p5 comment

    ---- ------------- ---- ---- ---- ------------- -- -------------

    0 Init 0 11 0 0 Start at 11

    1 OpenRead 0 30 0 3 0 root=30 iDb=0; product

    2 Rewind 0 10 0 0

    3 Column 0 2 1 0 r[1]= cursor 0 column 2

    4 RealAffinity 1 0 0 0

    5 Ne 2 9 1 BINARY-8 85 if r[1]!=r[2] goto 9

    6 Column 0 2 3 0 r[3]= cursor 0 column 2

    7 RealAffinity 3 0 0 0

    8 ResultRow 3 1 0 0 output=r[3]

    9 Next 0 3 0 1

    10 Halt 0 0 0 0

    11 Transaction 0 0 25 0 1 usesStmtJournal=0
    12 Integer 100 2 0 0 r[2]=100

    13 Goto 0 1 0 0

    116

    View full-size slide

  108. Btree
    Deals w binary format for data, k int
    b+ for tables
    b- for indexes
    index | data (in 'record' format)

    [k][ d ] -> called cell

    [k][ d ]

    [k][ d ]

    [k][ d ]
    [k][ d ]

    interpreted by next step
    117

    View full-size slide

  109. x x x x

    / \

    xx xx <- non-leaf (holds key only)

    [k][ d ][k][ d ] [k][ d ][k][ d ] <- leaf

    k+d in leaves, more than one k in tree (small integers)
    118

    View full-size slide

  110. btree page
    [ header ][ offset to content ][ free ][ content ]

    content: each cell contains left-child pointer, key, data
    119

    View full-size slide

  111. Record format [8]
    key and data in record format
    record format
    [ header ][ body ]

    both called payloads [7] but mainly body
    120

    View full-size slide

  112. VarInt
    1234
    10011010010 - b

    00010011010010 - 2 x 7

    10001001 01010010

    1 beg: more chunk coming

    0 beg 2nd chunk no more chunk coming

    store in two bytes
    121

    View full-size slide

  113. CREATE TABLE sandwiches (

    id INTEGER PRIMARY KEY,

    name TEXT,
    length REAL,

    count INTEGER

    )

    122

    View full-size slide

  114. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    |

    V

    h15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

    h15 size of row body
    h01 row id
    both use varint
    123

    View full-size slide

  115. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    |

    V

    h

    15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

    data & types
    name: 0x1b -> TEXT length of 7 bytes. if type val odd & >=
    13: TEXT calculated using (n*2) + 13 .
    (7*2) + 13 -> 27 -> h1b
    BLOB : (n*2) + 12
    Length floating point always encoded as h07
    124

    View full-size slide

  116. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    |

    V

    h

    15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

    count field which is an integer. These get packed down
    similar to varints but in a slightly different format. Integers that
    can fit in an
    8-bit integer are represented with a type value of 0x01. 16-bit
    integers are 0x02,

    24-bit integers are 0x03 and so on.
    125

    View full-size slide

  117. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    |

    V

    h

    15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

    values
    "italian" in utf-8 49 74 61 6c 69 61 6e
    126

    View full-size slide

  118. length value of 7.5 as `=IEEE-754-2008 floating-point number.
    SQLite can optimize integer floating-point values by storing
    them as pure integer fields but since we have a decimal place
    it is stored with 8-bytes:
    40 1e 00 00 00 00 00 00
    count value of 2:
    02
    127

    View full-size slide

  119. rows grouped in chunks called pages
    skip it
    128

    View full-size slide

  120. x x x x

    / \

    xx xx <- non-leaf (holds key only)

    [k][ d ][k][ d ] [k][ d ][k][ d ] <- leaf

    129

    View full-size slide

  121. [ page 1 ] x x x x - [1]

    [ page 2 ] / \

    [ page 3 ] [4]- xx xx - [6]

    [ page 4 ]

    [ page 5 ] [3] - [k][ d ][k][ d ] [k][ d ][k][ d ] - [8]

    [ page 6 ]

    [ page 7 ] x x x x - [5]

    [ page 8 ] / \

    [ page 9 ] [2] - xx xx - [4]

    [ page 10 ]

    [ page 11 ] [7] - [k][ d ][k][ d ] [k][ d ][k][ d ] - [10]

    130

    View full-size slide

  122. If table does not fit on page -> overflow to next page
    linkedlist
    first page always b+ tree holding root node
    131

    View full-size slide

  123. db format
    db: [page1][page2][page3]
    page1
    start byte - offset byte - description

    00 - 16: SQLite format 3

    16 - 02: Page size in bytes

    18 - 01: File format write version

    19 - 01: File format read version

    20 - 01: Bytes reserved at the end of each page

    21 - 01: Max embedded payload fraction

    22 - 01: Max embedded payload fraction

    23 - 01: Min leaf payload fraction

    24 - 04: File change counter

    28 - 04: Size of db in pages

    32 - 04: First freelist page

    36 - 04: Number of freelist pages in the file

    40 - 04: Schema cookie number

    44 - 56: 14 4-byte meta values passed to higher layers

    132

    View full-size slide

  124. Knowing The Internals
    133

    View full-size slide

  125. WebSQL
    WebStorage on the web
    Not going to implement an engine from scratch: Use
    SQLite
    "User agents must implement the SQL dialect supported
    by Sqlite 3.6.19"
    Example exploit: Omer Gull - SELECT code execution
    FROM USING SQLite [4]
    Need upated version of SQLite -> conflict with
    requirement of 3.6.19
    Aug 2022 Chrome: Deprecating and Removing webSQL
    [5]
    Memory corruption available from JS
    134

    View full-size slide

  126. Replaced by the beautiful IndexedDB written by a
    developer from the noble house of Oracle
    135

    View full-size slide

  127. The future
    136

    View full-size slide

  128. libSQL
    A great fork
    Real, OpenSource SQLite
    137

    View full-size slide

  129. LumoSQL
    A clone that is 100% on time
    swappable db engine, btree
    138

    View full-size slide

  130. lots of distributed clones
    139

    View full-size slide

  131. With the advent of Wasm, SQL or NoSQL solutions can
    come to the web. One example is DuckDB-Wasm,
    another is absurd-sql. Based on these creations, we feel
    that the developer community can iterate on and create
    new storage solutions faster and better than browser
    vendors.
    libSQL introduced native WASM support to SQLite
    140

    View full-size slide

  132. Ending Quotes
    141

    View full-size slide

  133. I had this crazy idea that I’m going to build a database
    engine that does not have a server, that talks directly to
    disk, and ignores the data types, and if you asked any of
    the experts of the day, they would say, “That’s
    impossible. That will never work. That’s a stupid idea.”
    Fortunately, I didn’t know any experts and so I did it
    anyway, so this sort of thing happens. I think, maybe, just
    don’t listen to the experts too much and do what makes
    sense. Solve your problem.
    142

    View full-size slide

  134. If I'd known how hard it would be I probably

    never would've have written it [3]
    143

    View full-size slide

  135. Interesting bits
    The customer wanted Informix, so SQLite was not used
    on the Ship
    144

    View full-size slide

  136. Refs
    [1] SQLite, A Database for the Edge of the Network, DRH,
    Databaseology Lectures, Carnegie Mellon (2015)
    [2] CORECURSIVE Podcast, Episode #066, The Untold
    Story of SQLite
    [3] Richard Hipp Speaks Out on

    SQLite, ACM SIGMOD interviews with DB people,
    Marianne Winslett and Vanessa Braganholo (2019),
    https://sigmodrecord.org/publications/sigmodRecord/1906
    /pdfs/06_Profiles_Hipp.pdf
    [4] DEF CON 27 - Omer Gull - SELECT code execution
    FROM USING SQLite, https://www.youtube.com/watch?
    v=JokZUjwGj4M
    145

    View full-size slide

  137. [5] Deprecating and removing Web SQL,
    https://developer.chrome.com/blog/deprecating-web-sql/
    [6] Craft vulnerable db
    https://github.com/CheckPointSW/QueryOrientedProgram
    ming/blob/master/qop.py
    [7] https://www.sqlite.org/fileformat.html#record_format
    [8] https://fly.io/blog/sqlite-internals-btree/
    146

    View full-size slide

  138. Images
    DGG-79: https://news.usni.org/2019/06/04/uss-oscar-
    austin-fire-damage-repairs-will-stretch-into-2022
    147

    View full-size slide