Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SQLite Internals: How The World's Most Used Database Works

SQLite Internals: How The World's Most Used Database Works

Abdur-Rahmaan Janhangeer

December 10, 2022
Tweet

More Decks by Abdur-Rahmaan Janhangeer

Other Decks in Technology

Transcript

  1. SQLite Internals: How the World's Most Used Database Works

  2. ssslides

  3. 3

  4. Abdur-Rahmaan Janhangeer Help people break into OpenSource People hire me

    to work on Python projects www.compileralchemy.com 4
  5. SQLite Internals: How the World's Most Used Database Works 5

  6. THE world's most used piece of sotware (2nd most used

    actually [1]) 6
  7. In the device you are holding most probably By the

    Mil On planes Space 7
  8. Presentation meant to be taken home and explored 8

  9. incomplete - v2 will have more info 9

  10. The story behind 10

  11. Author Dwayne Richard Hipp (DRH) Phd 11

  12. Inception Was working on software for a battleship: DDG-79 Oscar

    Austin [2] 12
  13. 13

  14. Working on program to solve flow routing, open & close

    valves DB already on ship, Informix, now by IBM "For the edge and beyond" Server down -> blame DRH No engines around 14
  15. one of the guys I was working with says, “Richard,

    why don’t you just write one?” “Okay, I’ll give it a try. ... all government contracts got shut down, so I was out of work for a few months, and I thought, “Well, I’ll just write that database engine now.”” 15
  16. so I wrote a byte code engine that would actually

    run a query and then I wrote a compiler that would translate SQL into that byte code and voila, SQLite was born. 16
  17. Milestones Posted on internet, picked up speed Motorola OS ->

    (Oh you can monetise OSS) project $80k America Online -> Wanted db on CD Symbian OS on Nokia phones (tried 10 db, 2 OSS, 7 prop) Symbian needed consortium (bus factor) Mozilla Foundation helped set the guidelines Android: Changed the phone game, used SQLite, contacted by Google 17
  18. We were going around boasting to everybody naively that SQLite

    didn’t have any bugs in it, or no serious bugs, but Android definitely proved us wrong. ... It’s amazing how many bugs will crop up when your software suddenly gets shipped on millions of devices. 18
  19. Worked for Rockwell Collins - DO-178B aviation quality standard -

    100% MCDC test coverage Tests better than even postgres which relies on peer reviews [3]. Change code fearlessly. 19
  20. A from scratch / first principle culture I need a

    db: Write it Wanted B-tree, implemented from scratch Simple enough to write parsers etc Build the CVS: Fossil Own implementation of printf inside query Freedom 20
  21. ... And it's GPL, and so SQLite Version 1 was

    GPL, it had to be because it was linking against the GPL library. But GDBM is only key-value, I can't do range queries with it. Then I said, “I'm gonna write my own B-tree layer” 21
  22. And so, we can can expect it to be a

    really complex piece of software! 22
  23. never understood lex because it's so easy to write a

    bunch of c codes faster then lex [1] 23
  24. Internals 24

  25. 120K lines, 200k with comments, compiled in one source file

    25
  26. Overview +----------+ +----------+ +----+ | Compiler | --> | bytecode

    | --> | VM | +----------+ +----------+ +----+ 26
  27. Better view 27

  28. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 28
  29. [ parser ] \ | \ compiler v / [

    code generator ] / | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 29
  30. [ parser ] \ | \ sqlite3_prepare_v2() v / [

    code generator ] / | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 30
  31. [ parser ] \ | \ prepared statements v /

    [ code generator ] / | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 31
  32. [ parser ] | v [ code generator ] |

    v [ VM ] \ | \ v \ [ btree ] \ | \ run the program v / [ pager ] / | / v / [ shim ] / | / v / [ OS Interface ] / 32
  33. [ parser ] | v [ code generator ] |

    v [ VM ] \ | \ v \ [ btree ] \ | \ sqlite3_step() v / [ pager ] / | / v / [ shim ] / | / v / [ OS Interface ] / 33
  34. [ parser ] \ | \ compiler v / [

    code generator ] / | v [ VM ] \ | \ v \ [ btree ] \ | \ run the program v / [ pager ] / | / v / [ shim ] / | / v / [ OS Interface ] / 34
  35. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] \ | \ v \ [ pager ] \ | \ storage engine v / [ shim ] / | / v / [ OS Interface ] / 35
  36. [ parser ] tokenise.c -> push-down automaton parser | v

    [ code generator ] | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 36
  37. [ parser ] (reentrant and thread-safe) | v [ code

    generator ] | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 37
  38. [ parser ] generated by lemon, parse.y, tool/lemon.c | v

    [ code generator ] | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 38
  39. [ parser ] outputs AST (sqliteInt.h) | v [ code

    generator ] | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 39
  40. [ parser ] | v [ code generator ] semantic

    analysis | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 40
  41. [ parser ] | v [ code generator ] AST

    transformation select.c | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 41
  42. [ parser ] | v [ code generator ] Join

    order where*.c whereInt.h | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 42
  43. [ parser ] | v [ code generator ] query

    planning (select.c) | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 43
  44. [ parser ] | v [ code generator ] output

    bytecodes | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 44
  45. [ parser ] | v [ code generator ] build.c,

    delete.c, expr.c | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 45
  46. [ parser ] | v [ code generator ] insert.c,

    update.c | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 46
  47. [ parser ] | v [ code generator ] section

    w most lines of codes | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 47
  48. [ parser ] | v [ code generator ] |

    v [ VM ] 2nd section in most lines of code | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 48
  49. [ parser ] | v [ code generator ] |

    v [ VM ] vdbe.c, vdbe.h, vdbeLnt.h | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 49
  50. [ parser ] | v [ code generator ] |

    v [ VM ] vdbe*.c, func.c, date.c | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 50
  51. [ parser ] | v [ code generator ] |

    v [ VM ] | Interface defined by btree.h v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 51
  52. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] has both b- and b+ trees | v [ pager ] | v [ shim ] | v [ OS Interface ] 52
  53. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] b- for indexes* | v [ pager ] | v [ shim ] | v [ OS Interface ] 53
  54. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] b+ for records* | v [ pager ] | v [ shim ] | v [ OS Interface ] 54
  55. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] multiple b-trees per db file | v [ pager ] | v [ shim ] | v [ OS Interface ] 55
  56. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] read via cursor | v [ pager ] | v [ shim ] | v [ OS Interface ] 56
  57. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] concurrent r/w same table diff cursors | v [ pager ] | v [ shim ] | v [ OS Interface ] 57
  58. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] Also called page cache | v [ shim ] | v [ OS Interface ] 58
  59. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] Protects data from power loss | v [ shim ] | v [ OS Interface ] 59
  60. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] 1. Roll back mode | v [ shim ] | v [ OS Interface ] 60
  61. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] 2. Write Ahead Log (WAL) mode | v [ shim ] | v [ OS Interface ] 61
  62. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] Concurrency control (when trying to access) | v [ shim ] | v [ OS Interface ] 62
  63. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] In-memory cache | v [ shim ] | v [ OS Interface ] 63
  64. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] pager.c, pager.h, pcache1.c, pcache.c | v [ shim ] | v [ OS Interface ] 64
  65. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] pcache.h, wal.c, wal.h | v [ shim ] | v [ OS Interface ] 65
  66. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] encryption, compression, logging | v [ OS Interface ] 66
  67. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] used to emulate an OS layer | v [ OS Interface ] 67
  68. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] used for tests to simulate hardware failures | v [ OS Interface ] 68
  69. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] test_multiplex.c, test_vfstrace.c | v [ OS Interface ] 69
  70. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] os-specific interfacing 70
  71. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] changed at runtime 71
  72. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] I/O (test_onefile.c) 72
  73. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] os.c, os_unix.c, os_win.c, os*.h 73
  74. Interesting parts 74

  75. Recovery 75

  76. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] Rollback & Write Ahead Log (WAL) | v [ shim ] | v [ OS Interface ] 76
  77. Rollback 77

  78. read 78

  79. | | | | | user space | os cache

    | disk | | a | a | x | | a | | x | | | a | x | | | | x | shared 79
  80. shared prevents processes from changing data 80

  81. write 81

  82. | | | | | user space | os cache

    | disk | | a | a | x | | a | | x | | | a | x | | | | x | reserved | | | | journal | | | | | | | | | | | | | | | | 82
  83. | | | | | user space | os cache

    | disk | | a | a | x | | a | | x | | | a | x | | | | x | reserved | | | | journal | | | | | | | | | | a | | | | a | | 83
  84. file.database-journal 84

  85. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | reserved | | | | journal | | | | | | | | | | a | | | | a | | 85
  86. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | reserved | | | | journal | | | | | | | | | | a | a | | | a | a | 86
  87. flush rollback jorunal to disk takes time, can be turned

    off 87
  88. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 88
  89. stops writes 89

  90. | | | | | user space | os cache

    | disk | | b | b | x | | b | | x | | | b | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 90
  91. | | | | | user space | os cache

    | disk | | b | b | b | | b | | x | | | b | b | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 91
  92. commit deletes rollback journal 92

  93. | | | | | user space | os cache

    | disk | | b | b | b | | b | | x | | | b | b | | | | x | exclusive | | | | journal | | | | | | | | | | | | | | | | 93
  94. power loss before commit 94

  95. | | | | | user space | os cache

    | disk | | | | b | | | | x | | | | x | | | | x | | | | | journal | | | | | | | | | | | a | | | | a | 95
  96. power restored 96

  97. | | | | | user space | os cache

    | disk | | | | b | | | | x | | | | x | | | | x | shared | | | | journal | | | | | | | | | | | a | | | | a | 97
  98. | | | | | user space | os cache

    | disk | | | | b | | | | x | | | | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 98
  99. | | | | | user space | os cache

    | disk | | | a | b | | | a | x | | | | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 99
  100. | | | | | user space | os cache

    | disk | | | a | a | | | a | a | | | | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 100
  101. Most reliable, default, but, slow 101

  102. Write Ahead Log (WAL) not default as hash lookup for

    page in wal in shared memory, not work in network from diff pc not work with os with dodgy memory mapping 102
  103. | | | | | user space | os cache

    | disk | | a | a | x | | a | | x | | | a | x | | | | x | shared | | | | file.database-wal | | | | | | | | | | | | | | | | 103
  104. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | shared | | | | file.database-wal | | | | | | | | | | | | | | | | 104
  105. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | shared | | | | file.database-wal | | | | | | | | | | b | | | | b | | 105
  106. | | | | | | user2 | user space

    | os cache | disk | | b | b | a <- x | | a | b | | x | | | | a <- x | | | | | x | shared | | | | | | wal | | | | b | | | b | | 106
  107. user2: another process user2: b from wal cache, a from

    os cache 107
  108. | | | | | | user2 | user space

    | os cache | disk | | b | b | a | x | | c | b | | x | | | | a | x | | | | | x | shared | | | | | | wal | | | | b | | | c | | 108
  109. different reads and writes w snapshot isolation 109

  110. | | | | | | user2 | user space

    | os cache | disk | | b | b | a | x | | c | b | | x | | | | a | x | | | | | x | shared | | | | | | wal | | | | b | b | | c | c | 110
  111. flush to disk occurs w pragma synchronous=full; 111

  112. | | | | | user space | os cache

    | disk | | | c | x | | | b | x | | | c | x | | | a | x | shared | | | | | | wal | | | | | | | | | 112
  113. checkpoint operation 113

  114. VM Originally stack-based, now switch statement in for loop 3-address

    machine 114
  115. 115

  116. libsql> EXPLAIN SELECT price FROM product WHERE price=100; addr opcode

    p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Init 0 11 0 0 Start at 11 1 OpenRead 0 30 0 3 0 root=30 iDb=0; product 2 Rewind 0 10 0 0 3 Column 0 2 1 0 r[1]= cursor 0 column 2 4 RealAffinity 1 0 0 0 5 Ne 2 9 1 BINARY-8 85 if r[1]!=r[2] goto 9 6 Column 0 2 3 0 r[3]= cursor 0 column 2 7 RealAffinity 3 0 0 0 8 ResultRow 3 1 0 0 output=r[3] 9 Next 0 3 0 1 10 Halt 0 0 0 0 11 Transaction 0 0 25 0 1 usesStmtJournal=0 12 Integer 100 2 0 0 r[2]=100 13 Goto 0 1 0 0 116
  117. Btree Deals w binary format for data, k int b+

    for tables b- for indexes index | data (in 'record' format) [k][ d ] -> called cell [k][ d ] [k][ d ] [k][ d ] [k][ d ] interpreted by next step 117
  118. x x x x / \ xx xx <- non-leaf

    (holds key only) [k][ d ][k][ d ] [k][ d ][k][ d ] <- leaf k+d in leaves, more than one k in tree (small integers) 118
  119. btree page [ header ][ offset to content ][ free

    ][ content ] content: each cell contains left-child pointer, key, data 119
  120. Record format [8] key and data in record format record

    format [ header ][ body ] both called payloads [7] but mainly body 120
  121. VarInt 1234 10011010010 - b 00010011010010 - 2 x 7

    10001001 01010010 1 beg: more chunk coming 0 beg 2nd chunk no more chunk coming store in two bytes 121
  122. CREATE TABLE sandwiches ( id INTEGER PRIMARY KEY, name TEXT,

    length REAL, count INTEGER ) 122
  123. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    | V h15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02 h15 size of row body h01 row id both use varint 123
  124. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    | V h 15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02 data & types name: 0x1b -> TEXT length of 7 bytes. if type val odd & >= 13: TEXT calculated using (n*2) + 13 . (7*2) + 13 -> 27 -> h1b BLOB : (n*2) + 12 Length floating point always encoded as h07 124
  125. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    | V h 15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02 count field which is an integer. These get packed down similar to varints but in a slightly different format. Integers that can fit in an 8-bit integer are represented with a type value of 0x01. 16-bit integers are 0x02, 24-bit integers are 0x03 and so on. 125
  126. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    | V h 15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02 values "italian" in utf-8 49 74 61 6c 69 61 6e 126
  127. length value of 7.5 as `=IEEE-754-2008 floating-point number. SQLite can

    optimize integer floating-point values by storing them as pure integer fields but since we have a decimal place it is stored with 8-bytes: 40 1e 00 00 00 00 00 00 count value of 2: 02 127
  128. rows grouped in chunks called pages skip it 128

  129. x x x x / \ xx xx <- non-leaf

    (holds key only) [k][ d ][k][ d ] [k][ d ][k][ d ] <- leaf 129
  130. [ page 1 ] x x x x - [1]

    [ page 2 ] / \ [ page 3 ] [4]- xx xx - [6] [ page 4 ] [ page 5 ] [3] - [k][ d ][k][ d ] [k][ d ][k][ d ] - [8] [ page 6 ] [ page 7 ] x x x x - [5] [ page 8 ] / \ [ page 9 ] [2] - xx xx - [4] [ page 10 ] [ page 11 ] [7] - [k][ d ][k][ d ] [k][ d ][k][ d ] - [10] 130
  131. If table does not fit on page -> overflow to

    next page linkedlist first page always b+ tree holding root node 131
  132. db format db: [page1][page2][page3] page1 start byte - offset byte

    - description 00 - 16: SQLite format 3 16 - 02: Page size in bytes 18 - 01: File format write version 19 - 01: File format read version 20 - 01: Bytes reserved at the end of each page 21 - 01: Max embedded payload fraction 22 - 01: Max embedded payload fraction 23 - 01: Min leaf payload fraction 24 - 04: File change counter 28 - 04: Size of db in pages 32 - 04: First freelist page 36 - 04: Number of freelist pages in the file 40 - 04: Schema cookie number 44 - 56: 14 4-byte meta values passed to higher layers 132
  133. Knowing The Internals 133

  134. WebSQL WebStorage on the web Not going to implement an

    engine from scratch: Use SQLite "User agents must implement the SQL dialect supported by Sqlite 3.6.19" Example exploit: Omer Gull - SELECT code execution FROM USING SQLite [4] Need upated version of SQLite -> conflict with requirement of 3.6.19 Aug 2022 Chrome: Deprecating and Removing webSQL [5] Memory corruption available from JS 134
  135. Replaced by the beautiful IndexedDB written by a developer from

    the noble house of Oracle 135
  136. The future 136

  137. libSQL A great fork Real, OpenSource SQLite 137

  138. LumoSQL A clone that is 100% on time swappable db

    engine, btree 138
  139. lots of distributed clones 139

  140. With the advent of Wasm, SQL or NoSQL solutions can

    come to the web. One example is DuckDB-Wasm, another is absurd-sql. Based on these creations, we feel that the developer community can iterate on and create new storage solutions faster and better than browser vendors. libSQL introduced native WASM support to SQLite 140
  141. Ending Quotes 141

  142. I had this crazy idea that I’m going to build

    a database engine that does not have a server, that talks directly to disk, and ignores the data types, and if you asked any of the experts of the day, they would say, “That’s impossible. That will never work. That’s a stupid idea.” Fortunately, I didn’t know any experts and so I did it anyway, so this sort of thing happens. I think, maybe, just don’t listen to the experts too much and do what makes sense. Solve your problem. 142
  143. If I'd known how hard it would be I probably

    never would've have written it [3] 143
  144. Interesting bits The customer wanted Informix, so SQLite was not

    used on the Ship 144
  145. Refs [1] SQLite, A Database for the Edge of the

    Network, DRH, Databaseology Lectures, Carnegie Mellon (2015) [2] CORECURSIVE Podcast, Episode #066, The Untold Story of SQLite [3] Richard Hipp Speaks Out on SQLite, ACM SIGMOD interviews with DB people, Marianne Winslett and Vanessa Braganholo (2019), https://sigmodrecord.org/publications/sigmodRecord/1906 /pdfs/06_Profiles_Hipp.pdf [4] DEF CON 27 - Omer Gull - SELECT code execution FROM USING SQLite, https://www.youtube.com/watch? v=JokZUjwGj4M 145
  146. [5] Deprecating and removing Web SQL, https://developer.chrome.com/blog/deprecating-web-sql/ [6] Craft vulnerable

    db https://github.com/CheckPointSW/QueryOrientedProgram ming/blob/master/qop.py [7] https://www.sqlite.org/fileformat.html#record_format [8] https://fly.io/blog/sqlite-internals-btree/ 146
  147. Images DGG-79: https://news.usni.org/2019/06/04/uss-oscar- austin-fire-damage-repairs-will-stretch-into-2022 147