Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SQLite Internals: How The World's Most Used Database Works

SQLite Internals: How The World's Most Used Database Works

Abdur-Rahmaan Janhangeer

December 10, 2022
Tweet

More Decks by Abdur-Rahmaan Janhangeer

Other Decks in Technology

Transcript

  1. 3

  2. Abdur-Rahmaan Janhangeer Help people break into OpenSource People hire me

    to work on Python projects www.compileralchemy.com 4
  3. 13

  4. Working on program to solve flow routing, open & close

    valves DB already on ship, Informix, now by IBM "For the edge and beyond" Server down -> blame DRH No engines around 14
  5. one of the guys I was working with says, “Richard,

    why don’t you just write one?” “Okay, I’ll give it a try. ... all government contracts got shut down, so I was out of work for a few months, and I thought, “Well, I’ll just write that database engine now.”” 15
  6. so I wrote a byte code engine that would actually

    run a query and then I wrote a compiler that would translate SQL into that byte code and voila, SQLite was born. 16
  7. Milestones Posted on internet, picked up speed Motorola OS ->

    (Oh you can monetise OSS) project $80k America Online -> Wanted db on CD Symbian OS on Nokia phones (tried 10 db, 2 OSS, 7 prop) Symbian needed consortium (bus factor) Mozilla Foundation helped set the guidelines Android: Changed the phone game, used SQLite, contacted by Google 17
  8. We were going around boasting to everybody naively that SQLite

    didn’t have any bugs in it, or no serious bugs, but Android definitely proved us wrong. ... It’s amazing how many bugs will crop up when your software suddenly gets shipped on millions of devices. 18
  9. Worked for Rockwell Collins - DO-178B aviation quality standard -

    100% MCDC test coverage Tests better than even postgres which relies on peer reviews [3]. Change code fearlessly. 19
  10. A from scratch / first principle culture I need a

    db: Write it Wanted B-tree, implemented from scratch Simple enough to write parsers etc Build the CVS: Fossil Own implementation of printf inside query Freedom 20
  11. ... And it's GPL, and so SQLite Version 1 was

    GPL, it had to be because it was linking against the GPL library. But GDBM is only key-value, I can't do range queries with it. Then I said, “I'm gonna write my own B-tree layer” 21
  12. And so, we can can expect it to be a

    really complex piece of software! 22
  13. never understood lex because it's so easy to write a

    bunch of c codes faster then lex [1] 23
  14. Overview +----------+ +----------+ +----+ | Compiler | --> | bytecode

    | --> | VM | +----------+ +----------+ +----+ 26
  15. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 28
  16. [ parser ] \ | \ compiler v / [

    code generator ] / | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 29
  17. [ parser ] \ | \ sqlite3_prepare_v2() v / [

    code generator ] / | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 30
  18. [ parser ] \ | \ prepared statements v /

    [ code generator ] / | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 31
  19. [ parser ] | v [ code generator ] |

    v [ VM ] \ | \ v \ [ btree ] \ | \ run the program v / [ pager ] / | / v / [ shim ] / | / v / [ OS Interface ] / 32
  20. [ parser ] | v [ code generator ] |

    v [ VM ] \ | \ v \ [ btree ] \ | \ sqlite3_step() v / [ pager ] / | / v / [ shim ] / | / v / [ OS Interface ] / 33
  21. [ parser ] \ | \ compiler v / [

    code generator ] / | v [ VM ] \ | \ v \ [ btree ] \ | \ run the program v / [ pager ] / | / v / [ shim ] / | / v / [ OS Interface ] / 34
  22. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] \ | \ v \ [ pager ] \ | \ storage engine v / [ shim ] / | / v / [ OS Interface ] / 35
  23. [ parser ] tokenise.c -> push-down automaton parser | v

    [ code generator ] | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 36
  24. [ parser ] (reentrant and thread-safe) | v [ code

    generator ] | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 37
  25. [ parser ] generated by lemon, parse.y, tool/lemon.c | v

    [ code generator ] | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 38
  26. [ parser ] outputs AST (sqliteInt.h) | v [ code

    generator ] | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 39
  27. [ parser ] | v [ code generator ] semantic

    analysis | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 40
  28. [ parser ] | v [ code generator ] AST

    transformation select.c | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 41
  29. [ parser ] | v [ code generator ] Join

    order where*.c whereInt.h | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 42
  30. [ parser ] | v [ code generator ] query

    planning (select.c) | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 43
  31. [ parser ] | v [ code generator ] output

    bytecodes | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 44
  32. [ parser ] | v [ code generator ] build.c,

    delete.c, expr.c | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 45
  33. [ parser ] | v [ code generator ] insert.c,

    update.c | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 46
  34. [ parser ] | v [ code generator ] section

    w most lines of codes | v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 47
  35. [ parser ] | v [ code generator ] |

    v [ VM ] 2nd section in most lines of code | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 48
  36. [ parser ] | v [ code generator ] |

    v [ VM ] vdbe.c, vdbe.h, vdbeLnt.h | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 49
  37. [ parser ] | v [ code generator ] |

    v [ VM ] vdbe*.c, func.c, date.c | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 50
  38. [ parser ] | v [ code generator ] |

    v [ VM ] | Interface defined by btree.h v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] 51
  39. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] has both b- and b+ trees | v [ pager ] | v [ shim ] | v [ OS Interface ] 52
  40. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] b- for indexes* | v [ pager ] | v [ shim ] | v [ OS Interface ] 53
  41. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] b+ for records* | v [ pager ] | v [ shim ] | v [ OS Interface ] 54
  42. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] multiple b-trees per db file | v [ pager ] | v [ shim ] | v [ OS Interface ] 55
  43. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] read via cursor | v [ pager ] | v [ shim ] | v [ OS Interface ] 56
  44. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] concurrent r/w same table diff cursors | v [ pager ] | v [ shim ] | v [ OS Interface ] 57
  45. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] Also called page cache | v [ shim ] | v [ OS Interface ] 58
  46. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] Protects data from power loss | v [ shim ] | v [ OS Interface ] 59
  47. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] 1. Roll back mode | v [ shim ] | v [ OS Interface ] 60
  48. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] 2. Write Ahead Log (WAL) mode | v [ shim ] | v [ OS Interface ] 61
  49. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] Concurrency control (when trying to access) | v [ shim ] | v [ OS Interface ] 62
  50. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] In-memory cache | v [ shim ] | v [ OS Interface ] 63
  51. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] pager.c, pager.h, pcache1.c, pcache.c | v [ shim ] | v [ OS Interface ] 64
  52. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] pcache.h, wal.c, wal.h | v [ shim ] | v [ OS Interface ] 65
  53. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] encryption, compression, logging | v [ OS Interface ] 66
  54. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] used to emulate an OS layer | v [ OS Interface ] 67
  55. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] used for tests to simulate hardware failures | v [ OS Interface ] 68
  56. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] test_multiplex.c, test_vfstrace.c | v [ OS Interface ] 69
  57. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] os-specific interfacing 70
  58. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] changed at runtime 71
  59. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] I/O (test_onefile.c) 72
  60. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] | v [ shim ] | v [ OS Interface ] os.c, os_unix.c, os_win.c, os*.h 73
  61. [ parser ] | v [ code generator ] |

    v [ VM ] | v [ btree ] | v [ pager ] Rollback & Write Ahead Log (WAL) | v [ shim ] | v [ OS Interface ] 76
  62. | | | | | user space | os cache

    | disk | | a | a | x | | a | | x | | | a | x | | | | x | shared 79
  63. | | | | | user space | os cache

    | disk | | a | a | x | | a | | x | | | a | x | | | | x | reserved | | | | journal | | | | | | | | | | | | | | | | 82
  64. | | | | | user space | os cache

    | disk | | a | a | x | | a | | x | | | a | x | | | | x | reserved | | | | journal | | | | | | | | | | a | | | | a | | 83
  65. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | reserved | | | | journal | | | | | | | | | | a | | | | a | | 85
  66. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | reserved | | | | journal | | | | | | | | | | a | a | | | a | a | 86
  67. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 88
  68. | | | | | user space | os cache

    | disk | | b | b | x | | b | | x | | | b | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 90
  69. | | | | | user space | os cache

    | disk | | b | b | b | | b | | x | | | b | b | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 91
  70. | | | | | user space | os cache

    | disk | | b | b | b | | b | | x | | | b | b | | | | x | exclusive | | | | journal | | | | | | | | | | | | | | | | 93
  71. | | | | | user space | os cache

    | disk | | | | b | | | | x | | | | x | | | | x | | | | | journal | | | | | | | | | | | a | | | | a | 95
  72. | | | | | user space | os cache

    | disk | | | | b | | | | x | | | | x | | | | x | shared | | | | journal | | | | | | | | | | | a | | | | a | 97
  73. | | | | | user space | os cache

    | disk | | | | b | | | | x | | | | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 98
  74. | | | | | user space | os cache

    | disk | | | a | b | | | a | x | | | | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 99
  75. | | | | | user space | os cache

    | disk | | | a | a | | | a | a | | | | x | | | | x | exclusive | | | | journal | | | | | | | | | | a | a | | | a | a | 100
  76. Write Ahead Log (WAL) not default as hash lookup for

    page in wal in shared memory, not work in network from diff pc not work with os with dodgy memory mapping 102
  77. | | | | | user space | os cache

    | disk | | a | a | x | | a | | x | | | a | x | | | | x | shared | | | | file.database-wal | | | | | | | | | | | | | | | | 103
  78. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | shared | | | | file.database-wal | | | | | | | | | | | | | | | | 104
  79. | | | | | user space | os cache

    | disk | | b | a | x | | b | | x | | | a | x | | | | x | shared | | | | file.database-wal | | | | | | | | | | b | | | | b | | 105
  80. | | | | | | user2 | user space

    | os cache | disk | | b | b | a <- x | | a | b | | x | | | | a <- x | | | | | x | shared | | | | | | wal | | | | b | | | b | | 106
  81. | | | | | | user2 | user space

    | os cache | disk | | b | b | a | x | | c | b | | x | | | | a | x | | | | | x | shared | | | | | | wal | | | | b | | | c | | 108
  82. | | | | | | user2 | user space

    | os cache | disk | | b | b | a | x | | c | b | | x | | | | a | x | | | | | x | shared | | | | | | wal | | | | b | b | | c | c | 110
  83. | | | | | user space | os cache

    | disk | | | c | x | | | b | x | | | c | x | | | a | x | shared | | | | | | wal | | | | | | | | | 112
  84. 115

  85. libsql> EXPLAIN SELECT price FROM product WHERE price=100; addr opcode

    p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Init 0 11 0 0 Start at 11 1 OpenRead 0 30 0 3 0 root=30 iDb=0; product 2 Rewind 0 10 0 0 3 Column 0 2 1 0 r[1]= cursor 0 column 2 4 RealAffinity 1 0 0 0 5 Ne 2 9 1 BINARY-8 85 if r[1]!=r[2] goto 9 6 Column 0 2 3 0 r[3]= cursor 0 column 2 7 RealAffinity 3 0 0 0 8 ResultRow 3 1 0 0 output=r[3] 9 Next 0 3 0 1 10 Halt 0 0 0 0 11 Transaction 0 0 25 0 1 usesStmtJournal=0 12 Integer 100 2 0 0 r[2]=100 13 Goto 0 1 0 0 116
  86. Btree Deals w binary format for data, k int b+

    for tables b- for indexes index | data (in 'record' format) [k][ d ] -> called cell [k][ d ] [k][ d ] [k][ d ] [k][ d ] interpreted by next step 117
  87. x x x x / \ xx xx <- non-leaf

    (holds key only) [k][ d ][k][ d ] [k][ d ][k][ d ] <- leaf k+d in leaves, more than one k in tree (small integers) 118
  88. btree page [ header ][ offset to content ][ free

    ][ content ] content: each cell contains left-child pointer, key, data 119
  89. Record format [8] key and data in record format record

    format [ header ][ body ] both called payloads [7] but mainly body 120
  90. VarInt 1234 10011010010 - b 00010011010010 - 2 x 7

    10001001 01010010 1 beg: more chunk coming 0 beg 2nd chunk no more chunk coming store in two bytes 121
  91. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    | V h15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02 h15 size of row body h01 row id both use varint 123
  92. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    | V h 15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02 data & types name: 0x1b -> TEXT length of 7 bytes. if type val odd & >= 13: TEXT calculated using (n*2) + 13 . (7*2) + 13 -> 27 -> h1b BLOB : (n*2) + 12 Length floating point always encoded as h07 124
  93. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    | V h 15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02 count field which is an integer. These get packed down similar to varints but in a slightly different format. Integers that can fit in an 8-bit integer are represented with a type value of 0x01. 16-bit integers are 0x02, 24-bit integers are 0x03 and so on. 125
  94. INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

    | V h 15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02 values "italian" in utf-8 49 74 61 6c 69 61 6e 126
  95. length value of 7.5 as `=IEEE-754-2008 floating-point number. SQLite can

    optimize integer floating-point values by storing them as pure integer fields but since we have a decimal place it is stored with 8-bytes: 40 1e 00 00 00 00 00 00 count value of 2: 02 127
  96. x x x x / \ xx xx <- non-leaf

    (holds key only) [k][ d ][k][ d ] [k][ d ][k][ d ] <- leaf 129
  97. [ page 1 ] x x x x - [1]

    [ page 2 ] / \ [ page 3 ] [4]- xx xx - [6] [ page 4 ] [ page 5 ] [3] - [k][ d ][k][ d ] [k][ d ][k][ d ] - [8] [ page 6 ] [ page 7 ] x x x x - [5] [ page 8 ] / \ [ page 9 ] [2] - xx xx - [4] [ page 10 ] [ page 11 ] [7] - [k][ d ][k][ d ] [k][ d ][k][ d ] - [10] 130
  98. If table does not fit on page -> overflow to

    next page linkedlist first page always b+ tree holding root node 131
  99. db format db: [page1][page2][page3] page1 start byte - offset byte

    - description 00 - 16: SQLite format 3 16 - 02: Page size in bytes 18 - 01: File format write version 19 - 01: File format read version 20 - 01: Bytes reserved at the end of each page 21 - 01: Max embedded payload fraction 22 - 01: Max embedded payload fraction 23 - 01: Min leaf payload fraction 24 - 04: File change counter 28 - 04: Size of db in pages 32 - 04: First freelist page 36 - 04: Number of freelist pages in the file 40 - 04: Schema cookie number 44 - 56: 14 4-byte meta values passed to higher layers 132
  100. WebSQL WebStorage on the web Not going to implement an

    engine from scratch: Use SQLite "User agents must implement the SQL dialect supported by Sqlite 3.6.19" Example exploit: Omer Gull - SELECT code execution FROM USING SQLite [4] Need upated version of SQLite -> conflict with requirement of 3.6.19 Aug 2022 Chrome: Deprecating and Removing webSQL [5] Memory corruption available from JS 134
  101. With the advent of Wasm, SQL or NoSQL solutions can

    come to the web. One example is DuckDB-Wasm, another is absurd-sql. Based on these creations, we feel that the developer community can iterate on and create new storage solutions faster and better than browser vendors. libSQL introduced native WASM support to SQLite 140
  102. I had this crazy idea that I’m going to build

    a database engine that does not have a server, that talks directly to disk, and ignores the data types, and if you asked any of the experts of the day, they would say, “That’s impossible. That will never work. That’s a stupid idea.” Fortunately, I didn’t know any experts and so I did it anyway, so this sort of thing happens. I think, maybe, just don’t listen to the experts too much and do what makes sense. Solve your problem. 142
  103. If I'd known how hard it would be I probably

    never would've have written it [3] 143
  104. Refs [1] SQLite, A Database for the Edge of the

    Network, DRH, Databaseology Lectures, Carnegie Mellon (2015) [2] CORECURSIVE Podcast, Episode #066, The Untold Story of SQLite [3] Richard Hipp Speaks Out on SQLite, ACM SIGMOD interviews with DB people, Marianne Winslett and Vanessa Braganholo (2019), https://sigmodrecord.org/publications/sigmodRecord/1906 /pdfs/06_Profiles_Hipp.pdf [4] DEF CON 27 - Omer Gull - SELECT code execution FROM USING SQLite, https://www.youtube.com/watch? v=JokZUjwGj4M 145
  105. [5] Deprecating and removing Web SQL, https://developer.chrome.com/blog/deprecating-web-sql/ [6] Craft vulnerable

    db https://github.com/CheckPointSW/QueryOrientedProgram ming/blob/master/qop.py [7] https://www.sqlite.org/fileformat.html#record_format [8] https://fly.io/blog/sqlite-internals-btree/ 146