Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make and Learn RDBMS

yui-knk
March 22, 2019

Make and Learn RDBMS

yui-knk

March 22, 2019
Tweet

More Decks by yui-knk

Other Decks in Programming

Transcript

  1. 4FMG*OUSPEVDUJPO • Yuichiro Kaneko • Arm Treasure Data • CDP

    team (Writing Rails application) • CRuby Committer 2015/12~ • GitHub (yui-knk)
  2. 5PEBZTUPQJD • Implementation of RDBMS, especially PostgreSQL • How to

    create DB step by step • Will not talk about CRuby/Rails!
  3. /PUDPWFSFE • How to use PostgreSQL efficiently in your Rails

    app • What RDBMS is best for your Rails app • How to design perfect database/table schema
  4. #BDLHSPVOE • Our customer can store data and execute queries

    to their data by Treasure Data • It can be said we are developing each part of DB • Knowledge of how DB works may be useful to understand/guess what other team members do • Creating a subset of DB helps me understand DB
  5. 5PEBZT(PBM • Grasp the whole picture of DB • Understand

    which part of DB we can omit when we create DB
  6. *NDSFBUJOH • https://github.com/yui-knk/minidb • About 3400 lines • Implemented by

    Rust • Use PostgreSQL as reference • d06fe6ce2c79420fd19ac89ace81b66579f08493 (tag: REL_11_1) • First commit date is 2018/12/31
  7. 4VQQPSUFE • SELECT/INSERT/DELETE/WHERE/COUNT/ORDER BY • Storage Manager • Support multiple

    pages • Buffer Manager • Has parser • Signed 4 bytes integer is only supported type • Use as command line tool
  8. /PU4VQQPSUFE • UPDATE/JOIN • Projection • Aggregation • Transaction/Concurrency control

    • Vacuum • Optimizer • Multiple segments • INDEX • Launch as server • ...
  9. 8IBUDBO*EPXJUIlNJOJECz $ minidb execute 'insert into db1.table1 (id, age) values

    (1, 12), (2, 13), (3, 12), (4, 20), (5, 21)' $ minidb execute 'select * from db1.table1' "1" "12" "2" "13" "3" "12" "4" "20" "5" "21" $ minidb execute 'select count() from db1.table1' Count: 5
  10. 8IBUDBO*EPXJUIlNJOJECz $ minidb execute 'select * from db1.table1 order by

    age' "1" "12" "3" "12" "2" "13" "4" "20" "5" "21" $ minidb execute 'select count() from db1.table1 where age = 12' Count: 2 $ minidb execute 'delete from db1.table1 where age = 12' Deleted records: 2 $ minidb execute 'select count() from db1.table1 where age = 12' Count: 0
  11. +---------------+ | Shared Memory | +---------------+ +------------+ | Postmaster |

    +------------+ +---------+ | Backend | +---------+ +---------+ | Backend | +---------+ +---------+ | Backend | +---------+ +-------------------+ | Background Writer | +-------------------+ | Checkpointer | +-------------------+ | WAL Writer | +-------------------+ | WAL Receiver | +-------------------+ | Checkpointer | +-------------------+ +----------------------+ | Autovacuum | +----------------------+ | WAL Archiver | +----------------------+ | Statistics Collector | +----------------------+ | System Logger | +----------------------+ fork fork
  12. +-------------------+ | Background Writer | +-------------------+ | Checkpointer | +-------------------+

    | WAL Writer | +-------------------+ | WAL Receiver | +-------------------+ | Checkpointer | +-------------------+ +----------------------+ | Autovacuum | +----------------------+ | WAL Archiver | +----------------------+ | Statistics Collector | +----------------------+ | System Logger | +----------------------+ +------------------+ | main | | * log | | * write to disk | +------------------+
  13. "HFOEB • Part 1: The system catalogs and init command

    • Part 2: Storage Manager and Buffer Manager • Part 3: Query Execution • Part 4: Conclusion
  14. %JSFDUPSZTUSVDUVSFPG 1PTUHSF42-EBUBCBTFEJSFDUPSZ $ ls -l total 104 -rw------- 1 yuichirokaneko

    admin 3 1 17 22:19 PG_VERSION drwx------ 7 yuichirokaneko admin 224 1 24 21:05 base drwx------ 61 yuichirokaneko admin 1952 3 10 10:44 global drwx------ 2 yuichirokaneko admin 64 1 17 22:19 pg_commit_ts ... drwx------ 4 yuichirokaneko admin 128 1 17 22:19 pg_wal drwx------ 3 yuichirokaneko admin 96 1 17 22:19 pg_xact -rw------- 1 yuichirokaneko admin 88 1 17 22:19 postgresql.auto.conf -rw------- 1 yuichirokaneko admin 23780 1 17 22:19 postgresql.conf -rw------- 1 yuichirokaneko admin 80 1 17 22:20 postmaster.opts -rw------- 1 yuichirokaneko admin 90 3 10 10:43 postmaster.pid
  15. %JSFDUPSZTUSVDUVSFPG 1PTUHSF42-EBUBCBTFEJSFDUPSZ $ ls -l total 104 -rw------- 1 yuichirokaneko

    admin 3 1 17 22:19 PG_VERSION drwx------ 7 yuichirokaneko admin 224 1 24 21:05 base drwx------ 61 yuichirokaneko admin 1952 3 10 10:44 global drwx------ 2 yuichirokaneko admin 64 1 17 22:19 pg_commit_ts ... drwx------ 4 yuichirokaneko admin 128 1 17 22:19 pg_wal drwx------ 3 yuichirokaneko admin 96 1 17 22:19 pg_xact -rw------- 1 yuichirokaneko admin 88 1 17 22:19 postgresql.auto.conf -rw------- 1 yuichirokaneko admin 23780 1 17 22:19 postgresql.conf -rw------- 1 yuichirokaneko admin 80 1 17 22:20 postmaster.opts -rw------- 1 yuichirokaneko admin 90 3 10 10:43 postmaster.pid I’m talking about PostgreSQL
  16. %JSFDUPSZTUSVDUVSFPG 1PTUHSF42-EBUBCBTFEJSFDUPSZ • Go to database directory • "global": "Subdirectory

    containing cluster-wide tables, such as pg_database" [1] • "base": "Subdirectory containing per-database subdirectories" [1] • [1]: https://www.postgresql.org/docs/11/storage-file-layout.html $ ls -l drwx------ 7 224 1 24 21:05 base drwx------ 61 1952 3 10 10:44 global
  17. $ ls -l global total 1160 -rw------- 1 yuichirokaneko admin

    8192 1 17 22:19 1136 -rw------- 1 yuichirokaneko admin 24576 1 17 22:19 1136_fsm -rw------- 1 yuichirokaneko admin 8192 1 17 22:19 1136_vm ... -rw------- 1 yuichirokaneko admin 8192 1 24 21:10 1262 -rw------- 1 yuichirokaneko admin 24576 1 17 22:19 1262_fsm -rw------- 1 yuichirokaneko admin 8192 1 17 22:19 1262_vm ... -rw------- 1 yuichirokaneko admin 22088 3 10 10:44 pg_internal.init # select oid, relname from pg_class where oid = 1262 order by oid asc; oid | relname ------+------------- 1262 | pg_database (1 row) Oid of table
  18. CBTFEJSFDUPSZ • Contains directories whose name is same with Oid

    • Directory 16388 is for lusiadas database $ ls -l base/ total 0 drwx------ 296 9472 2 10 21:15 1 drwx------ 296 9472 1 17 22:19 13363 drwx------ 296 9472 3 9 22:12 13364 drwx------ 297 9504 2 10 21:15 16384 drwx------ 299 9568 3 10 10:44 16388 # select oid, datname from pg_database order by oid asc; oid | datname -------+------------ 1 | template1 13363 | template0 13364 | postgres 16384 | example-db 16388 | lusiadas (5 rows)
  19. CBTFEJSFDUPSZ • Directories contain data files • File 16389 is

    for films table $ ls -l base/16388 total 15784 -rw------- 1 8192 1 24 21:05 112 -rw------- 1 8192 1 24 21:05 113 -rw------- 1 73728 1 24 21:10 1247 -rw------- 1 24576 1 24 21:05 1247_fsm -rw------- 1 8192 1 24 21:10 1247_vm ... -rw------- 1 8192 2 23 22:31 16389 ... # select oid, relname from pg_class where oid = 16389 order by oid asc; oid | relname -------+--------- 16389 | films (1 row) Oid of table
  20. *NQMFNFOUJOJUDPNNBOE $ minidb --base_dir /tmp/minidb init /tmp/minidb !"" base #""

    global !"" mini_attribute $ #"" data !"" mini_class $ #"" data !"" mini_database $ #"" data #"" oid Columns Tables Databases
  21. *NQMFNFOUJOJUDPNNBOE • Creating some system catalogs which are the place

    RDBMS stores schema metadata • "mini_database": Database information • "mini_class": Table information • "mini_attribute": Column information • Simply creating directories and files
  22. *NQMFNFOUDSFBUFEBUBCBTF DPNNBOE $ minidb --base_dir /tmp/minidb create_db db1 /tmp/minidb !""

    base $ #"" 10000 #"" global !"" mini_attribute $ #"" data !"" mini_class $ #"" data !"" mini_database $ #"" data #"" oid $ cat global/mini_database/data # Oid,db_name 10000,db1
  23. *NQMFNFOUDSFBUFEBUBCBTF DPNNBOE • Insert a database record to "mini_database/data" file

    • Do not use binary format for system catalog to easy to implement • The format of this data file is “Oid,database_name" • Create directory of Oid name (10000) under the base directory
  24. $ minidb --base_dir /tmp/minidb create_table db1 table1 /tmp/minidb !"" base

    $ #"" 10000 $ #"" 10001 #"" global !"" mini_attribute $ #"" data !"" mini_class $ #"" data !"" mini_database $ #"" data #"" oid *NQMFNFOUDSFBUFUBCMFDPNNBOE $ cat /tmp/minidb/global/mini_class/data # Oid,table_name,db_Oid 10001,table1,10000 $ cat /tmp/minidb/global/mini_attribute/data # column_name,db_Oid,table_Oid,type,length id,10000,10001,1,4 age,10000,10001,1,4 Only support `id integer, age integer` as table schema
  25. *NQMFNFOUDSFBUFUBCMFDPNNBOE • Insert a table record to "mini_class/data" file •

    The format of this data file is “Oid,table_name,database_Oid" • Insert column records to "mini_attribute/data" file • The format of this data file is “column_name,database_Oid,table_Oid,type,length" • Type 1 means Signed 4 bytes integer • User can not specify schema of the table! • Only support `id integer, age integer` as table schema
  26. )PXUPJOUFSBDUXJUIpMFT • Read the catalog file • Load data from

    the catalog file into an array • Add new catalog data to the array • Write the array to catalog file
  27. *NQPSUBOUQPJOUT • Use simple data format, e.g. JSON, CSV …

    • Simply interact with catalog files
  28. %BUBpMFTUSVDUVSFJOQH • "fork": Use same oid with different suffix for

    metadata files • "main": (0) / "fsm": (1) / "vm": (2) / "init": (3) • Free Space Map / Visibility Map • "segment": Divide a fork into multiple files to avoid OS' limit on file size (see: RELSEG_SIZE) • "block": Data file is divided into logical blocks, default 8KB (see: BLCKSZ) $ ls -hl base/16388/24576* -rw------- 1.0G 3 10 16:38 base/16388/24576 <- fork: 0, segment: 0 -rw------- 25M 3 10 16:38 base/16388/24576.1 <- fork: 0, segment: 1 -rw------- 280K 3 10 16:38 base/16388/24576_fsm <- fork: 1, segment: 0
  29. %BUBpMFTUSVDUVSFJOQH • Internal structure of "base/16388/24576" file looks like below

    • It has fixed-size blocks $ ls -hl base/16388/24576* -rw------- 1.0G 3 10 16:38 base/16388/24576 -rw------- 25M 3 10 16:38 base/16388/24576.1 -rw------- 280K 3 10 16:38 base/16388/24576_fsm +---------+ | Block 0 | <- 8KB +---------+ | Block 1 | <- 8KB +---------+ | Block 2 | <- 8KB +---------+ | Block 3 | <- 8KB +---------+ ...
  30. #MPDLTUSVDUVSFJOQH /* * +----------------+---------------------------------+ * | PageHeaderData | linp1 linp2

    linp3 ... | * +-----------+----+---------------------------------+ * | ... linpN | | * +-----------+--------------------------------------+ * | ^ pd_lower | * | | * | v pd_upper | * +-------------+------------------------------------+ * | | tupleN ... | * +-------------+------------------+-----------------+ * | ... tuple3 tuple2 tuple1 | "special space" | * +--------------------------------+-----------------+ * ^ pd_special */ 'SPNTSDJODMVEFTUPSBHFCVGQBHFI Header 32bit +---------+ | Block 0 | <- 8KB +---------+ | Block 1 | <- 8KB +---------+ | Block 2 | <- 8KB +---------+ | Block 3 | <- 8KB +---------+ ...
  31. #MPDLTUSVDUVSFJOQH • PageHeader is a header • linp has a

    ItemIdData type, 32bit. This holds info of each tuple, e.g. offset to the tuple, length of the tuple • Maybe linp means “line pointer” • tuple has a HeapTupleHeaderData type. This holds data of the row
  32. *NQMFNFOUJOTFSUDPNNBOE • We should implement insert before select :) •

    Only support `id integer, age integer` as table schema • `minidb insert_into db_name table_name id age` $ minidb insert_into db1 table1 1 12 $ minidb insert_into db1 table1 2 13 $ minidb insert_into db1 table1 3 12
  33. *NQMFNFOUJOTFSUDPNNBOE • Open data file • Load data into memory,

    this is called 'page' • Add tuple to page • Write data into data file
  34. pub fn execute(&self, dbname: &str, table_name: &str, key_values: Vec<KeyValue>) ->

    Result<(), String> { let rm: RecordManeger<MiniAttributeRecord> = RecordManeger::build_from_config("mini_attribute".to_string(), self.config).unwrap(); let attrs = rm.attributes(dbname, table_name); let mut slot = TupleTableSlot::new(rm.attributes_clone(dbname, table_name)); let path = self.config.data_file_path(dbname, table_name); let mut page = if path.exists() { Page::load(&path) } else { Page::new(DEFAULT_BLOCK_SIZE) }; if attrs.len() != key_values.len() { return Err(format!("Length not match. attrs: {}, key_values: {}", attrs.len(), key_values.len())); } *NQMFNFOUJOTFSUDPNNBOE open file & load data
  35. for ((i, kv), attr) in key_values.iter().enumerate().zip(attrs.iter()) { if kv.key !=

    attr.name { return Err(format!("Name not match. attrs: {}, key_values: {}", attr.name, kv.key)); } let t = ty::build_ty(&attr.type_name, &kv.value)?; slot.set_column(i, t.as_ref()); } page.add_tuple_slot_entry(&slot).unwrap(); let f = File::create(path).unwrap(); page.write_bytes(f); Ok(()) } *NQMFNFOUJOTFSUDPNNBOE add tuple to page write data to file
  36. *NQMFNFOUTFMFDUDPNNBOE • `minidb select_from db_name table_name` $ minidb select_from db1

    table1 "1" # <- id of 1st record "12" # <- age of 1st record "2" # <- id of 2nd record "13" # <- age of 2nd record "3" # <- id of 3rd record "12" # <- age of 3rd record
  37. *NQMFNFOUTFMFDUDPNNBOE • Open data file • Load data into page

    • Copy tuple data into tuple slot (TupleTableSlot) • Print each tuple • Print each column
  38. pub fn execute(&self, dbname: &str, table_name: &str, key: &str, value:

    &str) -> Result<(), String> { let rm: RecordManeger<MiniAttributeRecord> = RecordManeger::build_from_config("mini_attribute".to_string(), self.config).unwrap(); let attrs = rm.attributes_clone(dbname, table_name); let attrs_len = attrs.iter().fold(0, |acc, attr| acc + attr.len) as u32; let mut slot = TupleTableSlot::new(attrs); let path = self.config.data_file_path(dbname, table_name); let mut page = if path.exists() { Page::load(&path) } else { Page::new(DEFAULT_BLOCK_SIZE) }; *NQMFNFOUTFMFDUDPNNBOE open file & load data
  39. for i in 0..(page.entry_count()) { slot.load_data(page.get_entry_pointer(i).unwrap(), attrs_len); println!("attrs_count: {:?}", slot.attrs_count());

    for j in 0..(slot.attrs_count()) { let ty = slot.get_column(j); println!("{:?}", ty.as_string()); } } Ok(()) } *NQMFNFOUTFMFDUDPNNBOE copy data to tuple iterate each tuple print each column iterate each column
  40. *NQPSUBOUQPJOUT • Need not to clean code of select/insert commands

    at present • We will introduce Query Execution logic at a later step
  41. #VU • Each command read/write files • Each command allocate

    memory (page) • Introduce Buffer Manager and Storage Manager to extract these logic
  42. +------------------+ | Other components | +------------------+ Return buffer identifiers (Buffer)

    ^ | (ReadBufferExtended, etc.) | v Give me block 0 of films table, lusiadas db ForkNumber 0 +----------------+ | Buffer Manager | +----------------+ load ^ | | v dump +----------------+ | Disk | +----------------+ #V⒎FS.BOBHFS
  43. #V⒎FS.BOBHFS • Buffer Manager manages data in memory • All

    data in files are read/written via Buffer Manager • In many case we can not load all data into Buffer Manager (memory size <<<< data file size) • “Block" may be called "Page" in Buffer Manager context • User of Buffer Manager passes info of Relation(database & table) and BlockNumber to get a page
  44. pages: +------+------+ | 8KB | 8KB | +------+------+ ^ ^

    | | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+ Get (db1, tbl2, blk0) )PXUPVTF#V⒎FS.BOBHFS
  45. pages: +------+------+ | 8KB | 8KB | +------+------+ ^ ^

    | | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+ Get (db1, tbl2, blk0) )PXUPVTF#V⒎FS.BOBHFS
  46. pages: +------+------+ | 8KB | 8KB | +------+------+ ^ ^

    | | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+ Get (db1, tbl2, blk0) Buffer (1) )PXUPVTF#V⒎FS.BOBHFS
  47. pages: +------+------+ | 8KB | 8KB | +------+------+ ^ ^

    | | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+ Get (db1, tbl2, blk0) Buffer (1) Get page of Buffer (1) )PXUPVTF#V⒎FS.BOBHFS
  48. pages: +------+------+ | 8KB | 8KB | +------+------+ ^ ^

    | | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+ Get (db1, tbl2, blk0) Buffer (1) Get page of Buffer (1) Reference of page (1) )PXUPVTF#V⒎FS.BOBHFS
  49. Get (db1, tbl1, blk1) )PXUPVTF#V⒎FS.BOBHFS pages: +------+------+ | 8KB |

    8KB | +------+------+ ^ ^ | | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+
  50. Get (db1, tbl1, blk1) )PXUPVTF#V⒎FS.BOBHFS pages: +------+------+------+ | 8KB |

    8KB | 8KB | +------+------+------+ ^ ^ ^ | | | v v v buffer_descriptors: +------+------+------+ | db1 | db1 | db1 | | tbl1 | tbl2 | tbl1 | | blk0 | blk0 | blk1 | | flag | flag | flag | +------+------+------+ buffer_hash: +------+------+------+ | db1 | db1 | db1 | | tbl1 | tbl2 | tbl1 | | blk0 | blk0 | blk1 | +------+------+------+ | | | v v v +------+------+------+ | 0 | 1 | 2 | +------+------+------+
  51. Get (db1, tbl1, blk1) )PXUPVTF#V⒎FS.BOBHFS pages: +------+------+------+ | 8KB |

    8KB | 8KB | +------+------+------+ ^ ^ ^ | | | v v v buffer_descriptors: +------+------+------+ | db1 | db1 | db1 | | tbl1 | tbl2 | tbl1 | | blk0 | blk0 | blk1 | | flag | flag | flag | +------+------+------+ buffer_hash: +------+------+------+ | db1 | db1 | db1 | | tbl1 | tbl2 | tbl1 | | blk0 | blk0 | blk1 | +------+------+------+ | | | v v v +------+------+------+ | 0 | 1 | 2 | +------+------+------+ Buffer (2)
  52. *NQMFNFOU#V⒎FS.BOBHFS • Page: Malloced block_size memory • BufferTag: Relation(database &

    table) and BlockNumber // We malloc c memory for header, lines and tuples. pub struct Page { header: *mut PageHeaderData, } struct BufferTag { rnode: RelFileNode, block_num: BlockNumber, } 8 KB
  53. *NQMFNFOU#V⒎FS.BOBHFS • pages: Array of Page • buffer_descriptors: Array of

    info of Page (BufferTag, dirty flag etc.) • buffer_hash: Map from BufferTag to buffer identifier (usize) pub struct BufferManager { config: Rc<Config>, smgr: StorageManager, buffer_descriptors: Vec<BufferDesc>, pages: Vec<Page>, // Hash from BufferTag to index of descriptor and page // See LocalBufHash in pg. buffer_hash: HashMap<BufferTag, Buffer>, }
  54. -JNJUBUJPOTPG#V⒎FS.BOBHFS • No eviction logic • Can not load file

    into memory if all slots are used • It’s acceptable if we implement DB as command • It’s may not acceptable if we implement DB as server • No lock mechanism
  55. *NQPSUBOUQPJOUT • Dump data to files by "destructor" (drop trait)

    • Be careful to return cached page if the page is cached • I embedded a bug because I forgot this…
  56. +------------------+ | Other components | +------------------+ ^ | (ReadBufferExtended, etc.)

    | v Give me block 0 of films table, lusiadas db, ForkNumber 0 +----------------+ | Buffer Manager | +----------------+ ^ | (smgrread, etc.) | v Give me block 0 of films table, lusiadas db, ForkNumber 0 +----------------+ | Storage Manager| +----------------+ ^ | (mdread, etc.) | v Read block size data from an appropriate file to a buf +----------------+ | Disk | +----------------+
  57. typedef enum ForkNumber { InvalidForkNumber = -1, MAIN_FORKNUM = 0,

    FSM_FORKNUM, VISIBILITYMAP_FORKNUM, INIT_FORKNUM } ForkNumber; #define MAX_FORKNUM INIT_FORKNUM typedef struct _MdfdVec { File mdfd_vfd; /* fd number in fd.c's pool */ BlockNumber mdfd_segno; /* segment number, from 0 */ } MdfdVec; typedef struct SMgrRelationData { RelFileNodeBackend smgr_rnode; /* relation physical identifier */ struct SMgrRelationData **smgr_owner; BlockNumber smgr_targblock; /* current insertion target block */ BlockNumber smgr_fsm_nblocks; /* last known size of fsm fork */ BlockNumber smgr_vm_nblocks; /* last known size of vm fork */ int smgr_which; /* storage manager selector */ int md_num_open_segs[MAX_FORKNUM + 1]; struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1]; struct SMgrRelationData *next_unowned_reln; } SMgrRelationData; • md_seg_fds is an array of `_MdfdVec *` • see: `_mdfd_getseg` function
  58. md_seg_fds: main fsm vm init +------------+--------------+--------------+--------------+ | | | |

    | +------------+--------------+--------------+--------------+ | | | | v v v v +------------+--------------+ | fd 24576 | fd 24576_fsm | | seg 0 | seg 0 | +------------+--------------+ | fd 24576.1 | | seg 1 | +------------+ base/16388/24576 <- fork: 0, segment: 0 base/16388/24576.1 <- fork: 0, segment: 1 base/16388/24576_fsm <- fork: 1, segment: 0 Forks Segments
  59. *NQMFNFOU4UPSBHF.BOBHFS • cache: Map from RelFileNode to SMgrRelationData • SMgrRelationData:

    This has fd pub struct SMgrRelationData { config: Rc<Config>, pub smgr_rnode: RelFileNode, file: Option<File>, // current insertion target block pub smgr_targblock: BlockNumber, } pub struct StorageManager { config: Rc<Config>, cache: HashMap<RelFileNode, RefCell<SMgrRelationData>> }
  60. -JNJUBUJPOTPG4UPSBHF.BOBHFS • Fork is not supported • Segment is not

    supported $ ls -hl base/16388/24576* -rw------- 1.0G 3 10 16:38 base/16388/24576 <- fork: 0, segment: 0 -rw------- 25M 3 10 16:38 base/16388/24576.1 <- fork: 0, segment: 1 -rw------- 280K 3 10 16:38 base/16388/24576_fsm <- fork: 1, segment: 0
  61. 3FXSJUFJOTFSUDPNNBOE • Problems were • Each command read/write files •

    Each command allocate memory (page) • Insert command does not open file directly • Insert command does not allocate memory directly
  62. pub struct InsertState<'a> { slot: &'a TupleTableSlot, ss_currentRelation: &'a RelFileNode,

    } impl<'a> InsertState<'a> { pub fn new(rnode: &'a RelFileNode, slot: &'a TupleTableSlot) -> InsertState<'a> { InsertState { ss_currentRelation: rnode, slot: slot, } } pub fn exec_insert(&mut self, bufmrg: &mut BufferManager) { let buf = bufmrg.read_buffer(self.ss_currentRelation.clone(), 0); let page = bufmrg.get_page_mut(buf); page.add_tuple_slot_entry(self.slot).unwrap(); } } get page from buffer manager write tuple to page
  63. 5BCMFFYBNQMF =# \d films Table "public.films" Column | Type |

    Collation | Nullable | Default -----------+-------------------------+-----------+----------+--------- code | character(5) | | not null | title | character varying(40) | | not null | did | integer | | not null | date_prod | date | | | kind | character varying(10) | | | len | interval hour to minute | | | Indexes: "firstkey" PRIMARY KEY, btree (code) =# select * from films; code | title | did | date_prod | kind | len -------+-------+-----+-----------+------+----- c1 | T1 | 101 | | | c2 | T2 | 102 | | | c3 | T3 | 103 | | | (3 rows)
  64. +-----------------------+ | Query string (char *) | +-----------------------+ parse (pg_parse_query)

    | v +---------------------------------+ | Raw parse tree (struct RawStmt) | +---------------------------------+ analyze/rewrite (pg_analyze_and_rewrite) | v +---------------------------+ | Query Tree (struct Query) | +---------------------------+ optimize/generate tree of plan node (pg_plan_queries) | v +-------------------------+ | Plan tree (struct Plan) | +-------------------------+ setup query execution (ExecInitNode) | v +-----------------------------------+ | PlanState tree (struct PlanState) | +-----------------------------------+ execute query (ExecutorRun)
  65. 1BSTF select * from films where did = 1; ParseState:

    parsetree_list RawStmt: stmt SelectStmt: targetList ResTarget: name: , indirection: val ColumnRef: fields: * fromClause RangeVar: catalogname: , schemaname: , relname: films whereClause A_Expr: kind: OP, name: = lexpr ColumnRef: fields: did rexpr A_Const: val: 1
  66. "OBMZ[F3FXSJUF • Generate Query Tree • Replace user input attributes

    to internal representations • Semantic analysis
  67. "OBMZ[F3FXSJUF Query: SELECT(id: 0) rtable (# 1) RangeTblEntry: rtekind: RELATION,

    relid: 16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … ParseState: parsetree_list RawStmt: stmt SelectStmt: targetList ResTarget: name: , indirection: val ColumnRef: fields: * fromClause RangeVar: relname: films whereClause A_Expr: kind: OP, name: = lexpr ColumnRef: fields: did rexpr A_Const: val: 1 Expanded
  68. "OBMZ[F3FXSJUF Query: SELECT(id: 0) rtable (# 1) RangeTblEntry: rtekind: RELATION,

    relid: 16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … ParseState: parsetree_list RawStmt: stmt SelectStmt: targetList ResTarget: name: , indirection: val ColumnRef: fields: * fromClause RangeVar: relname: films whereClause A_Expr: kind: OP, name: = lexpr ColumnRef: fields: did rexpr A_Const: val: 1 Internal representations where did = 1
  69. "OBMZ[F3FXSJUF Query: SELECT(id: 0) rtable (# 1) RangeTblEntry: rtekind: RELATION,

    relid: 16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … # select oid, relname from pg_class where oid = 16389 order by oid asc; oid | relname -------+--------- 16389 | films (1 row) films
  70. "OBMZ[F3FXSJUF Query: SELECT(id: 0) rtable (# 1) RangeTblEntry: rtekind: RELATION,

    relid: 16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … where did = 1 /* catalog/pg_operator.dat */ { oid => ’96', oid_symbol => ‘Int4EqualOperator' oprname => ‘=', oprleft => ‘int4', oprright => ‘int4', oprresult => 'bool' }, /* catalog/pg_type.dat */ { oid => ’23', descr => '-2 billion to 2 billion integer, 4-byte storage', typname => ‘int4', typlen => '4' }, select * from films where did = 1;
  71. 0QUJNJ[F(FOFSBUFUSFFPGQMBO OPEF • We would have many way to calculate

    result tuples • How to get tuples from disk • Read files in order (SeqScan) / Use index (IndexScan) / Use only index (IndexOnlyScan) • How to aggregate tuples • Use internal hash table / sort tuples before aggregation • How to join tables • Nested Loops Join / (Sort-)Merge Join / Hash Join
  72. 0QUJNJ[F(FOFSBUFUSFFPGQMBO OPEF Query: SELECT(id: 0) rtable (# 1) RangeTblEntry: relid:

    16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … PlannedStmt: command_type: SELECT SeqScan: targetlist TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 ... qual OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1
  73. • Sort gets tuple from lefttree, SeqScan in this case

    select * from films where did = 1 order by code; PlannedStmt: command_type: SELECT Sort: num_cols: 1, sort_col_idx: [1], sort_operators: [1058] targetlist TargetEntry: resno: 1, resname: code Var: varno: 65001, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 65001, varattno: 2 ... qual lefttree SeqScan: targetlist TargetEntry: resno: 1, resname: Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: Var: varno: 1, varattno: 2 ... qual OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1
  74. *NQMFNFOUDPVOUDPNNBOE • In pg this is implemented as aggregation •

    But aggregation is difficult… $ minidb select_from_count db1 table1 Count: 5
  75. *NQMFNFOUDPVOUDPNNBOE +-------------+ | Count | | * counter | |

    * scanner | +-------------+ +---------+ | SeqScan | +---------+ +------+ | Disk | +------+
  76. +-----------------------+ | Query string (char *) | +-----------------------+ parse (pg_parse_query)

    | v +---------------------------------+ | Raw parse tree (struct RawStmt) | +---------------------------------+ analyze/rewrite (pg_analyze_and_rewrite) | v +---------------------------+ | Query Tree (struct Query) | +---------------------------+ optimize/generate tree of plan node (pg_plan_queries) | v +-------------------------+ | Plan tree (struct Plan) | +-------------------------+ setup query execution (ExecInitNode) | v +-----------------------------------+ | PlanState tree (struct PlanState) | +-----------------------------------+ execute query (ExecutorRun) • Directly create Plan tree
  77. impl<'a> CountState<'a> { pub fn new(lefttree: ScanState<'a>) -> CountState<'a> {

    CountState { lefttree: lefttree, result: 0, } } // `ExecAgg` in pg. pub fn exec_agg(&mut self, bufmrg: &mut BufferManager) { loop { let opt = self.lefttree.exec_scan(bufmrg); match opt { Some(_slot) => { self.result = self.result + 1; }, None => break } } } } get tuple count up counter scanner
  78. impl CountCommnad { ... pub fn execute(&self, dbname: &str, table_name:

    &str) -> Result<(), String> { ... let mut rmgr = RelationManager::new(self.config.clone()); let relation = rmgr.get_relation(db_oid, table_oid); let mut bm = BufferManager::new(1, self.config.clone()); let scan = ScanState::new(relation, &rm, &mut bm); let mut count = CountState::new(scan); count.exec_agg(&mut bm); println!("Count: {}", count.result); Ok(()) } } pass scan to count execute get result of the count
  79. *NQMFNFOUQBSTFS • Use lalrpop, LR(1) parser generator • Rewrite select_from

    / select_from_count / insert_into commands to execute command • Before: "minidb select_from db1 table1” • After: "minidb execute 'select * from db1.table1’" • I wanted some progress when I changed to use parser :) • Writing parser by parser generator is fun!!!
  80. +-----------------------+ | Query string (char *) | +-----------------------+ parse (pg_parse_query)

    | v +---------------------------------+ | Raw parse tree (struct RawStmt) | +---------------------------------+ analyze/rewrite (pg_analyze_and_rewrite) | v +---------------------------+ | Query Tree (struct Query) | +---------------------------+ optimize/generate tree of plan node (pg_plan_queries) | v +-------------------------+ | Plan tree (struct Plan) | +-------------------------+ setup query execution (ExecInitNode) | v +-----------------------------------+ | PlanState tree (struct PlanState) | +-----------------------------------+ execute query (ExecutorRun) • Do not analyze • Do not optimize
  81. -JNJUBUJPOPGRVFSZFYFDVUPS • Do not implement analyzer • Should do semantic

    analysis when the query is executed • Do not implement optimizer • Can not use index even if index is implemented
  82. *NQMFNFOUEFMFUF $ minidb /tmp/minidb execute 'select count() from db1.table1' Count:

    5 $ minidb execute 'delete from db1.table1' Deleted records: 5 $ minidb /tmp/minidb execute 'select count() from db1.table1' Count: 0
  83. *NQMFNFOUEFMFUF • Add deleted flag to tuple • Should recreate

    data files or write migration command • delete command sets the flag • SeqScan checks the flag • Count need not to take care of the flag
  84. *NQMFNFOUEFMFUF /* * +----------------+---------------------------------+ * | PageHeaderData | linp1 linp2

    linp3 ... | * +-----------+----+---------------------------------+ * | ... linpN | | * +-----------+--------------------------------------+ * | ^ pd_lower | * | | * | v pd_upper | * +-------------+------------------------------------+ * | | tupleN ... | * +-------------+------------------+-----------------+ * | ... tuple3 tuple2 tuple1 | "special space" | * +--------------------------------+-----------------+ * ^ pd_special */ +---------+ | | | Data | | | +---------+ +---------+ | flag(s) | +---------+ | | | Data | | | +---------+ Before After • Add deleted flag to tuple
  85. diff --git a/src/node_seqscan.rs b/src/node_seqscan.rs index 80f7cff..2edeb2b 100644 --- a/src/node_seqscan.rs +++

    b/src/node_seqscan.rs @@ -135,8 +135,19 @@ impl<'a> ScanState<'a> { let dp = bufmrg.get_page(scan_desc.rs_cbuf); let mut t_self = ItemPointerData::new(); ::tuple::item_pointer_set(&mut t_self, scan_desc.rs_cblock, lineoff); + debug!("lp_len {}", dp.get_item_ref(lineoff).lp_len()); scan_desc.rs_ctup.load_without_len(dp.get_entry_pointer(lineoff).unwrap(), t_self); - return + + // Skip deleted record + if scan_desc.rs_ctup.t_data.heap_keys_updated_p() { + debug!("Skip deleted tuple {}", lineoff); + lineoff = lineoff + 1; + linesleft = linesleft - 1; + // next + } else { + debug!("Return tuple {}", lineoff); + return + } } Check delete flag in scan
  86. impl<'a> CountState<'a> { pub fn new(lefttree: ScanState<'a>) -> CountState<'a> {

    CountState { lefttree: lefttree, result: 0, } } // `ExecAgg` in pg. pub fn exec_agg(&mut self, bufmrg: &mut BufferManager) { loop { let opt = self.lefttree.exec_scan(bufmrg); match opt { Some(_slot) => { self.result = self.result + 1; }, None => break } } } } get tuple count up • Count need not to take care of the flag • lefttree omit deleted tuples
  87. &WBMVBUJOHFYQSFTTJPOT XIFSF DMBVTF select * from films where did =

    1; QueryDesc (SELECT): planstate SeqScanState: qual ExprSate: SCAN_FETCHSOME, SCAN_VAR, FUNCEXPR_STRICT, QUAL, DONE ss_ScanTupleSlot TupleTableSlot: tts_isempty: true, tts_fixedTupleDescriptor: true tts_tupleDescriptor TupleDesc: natts: 6 attrs PgAttribute: attname: code ...
  88. &WBMVBUJOHFYQSFTTJPOT XIFSF DMBVTF select * from films where did =

    1; QueryDesc (SELECT): planstate SeqScanState: qual ExprSate: SCAN_FETCHSOME, SCAN_VAR, FUNCEXPR_STRICT, QUAL, DONE ss_ScanTupleSlot TupleTableSlot: tts_isempty: true, tts_fixedTupleDescriptor: true tts_tupleDescriptor TupleDesc: natts: 6 attrs PgAttribute: attname: code ...
  89. &WBMVBUJOHFYQSFTTJPOT XIFSF DMBVTF select * from films where did =

    1; QueryDesc (SELECT): planstate SeqScanState: qual ExprSate: SCAN_FETCHSOME, SCAN_VAR, FUNCEXPR_STRICT, QUAL, DONE ss_ScanTupleSlot TupleTableSlot: tts_isempty: true, tts_fixedTupleDescriptor: true tts_tupleDescriptor TupleDesc: natts: 6 attrs PgAttribute: attname: code ... Byte codes …
  90. &WBMVBUJOHFYQSFTTJPOT XIFSF DMBVTF • Expression like "where did = 1"

    needed to be evaluated to implement where clause • "direct threaded" or "switch threaded" or “llvmjit" in pg • https://magazine.rubyist.net/articles/0008/0008-YarvManiacs.html • It's overspec, so implement it by tree traverse
  91. *NQMFNFOUXIFSFDMBVTF $ minidb /tmp/minidb execute 'select count() from db1.table1' Count:

    5 $ minidb /tmp/minidb execute 'select count() from db1.table1 where true' Count: 5 $ minidb /tmp/minidb execute 'select count() from db1.table1 where false' Count: 0 $ minidb /tmp/minidb execute 'select count() from db1.table1 where true = false' Count: 0 $ minidb /tmp/minidb execute 'select * from db1.table1' "1" "12" "2" "13" "3" "12" "4" "20" "5" "21" $ minidb /tmp/minidb execute 'select count() from db1.table1 where age = 12' Count: 2
  92. *NQMFNFOUXIFSFDMBVTF pub enum Expr { Bool(bool), Number(i32), OpEq(Box<Expr>, Box<Expr>), //

    "=" ColumnRef(String), // column name } struct ExprEvaluator<'a> { currentTuple: &'a TupleTableSlot, expr: &'a Expr, } where did = 1 +------+ | OpEq | +------+ | | +------------------+ +-| ColumnRef("did") | | +------------------+ | | +-----------+ +-| Number(1) | +-----------+ has current tuple
  93. impl<'a> ScanState<'a> { fn exec(&mut self) -> Option<&TupleTableSlot> { loop

    { self.seq_next(); if self.ss_currentScanDesc.rs_finished { return None; } if self.exec_qual() { return Some(self.ss_ScanTupleSlot.as_ref()); } // next tuple } } fn exec_qual(&self) -> bool { if self.qual.is_none() { // Always condition is met return true; } let evaluator = ExprEvaluator::new(self.ss_ScanTupleSlot.as_ref(), self.qual.as_ref().unwrap().as_ref()); evaluator.eval() } fetch next tuple from page check where clause set current tuple to evaluator then call eval
  94. impl<'a> ExprEvaluator<'a> { fn eval_rec(&self, expr: &Expr) -> Box<Value> {

    match expr { Expr::Bool(b) => { Box::new(Value::Bool(b.clone())) }, Expr::Number(n) => { Box::new(Value::String(n.to_string())) }, Expr::OpEq(e1, e2) => { let v1 = self.eval_rec(e1); let v2 = self.eval_rec(e2); Box::new(Value::Bool(self.op_eq(v1.as_ref(), v2.as_ref()))) }, Expr::ColumnRef(col_name) => { let i = self.currentTuple.get_index_from_name(col_name); let c = self.currentTuple.get_column(i); Box::new(Value::String(c.as_string())) }, } } simple case eval operands then compare get value from current tuple
  95. "EEXIFSFUPEFMFUFDPNNBOE diff --git a/src/ast.rs b/src/ast.rs @@ -4,8 +4,8 @@ pub

    enum Stmt { SelectStmt(Box<Expr>, String, String, Option<Box<Expr>>), // dbname, tablename, keys, values InsertStmt(String, String, Vec<String>, Vec<Vec<String>>), - // dbname, tablename - DeleteStmt(String, String), + // dbname, tablename, where_clause + DeleteStmt(String, String, Option<Box<Expr>>), } diff --git a/src/parser.lalrpop b/src/parser.lalrpop @@ -12,8 +12,8 @@ pub Statement: Stmt = { Stmt::InsertStmt(fi.0, fi.1, li1, li2) }, // DELETE FROM dbname "." tablename - Delete_T From_T <fi: From_Item> => { - Stmt::DeleteStmt(fi.0, fi.1) + Delete_T From_T <fi: From_Item> <ow:OptWhere> => { + Stmt::DeleteStmt(fi.0, fi.1, ow) }, } Add where member to Node Change syntax to accept where
  96. "EEXIFSFUPEFMFUFDPNNBOE diff --git a/src/dml.rs b/src/dml.rs @@ -122,7 +122,7 @@ impl

    DeleteCommnad { } } - pub fn execute(&self, dbname: &str, table_name: &str, cmgr: &CatalogManager) -> Result<(), String> { + pub fn execute(&self, dbname: &str, table_name: &str, cmgr: &CatalogManager, qual: &Option<Box<Expr>>) -> Result<(), String> { let db_oid = cmgr.database_rm.find_mini_database_oid(dbname) .expect(&format!("{} database should be defined.", dbname)); let table_oid = cmgr.class_rm.find_mini_class_oid(db_oid, table_name) @@ -131,7 +131,7 @@ impl DeleteCommnad { let mut rmgr = RelationManager::new(self.config.clone()); let relation = rmgr.get_relation(db_oid, table_oid); let mut bm = BufferManager::new(1, self.config.clone()); - let scan = ScanState::new(relation, &rm, &mut bm, &None); + let scan = ScanState::new(relation, &rm, &mut bm, qual); let mut delete = DeleteState::new(relation, scan); delete.exec_delete(&mut bm); Pass where expressions to scan
  97. 4UFQTUPDSFBUJOHEC 1. Implement init command 2. Implement create database /

    create table command 3. Implement insert command / select command 4. Implement buffer manager / storage manager • Rewrite insert command to use these managers 5. Implement count command 6. Implement parser 7. Implement delete 8. Implement where clause • Add where to delete
  98. -JNJUBUJPOTPGFBDIDPNQPOFOU • The system catalogs • Use simple data format,

    e.g. JSON, CSV, not binary format • Create table • Signed 4 bytes integer is only supported type • Only support `id integer, age integer` as table schema • Buffer Manager • No eviction logic • No lock mechanism
  99. -JNJUBUJPOTPGFBDIDPNQPOFOU • Storage Manager • Fork is not supported •

    Segment is not supported • Count command • Count logic is not generalized to aggregation • Delete command • Vacuum is not implemented • Where clause • Support only few basic operations
  100. -JNJUBUJPOTPGFBDIDPNQPOFOU • Query Execution • Do not implement analyzer •

    Should do semantic analysis when the query is executed • Do not implement optimizer • Can not use index even if index is implemented
  101. 3FGFSFODFT • Raghu Ramakrishnan, Johannes Gehrke: "Database Management Systems” •

    Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom: “Database Systems: Pearson New International Edition: The Complete Book” • উມஐ੒, ࠤഢণथ, ݪాొࢤ: “ʦվగ৽൛ʧ಺෦ߏ଄͔ΒֶͿ PostgreSQL ઃܭɾӡ༻ܭըͷమଇ” • “The Internals of PostgreSQL” http://www.interdb.jp/pg/index.html • “PostgreSQL ͷ֮͑ॻ͖ͷΠϯσοΫε” http://www.nminoru.jp/ ~nminoru/postgresql/
  102. "DLOPXMFEHNFOUT • @takeshinoda and @hkdnet • Reviewing the slide •

    Oracle Corporation Japan • Giving me the chance to talk about DB in Oracle!!!