Make and Learn RDBMS

Make and Learn RDBMS Yuichiro Kaneko(@yui-knk) 2019/03/22 Rails Developers Meetup
2019 Day 1

Thank you !!

4FMG*OUSPEVDUJPO • Yuichiro Kaneko • Arm Treasure Data • CDP
team (Writing Rails application) • CRuby Committer 2015/12~ • GitHub (yui-knk)

https://www.treasuredata.com/company/careers/jobs/?team=Engineering 8FBSFIJSJOH

5PEBZTUPQJD • Implementation of RDBMS, especially PostgreSQL • How to
create DB step by step • Will not talk about CRuby/Rails!

/PUDPWFSFE • How to use PostgreSQL eﬃciently in your Rails
app • What RDBMS is best for your Rails app • How to design perfect database/table schema

#BDLHSPVOE • Our customer can store data and execute queries
to their data by Treasure Data • It can be said we are developing each part of DB • Knowledge of how DB works may be useful to understand/guess what other team members do • Creating a subset of DB helps me understand DB

5PCFIPOFTU +VTUGPS'VO

5PEBZT(PBM • Grasp the whole picture of DB • Understand
which part of DB we can omit when we create DB

*NDSFBUJOH • https://github.com/yui-knk/minidb • About 3400 lines • Implemented by
Rust • Use PostgreSQL as reference • d06fe6ce2c79420fd19ac89ace81b66579f08493 (tag: REL_11_1) • First commit date is 2018/12/31

4VQQPSUFE • SELECT/INSERT/DELETE/WHERE/COUNT/ORDER BY • Storage Manager • Support multiple
pages • Buﬀer Manager • Has parser • Signed 4 bytes integer is only supported type • Use as command line tool

/PU4VQQPSUFE • UPDATE/JOIN • Projection • Aggregation • Transaction/Concurrency control
• Vacuum • Optimizer • Multiple segments • INDEX • Launch as server • ...

8IBUDBO*EPXJUIlNJOJECz $ minidb execute 'insert into db1.table1 (id, age) values
(1, 12), (2, 13), (3, 12), (4, 20), (5, 21)' $ minidb execute 'select * from db1.table1' "1" "12" "2" "13" "3" "12" "4" "20" "5" "21" $ minidb execute 'select count() from db1.table1' Count: 5

8IBUDBO*EPXJUIlNJOJECz $ minidb execute 'select * from db1.table1 order by
age' "1" "12" "3" "12" "2" "13" "4" "20" "5" "21" $ minidb execute 'select count() from db1.table1 where age = 12' Count: 2 $ minidb execute 'delete from db1.table1 where age = 12' Deleted records: 2 $ minidb execute 'select count() from db1.table1 where age = 12' Count: 0

+---------------+ | Shared Memory | +---------------+ +------------+ | Postmaster |
+------------+ +---------+ | Backend | +---------+ +---------+ | Backend | +---------+ +---------+ | Backend | +---------+ +-------------------+ | Background Writer | +-------------------+ | Checkpointer | +-------------------+ | WAL Writer | +-------------------+ | WAL Receiver | +-------------------+ | Checkpointer | +-------------------+ +----------------------+ | Autovacuum | +----------------------+ | WAL Archiver | +----------------------+ | Statistics Collector | +----------------------+ | System Logger | +----------------------+ fork fork

+-------------------+ | Background Writer | +-------------------+ | Checkpointer | +-------------------+
| WAL Writer | +-------------------+ | WAL Receiver | +-------------------+ | Checkpointer | +-------------------+ +----------------------+ | Autovacuum | +----------------------+ | WAL Archiver | +----------------------+ | Statistics Collector | +----------------------+ | System Logger | +----------------------+ +------------------+ | main | | * log | | * write to disk | +------------------+

"HFOEB • Part 1: The system catalogs and init command
• Part 2: Storage Manager and Buﬀer Manager • Part 3: Query Execution • Part 4: Conclusion

Part 1: The system catalogs and init command

%JSFDUPSZTUSVDUVSFPG 1PTUHSF42-EBUBCBTFEJSFDUPSZ $ ls -l total 104 -rw------- 1 yuichirokaneko
admin 3 1 17 22:19 PG_VERSION drwx------ 7 yuichirokaneko admin 224 1 24 21:05 base drwx------ 61 yuichirokaneko admin 1952 3 10 10:44 global drwx------ 2 yuichirokaneko admin 64 1 17 22:19 pg_commit_ts ... drwx------ 4 yuichirokaneko admin 128 1 17 22:19 pg_wal drwx------ 3 yuichirokaneko admin 96 1 17 22:19 pg_xact -rw------- 1 yuichirokaneko admin 88 1 17 22:19 postgresql.auto.conf -rw------- 1 yuichirokaneko admin 23780 1 17 22:19 postgresql.conf -rw------- 1 yuichirokaneko admin 80 1 17 22:20 postmaster.opts -rw------- 1 yuichirokaneko admin 90 3 10 10:43 postmaster.pid

%JSFDUPSZTUSVDUVSFPG 1PTUHSF42-EBUBCBTFEJSFDUPSZ $ ls -l total 104 -rw------- 1 yuichirokaneko
admin 3 1 17 22:19 PG_VERSION drwx------ 7 yuichirokaneko admin 224 1 24 21:05 base drwx------ 61 yuichirokaneko admin 1952 3 10 10:44 global drwx------ 2 yuichirokaneko admin 64 1 17 22:19 pg_commit_ts ... drwx------ 4 yuichirokaneko admin 128 1 17 22:19 pg_wal drwx------ 3 yuichirokaneko admin 96 1 17 22:19 pg_xact -rw------- 1 yuichirokaneko admin 88 1 17 22:19 postgresql.auto.conf -rw------- 1 yuichirokaneko admin 23780 1 17 22:19 postgresql.conf -rw------- 1 yuichirokaneko admin 80 1 17 22:20 postmaster.opts -rw------- 1 yuichirokaneko admin 90 3 10 10:43 postmaster.pid I’m talking about PostgreSQL

%JSFDUPSZTUSVDUVSFPG 1PTUHSF42-EBUBCBTFEJSFDUPSZ • Go to database directory • "global": "Subdirectory
containing cluster-wide tables, such as pg_database" [1] • "base": "Subdirectory containing per-database subdirectories" [1] • [1]: https://www.postgresql.org/docs/11/storage-ﬁle-layout.html $ ls -l drwx------ 7 224 1 24 21:05 base drwx------ 61 1952 3 10 10:44 global

0JE • Oid means "Object ID” • `typedef unsigned int
Oid;`

$ ls -l global total 1160 -rw------- 1 yuichirokaneko admin
8192 1 17 22:19 1136 -rw------- 1 yuichirokaneko admin 24576 1 17 22:19 1136_fsm -rw------- 1 yuichirokaneko admin 8192 1 17 22:19 1136_vm ... -rw------- 1 yuichirokaneko admin 8192 1 24 21:10 1262 -rw------- 1 yuichirokaneko admin 24576 1 17 22:19 1262_fsm -rw------- 1 yuichirokaneko admin 8192 1 17 22:19 1262_vm ... -rw------- 1 yuichirokaneko admin 22088 3 10 10:44 pg_internal.init # select oid, relname from pg_class where oid = 1262 order by oid asc; oid | relname ------+------------- 1262 | pg_database (1 row) Oid of table

CBTFEJSFDUPSZ • Contains directories whose name is same with Oid
• Directory 16388 is for lusiadas database $ ls -l base/ total 0 drwx------ 296 9472 2 10 21:15 1 drwx------ 296 9472 1 17 22:19 13363 drwx------ 296 9472 3 9 22:12 13364 drwx------ 297 9504 2 10 21:15 16384 drwx------ 299 9568 3 10 10:44 16388 # select oid, datname from pg_database order by oid asc; oid | datname -------+------------ 1 | template1 13363 | template0 13364 | postgres 16384 | example-db 16388 | lusiadas (5 rows)

CBTFEJSFDUPSZ • Directories contain data ﬁles • File 16389 is
for ﬁlms table $ ls -l base/16388 total 15784 -rw------- 1 8192 1 24 21:05 112 -rw------- 1 8192 1 24 21:05 113 -rw------- 1 73728 1 24 21:10 1247 -rw------- 1 24576 1 24 21:05 1247_fsm -rw------- 1 8192 1 24 21:10 1247_vm ... -rw------- 1 8192 2 23 22:31 16389 ... # select oid, relname from pg_class where oid = 16389 order by oid asc; oid | relname -------+--------- 16389 | films (1 row) Oid of table

*NQMFNFOUJOJUDPNNBOE $ minidb --base_dir /tmp/minidb init /tmp/minidb !"" base #""
global !"" mini_attribute $ #"" data !"" mini_class $ #"" data !"" mini_database $ #"" data #"" oid Columns Tables Databases

*NQMFNFOUJOJUDPNNBOE • Creating some system catalogs which are the place
RDBMS stores schema metadata • "mini_database": Database information • "mini_class": Table information • "mini_attribute": Column information • Simply creating directories and ﬁles

*NQMFNFOUDSFBUFEBUBCBTF DPNNBOE $ minidb --base_dir /tmp/minidb create_db db1 /tmp/minidb !""
base $ #"" 10000 #"" global !"" mini_attribute $ #"" data !"" mini_class $ #"" data !"" mini_database $ #"" data #"" oid $ cat global/mini_database/data # Oid,db_name 10000,db1

*NQMFNFOUDSFBUFEBUBCBTF DPNNBOE • Insert a database record to "mini_database/data" ﬁle
• Do not use binary format for system catalog to easy to implement • The format of this data ﬁle is “Oid,database_name" • Create directory of Oid name (10000) under the base directory

$ minidb --base_dir /tmp/minidb create_table db1 table1 /tmp/minidb !"" base
$ #"" 10000 $ #"" 10001 #"" global !"" mini_attribute $ #"" data !"" mini_class $ #"" data !"" mini_database $ #"" data #"" oid *NQMFNFOUDSFBUFUBCMFDPNNBOE $ cat /tmp/minidb/global/mini_class/data # Oid,table_name,db_Oid 10001,table1,10000 $ cat /tmp/minidb/global/mini_attribute/data # column_name,db_Oid,table_Oid,type,length id,10000,10001,1,4 age,10000,10001,1,4 Only support `id integer, age integer` as table schema

*NQMFNFOUDSFBUFUBCMFDPNNBOE • Insert a table record to "mini_class/data" file •
The format of this data file is “Oid,table_name,database_Oid" • Insert column records to "mini_attribute/data" file • The format of this data file is “column_name,database_Oid,table_Oid,type,length" • Type 1 means Signed 4 bytes integer • User can not specify schema of the table! • Only support ìd integer, age integer` as table schema

)PXUPJOUFSBDUXJUIpMFT • Read the catalog file • Load data from
the catalog file into an array • Add new catalog data to the array • Write the array to catalog file

*NQPSUBOUQPJOUT • Use simple data format, e.g. JSON, CSV …
• Simply interact with catalog ﬁles

Part 2: Storage Manager and Buﬀer Manager

%BUBpMFTUSVDUVSFJOQH • "fork": Use same oid with different suffix for
metadata files • "main": (0) / "fsm": (1) / "vm": (2) / "init": (3) • Free Space Map / Visibility Map • "segment": Divide a fork into multiple files to avoid OS' limit on file size (see: RELSEG_SIZE) • "block": Data file is divided into logical blocks, default 8KB (see: BLCKSZ) $ ls -hl base/16388/24576* -rw------- 1.0G 3 10 16:38 base/16388/24576 <- fork: 0, segment: 0 -rw------- 25M 3 10 16:38 base/16388/24576.1 <- fork: 0, segment: 1 -rw------- 280K 3 10 16:38 base/16388/24576_fsm <- fork: 1, segment: 0

%BUBpMFTUSVDUVSFJOQH • Internal structure of "base/16388/24576" ﬁle looks like below
• It has ﬁxed-size blocks $ ls -hl base/16388/24576* -rw------- 1.0G 3 10 16:38 base/16388/24576 -rw------- 25M 3 10 16:38 base/16388/24576.1 -rw------- 280K 3 10 16:38 base/16388/24576_fsm +---------+ | Block 0 | <- 8KB +---------+ | Block 1 | <- 8KB +---------+ | Block 2 | <- 8KB +---------+ | Block 3 | <- 8KB +---------+ ...

#MPDLTUSVDUVSFJOQH /* * +----------------+---------------------------------+ * | PageHeaderData | linp1 linp2
linp3 ... | * +-----------+----+---------------------------------+ * | ... linpN | | * +-----------+--------------------------------------+ * | ^ pd_lower | * | | * | v pd_upper | * +-------------+------------------------------------+ * | | tupleN ... | * +-------------+------------------+-----------------+ * | ... tuple3 tuple2 tuple1 | "special space" | * +--------------------------------+-----------------+ * ^ pd_special */ 'SPNTSDJODMVEFTUPSBHFCVGQBHFI Header 32bit +---------+ | Block 0 | <- 8KB +---------+ | Block 1 | <- 8KB +---------+ | Block 2 | <- 8KB +---------+ | Block 3 | <- 8KB +---------+ ...

#MPDLTUSVDUVSFJOQH • PageHeader is a header • linp has a
ItemIdData type, 32bit. This holds info of each tuple, e.g. oﬀset to the tuple, length of the tuple • Maybe linp means “line pointer” • tuple has a HeapTupleHeaderData type. This holds data of the row

*NQMFNFOUJOTFSUDPNNBOE • We should implement insert before select :) •
Only support `id integer, age integer` as table schema • `minidb insert_into db_name table_name id age` $ minidb insert_into db1 table1 1 12 $ minidb insert_into db1 table1 2 13 $ minidb insert_into db1 table1 3 12

*NQMFNFOUJOTFSUDPNNBOE • Open data ﬁle • Load data into memory,
this is called 'page' • Add tuple to page • Write data into data ﬁle

pub fn execute(&self, dbname: &str, table_name: &str, key_values: Vec<KeyValue>) ->
Result<(), String> { let rm: RecordManeger<MiniAttributeRecord> = RecordManeger::build_from_config("mini_attribute".to_string(), self.config).unwrap(); let attrs = rm.attributes(dbname, table_name); let mut slot = TupleTableSlot::new(rm.attributes_clone(dbname, table_name)); let path = self.config.data_file_path(dbname, table_name); let mut page = if path.exists() { Page::load(&path) } else { Page::new(DEFAULT_BLOCK_SIZE) }; if attrs.len() != key_values.len() { return Err(format!("Length not match. attrs: {}, key_values: {}", attrs.len(), key_values.len())); } *NQMFNFOUJOTFSUDPNNBOE open ﬁle & load data

for ((i, kv), attr) in key_values.iter().enumerate().zip(attrs.iter()) { if kv.key !=
attr.name { return Err(format!("Name not match. attrs: {}, key_values: {}", attr.name, kv.key)); } let t = ty::build_ty(&attr.type_name, &kv.value)?; slot.set_column(i, t.as_ref()); } page.add_tuple_slot_entry(&slot).unwrap(); let f = File::create(path).unwrap(); page.write_bytes(f); Ok(()) } *NQMFNFOUJOTFSUDPNNBOE add tuple to page write data to ﬁle

*NQMFNFOUTFMFDUDPNNBOE • `minidb select_from db_name table_name` $ minidb select_from db1
table1 "1" # <- id of 1st record "12" # <- age of 1st record "2" # <- id of 2nd record "13" # <- age of 2nd record "3" # <- id of 3rd record "12" # <- age of 3rd record

*NQMFNFOUTFMFDUDPNNBOE • Open data ﬁle • Load data into page
• Copy tuple data into tuple slot (TupleTableSlot) • Print each tuple • Print each column

pub fn execute(&self, dbname: &str, table_name: &str, key: &str, value:
&str) -> Result<(), String> { let rm: RecordManeger<MiniAttributeRecord> = RecordManeger::build_from_config("mini_attribute".to_string(), self.config).unwrap(); let attrs = rm.attributes_clone(dbname, table_name); let attrs_len = attrs.iter().fold(0, |acc, attr| acc + attr.len) as u32; let mut slot = TupleTableSlot::new(attrs); let path = self.config.data_file_path(dbname, table_name); let mut page = if path.exists() { Page::load(&path) } else { Page::new(DEFAULT_BLOCK_SIZE) }; *NQMFNFOUTFMFDUDPNNBOE open ﬁle & load data

for i in 0..(page.entry_count()) { slot.load_data(page.get_entry_pointer(i).unwrap(), attrs_len); println!("attrs_count: {:?}", slot.attrs_count());
for j in 0..(slot.attrs_count()) { let ty = slot.get_column(j); println!("{:?}", ty.as_string()); } } Ok(()) } *NQMFNFOUTFMFDUDPNNBOE copy data to tuple iterate each tuple print each column iterate each column

*NQPSUBOUQPJOUT • Need not to clean code of select/insert commands
at present • We will introduce Query Execution logic at a later step

#VU • Each command read/write ﬁles • Each command allocate
memory (page) • Introduce Buﬀer Manager and Storage Manager to extract these logic

#V⒎FS.BOBHFS • Buffer Manager manages data in memory • All
data in files are read/written via Buffer Manager • In many case we can not load all data into Buffer Manager (memory size <<<< data file size) • “Block" may be called "Page" in Buffer Manager context • User of Buffer Manager passes info of Relation(database & table) and BlockNumber to get a page

pages: +------+------+ | 8KB | 8KB | +------+------+ ^ ^
| | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+ Get (db1, tbl2, blk0) )PXUPVTF#V⒎FS.BOBHFS

pages: +------+------+ | 8KB | 8KB | +------+------+ ^ ^
| | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+ Get (db1, tbl2, blk0) Buﬀer (1) )PXUPVTF#V⒎FS.BOBHFS

pages: +------+------+ | 8KB | 8KB | +------+------+ ^ ^
| | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+ Get (db1, tbl2, blk0) Buﬀer (1) Get page of Buﬀer (1) )PXUPVTF#V⒎FS.BOBHFS

pages: +------+------+ | 8KB | 8KB | +------+------+ ^ ^
| | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+ Get (db1, tbl2, blk0) Buﬀer (1) Get page of Buﬀer (1) Reference of page (1) )PXUPVTF#V⒎FS.BOBHFS

Get (db1, tbl1, blk1) )PXUPVTF#V⒎FS.BOBHFS pages: +------+------+ | 8KB |
8KB | +------+------+ ^ ^ | | v v buffer_descriptors: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | | flag | flag | +------+------+ buffer_hash: +------+------+ | db1 | db1 | | tbl1 | tbl2 | | blk0 | blk0 | +------+------+ | | v v +------+------+ | 0 | 1 | +------+------+

Get (db1, tbl1, blk1) )PXUPVTF#V⒎FS.BOBHFS pages: +------+------+------+ | 8KB |
8KB | 8KB | +------+------+------+ ^ ^ ^ | | | v v v buffer_descriptors: +------+------+------+ | db1 | db1 | db1 | | tbl1 | tbl2 | tbl1 | | blk0 | blk0 | blk1 | | flag | flag | flag | +------+------+------+ buffer_hash: +------+------+------+ | db1 | db1 | db1 | | tbl1 | tbl2 | tbl1 | | blk0 | blk0 | blk1 | +------+------+------+ | | | v v v +------+------+------+ | 0 | 1 | 2 | +------+------+------+

Get (db1, tbl1, blk1) )PXUPVTF#V⒎FS.BOBHFS pages: +------+------+------+ | 8KB |
8KB | 8KB | +------+------+------+ ^ ^ ^ | | | v v v buffer_descriptors: +------+------+------+ | db1 | db1 | db1 | | tbl1 | tbl2 | tbl1 | | blk0 | blk0 | blk1 | | flag | flag | flag | +------+------+------+ buffer_hash: +------+------+------+ | db1 | db1 | db1 | | tbl1 | tbl2 | tbl1 | | blk0 | blk0 | blk1 | +------+------+------+ | | | v v v +------+------+------+ | 0 | 1 | 2 | +------+------+------+ Buﬀer (2)

*NQMFNFOU#V⒎FS.BOBHFS • Page: Malloced block_size memory • BuﬀerTag: Relation(database &
table) and BlockNumber // We malloc c memory for header, lines and tuples. pub struct Page { header: *mut PageHeaderData, } struct BufferTag { rnode: RelFileNode, block_num: BlockNumber, } 8 KB

*NQMFNFOU#V⒎FS.BOBHFS • pages: Array of Page • buffer_descriptors: Array of
info of Page (BufferTag, dirty flag etc.) • buffer_hash: Map from BufferTag to buffer identifier (usize) pub struct BufferManager { config: Rc<Config>, smgr: StorageManager, buffer_descriptors: Vec<BufferDesc>, pages: Vec<Page>, // Hash from BufferTag to index of descriptor and page // See LocalBufHash in pg. buffer_hash: HashMap<BufferTag, Buffer>, }

-JNJUBUJPOTPG#V⒎FS.BOBHFS • No eviction logic • Can not load ﬁle
into memory if all slots are used • It’s acceptable if we implement DB as command • It’s may not acceptable if we implement DB as server • No lock mechanism

*NQPSUBOUQPJOUT • Dump data to ﬁles by "destructor" (drop trait)
• Be careful to return cached page if the page is cached • I embedded a bug because I forgot this…

4UPSBHF.BOBHFSJOQH • Manages ﬁle descriptors • "src/backend/storage/smgr/smgr.c" • "src/backend/storage/smgr/md.c" •
Take care of segments

typedef enum ForkNumber { InvalidForkNumber = -1, MAIN_FORKNUM = 0,
FSM_FORKNUM, VISIBILITYMAP_FORKNUM, INIT_FORKNUM } ForkNumber; #define MAX_FORKNUM INIT_FORKNUM typedef struct _MdfdVec { File mdfd_vfd; /* fd number in fd.c's pool */ BlockNumber mdfd_segno; /* segment number, from 0 */ } MdfdVec; typedef struct SMgrRelationData { RelFileNodeBackend smgr_rnode; /* relation physical identifier */ struct SMgrRelationData **smgr_owner; BlockNumber smgr_targblock; /* current insertion target block */ BlockNumber smgr_fsm_nblocks; /* last known size of fsm fork */ BlockNumber smgr_vm_nblocks; /* last known size of vm fork */ int smgr_which; /* storage manager selector */ int md_num_open_segs[MAX_FORKNUM + 1]; struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1]; struct SMgrRelationData *next_unowned_reln; } SMgrRelationData; • md_seg_fds is an array of `_MdfdVec *` • see: `_mdfd_getseg` function

md_seg_fds: main fsm vm init +------------+--------------+--------------+--------------+ | | | |
| +------------+--------------+--------------+--------------+ | | | | v v v v +------------+--------------+ | fd 24576 | fd 24576_fsm | | seg 0 | seg 0 | +------------+--------------+ | fd 24576.1 | | seg 1 | +------------+ base/16388/24576 <- fork: 0, segment: 0 base/16388/24576.1 <- fork: 0, segment: 1 base/16388/24576_fsm <- fork: 1, segment: 0 Forks Segments

*NQMFNFOU4UPSBHF.BOBHFS • cache: Map from RelFileNode to SMgrRelationData • SMgrRelationData:
This has fd pub struct SMgrRelationData { config: Rc<Config>, pub smgr_rnode: RelFileNode, file: Option<File>, // current insertion target block pub smgr_targblock: BlockNumber, } pub struct StorageManager { config: Rc<Config>, cache: HashMap<RelFileNode, RefCell<SMgrRelationData>> }

-JNJUBUJPOTPG4UPSBHF.BOBHFS • Fork is not supported • Segment is not
supported $ ls -hl base/16388/24576* -rw------- 1.0G 3 10 16:38 base/16388/24576 <- fork: 0, segment: 0 -rw------- 25M 3 10 16:38 base/16388/24576.1 <- fork: 0, segment: 1 -rw------- 280K 3 10 16:38 base/16388/24576_fsm <- fork: 1, segment: 0

3FXSJUFJOTFSUDPNNBOE • Problems were • Each command read/write ﬁles •
Each command allocate memory (page) • Insert command does not open ﬁle directly • Insert command does not allocate memory directly

pub struct InsertState<'a> { slot: &'a TupleTableSlot, ss_currentRelation: &'a RelFileNode,
} impl<'a> InsertState<'a> { pub fn new(rnode: &'a RelFileNode, slot: &'a TupleTableSlot) -> InsertState<'a> { InsertState { ss_currentRelation: rnode, slot: slot, } } pub fn exec_insert(&mut self, bufmrg: &mut BufferManager) { let buf = bufmrg.read_buffer(self.ss_currentRelation.clone(), 0); let page = bufmrg.get_page_mut(buf); page.add_tuple_slot_entry(self.slot).unwrap(); } } get page from buﬀer manager write tuple to page

Part 3: Query Execution

+-----------------------+ | Query string (char *) | +-----------------------+ parse (pg_parse_query)
| v +---------------------------------+ | Raw parse tree (struct RawStmt) | +---------------------------------+ analyze/rewrite (pg_analyze_and_rewrite) | v +---------------------------+ | Query Tree (struct Query) | +---------------------------+ optimize/generate tree of plan node (pg_plan_queries) | v +-------------------------+ | Plan tree (struct Plan) | +-------------------------+ setup query execution (ExecInitNode) | v +-----------------------------------+ | PlanState tree (struct PlanState) | +-----------------------------------+ execute query (ExecutorRun)

1BSTF • Generate AST from input text • Each node
represents user input

1BSTF select * from films where did = 1; ParseState:
parsetree_list RawStmt: stmt SelectStmt: targetList ResTarget: name: , indirection: val ColumnRef: fields: * fromClause RangeVar: catalogname: , schemaname: , relname: films whereClause A_Expr: kind: OP, name: = lexpr ColumnRef: fields: did rexpr A_Const: val: 1

"OBMZ[F3FXSJUF • Generate Query Tree • Replace user input attributes
to internal representations • Semantic analysis

"OBMZ[F3FXSJUF Query: SELECT(id: 0) rtable (# 1) RangeTblEntry: rtekind: RELATION,
relid: 16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … ParseState: parsetree_list RawStmt: stmt SelectStmt: targetList ResTarget: name: , indirection: val ColumnRef: fields: * fromClause RangeVar: relname: films whereClause A_Expr: kind: OP, name: = lexpr ColumnRef: fields: did rexpr A_Const: val: 1 Expanded

relid: 16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … ParseState: parsetree_list RawStmt: stmt SelectStmt: targetList ResTarget: name: , indirection: val ColumnRef: fields: * fromClause RangeVar: relname: films whereClause A_Expr: kind: OP, name: = lexpr ColumnRef: fields: did rexpr A_Const: val: 1 Internal representations where did = 1

relid: 16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … # select oid, relname from pg_class where oid = 16389 order by oid asc; oid | relname -------+--------- 16389 | films (1 row) ﬁlms

relid: 16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … where did = 1 /* catalog/pg_operator.dat */ { oid => ’96', oid_symbol => ‘Int4EqualOperator' oprname => ‘=', oprleft => ‘int4', oprright => ‘int4', oprresult => 'bool' }, /* catalog/pg_type.dat */ { oid => ’23', descr => '-2 billion to 2 billion integer, 4-byte storage', typname => ‘int4', typlen => '4' }, select * from films where did = 1;

0QUJNJ[F(FOFSBUFUSFFPGQMBO OPEF • We would have many way to calculate
result tuples • How to get tuples from disk • Read ﬁles in order (SeqScan) / Use index (IndexScan) / Use only index (IndexOnlyScan) • How to aggregate tuples • Use internal hash table / sort tuples before aggregation • How to join tables • Nested Loops Join / (Sort-)Merge Join / Hash Join

0QUJNJ[F(FOFSBUFUSFFPGQMBO OPEF Query: SELECT(id: 0) rtable (# 1) RangeTblEntry: relid:
16389 jointree FromExpr: quals OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1 target_list (# 6) TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 … PlannedStmt: command_type: SELECT SeqScan: targetlist TargetEntry: resno: 1, resname: code Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 1, varattno: 2 TargetEntry: resno: 3, resname: did Var: varno: 1, varattno: 3 ... qual OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1

• Sort gets tuple from lefttree, SeqScan in this case
select * from films where did = 1 order by code; PlannedStmt: command_type: SELECT Sort: num_cols: 1, sort_col_idx: [1], sort_operators: [1058] targetlist TargetEntry: resno: 1, resname: code Var: varno: 65001, varattno: 1 TargetEntry: resno: 2, resname: title Var: varno: 65001, varattno: 2 ... qual lefttree SeqScan: targetlist TargetEntry: resno: 1, resname: Var: varno: 1, varattno: 1 TargetEntry: resno: 2, resname: Var: varno: 1, varattno: 2 ... qual OpExpr: opno: 96, opfuncid: 65 Var: varno: 1, varattno: 3 Const: consttype: 23, constlen: 4, constvalue: 1

*NQMFNFOUDPVOUDPNNBOE • In pg this is implemented as aggregation •
But aggregation is diﬃcult… $ minidb select_from_count db1 table1 Count: 5

| v +---------------------------------+ | Raw parse tree (struct RawStmt) | +---------------------------------+ analyze/rewrite (pg_analyze_and_rewrite) | v +---------------------------+ | Query Tree (struct Query) | +---------------------------+ optimize/generate tree of plan node (pg_plan_queries) | v +-------------------------+ | Plan tree (struct Plan) | +-------------------------+ setup query execution (ExecInitNode) | v +-----------------------------------+ | PlanState tree (struct PlanState) | +-----------------------------------+ execute query (ExecutorRun) • Directly create Plan tree

impl<'a> CountState<'a> { pub fn new(lefttree: ScanState<'a>) -> CountState<'a> {
CountState { lefttree: lefttree, result: 0, } } // `ExecAgg` in pg. pub fn exec_agg(&mut self, bufmrg: &mut BufferManager) { loop { let opt = self.lefttree.exec_scan(bufmrg); match opt { Some(_slot) => { self.result = self.result + 1; }, None => break } } } } get tuple count up counter scanner

impl CountCommnad { ... pub fn execute(&self, dbname: &str, table_name:
&str) -> Result<(), String> { ... let mut rmgr = RelationManager::new(self.config.clone()); let relation = rmgr.get_relation(db_oid, table_oid); let mut bm = BufferManager::new(1, self.config.clone()); let scan = ScanState::new(relation, &rm, &mut bm); let mut count = CountState::new(scan); count.exec_agg(&mut bm); println!("Count: {}", count.result); Ok(()) } } pass scan to count execute get result of the count

-JNJUBUJPOPGDPVOU • Count logic is not generalized to aggregation

*NQMFNFOUQBSTFS • Use lalrpop, LR(1) parser generator • Rewrite select_from
/ select_from_count / insert_into commands to execute command • Before: "minidb select_from db1 table1” • After: "minidb execute 'select * from db1.table1’" • I wanted some progress when I changed to use parser :) • Writing parser by parser generator is fun!!!

| v +---------------------------------+ | Raw parse tree (struct RawStmt) | +---------------------------------+ analyze/rewrite (pg_analyze_and_rewrite) | v +---------------------------+ | Query Tree (struct Query) | +---------------------------+ optimize/generate tree of plan node (pg_plan_queries) | v +-------------------------+ | Plan tree (struct Plan) | +-------------------------+ setup query execution (ExecInitNode) | v +-----------------------------------+ | PlanState tree (struct PlanState) | +-----------------------------------+ execute query (ExecutorRun) • Do not analyze • Do not optimize

-JNJUBUJPOPGRVFSZFYFDVUPS • Do not implement analyzer • Should do semantic
analysis when the query is executed • Do not implement optimizer • Can not use index even if index is implemented

*NQMFNFOUEFMFUF $ minidb /tmp/minidb execute 'select count() from db1.table1' Count:
5 $ minidb execute 'delete from db1.table1' Deleted records: 5 $ minidb /tmp/minidb execute 'select count() from db1.table1' Count: 0

*NQMFNFOUEFMFUF • Add deleted flag to tuple • Should recreate
data files or write migration command • delete command sets the flag • SeqScan checks the flag • Count need not to take care of the flag

*NQMFNFOUEFMFUF /* * +----------------+---------------------------------+ * | PageHeaderData | linp1 linp2
linp3 ... | * +-----------+----+---------------------------------+ * | ... linpN | | * +-----------+--------------------------------------+ * | ^ pd_lower | * | | * | v pd_upper | * +-------------+------------------------------------+ * | | tupleN ... | * +-------------+------------------+-----------------+ * | ... tuple3 tuple2 tuple1 | "special space" | * +--------------------------------+-----------------+ * ^ pd_special */ +---------+ | | | Data | | | +---------+ +---------+ | flag(s) | +---------+ | | | Data | | | +---------+ Before After • Add deleted ﬂag to tuple

diff --git a/src/node_seqscan.rs b/src/node_seqscan.rs index 80f7cff..2edeb2b 100644 --- a/src/node_seqscan.rs +++
b/src/node_seqscan.rs @@ -135,8 +135,19 @@ impl<'a> ScanState<'a> { let dp = bufmrg.get_page(scan_desc.rs_cbuf); let mut t_self = ItemPointerData::new(); ::tuple::item_pointer_set(&mut t_self, scan_desc.rs_cblock, lineoff); + debug!("lp_len {}", dp.get_item_ref(lineoff).lp_len()); scan_desc.rs_ctup.load_without_len(dp.get_entry_pointer(lineoff).unwrap(), t_self); - return + + // Skip deleted record + if scan_desc.rs_ctup.t_data.heap_keys_updated_p() { + debug!("Skip deleted tuple {}", lineoff); + lineoff = lineoff + 1; + linesleft = linesleft - 1; + // next + } else { + debug!("Return tuple {}", lineoff); + return + } } Check delete ﬂag in scan

impl<'a> CountState<'a> { pub fn new(lefttree: ScanState<'a>) -> CountState<'a> {
CountState { lefttree: lefttree, result: 0, } } // `ExecAgg` in pg. pub fn exec_agg(&mut self, bufmrg: &mut BufferManager) { loop { let opt = self.lefttree.exec_scan(bufmrg); match opt { Some(_slot) => { self.result = self.result + 1; }, None => break } } } } get tuple count up • Count need not to take care of the ﬂag • lefttree omit deleted tuples

-JNJUBUJPOTPGEFMFUFDPNNBOE • delete command deletes all records because we can
not pass where clause • vacuum is not implemented

&WBMVBUJOHFYQSFTTJPOT XIFSF DMBVTF select * from films where did =
1; QueryDesc (SELECT): planstate SeqScanState: qual ExprSate: SCAN_FETCHSOME, SCAN_VAR, FUNCEXPR_STRICT, QUAL, DONE ss_ScanTupleSlot TupleTableSlot: tts_isempty: true, tts_fixedTupleDescriptor: true tts_tupleDescriptor TupleDesc: natts: 6 attrs PgAttribute: attname: code ...

&WBMVBUJOHFYQSFTTJPOT XIFSF DMBVTF select * from films where did =
1; QueryDesc (SELECT): planstate SeqScanState: qual ExprSate: SCAN_FETCHSOME, SCAN_VAR, FUNCEXPR_STRICT, QUAL, DONE ss_ScanTupleSlot TupleTableSlot: tts_isempty: true, tts_fixedTupleDescriptor: true tts_tupleDescriptor TupleDesc: natts: 6 attrs PgAttribute: attname: code ... Byte codes …

&WBMVBUJOHFYQSFTTJPOT XIFSF DMBVTF • Expression like "where did = 1"
needed to be evaluated to implement where clause • "direct threaded" or "switch threaded" or “llvmjit" in pg • https://magazine.rubyist.net/articles/0008/0008-YarvManiacs.html • It's overspec, so implement it by tree traverse

*NQMFNFOUXIFSFDMBVTF $ minidb /tmp/minidb execute 'select count() from db1.table1' Count:
5 $ minidb /tmp/minidb execute 'select count() from db1.table1 where true' Count: 5 $ minidb /tmp/minidb execute 'select count() from db1.table1 where false' Count: 0 $ minidb /tmp/minidb execute 'select count() from db1.table1 where true = false' Count: 0 $ minidb /tmp/minidb execute 'select * from db1.table1' "1" "12" "2" "13" "3" "12" "4" "20" "5" "21" $ minidb /tmp/minidb execute 'select count() from db1.table1 where age = 12' Count: 2

*NQMFNFOUXIFSFDMBVTF pub enum Expr { Bool(bool), Number(i32), OpEq(Box<Expr>, Box<Expr>), //
"=" ColumnRef(String), // column name } struct ExprEvaluator<'a> { currentTuple: &'a TupleTableSlot, expr: &'a Expr, } where did = 1 +------+ | OpEq | +------+ | | +------------------+ +-| ColumnRef("did") | | +------------------+ | | +-----------+ +-| Number(1) | +-----------+ has current tuple

impl<'a> ScanState<'a> { fn exec(&mut self) -> Option<&TupleTableSlot> { loop
{ self.seq_next(); if self.ss_currentScanDesc.rs_finished { return None; } if self.exec_qual() { return Some(self.ss_ScanTupleSlot.as_ref()); } // next tuple } } fn exec_qual(&self) -> bool { if self.qual.is_none() { // Always condition is met return true; } let evaluator = ExprEvaluator::new(self.ss_ScanTupleSlot.as_ref(), self.qual.as_ref().unwrap().as_ref()); evaluator.eval() } fetch next tuple from page check where clause set current tuple to evaluator then call eval

impl<'a> ExprEvaluator<'a> { fn eval_rec(&self, expr: &Expr) -> Box<Value> {
match expr { Expr::Bool(b) => { Box::new(Value::Bool(b.clone())) }, Expr::Number(n) => { Box::new(Value::String(n.to_string())) }, Expr::OpEq(e1, e2) => { let v1 = self.eval_rec(e1); let v2 = self.eval_rec(e2); Box::new(Value::Bool(self.op_eq(v1.as_ref(), v2.as_ref()))) }, Expr::ColumnRef(col_name) => { let i = self.currentTuple.get_index_from_name(col_name); let c = self.currentTuple.get_column(i); Box::new(Value::String(c.as_string())) }, } } simple case eval operands then compare get value from current tuple

"EEXIFSFUPEFMFUFDPNNBOE diff --git a/src/ast.rs b/src/ast.rs @@ -4,8 +4,8 @@ pub
enum Stmt { SelectStmt(Box<Expr>, String, String, Option<Box<Expr>>), // dbname, tablename, keys, values InsertStmt(String, String, Vec<String>, Vec<Vec<String>>), - // dbname, tablename - DeleteStmt(String, String), + // dbname, tablename, where_clause + DeleteStmt(String, String, Option<Box<Expr>>), } diff --git a/src/parser.lalrpop b/src/parser.lalrpop @@ -12,8 +12,8 @@ pub Statement: Stmt = { Stmt::InsertStmt(fi.0, fi.1, li1, li2) }, // DELETE FROM dbname "." tablename - Delete_T From_T <fi: From_Item> => { - Stmt::DeleteStmt(fi.0, fi.1) + Delete_T From_T <fi: From_Item> <ow:OptWhere> => { + Stmt::DeleteStmt(fi.0, fi.1, ow) }, } Add where member to Node Change syntax to accept where

"EEXIFSFUPEFMFUFDPNNBOE diff --git a/src/dml.rs b/src/dml.rs @@ -122,7 +122,7 @@ impl
DeleteCommnad { } } - pub fn execute(&self, dbname: &str, table_name: &str, cmgr: &CatalogManager) -> Result<(), String> { + pub fn execute(&self, dbname: &str, table_name: &str, cmgr: &CatalogManager, qual: &Option<Box<Expr>>) -> Result<(), String> { let db_oid = cmgr.database_rm.find_mini_database_oid(dbname) .expect(&format!("{} database should be defined.", dbname)); let table_oid = cmgr.class_rm.find_mini_class_oid(db_oid, table_name) @@ -131,7 +131,7 @@ impl DeleteCommnad { let mut rmgr = RelationManager::new(self.config.clone()); let relation = rmgr.get_relation(db_oid, table_oid); let mut bm = BufferManager::new(1, self.config.clone()); - let scan = ScanState::new(relation, &rm, &mut bm, &None); + let scan = ScanState::new(relation, &rm, &mut bm, qual); let mut delete = DeleteState::new(relation, scan); delete.exec_delete(&mut bm); Pass where expressions to scan

-JNJUBUJPOTPGXIFSFDMBVTF • Support only few basic operations

Part 4: Conclusion

4UFQTUPDSFBUJOHEC 1. Implement init command 2. Implement create database /
create table command 3. Implement insert command / select command 4. Implement buﬀer manager / storage manager • Rewrite insert command to use these managers 5. Implement count command 6. Implement parser 7. Implement delete 8. Implement where clause • Add where to delete

-JNJUBUJPOTPGFBDIDPNQPOFOU • The system catalogs • Use simple data format,
e.g. JSON, CSV, not binary format • Create table • Signed 4 bytes integer is only supported type • Only support `id integer, age integer` as table schema • Buﬀer Manager • No eviction logic • No lock mechanism

-JNJUBUJPOTPGFBDIDPNQPOFOU • Storage Manager • Fork is not supported •
Segment is not supported • Count command • Count logic is not generalized to aggregation • Delete command • Vacuum is not implemented • Where clause • Support only few basic operations

-JNJUBUJPOTPGFBDIDPNQPOFOU • Query Execution • Do not implement analyzer •
Should do semantic analysis when the query is executed • Do not implement optimizer • Can not use index even if index is implemented

3FGFSFODFT • Raghu Ramakrishnan, Johannes Gehrke: "Database Management Systems” •
Hector Garcia-Molina, Jeﬀrey D. Ullman, Jennifer Widom: “Database Systems: Pearson New International Edition: The Complete Book” • উມஐ੒, ࠤഢণथ, ݪాొࢤ: “ʦվగ৽൛ʧ಺෦ߏ଄͔ΒֶͿ PostgreSQL ઃܭɾӡ༻ܭըͷమଇ” • “The Internals of PostgreSQL” http://www.interdb.jp/pg/index.html • “PostgreSQL ͷ֮͑ॻ͖ͷΠϯσοΫε” http://www.nminoru.jp/ ~nminoru/postgresql/

"DLOPXMFEHNFOUT • @takeshinoda and @hkdnet • Reviewing the slide •
Oracle Corporation Japan • Giving me the chance to talk about DB in Oracle!!!

Thank you !!

Make and Learn RDBMS

Make and Learn RDBMS

More Decks by yui-knk

Other Decks in Programming

Featured

Transcript