Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make and Learn RDBMS

yui-knk
March 22, 2019

Make and Learn RDBMS

yui-knk

March 22, 2019
Tweet

More Decks by yui-knk

Other Decks in Programming

Transcript

  1. Make and Learn RDBMS
    Yuichiro Kaneko(@yui-knk)
    2019/03/22
    Rails Developers Meetup 2019 Day 1

    View Slide

  2. Thank you !!

    View Slide

  3. 4FMG*OUSPEVDUJPO
    • Yuichiro Kaneko

    • Arm Treasure Data

    • CDP team (Writing Rails application)

    • CRuby Committer 2015/12~

    • GitHub (yui-knk)

    View Slide

  4. https://www.treasuredata.com/company/careers/jobs/?team=Engineering
    8FBSFIJSJOH

    View Slide

  5. https://www.treasuredata.com/company/careers/jobs/?team=Engineering
    8FBSFIJSJOH

    View Slide

  6. 5PEBZTUPQJD
    • Implementation of RDBMS, especially PostgreSQL

    • How to create DB step by step

    • Will not talk about CRuby/Rails!

    View Slide

  7. /PUDPWFSFE
    • How to use PostgreSQL efficiently in your Rails app

    • What RDBMS is best for your Rails app

    • How to design perfect database/table schema

    View Slide

  8. #BDLHSPVOE
    • Our customer can store data and execute queries to their data by
    Treasure Data

    • It can be said we are developing each part of DB

    • Knowledge of how DB works may be useful to understand/guess what
    other team members do

    • Creating a subset of DB helps me understand DB

    View Slide

  9. 5PCFIPOFTU
    +VTUGPS'VO

    View Slide

  10. 5PEBZT(PBM
    • Grasp the whole picture of DB

    • Understand which part of DB we can omit when we create DB

    View Slide

  11. *NDSFBUJOH
    • https://github.com/yui-knk/minidb

    • About 3400 lines

    • Implemented by Rust

    • Use PostgreSQL as reference

    • d06fe6ce2c79420fd19ac89ace81b66579f08493 (tag: REL_11_1)

    • First commit date is 2018/12/31

    View Slide

  12. 4VQQPSUFE
    • SELECT/INSERT/DELETE/WHERE/COUNT/ORDER BY

    • Storage Manager

    • Support multiple pages

    • Buffer Manager

    • Has parser

    • Signed 4 bytes integer is only supported type

    • Use as command line tool

    View Slide

  13. /PU4VQQPSUFE
    • UPDATE/JOIN

    • Projection

    • Aggregation

    • Transaction/Concurrency control

    • Vacuum

    • Optimizer

    • Multiple segments

    • INDEX

    • Launch as server

    • ...

    View Slide

  14. 8IBUDBO*EPXJUIlNJOJECz
    $ minidb execute 'insert into db1.table1 (id, age)
    values (1, 12), (2, 13), (3, 12), (4, 20), (5, 21)'
    $ minidb execute 'select * from db1.table1'
    "1" "12"
    "2" "13"
    "3" "12"
    "4" "20"
    "5" "21"
    $ minidb execute 'select count() from db1.table1'
    Count: 5

    View Slide

  15. 8IBUDBO*EPXJUIlNJOJECz
    $ minidb execute 'select * from db1.table1 order by age'
    "1" "12"
    "3" "12"
    "2" "13"
    "4" "20"
    "5" "21"
    $ minidb execute 'select count() from db1.table1 where age = 12'
    Count: 2
    $ minidb execute 'delete from db1.table1 where age = 12'
    Deleted records: 2
    $ minidb execute 'select count() from db1.table1 where age = 12'
    Count: 0

    View Slide

  16. +---------------+
    | Shared Memory |
    +---------------+
    +------------+
    | Postmaster |
    +------------+
    +---------+
    | Backend |
    +---------+
    +---------+
    | Backend |
    +---------+
    +---------+
    | Backend |
    +---------+
    +-------------------+
    | Background Writer |
    +-------------------+
    | Checkpointer |
    +-------------------+
    | WAL Writer |
    +-------------------+
    | WAL Receiver |
    +-------------------+
    | Checkpointer |
    +-------------------+
    +----------------------+
    | Autovacuum |
    +----------------------+
    | WAL Archiver |
    +----------------------+
    | Statistics Collector |
    +----------------------+
    | System Logger |
    +----------------------+
    fork
    fork

    View Slide

  17. +-------------------+
    | Background Writer |
    +-------------------+
    | Checkpointer |
    +-------------------+
    | WAL Writer |
    +-------------------+
    | WAL Receiver |
    +-------------------+
    | Checkpointer |
    +-------------------+
    +----------------------+
    | Autovacuum |
    +----------------------+
    | WAL Archiver |
    +----------------------+
    | Statistics Collector |
    +----------------------+
    | System Logger |
    +----------------------+
    +------------------+
    | main |
    | * log |
    | * write to disk |
    +------------------+

    View Slide

  18. "HFOEB
    • Part 1: The system catalogs and init command

    • Part 2: Storage Manager and Buffer Manager

    • Part 3: Query Execution

    • Part 4: Conclusion

    View Slide

  19. Part 1: The system catalogs and init
    command

    View Slide

  20. %JSFDUPSZTUSVDUVSFPG
    1PTUHSF42-EBUBCBTFEJSFDUPSZ
    $ ls -l
    total 104
    -rw------- 1 yuichirokaneko admin 3 1 17 22:19 PG_VERSION
    drwx------ 7 yuichirokaneko admin 224 1 24 21:05 base
    drwx------ 61 yuichirokaneko admin 1952 3 10 10:44 global
    drwx------ 2 yuichirokaneko admin 64 1 17 22:19 pg_commit_ts
    ...
    drwx------ 4 yuichirokaneko admin 128 1 17 22:19 pg_wal
    drwx------ 3 yuichirokaneko admin 96 1 17 22:19 pg_xact
    -rw------- 1 yuichirokaneko admin 88 1 17 22:19 postgresql.auto.conf
    -rw------- 1 yuichirokaneko admin 23780 1 17 22:19 postgresql.conf
    -rw------- 1 yuichirokaneko admin 80 1 17 22:20 postmaster.opts
    -rw------- 1 yuichirokaneko admin 90 3 10 10:43 postmaster.pid

    View Slide

  21. %JSFDUPSZTUSVDUVSFPG
    1PTUHSF42-EBUBCBTFEJSFDUPSZ
    $ ls -l
    total 104
    -rw------- 1 yuichirokaneko admin 3 1 17 22:19 PG_VERSION
    drwx------ 7 yuichirokaneko admin 224 1 24 21:05 base
    drwx------ 61 yuichirokaneko admin 1952 3 10 10:44 global
    drwx------ 2 yuichirokaneko admin 64 1 17 22:19 pg_commit_ts
    ...
    drwx------ 4 yuichirokaneko admin 128 1 17 22:19 pg_wal
    drwx------ 3 yuichirokaneko admin 96 1 17 22:19 pg_xact
    -rw------- 1 yuichirokaneko admin 88 1 17 22:19 postgresql.auto.conf
    -rw------- 1 yuichirokaneko admin 23780 1 17 22:19 postgresql.conf
    -rw------- 1 yuichirokaneko admin 80 1 17 22:20 postmaster.opts
    -rw------- 1 yuichirokaneko admin 90 3 10 10:43 postmaster.pid
    I’m talking about PostgreSQL

    View Slide

  22. %JSFDUPSZTUSVDUVSFPG
    1PTUHSF42-EBUBCBTFEJSFDUPSZ
    • Go to database directory

    • "global": "Subdirectory containing cluster-wide tables, such as
    pg_database" [1]

    • "base": "Subdirectory containing per-database subdirectories" [1]

    • [1]: https://www.postgresql.org/docs/11/storage-file-layout.html
    $ ls -l
    drwx------ 7 224 1 24 21:05 base
    drwx------ 61 1952 3 10 10:44 global

    View Slide

  23. 0JE
    • Oid means "Object ID”

    • `typedef unsigned int Oid;`

    View Slide

  24. $ ls -l global
    total 1160
    -rw------- 1 yuichirokaneko admin 8192 1 17 22:19 1136
    -rw------- 1 yuichirokaneko admin 24576 1 17 22:19 1136_fsm
    -rw------- 1 yuichirokaneko admin 8192 1 17 22:19 1136_vm
    ...
    -rw------- 1 yuichirokaneko admin 8192 1 24 21:10 1262
    -rw------- 1 yuichirokaneko admin 24576 1 17 22:19 1262_fsm
    -rw------- 1 yuichirokaneko admin 8192 1 17 22:19 1262_vm
    ...
    -rw------- 1 yuichirokaneko admin 22088 3 10 10:44 pg_internal.init
    # select oid, relname from pg_class where oid = 1262 order by oid asc;
    oid | relname
    ------+-------------
    1262 | pg_database
    (1 row) Oid of table

    View Slide

  25. CBTFEJSFDUPSZ
    • Contains directories whose name is same with Oid

    • Directory 16388 is for lusiadas database
    $ ls -l base/
    total 0
    drwx------ 296 9472 2 10 21:15 1
    drwx------ 296 9472 1 17 22:19 13363
    drwx------ 296 9472 3 9 22:12 13364
    drwx------ 297 9504 2 10 21:15 16384
    drwx------ 299 9568 3 10 10:44 16388
    # select oid, datname
    from pg_database order by oid asc;
    oid | datname
    -------+------------
    1 | template1
    13363 | template0
    13364 | postgres
    16384 | example-db
    16388 | lusiadas
    (5 rows)

    View Slide

  26. CBTFEJSFDUPSZ
    • Directories contain data files

    • File 16389 is for films table
    $ ls -l base/16388
    total 15784
    -rw------- 1 8192 1 24 21:05 112
    -rw------- 1 8192 1 24 21:05 113
    -rw------- 1 73728 1 24 21:10 1247
    -rw------- 1 24576 1 24 21:05 1247_fsm
    -rw------- 1 8192 1 24 21:10 1247_vm
    ...
    -rw------- 1 8192 2 23 22:31 16389
    ...
    # select oid, relname
    from pg_class
    where oid = 16389
    order by oid asc;
    oid | relname
    -------+---------
    16389 | films
    (1 row)
    Oid of table

    View Slide

  27. *NQMFNFOUJOJUDPNNBOE
    $ minidb --base_dir /tmp/minidb init
    /tmp/minidb
    !"" base
    #"" global
    !"" mini_attribute
    $ #"" data
    !"" mini_class
    $ #"" data
    !"" mini_database
    $ #"" data
    #"" oid
    Columns
    Tables
    Databases

    View Slide

  28. *NQMFNFOUJOJUDPNNBOE
    • Creating some system catalogs which are the place RDBMS stores
    schema metadata

    • "mini_database": Database information

    • "mini_class": Table information

    • "mini_attribute": Column information

    • Simply creating directories and files

    View Slide

  29. *NQMFNFOUDSFBUFEBUBCBTF
    DPNNBOE
    $ minidb --base_dir /tmp/minidb create_db db1
    /tmp/minidb
    !"" base
    $ #"" 10000
    #"" global
    !"" mini_attribute
    $ #"" data
    !"" mini_class
    $ #"" data
    !"" mini_database
    $ #"" data
    #"" oid
    $ cat global/mini_database/data
    # Oid,db_name
    10000,db1

    View Slide

  30. *NQMFNFOUDSFBUFEBUBCBTF
    DPNNBOE
    • Insert a database record to "mini_database/data" file

    • Do not use binary format for system catalog to easy to implement

    • The format of this data file is “Oid,database_name"

    • Create directory of Oid name (10000) under the base directory

    View Slide

  31. $ minidb --base_dir /tmp/minidb create_table db1 table1
    /tmp/minidb
    !"" base
    $ #"" 10000
    $ #"" 10001
    #"" global
    !"" mini_attribute
    $ #"" data
    !"" mini_class
    $ #"" data
    !"" mini_database
    $ #"" data
    #"" oid
    *NQMFNFOUDSFBUFUBCMFDPNNBOE
    $ cat /tmp/minidb/global/mini_class/data
    # Oid,table_name,db_Oid
    10001,table1,10000
    $ cat /tmp/minidb/global/mini_attribute/data
    # column_name,db_Oid,table_Oid,type,length
    id,10000,10001,1,4
    age,10000,10001,1,4
    Only support `id integer, age integer` as table schema

    View Slide

  32. *NQMFNFOUDSFBUFUBCMFDPNNBOE
    • Insert a table record to "mini_class/data" file

    • The format of this data file is “Oid,table_name,database_Oid"

    • Insert column records to "mini_attribute/data" file

    • The format of this data file is
    “column_name,database_Oid,table_Oid,type,length"

    • Type 1 means Signed 4 bytes integer

    • User can not specify schema of the table!

    • Only support `id integer, age integer` as table schema

    View Slide

  33. )PXUPJOUFSBDUXJUIpMFT
    • Read the catalog file

    • Load data from the catalog file into an array

    • Add new catalog data to the array

    • Write the array to catalog file

    View Slide

  34. *NQPSUBOUQPJOUT
    • Use simple data format, e.g. JSON, CSV …

    • Simply interact with catalog files

    View Slide

  35. Part 2: Storage Manager and
    Buffer Manager

    View Slide

  36. %BUBpMFTUSVDUVSFJOQH
    • "fork": Use same oid with different suffix for metadata files

    • "main": (0) / "fsm": (1) / "vm": (2) / "init": (3)

    • Free Space Map / Visibility Map

    • "segment": Divide a fork into multiple files to avoid OS' limit on file
    size (see: RELSEG_SIZE)

    • "block": Data file is divided into logical blocks, default 8KB (see:
    BLCKSZ)
    $ ls -hl base/16388/24576*
    -rw------- 1.0G 3 10 16:38 base/16388/24576 <- fork: 0, segment: 0
    -rw------- 25M 3 10 16:38 base/16388/24576.1 <- fork: 0, segment: 1
    -rw------- 280K 3 10 16:38 base/16388/24576_fsm <- fork: 1, segment: 0

    View Slide

  37. %BUBpMFTUSVDUVSFJOQH
    • Internal structure of "base/16388/24576" file looks like below

    • It has fixed-size blocks
    $ ls -hl base/16388/24576*
    -rw------- 1.0G 3 10 16:38 base/16388/24576
    -rw------- 25M 3 10 16:38 base/16388/24576.1
    -rw------- 280K 3 10 16:38 base/16388/24576_fsm
    +---------+
    | Block 0 | <- 8KB
    +---------+
    | Block 1 | <- 8KB
    +---------+
    | Block 2 | <- 8KB
    +---------+
    | Block 3 | <- 8KB
    +---------+
    ...

    View Slide

  38. #MPDLTUSVDUVSFJOQH
    /*
    * +----------------+---------------------------------+
    * | PageHeaderData | linp1 linp2 linp3 ... |
    * +-----------+----+---------------------------------+
    * | ... linpN | |
    * +-----------+--------------------------------------+
    * | ^ pd_lower |
    * | |
    * | v pd_upper |
    * +-------------+------------------------------------+
    * | | tupleN ... |
    * +-------------+------------------+-----------------+
    * | ... tuple3 tuple2 tuple1 | "special space" |
    * +--------------------------------+-----------------+
    * ^ pd_special
    */ 'SPNTSDJODMVEFTUPSBHFCVGQBHFI
    Header 32bit
    +---------+
    | Block 0 | <- 8KB
    +---------+
    | Block 1 | <- 8KB
    +---------+
    | Block 2 | <- 8KB
    +---------+
    | Block 3 | <- 8KB
    +---------+
    ...

    View Slide

  39. #MPDLTUSVDUVSFJOQH
    • PageHeader is a header

    • linp has a ItemIdData type, 32bit. This holds info of each tuple, e.g.
    offset to the tuple, length of the tuple

    • Maybe linp means “line pointer”

    • tuple has a HeapTupleHeaderData type. This holds data of the row

    View Slide

  40. *NQMFNFOUJOTFSUDPNNBOE
    • We should implement insert before select :)

    • Only support `id integer, age integer` as table schema

    • `minidb insert_into db_name table_name id age`
    $ minidb insert_into db1 table1 1 12
    $ minidb insert_into db1 table1 2 13
    $ minidb insert_into db1 table1 3 12

    View Slide

  41. *NQMFNFOUJOTFSUDPNNBOE
    • Open data file

    • Load data into memory, this is called 'page'

    • Add tuple to page

    • Write data into data file

    View Slide

  42. pub fn execute(&self, dbname: &str, table_name: &str, key_values: Vec) -> Result<(),
    String> {
    let rm: RecordManeger =
    RecordManeger::build_from_config("mini_attribute".to_string(), self.config).unwrap();
    let attrs = rm.attributes(dbname, table_name);
    let mut slot = TupleTableSlot::new(rm.attributes_clone(dbname, table_name));
    let path = self.config.data_file_path(dbname, table_name);
    let mut page = if path.exists() {
    Page::load(&path)
    } else {
    Page::new(DEFAULT_BLOCK_SIZE)
    };
    if attrs.len() != key_values.len() {
    return Err(format!("Length not match. attrs: {}, key_values: {}", attrs.len(),
    key_values.len()));
    }
    *NQMFNFOUJOTFSUDPNNBOE
    open file & load data

    View Slide

  43. for ((i, kv), attr) in key_values.iter().enumerate().zip(attrs.iter()) {
    if kv.key != attr.name {
    return Err(format!("Name not match. attrs: {}, key_values: {}", attr.name, kv.key));
    }
    let t = ty::build_ty(&attr.type_name, &kv.value)?;
    slot.set_column(i, t.as_ref());
    }
    page.add_tuple_slot_entry(&slot).unwrap();
    let f = File::create(path).unwrap();
    page.write_bytes(f);
    Ok(())
    }
    *NQMFNFOUJOTFSUDPNNBOE
    add tuple to page
    write data to file

    View Slide

  44. *NQMFNFOUTFMFDUDPNNBOE
    • `minidb select_from db_name table_name`
    $ minidb select_from db1 table1
    "1" # <- id of 1st record
    "12" # <- age of 1st record
    "2" # <- id of 2nd record
    "13" # <- age of 2nd record
    "3" # <- id of 3rd record
    "12" # <- age of 3rd record

    View Slide

  45. *NQMFNFOUTFMFDUDPNNBOE
    • Open data file

    • Load data into page

    • Copy tuple data into tuple slot (TupleTableSlot)

    • Print each tuple

    • Print each column

    View Slide

  46. pub fn execute(&self, dbname: &str, table_name: &str, key: &str, value: &str) -> Result<(),
    String> {
    let rm: RecordManeger =
    RecordManeger::build_from_config("mini_attribute".to_string(), self.config).unwrap();
    let attrs = rm.attributes_clone(dbname, table_name);
    let attrs_len = attrs.iter().fold(0, |acc, attr| acc + attr.len) as u32;
    let mut slot = TupleTableSlot::new(attrs);
    let path = self.config.data_file_path(dbname, table_name);
    let mut page = if path.exists() {
    Page::load(&path)
    } else {
    Page::new(DEFAULT_BLOCK_SIZE)
    };
    *NQMFNFOUTFMFDUDPNNBOE
    open file & load data

    View Slide

  47. for i in 0..(page.entry_count()) {
    slot.load_data(page.get_entry_pointer(i).unwrap(), attrs_len);
    println!("attrs_count: {:?}", slot.attrs_count());
    for j in 0..(slot.attrs_count()) {
    let ty = slot.get_column(j);
    println!("{:?}", ty.as_string());
    }
    }
    Ok(())
    }
    *NQMFNFOUTFMFDUDPNNBOE
    copy data to tuple
    iterate each tuple
    print each column
    iterate each column

    View Slide

  48. *NQPSUBOUQPJOUT
    • Need not to clean code of select/insert commands at present

    • We will introduce Query Execution logic at a later step

    View Slide

  49. #VU
    • Each command read/write files

    • Each command allocate memory (page)

    • Introduce Buffer Manager and Storage Manager to extract these logic

    View Slide

  50. +------------------+
    | Other components |
    +------------------+
    Return buffer identifiers (Buffer)
    ^ | (ReadBufferExtended, etc.)
    | v
    Give me block 0 of films table, lusiadas db ForkNumber 0
    +----------------+
    | Buffer Manager |
    +----------------+
    load
    ^ |
    | v
    dump
    +----------------+
    | Disk |
    +----------------+
    #V⒎FS.BOBHFS

    View Slide

  51. #V⒎FS.BOBHFS
    • Buffer Manager manages data in memory

    • All data in files are read/written via Buffer Manager

    • In many case we can not load all data into Buffer Manager (memory
    size <<<< data file size)

    • “Block" may be called "Page" in Buffer Manager context

    • User of Buffer Manager passes info of Relation(database & table) and
    BlockNumber to get a page

    View Slide

  52. pages:
    +------+------+
    | 8KB | 8KB |
    +------+------+
    ^ ^
    | |
    v v
    buffer_descriptors:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    | flag | flag |
    +------+------+
    buffer_hash:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    +------+------+
    | |
    v v
    +------+------+
    | 0 | 1 |
    +------+------+
    Get (db1, tbl2, blk0)
    )PXUPVTF#V⒎FS.BOBHFS

    View Slide

  53. pages:
    +------+------+
    | 8KB | 8KB |
    +------+------+
    ^ ^
    | |
    v v
    buffer_descriptors:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    | flag | flag |
    +------+------+
    buffer_hash:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    +------+------+
    | |
    v v
    +------+------+
    | 0 | 1 |
    +------+------+
    Get (db1, tbl2, blk0)
    )PXUPVTF#V⒎FS.BOBHFS

    View Slide

  54. pages:
    +------+------+
    | 8KB | 8KB |
    +------+------+
    ^ ^
    | |
    v v
    buffer_descriptors:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    | flag | flag |
    +------+------+
    buffer_hash:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    +------+------+
    | |
    v v
    +------+------+
    | 0 | 1 |
    +------+------+
    Get (db1, tbl2, blk0)
    Buffer (1)
    )PXUPVTF#V⒎FS.BOBHFS

    View Slide

  55. pages:
    +------+------+
    | 8KB | 8KB |
    +------+------+
    ^ ^
    | |
    v v
    buffer_descriptors:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    | flag | flag |
    +------+------+
    buffer_hash:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    +------+------+
    | |
    v v
    +------+------+
    | 0 | 1 |
    +------+------+
    Get (db1, tbl2, blk0)
    Buffer (1)
    Get page of Buffer (1)
    )PXUPVTF#V⒎FS.BOBHFS

    View Slide

  56. pages:
    +------+------+
    | 8KB | 8KB |
    +------+------+
    ^ ^
    | |
    v v
    buffer_descriptors:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    | flag | flag |
    +------+------+
    buffer_hash:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    +------+------+
    | |
    v v
    +------+------+
    | 0 | 1 |
    +------+------+
    Get (db1, tbl2, blk0)
    Buffer (1)
    Get page of Buffer (1)
    Reference of page (1)
    )PXUPVTF#V⒎FS.BOBHFS

    View Slide

  57. Get (db1, tbl1, blk1)
    )PXUPVTF#V⒎FS.BOBHFS
    pages:
    +------+------+
    | 8KB | 8KB |
    +------+------+
    ^ ^
    | |
    v v
    buffer_descriptors:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    | flag | flag |
    +------+------+
    buffer_hash:
    +------+------+
    | db1 | db1 |
    | tbl1 | tbl2 |
    | blk0 | blk0 |
    +------+------+
    | |
    v v
    +------+------+
    | 0 | 1 |
    +------+------+

    View Slide

  58. Get (db1, tbl1, blk1)
    )PXUPVTF#V⒎FS.BOBHFS
    pages:
    +------+------+------+
    | 8KB | 8KB | 8KB |
    +------+------+------+
    ^ ^ ^
    | | |
    v v v
    buffer_descriptors:
    +------+------+------+
    | db1 | db1 | db1 |
    | tbl1 | tbl2 | tbl1 |
    | blk0 | blk0 | blk1 |
    | flag | flag | flag |
    +------+------+------+
    buffer_hash:
    +------+------+------+
    | db1 | db1 | db1 |
    | tbl1 | tbl2 | tbl1 |
    | blk0 | blk0 | blk1 |
    +------+------+------+
    | | |
    v v v
    +------+------+------+
    | 0 | 1 | 2 |
    +------+------+------+

    View Slide

  59. Get (db1, tbl1, blk1)
    )PXUPVTF#V⒎FS.BOBHFS
    pages:
    +------+------+------+
    | 8KB | 8KB | 8KB |
    +------+------+------+
    ^ ^ ^
    | | |
    v v v
    buffer_descriptors:
    +------+------+------+
    | db1 | db1 | db1 |
    | tbl1 | tbl2 | tbl1 |
    | blk0 | blk0 | blk1 |
    | flag | flag | flag |
    +------+------+------+
    buffer_hash:
    +------+------+------+
    | db1 | db1 | db1 |
    | tbl1 | tbl2 | tbl1 |
    | blk0 | blk0 | blk1 |
    +------+------+------+
    | | |
    v v v
    +------+------+------+
    | 0 | 1 | 2 |
    +------+------+------+
    Buffer (2)

    View Slide

  60. *NQMFNFOU#V⒎FS.BOBHFS
    • Page: Malloced block_size memory

    • BufferTag: Relation(database & table) and BlockNumber
    // We malloc c memory for header, lines and tuples.
    pub struct Page {
    header: *mut PageHeaderData,
    }
    struct BufferTag {
    rnode: RelFileNode,
    block_num: BlockNumber,
    }
    8 KB

    View Slide

  61. *NQMFNFOU#V⒎FS.BOBHFS
    • pages: Array of Page

    • buffer_descriptors: Array of info of Page (BufferTag, dirty flag etc.)

    • buffer_hash: Map from BufferTag to buffer identifier (usize)
    pub struct BufferManager {
    config: Rc,
    smgr: StorageManager,
    buffer_descriptors: Vec,
    pages: Vec,
    // Hash from BufferTag to index of descriptor and page
    // See LocalBufHash in pg.
    buffer_hash: HashMap,
    }

    View Slide

  62. -JNJUBUJPOTPG#V⒎FS.BOBHFS
    • No eviction logic

    • Can not load file into memory if all slots are used

    • It’s acceptable if we implement DB as command

    • It’s may not acceptable if we implement DB as server

    • No lock mechanism

    View Slide

  63. *NQPSUBOUQPJOUT
    • Dump data to files by "destructor" (drop trait)

    • Be careful to return cached page if the page is cached

    • I embedded a bug because I forgot this…

    View Slide

  64. 4UPSBHF.BOBHFSJOQH
    • Manages file descriptors

    • "src/backend/storage/smgr/smgr.c"

    • "src/backend/storage/smgr/md.c"

    • Take care of segments

    View Slide

  65. +------------------+
    | Other components |
    +------------------+
    ^ | (ReadBufferExtended, etc.)
    | v
    Give me block 0 of films table, lusiadas db, ForkNumber 0
    +----------------+
    | Buffer Manager |
    +----------------+
    ^ | (smgrread, etc.)
    | v
    Give me block 0 of films table, lusiadas db, ForkNumber 0
    +----------------+
    | Storage Manager|
    +----------------+
    ^ | (mdread, etc.)
    | v
    Read block size data from an appropriate file to a buf
    +----------------+
    | Disk |
    +----------------+

    View Slide

  66. typedef enum ForkNumber
    {
    InvalidForkNumber = -1,
    MAIN_FORKNUM = 0,
    FSM_FORKNUM,
    VISIBILITYMAP_FORKNUM,
    INIT_FORKNUM
    } ForkNumber;
    #define MAX_FORKNUM INIT_FORKNUM
    typedef struct _MdfdVec
    {
    File mdfd_vfd; /* fd number in fd.c's pool */
    BlockNumber mdfd_segno; /* segment number, from 0 */
    } MdfdVec;
    typedef struct SMgrRelationData
    {
    RelFileNodeBackend smgr_rnode; /* relation physical identifier */
    struct SMgrRelationData **smgr_owner;
    BlockNumber smgr_targblock; /* current insertion target block */
    BlockNumber smgr_fsm_nblocks; /* last known size of fsm fork */
    BlockNumber smgr_vm_nblocks; /* last known size of vm fork */
    int smgr_which; /* storage manager selector */
    int md_num_open_segs[MAX_FORKNUM + 1];
    struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
    struct SMgrRelationData *next_unowned_reln;
    } SMgrRelationData;
    • md_seg_fds is an
    array of
    `_MdfdVec *`

    • see:
    `_mdfd_getseg`
    function

    View Slide

  67. md_seg_fds:
    main fsm vm init
    +------------+--------------+--------------+--------------+
    | | | | |
    +------------+--------------+--------------+--------------+
    | | | |
    v v v v
    +------------+--------------+
    | fd 24576 | fd 24576_fsm |
    | seg 0 | seg 0 |
    +------------+--------------+
    | fd 24576.1 |
    | seg 1 |
    +------------+
    base/16388/24576 <- fork: 0, segment: 0
    base/16388/24576.1 <- fork: 0, segment: 1
    base/16388/24576_fsm <- fork: 1, segment: 0
    Forks
    Segments

    View Slide

  68. *NQMFNFOU4UPSBHF.BOBHFS
    • cache: Map from RelFileNode to SMgrRelationData

    • SMgrRelationData: This has fd
    pub struct SMgrRelationData {
    config: Rc,
    pub smgr_rnode: RelFileNode,
    file: Option,
    // current insertion target block
    pub smgr_targblock: BlockNumber,
    }
    pub struct StorageManager {
    config: Rc,
    cache: HashMap>
    }

    View Slide

  69. -JNJUBUJPOTPG4UPSBHF.BOBHFS
    • Fork is not supported

    • Segment is not supported
    $ ls -hl base/16388/24576*
    -rw------- 1.0G 3 10 16:38 base/16388/24576 <- fork: 0, segment: 0
    -rw------- 25M 3 10 16:38 base/16388/24576.1 <- fork: 0, segment: 1
    -rw------- 280K 3 10 16:38 base/16388/24576_fsm <- fork: 1, segment: 0

    View Slide

  70. 3FXSJUFJOTFSUDPNNBOE
    • Problems were

    • Each command read/write files

    • Each command allocate memory (page)

    • Insert command does not open file directly

    • Insert command does not allocate memory directly

    View Slide

  71. pub struct InsertState<'a> {
    slot: &'a TupleTableSlot,
    ss_currentRelation: &'a RelFileNode,
    }
    impl<'a> InsertState<'a> {
    pub fn new(rnode: &'a RelFileNode, slot: &'a TupleTableSlot) -> InsertState<'a> {
    InsertState {
    ss_currentRelation: rnode,
    slot: slot,
    }
    }
    pub fn exec_insert(&mut self, bufmrg: &mut BufferManager) {
    let buf = bufmrg.read_buffer(self.ss_currentRelation.clone(), 0);
    let page = bufmrg.get_page_mut(buf);
    page.add_tuple_slot_entry(self.slot).unwrap();
    }
    }
    get page from buffer manager
    write tuple to page

    View Slide

  72. Part 3: Query Execution

    View Slide

  73. 5BCMFFYBNQMF
    =# \d films
    Table "public.films"
    Column | Type | Collation | Nullable | Default
    -----------+-------------------------+-----------+----------+---------
    code | character(5) | | not null |
    title | character varying(40) | | not null |
    did | integer | | not null |
    date_prod | date | | |
    kind | character varying(10) | | |
    len | interval hour to minute | | |
    Indexes:
    "firstkey" PRIMARY KEY, btree (code)
    =# select * from films;
    code | title | did | date_prod | kind | len
    -------+-------+-----+-----------+------+-----
    c1 | T1 | 101 | | |
    c2 | T2 | 102 | | |
    c3 | T3 | 103 | | |
    (3 rows)

    View Slide

  74. +-----------------------+
    | Query string (char *) |
    +-----------------------+
    parse (pg_parse_query)
    |
    v
    +---------------------------------+
    | Raw parse tree (struct RawStmt) |
    +---------------------------------+
    analyze/rewrite (pg_analyze_and_rewrite)
    |
    v
    +---------------------------+
    | Query Tree (struct Query) |
    +---------------------------+
    optimize/generate tree of plan node
    (pg_plan_queries)
    |
    v
    +-------------------------+
    | Plan tree (struct Plan) |
    +-------------------------+
    setup query execution (ExecInitNode)
    |
    v
    +-----------------------------------+
    | PlanState tree (struct PlanState) |
    +-----------------------------------+
    execute query (ExecutorRun)

    View Slide

  75. 1BSTF
    • Generate AST from input text

    • Each node represents user input

    View Slide

  76. 1BSTF
    select * from films where did = 1;
    ParseState:
    parsetree_list
    RawStmt:
    stmt
    SelectStmt:
    targetList
    ResTarget: name: , indirection:
    val ColumnRef: fields: *
    fromClause
    RangeVar:
    catalogname: , schemaname: , relname: films
    whereClause
    A_Expr: kind: OP, name: =
    lexpr
    ColumnRef: fields: did
    rexpr
    A_Const: val: 1

    View Slide

  77. "OBMZ[F3FXSJUF
    • Generate Query Tree

    • Replace user input attributes to internal representations

    • Semantic analysis

    View Slide

  78. "OBMZ[F3FXSJUF
    Query: SELECT(id: 0)
    rtable (# 1)
    RangeTblEntry: rtekind: RELATION, relid: 16389
    jointree
    FromExpr:
    quals
    OpExpr: opno: 96, opfuncid: 65
    Var: varno: 1, varattno: 3
    Const: consttype: 23, constlen: 4,
    constvalue: 1
    target_list (# 6)
    TargetEntry: resno: 1, resname: code
    Var: varno: 1, varattno: 1
    TargetEntry: resno: 2, resname: title
    Var: varno: 1, varattno: 2
    TargetEntry: resno: 3, resname: did
    Var: varno: 1, varattno: 3

    ParseState:
    parsetree_list
    RawStmt:
    stmt
    SelectStmt:
    targetList
    ResTarget: name: , indirection:
    val ColumnRef: fields: *
    fromClause
    RangeVar:
    relname: films
    whereClause
    A_Expr: kind: OP, name: =
    lexpr
    ColumnRef: fields: did
    rexpr
    A_Const: val: 1
    Expanded

    View Slide

  79. "OBMZ[F3FXSJUF
    Query: SELECT(id: 0)
    rtable (# 1)
    RangeTblEntry: rtekind: RELATION, relid: 16389
    jointree
    FromExpr:
    quals
    OpExpr: opno: 96, opfuncid: 65
    Var: varno: 1, varattno: 3
    Const: consttype: 23, constlen: 4,
    constvalue: 1
    target_list (# 6)
    TargetEntry: resno: 1, resname: code
    Var: varno: 1, varattno: 1
    TargetEntry: resno: 2, resname: title
    Var: varno: 1, varattno: 2
    TargetEntry: resno: 3, resname: did
    Var: varno: 1, varattno: 3

    ParseState:
    parsetree_list
    RawStmt:
    stmt
    SelectStmt:
    targetList
    ResTarget: name: , indirection:
    val ColumnRef: fields: *
    fromClause
    RangeVar:
    relname: films
    whereClause
    A_Expr: kind: OP, name: =
    lexpr
    ColumnRef: fields: did
    rexpr
    A_Const: val: 1
    Internal representations
    where did = 1

    View Slide

  80. "OBMZ[F3FXSJUF
    Query: SELECT(id: 0)
    rtable (# 1)
    RangeTblEntry: rtekind: RELATION, relid: 16389
    jointree
    FromExpr:
    quals
    OpExpr: opno: 96, opfuncid: 65
    Var: varno: 1, varattno: 3
    Const: consttype: 23, constlen: 4,
    constvalue: 1
    target_list (# 6)
    TargetEntry: resno: 1, resname: code
    Var: varno: 1, varattno: 1
    TargetEntry: resno: 2, resname: title
    Var: varno: 1, varattno: 2
    TargetEntry: resno: 3, resname: did
    Var: varno: 1, varattno: 3

    # select oid, relname
    from pg_class
    where oid = 16389
    order by oid asc;
    oid | relname
    -------+---------
    16389 | films
    (1 row)
    films

    View Slide

  81. "OBMZ[F3FXSJUF
    Query: SELECT(id: 0)
    rtable (# 1)
    RangeTblEntry: rtekind: RELATION, relid: 16389
    jointree
    FromExpr:
    quals
    OpExpr: opno: 96, opfuncid: 65
    Var: varno: 1, varattno: 3
    Const: consttype: 23, constlen: 4,
    constvalue: 1
    target_list (# 6)
    TargetEntry: resno: 1, resname: code
    Var: varno: 1, varattno: 1
    TargetEntry: resno: 2, resname: title
    Var: varno: 1, varattno: 2
    TargetEntry: resno: 3, resname: did
    Var: varno: 1, varattno: 3

    where did = 1
    /* catalog/pg_operator.dat */
    { oid => ’96',
    oid_symbol => ‘Int4EqualOperator'
    oprname => ‘=', oprleft => ‘int4',
    oprright => ‘int4',
    oprresult => 'bool' },
    /* catalog/pg_type.dat */
    { oid => ’23',
    descr => '-2 billion to 2 billion
    integer, 4-byte storage',
    typname => ‘int4',
    typlen => '4' },
    select * from films where did = 1;

    View Slide

  82. 0QUJNJ[F(FOFSBUFUSFFPGQMBO
    OPEF
    • We would have many way to calculate result tuples

    • How to get tuples from disk

    • Read files in order (SeqScan) / Use index (IndexScan) / Use only
    index (IndexOnlyScan)

    • How to aggregate tuples

    • Use internal hash table / sort tuples before aggregation

    • How to join tables

    • Nested Loops Join / (Sort-)Merge Join / Hash Join

    View Slide

  83. 0QUJNJ[F(FOFSBUFUSFFPGQMBO
    OPEF
    Query: SELECT(id: 0)
    rtable (# 1)
    RangeTblEntry: relid: 16389
    jointree
    FromExpr:
    quals
    OpExpr: opno: 96, opfuncid: 65
    Var: varno: 1, varattno: 3
    Const: consttype: 23,
    constlen: 4,
    constvalue: 1
    target_list (# 6)
    TargetEntry: resno: 1, resname: code
    Var: varno: 1, varattno: 1
    TargetEntry: resno: 2, resname: title
    Var: varno: 1, varattno: 2
    TargetEntry: resno: 3, resname: did
    Var: varno: 1, varattno: 3

    PlannedStmt: command_type: SELECT
    SeqScan:
    targetlist
    TargetEntry: resno: 1, resname: code
    Var: varno: 1, varattno: 1
    TargetEntry: resno: 2, resname: title
    Var: varno: 1, varattno: 2
    TargetEntry: resno: 3, resname: did
    Var: varno: 1, varattno: 3
    ...
    qual
    OpExpr: opno: 96, opfuncid: 65
    Var: varno: 1, varattno: 3
    Const: consttype: 23,
    constlen: 4,
    constvalue: 1

    View Slide

  84. • Sort gets tuple from
    lefttree, SeqScan in
    this case
    select *
    from films
    where did = 1
    order by code;
    PlannedStmt: command_type: SELECT
    Sort: num_cols: 1, sort_col_idx: [1], sort_operators: [1058]
    targetlist
    TargetEntry: resno: 1, resname: code
    Var: varno: 65001, varattno: 1
    TargetEntry: resno: 2, resname: title
    Var: varno: 65001, varattno: 2
    ...
    qual
    lefttree
    SeqScan:
    targetlist
    TargetEntry: resno: 1, resname:
    Var: varno: 1, varattno: 1
    TargetEntry: resno: 2, resname:
    Var: varno: 1, varattno: 2
    ...
    qual
    OpExpr: opno: 96, opfuncid: 65
    Var: varno: 1, varattno: 3
    Const: consttype: 23, constlen: 4, constvalue: 1

    View Slide

  85. *NQMFNFOUDPVOUDPNNBOE
    • In pg this is implemented as aggregation

    • But aggregation is difficult…
    $ minidb select_from_count db1 table1
    Count: 5

    View Slide

  86. *NQMFNFOUDPVOUDPNNBOE
    +-------------+
    | Count |
    | * counter |
    | * scanner |
    +-------------+
    +---------+
    | SeqScan |
    +---------+
    +------+
    | Disk |
    +------+

    View Slide

  87. +-----------------------+
    | Query string (char *) |
    +-----------------------+
    parse (pg_parse_query)
    |
    v
    +---------------------------------+
    | Raw parse tree (struct RawStmt) |
    +---------------------------------+
    analyze/rewrite (pg_analyze_and_rewrite)
    |
    v
    +---------------------------+
    | Query Tree (struct Query) |
    +---------------------------+
    optimize/generate tree of plan node
    (pg_plan_queries)
    |
    v
    +-------------------------+
    | Plan tree (struct Plan) |
    +-------------------------+
    setup query execution (ExecInitNode)
    |
    v
    +-----------------------------------+
    | PlanState tree (struct PlanState) |
    +-----------------------------------+
    execute query (ExecutorRun)
    • Directly create Plan tree

    View Slide

  88. impl<'a> CountState<'a> {
    pub fn new(lefttree: ScanState<'a>) -> CountState<'a> {
    CountState {
    lefttree: lefttree,
    result: 0,
    }
    }
    // `ExecAgg` in pg.
    pub fn exec_agg(&mut self, bufmrg: &mut BufferManager) {
    loop {
    let opt = self.lefttree.exec_scan(bufmrg);
    match opt {
    Some(_slot) => {
    self.result = self.result + 1;
    },
    None => break
    }
    }
    }
    }
    get tuple
    count up
    counter
    scanner

    View Slide

  89. impl CountCommnad {
    ...
    pub fn execute(&self, dbname: &str, table_name: &str) -> Result<(), String> {
    ...
    let mut rmgr = RelationManager::new(self.config.clone());
    let relation = rmgr.get_relation(db_oid, table_oid);
    let mut bm = BufferManager::new(1, self.config.clone());
    let scan = ScanState::new(relation, &rm, &mut bm);
    let mut count = CountState::new(scan);
    count.exec_agg(&mut bm);
    println!("Count: {}", count.result);
    Ok(())
    }
    }
    pass scan to count
    execute
    get result of the count

    View Slide

  90. -JNJUBUJPOPGDPVOU
    • Count logic is not generalized to aggregation

    View Slide

  91. *NQMFNFOUQBSTFS
    • Use lalrpop, LR(1) parser generator

    • Rewrite select_from / select_from_count / insert_into commands to
    execute command

    • Before: "minidb select_from db1 table1”

    • After: "minidb execute 'select * from db1.table1’"

    • I wanted some progress when I changed to use parser :)

    • Writing parser by parser generator is fun!!!

    View Slide

  92. +-----------------------+
    | Query string (char *) |
    +-----------------------+
    parse (pg_parse_query)
    |
    v
    +---------------------------------+
    | Raw parse tree (struct RawStmt) |
    +---------------------------------+
    analyze/rewrite (pg_analyze_and_rewrite)
    |
    v
    +---------------------------+
    | Query Tree (struct Query) |
    +---------------------------+
    optimize/generate tree of plan node
    (pg_plan_queries)
    |
    v
    +-------------------------+
    | Plan tree (struct Plan) |
    +-------------------------+
    setup query execution (ExecInitNode)
    |
    v
    +-----------------------------------+
    | PlanState tree (struct PlanState) |
    +-----------------------------------+
    execute query (ExecutorRun)
    • Do not analyze

    • Do not optimize

    View Slide

  93. -JNJUBUJPOPGRVFSZFYFDVUPS
    • Do not implement analyzer

    • Should do semantic analysis when the query is executed

    • Do not implement optimizer

    • Can not use index even if index is implemented

    View Slide

  94. *NQMFNFOUEFMFUF
    $ minidb /tmp/minidb execute 'select count() from db1.table1'
    Count: 5
    $ minidb execute 'delete from db1.table1'
    Deleted records: 5
    $ minidb /tmp/minidb execute 'select count() from db1.table1'
    Count: 0

    View Slide

  95. *NQMFNFOUEFMFUF
    • Add deleted flag to tuple

    • Should recreate data files or write migration command

    • delete command sets the flag

    • SeqScan checks the flag

    • Count need not to take care of the flag

    View Slide

  96. *NQMFNFOUEFMFUF
    /*
    * +----------------+---------------------------------+
    * | PageHeaderData | linp1 linp2 linp3 ... |
    * +-----------+----+---------------------------------+
    * | ... linpN | |
    * +-----------+--------------------------------------+
    * | ^ pd_lower |
    * | |
    * | v pd_upper |
    * +-------------+------------------------------------+
    * | | tupleN ... |
    * +-------------+------------------+-----------------+
    * | ... tuple3 tuple2 tuple1 | "special space" |
    * +--------------------------------+-----------------+
    * ^ pd_special
    */
    +---------+
    | |
    | Data |
    | |
    +---------+
    +---------+
    | flag(s) |
    +---------+
    | |
    | Data |
    | |
    +---------+
    Before
    After
    • Add deleted flag to tuple

    View Slide

  97. diff --git a/src/node_seqscan.rs b/src/node_seqscan.rs
    index 80f7cff..2edeb2b 100644
    --- a/src/node_seqscan.rs
    +++ b/src/node_seqscan.rs
    @@ -135,8 +135,19 @@ impl<'a> ScanState<'a> {
    let dp = bufmrg.get_page(scan_desc.rs_cbuf);
    let mut t_self = ItemPointerData::new();
    ::tuple::item_pointer_set(&mut t_self, scan_desc.rs_cblock, lineoff);
    + debug!("lp_len {}", dp.get_item_ref(lineoff).lp_len());
    scan_desc.rs_ctup.load_without_len(dp.get_entry_pointer(lineoff).unwrap(),
    t_self);
    - return
    +
    + // Skip deleted record
    + if scan_desc.rs_ctup.t_data.heap_keys_updated_p() {
    + debug!("Skip deleted tuple {}", lineoff);
    + lineoff = lineoff + 1;
    + linesleft = linesleft - 1;
    + // next
    + } else {
    + debug!("Return tuple {}", lineoff);
    + return
    + }
    }
    Check delete flag in scan

    View Slide

  98. impl<'a> CountState<'a> {
    pub fn new(lefttree: ScanState<'a>) -> CountState<'a> {
    CountState {
    lefttree: lefttree,
    result: 0,
    }
    }
    // `ExecAgg` in pg.
    pub fn exec_agg(&mut self, bufmrg: &mut BufferManager) {
    loop {
    let opt = self.lefttree.exec_scan(bufmrg);
    match opt {
    Some(_slot) => {
    self.result = self.result + 1;
    },
    None => break
    }
    }
    }
    }
    get tuple
    count up
    • Count need not to take
    care of the flag

    • lefttree omit deleted
    tuples

    View Slide

  99. -JNJUBUJPOTPGEFMFUFDPNNBOE
    • delete command deletes all records because we can not pass where
    clause

    • vacuum is not implemented

    View Slide

  100. &WBMVBUJOHFYQSFTTJPOT XIFSF
    DMBVTF

    select * from films where did = 1;
    QueryDesc (SELECT):
    planstate
    SeqScanState:
    qual
    ExprSate: SCAN_FETCHSOME, SCAN_VAR, FUNCEXPR_STRICT, QUAL, DONE
    ss_ScanTupleSlot
    TupleTableSlot: tts_isempty: true, tts_fixedTupleDescriptor: true
    tts_tupleDescriptor
    TupleDesc: natts: 6
    attrs
    PgAttribute: attname: code
    ...

    View Slide

  101. &WBMVBUJOHFYQSFTTJPOT XIFSF
    DMBVTF

    select * from films where did = 1;
    QueryDesc (SELECT):
    planstate
    SeqScanState:
    qual
    ExprSate: SCAN_FETCHSOME, SCAN_VAR, FUNCEXPR_STRICT, QUAL, DONE
    ss_ScanTupleSlot
    TupleTableSlot: tts_isempty: true, tts_fixedTupleDescriptor: true
    tts_tupleDescriptor
    TupleDesc: natts: 6
    attrs
    PgAttribute: attname: code
    ...

    View Slide

  102. &WBMVBUJOHFYQSFTTJPOT XIFSF
    DMBVTF

    select * from films where did = 1;
    QueryDesc (SELECT):
    planstate
    SeqScanState:
    qual
    ExprSate: SCAN_FETCHSOME, SCAN_VAR, FUNCEXPR_STRICT, QUAL, DONE
    ss_ScanTupleSlot
    TupleTableSlot: tts_isempty: true, tts_fixedTupleDescriptor: true
    tts_tupleDescriptor
    TupleDesc: natts: 6
    attrs
    PgAttribute: attname: code
    ...
    Byte codes …

    View Slide

  103. &WBMVBUJOHFYQSFTTJPOT XIFSF
    DMBVTF

    • Expression like "where did = 1" needed to be evaluated to implement
    where clause

    • "direct threaded" or "switch threaded" or “llvmjit" in pg

    • https://magazine.rubyist.net/articles/0008/0008-YarvManiacs.html

    • It's overspec, so implement it by tree traverse

    View Slide

  104. *NQMFNFOUXIFSFDMBVTF
    $ minidb /tmp/minidb execute 'select count() from db1.table1'
    Count: 5
    $ minidb /tmp/minidb execute 'select count() from db1.table1 where true'
    Count: 5
    $ minidb /tmp/minidb execute 'select count() from db1.table1 where false'
    Count: 0
    $ minidb /tmp/minidb execute 'select count() from db1.table1 where true = false'
    Count: 0
    $ minidb /tmp/minidb execute 'select * from db1.table1'
    "1" "12"
    "2" "13"
    "3" "12"
    "4" "20"
    "5" "21"
    $ minidb /tmp/minidb execute 'select count() from db1.table1 where age = 12'
    Count: 2

    View Slide

  105. *NQMFNFOUXIFSFDMBVTF
    pub enum Expr {
    Bool(bool),
    Number(i32),
    OpEq(Box, Box), // "="
    ColumnRef(String), // column name
    }
    struct ExprEvaluator<'a> {
    currentTuple: &'a TupleTableSlot,
    expr: &'a Expr,
    }
    where did = 1
    +------+
    | OpEq |
    +------+
    |
    | +------------------+
    +-| ColumnRef("did") |
    | +------------------+
    |
    | +-----------+
    +-| Number(1) |
    +-----------+
    has current tuple

    View Slide

  106. impl<'a> ScanState<'a> {
    fn exec(&mut self) -> Option<&TupleTableSlot> {
    loop {
    self.seq_next();
    if self.ss_currentScanDesc.rs_finished {
    return None;
    }
    if self.exec_qual() {
    return Some(self.ss_ScanTupleSlot.as_ref());
    }
    // next tuple
    }
    }
    fn exec_qual(&self) -> bool {
    if self.qual.is_none() {
    // Always condition is met
    return true;
    }
    let evaluator = ExprEvaluator::new(self.ss_ScanTupleSlot.as_ref(),
    self.qual.as_ref().unwrap().as_ref());
    evaluator.eval()
    }
    fetch next tuple from page
    check where clause
    set current tuple to evaluator then call eval

    View Slide

  107. impl<'a> ExprEvaluator<'a> {
    fn eval_rec(&self, expr: &Expr) -> Box {
    match expr {
    Expr::Bool(b) => {
    Box::new(Value::Bool(b.clone()))
    },
    Expr::Number(n) => {
    Box::new(Value::String(n.to_string()))
    },
    Expr::OpEq(e1, e2) => {
    let v1 = self.eval_rec(e1);
    let v2 = self.eval_rec(e2);
    Box::new(Value::Bool(self.op_eq(v1.as_ref(), v2.as_ref())))
    },
    Expr::ColumnRef(col_name) => {
    let i = self.currentTuple.get_index_from_name(col_name);
    let c = self.currentTuple.get_column(i);
    Box::new(Value::String(c.as_string()))
    },
    }
    }
    simple case
    eval operands then compare
    get value from current tuple

    View Slide

  108. "EEXIFSFUPEFMFUFDPNNBOE
    diff --git a/src/ast.rs b/src/ast.rs
    @@ -4,8 +4,8 @@ pub enum Stmt {
    SelectStmt(Box, String, String, Option>),
    // dbname, tablename, keys, values
    InsertStmt(String, String, Vec, Vec>),
    - // dbname, tablename
    - DeleteStmt(String, String),
    + // dbname, tablename, where_clause
    + DeleteStmt(String, String, Option>),
    }
    diff --git a/src/parser.lalrpop b/src/parser.lalrpop
    @@ -12,8 +12,8 @@ pub Statement: Stmt = {
    Stmt::InsertStmt(fi.0, fi.1, li1, li2)
    },
    // DELETE FROM dbname "." tablename
    - Delete_T From_T => {
    - Stmt::DeleteStmt(fi.0, fi.1)
    + Delete_T From_T => {
    + Stmt::DeleteStmt(fi.0, fi.1, ow)
    },
    }
    Add where member to Node
    Change syntax to accept where

    View Slide

  109. "EEXIFSFUPEFMFUFDPNNBOE
    diff --git a/src/dml.rs b/src/dml.rs
    @@ -122,7 +122,7 @@ impl DeleteCommnad {
    }
    }
    - pub fn execute(&self, dbname: &str, table_name: &str, cmgr: &CatalogManager) -> Result<(), String> {
    + pub fn execute(&self, dbname: &str, table_name: &str, cmgr: &CatalogManager, qual: &Option>) ->
    Result<(), String> {
    let db_oid = cmgr.database_rm.find_mini_database_oid(dbname)
    .expect(&format!("{} database should be defined.", dbname));
    let table_oid = cmgr.class_rm.find_mini_class_oid(db_oid, table_name)
    @@ -131,7 +131,7 @@ impl DeleteCommnad {
    let mut rmgr = RelationManager::new(self.config.clone());
    let relation = rmgr.get_relation(db_oid, table_oid);
    let mut bm = BufferManager::new(1, self.config.clone());
    - let scan = ScanState::new(relation, &rm, &mut bm, &None);
    + let scan = ScanState::new(relation, &rm, &mut bm, qual);
    let mut delete = DeleteState::new(relation, scan);
    delete.exec_delete(&mut bm);
    Pass where expressions to scan

    View Slide

  110. -JNJUBUJPOTPGXIFSFDMBVTF
    • Support only few basic operations

    View Slide

  111. Part 4: Conclusion

    View Slide

  112. 4UFQTUPDSFBUJOHEC
    1. Implement init command

    2. Implement create database / create table command

    3. Implement insert command / select command

    4. Implement buffer manager / storage manager

    • Rewrite insert command to use these managers

    5. Implement count command

    6. Implement parser

    7. Implement delete

    8. Implement where clause

    • Add where to delete

    View Slide

  113. -JNJUBUJPOTPGFBDIDPNQPOFOU
    • The system catalogs

    • Use simple data format, e.g. JSON, CSV, not binary format

    • Create table

    • Signed 4 bytes integer is only supported type

    • Only support `id integer, age integer` as table schema

    • Buffer Manager

    • No eviction logic

    • No lock mechanism

    View Slide

  114. -JNJUBUJPOTPGFBDIDPNQPOFOU
    • Storage Manager

    • Fork is not supported

    • Segment is not supported

    • Count command

    • Count logic is not generalized to aggregation

    • Delete command

    • Vacuum is not implemented

    • Where clause

    • Support only few basic operations

    View Slide

  115. -JNJUBUJPOTPGFBDIDPNQPOFOU
    • Query Execution

    • Do not implement analyzer

    • Should do semantic analysis when the query is executed

    • Do not implement optimizer

    • Can not use index even if index is implemented

    View Slide

  116. 3FGFSFODFT
    • Raghu Ramakrishnan, Johannes Gehrke: "Database Management
    Systems”

    • Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom: “Database
    Systems: Pearson New International Edition: The Complete Book”

    • উມஐ੒, ࠤഢণथ, ݪాొࢤ: “ʦվగ৽൛ʧ಺෦ߏ଄͔ΒֶͿ
    PostgreSQL ઃܭɾӡ༻ܭըͷమଇ”

    • “The Internals of PostgreSQL” http://www.interdb.jp/pg/index.html

    • “PostgreSQL ͷ֮͑ॻ͖ͷΠϯσοΫε” http://www.nminoru.jp/
    ~nminoru/postgresql/

    View Slide

  117. "DLOPXMFEHNFOUT
    • @takeshinoda and @hkdnet

    • Reviewing the slide

    • Oracle Corporation Japan

    • Giving me the chance to talk about DB in Oracle!!!

    View Slide

  118. Thank you !!

    View Slide