Slide 1

Slide 1 text

Transactional Storage for MySQL FAST. RELIABLE. PROVEN. InnoDB Internals: InnoDB File Formats and Source Code Structure Heikki Tuuri CEO Innobase Oy Vice President, Development Oracle Corporation MySQL Conference, April 2009 Calvin Sun Principal Engineer Oracle Corporation

Slide 2

Slide 2 text

Today’s Topics • Goals of InnoDB • Key Functional Characteristics • InnoDB Design Considerations • InnoDB Architecture • InnoDB On Disk Format • Source Code Structure • Q & A

Slide 3

Slide 3 text

Goals of InnoDB • OLTP oriented • Performance, Reliability, Scalability • Data Protection • Portability

Slide 4

Slide 4 text

InnoDB Key Functional Characteristics • Full transaction support • Row-level locking • MVCC • Crash recovery • Efficient IO

Slide 5

Slide 5 text

Design Considerations • Modeled on Gray & Reuter’s “Transactions Processing: Concepts & Techniques” • Also emulated the Oracle architecture • Added unique subsystems • Doublewrite • Insert buffering • Adaptive hash index • Designed to evolve with changing hardware & requirements

Slide 6

Slide 6 text

InnoDB Architecture IO Buffer File Space Manager Transaction Handler API Embedded InnoDB API Cursor / Row Mini- transaction Lock B-tree Page Server Applications

Slide 7

Slide 7 text

InnoDB On Disk Format • InnoDB Database Files • InnoDB Tablespaces • InnoDB Pages / Extents • InnoDB Rows • InnoDB Indexes • InnoDB Logs • File Format Design Considerations

Slide 8

Slide 8 text

InnoDB Database Files ibdata files System tablespace internal data dictionary MySQL Data Directory InnoDB tables OR innodb_file_per_table .ibd files .frm files undo logs insert buffer

Slide 9

Slide 9 text

InnoDB Tablespaces • A tablespace consists of multiple files and/or raw disk partitions. file_name:file_size[:autoextend[:max:max_file_size]] • A file/partition is a collection of segments. • A segment consists of fixed-length pages. • The page size is always 16KB in uncompressed tablespaces, and 1KB-16KB in compressed tablespaces (for both data and index).

Slide 10

Slide 10 text

System Tablespace • Internal Data Dictionary • Undo • Insert Buffer • Doublewrite Buffer • MySQL Replication Info

Slide 11

Slide 11 text

InnoDB Tablespaces Extent Segment Extent Extent Extent an extent = 64 pages Extent Trx id Row Field 1 Roll pointer Field pointers Field 2 Field n Row Page Row Row Row Row Leaf node segment Tablespace Rollback segment Non-leaf node segment Row Row

Slide 12

Slide 12 text

InnoDB Pages Symbol Value Notes FIL_PAGE_INODE 3 File segment inode FIL_PAGE_INDEX 17855 B-tree node FIL_PAGE_TYPE_BLOB 10 Uncompressed BLOB page FIL_PAGE_TYPE_ZBLOB 11 1st compressed BLOB page FIL_PAGE_TYPE_ZBLOB2 12 Subsequent compressed BLOB page FIL_PAGE_TYPE_SYS 6 System page FIL_PAGE_TYPE_TRX_SYS 7 Transaction system page others i-buf bitmap, I-buf free list, file space header, extent desp page, new allocated page InnoDB Page Types InnoDB Page Types

Slide 13

Slide 13 text

InnoDB Pages A page consists of: a page header, a page trailer, and a page body (rows or other contents). Page header Page trailer row offset array Row Row Row Row Row Row Row Row Row Row Row

Slide 14

Slide 14 text

Page Declares typedef struct /* a space address */ { ulint pageno; /* page number within the file */ ulint boffset; /* byte offset within the page */ } fil_addr_t; typedef struct { ulint checksum; /* checksum of the page (since 4.0.14) */ ulint page_offset; /* page offset inside space */ fil_addr_t previous; /* offset or fil_addr_t */ fil_addr_t next; /* offset or fil_addr_t */ dulint page_lsn; /* lsn of the end of the newest modification log record to the page */ PAGE_TYPE page type; /* file page type */ dulint file_flush_lsn;/* the file has been flushed to disk at least up to this lsn */ int space_id; /* space id of the page */ char data[]; /* will grow */ ulint page_lsn; /* the last 4 bytes of page_lsn */ ulint checksum; /* page checksum, or checksum magic, or 0 */ } PAGE, *PAGE;

Slide 15

Slide 15 text

InnoDB Compressed Pages •InnoDB keeps a “modification log” in each page •Updates & inserts of small records are written to the log w/o page reconstruction; deletes don’t even require uncompression •Log also tells InnoDB if the page will compress to fit page size •When log space runs out, InnoDB uncompresses the page, applies the changes and recompresses the page Page header modification log Page trailer page directory compressed data BLOB pointers empty space

Slide 16

Slide 16 text

InnoDB Rows prefix(768B) … … overflow page COMACT format COMACT format Record hdr Trx ID Roll ptr Fld ptrs overflow-page ptr .. Field values overflow page … … DYNAMIC format DYNAMIC format 20 bytes

Slide 17

Slide 17 text

InnoDB Indexes - Primary ●Data rows are stored in the B-tree leaf nodes of a clustered index ● B-tree is organized by primary key or non-null unique key of table, if defined; else, an internal column with 6-byte ROW_ID is added. xxx xxx xxx xxx - - - - nnn nnn nnn nnn 001 001 001 001 - - - - 275 275 275 275 276 276 276 276 – – – – 500 500 500 500 clustered (primary key) index 501 501 501 501 - - - - 630 630 630 630 631 631 631 631 - - - - 768 768 768 768 769 769 769 769 - - - - 800 800 800 800 801 801 801 801 - - - - 949 949 949 949 950 950 950 950 - - - - xxx xxx xxx xxx 001 001 001 001 – – – – 500 500 500 500 801 801 801 801 – – – – nnn nnn nnn nnn 500 500 500 500 – – – – 800 800 800 800 PK values PK values PK values PK values 001 001 001 001 - - - - nnn nnn nnn nnn Key values Key values Key values Key values 501 501 501 501- - - -630 630 630 630 + data for + data for + data for + data for corresponding rows corresponding rows corresponding rows corresponding rows … … Primary Index

Slide 18

Slide 18 text

InnoDB Indexes - Secondary ● Secondary index B- tree leaf nodes contain, for each key value, the primary keys of the corresponding rows, used to access clustering index to obtain the data clustered (primary key) index clustered (primary key) index Secondary index PK values PK values PK values PK values 001 001 001 001 - - - - nnn nnn nnn nnn B-tree leaf nodes, containing data key values key values key values key values A Z A Z A Z A Z B-tree leaf nodes, containing PKs Secondary index key values key values key values key values A Z A Z A Z A Z B-tree leaf nodes, containing PKs Secondary Index

Slide 19

Slide 19 text

DATA InnoDB Logging Rollback segments Log Buffer Buffer Pool redo log rollback Log File #1 Log File #2 log thread write thread log files ibdata files

Slide 20

Slide 20 text

InnoDB Redo Log Redo log structure: Space id PageNo OpCode Data end of log min LSN start of log last checkpoint

Slide 21

Slide 21 text

File Format Management • Builtin InnoDB format: “Antelope” • New “Barracuda” format enables compression,ROW_FORMAT=DYNAMIC • Fast index creation, other features do not require Barracuda file format • Builtin InnoDB can access “Antelope” databases, but not “Barracuda” databases • Check file format tag in system tablespace on startup • Enable a file format with new dynamic parameter innodb_file_format • Preserves ability to downgrade easily .ibd data files (file per table)

Slide 22

Slide 22 text

InnoDB File Format Design Considerations • Durability • Logging, doublewrite, checksum; • Performance • Insert buffering, table compression • Efficiency • Dynamic row format, table compression • Compatibility • File format management

Slide 23

Slide 23 text

Source Code Structure • 31 subdirectories • Relevant InnoDB source files on file formats • Tablespace: fsp0fsp {.c, .ic, .h} • Page: page0page, page0zip {.c, .ic, .h} • Log: log0log {.c, .ic, .h}

Slide 24

Slide 24 text

Source Code Subdirectories • buf • data • db • dict • dyn • eval • fil • fsp • fut • ha • handler • ibuf • include • lock • log • math • mem • mtr • os • page • pars • que • read • rem • row • srv • sync • thr • trx • usr • ut

Slide 25

Slide 25 text

Summary: Durability, Performance, Compatibility & Efficiency • InnoDB is the leading transactional storage engine for MySQL • InnoDB’s architecture is well-suited to modern, on- line transactional applications; as well as embedded applications. • InnoDB’s file format is designed for high durability, better performance, and easy to manage

Slide 26

Slide 26 text

For More Information … 2009 MySQL User Conference InnoDB Birds of a Feather Wed 7:30pm Ballroom C • Heikki Tuuri: Concurrency Control: How it Really Works, Thurs, 2:50pm Please visit www.innodb.com, blogs.innodb.com and forums.innodb.com

Slide 27

Slide 27 text

Q U E S T I O N S Q U E S T I O N S A N S W E R S A N S W E R S

Slide 28

Slide 28 text

company an Embedded Hot Backup Plugin

Slide 29

Slide 29 text

InnoDB Size Limits • Max # of tables: 4 G • Max size of a table: 32TB • Columns per table: 1000 • Max row size: n*4 GB • 8 kB if stored on the same page • n*4 GB with n BLOBs • Max key length: 3500 • Maximum tablespace size: 64 TB • Max # of concurrent trxs: 1023