Upgrade to Pro — share decks privately, control downloads, hide ads and more …

InnoDB Magic

InnoDB Magic

Sunny Gleason

December 10, 2010
Tweet

More Decks by Sunny Gleason

Other Decks in Technology

Transcript

  1. What’s in this Preso • What is InnoDB? • Relation

    to MySQL & Other Products • InnoDB Model • g414-inno: a Java Access Library for InnoDB
  2. What else is in this Preso • Creating a Voldemort

    Storage Engine with Embedded InnoDB • St8: A REST-based Storage Server • Faban Benchmark Results
  3. What is InnoDB? • High-Performance “guts” of MySQL • Finely

    Tuned B-Tree Storage Engine • MVCC Transactional Store a la Jim Gray (“Transactional Processing Systems”) • Available Stand-Alone as Embedded InnoDB (stagnant) or HailDB (drizzle)
  4. Relation to MySQL • One of many MySQL storage engines

    • Transactional, in contrast to MYISAM • Well-known, Bullet-Proof Backup, Failure & Recovery Modes • Advanced Buffer Pool Management (adaptive hash index, tunable LRU) • Online Backup Support (Xtrabackup / Hot)
  5. Other Products • Tokyo BDB, Oracle BDB & BDB-JE •

    Schema-Free (No Structure / Data Types) • Lower Concurrency (fewer writers) • Performance Degradation in Larger DBs • (TODO: quantify performance gap - in meantime, see Dynamo & Voldemort)
  6. InnoDB Model (Logical) • Database == Tablespace • Tablespace has

    Table(s) and Log(s) • Table has columns (rich datatypes) • Tables have a PRIMARY clustered index • Tables may have SECONDARY indexes • Row == Tuple • Tuples are stored / clustered by index sort • Secondary index stores full Primary Key
  7. InnoDB Model (Txns) • Everything uses a Transaction • Isolation

    Level: Serialized, Read Committed, Read Uncommitted • Locks: Shared (Read-only), Exclusive (Read/Write) • Cursors provide access to tables: Lookup by index, Iteration / Traversal • Secondary index contains partial Tuples • Secondary cursor can access primary (full tuple)
  8. InnoDB Model (Physical) • Tablespace is a collection of pages

    (16K) • Pages organized as a B-Tree: infimum & supremum keys, pointers to children • Pages contain row or index tuple data, or blob overflow data • Pages written to log first and flushed to tablespace based on ‘sync’ policy
  9. Physical Considerations • New pages requested from OS in extend_size

    increments • OS Assigns space from file system / partition “free list” • Temporal Locality (pages close together) • Spatial Locality / Fragmentation from Updates • Prefer “narrow” rows / indexes: faster scan, keeps working set in-memory • Secondary “covering” indexes can save primary index access
  10. !""#$%&$'(')'*+&,-.+* !"#$%$ &!'() !"#$%&'$()*%#+(,% !"#$%"&' (&#& (!)#!*"&%+ -"!./'0($('012%,$32" *++,-. %$"'()

    /0 !""*(,-.!'$-/$%-#&,'$ 0!,( .!'$1 0.%2 .!'$1 3"(* '*41 !"1$%#5 ,3..$%5 *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  11. !""#$%&'()*+,-(.+, !"#$%# &$'($%# !"#$%# !"#$%# !"#$%# )%*$"#$%#*+*,-*.)'$/ !"#$%# 01"*23 456

    72$83*9 4588*.52%#$1 72$83*.52%#$1/ 72$83*: 72$83*% 456 ;)'$ 456 456 456 456 <$)=*%53$*/$'($%# 0)>8$/.)?$ 4588>)?@*/$'($%# A5%B8$)=*%53$*/$'($%# 456 456 *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  12. !""#$%&'#() !"#$%&'()*+, * * #+,-./#( 012, -./0-12$3"456 -./0-12$3"456 !"#$%&'(&%'''')%*'+,''''!$--'./%''''0-&'./%1'''$2"%3-$45.67"'./%'88'09"-&'26-:"1 #+,-./#(

    012, * * 7890/:-2$3"456 7890/:-2$3"456 !"#$%&'( *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  13. !""#$%&!"'()(*&+ ,-./0-1 !$020&-#3*&0-(&*2#-('& ."&24(&%+2-((&5(06& "#'(*&#6&0&758*2(-('& ."'() ! %+2-((&.*&#-90".:('& ;1&<-./0-1&=(1&#-& "#"+"855&8".>8(&=(1&

    #6&20;5(?&.6&'(6."('@& (5*(?&0"&."2(-"05& 7#58/"&3.24&A+;12(& BCDE!$&.*&0''('F [[[ [[[ [[[ [[[     QQQ QQQ QQQ QQQ                ಥ ಥ ಥ ಥ     !"#$%&'&( )*'+,-'./0&.1 +2(&3                                                         [[[ [[[ [[[ [[[    ಥ ಥ ಥ ಥ        ಥ ಥ ಥ ಥ QQQ QQQ QQQ QQQ    ಥ ಥ ಥ ಥ     3.YDOXHV 3.YDOXHV 3.YDOXHV 3.YDOXHV        QQQ QQQ QQQ QQQ .H\YDOXHV .H\YDOXHV .H\YDOXHV .H\YDOXHV           GDWDIRU GDWDIRU GDWDIRU GDWDIRU FRUUHVSRQGLQJURZV FRUUHVSRQGLQJURZV FRUUHVSRQGLQJURZV FRUUHVSRQGLQJURZV ! ! "#$%&#'()*+,- *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  14. !""#$%&!"'()(*&+ ,(-#"'./0 ! ,(-#"'./0&1"'()&%+ 2/((&3(.4&"#'(*& -#"2.1"5&4#/&(.-6&7(0& 8.39(5&26(&:/1;./0& 7(0*&#4&26(& -#//(*:#"'1"<&/#=*5& 9*('&2#&.--(**&

    -39*2(/1"<&1"'()&2#& #>2.1"&26(&'.2. !"#$%&'&( )*'+,-'./0&.1 +2(&3 !"#$%&'&( )*'+,-'./0&.1 +2(&3 4&!52(-'./+2(&3 3.YDOXHV 3.YDOXHV 3.YDOXHV 3.YDOXHV        QQQ QQQ QQQ QQQ 67%'&&/"&-8/25(&$9/!52%-+2+2:/(-%- NH\YDOXHV NH\YDOXHV NH\YDOXHV NH\YDOXHV $= $= $= $= 67%'&&/"&-8/25(&$9/!52%-+2+2: ;<$ 4&!52(-'./+2(&3 NH\YDOXHV NH\YDOXHV NH\YDOXHV NH\YDOXHV $= $= $= $= 67%'&&/"&-8/25(&$9/!52%-+2+2: ;<$ ,(-#"'./0 !"#$% *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  15. How can we use InnoDB? • Download Embedded InnoDB or

    HailDB • Use C-API for access to InnoDB tables • Innostore: Erlang library for InnoDB access (from Basho’s Riak NoSQL project) • g414-inno: Open-Source Java access library for Embedded InnoDB
  16. g414-inno Foundations • Uses JNA (Java Native Access): Like JNI,

    but doesn’t provoke (as much) insanity • JNAerate: creates thin Java Class wrapper from a C-based header file (innodb.h) • But, complex C API’s are super ugly in Java • Need to clean that up a bit...
  17. g414-inno Library • Provides a more Object-Oriented API to mask

    all of the JNA “Pointer” madness • Transaction Objects, Cursors, Table Builder, Tuple Builder, Datatype Validation • Java Enum Types for ‘int’ enums in C API • inTransaction() templates (like Spring, JDBI) • Contains sanity checks to prevent common errors (mostly C API order of operations)
  18. Use Case: Voldemort • Voldemort: High-Performance Key-Value Store (Amazon Dynamo

    clone) • Nokia: good results with Voldemort on MySQL with InnoDB • Typical features of DB (network connectivity, SQL language) not really necessary • Thought: why bother with DB layer? The g414-inno project is born ...
  19. Voldemort Storage Engines • Trivial to integrate new persistence mechanisms

    with Voldemort • 2 Classes: Config & Storage Engine • Trivial InnoDB Table: key_ VARBINARY(200) NOT NULL version_ VARBINARY(200) NOT NULL value_ BLOB PRIMARY KEY (key_, version_) • 3 Operations: put(k, v), get(k), delete(k) • Complication: k is Versioned<Key>
  20. V Storage Engine: put • put(byte[] key, byte[] version, byte[]

    value) • Start transaction, open table cursor • Create search tuple for key • Cursor.find(key) • Foreach row matching key if row.version is below, delete row if row.version is above, throw exception • Cursor.insert(key, version, value)
  21. V Storage Engine: get • get(byte[] key, byte[] version) •

    Start transaction, open table cursor • Create search tuple for key • Cursor.find(key) • Foreach row matching key add to results • Return results
  22. V Storage Engine: delete • delete(byte[] key) • Start transaction,

    open table cursor • Create search tuple for key • Cursor.find(key) • Foreach row matching key delete row
  23. V Storage Engine: TODO • Perform Benchmarks (in EC2, local)

    • Tuning / Optimization • Clarify licenses (GPLv2 + Apache == ouch) • Organize & streamline distribution
  24. St8 • Simple, Open Source REST-based Storage Server • Wraps

    InnoDB with thin “but pleasant” HTTP API • Custom Tables using JSON table definitions • Natural, JSON-based access to tables: CRUD, Index- based Query & Iteration • Under the hood: Jetty, Jersey, Guice, Jackson, g414- inno, Embedded InnoDB
  25. St8 Table Def { "columns":[ {"name":"key1","type":"INT","length":4}, {"name":"key2","type":"VARCHAR","length":50}, {"name":"val","type":"BLOB","length":0} ], "indexes":[

    { "name":"PRIMARY", "clustered":true,"unique":true, "indexColumns":[{"name":"key1"}] }, { "name":"key2", "clustered":false,"unique":false, "indexColumns":[{"name":"key2"}] } ] }
  26. St8 Interface • Operations for Table Management: create, describe, delete,

    truncate • Operations for Data Management: Create, Retrieve, Update, Delete • Influences g414-inno design: template methods for inTransaction(), insert, update, insertOrUpdate, delete, load • Coming Soon: Query & Iteration APIs
  27. St8: Sample Requests SIMPLE GET: curl "http://localhost:8080/d/atable;key1=123" INSERT: curl -X

    PUT "http://localhost:8080/d/atable;key1=123;key2=ABC;val=AVERYLONGDATA" UPDATE: curl -X POST "http://localhost:8080/d/atable;key1=123;key2=CDE;val=NEWDATA" DELETE: curl -X DELETE "http://localhost:8080/d/atable;key1=123"
  28. g414-inno: Faban Benchmark • Row: 4-byte Key, 4096-byte value •

    Insert Sequential, Random • Single disk, 3-disk RAID 0, SSD • TODO: Concurrent Benchmarks, Mixed Read/Write
  29. Benchmark Results 0 5 10 15 20 InsertSeq InsertRnd SelectRnd

    Embedded InnoDB Latency (ms) Single Disk (OS X 1) 3-Disk Raid 0 (OS X 1) SSD (OS X 2) Single-Threaded Benchmarks InsertSeq InsertRnd SelectRnd Single Disk (OS X 1) 9.0 9.3 16 3-Disk Raid 0 (OS X 1) 0.47 1.4 5.2 SSD (OS X 2) 0.51 1.2 0.71
  30. Next Steps / Future Work • Finish St8: Queries &

    Iteration, Benchmark • Package / Qualify Voldemort Storage Engine • Integrate with Xtrabackup (hot backup) • Integrate with Sqoop (hadoop export) • Explore more advanced App-Level Replication Support
  31. References / More Info • Embedded InnoDB, HailDB (drizzle) •

    InnoDB Performance • GitHub: g414-inno, st8, voldemort, xfaban • Java Native Access (JNA) • Tokyo BDB, Oracle BDB & BDB-JE • Amazon Dynamo; Voldemort Project