Upgrade to Pro — share decks privately, control downloads, hide ads and more …

InnoDB Magic

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

InnoDB Magic

Avatar for Sunny Gleason

Sunny Gleason

December 10, 2010
Tweet

More Decks by Sunny Gleason

Other Decks in Technology

Transcript

  1. What’s in this Preso • What is InnoDB? • Relation

    to MySQL & Other Products • InnoDB Model • g414-inno: a Java Access Library for InnoDB
  2. What else is in this Preso • Creating a Voldemort

    Storage Engine with Embedded InnoDB • St8: A REST-based Storage Server • Faban Benchmark Results
  3. What is InnoDB? • High-Performance “guts” of MySQL • Finely

    Tuned B-Tree Storage Engine • MVCC Transactional Store a la Jim Gray (“Transactional Processing Systems”) • Available Stand-Alone as Embedded InnoDB (stagnant) or HailDB (drizzle)
  4. Relation to MySQL • One of many MySQL storage engines

    • Transactional, in contrast to MYISAM • Well-known, Bullet-Proof Backup, Failure & Recovery Modes • Advanced Buffer Pool Management (adaptive hash index, tunable LRU) • Online Backup Support (Xtrabackup / Hot)
  5. Other Products • Tokyo BDB, Oracle BDB & BDB-JE •

    Schema-Free (No Structure / Data Types) • Lower Concurrency (fewer writers) • Performance Degradation in Larger DBs • (TODO: quantify performance gap - in meantime, see Dynamo & Voldemort)
  6. InnoDB Model (Logical) • Database == Tablespace • Tablespace has

    Table(s) and Log(s) • Table has columns (rich datatypes) • Tables have a PRIMARY clustered index • Tables may have SECONDARY indexes • Row == Tuple • Tuples are stored / clustered by index sort • Secondary index stores full Primary Key
  7. InnoDB Model (Txns) • Everything uses a Transaction • Isolation

    Level: Serialized, Read Committed, Read Uncommitted • Locks: Shared (Read-only), Exclusive (Read/Write) • Cursors provide access to tables: Lookup by index, Iteration / Traversal • Secondary index contains partial Tuples • Secondary cursor can access primary (full tuple)
  8. InnoDB Model (Physical) • Tablespace is a collection of pages

    (16K) • Pages organized as a B-Tree: infimum & supremum keys, pointers to children • Pages contain row or index tuple data, or blob overflow data • Pages written to log first and flushed to tablespace based on ‘sync’ policy
  9. Physical Considerations • New pages requested from OS in extend_size

    increments • OS Assigns space from file system / partition “free list” • Temporal Locality (pages close together) • Spatial Locality / Fragmentation from Updates • Prefer “narrow” rows / indexes: faster scan, keeps working set in-memory • Secondary “covering” indexes can save primary index access
  10. !""#$%&$'(')'*+&,-.+* !"#$%$ &!'() !"#$%&'$()*%#+(,% !"#$%"&' (&#& (!)#!*"&%+ -"!./'0($('012%,$32" *++,-. %$"'()

    /0 !""*(,-.!'$-/$%-#&,'$ 0!,( .!'$1 0.%2 .!'$1 3"(* '*41 !"1$%#5 ,3..$%5 *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  11. !""#$%&'()*+,-(.+, !"#$%# &$'($%# !"#$%# !"#$%# !"#$%# )%*$"#$%#*+*,-*.)'$/ !"#$%# 01"*23 456

    72$83*9 4588*.52%#$1 72$83*.52%#$1/ 72$83*: 72$83*% 456 ;)'$ 456 456 456 456 <$)=*%53$*/$'($%# 0)>8$/.)?$ 4588>)?@*/$'($%# A5%B8$)=*%53$*/$'($%# 456 456 *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  12. !""#$%&'#() !"#$%&'()*+, * * #+,-./#( 012, -./0-12$3"456 -./0-12$3"456 !"#$%&'(&%'''')%*'+,''''!$--'./%''''0-&'./%1'''$2"%3-$45.67"'./%'88'09"-&'26-:"1 #+,-./#(

    012, * * 7890/:-2$3"456 7890/:-2$3"456 !"#$%&'( *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  13. !""#$%&!"'()(*&+ ,-./0-1 !$020&-#3*&0-(&*2#-('& ."&24(&%+2-((&5(06& "#'(*&#6&0&758*2(-('& ."'() ! %+2-((&.*&#-90".:('& ;1&<-./0-1&=(1&#-& "#"+"855&8".>8(&=(1&

    #6&20;5(?&.6&'(6."('@& (5*(?&0"&."2(-"05& 7#58/"&3.24&A+;12(& BCDE!$&.*&0''('F [[[ [[[ [[[ [[[     QQQ QQQ QQQ QQQ                ಥ ಥ ಥ ಥ     !"#$%&'&( )*'+,-'./0&.1 +2(&3                                                         [[[ [[[ [[[ [[[    ಥ ಥ ಥ ಥ        ಥ ಥ ಥ ಥ QQQ QQQ QQQ QQQ    ಥ ಥ ಥ ಥ     3.YDOXHV 3.YDOXHV 3.YDOXHV 3.YDOXHV        QQQ QQQ QQQ QQQ .H\YDOXHV .H\YDOXHV .H\YDOXHV .H\YDOXHV           GDWDIRU GDWDIRU GDWDIRU GDWDIRU FRUUHVSRQGLQJURZV FRUUHVSRQGLQJURZV FRUUHVSRQGLQJURZV FRUUHVSRQGLQJURZV ! ! "#$%&#'()*+,- *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  14. !""#$%&!"'()(*&+ ,(-#"'./0 ! ,(-#"'./0&1"'()&%+ 2/((&3(.4&"#'(*& -#"2.1"5&4#/&(.-6&7(0& 8.39(5&26(&:/1;./0& 7(0*&#4&26(& -#//(*:#"'1"<&/#=*5& 9*('&2#&.--(**&

    -39*2(/1"<&1"'()&2#& #>2.1"&26(&'.2. !"#$%&'&( )*'+,-'./0&.1 +2(&3 !"#$%&'&( )*'+,-'./0&.1 +2(&3 4&!52(-'./+2(&3 3.YDOXHV 3.YDOXHV 3.YDOXHV 3.YDOXHV        QQQ QQQ QQQ QQQ 67%'&&/"&-8/25(&$9/!52%-+2+2:/(-%- NH\YDOXHV NH\YDOXHV NH\YDOXHV NH\YDOXHV $= $= $= $= 67%'&&/"&-8/25(&$9/!52%-+2+2: ;<$ 4&!52(-'./+2(&3 NH\YDOXHV NH\YDOXHV NH\YDOXHV NH\YDOXHV $= $= $= $= 67%'&&/"&-8/25(&$9/!52%-+2+2: ;<$ ,(-#"'./0 !"#$% *source: http://www.mysqlconf.com/mysql2009/public/schedule/detail/7052
  15. How can we use InnoDB? • Download Embedded InnoDB or

    HailDB • Use C-API for access to InnoDB tables • Innostore: Erlang library for InnoDB access (from Basho’s Riak NoSQL project) • g414-inno: Open-Source Java access library for Embedded InnoDB
  16. g414-inno Foundations • Uses JNA (Java Native Access): Like JNI,

    but doesn’t provoke (as much) insanity • JNAerate: creates thin Java Class wrapper from a C-based header file (innodb.h) • But, complex C API’s are super ugly in Java • Need to clean that up a bit...
  17. g414-inno Library • Provides a more Object-Oriented API to mask

    all of the JNA “Pointer” madness • Transaction Objects, Cursors, Table Builder, Tuple Builder, Datatype Validation • Java Enum Types for ‘int’ enums in C API • inTransaction() templates (like Spring, JDBI) • Contains sanity checks to prevent common errors (mostly C API order of operations)
  18. Use Case: Voldemort • Voldemort: High-Performance Key-Value Store (Amazon Dynamo

    clone) • Nokia: good results with Voldemort on MySQL with InnoDB • Typical features of DB (network connectivity, SQL language) not really necessary • Thought: why bother with DB layer? The g414-inno project is born ...
  19. Voldemort Storage Engines • Trivial to integrate new persistence mechanisms

    with Voldemort • 2 Classes: Config & Storage Engine • Trivial InnoDB Table: key_ VARBINARY(200) NOT NULL version_ VARBINARY(200) NOT NULL value_ BLOB PRIMARY KEY (key_, version_) • 3 Operations: put(k, v), get(k), delete(k) • Complication: k is Versioned<Key>
  20. V Storage Engine: put • put(byte[] key, byte[] version, byte[]

    value) • Start transaction, open table cursor • Create search tuple for key • Cursor.find(key) • Foreach row matching key if row.version is below, delete row if row.version is above, throw exception • Cursor.insert(key, version, value)
  21. V Storage Engine: get • get(byte[] key, byte[] version) •

    Start transaction, open table cursor • Create search tuple for key • Cursor.find(key) • Foreach row matching key add to results • Return results
  22. V Storage Engine: delete • delete(byte[] key) • Start transaction,

    open table cursor • Create search tuple for key • Cursor.find(key) • Foreach row matching key delete row
  23. V Storage Engine: TODO • Perform Benchmarks (in EC2, local)

    • Tuning / Optimization • Clarify licenses (GPLv2 + Apache == ouch) • Organize & streamline distribution
  24. St8 • Simple, Open Source REST-based Storage Server • Wraps

    InnoDB with thin “but pleasant” HTTP API • Custom Tables using JSON table definitions • Natural, JSON-based access to tables: CRUD, Index- based Query & Iteration • Under the hood: Jetty, Jersey, Guice, Jackson, g414- inno, Embedded InnoDB
  25. St8 Table Def { "columns":[ {"name":"key1","type":"INT","length":4}, {"name":"key2","type":"VARCHAR","length":50}, {"name":"val","type":"BLOB","length":0} ], "indexes":[

    { "name":"PRIMARY", "clustered":true,"unique":true, "indexColumns":[{"name":"key1"}] }, { "name":"key2", "clustered":false,"unique":false, "indexColumns":[{"name":"key2"}] } ] }
  26. St8 Interface • Operations for Table Management: create, describe, delete,

    truncate • Operations for Data Management: Create, Retrieve, Update, Delete • Influences g414-inno design: template methods for inTransaction(), insert, update, insertOrUpdate, delete, load • Coming Soon: Query & Iteration APIs
  27. St8: Sample Requests SIMPLE GET: curl "http://localhost:8080/d/atable;key1=123" INSERT: curl -X

    PUT "http://localhost:8080/d/atable;key1=123;key2=ABC;val=AVERYLONGDATA" UPDATE: curl -X POST "http://localhost:8080/d/atable;key1=123;key2=CDE;val=NEWDATA" DELETE: curl -X DELETE "http://localhost:8080/d/atable;key1=123"
  28. g414-inno: Faban Benchmark • Row: 4-byte Key, 4096-byte value •

    Insert Sequential, Random • Single disk, 3-disk RAID 0, SSD • TODO: Concurrent Benchmarks, Mixed Read/Write
  29. Benchmark Results 0 5 10 15 20 InsertSeq InsertRnd SelectRnd

    Embedded InnoDB Latency (ms) Single Disk (OS X 1) 3-Disk Raid 0 (OS X 1) SSD (OS X 2) Single-Threaded Benchmarks InsertSeq InsertRnd SelectRnd Single Disk (OS X 1) 9.0 9.3 16 3-Disk Raid 0 (OS X 1) 0.47 1.4 5.2 SSD (OS X 2) 0.51 1.2 0.71
  30. Next Steps / Future Work • Finish St8: Queries &

    Iteration, Benchmark • Package / Qualify Voldemort Storage Engine • Integrate with Xtrabackup (hot backup) • Integrate with Sqoop (hadoop export) • Explore more advanced App-Level Replication Support
  31. References / More Info • Embedded InnoDB, HailDB (drizzle) •

    InnoDB Performance • GitHub: g414-inno, st8, voldemort, xfaban • Java Native Access (JNA) • Tokyo BDB, Oracle BDB & BDB-JE • Amazon Dynamo; Voldemort Project