Index Management in Depth

Index Management in Depth

Every time you choose how to store data in your database, a lot of things happen under the hood.
Making the best choice is even more important in those applications that aim to high performance.
The purpose of the talk is to show how indexes work and how slightly changing their combinations can impact on the performance of your application.

21935bc100d813458e925f4a0bef80db?s=128

Andrea Giuliano

February 20, 2015
Tweet

Transcript

  1. 1.

    I N D E X M A N A G

    E M E N T I N D E P T H A n d re a G i u l i a n o @ b i t _ s h a r k
  2. 2.

    D ATA B A S E M A N A

    G E M E N T S Y S T E M S
  3. 3.

    T H E A R C H I T E

    C T U R E D A TA B A S E M A N A G E M E N T S Y S T E M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser Optimizer Operator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  4. 4.

    S Q L E N G I N E A

    R C H I T E C T U R E O F A D B M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser Optimizer Operator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  5. 5.

    A C C E S S F I L E

    M A N A G E R A R C H I T E C T U R E O F A D B M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser Optimizer Operator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  6. 6.

    B U F F E R M A N A

    G E R A R C H I T E C T U R E O F A D B M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser Optimizer Operator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  7. 7.

    D I S K M A N A G E

    R A R C H I T E C T U R E O F A D B M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser Optimizer Operator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  8. 8.

    U N I T O F I N F O

    R M AT I O N Page size ranges from 2Kb to 64Kb I/O of pages dominates the cost of the operations
  9. 9.

    F I L E O R G A N I

    Z AT I O N method of arranging the records in a file
  10. 10.

    –Sars, Roebuck, and Co, Consumers’ Guide, 1897 “If you don’t

    find it in the index, look very carefully through the entire catalog.”
  11. 11.

    I N D E X E S organize data records

    on disk to optimize retrieval operations
  12. 12.

    I N D E X E S data entry record

    stored in an index file data record record stored in a database file
  13. 13.

    P R O P E R T I E S

    O F A N I N D E X 1. Structure of data entries 2. Clustered/unclustered 3. Primary/secondary 4. Dense/sparse 5. Organization of the index
  14. 14.

    S T R U C T U R E O

    F D ATA E N T R I E S k*: data entry whose search key value is k 1. k* is a data record -extreme case 2. k* is a pair (k, rid) -the index file is independent from the data file 3. k* is a pair <k, rid-list> -the index file is independent from the data file -better use of space but variable-length data entries P R O P E R T I E S O F A N I N D E X
  15. 15.

    C L U S T E R E D /

    U N C L U S T E R E D I N D E X unclustered clustered data record file index file data records data entries data entries index entries data records P R O P E R T I E S O F A N I N D E X
  16. 16.

    P E R F O R M A N C

    E data record file index file data records data entries Clustered only few pages have to be retrieved Unclustered as many data page I/Os as the number of data entries C L U S T E R E D / U N C L U S T E R E D I N D E X
  17. 17.

    P R I M A RY A N D S

    E C O N D A RY I N D E X P R O P E R T I E S O F A N I N D E X
  18. 18.

    10 20 30 40 10 20 30 40 page pointers

    page in index file page in data file P R O P E R T I E S O F A N I N D E X D E N S E I N D E X
  19. 19.

    10 30 50 70 10 20 30 40 50 60

    70 80 page pointers page in index file page in data file P R O P E R T I E S O F A N I N D E X S PA R S E I N D E X
  20. 20.

    P R O P E R T I E S

    O F A N I N D E X O R G A N I Z AT I O N O F T H E I N D E X sorted index the index is a sorted file tree-based index the index is a tree hash-based index the index is a function from search key values to record addresses
  21. 21.

    I S A M A N D B + -

    T R E E T R E E - B A S E D I N D E X ISAM used when the relation is static: no insertion and deletion on the tree b+-tree effective in dynamic situations with insertion and deletion
  22. 22.

    Data Pages Index Pages Overflow Pages Page Allocation in ISAM

    … … Non-leaf pages Leaf pages Primary pages Overflow page I S A M T R E E - B A S E D I N D E X
  23. 23.

    I S A M : O P E R AT

    I O N S T R E E - B A S E D I N D E X search identical to b+-tree (more on this soon). insertion find the right position on the tree and write the key (possible overflow pages). deletion remove the entry and the empty overflow page if needed. Leave the empty primary page as it is.
  24. 24.

    B + - T R E E T R E

    E - B A S E D I N D E X Index entries Data entries Index file
  25. 25.

    12 78 3 9 19 56 86 94 33 44

    Daniel, 22, 6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 B + - T R E E T R E E - B A S E D I N D E X … … … … … … … …
  26. 26.

    12 78 3 9 19 56 86 94 33 44

    Daniel, 22, 6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 … … … … … … … … B + - T R E E : S E A R C H T R E E - B A S E D I N D E X Start search
  27. 27.

    B + - T R E E : S E

    A R C H T R E E - B A S E D I N D E X 12 78 3 9 19 56 86 94 33 44 Daniel, 22, 6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 … … … … … … … … Start search A B L1 L2 L3 find all data entries with 24 < key < 50
  28. 28.

    12 78 3 9 19 56 33 44 Daniel, 22,

    6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 … … … … … … Start search A B L1 L2 L3 S E A R C H : C O S T T R E E - B A S E D I N D E X f: fanout h: height m: leaves (f h) Cost of a search [logF m]
  29. 29.

    T R E E - B A S E D

    I N D E X 12 78 3 9 19 56 33 44 Daniel, 22, 6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 … … … … … … Start search A B L1 L2 L3 S E A R C H : C O S T f = 3 h = 3 m = 27 I/Os [log3 27] = 3
  30. 30.

    B + - T R E E : I N

    S E R T T R E E - B A S E D I N D E X 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Insertion of a data record with search key value 8
  31. 31.

    2* 3* 5* 7* 8* B + - T R

    E E : I N S E R T T R E E - B A S E D I N D E X … 5 data entry is copied up
  32. 32.

    B + - T R E E : I N

    S E R T T R E E - B A S E D I N D E X 17 data entry is pushed up 5 13 24 30
  33. 33.

    B + - T R E E : I N

    S E R T T R E E - B A S E D I N D E X 17 5* 7* 8* 14 * 16 * 19 * 20 * 22 * 24 * 27 * 29 * 33 * 34 * 38 * 39 * 5 13 24 30 2* 3* The resulting tree after the insertion of a data record with search key value 8.
  34. 34.

    B + - T R E E : D E

    L E T E T R E E - B A S E D I N D E X 17 5* 7* 8* 14 * 16 * 22 24 27 29 33 * 34 * 38 * 39 * 5 13 27 30 2* 3* The resulting tree after deleting entries 19* and 20*
  35. 35.
  36. 36.

    - 2.000.000 records - 200.000 pages - 10 data record

    in a page - 200 records with the same value of the attribute cost (on average) - dense unclustered b+-tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E D E N S E U N C L U S T E R E D I N D E X
  37. 37.

    SELECT code, author, publisher FROM Book WHERE cost = %cost%

    - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - dense unclustered b+-tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E D E N S E U N C L U S T E R E D I N D E X
  38. 38.

    SELECT code, author, publisher FROM Book WHERE cost = %cost%

    - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - dense unclustered b+-tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher how many page accesses do we need to answer to the query? E X A M P L E D E N S E U N C L U S T E R E D I N D E X
  39. 39.

    E X A M P L E D E N

    S E U N C L U S T E R E D I N D E X - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - dense unclustered b+-tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher Let’s build the index structure each record has 4 field so in each page there are 40 fields 20 data entries fit in one leaf page of the index the tree has a fan-out of 20
  40. 40.

    E X A M P L E D E N

    S E U N C L U S T E R E D I N D E X fanout: 20 occupancy factor of 67% leads to 13 data entries in the leaves How many leaves are there in the tree? 2.000.000/13 = 153.846
  41. 41.

    log20 (153.846) = 4 I/O page accesses E X A

    M P L E D E N S E U N C L U S T E R E D I N D E X How many I/Os are needed to go to the leaves? leaves: 153.846 fanout: 20
  42. 42.

    E X A M P L E D E N

    S E U N C L U S T E R E D I N D E X - 200 records with the same value of the attribute cost (on average) - 13 data entries in the leaves - dense unclustered b+-tree index with search key cost Book code author cost publisher SELECT code, author, publisher FROM Book WHERE cost = %cost% - 15 pages (200/13) to visit for reaching data records with the same cost value (on average)
  43. 43.

    ~ 3 sec E X A M P L E

    D E N S E U N C L U S T E R E D I N D E X Costs path to the leaves: 4 I/Os leaves access: 15 I/Os data records: 200 I/Os Total cost 4 + 15 + 200 = 219 I/Os
  44. 44.

    - 2.000.000 records - 200.000 pages - 10 data record

    in a page - 200 records with the same value of the attribute cost (on average) - sparse clustered b+-tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E S PA R S E C L U S T E R E D I N D E X
  45. 45.

    - 2.000.000 records - 200.000 pages - 10 data record

    in a page - 200 records with the same value of the attribute cost (on average) - sparse clustered b+-tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E S PA R S E C L U S T E R E D I N D E X SELECT code, author, publisher FROM Book WHERE cost = %cost%
  46. 46.

    - 2.000.000 records - 200.000 pages - 10 data record

    in a page - 200 records with the same value of the attribute cost (on average) - sparse clustered b+-tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E S PA R S E C L U S T E R E D I N D E X SELECT code, author, publisher FROM Book WHERE cost = %cost% how many page accesses do we need to answer to the query?
  47. 47.

    - 2.000.000 records - 200.000 pages - 10 data record

    in a page - 200 records with the same value of the attribute cost (on average) - sparse clustered b+-tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E S PA R S E C L U S T E R E D I N D E X Let’s build the index structure each record has 4 field so in each page there are 40 fields 20 data entries fit in one leaf page of the index the tree has a fan-out of 20
  48. 48.

    E X A M P L E S PA R

    S E C L U S T E R E D I N D E X fanout: 20 How many pages store 2.000.000 data records? 2.000.000/10 = 200.000 each data entry points to a data record page
  49. 49.

    E X A M P L E S PA R

    S E C L U S T E R E D I N D E X fanout: 20 occupancy factor of 67% leads to 13 data entries in the leaves How many leaves are there in the tree? 200.000/13 = 15.384
  50. 50.

    log20 (15.384) = 3 I/O page accesses E X A

    M P L E S PA R S E C L U S T E R E D I N D E X How many I/Os are needed to go to the leaves? leaves: 15.384 fanout: 20 fanout: 20
  51. 51.

    E X A M P L E S PA R

    S E C L U S T E R E D I N D E X - 200 records with the same value of the attribute cost (on average) - 10 data records in a page - sparse clustered b+-tree index with search key cost Book code author cost publisher SELECT code, author, publisher FROM Book WHERE cost = %cost% - 20 pages (200/10) of data records to visit 
 (on average)
  52. 52.

    ~ 0.3 sec E X A M P L E

    S PA R S E C L U S T E R E D I N D E X Costs path to the leaves: 3 I/Os data records: 20 I/Os Total cost 3 + 20 = 23 I/Os
  53. 53.

    E X A M P L E A VA R

    I A N T What if the attributes were part of the search key? SELECT code, author, publisher FROM Book WHERE cost = %cost%
  54. 54.

    E X A M P L E W I T

    H O U T I N D E X In the worst case we have to visit 2.000.000 records SELECT code, author, publisher FROM Book WHERE cost = %cost% ~ 50 min
  55. 55.
  56. 56.
  57. 58.

    Ramakrishan, Gehrke “Database Management Systems” Assets: https://farm4.staticflickr.com/3577/3492185538_a39dbb4511_b_d.jpg https://farm3.staticflickr.com/2852/10740309163_12f6a671cc_k_d.jpg https://farm3.staticflickr.com/2456/3835365695_5e515a3492_b_d.jpg https://farm5.staticflickr.com/4048/4332381194_3cfbed7f8e_b_d.jpg

    https://farm5.staticflickr.com/4117/4800819674_3cf963deaa_b_d.jpg https://farm5.staticflickr.com/4141/4772464179_0672159bbd_b.jpg https://farm9.staticflickr.com/8530/8574154090_bd14f9ccbf_o_d.jpg https://farm8.staticflickr.com/7370/10847923014_2b3fc30cea_k_d.jpg R E F E R E N C E S