April 24, 2019 Context of Today's Lecture Access Methods System Buffers Authorisation Control Integrity Checker Command Processor Program Object Code DDL Compiler File Manager Buffer Manager Recovery Manager Scheduler Query Optimiser Transaction Manager Query Compiler Queries Catalogue Manager DML Preprocessor Database Schema Application Programs Database Manager Data Manager DBMS Programmers Users DB Admins Based on 'Components of a DBMS', Database Systems, T. Connolly and C. Begg, Addison-Wesley 2010 Data, Indices and System Catalogue
April 24, 2019 Basic Index Concepts ▪ An index is used to efficiently access specific data ▪ e.g. index in a textbook ▪ Often database queries reference only a small number of records in a file ▪ indices can be used to enhance the search for these records ▪ most DBMSs automatically create an index for primary keys ▪ A search key is an attribute (or set of attributes) that is used to lookup records in a file ▪ An index file consists of (searchKey, pointer) index entries ▪ normally much smaller than the original file ▪ pointer identifies a disk block and a record offset within the block
April 24, 2019 Basic Index Concepts ... ▪ There are two basic types of indices ▪ ordered indices - based on a sorted ordering of search keys ▪ hash indices - search keys are uniformly distributed to different buckets based on a hash function ▪ Indexing techniques have to be evaluated based on ▪ access types - e.g. records with a given attribute value or an attribute value in a specific range ▪ access time ▪ insertion and deletion time - if the underlying data is changed, all indices have to be updated which may result in a significant overhead for modifications ▪ space overhead
April 24, 2019 Ordered Indices ▪ An ordered index stores the search key values in sorted order ▪ e.g. author catalogue in a library ▪ A primary index (clustering index) is an index whose search key also defines the sequential order of the file ▪ the primary key is often used as a search key of a primary index ▪ files with a clustering index on the search key are called index-sequential-files ▪ A secondary index (non-clustering index) is an index whose search key has a different order than the sequential file order
April 24, 2019 Dense Index ▪ A dense index contains an entry for each search key in the file ▪ An index record of a dense primary index points to the first file record with the given search key ▪ An index record of a dense secondary index stores a list of pointers to the records with the same search key value de Rover Frei Jones Meier Michelin 2 de Rover Pieter Brussels 1 Frei Urs Zurich 8 Frei Urs Zurich 14 Jones Andrew Paris 53 Meier Beat Zurich 5 Michelin Robert Paris dense index index-sequential file
April 24, 2019 Dense Index Insertion find the search key of the record to be inserted in the index if (index contains search key) { if (index stores pointers to all records) { add a pointer to the new record to the index entry } else { // index stores pointer to the first record only ensure that the record is inserted after the first record with the same search key } } else { // index does not contain the search key insert an index record with the search key at the appropriate position }
April 24, 2019 Dense Index Deletion D ▪ Update can be modelled as a delete followed by an insert look up the record to be deleted if (deleted record is the only one with the search key value) { delete index record } else { // there are other records with the same search key if (index stores pointers to all records) { delete the corresponding pointer from the index record } else { // index stores pointer to the first record only if (deleted record was the first record with the search key value) { update the index record to point to the next record } } }
April 24, 2019 Sparse Index ▪ Contains only some of the search key values ▪ can only be used for primary indices! ▪ To search a given record in a sparse index the following steps have to be performed 1) find the index record ri with the largest search key value that is smaller or equal to the search key value to be found 2) start a linear search at record ri until the record is found de Rover Jones Michelin 2 de Rover Pieter Brussels 1 Frei Urs Zurich 8 Frei Urs Zurich 14 Jones Andrew Paris 53 Meier Beat Zurich 5 Michelin Robert Paris sparse index
April 24, 2019 Sparse Index ▪ A sparse index consumes less space and produces less maintenance overhead for insertions and deletions ▪ On the other hand it takes more time to locate a record based on a sparse index ▪ A good trade-off is a sparse index with one index record for each block ... sparse index ... block 1 block 2
April 24, 2019 Sparse Index Insertion // assumption: one index entry for each block if (new block is created) { insert first search key value in the new block into the index } else { // no new block if (new record has the smallest search key value in the block) { update the index record pointing to the block } }
April 24, 2019 Sparse Index Deletion ▪ Update can be modelled as a delete followed by an insert if (index contains the search key value of deleted record) { if (deleted record is the only one with the search key value) { if (search key value of the next record is not in the index { update the index record to point to the next record } else { // search key value of next record is already in the index delete index entry } } else { // there are other records with the same search key if (deleted record was the first record with the search key value) { update the index entry to point to the next record with the same search key value } }
April 24, 2019 Multilevel Index ▪ A multilevel index can be used if the index grows too large to be kept in memory ▪ inner index forms the primary index file ▪ outer index is a sparse index on the inner index ... inner index ... data block 1 data block 2 ... ... outer index
April 24, 2019 Multilevel Index ... ▪ If even the outer index grows too large to be kept in memory, the process can be repeated ▪ note that instead of this type of multilevel index it is preferred to use B+-trees ▪ The update of the multilevel index (on insertion and deletion) is an extension of the updates described for the single level indices
April 24, 2019 Secondary Index ▪ A secondary index has to be a dense index and must point to all the records with the same search key value ▪ e.g. indirection via buckets that contain the pointers ▪ The use of a secondary index normally requires more I/O operations than a primary index Brussels Paris Zurich 2 de Rover Pieter Brussels 1 Frei Urs Zurich 8 Frei Urs Zurich 14 Jones Andrew Paris 53 Meier Beat Zurich 5 Michelin Robert Paris secondary index buckets
April 24, 2019 Index-Sequential File Problems ▪ Index-sequential files have a number of disadvantages ▪ degradation and loss of performance when the file grows since many overflow blocks have to be created ▪ the entire file has to be periodically reorganised ▪ Alternative index structures that do not show a loss of performance after updates can be used ▪ e.g. the B+-tree index file structure is a widely used multilevel index structure that is based on a B+-tree (balanced and sorted )
April 24, 2019 B+-Tree ▪ Properties of a balanced B+-tree ▪ every path from the root to a leaf node has the same length ▪ for a given n, each leaf node has between (n-1)/2 and n-1 values - leaf nodes must always be at least half full ▪ each non-leaf (and non-root) node has between n/2 and n children (fanout) ▪ typical B+-tree node - K1 ,..., Km-1 are the ordered search key values (Ki < Ki+1 ) - a non-leaf node has pointers P1 ,..., Pm to child nodes - a leaf node has pointers P1 ,..., Pm to file records or to buckets of pointers - typically the size of a disk block ▪ if the root node is not a leaf node it has at least 2 children; otherwise it has between 0 and n-1 values P1 K1 P2 ... Pm-1 Km-1 Pm ...
April 24, 2019 B+-Tree Leaf Nodes ▪ Leaf node example (n=3) ▪ for 1 i < n the pointers Pi point to a file record with the search key value Ki or to a bucket of pointers for search keys that are non-ordered non-candidate keys ▪ the pointer Pn points to the next leaf node in search key order de Rover Frei 2 de Rover Pieter Brussels 1 Frei Urs Zurich 8 Frei Urs Zurich ... next leaf node index-sequential file
April 24, 2019 B+-Tree Non-Leaf (Internal) Nodes ▪ The non-leaf nodes form a multilevel sparse index on the leaf nodes ▪ For a non-leaf node with m pointers the following holds ▪ the values of the search keys in the subtree that P1 points to are all smaller than K1 ▪ for 2 i < m the search key values in the subtree to which Pi points to have values greater or equal than Ki-1 and smaller than Ki ▪ the search keys in the subtree to which Pm points to have values greater than or equal to Km-1 P1 K1 P2 ... Pm-1 Km-1 Pm ...
April 24, 2019 B+-Tree Example ▪ Example B+-tree for n=3 ▪ root node must have at least two children ▪ non-leaf nodes must have between 2 and 3 children (for n=3) ▪ leaf nodes must have between 1 and 2 values (for n=3) ▪ note that the node size is normally defined by the block size and the fanout is typically in the range of 50-100 (e.g. n=100) Bush Frei Jones Meier Otlet Price Jones Otlet Meier
April 24, 2019 B+-Tree Properties ▪ Logically close blocks do not have to be physically close since the inter-node connections are realised via pointers ▪ B+-trees generally have a small height ▪ if there are k search key values, the B+-tree height is not greater than log n/2 (k) - e.g. for 5'000'000 search keys and n = 100 we get a height not greater than 4 (note that the height would be 23 for a binary tree) ▪ search operations can be performed efficiently (without many block accesses)
April 24, 2019 B+-Tree Lookup ▪ Find all records for a given search key v Bush Frei Jones Meier Otlet Price Jones Otlet Meier c = root node while (c is not a leaf node) { let Ki = smallest search key value in c greater than v if (Ki exists) { c = node pointed to by Pi
April 24, 2019 B+-Tree Lookup ... ▪ The lookup requires O(logn k) I/O operations in worst case } else { // no search key greater than v c = node pointed to by Pm // where m is the number of pointers } } if (a key value Ki in c with the value v exists) { pointer Pi leads to the desired record or bucket } else { a record with the given search key value v does not exist }
April 24, 2019 B+-Tree Insertion (1)Find the leaf node in which the search key value would appear using the lookup algorithm presented before and add the new record to the file (2)If the search key is already present in the leaf node ▪ if necessary add a new pointer to the bucket (3)If the search key value is not yet present ▪ if there is room in the leaf node then insert a new index record ▪ if there is no room then split the node - take the n index records (including the one to be inserted) in sorted order and place the first n/2 records in the original node and the rest in a new node - let the new node be p and the smallest value in p be k. Insert (p,k) in the parent node (if there is no space in the parent node, the split is propagated upwards)
April 24, 2019 B+-Tree Insertion ... ▪ In worst case, the root node has to be split and the height of the tree is increased by 1 ▪ The insertion requires O(logn k) I/O operations in worst case
April 24, 2019 B+-Tree Insertion Example ▪ insertion of Codd Bush Frei Jones Meier Otlet Price Jones Otlet Meier Bush Codd Meier Otlet Price Frei Jones Otlet Meier Jones Frei
April 24, 2019 B+-Tree Deletion (1)Find the record to be deleted by using the lookup algorithm and delete it from the file (2)Remove the search key and value from the leaf index node if there is no bucket or if the bucket is empty (3)If the node has no longer enough entries due to the deleted index record and the entries in the node and a sibling fit into a single node then merge the siblings ▪ insert the search key values of the two nodes into the left node and delete the right node ▪ recursively delete the entry for the deleted node from the parent node (propagation)
April 24, 2019 B+-Tree Deletion ... (4)If the node has no longer enough entries due to the deleted index record but the entries in the node and a sibling do not fit into a single node then redistribute the pointers ▪ redistribute the pointers between the node and a sibling such that both have more than the minimum number of entries ▪ update the search key values in the parent node ▪ If the root node has only one pointer after deletion, it is deleted and its child becomes the root ▪ The deletion of a previously located record requires O(logn k) I/O operations in worst case ▪ note that there are variants where leaf nodes are not merged
April 24, 2019 B+-Tree Deletion Example ▪ deletion of Frei Bush Codd Meier Otlet Price Frei Jones Otlet Meier Jones Frei Bush Codd Meier Otlet Price Jones Otlet Meier Jones
April 24, 2019 B+-Tree Deletion Example ... ▪ deletion of Meier Bush Codd Otlet Price Jones Bush Codd Meier Otlet Price Jones Otlet Meier Jones Jones Otlet
April 24, 2019 B+-Tree Deletion Example ... ▪ deletion of Meier Bush Codd Meier Otlet Price Frei Jones Otlet Meier Jones Frei Bush Codd Otlet Price Frei Otlet Jones Jones Frei
April 24, 2019 Features of B+-Tree Index Files ▪ Advantages ▪ automatic reorganisation with small local changes for insertions and deletions ▪ no reorganisation of the file is required to maintain the performance (solves index-sequential file problem) ▪ Disadvantages ▪ insertion and deletion overhead ▪ space overhead ▪ The advantages of B+-trees outweigh the disadvantages which makes them an extensively used data structure
April 24, 2019 B+-Tree File Organisation ▪ We can use the same B+-tree structure to organise the records in a file and thereby solve the problem of data file degradation ▪ In a B+-tree file organisation, the leaves of the tree store records instead of pointers to records ▪ number of records in the leaf nodes normally smaller than the number of pointers in non-leave nodes ▪ leaf nodes still have to be at least half full ▪ two siblings can be used in splits/merges for space optimisation ▪ adjacent nodes in the tree may not be continuous on the disk
April 24, 2019 Indexing Strings ▪ The use of strings as search keys introduces two major problems ▪ the strings can be of variable length which leads to different fanouts ▪ the split and merge operations should no longer be based on the number of search keys but on the fraction of the used space ▪ The fanout of nodes can be increased by using a prefix compression technique ▪ we no longer store the entire search key in non-leaf nodes but only the part that is necessary to distinguish between the key values in the subtrees
April 24, 2019 B+-Trees vs. B-Trees ▪ B-trees store every search key value only once (no redundancy) ▪ additional pointers to the file records have to be added for non- leaf nodes ▪ Advantages of B-trees ▪ may use less nodes than the corresponding B+-trees ▪ may find the search key before reaching the leaf node ▪ Disadvantages of B-trees ▪ non-leaf nodes get larger which results in a deeper trees ▪ insertion and deletion gets more complicated ▪ Normally the advantages of B-trees do not outweigh their disadvantages
April 24, 2019 Multiple-Key Access ▪ For queries that need fast search on the combination of multiple attributes we can define composite search keys ▪ e.g. search key (name, city) for customers ▪ The ordering of composite search keys is the lexicographic ordering ▪ e.g. (a1 , a2 ) < (b1 , b2 ) if - a1 < b1 , or - a1 = b1 and a2 < b2 ▪ An index with composite search keys can be used to efficiently answer queries of the form ▪ WHERE name = 'Max Frisch' AND city = 'Zurich' ▪ WHERE name = 'Max Frisch' AND city < 'Brussels'
April 24, 2019 Static Hashing ▪ A bucket is a storage unit containing one or more records ▪ typically a disk block ▪ In a hash file organisation we get the bucket in a file by using a hash function on the search key value ▪ A hash function h is a function from the set of search key values K to the set of buckets B ▪ distribution should be uniform and random ▪ The hash function can be used to locate, insert and delete records ▪ A bucket may contain records with different search keys ▪ bucket has to be sequentially searched
April 24, 2019 Bucket Overflows ▪ Buckets can overflow due to different reasons ▪ insufficient number of buckets ▪ a skew in the distribution of records - many records with the same search key - non-uniform hash function ▪ The possibility for bucket overflows can be reduced but we cannot prevent overflows ▪ bucket overflows are handled by a linked list of additional overflow buckets ▪ this form of hashing with overflow buckets is called closed hashing ▪ Open hashing does not allow overflow buckets ▪ other existing buckets have to be "misused" (linear probing)
April 24, 2019 Hash Index ▪ Hashing cannot only be used for file organisation but also for the creation of an index ▪ A hash index manages the search keys with their record pointers in a hash file structure
April 24, 2019 Static Hashing Problems ▪ The hash function maps the search key values to a fixed number of buckets B but the database will grow or shrink over time ▪ if the file grows and the initial number of buckets is too small, the performance decreases since overflow buckets are necessary ▪ if space is pre-allocated for future growth, some space will be wasted initially ▪ A possible solution would be to reorganise the file with a new hash function from time to time ▪ too expensive and has to be performed exclusively ▪ An alternative is to dynamically modify the number of buckets (dynamic hashing)
April 24, 2019 Ordered Indexing vs. Hashing ▪ The database designer has to choose which form of index should be used based on different criteria ▪ cost of periodic reorganisation of the index ▪ frequency of update operations ▪ optimised average or worst case access time ▪ expected types of queries - hashing is generally better for retrieving records with a specific attribute value • constant average lookup time • worst case lookup time proportional to the number of attribute values! - ordered indices are preferred for range queries where the attribute value lies within a range of values (e.g. WHERE A > C1 AND A < C2) ▪ In practice ▪ most DBMSs support an ordered B+-tree index but not all of them support hash indices
April 24, 2019 Bitmap Index ▪ A bitmap index can be used to efficiently query on multiple keys/attributes ▪ The following assumptions must hold ▪ records in a relation must be numbered sequentially (e.g. starting from 0) ▪ given a number i, it must be easy to find the ith record - easy if records have fixed size ▪ Bitmaps are applicable on attributes with a relatively small number of distinct values ▪ e.g. gender, city, ... ▪ A bitmap is an array with as many bits as records ▪ the ith bit is 1 if record i has the desired attribute value
April 24, 2019 Bitmap Index ... ▪ Queries can be answered by using bitmap operations ▪ and (intersection), or (union) and not (complementation) - e.g. find all males in Zurich: 101101 AND 011010 = 001000 - e.g. find females not living in Paris: 010010 AND 111010 = 010010 ▪ Retrieve records based on the resulting bit array ▪ can also be used to count the number of records with a given condition 2 de Rover Pieter m Brussels 21 Brown Kathy f Zurich 8 Frei Urs m Zurich 14 Jones Andrew m Paris 9 Jones Lea f Zurich 5 Michelin Robert m Paris 101101 010010 100000 000101 011010 m f Brussel Paris Zurich bitmaps for gender bitmaps for location
April 24, 2019 Index Definition in SQL ▪ The create index command can be used to create an index on a relation ▪ different DBMSs support different types of indices (e.g. B+-tree, hash index, etc.) ▪ Example CREATE INDEX nameIndex ON Customer USING BTREE (name);
April 24, 2019 Summary ▪ Different types of indices can be used to speed up the search of specific records in a file ▪ The number and types of indices have to be designed carefully since each index adds additional maintenance costs on update operations ▪ The structures that are used for the indices (e.g. B+-tree or hash index) can also be used to manage the records in a file and reduce data file degradation