Skip to main content

Understanding database [5] : Tree based indexes

As we have mentioned in the previous post, there exists a variety of indexes, and we are here to discuss Tree based indexes first (others to be posted later), continuing from the end of the previous post.

For the ISAM (Indexed Sequential Access Method) variety, there exists a type of entry being called as the separator entry. Each separator entry is a 2-tuple (k,p) where k is a search key value, meanwhile p is a pointer to a lower level page. The search key value k will separate set of search key values in the 2 subtrees pointed at p and p-1, where p-1 is considered to be located before p according to their ordering.

The leaf pages are initially to be stored sequentially, and they generally contain the data records (primary/integrated/main index). The separator levels will never change once they have been constructed, by that it means that the indexes are static for ISAM. On top of that, the number and position of leaf pages in the file always remains the same. 

As a result of this, it might be necessary for overflow pages to exist in order to hold for new entries to the structure. The chains can be long caused by the overflow pages, therefore it causes one of the major drawbacks of ISAM, which it may be inefficient if the table is dynamic. Nevertheless, its unfair to overlook its advantages including the suitability for equality and range searches, as storing the leaf pages sequentially in file allows range searches to be well supported. Multiple attribute search keys and partial key searches are also supported by that nature.

B+ tree, the other variety, is the most commonly used structure in commercial database systems, where it supports a dynamic multi-level index structure. By that it means that the index structure responds to dynamic changes in the table, and the reorganization of the entire file is not required to maintain the performance. The difference between ISAM and B+ Tree is that the B+ tree is being kept balanced, and the sibling nodes exist a pointer between each other for the leaf nodes (sorted entries). The sibling pointers allows the support for range searches although there may be allocation and deallocation of leaf pages occurring. Similar to ISAM, B+ tree also embeds a separator entry structure, where it is used to direct searches.

Advantages of using a B+ tree includes that it supports equality and range searches with attribute keys and partial key searches, and it could use either a separate file or the basis for an integrated storage structure. Drawbacks include that the insertion and deletion may cause some major overhead , as well as the space it might take up.

Whenever we attempt an insertion or deletion on a B + tree, we are to find the data entry in the leaf, changing it and we might at times need to adjust the parents. The algorithm for searching in a B+ tree is being stated below :

Begin from root , with search key value k-
While no leaf reached
Examine all nodes with smallest search key >= k
If such exists, assume its Ki, then follow Pi to child node
Else it means k >= kn-1, we follow Pn to the child
If for some i, Ki == k,
Follow pointer Pi to desired record
Progress to next i
Else
No record with search-key value k

A B+ tree also has the following properties:
Each path from root to leaf has length h
Each internal node has at least d+1 children, meaning the root is either a leaf or it has at least 2 children.=
Each internal node and the root node have at most 2d +1 children
Leaf node holds between d and 2d index entries, and they form a doubly linked list
All the non root nodes are at least half full

Note that the average number of children per node is called the fanout, and is typically over 100.

As we are aware that the tree is always being kept balanced, we will have to deal with special cases where insertion may break the tree as the way we want it to be. The handling algorithm is stated below :

Find a leaf node of search key values position
If there is room in the leaf, insert the pair at sorted position in the leaf
Otherwise
Take the n pairs in sorted orders, and place the ceil(n/2) nodes in the original, the rest in a new node
Let the new node with (k,p) where k is least key value, be inserted to the parent of the original node
If the parent turns out to be full after the above :
Take n pairs in sorted order, remove middle key then replace the first (n-1)/2 pairs in original, rest in new node
Let new node be p, k the middle key value that is not being stored in the tree, push the pair (k,p) up the tree
The splitting terminates once a not full node is found


Similarly, we need a deletion handling algorithm as below :
Starting from the root, find leaf L where entry belongs
Remove the entry, and check for the leaf
If leaf at least half full, done
Else, try to re-distribute by borrowing from the sibling
If fails, merge L and sibling

Note that the access cost of B+ tree is logarithmic to fanout.

Comments

Popular posts from this blog

Understanding database [9] : Choosing indexes

Understanding database 9 : Choosing indexes When choosing indexes, we choose the best plan that suits for the queries, and look for additional indexes that may potentially upgrade upon that. Before creating, we must also consider the impact on updates in the workload, such that indexes take disk space. For a query, the WHERE clause are the main focus point to make indexes on, where exact matches suggest a hash index and range queries suggest a tree index. Clustering is extremely helpful when it comes to range queries, and may also help with equality queries if there are duplicates. Search keys with multiple attribute should be considered if a WHERE clause contains multiple conditions, and the order of attributes is important for range queries. Searching may become ‘index-only’ with such indexes.

Understanding database [6] : Clustered Index

A clustered index is what is good for a range search over a range of search key values, and the index entries and the rows are ordered the same way, which is different from unclustered indexes (secondary indexes). To use a clustered index, we use the index to locate the first index entry at the start of the range, which where the first row is located at. If the index is clustered, subsequent rows will be stored in successive locations with the ordering , therefore it may minimize the page transfers and maximizes cache hits. For one table, there may only be one clustered index, whereas there can be as many unclustered indexes as you may want  to create. Unclustered indexes aren ’ t as good as clustered, but they might become necessary when it comes to finding for other attributes apart from the primary key. Smaller topics : Dense index & Sparse index When we say sparse index, we mean that there is an index entry for each page of the data file. With this structuring