Skip to main content

Understanding database [7] : Hash based indexes

Understanding database 7 : Hash based indexes

When we are to implement hash based indexes, it is key that we should partition our data into buckets. As for the rule to allocate entries, we use each entrys search key as its bucket id to partition, where each bucket will hold entries that share a same property.

In order to determine what is that property we have chosen to partition with, we will have to introduce a hash function h(v), where v ranges over all the search key values. We will consider each bucket as a unit of storage containing one or more records that is stored in a page. It is often considered better to use hash based indexes for equality searches, and it is unable to support range searches due to the use of buckets.

The steps of a equality search with hash index is as below :
1. Given v value, compute with h(v) = a
2. Retrieve the bucket a
3. Search the wanted entry from the bucket
As we may see from above, the cost of searching is equivalent to the number of pages in the bucket, which is superior compared to B+ trees (if no overflow chains).

One variety of hashing that is often used is the static hashing method. The bucket allocation is based on a modulo value that is equivalent to the number of buckets existing primarily. The id of the bucket is being retrieved via h(k) % m, where m is the number of buckets existing. The number of pages are fixed, and are never deallocated by any chance and are sequentially allocated. As a result of the fixed number of pages, overflow pages may be needed for entries outside the capacity, and this may pull the efficiency down.

Extendible hashing is another variety which deals with the excessive overflowing chains that may occur. Each time when we have full pages, we will double the number of buckets by 2, and reallocate the entries based on their hashes. We will need a global depth variable to indicate the number of bits that we will need to get from the hash function.

For example, when we have a hashed result of binary encoding 101, where our global depth is 2, we will take the last 2 bits 01 as the bucket id and will not look at the entire 101 (at least for this global depth). Each bucket will also have a local depth which denotes the number of bits it uses to identify itself from others. The splitting of buckets will occur when the local depth is larger than the global depth, then the directory size (indicator of number of buckets) will be doubled and the entries will be redistributed with the corresponding number of bits needed.

It is worth noting that if the distribution of hash value are skew, the directory may end up growing pretty large, and if the directory fits in the memory, equality searches may be answered with 1, or else 2 disk accesses.

For deletion in extendible hashing, if the removal of a data entry results in a empty bucket, it can be merged its split image. If we have each directory element pointing to its split image, we can reduce the directory to half.


Popular posts from this blog

Understanding database [9] : Choosing indexes

Understanding database 9 : Choosing indexes When choosing indexes, we choose the best plan that suits for the queries, and look for additional indexes that may potentially upgrade upon that. Before creating, we must also consider the impact on updates in the workload, such that indexes take disk space. For a query, the WHERE clause are the main focus point to make indexes on, where exact matches suggest a hash index and range queries suggest a tree index. Clustering is extremely helpful when it comes to range queries, and may also help with equality queries if there are duplicates. Search keys with multiple attribute should be considered if a WHERE clause contains multiple conditions, and the order of attributes is important for range queries. Searching may become ‘index-only’ with such indexes.

Understanding database [6] : Clustered Index

A clustered index is what is good for a range search over a range of search key values, and the index entries and the rows are ordered the same way, which is different from unclustered indexes (secondary indexes). To use a clustered index, we use the index to locate the first index entry at the start of the range, which where the first row is located at. If the index is clustered, subsequent rows will be stored in successive locations with the ordering , therefore it may minimize the page transfers and maximizes cache hits. For one table, there may only be one clustered index, whereas there can be as many unclustered indexes as you may want  to create. Unclustered indexes aren ’ t as good as clustered, but they might become necessary when it comes to finding for other attributes apart from the primary key. Smaller topics : Dense index & Sparse index When we say sparse index, we mean that there is an index entry for each page of the data file. With this structuring