Skip to main content

Understanding database [1] : Storage layers

Note: This series of posts are more for the personal interpretation of uni taught materials, please make corrections if anything is appearing to be incorrect or unclear.

_____


Understanding database :
Storage layers

For a database, its obvious that we will have to store data within some storing sheds which may allow us to access previous information that we have recorded. For gamers, its really often that we may see the word RAM, or HDD and SSD which are terms that represent data storing mediums. We often HDD as secondary storage, as it is relatively cheaper,slower but more stable than RAM. Moreover, tertiary storage also exists in the form of tapes etc. They are one step more cheaper than secondary storage.

We may classify our secondary storage as disks. The data within disks are stored in units, and at the same time being retrieved in units as well. We call these units the disk pages. We may consider the accessing latency as a pyramid : from the top its cache, with the fastest access, following up with main memory, and at the very end a very large access gap to meet disk storage. Though disks are significantly cheaper, cache and main memory may easily outperform disks in terms of accessing speed and granularity.

A buffer, aka the pool, is a place where copies of useful pages and data are being retained for accessing for further use. We may consider the buffer as a middle point between our secondary storage and the CPU, where the CPU may have fast access to the pages of data in the pool, meanwhile pages in the pool may be fetched from the disk for further or current execution.

According to the nature of disks and buffers, hence we may understand in the way that the buffer is in temporal control of data, as we may reserve or update data within the pool for further use, which allows the accessing of database items to benefit from.

In terms of SQL statements, namely at Logical database level, we may consider that the records are being stored as columns and rows within a table created by SQL statements. In fact, on the physical layer of databases, they may be represented in much different ways according to different needs and designs. For example, records may be stored in the form of tab-separated strings, or hex-encoded dates separated by the \0 character which is being widely used for separating strings in C (which is not quite related in this case, but just to give an idea about the use of this character is quite similar at different places).

Elaborating on how the records are stored, records may be fixed in length, or it may be not.

For records of fixed length, if they are to be dealt with in a packed way, the slots of entries are contiguous, where there are no empty space between records and a complete, continuous blank space is reserved for later use. On the other hand, if its not the case, where we have our entries unpacked, we will need a bitmap to keep an array of bits to keep in mind of the status of our entries. For packed entries, we will have to maintain a counter for the number of records, on the other hand we will have to record the number of slots for the unpacked slots.

For records of variable length, there are often 3 formats of how we deal with the accessing of entries. First format, like being mentioned in the database level paragraph, we delimit entries with a special symbol, for instance $ or the \0 mentioned before. Second format, we give a field length for the entry, where the length of the entry may vary within that range. Note that for these 2 formats, a counter for the number of entries is needed, and this might be for the sake of maintaining entries with ease when accessing. Last format, we will use an array of offsets as a prefix to all entries, and we will use the offsets relative to the beginning of the field to achieve direct access to each field. It is possible for this format to efficiently support null values, and the directory array (offset array) allows for rather small overhead compared to the other 2.

In the case how its being explained by our lecturer, a tuple id may be helpful for the navigation and moving of records on a page. The tuple id consists of 2 parts, built in the format of <page id, slot number>, which indicates most of the necessary information for a record to be retrieved.

Files can also be classified as two kinds - Heap files and Sorted files, where we can roughly consider them as sorted and unsorted arrays of files. The accessing and updating pattern is similar to corresponding arrays, where binary search may be utilized for a search in sorted files, meanwhile insertion would require a neater process with keys need to be taken care with. However, the deletion of the two kinds of files are identical - leave an empty space at wherever the entry was at. Indexes are also a helpful way to organize data in the case where data are to be organized in tree structures, or buckets via the specified hashing method. It is more speedy when we are to find for a subset of records, and it has a much faster updating speed compared to sorted files.

As mentioned earlier, sorted files may have better performance when dealing with searches. To dig in to that, the binary search supported by the ordered nature of sorted files allows equality or range queries to be completed with logN time, and it is done by retrieving the page that contains the first row which our query has requested for. The main cost of sorted files comes from maintaining its order, where we have to make a trade off such that each insertion may require shifting of records to make space for the new comer. The performance of insertion is believed to be logN + N, as we first have to binary search for our storing location, and then would need to update all pages according to their own position (same for deletion). 

Comments

Popular posts from this blog

Understanding database [6] : Clustered Index

A clustered index is what is good for a range search over a range of search key values, and the index entries and the rows are ordered the same way, which is different from unclustered indexes (secondary indexes). To use a clustered index, we use the index to locate the first index entry at the start of the range, which where the first row is located at. If the index is clustered, subsequent rows will be stored in successive locations with the ordering , therefore it may minimize the page transfers and maximizes cache hits. For one table, there may only be one clustered index, whereas there can be as many unclustered indexes as you may want  to create. Unclustered indexes aren ’ t as good as clustered, but they might become necessary when it comes to finding for other attributes apart from the primary key. Smaller topics : Dense index & Sparse index When we say sparse index, we mean that there is an index entry for each page of the data file. With this structuring

Understanding database [9] : Choosing indexes

Understanding database 9 : Choosing indexes When choosing indexes, we choose the best plan that suits for the queries, and look for additional indexes that may potentially upgrade upon that. Before creating, we must also consider the impact on updates in the workload, such that indexes take disk space. For a query, the WHERE clause are the main focus point to make indexes on, where exact matches suggest a hash index and range queries suggest a tree index. Clustering is extremely helpful when it comes to range queries, and may also help with equality queries if there are duplicates. Search keys with multiple attribute should be considered if a WHERE clause contains multiple conditions, and the order of attributes is important for range queries. Searching may become ‘index-only’ with such indexes.