Understanding database  : Storage layers
Note: This series of posts are more for the personal interpretation of uni taught materials, please make corrections if anything is appearing to be incorrect or unclear.
Understanding database :
For a database, its obvious that we will have to store data within some storing sheds which may allow us to access previous information that we have recorded. For gamers, its really often that we may see the word RAM, or HDD and SSD which are terms that represent data storing mediums. We often HDD as secondary storage, as it is relatively cheaper,slower but more stable than RAM. Moreover, tertiary storage also exists in the form of tapes etc. They are one step more cheaper than secondary storage.
We may classify our secondary storage as disks. The data within disks are stored in units, and at the same time being retrieved in units as well. We call these units the disk pages. We may consider the accessing latency as a pyramid : from the top its cache, with the fastest access, following up with main memory, and at the very end a very large access gap to meet disk storage. Though disks are significantly cheaper, cache and main memory may easily outperform disks in terms of accessing speed and granularity.
A buffer, aka the pool, is a place where copies of useful pages and data are being retained for accessing for further use. We may consider the buffer as a middle point between our secondary storage and the CPU, where the CPU may have fast access to the pages of data in the pool, meanwhile pages in the pool may be fetched from the disk for further or current execution.
According to the nature of disks and buffers, hence we may understand in the way that the buffer is in ‘temporal control’ of data, as we may reserve or update data within the pool for further use, which allows the accessing of database items to benefit from.
In terms of SQL statements, namely at ‘Logical database level’, we may consider that the records are being stored as columns and rows within a table created by SQL statements. In fact, on the physical layer of databases, they may be represented in much different ways according to different needs and designs. For example, records may be stored in the form of tab-separated strings, or hex-encoded dates separated by the \0 character which is being widely used for separating strings in C (which is not quite related in this case, but just to give an idea about the use of this character is quite similar at different places).
Elaborating on how the records are stored, records may be fixed in length, or it may be not.
For records of fixed length, if they are to be dealt with in a ‘packed’ way, the slots of entries are contiguous, where there are no empty space between records and a complete, continuous blank space is reserved for later use. On the other hand, if its not the case, where we have our entries ‘unpacked’, we will need a bitmap to keep an array of bits to keep in mind of the status of our entries. For packed entries, we will have to maintain a counter for the number of records, on the other hand we will have to record the number of slots for the unpacked slots.
For records of variable length, there are often 3 formats of how we deal with the accessing of entries. First format, like being mentioned in the database level paragraph, we delimit entries with a special symbol, for instance ‘$’ or the ‘\0’ mentioned before. Second format, we give a field length for the entry, where the length of the entry may vary within that range. Note that for these 2 formats, a counter for the number of entries is needed, and this might be for the sake of maintaining entries with ease when accessing. Last format, we will use an array of offsets as a prefix to all entries, and we will use the offsets relative to the beginning of the field to achieve direct access to each field. It is possible for this format to efficiently support null values, and the directory array (offset array) allows for rather small overhead compared to the other 2.
In the case how its being explained by our lecturer, a tuple id may be helpful for the navigation and moving of records on a page. The tuple id consists of 2 parts, built in the format of <page id, slot number>, which indicates most of the necessary information for a record to be retrieved.
Files can also be classified as two kinds - Heap files and Sorted files, where we can roughly consider them as sorted and unsorted arrays of files. The accessing and updating pattern is similar to corresponding arrays, where binary search may be utilized for a search in sorted files, meanwhile insertion would require a neater process with keys need to be taken care with. However, the deletion of the two kinds of files are identical - leave an empty space at wherever the entry was at. Indexes are also a helpful way to organize data in the case where data are to be organized in tree structures, or buckets via the specified hashing method. It is more speedy when we are to find for a subset of records, and it has a much faster updating speed compared to sorted files.
As mentioned earlier, sorted files may have better performance when dealing with searches. To dig in to that, the binary search supported by the ordered nature of sorted files allows equality or range queries to be completed with logN time, and it is done by retrieving the page that contains the first row which our query has requested for. The main cost of sorted files comes from maintaining its order, where we have to make a trade off such that each insertion may require shifting of records to make space for the new comer. The performance of insertion is believed to be logN + N, as we first have to binary search for our storing location, and then would need to update all pages according to their own position (same for deletion).