Hashing Concepts in DBMS

Table of Contents

Introduction

Hash File organization method is the one where data is stored at the data blocks whose address is generated by using hash function. The memory location where these records are stored is called as data block or data bucket. This data bucket is capable of storing one or more records.

The hash function can use any of the column value to generate the address. Most of the time, hash function uses primary key to generate the hash index – address of the data block. Hash function can be simple mathematical function to any complex mathematical function. We can even consider primary key itself as address of the data block. That means each row will be stored at the data block whose address will be same as primary key. This implies how simple a hash function can be in database.

Above diagram depicts data block address same as primary key value. This hash function can also be simple mathematical function like mod, sin, cos, exponential etc. Imagine we have hash function as mod (5) to determine the address of the data block. So what happens to the above case? It applies mod (5) on primary keys and generates 3,3,1,4 and 2 respectively and the records are stored in those data block addresses.

From above two diagrams it now clear how hash function works.

There are two types of hash file organizations – Static and Dynamic Hashing.

Static Hashing

In this method of hashing, the resultant data bucket address will be always same. That means, if we want to generate address for EMP_ID = 103 using mod (5) hash function, it always result in the same bucket address 3. There will not be any changes to the bucket address here. Hence number of data buckets in the memory for this static hashing remains constant throughout. In our example, we will have five data buckets in the memory used to store the data.

Searching a record

Using the hash function, data bucket address is generated for the hash key. The record is then retrieved from that location. i.e.; if we want to retrieve whole record for ID 104, and if the hash function is mod (5) on ID, the address generated would be 4. Then we will directly got to address 4 and retrieve the whole record for ID 104. Here ID acts as a hash key.

Inserting a record

When a new record needs to be inserted into the table, we will generate a address for the new record based on its hash key. Once the address is generated, the record is stored in that location.

Delete a record

Using the hash function we will first fetch the record which is supposed to be deleted. Then we will remove the records for that address in memory.

Update a record

Data record marked for update will be searched using static hash function and then record in that address is updated.

Suppose we have to insert some records into the file. But the data bucket address generated by the hash function is full or the data already exists in that address. How do we insert the data? This situation in the static hashing is called bucket overflow. This is one of the critical situations/ drawback in this method. Where will we save the data in this case? We cannot lose the data. There are various methods to overcome this situation. Most commonly used methods are listed below:

Closed hashing

In this method we introduce a new data bucket with same address and link it after the full data bucket. These methods of overcoming the bucket overflow are called closed hashing or overflow chaining.

Consider we have to insert a new record R2 into the tables. The static hash function generates the data bucket address as ‘AACDBF’. But this bucket is full to store the new data. What is done in this case is a new data bucket is added at the end of ‘AACDBF’ data bucket and is linked to it. Then new record R2 is inserted into the new bucket. Thus it maintains the static hashing address. It can add any number of new data buckets, when it is full.

Open Hashing

In this method, next available data block is used to enter the new record, instead of overwriting on the older one. This method is called Open Hashing or linear probing.

In the below example, R2 is a new record which needs to be inserted. But the hash function generates address as 237. But it is already full. So the system searches next available data bucket, 238 and assigns R2 to it.

In the linear probing, the difference between the older bucket and the new bucket is usually fixed and it will be 1 most of the cases.

Quadratic probing

This is similar to linear probing. But here, the difference between old and new bucket is linear. We use quadratic function to determine the new bucket address.

Double Hashing

This is also another method of linear probing. Here the difference is fixed like in linear probing, but this fixed difference is calculated by using another hash function. Hence the name is double hashing.

Dynamic Hashing

This hashing method is used to overcome the problems of static hashing – bucket overflow. In this method of hashing, data buckets grows or shrinks as the records increases or decreases. This method of hashing is also known as extendable hashing method. Let us see an example to understand this method.

Consider there are three records R1, R2 and R4 are in the table. These records generate addresses 100100, 010110 and 110110 respectively. This method of storing considers only part of this address – especially only first one bit to store the data. So it tries to load three of them at address 0 and 1.

What will happen to R3 here? There is no bucket space for R3. The bucket has to grow dynamically to accommodate R3. So it changes the address have 2 bits rather than 1 bit, and then it updates the existing data to have 2 bit address. Then it tries to accommodate R3.

Now we can see that address of R1 and R2 are changed to reflect the new address and R3 is also inserted. As the size of the data increases, it tries to insert in the existing buckets. If no buckets are available, the number of bits is increased to consider larger address, and hence increasing the buckets. If we delete any record and if the datas can be stored with lesser buckets, it shrinks the bucket size. It does the opposite of what we have seen above. This is how a dynamic hashing works. Initially only partial index/address generated by the hash function is considered to store the data. As the number of data increases and there is a need for more bucket, larger part of the index is consider to store the data.

Advantages of Dynamic hashing

Performance does not come down as the data grows in the system. It simply increases the memory size to accommodate the data.
Since it grows and shrinks with the data, memory is well utilized. There will not be any unused memory lying.
Good for dynamic databases where data grows and shrinks frequently.

Disadvantages of Dynamic hashing

As the data size increases, the bucket size is also increased. These addresses will be maintained in bucket address tables. This is because, the address of the data will keep changing as buckets grow and shrink. When there is a huge increase in data, maintaining this bucket address table becomes tedious.
Bucket overflow situation will occur in this case too. But it might take little time to reach this situation than static hashing.

Comparison of Ordered Indexing and Hashing

Ordered Indexing	Hashing
Addresses in the memory are sorted for key value. This key value can be primary key or any other column in the table.	Addresses are generated using hash function on the key value. This key value can be primary key or any other column in the table.
Performance of this method comes down as the data increases in the file. Since it stores the data in a sorted form, when there is insert/delete/update operation, an extra effort to sort the record is needed. This reduces its performance.	Performance of dynamic hashing will be good when there is a frequent addition and deletion of data. But if the database is very huge, maintenance will be costlier. Static hashing will be good for smaller databases where record size id previously known. If there is a growth in data, it results in serious problems like bucket overflow.
There will be unused data blocks due to delete/update operation. These data blocks will not be released for re-use. Hence periodic maintenance of the memory is required. Else, memory is wasted and performance will also degrade. Also it will be cost overhead to maintain memory.	In both static and dynamic hashing, memory is well managed. Bucket overflow is also handled to better extent in static hashing. Data blocks are designed to shrink and grow in dynamic hashing. But there will be an overhead of maintaining the bucket address table in dynamic hashing when there is a huge database growth.
Preferred for range retrieval of data- that means when there is retrieval data for particular range, this method is best suited.	This method is suitable to retrieve a particular record based on the search key. But it will not perform better if the hash function is not on the search key.