Introduction of Indexing
The main goal of designing the database is faster access to any data in the database and quicker insert/delete/update to any data. This is because no one likes waiting. When a database is very huge, even a smallest transaction will take time to perform the action. In order to reduce the time spent in transactions, Indexes are used. Indexes are similar to book catalogues in library or even like an index in a book. What it does? It makes our search simpler and quicker. Same concept is applied here in DBMS to access the files from the memory.
When records are stored in the primary memory like RAM, accessing them is very easy and quick. But records are not limited in numbers to store in RAM. They are very huge and we have to store it in the secondary memories like hard disk. As we have seen already, in memory we cannot store records like we see – tables. They are stored in the form of files in different data blocks. Each block is capable of storing one or more records depending on its size.
When we have to retrieve any required data or perform some transaction on those data, we have to pull them from memory, perform the transaction and save them back to the memory. In order to do all these activities, we need to have a link between the records and the data blocks so that we can know where these records are stored. This link between the records and the data block is called index. It acts like a bridge between the records and the data block.
How do we create these indexes? How these indexes help to access the data? Is linking between records and data block address enough to give better performance? Answers all these questions are learnt in this article.
How do we index in a book? We list main topics first and under that we group different sub-topic right? We do the same thing in the database too. Each table will have unique column or primary key column which uniquely determines each record in the table. Most of the time, we use this primary key to create index. Sometimes, we will have to fetch the records based on other columns in the table which are not primary key. In such cases we create index on those columns. But what is this index? Index in databases is the pointer to the block address in the memory. But these pointers are stored as (column, block_address) format.
Imagine we have a student table with thousands of records, each of which is 10 bytes long. Imagine their IDs start from 1 2, 3… and goes on. And we have to search student with ID 678. In a normal database with no index, it searches the disk block from the beginning till it reaches 678. So the DBMS will reach this record after reading 677*10 = 6770 bytes. But if we have index on ID column, then the address of the location will be stored as each record as (1,200), (2, 201)… (678, 879) and so on. One can imagine it as a smaller table with index column and address column. Now if we want to search record with ID 678, then it will search using indexes. i.e.; here it will traverse only 677*2 = 1354 bytes which very less compared to earlier one. Hence retrieving the record from the disk becomes faster. Most of the cases these indexes are sorted and kept to make searching faster. If the indexes are sorted, then it is called as ordered indices.
If the index is created on the primary key of the table then it is called as Primary Indexing. Since these primary keys are unique to each record and it has 1:1 relation between the records, it is much easier to fetch the record using it. Also, these primary key are kept in sorted form which helps in performance of the transactions. The primary indexing is of two types – Dense Index and Sparse Index.
1. Dense Index
In this case, indexing is created for primary key as well as on the columns on which we perform transactions. That means, user can fire query not only based on primary key column. He can query based on any columns in the table according to his requirement. But creating index only on primary key will not help in this case. Hence index on all the search key columns are stored. This method is called dense index.
For example, Student can be searched based on his ID which is a primary key. In addition, we search for student by his first name, last name, particular age group, residing in some place, opted for some course etc. That means most of the columns in the table can be used for searching the student based on different criteria. But if we have index on his ID, other searches will not be efficient. Hence index on other search columns are also stored to make the fetch faster.
Though it addresses quick search on any search key, the space used for index and address becomes overhead in the memory. Here the (index, address) becomes almost same as (table records, address). Hence more space is consumed to store the indexes as the record size increases.
2. Sparse Index
In order to address the issues of dense indexing, sparse indexing is introduced. In this method of indexing, range of index columns store the same data block address. And when data is to be retrieved, the block address will be fetched linearly till we get the requested data.
Let us see how above example of dense index is converted into sparse index.
In above diagram we can see, we have not stored the indexes for all the records, instead only for 3 records indexes are stored. Now if we have to search a student with ID 102, then the address for the ID less than or equal to 102 is searched – which returns the address of ID 100. This address location is then fetched linearly till we get the records for 102. Hence it makes the searching faster and also reduces the storage space for indexes.
The range of column values to store the index addresses can be increased or decreased depending on the number of record in the table. The main goal of this method should be more efficient search with less memory space.
But if we have very huge table, then if we provide very large range between the columns will not work. We will have to divide the column ranges considerably shorter. In this situation, (index, address) mapping file size grows like we have seen in the dense indexing.
In the sparse indexing, as the table size grows, the (index, address) mapping file size also grows. In the memory, usually these mappings are kept in the primary memory so that address fetch should be faster. And latter the actual data is searched from the secondary memory based on the address got from mapping. If the size of this mapping grows, fetching the address itself becomes slower. Hence sparse index will not be efficient. In order to overcome this problem next version of sparse indexing is introduced i.e.; Secondary Indexing.
In this method, another level of indexing is introduced to reduce the (index, address) mapping size. That means initially huge range for the columns are selected so that first level of mapping size is small. Then each range is further divided into smaller ranges. First level of mapping is stored in the primary memory so that address fetch is faster. Secondary level of mapping and the actual data are stored in the secondary memory – hard disk.
In the above diagram, we can see that columns are divided into groups of 100s first. These groups are stored in the primary memory. In the secondary memory, these groups are further divided into sub-groups. Actual data records are then stored in the physical memory. We can notice that, address index in the first level is pointing to the first address in the secondary level and each secondary index addresses are pointing to the first address in the data block. If we have to search any data in between these values, then it will search the corresponding address from first and second level respectively. Then it will go to the address in the data blocks and perform linear search to get the data.
For example, if it has to search 111 in the above diagram example, it will search the max (111) <= 111 in the first level index. It will get 100 at this level. Then in the secondary index level, again it does max (111) <= 111, and gets 110. Now it goes to data block with address 110 and starts searching each record till it gets 111. This is how a search is done in this method. Inserting/deleting/updating is also done in same manner.
In this method, we can see that index mapping growth is reduced to considerable amount. But this method can also have same problem as the table size increases. In order to overcome this, we can introduce multiple levels between primary memory and secondary memory. This method is also known as multilevel indexing. In this method number of secondary level index is two or more.
In some cases, the index is created on non-primary key columns which may not be unique for each record. In such cases, in order to identify the records faster, we will group two or more columns together to get the unique values and create index out of them. This method is known as clustering index. Basically, records with similar characteristics are grouped together and indexes are created for these groups.
For example, students studying in each semester are grouped together. i.e.; 1st Semester students, 2nd semester students, 3rd semester students etc are grouped.
In above diagram we can see that, indexes are created for each semester in the index file. In the data block, the students of each semester are grouped together to form the cluster. The address in the index file points to the beginning of each cluster. In the data blocks, requested student ID is then search in sequentially.
New records are inserted into the clusters based on their group. In above case, if a new student joins 3rd semester, then his record is inserted into the semester 3 cluster in the secondary memory. Same is done with update and delete.
If there is short of memory in any cluster, new data blocks are added to that cluster.
This method of file organization is better compared to other methods as it provides clean distribution of records, and hence making search easier and faster. But in each cluster, there would be unused space left. Hence it will take more memory compared to other methods.