Table of Contents
Introduction
As we have already seen in previous articles that B+ tree is a (key, value) storage method in a tree like structure. B+ tree has one root, any number of intermediary nodes (usually one) and a leaf node. Here all leaf nodes will have the actual records stored. Intermediary nodes will have only pointers to the leaf nodes; it not has any data. Any node will have only two leaves. This is the basic of any B+ tree.
Consider the STUDENT table below. This can be stored in B+ tree structure as shown below. We can observe here that it divides the records into two and splits into left node and right node. Left node will have all the values less than or equal to root node and the right node will have values greater than root node. The intermediary nodes at level 2 will have only the pointers to the leaf nodes. The values shown in the intermediary nodes are only the pointers to next level. All the leaf nodes will have the actual records in a sorted order.
If we have to search for any record, they are all found at leaf node. Hence searching any record will take same time because of equidistance of the leaf nodes. Also they are all sorted. Hence searching a record is like a sequential search and does not take much time.
Suppose a B+ tree has an order of n (it is the number of branches – above tree structure has 5 branches altogether, hence order is 5), and then it can have n/2 to n intermediary nodes and n/2 to n-1 leaf nodes. In our example above, n= 5 i.e.; it has 5 branches from root. Then it can have intermediary nodes ranging from 3 to 5. And it can have leaf nodes from 3 to 4.
The main goal of B+ tree is:
- Sorted Intermediary and leaf nodes: Since it is a balanced tree, all nodes should be sorted.
- Fast traversal and Quick Search:
One should be able to traverse through the nodes very fast. That means, if we have to search for any particular record, we should be able pass through the intermediary node very easily. This is achieved by sorting the pointers at intermediary nodes and the records in the leaf nodes.
Any record should be fetched very quickly. This is made by maintaining the balance in the tree and keeping all the nodes at same distance.
- No overflow pages: B+ tree allows all the intermediary and leaf nodes to be partially filled – it will have some percentage defined while designing a B+ tree. This percentage up to which nodes are filled is called fill factor. If a node reaches the fill factor limit, then it is called overflow page. If a node is too empty then it is called underflow. In our example above, intermediary node with 108 is underflow. And leaf nodes are not partially filled, hence it is an overflow. In ideal B+ tree, it should not have overflow or underflow except root node.
Searching a record in B+ Tree
Suppose we want to search 65 in the below B+ tree structure. First we will fetch for the intermediary node which will direct to the leaf node that can contain record for 65. So we find branch between 50 and 75 nodes in the intermediary node. Then we will be redirected to the third leaf node at the end. Here DBMS will perform sequential search to find 65. Suppose, instead of 65, we have to search for 60. What will happen in this case? We will not be able to find in the leaf node. No insertions/update/delete is allowed during the search in B+ tree.
Insertion in B+ tree
Suppose we have to insert a record 60 in below structure. It will go to 3rd leaf node after 55. Since it is a balanced tree and that leaf node is already full, we cannot insert the record there. But it should be inserted there without affecting the fill factor, balance and order. So the only option here is to split the leaf node. But how do we split the nodes?
The 3rd leaf node should have values (50, 55, 60, 65, 70) and its current root node is 50. We will split the leaf node in the middle so that its balance is not altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes. If these two has to be leaf nodes, the intermediary node cannot branch from 50. It should have 60 added to it and then we can have pointers to new leaf node.
This is how we insert a new entry when there is overflow. In normal scenario, it is simple to find the node where it fits and place it in that leaf node.
Delete in B+ tree
Suppose we have to delete 60 from the above example. What will happen in this case? We have to remove 60 from 4th leaf node as well as from the intermediary node too. If we remove it from intermediary node, the tree will not satisfy B+ tree rules. So we need to modify it have a balanced tree. After deleting 60 from above B+ tree and re-arranging nodes, it will appear as below.
Suppose we have to delete 15 from above tree. We will traverse to the 1st leaf node and simply delete 15 from that node. There is no need for any re-arrangement as the tree is balanced and 15 do not appear in the intermediary node.
B+ Tree Extensions
As the number of records grows in the database, the intermediary and leaf nodes needs to be split and spread widely to keep the balance of the tree. This is called as B+ tree extensions. As it spreads out widely, the searching of records becomes faster.
The main goal of creating B+ tree is faster traversal of records. As the branches spreads out, it requires less I/O on disk to get the record. Record that needs to be fetched are fetched in logarithmic fraction of time. Suppose we have K search key values – that is the pointers in the intermediary node for n nodes. Then we can fetch any record in the b+ tree in log (n/2) (K).
Suppose each node takes 40bytes to store an index and each disk block is of 40Kbytes. That means we can have 100 nodes (n). Say we have 1million search key values – that means we have 1 million intermediary pointers. Then we can access log 50 (1000000) = 4 nodes are accessed in one go. Hence this costs only 4milliseconds to fetch any node in the tree. Now we can guess the advantage of extending the B+ tree into more intermediary nodes. As intermediary nodes spread out more and more, it is more efficient in fetching the records in B+ tree.
Look at below two diagrams to understand how it makes difference with B+ tree extensions.
B+ Tree index files
Above concept of B+ tree is used to store the records in the secondary memory. If the records are stored using this concept, then those files are called as B+ tree index files. Since this tree is balanced and sorted, all the nodes will be at same distance and only leaf node has the actual value, makes searching for any record easy and quick in B+ tree index files. Even insertion/deletion in B+ tree does not take much time. Hence B+ tree forms an efficient method to store the records.
Searching, inserting and deleting a record is done in the same way we have seen above. Since it is a balance tree, it searches for the position of the records in the file, and then it fetches/inserts /deletes the records. In case it finds that tree will be unbalanced because of insert/delete/update, it does the proper re-arrangement of nodes so that definition of B+ tree is not changed.
Below is the simple example of how student details are stored in B+ tree index files.
Suppose we have a new student Bryan. Where will he fit in the file? He will fit in the 1st leaf node. Since this leaf node is not full, we can easily add him in the node.
But what happens if we want to insert another student Ben to this file? Some re-arrangement to the nodes is needed to maintain the balance of the file.
Same thing happens when we perform delete too.
Benefits of B+ Tree index files
- As the file grows in the database, the performance remains the same. It does not degrade like in ISAM. This is because all the records are maintained at leaf node and all the nodes are at equi-distance from root. In addition, if there is any overflow, it automatically re-organizes the structure.
- Even though insertion and deletion are little complicated, it can be done in fraction of seconds.
- Leaf node allows only partial/ half filled, since records are larger than pointers.
B Tree index Files
B tree index file is similar to B+ tree index files, but it uses binary search concepts. In this method, each root will branch to only two nodes and each intermediary node will also have the data. And leaf node will have lowest level of data. However, in this method also, records will be sorted. Since all intermediary nodes also have records, it reduces the traversing till leaf node for the data. A simple B tree can be represented as below:
See the difference between this tree structure and B+ tree for the same example above. Here there is no repetition or pointers till leaf node. All the records are stored in all the nodes. If we need to insert any record, it will be done as B+ tree index files, but it will make sure that each node will branch only to two nodes. If there is not enough space in any of the node, it will split the node and store the records.
Example of Simple Insert
Example of splitting the nodes while inserting
Difference between B Tree and B+ Tree Index Files
Compare the difference between the examples of B+ tree index files and B tree index files above. You can see that they are almost similar but there is little difference in them. This little difference itself gives greater effect in database performance.
B Tree Index Files | B+ Tree Index Files | |
This is a binary tree structure similar to B+ tree. But here each node will have only two branches and each node will have some records. Hence here no need to traverse till leaf node to get the data. | This is a balanced tree with intermediary nodes and leaf nodes. Intermediary nodes contain only pointers / address to the leaf nodes. All leaf nodes will have records and all are at same distance from the root. | |
It has more height compared to width. | Most width is more compared to height. | |
Number of nodes at any intermediary level ‘l’ is 2l. Each of the intermediary nodes will have only 2 sub nodes. | Each intermediary node can have n/2 to n children. Only root node will have 2 children. | |
Even a leaf node level will have 2l nodes. Hence total nodes in the B Tree are 2 l+1 – 1. | Leaf node stores (n-1)/2 to n-1 values | |
As the number of intermediary nodes increases and hence the leaf nodes i.e. as B+ tree extends, the traversal speed increases log arithmetically log(n/2)(K) | ||
Records are in sorted order | Records are in sorted order | |
Advantages | It might have fewer nodes compared to B+ tree as each node will have data. | Automatically Adjust the nodes to fit the new record. Similarly it re-organizes the nodes in the case of delete, if required. Hence it does not alter the definition of B+ tree. |
Since each node has record, there might not be required to traverse till leaf node. | Reorganization of the nodes does not affect the performance of the file. This is because, even after the rearrangement all the records are still found in leaf nodes and are all at equidistance. There is no change in distance of records from neither root nor the time to traverse till leaf node. | |
No file degradation problem | ||
Good space utilization as intermediary nodes contain only pointer to the records and only leaf nodes contain records. Space needed for pointers are very less compared to records. | ||
Is suitable for partial and range search too | ||
Since all the leaf nodes are at equal distance, the time for I/O fetch is much less. Hence the performance of the tree will also increase. | ||
Disadvantages | If the tree is very big, then we have to traverse through most of the nodes to get the records. Only few records can be fetched at the intermediary nodes or near to the root. Hence this method might be slower. | If there is any rearrangement of nodes while insertion or deletion, then it would be an overhead. It takes little effort, time and space. But this disadvantage can be ignored compared to the speed of traversal |
Since each node has data and can have only two child nodes, the tree will not spread out much. Its depth/height will increase as the number of records increases. But if height of a tree increases, the I/O will also increase and hence the performance will decrease. | ||
Insertion and deletion of nodes will have re-arrangements like in B+ tree. But it will be more complicated as it has to balance the binary nodes. | ||
Implementation of B tree is little difficult compared to B+ tree | ||
All these disadvantages cannot be ignored as they are highly affecting the performance of the file. |
B+ Tree indexing
This is the standard index in the database where primary key or the most frequently used search key column in the table used to index. It has the same feature as discussed above. Hence it is efficient in retrieving the data. These indexes can be stored in different forms in a B+ tree. Depending on the way they are organized, there are 4 types of B+ tree indexes.
- Index-organized tables: – Here data itself acts as a index and whole record is stored in the B+ index file.
- Descending Indexes: – Here index key is stored in the descending order in B+ tree files.
- Reverse key indexes: – In this method of indexing, the index key column value is stored in the reverse order. For example, say index is created on STD_ID in the STUDENT table. Suppose STD_ID has values 100,101,102 and 103. Then the reverse key index would be 001, 101,201 and 301 respectively.
- B+ tree Cluster Index: – Here, cluster key of the table is used in the index. Thus, each index in this method will point to set of records with same cluster keys.