Table of Contents
Cluster File Organization
In all the file organization methods described above, each file contains single table and are all stored in different ways in the memory. In real life situation, retrieving records from single table is comparatively less. Most of the cases, we need to combine/join two or more related tables and retrieve the data. In such cases, above all methods will not be faster to give the result. Those methods have to traverse each table at a time and then combine the results of each to give the requested result. This is obvious that the time taken for this is more. So what could be done to overcome this situation?
Another method of file organization – Cluster File Organization is introduced to handle above situation. In this method two or more table which are frequently used to join and get the results are stored in the same file called clusters. These files will have two or more tables in the same data block and the key columns which map these tables are stored only once. This method hence reduces the cost of searching for various records in different files. All the records are found at one place and hence making search efficient.
For example, we want to see the students who have taken particular course. The tables are shown in below diagram. We can see there are two students who have opted for ‘Database’ and ‘Perl’ course each. Though it is stored in separate tables in logical view, when it is stored in physical view, we have combined them. This can be seen in cluster file below. This is the result of join. So do not have to put any effort or time for joining. Hence it will give faster results.
If we have to insert or update or delete any record, we can directly do so. Here data are sorted based on the primary key or the key with which we are searching the data. Also, clusters are formed based on the join condition. The key with which we are joining the tables is known as cluster key.
Clustering of tables are done when
- There is a frequent need for joining the tables with same condition. Also, these joins will result in only few records from both tables. i.e.; in above example, we are retrieving the records for only particular course; not for the entire course. If all the records from any one of the table is used in the join condition, then this method is not efficient.
- If tables are joined once in a while or full table scan of any one the table in involved in the query, then we do not cluster the tables.
- If there is 1: M relationship between the tables, then we can cluster the tables. In above case for each course, we have many students opted for. Hence we have clustered.
There are two types of cluster file organization
- Indexed Clusters: – Here records are grouped based on the cluster key and stored together. Our example above to illustrate STUDENT-COURSE cluster is an indexed cluster. The records are grouped based on the cluster key – COURSE_ID and all the related records are stored together. This method is followed when there is retrieval of data for range of cluster key values or when there is a huge data growth in the clusters. That means, if we have to select the students who are attending the course with COURSE_ID 230-240 or there is a large number of students attending the same course, say 250.
- Hash Clusters: – This is also similar to indexed cluster. Here instead of storing the records based on the cluster key, we generate the hash key value for the cluster key and store the records with same hash key value together in the memory disk.
Advantages of Clustered File Organization
- This method is best suited when there is frequent request for joining the tables with same joining condition.
- When there is a 1:M mapping between the tables, it results efficiently
Disadvantages of Clustered File Organization
- This method is not suitable for very large databases since the performance of this method on them is low.
- We cannot use this clusters, if there is any change is joining condition. If the joining condition changes, the traversing the file takes lot of time.
- This method is not suitable for less frequently joined tables or tables with 1:1 conditions.