Distributed query processing in DBMS - Distributed query processing in SQL

A DDB can be homogeneous or heterogeneous DDB. That means all the DBs in DDB can of same type with same software, hardware, operating system etc or at least one of them may be different. When a user sends query / request, homogeneous system will be able manage the query easily as there is no difference among DB. But heterogeneous systems will not be able to work same among all the DBs.

In the case of Heterogeneous system, there should be some mechanism to handle queries in different databases. There are two types of mechanisms in heterogeneous system to manage such situations.

Table of Contents

Multi-database

In this method dynamical schema will be created based on the database to be connected to. For example, if the user connects to DB2 database, then a schema will be created dynamically to connect to DB2 database and make the user query flexible with this schema, if he connects to Sybase DB, then schema will be created dynamically to connect and perform Sybase transactions.

Federated

In this method users will access different databases using a centralized or global conceptual schema. That means a common schema is created to manage all the DB requests – which in turn makes the users to access the DB at a common schema. Hence even though the data is fragmented or distributed over DB, user will be accessing the central schema for processing his query. Hence accessing the DB nearest to him is minimal.

When a heterogeneous DDB is using federal method to process the query, there are lot of issues that it needs to deal with.

Data Models

DDB will have different databases distributed over the network. These DBs will have their own data models like relational, documented, network, object oriented, hierarchical etc. The federated schema should be made compatible with all these types of data models to handle all the DB requests.

Constraints

Each DB will have its own mechanism of accessing the data and defining the constraints. For example, in relational model primary-foreign key constraint determines the relationship between the tables, while in hierarchical model parent-child relationship exists. This makes the federal schema complex as it has to handle all these types of constraints.

Query Language

As the DB system varies from location to location, the query language used by it can also change. One system may use SQL while other may use DB2 or Sybase. Hence federated schema should develop a common language which is compatible all the query language as well as should be able to develop a code that can be executed in different DB systems.

Data Transfer Cost

Since this is a DDB, all the tables in the user query may not be present in a single DB or at single location. Hence while processing the query, it may need to access the tables at different DB or at different location. This requires a request and transfer cost for the data over the network. This is the very big cost that really needs some optimization on it.

Data transfer cost can be better understood by considering the example of EMPLOYEE and DEPARTMENT table. Consider an EMPLOYEE table with 1000 records with each record of 100bytes, DEPARTMENT table with 10 records with each record of 20 bytes. Suppose EMPLOYEE table is in DB1 at location 1 and DEPARTMENT table is in DB2 at location 2.

Consider the query to find the Names of the employees and their department names. Suppose each resulting record will have 20 bytes and all the employees and departments are being selected. Suppose this query is being executed at location 4.

∏EMP_NAME, DEPT_NAME (EMPLOYEE ∞ DEPT_ID DEPARTMENT)

Now consider the following scenarios :

Case 1 :

Since location 4 is not having any of the tables, both the tables needs to be transferred to location 3. Hence the cost of data transfer is as below:

Cost of transferring EMPLOYEE data: 1000 records * 100 bytes = 1,00,000 bytes

Cost of transferring DEPARTMENT data: 10 records * 20 bytes = 200 bytes.

Therefore, total cost = 1,00,000 bytes + 200 bytes = 1,00,200 bytes

Here cost of transferring result records will not come as result is request at this location itself.

Case 2 :

Suppose we transfer EMPLOYEE records into location 2 and process the query there. Then transfer the data to the location 3. This transfer needs to consider transfer cost of EMPLOYEE records and transfer cost of result records. Hence

Cost of transferring EMPLOYEE data: 1000 records * 100 bytes = 1,00,000 bytes

Cost of transferring the result: 1000 records * 20 bytes = 20,000 bytes.

Therefore, total cost = 1,00,000 bytes + 200 bytes = 1,20,000 bytes

Case 3 :

Suppose we transfer DEPARTMENT records into location 1 and process the query there. Then transfer the data to the location 3. This transfer needs to consider transfer cost of DEPARTMENT records and transfer cost of result records. Hence

Cost of transferring DEPARTMENT data: 10 records * 20 bytes = 200 bytes

Cost of transferring the result: 1000 records * 20 bytes = 20,000 bytes.

Therefore, total cost = 200 bytes + 20,000 bytes = 20,200 bytes

Hence the case 3 is the best approach for transferring the data which gives the minimal cost. Hence the federated method has to calculate these costs depending on the query, table size, result size, cost processing location etc and determine which method to use for query processing. Like we do in normal DB query processing – reducing number of records, performing the filter condition first etc, we need to apply the same in federated systems too to get the optimal solution.