When a transaction is being executed in the system, it may fail to execute due to various reasons. The failure can be because of system program, bug in a program, user, or system crash. These failures can be broadly classified into three categories.
Crash
This is the failure of the system because of the bug in the software or the failure of system processor. This crash mainly affects the data in the primary memory. If it affects only the primary memory, the actual data will not be really affected and recovery from this failure is easy. This is because primary memories are temporary storages and it would not have updated the actual database. Hence the system will be in a consistent state before to the transaction. But when secondary memory crashes, there would be a loss of data and need to take serious actions to recover lost data. Because secondary memories contain actual DB data. Recovering them from crash is little tedious and requires more effort. DB Recovery system provides strong mechanisms to recovery the system from crash and maintains the atomicity of the transactions.
Abort
This failure can be because of user or executing program/ transaction. The user may cancel the transaction when the transaction is executing by pressing the cancel button or abort using the DB commands. The transaction may fail because of the constraints on the tables – violation of constraints. It can even fail if there is concurrent processing of multiple transactions and there is lack of resources for all of them or deadlock situation. All these will cause the transaction to stop processing in the middle of its execution. When a transaction fails / stops in the middle, it would have partially changed DB and it needs to be rolled back to previous consistent state. This will ensure the atomicity of the transaction and consistency of DB. In ATM withdrawal example, if the user cancels his transaction after step (i), the system should be able to stop further processing of the transaction, or if he cancels the transaction after step (ii), the system should be strong enough to update his balance in his account. This will guarantee the atomicity (either fully executed or not executed at all) of transaction and consistency (no incorrect data) of DB.
Media failure
This is another major failure where hard disks crash with formation of bad sectors, disk head crash, unavailability of disk etc. These can even loss of data because of fire, flood, theft etc. This is mainly affects the secondary memory where the actual data lies. In these cases, we need to have alternative ways of storing DB. We can create backups of DB at regular basis and store them separately from the memory where DB is stored or maintain multiple copies of DB at different network locations to recover them from failure.
In general, transaction should be either fully executed or not executed at all to maintain the atomicity of it. In addition, the system should make sure that DB is in a consistent state even after the transaction. If there is any failure in the system, the data in DB should not be lost. Either the whole transaction should be aborted or the transactions which were active during failure have to be aborted.
For example suppose we had transactions T1, T2, T3 and T4 were executing in DB in a sequence. Assume there was a crash when it was executing transaction T3.
Now the system should be strong enough to decide what steps to be followed to recover the system from failure. Here transactions T1 and T2 have been executed already and would have made some changes to DB. But the changes would be fully complete, if T3 and T4 also had executed. But there was a failure, and they are not executed. In order to maintain the atomicity of the transaction, it should either complete T3 and T4 or rollback T1 and T2. Since it is a crash, executing T3 and T4 will not be possible. But reverting T1 and T2 is possible, provided log for each of this transaction is maintained in the system. The log should be maintained in such a way that it should have details about the data before and after executing T1 and T2. Also, it should have log on whether those transactions are complete or not. All these informations will help the system to rollback T1 and T2 so that the system can recover to previous consistent state. This will help to maintain the atomicity and durability of the transactions.
In ATM withdrawal example below, suppose it fails at 3rd step. In order to maintain the atomicity of the transactions, it should either complete the transaction (T3 and T4) or rollback the transactions (T1 and T2). But durability of the system is achieved only by completing T3 and T4. i.e.; transaction T2 has already given money to the user and it cannot be rolled back. Hence system has to complete T3 and T4 to calculate updated balance and update DB. This will make the system consistent, durable and atomic.
This is how a system is recovered from failure. Let us see how exactly logs and other techniques help to recover from failure.