After reading this chapter, the reader will understand:
Today’s computer applications demand high performing machines which process large amounts of data in sophisticated ways. The applications such as geographical information systems, real-time decision making, and computer-aided design demand the capability to manage several hundred gigabytes to terabytes of data. Moreover, with the evolution of Internet, the number of online users as well as size of data has been increased. This demand has been the driving force of the emergence of technologies like parallel processing and data distribution. The database systems based on parallelism and data distribution are called parallel DBMS and distributed DBMS (DDBMS), respectively.
This chapter begins with an overview of parallel DBMS. The distributed system and various issues involved in its implementation are then discussed. The chapter ends with a discussion about the client/server system (briefly discussed in Chapter 01), which is considered as a special case of distributed system.
To meet the ever-increasing demand for higher performance, the database systems are increasingly required to make use of parallel processing. The simultaneous use of computer resources such as central processing units (CPUs) and disks to perform a specific task is known as parallel processing. In parallel processing, many tasks are performed simultaneously on different CPUs. Parallel DBMS seeks to improve performance by carrying out many operations in parallel, such as loading data, processing queries, and so on.
There are several architectures for building parallel database systems. The three main architectures are as follows.
All these three architectures are shown in Figure 13.1.
Fig. 13.1 Parallel database architecture
There are two important metrics for measuring the efficiency of a parallel database system, namely, scaleup and speedup. Running a task in less time by increasing the degree of parallelism is called speedup, and handling larger task by increasing the degree of parallelism is called scaleup.
In a distributed database system, data is physically stored across several computers. The computers are geographically dispersed and are connected with one another through various communication media (such as high-speed networks or telephone lines). Due to the distribution of data on multiple machines, potential exists for distributed computing which means the processing of a single task by several machines in a coordinated manner.
The computers in a distributed system are referred to as sites. In a distributed database system, each site is managed by a DBMS that is capable of running independent of other sites. The overall distributed system can be regarded as a kind of partnership among the individual local DBMSs at the individual local sites. The necessary partnership functionality is provided by a new software system at each site. This new software is logically an extension of local DBMS. The combination of this new software together with the existing DBMSs is called distributed DBMS (DDBMS). DDBMS permits the management of the distributed database and makes the distribution transparent to its users.
Fig. 13.2 A distributed database system
The shared-nothing parallel database system resembles distributed database system. The main difference is that former is implemented on tightly coupled multiprocessor and comprises a single database system. While in the latter, databases are geographically separated across the sites that share no physical components and are separately administered. The other difference is in their mode of operation. In shared-nothing architecture, there is symmetry and homogeneity of sites. However, in distributed database system, there may be heterogeneity of hardware and operating system at each site.
A special purpose distributed operating system is used for the distributed system, which logically works as a single operating system. Thus, all relational database operations used for shared-nothing can be used in distributed DBMS without much modifications.
A system having multiprocessors is itself a distributed system made up of a number of nodes (processors and disks) connected by a fast network within a machine.
The distributed database system in which all sites share a common global schema and run the identical DBMS software and is known as homogeneous distributed database system. In such a system, DBMS software on different sites are aware of one another, and agree to work together in processing a user request at any site exactly as if data were stored at the user’s own site. In addition, the software must agree in exchanging information about transactions so that a transaction can be processed across multiple sites. Local sites relinquish a certain degree of control in terms of their right to change schemas or database-management system software to some other site.
On the other hand, a distributed database system in which different sites may have different schemas and run different DBMS software essentially autonomously is known as heterogeneous distributed database system or multidatabase system. In such a system, the sites may not be aware of one another and may provide limited facilities for cooperation in transaction processing. The differences in schemas also causes problem in query processing. In addition, differences in software obstruct the processing of transactions that access multiple sites.
Distributed computing is the outcome of using networks to allow computers to communicate efficiently. But it differs from computer networking. The computer networking refers to two or more computers interacting with each other, but not typically sharing the processing of a single program. The World Wide Web is an example of a network but not an example of distributed computing.
The distributed database system has various advantages ranging from decentralization to autonomy. Some of the advantages are as follows.
In distributed DBMS, the relations are stored across several sites. Accessing a relation that is stored at remote site increases the overhead due to exchange of messages. This overhead can be reduced by using two different techniques namely, fragmentation and replication. In fragmentation, a relation is divided into several fragments (or smaller relations), and each fragment can be stored at sites where they are more often accessed. In replication, several identical copies or replicas of each relation are maintained and each replica is stored at many distinct sites. Generally, replicas are stored at the sites where they are in high demand.
Note that a relation can be divided into several fragments and there may be several replicas of each fragment. In other words, both the techniques can be combined. These techniques are used during the process of distributed database design.
Fragmentation of a relation R
consists of breaking it into a number of fragments R1
, R2
, …, Rn
such that it is always possible to reconstruct the original relation R
from the fragments R1
, R2
, …, Rn
. The fragmentation can be vertical or horizontal (see Figure 13.3).
Fig. 13.3 Horizontal and vertical fragmentation
R
by assigning each tuple of R
to one or more fragments. In horizontal fragmentation, each fragment is a subset of the tuples in the original relation. Since each tuple of a relation R
belongs to at least one of the fragments, original relation can be reconstructed. Generally, a horizontal fragment is defined as a selection on the relation R
. That is, the tuples that belong to the horizontal fragment are specified by a condition on one or more attributes of the relation. For instance, fragment Ri
can be constructed as shown here.
Ri = σcondition(i) (R)
The relation R
can be reconstructed by applying union on all fragments, that is,
R = R1 ∪ R2 ∪ … ∪ Rn
For example, the relation book can be divided into several fragments on the basis of attribute Category
as shown here.
BOOK1 = σCategory = “Novel” (BOOK) BOOK2 = σCategory = “Textbook” (BOOK) BOOK3 = σCategory = “LanguageBook” (BOOK)
Here, each fragment consists of tuples of BOOK
relation that belongs to a particular category. Further, the fragments are disjoint. However, by changing the selection condition, a particular tuple of R
can appear in more than one fragment.
R
. In vertical fragmentation, each fragment is a subset of the attributes of the original relation. A vertical fragment is defined as a projection on the relation R
as shown here.
Ri = πRi (R)
The relation should be fragmented in such a way that original relation can be reconstructed by applying natural join on the fragments. That is,
R = R1 R2 … Rn
To ensure the reconstruction of a relation from the fragments, one of the following options can be used.
R
in each fragment Ri.
Replication means storing a copy (or replica) of a relation or relation fragments in two or more sites. Distribution of an entire relation at all sites is known as full replication. When only some fragments of a relation are replicated, it is called partial replication. In partial replication, the number of copies of each fragment can range from one to the total number of sites in the system.
Replication is desirable for the following two reasons.
R
fails, then the same relation can be found as long as at least one replica is available at some other site. Thus, the system can continue to process queries involving R
, despite the failure of one site. Further, if the local copies of remote relations are available, we are less vulnerable to failure of remote site.R
require only reading of the relation, then several sites can process query involving R
in parallel. In case there are more replicas of a relation, the probability of the needed data at the site where the transaction is initiated is more. Thus, data replication minimizes the overhead due to the movement of data between sites.The major disadvantage of replication is that update operations incur greater overhead. The system must ensure that all the replicas of a relation are consistent to avoid erroneous results. Thus, whenever R
is updated, all the replicas of R
must be updated. For example, in an airline reservation system where the information regarding a flight is replicated at different sites, it is necessary that the information must agree in all sites.
In general, replication improves the performance of read operations in a distributed system and improves its reliability. However, the requirement that all the replicas must be updated makes the distributed database system more complex. In addition, concurrency control is more complicated.
In a distributed system, the users should be able to access the database exactly as if the system were local. Users are not required to know where the data is physically stored and how the data can be accessed at the specific local site. Hiding all such details from the users is referred to as data transparency. DDBMS provides following type of transparencies.
Data transparency is desirable because it simplifies application programs and end-user activities. At any time, in response to changing requirements, data can be changed, the number of copies can vary, and these copies can migrate from site to site. Further, data can be refragmented and fragments can be redistributed. Data transparency allows performing all these operations on data without manipulating programs or activities.
There are various issues that are involved in processing and optimizing a given query in a centralized database system as discussed in Chapter 08. In a distributed system, several other factors must also be taken into account. One of them is the communication cost. Since in a distributed system, a query may require data from more than one site, this transmission of data entails communication cost. Another factor is the gain in performance, if parts of a query are processed in parallel at different sites.
The efficiency of the distributed query processing strategies also depends on the type and typology of the network used in the distributed system.
This section discusses the communication cost reduction techniques for one of the most common relational algebra operations, the join. Further, the gain in performance from having several sites process parts of query in parallel is elaborated.
The join operation is one of the most expensive operations to perform. Thus, choosing a strategy for join is the major decision in the selection of a query-processing strategy. Consider the following relational algebra expression.
πISBN,Price,P_ID,Pname (BOOKBOOK.P_ID=PUBLISHER.P_IDPUBLISHER)
Suppose relations BOOK
and PUBLISHER
are stored at sites S1
and S2
, respectively, and the result is to be produced at site S3
. Further, suppose that there are 500 tuples in BOOK
where each tuple is 100 bytes long and 100 tuples in PUBLISHER
where each tuple is 150 bytes long. Thus, the size of BOOK
relation is 500 * 100 = 50,000 bytes and the size of PUBLISHER
relation is 100 * 150 = 15,000 bytes. The result of this query has 500 tuples where each tuple is 75 bytes long (assuming that the attributes ISBN
, Price
, P_ID
, Pname
are 15, 6, 4, and 50 bytes long). For simplicity, assume that these two relations are neither replicated nor fragmented. The different possible strategies that can be employed for processing this query are following.
S3
and process the query at this site to obtain the result. In this strategy, a total of 50,000 + 15000 = 65,000 bytes must be transferred.PUBLISHER
to site S1
for processing the entire query locally at site S1
and then ship result to site S3
. In this strategy, the bytes equivalent to the size of the query result and the size of relation PUBLISHER
must be transferred. That is, 37,500 + 15,000 = 52,500 bytes must be transferred.BOOK
to site S2
for processing the entire query locally at site S2
and then ship result to site S3
. In this case, 50,000 + 37,500 = 87,500 bytes must be transferred.If minimizing the amount of data transfer is the criteria for optimization, the strategy 2 must be chosen. However, in addition to amount of data transfer, the optimal choice also depends on the communication cost between a pair of sites and the relative speed of processing at each site. Thus, none of the above strategies is always the best.
Semijoin strategy is used to reduce the communication cost in performing a join operation. The idea behind semijoin strategy is to reduce the size of a relation that needs to be transmitted and hence the communication costs. In this strategy, instead of sending entire relation, joining attribute of one relation is sent to the site where other relation is present. This attribute is then joined with the relation and then the join attributes along with the required attributes are projected and transferred back to the original site. For example, consider once again the following relational algebra expression.
πISBN,Price,P_ID,Pname (BOOKBOOK.P_ID=PUBLISHER.P_ID PUBLISHER)
where, BOOK
and PUBLISHER
are stored at different sites S1
and S2
, respectively. Further, suppose that system needs to produce the result at site S2
. This relational algebra expression can be evaluated using the strategy as follows:
(P_ID)
of PUBLISHER
at site S2
and transfer it to S1
. A total of 100 * 4 = 400 bytes will be transferred.BOOK
relation and transfer the required attributes from the resultant relation to site S2
. A total of 500 * 25 (size of attribute ISBN
+ size of attribute Price
+ size of attribute P_ID
), that is, 12,500 bytes are required to be transferred.PUBLISHER
relation at site S2
.Here, 12,900 (12,500 + 400) bytes are required to be transferred. This strategy is more advantageous if small fraction of tuples of BOOK
relation participates in join. It is clear that using this strategy, network cost can be reduced to a greater extent.
Semijoin operation is not commutative.
Consider the evaluation of a query that involves join of four relations as follows:
R S T U
Suppose the relations R
, S
, T
, and U
are stored at sites S1
, S2
, S3
, and S4
, respectively. Assume that the result is to be obtained at site S1
. The query can be evaluated in parallel by using the strategy given here.
Relation R
is transferred to site S2
where R S
can be evaluated. At the same time, relation T
is transferred to site S4
where T U
can be evaluated. Site S2
can ship tuples of R S
to S1
as they are generated, instead of waiting for the entire join to be computed. Similarly, site S4
can ship tuples of T U
to S1
. Once tuples of R S
and T U
reach at S1
, the computation of R S
T U
can start. Hence, the computation of the final join result at S1
can be done in parallel with the computation of R S
at S2
and T U
at S4
.
In a distributed system, a transaction initiated at one site can access and update data at several other sites too. The transaction that accesses and updates data only at the site where it is initiated is known as local transaction. The transaction that accesses and updates data from several sites or the site other than the site at which the transaction is initiated is known as global transaction. A transaction that requires data from remote sites is broken into one or more subtransactions. These subtransactions are then distributed to the appropriate sites for execution. Note that such subtransactions are executed independently at the respective sites. The site at which the transaction is initiated is referred to as coordinator site and the sites where subtransactions are executed are called participating sites.
Like centralized system, each site has its own local transaction manager (TM) that manages the execution of those transactions (subtransactions) that access data stored in that site. The transactions may be either a local transaction or part of a global transaction. In addition, there is a transaction coordinator at each site that is responsible for coordinating the execution of all the transactions initiated at that site. The overall system architecture is shown in Figure 13.4 in which a transaction Ti initiated at coordinator site Si
requires data from a remote site Sj
. As a result, a subtransaction Tij is distributed at the site Sj
.
Fig. 13.4 System architecture
As discussed in Chapter 09, transaction must preserve ACID properties to ensure the integrity of the data. The ACID properties for local transactions can be assured by using certain concurrency control and recovery techniques as discussed in Chapter 10 and Chapter 11, respectively. However, for global transactions, ensuring these properties is a complicated task because of data distribution. Thus, both the concurrency control and recovery techniques need some modification to conform the distributed environment.
In DDBMS, concurrency control has to deal with replicated data items. Variations of the concurrency control techniques of centralized database system are used in distributed environment. Several such possible techniques based on the locking and timestamp are presented in this section.
All the concurrency control techniques require updates to be done on all replicas of a data item. If any site containing a replica of a data item has failed, updates on the data item cannot be performed.
The various locking techniques (discussed in Chapter 10) can be applied in distributed systems. However, they need to be modified in such a way that the lock manager can deal with the replicated data.
Single Lock Manager In single lock manager technique, the system maintains a single lock manager that resides in any single chosen site and all the requests to lock and unlock the data items are sent to that site. When a transaction needs to acquire lock on a specific data item, it sends request to the site where the lock manager resides. The lock manager checks whether the lock can be granted immediately. If the lock can be granted, the lock manager sends a message to the site from which the request for lock was initiated. Otherwise, the request is delayed until it can be granted. In case of a read lock request, data item at any one of the sites (at which replica of data item is present) is locked in the shared mode and then read. In case of a write lock request, all the replicas of the data item are locked in exclusive mode and then modified.
The advantage of this approach is that it is relatively easy to implement as it is merely a simple extension of the centralized approach. Since all requests to lock and unlock data items are made at one site, deadlock can be detected by applying directed graph (discussed in Chapter 10) directly. This technique has certain disadvantages too. Firstly, sending all locking requests to a single site can overload that site which can cause bottleneck. Secondly, the failure of the site containing lock manager disrupts the entire system as all the locking information is kept at t hat site.
Distributed Lock Manager In distributed lock manager technique, each site maintains a local lock manager which is responsible for handling the lock and unlock request for those data items that are stored in that site. When a transaction needs to acquire lock on a data item (assume that data item is not replicated) that resides at site say S
, a message is sent to the lock manager at site S
requesting appropriate lock. If the lock request is incompatible with the current state of locking of the requested data item, the request is delayed. Once it has ascertained that the lock can be granted, the lock manager sends a corresponding message to the site at which the request was initiated.
This approach is also simple to implement and to a great extent, it reduces the degree to which the coordinator is a bottleneck. However, in this technique detecting deadlock is more complex. Since lock requests are directed to a number of different sites, there may be inter-site deadlocks even when there is no deadlock within a single site. Thus, the deadlock-handling algorithm needs to be modified to detect global deadlocks (deadlock involving two or more sites). The modified deadlock-handling algorithm will be discussed later in this section.
Primary Copy This technique deals with the replicated data by designating one copy of each data item (say, Q
) as a primary copy. The site at which the primary copy of Q
resides is called primary site of Q
. All requests to lock or unlock Q
are handled by the lock manager at the primary site, regardless of the number of copies existing for Q
. If the lock can be granted, the lock manager sends a message to the site from which the request for lock was initiated. Otherwise, the request is delayed until it can be granted.
Using this technique, concurrency control for replicated data can be handled the same way as for unreplicated data. Furthermore, load of processing lock requests is distributed among various sites by having primary copies of different data items stored at different sites. The failure of one site affects only those transactions that need to acquire locks on data items whose primary copies are stored at the failed site. The other transactions are not affected.
Majority Locking In majority locking technique, majority of the copies must be locked in case of both read and write lock request for a data item. Lock request is sent to more than one–half of the sites that contain a copy of the required data item. The local manager at each site determines whether the lock can be granted or not and acts similarly as discussed in earlier techniques. In this technique, a transaction cannot operate on a data item until the lock has been granted on a majority of the replica of the data item. Further, any number of transactions can simultaneously acquire a read lock on a majority of the replicas of a data item, since a read lock is shareable. However, only one transaction can acquire a write lock on a majority of these replicas. No transaction can majority lock a data item in the shared mode if it is already locked in exclusive mode.
This technique is considered as a truly distributed concurrency control method, since most of the sites are involved in decision in locking process. However, this technique is more complicated to implement than earlier discussed techniques. It is because it has higher message traffic among sites. Another problem with this scheme is that in addition to the problem of global deadlocks due to the use of a distributed lock manager, deadlock can occur even if only one data item is being locked. To illustrate this, consider a system with four sites S1
, S2
, S3
, and S4
having data replicated at all sites. Suppose that transactions T1 and T2 need to lock a data item Q
in exclusive mode. Further, suppose that transaction T1 succeeds in locking Q
at sites S1
, and S4
, while transactions T2 succeeds in locking Q
at sites S2
and S3
. Now, to acquire the third lock, each of the transactions must wait. Hence, a deadlock has occurred. However, this type of deadlock can be easily prevented by making all sites to request locks on the replicas of the data item in the predetermined order.
A variant of majority locking is biased locking in which the request for shared lock on a data item (say, Q
) goes to the lock manager only at one site containing a copy of Q
, whereas the request for exclusive lock goes to lock manager at all sites containing a copy of Q
.
We know that locking technique suffers from two disadvantages: deadlock and low level of concurrency. Timestamping technique therefore can be used as an alternative to locking. Though it is more complex to implement than locking, it prevents deadlock from occurring by avoiding in the first place. The main idea behind timestamping approach is that each transaction in the system is assigned a unique timestamp to determine its serialization order. Various timestamping methods for a centralized system can also be used in the distributed environment. However, extension in the centralized technique to distributed technique requires a method to be developed for generating unique global timestamps.
For generating unique timestamps in distributed systems, there are two primary methods, namely, centralized and distributed.
Fig. 13.5 Generation of global unique timestamps
Since the local timestamp is generated using local clock or logical counter, some problems can arise. If the logical counter is used, a relatively faster site would generate timestamps at a faster rate than the slower site. Thus, all the timestamps generated by the fast site will always be larger than those generated by another sites. Similarly, if the local clock is used, the timestamps generated by the site having fast clock will always be larger than those generated by another sites. So, some mechanism is required to keep these local timestamp generating schemes synchronized. It is accomplished by maintaining a logical clock at each site. Logical clock is implemented as a counter that is incremented after generating a new timestamp. This timestamp is included in the messages sent between sites. On receiving a message, the site compares its clock with this timestamp. If site finds that the current value of its logical clock is smaller than this message timestamp, it sets it to some value greater than the message timestamp.
As stated earlier, in distributed systems, even when there is no deadlock within a single site, there may be inter-site deadlocks. In other words, deadlock involving two or more sites can occur. To illustrate this, consider a system having two sites S1
and S2
with following situation.
S1
, transaction T1 is waiting to acquire lock on the data item held by transaction T2. Transaction T2 is waiting to acquire lock on the data item held by transaction T4. Transaction T3 is waiting to acquire lock on the data item locked by transaction T1 and T2.S2
, transaction T4 is waiting to acquire lock on the data item locked by transaction T3.The common deadlock-detection technique is wait-for graph. This technique requires that each site maintains its own local wait-for graph. The local wait-for graph [see Figure 13.6(a)] for both the sites S1
and S2
is constructed in the same way as discussed in Chapter 10. The only difference is that the nodes represent all the local as well as non-local transactions that are currently holding or requesting any of the items local to that site. Note that T3 and T4 are present in both the graphs indicating that transactions have requested data items at both sites.
Fig. 13.6 Distributed deadlock
Observe that in Figure 13.6(a), each local wait-for graph is acyclic, indicating that there is no deadlock within the sites. However, absence of cycle in the local wait-for graph does not imply that there is no deadlock. This means, global deadlock cannot be detected solely on the basis of local wait-for graph.
To detect such deadlocks, the wait-for graph technique needs some extended treatment. In a distributed system, this technique requires generating not only local wait-for graph but also global wait-for graph for the entire system. Global wait-for graph [see Figure 13.6(b)] is constructed simply by the union of the local wait-for graphs and maintained at a single chosen site. Since this site is responsible for handling global deadlocks, it is known as deadlock detection coordinator.
The disadvantage of this technique is that it incurs communication overhead, as it requires sending all local wait-for graphs to the coordinator site. The other disadvantage is that delays in propagating local wait-for graphs may cause deadlock detection algorithms to identify deadlocks that do not really exist. Such deadlocks are known as phantom deadlocks. These types of deadlocks may lead to unnecessary rollbacks.
Instead of using one site for deadlock detection, it is possible to detect deadlock in a distributed manner. That is, several sites can take part in detecting deadlocks. In this technique, local wait-for graphs are sent to all the sites to generate only the concerned portion of global wait-for graph. However, such technique is more complicated to implement and sometimes deadlock may not be detected.
One other simple technique that can be used is time-out based technique. Recall that this technique falls between the deadlock prevention where a deadlock can never happen, and deadlock detection and recovery. In this technique, a transaction is allowed to wait for a fixed interval of time for required data item. If a transaction waits longer than specified time interval, it is aborted. Though this algorithm may cause unnecessary restarts, the overhead of deadlock detection is low. Moreover, if the participating sites do not cooperate to the extent of sharing their waits-for graphs, it may be the only option.
NOTE Deadlock prevention techniques can also be used for handling deadlocks but they may result in unnecessary delays and rollbacks. Thus, instead of deadlock prevention techniques, deadlock detection techniques are widely used.
Recovery in a distributed system is more complicated than in a centralized system. In addition to the same type of failures encountered in a centralized system, a distributed system may suffer from new types of failures. The basic types of failures are loss of messages, failure of the site at which subtransaction is executing, and failure of communication links. The recovery system must ensure atomicity property, that is, either all the subtransactions of a given transaction must commit or abort at all sites. This property must be guaranteed despite any kind of failure. This guarantee is achieved using commit protocol. The most widely used commit protocol is the two-phase commit protocol (2PC).
In a distributed system, since a transaction is being executed at more than one site, the commit protocol may be divided into two phases, namely, voting phase and decision phase. To understand the operation of this protocol, consider a transaction T initiated at site Si
where the transaction coordinator is Ci
. When T completes its execution, that is, all the sites at which T has executed inform Ci
that T has completed, Ci
initiates the 2PC protocol. The two phases of this protocol are as follows:
Ci
inserts a prepare record [T, prepare]
in the log and force-writes the log onto stable storage. After this, Ci
sends the prepare T
message to all the participating sites.prepare T
message, it determines whether the site is ready to commit T or not. If the site is ready to commit T, it inserts a [T, ready]
record in the log; otherwise it inserts [T, not-ready]
record. It then force-writes the log onto stable storage and sends the ready T
or abort T
message to Ci
accordingly.NOTE Any participating site can unilaterally decide to abort T before sending ready T message to the coordinator.
prepare T
message that Ci
receives from all the participating sites. The following responses may be received at the coordinator site.ready T
message from all the participating sites. In this case, Ci
decides to commit the transaction and inserts [T, commit]
record to the log. It then force-writes the log onto stable storage and sends commit T
message to all the participating sites.not-ready T
message or no response from any participating site within the specified time-out interval. In this case, Ci
decides to abort the transaction and inserts [T, abort]
record to the log. It then force-writes the log onto stable storage and sends abort T
message to all the participating sites.Once the participating sites receive either the commit T
or abort T
message from Ci
, they insert the corresponding record to the log, and force-write the log onto stable storage. After that, each participating site sends an acknowledge T
message to Ci
. Upon receiving this message from all the sites, Ci
force-writes the [T, complete]
record to the log.
NOTE As soon as the coordinator receives one abort T
message, the coordinator decides to abort the transaction without waiting for other sites or time-out period.
Now, there may be a case that any site fails during the execution of the commit protocol. If any participating site fails, there are two possibilities. First, the site fails before sending ready T
message. In this case, the coordinator does not receive any reply within the specified time interval. Thus, the transaction is aborted. Secondly, if the site fails after sending ready T
message, the coordinator simply ignores the failure of participating site and proceeds in the normal way.
When the participating site restarts after failure, the recovery process is invoked. During recovery, the site examines its log to find all those transactions that were executing at the time of the failure. To decide the fate of those transactions, the recovery process proceeds as follows:
[T, commit]
or [T, abort]
record, the site simply executes redo(T)
or undo(T)
, respectively, and sends the acknowledgement T
message to the coordinator.[T, ready]
record, the site contacts the coordinator to determine whether to commit or abort T.undo(T)
, since the site could not have voted before its failure.Now, consider the case of coordinator’s failure. In this case, all the participating sites communicate with each other to determine the status of T. If any participating site contains the [T, commit]
or [T, abort]
record in its log, the transaction has to be committed or aborted, respectively. On the other hand, if any participating site does not contain [T, ready]
record in its log, it means that site has not voted for T and without that vote, the coordinator could not have decided to commit T. Thus, in this case, it is preferable to abort T. However, if none of the discussed cases holds, the participating sites have to wait until the coordinator recovers, and all the locks acquired by T are held. This leads to blocking problem at the participating sites, since the transactions that require locks on data items locked by T have to wait.
In order to continue processing immediately after the coordinator’s failure, the system may maintain a backup coordinator that resumes the whole responsibility of the coordinator.
When the coordinator restarts after failure, the recovery process is invoked at the coordinator site. The coordinator examines its log to determine the status of T.
[T, commit]
or [T, abort]
record, it executes redo(T)
or undo(T)
, respectively, and periodically resends the commit T
or abort T
message to the participating sites, until it receives acknowledgement T
message from each participating site.[T, prepare]
record, the coordinator unilaterally decides to abort T and conveys it to all the participating sites.The commit protocol called three-phase commit protocol (3PC), is an extension of the two-phase commit protocol that avoids blocking even if the coordinator site fails during recovery. To avoid blocking problem, this protocol requires some conditions which are as follows.
K
sites fail, where K
is some predetermined number.If these conditions hold, 3PC protocol avoids blocking by introducing third phase between voting phase and the decision phase. In this phase, multiple sites are involved in the decision to commit. The basic idea of 3PC protocol is that when the coordinator sends prepare T
message and receives yes votes from all the participant sites; it sends all participants pre-commit message instead of commit message. After ensuring that at least K
other participating sites know about the decision to commit, the coordinator force-writes a commit log record and sends a commit message to all participants. The coordinator effectively postpones the decision to commit until it is sure that enough sites know about the decision to commit.
If the coordinator site fails, the other sites choose a new coordinator. This new coordinator communicates with the remaining other sites to check whether the old coordinator had decided to commit. It can be detected easily if any one of the other K
sites that received pre-commit message is up. In this case, new coordinator restarts the third phase. If none of them has received pre-commit message, new coordinator aborts the transaction; otherwise, commits the transaction.
The 3PC protocol incurs an overhead during normal execution and requires that communication link failures do not lead to network partition to ensure freedom from blocking. For these reasons, it is not widely used.
This protocol does not block the participants on failure of sites, except for the case when more than K sites fail.
As we know client/sever refers to the architecture where overall system is divided into two parts, client and server. They can run on two different machines. Thus, there is possibility of distributed computing. In fact, a client/server system can be considered as a special case of distributed system in which some sites are server sites and some are client sites; data resides at the server sites and all applications execute at the client sites.
In the late 1980s and early to mid of 1990s, a true distributed database system (that implement all the functionality discussed so far) was not a commercially feasible product. Thus, instead of making effort on developing a true distributed DBMS product, the vendors developed the distributed database system based on the client/server architecture. Two-tier client/server architecture can also be used for distributed system. However, it has some limitations. So, nowadays, three-tier client/server architecture has been used in developing distributed systems, particularly in web applications. The three-tier client/server architecture has already been introduced in Chapter 01. Here, all the three layers, namely, presentation layer, application layer, and database server of this architecture are explained in context of web applications. Different technologies related to each layer are given in Figure 13.7.
Fig. 13.7 Three-tier architecture for web applications with respective technologies
There are various approaches that are used to divide the DBMS functionality across client, application server, and database server. The common approach is to include the functionality of centralized database at the database server. The application server formulates SQL queries and connects to the database server when required. The client provides interfaces for user interactions. A number of relational products have adopted this approach.
R
by assigning each tuple of R
to one or more fragments. Since each tuple of a relation R
belongs to at least one of the fragments, original relation can be reconstructed. Generally, a horizontal fragment is defined as a selection on the relation R
.R
. In vertical fragmentation, each fragment is a subset of the attributes of the original relation.K
sites fail, where K
is some predetermined number.Employee (name, address, salary, branch)
is fragmented horizontally by branch. Assume that each fragment has two replicas: one stored at site A and one stored locally at the branch site. Describe a good strategy for the following queries entered at site B.