Chapter Thirteen. Scalability and Resilience

Introduction

Some inherent scalability and resilience features are present in the Java EE platform, the Application Servers, the JMS Message Servers, and in the hardware platforms. To take advantage of these inherent features, and add explicit features where they are lacking, scalability and resilience of the solution must be architected into it from inception.

Java CAPS solution components can be distributed over available infrastructure to facilitate scaling. The approach to distribution will vary depending on the extent to which eInsight Business Processes are involved. Business Processes may need to be refactored to support component distribution for scalability. This is discussed in section 13.2.

Resilience, the ability of a solution to withstand disruptions caused by expected or unexpected exceptions, is greatly influenced by solution design. Exception interception and handling properties of the tools greatly influence the designer’s ability to construct resilient solutions. Some solutions require exception handling that spans multiple components at the business level rather than technical level. These issues are discussed in sections 13.3 and 13.4.

Constructing highly available solutions that take advantage of hardware and infrastructure software is discussed in section 13.5.

Distributing Components

Scalability and resilience can be achieved by carefully factoring the overall solution into smaller components that can be replicated and deployed to multiple physical devices.

Java CAPS solution factoring is dependent on the nature of the solution. Strictly eGate-based solutions typically would be broken up at JMS Destinations, which are the logical break points. Strictly eInsight-based solutions may not use many JMS Destinations, so large Business Processes may need to be broken up into multiple smaller processes to address resilience and scalability requirements. Layered architectures, typically found in service-oriented architecture (SOA) environments, can be broken up at the interlayer interfaces, typically Web Services invocations.

eGate Component Distribution

JMS Destinations give the architect a considerable degree of flexibility in factoring large solutions into multiple solution fragments for replication and distribution. This flexibility, however, makes it more difficult to understand the runtime environment and to monitor and manage the solution. In the presence of multiple JMS Message Servers in the runtime environment, maintainers of the solution must ensure that JMS Destinations are deployed to the correct Message Servers to avoid unintentional solution fragmentation.

Figure 13-1 shows an exclusively eGate-based solution that involves several services. Each service performs some processing on an incoming message before forwarding it to the next component using a JMS Destination. What the components actually do is immaterial. In this case, each component is a Java Collaboration that adds the name of the connectivity map service, to which it is assigned, to a Route JMS user property, to record the route the message takes; see Chapter 10, “System Management,” section 10.3.5.

JMS-based message routing solution

Figure 13-1. JMS-based message routing solution

The Java CAPS environment to which this solution will be deployed consists of three logical hosts—LH01, LH02, and LH03—each with one Integration Server—IS01, IS02, and IS03 respectively—and each with one Message Server—MS01, MS02, and MS03 respectively. The Java CAPS environment is depicted in Figure 13-2.

Java CAPS environment

Figure 13-2. Java CAPS environment

The single deployment profile assigns components to containers, as shown in Figure 13-3.

Component deployment

Figure 13-3. Component deployment

Note that most JMS queues have two objects that are deployed, a sending client—for example, qB → svcBC—and a receiving client—for example, qA → svcAB. The receiving client corresponds to a subscription by the service to a JMS Destination. The sending client corresponds to a publication by the service to a JMS Destination.

A JMS Destination, configured in a connectivity map with a particular name, is global within the JMS Message Server instance. The publication and the subscription, deployed to the same JMS Message Server, will operate on the same JMS Destination. If the publication and subscription are deployed to different JMS Message Servers, as is the case with qF and qG, the JMS Destinations are different.

Figure 13-4 depicts deployment of the components assigned in the deployment profile, as shown in Figure 13-3.

Component deployment relationships and connectivity

Figure 13-4. Component deployment relationships and connectivity

Even though the connectivity map depicts all components connected, and we would expect the message picked up from qA to end up in qH, submitting a message to qA will cause the message to end up in qF and go no further. Similarly, submitting a message to qF deployed in MS02 will cause the message to end up in qG in the same JMS Message Server and go no further. This is because qF in MS01 is different from qF in MS02 and qG in MS02 is different from qG in MS03. In the presence of multiple logical hosts in the environment, care must be taken to ensure publications and subscriptions are deployed where expected.

Components from one or more connectivity maps can be deployed together through one deployment profile or they can be deployed through multiple deployment profiles. Components in each deployment profile can be assigned to one or more logical hosts. Needless to say, components in different deployment profiles can be assigned to different logical hosts as well. Java CAPS allows the designer and the architect to distribute components in a number of ways. The solution architecture can leverage this ability to create a runtime environment with the desired scalability and resiliency properties.

eInsight Component Distribution

When faced with building integration solutions in Java CAPS, there is a temptation to either use multiple Java Collaborations or build a single large eInsight Business Process to accomplish the same thing. There are a couple of major reasons people choose to design an eInsight Business Process to solve an integration problem.

The first and foremost is that by turning on eInsight Persistence, you will immediately obtain the benefits of Message (Route) History. Each Business Process instance will be able to be inspected to determine the route each message took, whether it succeeded or where it failed, and whether it completed or is still in flight. Advantages of runtime monitoring are quite obvious. This kind of functionality is simply not available using just Java Collaborations unless it is explicitly built into the solution. The second is that all the logic of the solution is explicit and visible in a single canvas. The drawback of the approach is that the Business Process is monolithic and its components cannot be distributed. A further disadvantage is that persisting process states and attribute contents to a backing store imposes a significant overhead in terms of performance and storage.

To obtain the advantages of a single process, but to also support component distribution for resilience and scalability, we could take a layered Business Process approach. A main Business Process would contain coarse-grained activities and high-level logic. Each activity would be implemented as a subprocess, or a separate process, or a Java Collaboration, and loosely coupled with the main process using remote invocation technologies like Web Services, TCP Sockets, or JMS. The overall process logic would still be visible, but the moving parts would be able to be distributed over multiple platforms.

An eInsight Business Process can invoke a subprocess or a Java Collaboration as an activity. While this kind of invocation is advertised as a Web Services invocation, it requires that the caller and the called be part of the same deployment and be assigned to the same Integration Server. With this tight coupling, components cannot be distributed, so the major advantage of factoring is not obtainable.

An eInsight Business Process can be exposed as a Web Service. This ability can be exploited to break up a large process into a series of subprocesses, exposed as Web Services, and orchestrated using a main process. By decoupling subprocesses from the main process using remote invocation technologies, an architect can achieve not merely a service-oriented design but also a distributed, scalable design.

Exception Handling

While all integration solutions are supposed to be designed to work, work well, and work all the time, the reality is somewhat different. Most messages follow the “happy path.” Unforeseen circumstances may cause exceptions to occur. A resilient solution must account for exceptions and handle them in a way that is compatible with the objectives of the solution. Solution developers must be aware of the inherent capabilities and limitations of the platform to design exception handling strategies that take advantage of the platform’s capabilities and transcend its limitations. The “unhappy path” must be designed and implemented.

Exceptions in Java Collaborations

JMS-Triggered Java Collaborations

JMS-triggered Java Collaborations are special in Java CAPS in that there exists a prebuilt exception and retry handling infrastructure, provided by the JMS Message Server implementation, that will handle exceptions not explicitly handled by such collaborations. In ordinary circumstances, an unhandled exception in a JMS-triggered Java Collaboration will cause a transaction rollback to occur. The JMS Message Server will attempt to redeliver the message to the same collaboration. By default, if the condition that caused the exception persists, redelivery will continue indefinitely, preventing subsequent messages from being delivered and processed. With redelivery handling explicitly configured to change the default behavior, redelivery attempts may stop after some number of retries. Messages that could not be delivered will be discarded or diverted to another JMS Destination, depending on the explicitly configured redelivery handling behavior. Chapter 5, “Messaging Infrastructure,” section 5.13, discusses JMS redelivery handling at length. Solution designers can take advantage of this infrastructure to handle exceptions. Troublesome messages can be diverted to another destination to be handled by a different component. If the condition is transient, the designer can rely on the JMS Message Server to redeliver the message once the exception condition is cleared.

On exception, all transactional resources invoked by the collaboration prior to the exception will be rolled back. Conversely, nontransactional resources will not be rolled back. Side effects caused by such resource invocation will have to be carefully considered and accounted for in exception handling design. To be more specific, database insert or update operations that occur before an exception will be rolled back, but a file write or an HTTP POST will not be rolled back, since the resources involved are nontransactional.

This discussion is illustrated with a detailed example in Chapter 11, section 11.2.1.1, “JMS-Triggered Java Collaborations,” in Part II (located on the accompanying CD-ROM).

A solution designer can take advantage of the JMS redelivery handling to handle exceptions at a Java Collaboration component level. The built-in JMS redelivery mechanism can be utilized to overcome transient exception-causing conditions, such as temporary database unavailability, without requiring explicit logic in Java Collaborations. The designer must, however, consider side effects arising out of access to nontransactional resources, to minimize the adverse impact of retry attempts on these resources.

If the collaboration does not throw an exception, the message that triggered it will be consumed and the transaction that spans the collaboration will complete. If the collaboration handles exceptions that arise during its execution and does not rethrow any, the message that triggered it will also be consumed.

Other Java Collaborations

A JMS-triggered Java Collaboration does not “consume” a message if it throws an exception. The message can be redelivered, as discussed in the previous section. A collaboration triggered by other means generally consumes a message regardless of the outcome of its execution. How applicable this statement is depends on the eWay used to trigger the collaboration.

A Batch Inbound eWay, for example, might be invoked at intervals. It will scan the configured directory, and if it finds a file that matches the file name pattern, it will rename the file and deliver the new name and other details to the Java Collaboration. Once this is done, the file will not be renamed back even if the collaboration throws an exception. Effectively, the trigger will be lost and the file will not be processed. This behavior is illustrated in detail by an example in Chapter 11, section 11.2.1.2, “Other Java Collaborations,” in Part II.

Transactionality of the triggering endpoint determines whether the triggering message can be rolled back and retried. In general, the only common triggering endpoint with transactional capability is the JMS client. Just about all other endpoints are nontransactional, so their messages are not automatically rolled back and retried. If a collaboration receives a message from a nontransactional endpoint, experiences an exception, and does not handle it, the message will be lost. For messaging solutions that require guaranteed delivery, it is very important that collaborations are designed to handle exceptions so as not to cause message loss. One of the simplest ways of doing so is to ensure that the message is handled by two consecutive collaborations connected by a JMS Destination. The initial collaboration, triggered by a nontransactional endpoint, performs as little work as absolutely essential and passes the message via JMS to the subsequent component. This subsequent component can implement whatever additional logic might be required. If it experiences an exception, JMS redelivery handling can be used to work around transient exceptions and to divert unprocessable messages to dead letter queues.

Within Java Collaborations, exceptions that need handling can be handled using standard Java techniques.

Because a Java Collaboration is invoked from a Stateless Session Bean or a Message-Driven Bean, hosted inside a Java EE container, there exists a class of exceptions that a Java Collaboration cannot handle. One of the things the container does, for example, is manage connections to external resources. The container establishes connections on behalf of objects that need them. If a connectivity issue arises, an exception may be thrown before the collaboration code can execute. The collaboration will not execute and will have no opportunity to intercept and handle the exception. Since these kinds of issue are outside collaborations’ control, collaborations cannot be written to deal with them directly. Some eWays, subject to this behavior, will generate one or more alerts. The optional Alert Agent product can be used to intercept such alerts and provide operator notification or trigger Java CAPS or external components to implement alternate processing logic. The use of Alert Agent and JMX-based Java CAPS component control are discussed in Chapter 10, sections 10.2.5 and 10.2.9.4.

Faults in Business Processes

JMS-Triggered Business Processes

Chapter 4, “Message Exchange Patterns,” section 4.8.3, and Chapter 5, section 5.14.5.1, discuss eInsight capabilities for guaranteed delivery in the context of distributed transactions. A JMS-triggered eInsight Business Process, enrolled with a JMS Message Server–based distributed transaction, behaves differently from an ordinary eInsight Business Process. Applying XA to the entire Business Process, and marking suitable activities as “participating,” makes it behave in much the same way as a Java Collaboration. Unhandled exceptions will cause a JMS message to be rolled back and redelivered according to the JMS redelivery handling configuration.

Section 13.3.1.1 described the behavior of transactional and nontransactional resources in the context of a transaction. The same behavior applies to Business Processes under JMS-triggered XA. When a Business Process under XA experiences a fault that it does not handle, all transactional resources it invoked to that point will be rolled back, and all nontransactional resources will not be rolled back. Chapter 11, section 11.2.2.1, “JMS-Triggered Business Processes,” in Part II, walks step-by-step through a solution that illustrates this behavior.

Imposing distributed transaction semantics onto an eInsight Business Process changes its behavior with respect to XA-capable and non-XA-capable resources it orchestrates. In the presence of unhandled faults, XA-capable resources are rolled back and non-XA-capable resources are not, as is to be expected.

As at release 5.1.3, only Business Processes triggered by JMS message delivery can be configured to operate as distributed transactions. While the XA property can be enabled for Business Processes triggered by other means, this will have no effect on the behavior, as no other resource that can trigger instance creation is XA-capable.

Because the XA Business Process is triggered by a JMS message, imposing distributed transaction semantics on that process effectively serializes it with respect to the JMS Destination. The process will behave the same way as a Java Collaboration receiving from a JMS Destination with Concurrency mode set to Serial in the receiver/subscriber connector on the connectivity map. Because the XA Business Process is serialized, its property Max Concurrent Instances value will be ignored and will have the effective value of 1.

In constructing XA Business Processes, care must be taken to ensure that lockable resources do not remain locked longer then necessary, as this can lead to resource contention and possibly to deadlocks. In particular, this means that XA should not be imposed on Business Processes that use correlations or user activities. Care must also be taken to appropriately design the handling of non-XA-capable resources with side effects, possibly using eInsight Compensation.

Note

Note

XA transactionality in BPEL 1.0 Business Processes is nonstandard.

The foregoing discussion applies to eInsight Business Processes that explicitly or implicitly delegate fault handling to the eInsight invocation framework. Business Processes that explicitly handle faults behave much the same way whether they are XA-enabled or not, since only an unhandled fault triggers JMS rollback that affects XA-capable activities.

Fault Handlers

Exception and error events may occur during execution of business processes. In BPEL 1.0, these are named faults and can be handled by the Business Process or can be ignored. [BPEL4WS06] has a good introductory discussion of this topic. Java CAPS eInsight engine implements BPEL 1.0; therefore, it implements standard fault handling. Fault handlers for a set of activities are declared by enclosing relevant activities in a scope and associating a fault handler with that scope. A fault arising in any activity within the scope will be handled by the associated fault handler. This is similar to the try-catch scope in Java. There can be a number of fault handlers associated with a single scope, one of which can be a catch-all handler. Others must handle specific named faults. Since scopes can be nested, there may be a hierarchy of fault handlers in place. Fault handlers deeper in the hierarchy may rethrow faults to be handled by enclosing the scope’s fault handlers, if any. A Last Chance fault handler can be associated with a Business Process. Its implicit scope is the entire process. If present, this handler will handle all faults not handled by handlers associated with specific scopes or rethrown by them.

Since BPEL orchestrates Web Services invocations, exceptions arising in invoked resources are really SOAP faults.

eInsight User’s Guide release 5.1.3, Chapter 8, “Catching Exceptions within Business Processes,” briefly discusses configuring exception handlers both at the scope and at the process level.

A fault handler, associated with a scope, may terminate the process instance as part of fault handling flow. If this is not the case, upon completion of the fault handler flow, the process will resume at the point following the end of the applicable scope. It is critical to remember this, as unexpected and difficult-to-diagnose side effects can arise if this point is missed. This point is illustrated in detail in Chapter 11, section 11.2.2.2, “Fault Handlers,” in Part II.

When designing fault handling, care must be taken to ensure that activities are correctly scoped and that fault handlers explicitly terminate process instances if they are not to resume execution with the activity following the scope.

Scopes can be nested. Each scope can have zero or more named fault handlers and zero or one catch-all fault handlers.

Throwing a fault within a scope, handling it in an attached fault handler, and having the fault handling flow not invoke explicit terminate activity is equivalent to unconditional “jump to the first activity following the scope.” This may or may not be a valid design technique, depending on your outlook on life.

Potentially every activity in a Business Process can result in a fault being thrown. An unhandled fault will be rethrown to the eInsight execution framework. This will result in process instance termination. A “detached” fault handler that will handle all faults not otherwise handled can be implemented. This handler, behaving as though associated with an implicit scope encompassing the entire process, can handle some faults and rethrow others as may be required.

Web Services invocations, Java Collaboration invocations, subprocess invocations, eWay invocations, and OTD marshals and unmarshals can all cause a fault. Handling of exceptions thrown by a Java Collaboration is shown in Chapter 13, section 13.2.9, “Create a Business Process,” in Part II. Fault handlers for handling faults from other activities can be implemented in similar manner.

Higher-Level Exception Handling

Exception handling may involve not just using the Java- or BPEL-provided exception interception and handling constructs but also higher-order solution components that are involved in exception handling and alternate processing logic.

Chapter 9, “Messaging Endpoints,” section 9.4.2, discussed how exceptions arising out of an inability to deliver a payload to an FTP server can be handled. The associated example shows how a failed transfer attempt can be retried, at intervals, until successful, or until a set number of retry attempts has been exceeded. Some of the exceptions, which the example handles, are temporary unavailability of the FTP server, invalid credentials, and other issues that prevent the Batch FTP eWay from connecting to the host. It is an example of a generic try-wait-retry exception handling pattern. jcdRetryResubmitter Java Collaboration, featured in the example, is an implementation of a generic JMS-based resubmission handler. At intervals, it polls a JMS queue containing messages to be resubmitted to another queue, inspects metadata carried as JMS message header user-defined properties, and, based on their values, resubmits the message to the queue to which they are supposed to be resubmitted. If the count of retry attempts exceeds the maximum retry attempt limit, it queues the message to the retries failed queue, thus ending the retry cycle.

Chapter 13, section 13.2, “Web Service, Stored Procedures, and XA,” in Part II, contains an example of handling exceptions from New Web Services Java Collaborations, invoked from a Business Process. The process catches the exception, extracts the exception message, formats a SOAP Fault, and passes it onto the Web Services invoker.

Compensation

eInsight User’s Guide [eInsight] for Java CAPS release 5.1.3 briefly discusses the topic of compensation. [BPEL4WS06] elaborates more on the topic of compensation in BPEL4WS. Neither source is particularly illuminating.

For those familiar with transaction-oriented systems, it may come as a surprise that BPEL 1.0’s concept of a transaction is not the same as that to which they are accustomed. See an interesting discussion of BPEL4WS by Jean-Jacques Dubray [BPELDisc], including discussion of the Long Running Transactions (LRTs), which are “undoable” or “compensable.” In compliance with BPEL4WS 1.0, eInsight supports compensation.

eInsight 5.1 implements distributed transaction support, in the XA sense, for Business Processes initiated by a JMS receive activity that use XA-capable resources such as databases. This is discussed, in the context of implementing different patterns, in Chapter 4, section 4.8.3; Chapter 5, section 5.14.5.1; and in this chapter, section 13.3.2.1. For circumstances where a Business Process is short-lived, is triggered by JMS, and orchestrates XA-capable resources, XA may be appropriate. In all other circumstances, it may be necessary to implement compensating activities to handle reversal of side effects when a process is aborted before completion.

It is critical to appreciate that a compensating activity is an independent activity that permanently modifies a resource much as the original activity, which is being compensated, permanently modified the resource in the first place. There is no notion of a transaction with Atomicity, Consistency, Isolation, and Durability (ACID) properties. As soon as an activity modifies the resource, the modification is visible to all other accessors. A lot can happen to the resource between the time it is modified and the time the modification may need to be compensated. This severely limits the kinds of modifications that can be reasonably undone through compensation or greatly increases complexity of the process that may need to perform compensation. Let’s consider a couple of examples to illustrate the point.

Let’s imagine a travel reservation process that orchestrates a number of Web Services to book an airline ticket, book a room in a hotel, or rent a car. Each of these services is binary in effect. The car is reserved or it is not. A seat is booked or it is not. A room is booked or it is not. If our process instance causes a room to be booked, no other process instance, or indeed no other accessor, can book the same room. If the process causes room booking first, and fails to book an airline seat, then to compensate for the failure it simply needs to invoke a cancel room service. Once the booking is cancelled through compensation, the room is as it was, unbooked. Any other accessor can book it from that point. The compensation activity “knows” in what state the resource was and restores it to its original state.

To contrast this, let’s imagine a process that causes funds transfer between accounts. The first activity credits the target account with the sum to be transferred. The second activity debits the source account with the sum to be transferred. If the source account does not have the funds, debit operation fails and credit operation must be undone. To undo the credit, you would debit the same amount from the target account. Because the credit and debit activities are not part of an atomic transaction, it is possible that funds were withdrawn from the target account, by some other application, before the debit from the source account was actioned. If the compensating activity simply restores the target account’s state to that before the credit, then the withdrawal by another application will be incorrectly overwritten. Someone will have gotten some money out of thin air. If the compensating activity debits the target account to reduce the balance by the amount originally credited, then the account could become overdrawn. Not a desirable state either. The long and the short of this discussion is that you must carefully consider the kinds of resources to which to apply compensation, and the implications and possible side effects of compensation on these resources.

Travel reservation is perhaps the most often quoted example of use of Web Services. It might be argued that BPEL was invented for these kinds of application. Funds transfer, on the other hand, would be a singularly inappropriate application of BPEL. Chapter 13, section 13.3, “Example Travel Reservation,” in Part II, provides an extensive example of Web Services orchestration, including compensation, based on the simplified travel booking service.

High-Availability Architecture

By Brendan Marry

The Java CAPS high-availability architecture is intended to accommodate requirements of several applications. As different applications have different availability, load, and scalability requirements, the architecture discussed here is general in nature, and it must be reviewed and adapted for specific applications and workload characteristics.

The high-availability solution discussed here is required to implement intrasite high-availability and failover and intersite failover for disaster recovery. It discusses a logical architecture and outlines one of the possible hardware architectures that satisfy named requirements.

Introduction

A number of factors must be considered when architecting highly available, fault-tolerant, and resilient systems:

  • What is connecting externally and internally to the integration layer?

  • What are the system availability requirements?

  • Can multiple hardware and software configurations be used for fault-tolerance, failover, and disaster recovery (intrasite)?

  • Can multiple sites to be used for fault-tolerance, failover, and disaster recovery (intersite)?

  • What are security zoning requirements, if any?

  • What inherent resilience can be expected from each of the components?

  • What are the hardware, CPU (threads), RAM, and shared storage requirements?

This section discusses Java CAPS component properties relevant to the discussion, resilience options, and fault-tolerance properties of different component distribution and replication strategies.

Java CAPS Platform Components

Let’s consider components of the Java CAPS environment, design, management, and runtime; their resiliency requirements; and their ability to participate in resilient solutions.

Repository

The Java CAPS development environment uses a proprietary Repository to store design-time solution artifacts in hierarchical project structures. The application build process combines solution artifacts into Enterprise Archives that can be deployed to runtime environments hosted in Java EE-compliant Application Servers. As of Java CAPS release 5.1, runtime execution and management infrastructure do not depend on development infrastructure; consequently, the requirement to host the Repository on a highly available platform is eliminated.

Enterprise Manager

The Enterprise Manager, used for runtime management and monitoring of Java CAPS solution components, consists of the Enterprise Manager Server and a Web-based user interface. The execution environment does not depend on the Enterprise Manager Server; consequently, it is not necessary to host it on a highly available platform. Should the cluster node on which an instance of the Enterprise Manager is running fail, another instance can be readily started on the surviving node.

UDDI Registry

As Java CAPS–exposed Web Services are built, they are registered in the Java CAPS Universal Directory, Discovery, and Integration (UDDI) Registry. If the UDDI Registry is not available, WSDL definitions can be exported to the file system instead. In Java CAPS, by default, UDDI Registry is only used at build time. Through version 5.1.3, there are no runtime service discovery facilities in Java CAPS that use the UDDI Registry. The Java CAPS UDDI Registry does not need to be hosted on a highly available platform unless non-Java CAPS–based frameworks need to use it for dynamic runtime service discovery.

Integration Server

The Integration Server, or one of the supported Application Servers, provides the runtime environment for enterprise applications, whose components may include Java Collaboration, XSLT Collaboration, BPEL engines, and other application services and components. The ability of enterprise applications to withstand Integration Server failures is highly dependent on how they are designed. The design must incorporate provisions for high availability or must account for the possibility of application failure on one node and resumption on another node. Stateless applications and applications that do not rely on the order of messages are good candidates for load-balancing deployments. They are very much more likely to gracefully survive failures and node switch than applications that rely on preservation of context or message order. Applications exposed for external consumption, such as Web Services, are examples of stateless applications. Applications that implement message correlation are examples of context-dependent applications. The former can be load balanced, can run in Active/Active clusters, and can be switched between nodes in failover situations. The latter generally cannot be load balanced, run in Active/Active configurations, or be failed over without potential data loss.

eInsight Business Processes are expected to be long-running and therefore are more likely to be affected by platform failures. As of release 5.1, Java CAPS eInsight Business Process Manager is cluster-aware. An executing Business Process instance can be failed over from node to node in a cluster, as long as the Business Process does not use BPEL correlations and does not implement User Activities. To support instance failover, processes must use a single database that must be accessible to all instances of the BPEL engine. Furthermore, if processes access external resources such as databases or file system objects, the storage devices hosting these objects must be accessible to all nodes in the cluster.

Java Collaborations are expected to be short-lived. They operate within container-managed transactions. If the collaboration is invoked to process a JMS message and a node failure occurs, the JMS message will be rolled back. The message will become a candidate for processing on another node in the cluster. This will occur only if the JMS Message Server itself does not fail at the same time. If a collaboration is invoked by a non-XA-capable resource, such as HTTP GET or POST, the message that triggered it will be consumed and will not be available to be processed on another node. Message loss will occur.

It should be clear by now that the ability to build resilient, highly available, and scalable solutions is totally dependent on the nature and architecture of the applications.

Sun SeeBeyond JMS IQ Manager

The Java CAPS product suite comes bundled with a JMS Message Server, the Sun SeeBeyond IQ Manager. This Message Server is an implementation of the JMS 1.1 specification. The Sun SeeBeyond JMS IQ Manager is not cluster-aware; therefore, solutions that require high availability and fault tolerance will need to employ shared storage for JMS persistence and will have to operate in Active/Passive configurations.

Sun JMS Grid

The Sun JMS Grid, designed specifically for high availability and fault tolerance, is an alternative JMS 1.1 implementation available from Sun. It can be used standalone or in conjunction with the other Java CAPS products to provide reliable asynchronous communication between components in a distributed computing environment. An example Java CAPS solution using JMS Grid is discussed and illustrated in Chapter 5, section 5.7.

A JMS Grid system consists of one or more JMS Grid clusters, each with one or more JMS Servers (daemons), and one or more JMS Grid clients. Each JMS Grid daemon is expected to operate on an independent hardware configuration. Each JMS Grid daemon is responsible for managing client connections, delivery of messages to its clients, and replication of messages to other daemons in the cluster.

Each client is connected to a specific daemon with a socket connection. Messages are routed from the sending client, through one or more daemon processes, to one or more receiving clients. Message routing is managed by the JMS Grid cluster and is automatic. A sender client does not know or care to which daemon the recipient client is connected.

  • JMS Grid cluster is a tightly coupled collection of daemon processes where all daemons are interconnected. Client connections are spread across the available daemons, and all message data is replicated to provide fault tolerance. This also gives greater scalability than the JMS IQ Manager, as several JMS message daemons are servicing client requests.

  • JMS Grid network is a loosely coupled collection of JMS Grid clusters. Clusters in the network are typically located at different sites and are connected over a wide area network. Specific daemons are connected between clusters in the network and forward eligible messages between clusters. Only messages required to be delivered to a client on the remote cluster are sent between clusters. Clusters and networks of clusters together provide scalability and fault-tolerance.

Figure 13-5 depicts a JMS Grid cluster with three daemons.

JMS Grid cluster with three daemons

Figure 13-5. JMS Grid cluster with three daemons

The diagram in Figure 13-6 shows a cluster of three daemons in a failure scenario. If one of these daemons fails, then any clients connected to that daemon are automatically and transparently failed over and reconnected to another daemon in the cluster.

Automatic client failover on daemon failure

Figure 13-6. Automatic client failover on daemon failure

JMS Grid clusters add resilience, fault tolerance, and scalability to the JMS messaging infrastructure.

JMS Grid ensures guaranteed delivery of persistent messages between clients. This is achieved using a synchronous acknowledgment mechanism between clients and daemons. When a message is sent, the sending client waits for an acknowledgment from the daemon. The acknowledgment indicates the message has been persisted in the message store and is safe from system failure. At this point the client discards the message from its in-memory cache.

During the acknowledgment cycle, the daemon persists the message in its recoverable message store and sends replicas out to all other daemons in the cluster. It then waits for all the other daemons to send back acknowledgments that they have received and persisted each message replica. The daemon then sends an acknowledgment back to the receiving client. The process is depicted in Figure 13-7.

Guaranteed delivery with persistence process

Figure 13-7. Guaranteed delivery with persistence process

Application Connectivity

Let us consider a solution in which applications and interfaces will be deployed to a Java CAPS cluster with two or more nodes and which does not use JMS Grid. This configuration is depicted in Figure 13-8 and discussed here.

Java CAPS and JMS Grid clusters–based architecture

Figure 13-8. Java CAPS and JMS Grid clusters–based architecture

  • Each application will connect to a database that stores application data.

  • The JMS Message Service, which applications use, will persist messages using shared storage (SAN).

  • External Web Service clients will invoke Web Services hosted by Java CAPS cluster using a load balancer.

  • Web Service clients hosted on Java CAPS nodes in the cluster will directly invoke external Web Services.

  • FTP connectivity will be via the internal FTP server.

Intrasite Failover

Intrasite failover addresses availability of applications in the event of an application server instance or hardware instance failure. The design discussed here introduces redundancy, increasing availability, and scalability of the applications at the single site.

An operating system cluster solution will increase horizontal scalability and provide application resilience. As the load on the cluster increases, additional nodes may be added. The cluster will monitor running instances of Java CAPS components. It will restart failed instances, if any, or will start instances on surviving cluster nodes if a node fails.

Note

Note

JMS Grid could be used to address JMS Queue Manager resiliency without the need for shared storage.

If the Sun SeeBeyond IQ Manager process running on a node fails or the node on which the Queue Manager is running becomes unavailable, the cluster will start the IQ Manager on one of the surviving nodes. It will use the same IP address and port number. The use of SAN disks for storage of JMS persistence data introduces redundancy and protects against disk failure. Data stored on SAN disks is accessible from any node in the cluster that is currently running the IQ Manager. This is an example of an Active/Passive automatic failover configuration.

Integration Servers will be arranged in an Active/Active configuration with the same set of applications deployed on both cluster nodes. This will maximize throughput and increase scalability, as both nodes are running applications and therefore twice as many threads are executing application code. Having the same set of services running on each node in the cluster simplifies horizontal scalability and reduces complexity of the integration layer.

The load balancer is configured to deliver HTTP and TCP/IP requests to each of the running applications. Should one of the Integration Servers go down, the load balancer will deliver requests to the remaining Integration Servers. This configuration is depicted in Figure 13-9.

High-availability single-site configuration with HTTP and TCP load balancer

Figure 13-9. High-availability single-site configuration with HTTP and TCP load balancer

Intersite Failover Architecture

Intersite failover configuration improves availability of applications in the case of a site failure. The two sites (primary) Site A and (secondary) Site B are deployed. In the event of a disaster, the secondary site’s Application Server is activated and begins receiving Web Services, HTTP, and TCP/IP request traffic from the load balancer.

A JMS Grid network or JMS persistence data replication using SAN storage replication would be implemented to ensure JMS messages are available to both sites. If we assume that, after a failover, JMS Destinations are in the same state as they were on the failed site before the failure, processing of transactions will continue uninterrupted. However, implementation of this level of synchronization between sites incurs an unavoidable performance penalty. This configuration is depicted in Figure 13-10.

High-availability dual-site configuration with HTTP and TCP load balancers

Figure 13-10. High-availability dual-site configuration with HTTP and TCP load balancers

Queue Failover Options

As discussed, JMS Destination failover can be implemented using the JMS Grid or the JMS IQ Manager Persistence Store replication. The preferred option for a distributed architecture is JMS Grid because of its distributed capabilities.

JMS Grid clusters and networks can be used to implement guaranteed message delivery, replication, and fault tolerance of messages across both Site A and Site B.

JMS Grid network connections can withstand the connection going down for a period of time, as is sometimes the case with a WAN. In such situations, messages are simply stored on the sending cluster, ready to be forwarded when the connection is reestablished.

This configuration gives the best for performance but has the disadvantage that messages may be lost. In the event that the connection between the sites goes down prior to a failover, or in the case where failover occurs before all messages are written to the storage on the Disaster Recovery (DR) site, messages not propagated to the alternate site become inaccessible or become lost.

JMS Grid-Based Replication

The preferred option for implementation for a distributed architecture is to implement two levels of replication. This is to give us options.

Level One

A JMS Grid cluster is distributed between sites connected using a high-speed network. This configuration is appropriate for mission-critical applications that will tolerate no message loss and that require high availability.

Advantages include the following:

  • No message loss.

  • No downtime.

Disadvantages include the following:

  • Increased network latency as each message is written to both sites’ message stores before acknowledgment is sent back to the client.

Level Two

A JMS Grid network has JMS Grid clusters running on both sites. This configuration is suitable for high-throughput applications that have a high-availability requirement and allowable message loss.

Advantages include the following:

  • Faster performance, as replication occurs only at the local site.

Disadvantages include the following:

  • Potential message loss in the event that the connection between the two sites goes down prior to a failover, or in the event that failover occurs before all messages are written to the storage on the alternate site.

Queue Manager Disk-Based Replication

This option relies on disk replication of the queue data from the primary Site A to the secondary Site B. This is the least preferred option and requires additional resources to be implemented.

Summary

For a highly available Java CAPS solution, in situations where costs are not a concern and the availability and fault tolerance of the integration environment are of the highest priority, consider the following suggestions:

  • Java CAPS domains should be deployed in an Active/Active configuration managed by clustering software that will dynamically and horizontally scale the integration layer’s processes.

  • A JMS Grid cluster should be deployed to resiliently store and guarantee delivery of JMS messages among the Java CAPS domains.

  • A load-balancing switch should be deployed to manage requests to the integration layer.

  • A SAN storage configuration should be used for queue data and software component binaries.

To facilitate continuity of operation, it is suggested that the JMS Grid cluster be deployed over both the primary and secondary sites. In the case where availability and performance are the main concerns, it is suggested that a JMS Grid network consisting of two JMS Grid clusters, one on each site, be deployed.

As each environment and suite of applications/services may have different requirements, so too the high-availability architecture that is appropriate in different circumstances may vary.

Chapter Summary

Scalability and resilience are important in business solutions. These properties enable smooth operation as workload fluctuates and exceptional circumstances occur.

For scalability, Java CAPS solution components can be distributed over available infrastructure. This chapter discussed some approaches to distribution that varied depending on whether and to what extent eInsight Business Processes were involved. Business Processes refactoring in support of component distribution as well as distribution of eGate components were discussed.

Resilience, the ability of a solution to withstand disruptions caused by expected or unexpected exceptions, is greatly influenced by solution design. Exception interception and handling properties of the tools greatly influence the designer’s ability to construct resilient solutions. Some solutions require exception handling that spans multiple components at the business level rather than the technical level. This chapter discussed exception and fault handling features of Java Collaborations and eInsight Business Processes as well as component spanning exception handling. Constructing highly available solutions that take advantage of hardware and infrastructure software was also discussed.

Together, all the features and options can be used by an enterprise architect to design and implement solutions that have the desired degree of resilience and scalability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset