In this recipe, we are building an Orchestrator cluster and configuring it for HA. Load-balancing is discussed in a separate recipe in this chapter.
The prerequisites for a cluster are not that hard, but they are important:
In this recipe, I will deploy and configure two fresh Orchestrator installations and configure them into a cluster.
Before we come to the main event, we need to prepare some things:
We now prepare the first node of the cluster:
We are now configuring the cluster settings:
1
and the Failover Heartbeats to 5
. This setting will make sure that the Orchestrator server fails over after 5 seconds. More about the settings in the How it works... section.
Now we are joining an additional Orchestrator to the cluster; with 7.1 this becomes extremely easy:
You can see which Orchestrator is the active node by looking at the state. The local node is the node you are currently connected to.
When you want your Orchestrator cluster to work properly, you also need to configure the Orchestrator VMs to be properly configured in vSphere:
SeparateOrchestrator
.We will now simulate cluster failover:
The push configuration is a new feature in vRO 7.1 and makes the synchronization of clusters much easier. Let's have a look at this:
Since vRO7.1, the configuration of Orchestrator clusters has become a lot easier. The function that an additional node will automatically be synced to the configuration of the cluster is a massive improvement and makes things a lot easier. The other function that was added is the ability to actively push a configuration to all the nodes, making it easier to change clusters.
The Orchestrator cluster can function in two ways. The first and easiest is HA mode. This means that we have at least two Orchestrator installations, and if one fails, the other will continue running. When a workflow is running, Orchestrator will save the state of the workflow to the database, before executing a workflow element. This is the same behavior that lets us resume failed workflows or do debugging (see the recipe Resuming failed workflows in Chapter 4, Programming Skills).
What is happening when one server fails is that the last state of the workflow execution will be picked up by the new active node and continued. For a purely HA function, you set the Number of active nodes to one.
The difference between HA and load-balanced is that in the load-balanced versions multiple Orchestrators can execute workflows at the same time, meaning that each Orchestrator instance is doing less work. For load-balancing, you need to set the Number of active nodes to more than one and you should configure a load-balancer to Round-Robin.
Clearly, you can use both modes at the same time. For example, if you have four Orchestrator nodes and you have configured Number of active nodes to two, two of the Orchestrators are running and two are in standby. If one of the active nodes fails, then one of the standby nodes will be brought to active mode. If the failed node is available again, it becomes a standby node.
The Heartbeat interval (in seconds) gives the interval in which an Orchestrator node sends keep alive signals to all other nodes of the cluster.
The Number of failover heartbeats defines how many keep alive signals can be missed before a node is declared dead by the other members of the cluster.
You determine the failover time by multiplying the Heartbeat interval (in seconds) by the Number of failover heartbeats.
If you want to use local files in a clustered Orchestrator environment you should use NFS or SMB shares. See the recipe Configuring access to the local filesystem in Chapter 2, Optimizing Orchestrator Configuration.
At the time of writing (vRO 7.1.0), when a node joins a cluster it will automatically take the certificate of the primary host. If you reconfigure a node for a different certificate, the cluster will be out of sync. If your security isn't allowing for SAN certificates, you can run with an unsynced cluster. It's not nice, but it works.
VMware has promised to make sure that in the next release the certificates will not be pushed out automatically, allowing you to create a separate machine account for each node.
When you have more than one active Orchestrator, you need to have a think about the Orchestrator Client usage. Officially, the usage is not supported but, works anyhow. The idea behind it is that it would be possible for two users (one on each of the Orchestrators) to modify the same resource (for example, a workflow). This can be worked around by not giving the users edit or administrator rights (see the recipe User management in Chapter 7, Interacting with Orchestrator) or by using locks (see the recipe Using the Locking System in Chapter 8, Better Workflows and Optimized Working).
The supported and best practice, however, is to test a change on a separate Orchestrator installation and then transfer it to the cluster when only one Orchestrator node is running and the workflow that is to be changed is not in use.
When you want to change content, such as workflows, that are stored on the cluster, you must shut down all but one Orchestrator services, then change the content on one server and restart all the other Orchestrator services.
If you are adding a new plugin, you will need to install this plugin on all nodes before restarting the Orchestrator services.
When you would like to change the Orchestrator server settings, it's best to stop all but one Orchestrator nodes, change the settings, and then restart the others. If you don't do that, you will end up with an unstable cluster, meaning that the cluster fails over from one node to the other all the time. Try it out...
There are several more interesting things.
When you are writing to logs in your workflow while using clusters, you should use the Server
log, not the System
log, as the System
will be written to the localhost while the Server
is written to the database. Check out the example workflow for this recipe.
If you are looking for pure load-balancing, as in trying to run a process on several Orchestrators in parallel, you could also consider using the AMQP plugin. Have a look at the recipe Working with AMQP in Chapter 10, Built-in Plugins.
In the example package, there is a workflow called 03.01 Cluster Test
. For it to work, follow these steps:
You will see that the logs show only the entries that have been made after workflow execution was switched to the new host (System.log
). The events tab will show all the log entries (Server.log
).
The recipe Working with AMQP in Chapter 10, Built-in Plugins,for alternative workload balancing.
The recipe Configuring the Orchestrator service SSL certificate in Chapter 2, Optimizing Orchestrator Configuration, for creating SSL certificates.
The recipe Load-balancing Orchestrator in this chapter to understand and set up load-balancing.