This recipe looks at the ability to resume a failed workflow. It allows you to resume a workflow when an error has occurred.
We just need a working Orchestrator, and you will need the rights to create new workflows and run them. We will work with the Orchestrator Client.
To make it easier, we reuse the workflow we will create in the Error handling in workflows recipe in Chapter 5, Visual Programming ( the 05.03.01 Error Handling
example workflow). If you don't have it, please create it as described or use the example package that is supplied with this book.
The following steps showcase the functionality:
Before
and After
onto the log.
5
(this will result in an error). A window will now pop up and ask whether you would like to Cancel or Resume the workflow:
2
and click on Submit.Notice that the first log messages and the log message from the Error Handling workflow was only written once, so the resume process would just rerun the scriptable task and not the whole workflow from the beginning.
The ability to resume a workflow is quite a powerful tool. Instead of rerunning failed workflows again, and in some cases, roll back the previous operations, you are now able to resume at the same element the error occurred in.
Please note that in our little example we used a workflow inside a workflow, and the workflow that failed didn't have the Resume action assigned to it. What this means is that you don't have to assign the Resume action to all workflows, but just to the main one that calls all the others. Also, you see that only the failed element is allowed to be rerun, which in our case is the scriptable task inside the Error Handling workflow, not the whole workflow of error handling.
For example, you have a workflow that creates a VM, adds a virtual disk, and powers it on. If the workflow fails because you are out of disk space on the datastore, you will have to rerun the workflow again. This is especially true if some other application triggers the workflow via the Orchestrator API. Now, you can simply add the required disk space to the datastore and resume the workflow, or just use a different datastore.
However, you need to understand that you can only change variables or rerun the same failed element. If the error can't be remedied by a change of the variable content or by rerunning later, the resume
function will not help you.
In addition to this, rerunning some failed elements can have very undesirable results. For example, if you add two items to a database using one scriptable task, the insertion of the second fails. You resume the workflow and the result is that you have added the first item twice. So be careful.
The secret to the resume feature lies in the way that Orchestrator works. When a workflow is executed, Orchestrator writes checkpoints in its database. One checkpoint before a step in the workflow is executed. These checkpoints consist of all variable values. This is why when you resume a workflow, you are presented with all the variables that exist in the workflow.
The resume
function is, by default, switched off system-wide. You can switch it on system-wide using the com.vmware.vco.engine.execute.resume-from-failed
system property and setting it to true
. See the Control Center titbits recipe in Chapter 2, Optimizing Orchestrator Configuration.
If you consider using the resume
function, it is a good idea to define the timeout. The timeout defines how long a workflow waits in resume mode before failing. This feature can be used to make sure that workflows don't stay in resume mode indefinitely and that a human interaction can take place in a certain time frame.
I personally would urge caution with switching on the resume feature system-wide, because as mentioned, not every workflow can or should be recoverable. Instead of switching on the resume feature system-wide, consider writing a good error response and making a general decision if you want to roll back or push forward.