Resuming failed workflows

This recipe looks at the ability to resume a failed workflow. It allows you to resume a workflow when an error has occurred.

Getting ready

We just need a working Orchestrator, and you will need the rights to create new workflows and run them. We will work with the Orchestrator Client.

To make it easier, we reuse the workflow we will create in the Error handling in workflows recipe in Chapter 5, Visual Programming ( the 05.03.01 Error Handling example workflow). If you don't have it, please create it as described or use the example package that is supplied with this book.

How to do it...

The following steps showcase the functionality:

  1. Create a new workflow.
  2. Drag Workflow element onto the schema and select the workflow we created in the Error handling in workflows recipe in Chapter 5, Visual Programming.
  3. Assign the in-parameter of the Error Handling workflow to the in-parameter of the workflow you added in step 2.
  4. Drop two additional System log instances before and after the workflow element and have it write something, such as Before and After onto the log.
  5. Drop a Throw exception element directly onto the workflow from step 2.
  6. Click on General in the main workflow and then select Enable for Resume from failed behavior:

    How to do it...

  7. Click on Save and Close.
  8. Run the workflow and enter 5 (this will result in an error). A window will now pop up and ask whether you would like to Cancel or Resume the workflow:

    How to do it...

  9. Choose Resume. You can now change all the variables of the workflow. Enter 2 and click on Submit.
  10. The script now runs through as if nothing has happened. Check the Logs.

Notice that the first log messages and the log message from the Error Handling workflow was only written once, so the resume process would just rerun the scriptable task and not the whole workflow from the beginning.

How it works...

The ability to resume a workflow is quite a powerful tool. Instead of rerunning failed workflows again, and in some cases, roll back the previous operations, you are now able to resume at the same element the error occurred in.

Please note that in our little example we used a workflow inside a workflow, and the workflow that failed didn't have the Resume action assigned to it. What this means is that you don't have to assign the Resume action to all workflows, but just to the main one that calls all the others. Also, you see that only the failed element is allowed to be rerun, which in our case is the scriptable task inside the Error Handling workflow, not the whole workflow of error handling.

For example, you have a workflow that creates a VM, adds a virtual disk, and powers it on. If the workflow fails because you are out of disk space on the datastore, you will have to rerun the workflow again. This is especially true if some other application triggers the workflow via the Orchestrator API. Now, you can simply add the required disk space to the datastore and resume the workflow, or just use a different datastore.

However, you need to understand that you can only change variables or rerun the same failed element. If the error can't be remedied by a change of the variable content or by rerunning later, the resume function will not help you.

In addition to this, rerunning some failed elements can have very undesirable results. For example, if you add two items to a database using one scriptable task, the insertion of the second fails. You resume the workflow and the result is that you have added the first item twice. So be careful.

The secret to the resume feature lies in the way that Orchestrator works. When a workflow is executed, Orchestrator writes checkpoints in its database. One checkpoint before a step in the workflow is executed. These checkpoints consist of all variable values. This is why when you resume a workflow, you are presented with all the variables that exist in the workflow.

There's more...

The resume function is, by default, switched off system-wide. You can switch it on system-wide using the com.vmware.vco.engine.execute.resume-from-failed system property and setting it to true. See the Control Center titbits recipe in Chapter 2, Optimizing Orchestrator Configuration.

If you consider using the resume function, it is a good idea to define the timeout. The timeout defines how long a workflow waits in resume mode before failing. This feature can be used to make sure that workflows don't stay in resume mode indefinitely and that a human interaction can take place in a certain time frame.

Tip

I personally would urge caution with switching on the resume feature system-wide, because as mentioned, not every workflow can or should be recoverable. Instead of switching on the resume feature system-wide, consider writing a good error response and making a general decision if you want to roll back or push forward.

See also

The example workflow 04.04 Resume Workflow .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset