Recovery attempts · Documentation

Note: As of March 27th 2023 this documentation is out of date and is no longer being maintained. Up-to-date documentation is now included in the PFTrack and PFClean software installers

If a processing job fails, a number of attempts are made to automatically re-run it.

The first attempt is to simply re-run it immediately.
The 2nd attempt is to wait three minutes before re-running it.
The 3rd attempt is to try running it on each available render node in turn.
The 4th, and final, attempt is another immediately re-run.

The reasoning behind this approach is as follows:

The first re-run is to handle the case where the render node processing the job goes down. Immediately re-running the job, potentially on a different node if the original node is unavailable, will allow the job to complete.

The second re-run is to handle the case where there is a temporary failure such as the network going down or a machine rebooting. Waiting for 3 minutes will allow that fault to clear.

The 3rd attempts cover the case where a specific machine is unable to handle the job, say the job requires a QuickTime codec not present on that machine. By trying each machine in turn a node that can process the job can be found.

There is no real reasoning behind the 4th attempt! - its simply one last final attempt to run the job.

While going through these various re-run attempts, the batch job is classified as being in a warning phase, i.e. somethings gone wrong but the software is trying to recover. Its only when the all re-run attempts fail is the batch job classified as being in a error phase.

When a processing job is in a warning phase it's state in the Jobs list is colour coded yellow. When a job is in an error phase it's state is colour coded red.

The danger of using the Process rest of job error handling setting too early can be illustrated by considering a distributed processing configuration with two render nodes. However, due to a setup issue one of the render nodes is unable to process jobs, for example, bad graphics drivers have been installed. Since jobs will be successfully processed, you may be lulled into a false sense of security that everything is OK whereas is fact only 1 of the 2 render nodes is being used but the retry strategy is working around that issue.