Prune User Processes on PBS, Torque Batch Nodes

All files can be downloaded from this directory.
Prune Userproc - remove stray processes after a batch job completes

Description:

The intent of the script is to remove stray user processes that are (inadvertently or deliberately to escape accounting) left around after a batch job completes on a worker node. The script is applicable also to multi-job systems where multiple jobs from the same user may run concurrently.

The script lookup the sid of the initial process started by the pbs mom (using momctl -d 0), and then traces the process tree of that process using pid-ppid matching. User processes with uid>99 that are not started by ssh*, and are not in the process tree of any current batch job are then killed. This means all jobs who are (grand) children of init, instead of being (grand)children of a valid pbs_mom job, are killed.

Notes:
- Please note that the script does not reliably protect the machine from the fork bombs -- a race condition in the current version may cause them to survive the pruning script.
- If the farm is running MPI tasks, then the script should be invoked from the epilogue.parallel instead of the plain epilogue script.
- Normally invoked as: prune_userprocs -a -k 9.

More system utilities can be found on the NDPF Wiki pages and one level up.