gLExec Operating System Interoperability
Batch systems in the presence of gLExec
gLExec attempts really hard to be neutral to its OS environment. In particular, gLExec will not break the process tree, and will accumulate CPU and system usage times from the child processes it spawns. We recognise that this is particularly important in the gLExec-on-WN scenario, where the entire process (pilot job and target user processes) should be managed as a whole by the node-local batch system daemon.
We have verified that, on the Torque batch system, the forking of a
process with a different uid does not impair the functioning of pbs_mom
in being able to kill any and all of the child processes. We tested this
with Torque version 2.1.6; to verify this with your own batch system you
can use the steps below.
The (simple) program will do the exact same uid change that glexec
does, but does not require that you install anything grid-like on your
site. It's a completely stand-alone program that does a uid change,
so you can test how your batch system reacts.
- download the code for the sUTest programme and compile it on your
system: sutest.c. We on purpose do not provide
a pre-compiled binary for this, as you need to configure two pre-defined
constants in the source that are site-specific:
#define UNOBODY 99 #define GNOBODY 99 #define SRCUID 502
specify the uid and gid numbers to switch to, as well as the uid (numeric) of the user account you will use for testing (i.e. the account that will do the batch system qsub). This must be a trusted uid as that user will have effective super-user privileges at any time. - Compile this program:
cc -o sutest sutest.c
- Copy this program to a worker node, and make it setuid root:
cp sutest /usr/local/bin/ chown root:root /usr/local/bin/sutest chmod u+s /usr/local/bin/sutest
(make sure your batch job goes to this worker node, please refer to your batch system manual for details). - Submit a batch job that takes a while:
echo "/usr/local/bin/sutest sleep 600" | qsub
(make sure your batch job goes to the worker node with the setuid sutest program, please refer to your batch system manual for details).
Check on the worker node to see if the sleep job is running, and as who (usually: the nobody user). - As soon as the job starts, kill it (in PBS/Torque that would be a "qdel"), and look on the worker node to see if the sleep process goes away. It should.
- Remove the setuid bit from the sutest executable on the WN:
chmod 0755 /usr/local/bin/sutest