Using the Condor Batch System with no Shared Filesystem to Run Jobs from the Globus Jobmanager

Dan Bradley, University of Wisconsin
October, 2005

This is a very simple description of how you can have jobs from a Globus gatekeeper run on a Condor batchsystem that doesn't have a shared filesystem between the gatekeeper node and the execution nodes. For example, you might have Condor configured to flock jobs from one Condor pool to another, without a common filesystem between the two.

However, please note that this recipe does not provide a general method for handling all I/O that a job may want to do. It only takes care of the standard I/O files stdin/stdout/stderr and the grid-proxy. So to be more accurate, this recipe is about how to avoid the need for the GASS Cache to be on a shared filesystem accessible from the execute node. Other files accessed by the job will still need to be made accessible to it in some way. From the job description, there is no generic way to know what these files are, so the Condor file-transfer mechanisms cannot alone solve the problem.

What you need to do to follow this recipe is to modify the Globus jobmanager for Condor. It is a perl script, typically located in globus/lib/perl/Globus/GRAM/JobManager/condor.pm. Here is a patch showing the changes that you can make:


*** condor.pm.orig 2005-07-19 13:28:18.000000000 -0500 --- condor.pm 2005-08-16 15:14:13.000000000 -0500 *************** *** 248,251 **** --- 248,268 ---- } + #Additions by Dan + #Turn on file-transfer mode + print SCRIPT_FILE "ShouldTransferFiles = true\n"; + print SCRIPT_FILE "WhenToTransferOutput = ON_EXIT\n"; + + #Remove the X509_USER_PROXY setting from the environment, so + #the Condor jobmanager automatically sets it to the correct + #path in the remote scratch directory. + $environment_string =~ s|X509_USER_PROXY=([^;]*);|;|g; + print SCRIPT_FILE "X509UserProxy = \$ENV(X509_USER_PROXY)\n"; + #End of additions by Dan + print SCRIPT_FILE "Environment = $environment_string\n"; print SCRIPT_FILE "Arguments = $argument_string\n";

NOTE: if the job's working directory is on a shared filesystem accessible to the execute machine, then you can add one additional line to make the job run with the correct working directory, rather than running in a scratch directory on the execute machine. Here's the line you can add along with the other above additions:

print SCRIPT_FILE "remote_initialdir = " . $description->directory() . "\n";

If you want to have some sort of shared filesystem for working files, but NFS is not an option, because the execute machine is in a different administrative domain, one option is to use AFS. It is still good to use the above changes to avoid the need for using AFS for the GASS cache, but it can still be useful for other working files that the application may need to access. Read more about that here.

If you know more about the file I/O needs of the jobs, you may be able to add additional lines to condor.pm to make it transfer other files for the job. You may also want to tweak things like the environment. For example, we have jobs that expect to find GLOBUS_LOCATION in the environment, and we make the grid software available to the worker nodes via AFS, rather than having a local copy on each machine. Since the gatekeeper does have a local copy of the grid software, the environment points to that by default. Here's an example of how to fix it in condor.pm:

#Worker nodes do not have a local installation of grid software, #so point any grid sw paths to an installation in AFS. $environment_string =~ s|/usr/local/osg|/afs/hep.wisc.edu/cms/sw/osg|g; +

Software versions which are known to work with this recipe:

Known bugs

As of this writing (Condor 6.7.12), Condor does not refresh the grid proxy, so whatever proxy is sent over with the original job is never updated during the lifespan of the job. This will be fixed in a future version of Condor, but I don't yet know which one.