Tools for using XSEDE resources from OSG.
These tools are meant to help you acquire and manage XSEDE resources so that you can use them via OSG. The OSG uses HTCondor as its workload management (batch) system. The detailed instructions further below will assume that you're accessing the OSG via OSG Connect.
In this toolset, XSEDE resources are grouped into "annexes", each with a name; an annex may contain one or more machines from one or more XSEDE resources or sites; or one annex may be one part of a resource and another example some other part of the same resource. For example, TACC's Stampede2 supercomputer has both Knight's Landing and Xeon processors, which you may wish to manage independently.
Run osg_annex_status annex-name
to get information about how busy annex-name is.
You may append arguments for condor_status (see the man page) after the annex-name.
Run osg_annex_off annex-name
to turn off annex-name. This will release the corresponding resource,
preserving your allocation. Annexes turn themselves off after they've been idle for three hours.
You may append arguments for condor_off (see the man page) after the annex-name.
Only works for Stampede2 right now
Run osg_annex_generate
to create a SLURM script which annexes (part of) an XSEDE resource. (For now,
you'll have to copy the script to Stampede2 (and perform a few other steps) manually.) Each time you
submit the script there, it will add more resources to the same annex. By default, the max runtime of
the SLURM job is one hour. Use the --duration
command-line argument to specify, as hh:mm:ss,
your preferred duration.
The default queue (partition) is normal
; use the command-line option --queue
to change it. The
default name for the annex (user-name@Stampede2-normal) includes the name of the user who ran
osg_annex_generate
and the queue name; to specify annex's name, use the --name
option. Likewise,
the default name for the generated file (stampede2-normal.slurm) includes the queue name; use the
--target
option to change it.
By default, the SLURM script requests one CPU for one task on one node. Specify --nodes
,
--ntasks
, and/or --cpus-per-task
to change the corresponding numbers. Presently, these numbers aren't passed along to HTCondor, so it will over-report available resources.
These instructions assume you have an OSG Connect account and an XSEDE User Portal account.
You will need to have contacted an OSG administrator and obtained a token to allow your annexes
to join the OSG pool. The following instructions assume that the token is in the file
~/token.txt
at OSG Connect.
The following instructions assume that your user name at the XSEDE User Portal is xsede-sicilian and osg-sicilian at OSG Connect.
- Open two terminal windows.
- Log in to your OSG Connect account in one of them.
- In that terminal,
- get a copy of the tools:
git clone https://github.com/Todd-L-Miller/osg_annex_tools.git
cd osg_annex_tools
- generate the SLURM script:
osg_annex_generate
- copy it and the token to XSEDE:
scp stampede2-normal.slurm ~/token.txt [email protected]:~
- get a copy of the tools:
- Log in to the XSEDE single-sign-on portal (login.xsede.org) in the other.
- In that terminal,
- copy the SLURM script and the token to Stampede2:
gsiscp stampede2-normal.slurm token.txt stampede2:~
- log into Stampede2:
gsissh stampede2
- create a directory that the SLURM script will need:
mkdir ~/osgvo-pilot
- copy the SLURM script and the token to Stampede2:
(These instructions assume you have two terminal windows open, one logged into OSG Connect, and the other logged into Stampede2; if you just finished the first-time set-up instructions, just keep going.)
- In the login.xsede.org terminal,
-
submit the SLURM script:
sbatch ./stampede2-normal.slurm
.
This starts the annexation process. You should see a line likeSubmitted batch job 5438235
at the end of the output.
-
Optionally, you can run
squeue -j 5438235
(replacing 5438235 with your batch job number) to see if the job has started yet; it might take a while. -
Optionally, once it has, you can run watch the HTCondor logs to see what's happening:
tail --follow osgvo-pilot/5438235.out
(replacing 5438235 with your batch job number)
-
- In the OSG Connect terminal,
- run
osg_annex_status osg-sicilian@Stampede2-normal
to see if your resources have joined the pool; it might take a while. If you elected to watch the HTCondor logs, you're waiting for a line like03/27/20 01:48:17 Setting ready state 'Ready' for STARTD
- run
It might take a while for HTCondor to schedule jobs to run on your annex. (By default, the annex is
configured to run only your jobs.) If your jobs won't run on the other resources available to OSG, you
can require that they run on a particular annex by using the requirements command in the submit file,
specifically: requirements = AnnexName == "osg-sicilian@Stampede2-normal"
.
- In the OSG Connect terminal,
- run
osg_annex_off osg-sicilian@stampede2-normal
.
- run