hcswif is a python script for generating JSON files that describe a swif2 workflow to be run on JLab's ifarm
The json output from swif2 export <workflow>
is one long string, but the following command will display it in a pretty format:
swif2 export <workflow> | python -m json.tool > myswifjob_export.json
Visit these links for more details on swif2
https://scicomp.jlab.org/docs/swif2 https://scicomp.jlab.org/cli/swif.html https://scicomp.jlab.org/cli/data.html
You'll need to modify a few files so that hcswif points to the correct hcana, raw data, etc.
The following is a list of files and variables that you may need to modify:
1) hcswif.py
- out_dir
-json_dir
- raw_dir
2) setup.sh
- hcana_dir
- hallc_replay_dir
- Version of /site/12gev_physics/production.sh
out_dir must be somewhere to your /farm_out/$USER/ directory. Files with high read/write should not be sent to /volatile/ or /work. You must be on an ifarm machine to access the /farm_out/ directories. Failure to use the /farm_out/$USER directory will result in your jobs being cancelled followed by an email from CST. If you do not have a /farm_out/ directory ensure you have a Slurm account. If you have a Slurm account and no /farm_out/ directory run srun hostname to generate one.
You must have a symbolic link to hcana in your hallc_replay_dir in order for the setup.sh and setup.csh scripts to work.
Note that hcswif must be run with python 3, not python 2. On an ifarm computer, python3 executable paths and dependencies can be loaded with source /site/12gev_phys/softenv.sh 2.4. The softenv number may increase as the farm gets upgraded. You can only run swif2 on the ifarm computers. Access to the /farm_out/ out_dir location is only available on an ifarm computer.
usage: hcswif.py [-h] [--mode MODE] [--spectrometer SPECTROMETER]
[--run RUN [RUN ...]] [--events EVENTS] [--name NAME]
[--replay REPLAY] [--command COMMAND] [--filelist FILELIST]
[--account ACCOUNT] [--disk DISK] [--ram RAM] [--cpu CPU]
[--time TIME] [--shell SHELL]
Some parameters are optional, and some are only relevant for one mode. Optional parameters are assigned default values. Parameters may be specified in any order on the command line.
Parameter | replay? | command? | Description |
---|---|---|---|
mode | - | - | Are we replaying runs or running a shell script? |
spectrometer | required | - | Which spectrometer? This specifies a replay script and what raw .dat is used |
run | required | - | Either (1) a space-separated list of runs or ranges of runs or (2) file runs.txt where runs.txt contains one run number per line |
events | optional | - | Number of events to use. Default is all events (i.e. -1) |
replay | optional | - | Replay script to be used; path is relative to your hallc_replay directory. Defaults exist for each spectrometer. |
command | - | required | Full path location of the shell script to be run |
filelist | - | optional | List of files needed for your shell script |
name | optional | optional | Name of workflow. Default is hcswifXXXXXXXX, with a timestamp suffix |
account | optional | optional | Account to which the time should be accounted. |
disk | optional | optional | How much disk space do you need in bytes? Default = 10GB |
ram | optional | optional | How much RAM do you need in bytes? Default = 2.5GB |
cpu | optional | optional | How many cores? Default = 1 |
This will replay 50k SHMS events for COIN runs 2187-2212, 2023-2066, and 2049 using the default script SCRIPTS/SHMS/PRODUCTION/replay_production_shms_coin.C
$ ./hcswif.py --mode replay --spectrometer SHMS_COIN --run 2187-2212 2023-2066 2049 --events 50000 --name myswifjob --account hallc
Wrote: /some/directory/myswifjob.json
$ swif2 import -file myswifjob.json
$ swif2 run myswifjob
This will replay all HMS events for ALL runs using a file some_runs.dat that contains one full path file location per line. The default HMS_ALL script SCRIPTS/HMS/PRODUCTION/replay_production_all_hms.C is used. The disk space has been increased since the file size may be large for a HMS_ALL replay.
$ ./hcswif.py --mode replay --spectrometer HMS_ALL --run file /home/runlists/some_runs.dat --name myswifjob --disk 15000000000
Wrote: /some/directory/myswifjob.json
$ swif2 import -file myswifjob.json
$ swif2 run myswifjob
This example will submit a job that runs myscript.sh, which presumably does something more complicated than "regular" replay. It uses a filelist text file called "myfiles" that contains one full path file location per line. These files will be added to the 'input' list of the shell script's job. Note that instead of specifying a filelist, you may explicitly put appropriate jget
s in your shell script to read your raw data from tape.
$ ./hcswif.py --mode command --command /some/directory/myscript.sh arg1 arg2 --name myswifjob --account hallc --filelist myfiles
Wrote: /some/directory/myswifjob.json
$ swif2 import -file myswifjob.json
$ swif2 run myswifjob
hcswif will generate a workflow with one job per line in the text file myjobs.txt. Note that --command
's first argument file
tells hcswif that this is a file to read, and not a shell script as in the previous example.
$ cat myjobs.txt
/some/where/myscript.sh arg1 arg2
/some/where/myscript.sh arg3 arg4
/some/where/myotherscript.sh
$ ./hcswif.py --mode command --command file /some/where/myjobs.txt --name test2
Wrote: /home/jmatter/hcswif/output/test2.json
If some parameters aren't specified (e.g. account, events, filelist) you will be warned and possibly asked if you want to use the default value.
$ ./hcswif.py --mode shell --command /some/directory/myscript.sh --name myswifjob
./hcswif.py:160: UserWarning: No account specified.
warnings.warn('No account specified.')
Should I use account=hallc? (y/n): y
Wrote: /some/directory/myswifjob.json
These are descriptions of some errors you may see when you run swif2 status
. Thank you to Hall D's wiki for this info.
ERROR | Description | Possible solution |
---|---|---|
SITE_PREP_FAIL | SWIF’s attempt to submit jobs to Slurm failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect account name, too many resources, etc.) | If requested resources are known to be correct, resubmit. Otherwise, modify job resources using swif2 modify-job . |
SLURM_FAILED | Slurm reports the job FAILED with no specific details. | Verify jobs are valid, then resubmit. This can happen if the /raw/ directory is not set appropriately in hallc_replay. |
SWIF_OUTPUT_FAIL | Failure to copy one or more output files. Can be due to permission problem, quota problem, system error, etc. | Check if output files will exist after job execution and that output directory exists, resubmit jobs. |
SWIF_INPUT_FAIL | SWIF failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape) | Verify input file exists, then resubmit jobs. |
SLURM_TIMEOUT | Job timed out. | If more time is needed for job, add more resources. Also check whether code is hanging. |
SLURM_OUT_OF_MEMORY | Not enough resources, either RAM or disk space. | Add more resources for job. |
SWIF_MISSING_OUTPUT | Output file specified by user was not found. | Check if output file exists at end of job. |
SWIF_USER_NON_ZERO | User script exited with non-zero status code. | Your script exited with non-zero status. Check the code you are running. |
SWIF_SYSTEM_ERROR | Job failed owing to a problem with swif2 (e.g. network connection timeout) | Resubmit jobs. |
If recieving a SWIF_USER_NON_ZERO error or similar error which points to the script failing to run once resources are allocated, look at the out_dir/*.err files to identify problems. If the script never runs, try copying the command line fo the output JSON file and run it on an interactive farm (ifarm) node. Ensure all the output directories exist for hallc_replay including ROOTfiles and REPORT_OUTPUT. Make sure these are symlinks to /volatile/ or /work/ or /cache/. Large files should not be saved to /group/ or /home/.