Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy all run and out dirs at once, not in for-loop #3025

Merged
merged 55 commits into from
Jan 21, 2023
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
1d719d5
move remote copy outside for-loop
Aariq Aug 31, 2022
f3b79bf
put back some mkdir and copy steps
Aariq Aug 31, 2022
ed1d334
go up a directory to copy to outdir/out not outdir/out/out
Aariq Aug 31, 2022
3795f0c
update changelog
Aariq Aug 31, 2022
693e981
only print unique jobids
Aariq Aug 31, 2022
41df050
move copying files back to outside of for loop.
Aariq Aug 31, 2022
5a8f90c
Merged upstream/develop into remote-copy
Aariq Aug 31, 2022
132997f
comments
Aariq Aug 31, 2022
e57dd5d
more todos
Aariq Aug 31, 2022
5678889
Merge branch 'develop' into remote-copy
dlebauer Sep 6, 2022
4286732
added Sys.sleep() b/c settings$host$outdir still not getting copied o…
Aariq Sep 7, 2022
c8b9240
print informative errors
Aariq Sep 7, 2022
7427d16
remove sys.sleep--shouldn't be necessary
Aariq Sep 7, 2022
abbc021
remove non-ascii chars
Aariq Sep 7, 2022
67905b8
Merge branch 'develop' into remote-copy
Aariq Sep 13, 2022
61b50c0
Merge branch 'develop' into remote-copy
Aariq Sep 13, 2022
f45f38e
Merge branch 'develop' into remote-copy
Aariq Sep 15, 2022
b8331a4
add specific comments
Aariq Sep 15, 2022
b73c2f5
copy over just log files
Aariq Sep 15, 2022
f6ba486
comment out custom errors---not quite working. Save for another PR
Aariq Sep 15, 2022
d02858e
fix rsync syntax error
Aariq Sep 15, 2022
17346d9
correct filepath to logs
Aariq Sep 15, 2022
a0baba2
Merged upstream/develop into remote-copy
Aariq Sep 15, 2022
01ac887
Merge branch 'develop' into remote-copy
Aariq Sep 19, 2022
78db328
fix rsync --exclude flag
Aariq Sep 19, 2022
714a698
Merged upstream/remote-copy into remote-copy
Aariq Sep 19, 2022
4bfb1c3
system2 already uses shQuote() internally.
Aariq Sep 19, 2022
3b8a52c
enable informative errors
Aariq Sep 19, 2022
725b991
Merge branch 'develop' into remote-copy
Aariq Sep 26, 2022
dc4a577
Merge branch 'develop' into remote-copy
Aariq Sep 29, 2022
e59188d
copy over only ensemble directories
Aariq Oct 4, 2022
44a1b0b
no need to mkdir, rsync does this
Aariq Oct 4, 2022
a02c00b
try adding a pause??
Aariq Oct 4, 2022
f397c40
Merged upstream/develop into remote-copy
Aariq Oct 13, 2022
b64eb80
Merge branch 'develop' into remote-copy
Aariq Oct 21, 2022
3380b97
Merge branch 'develop' into remote-copy
Aariq Oct 27, 2022
3d4ab9f
Merge branch 'develop' into remote-copy
Aariq Nov 1, 2022
6fa5063
Merge branch 'develop' into remote-copy
Aariq Nov 16, 2022
0d20cc1
copy run dirs first, then out
Aariq Nov 16, 2022
d4b3df2
make rsync errors logger level severe
Aariq Nov 16, 2022
ff1e4af
whoops, fixed prev commit
Aariq Nov 16, 2022
1afd382
wrap initial rsync steps in retry.func()
Aariq Nov 30, 2022
f6f1e34
add sleep to retry
Aariq Nov 30, 2022
93f7155
add more retry
Aariq Dec 1, 2022
780b9f1
document() totally unrelated package
Aariq Dec 12, 2022
76d825b
Merge branch 'develop' into remote-copy
dlebauer Dec 19, 2022
f7d1d0f
Merge branch 'develop' into remote-copy
Aariq Jan 3, 2023
e4289c3
Merge branch 'develop' into remote-copy
Aariq Jan 10, 2023
9141d3b
Merge branch 'develop' into remote-copy
Aariq Jan 10, 2023
c8cda6e
Merge branch 'develop' into remote-copy
Aariq Jan 10, 2023
61dd672
Merge branch 'develop' into remote-copy
Aariq Jan 11, 2023
2b1a3ed
Merge branch 'develop' into remote-copy
robkooper Jan 19, 2023
35eb512
add shQuote(args) back
Aariq Jan 20, 2023
bdf2095
remove rsync status messages
Aariq Jan 20, 2023
aeb0bc0
Merge branch 'develop' into remote-copy
infotroph Jan 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ convert data for a single PFT fixed (#1329, #2974, #2981)
Note that both `units` and `udunits2` interface with the same underlying
compiled code, so the `udunits2` *system library* is still required.
(#2989; @nanu1605)
- Occasionally some run directories were not getting copied over to remote hosts. This should be fixed now (#3025)
- Fixed a bug with ED2 where ED2IN tags supplied in `settings` that were not in the ED2IN template file were not getting added to ED2IN config files (#3034, #3033)
- Fixed a bug where warnings were printed for file paths on remote servers even when they did exist (#3020)
- Fixed bug in model2netcdf.SIPNET that caused LE to be overestimaed 10^3 (#3036)
Expand Down
34 changes: 33 additions & 1 deletion base/remote/R/remote.copy.from.R
Original file line number Diff line number Diff line change
Expand Up @@ -45,5 +45,37 @@ remote.copy.from <- function(host, src, dst, options = NULL, delete = FALSE, std
}
}
PEcAn.logger::logger.debug("rsync", shQuote(args))
system2("rsync", shQuote(args), stdout = TRUE, stderr = as.logical(stderr))
out <-
system2("rsync", args, stdout = "", stderr = as.logical(stderr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is removing the shQuote around args intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was intentional, but now that you bring it up I'm not sure it's correct. system2() always shQuotes command, but I guess it doesn't do that for args. I'll add it back and make sure everything still works.


# Informative errors from rsync man page
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the desire for clarity, but I'd rather send people to the rsync docs for these error messages. We use whatever rsync version is installed on the host system, so hard-coding values here feels like an invitation for them to get out of sync.

I assume rsync doesn't often change the meaning of exit codes between versions, but if they add a new code that's not on this list then this switch will return an empty string instead of the numeric value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this makes sense. I'm going to keep errors (anything other than 0) as logger.severe() though.

msg <-
switch(
as.character(out),
'0' = "Success",
'1' = "Syntax or usage error",
'2' = "Protocol incompatibility",
'3' = "Errors selecting input/output files, dirs",
'4' = "Requested action not supported",
'5' = "Error starting client-server protocol",
'6' = "Daemon unable to append to log-file",
'10' = "Error in socket I/O",
'11' = "Error in file I/O",
'12' = "Error in rsync protocol data stream",
'13' = "Errors with program diagnostics",
'14' = "Error in IPC code",
'20' = "Received SIGUSR1 or SIGINT",
'21' = "Some error returned by waitpid()",
'22' = "Error allocating core memory buffers",
'23' = "Partial transfer due to error",
'24' = "Partial transfer due to vanished source files",
'25' = "The --max-delete limit stopped deletions",
'30' = "Timeout in data send/receive",
'35' = "Timeout waiting for daemon connection"
)
if (out != 0) {
PEcAn.logger::logger.severe(paste0("rsync status: ", msg))
} else {
PEcAn.logger::logger.info(paste0("rsync status: ", msg))
}
} # remote.copy.from
34 changes: 33 additions & 1 deletion base/remote/R/remote.copy.to.R
Original file line number Diff line number Diff line change
Expand Up @@ -47,5 +47,37 @@ remote.copy.to <- function(host, src, dst, options = NULL, delete = FALSE, stder
}
}
PEcAn.logger::logger.debug("rsync", shQuote(args))
system2("rsync", shQuote(args), stdout = TRUE, stderr = as.logical(stderr))
out <-
system2("rsync", args, stdout = "", stderr = as.logical(stderr))

# Informative errors from rsync man page
msg <-
switch(
as.character(out),
'0' = "Success",
'1' = "Syntax or usage error",
'2' = "Protocol incompatibility",
'3' = "Errors selecting input/output files, dirs",
'4' = "Requested action not supported",
'5' = "Error starting client-server protocol",
'6' = "Daemon unable to append to log-file",
'10' = "Error in socket I/O",
'11' = "Error in file I/O",
'12' = "Error in rsync protocol data stream",
'13' = "Errors with program diagnostics",
'14' = "Error in IPC code",
'20' = "Received SIGUSR1 or SIGINT",
'21' = "Some error returned by waitpid()",
'22' = "Error allocating core memory buffers",
'23' = "Partial transfer due to error",
'24' = "Partial transfer due to vanished source files",
'25' = "The --max-delete limit stopped deletions",
'30' = "Timeout in data send/receive",
'35' = "Timeout waiting for daemon connection"
)
if (out != 0) {
PEcAn.logger::logger.severe(paste0("rsync status: ", msg))
} else {
PEcAn.logger::logger.info(paste0("rsync status: ", msg))
}
} # remote.copy.to
112 changes: 73 additions & 39 deletions base/workflow/R/start_model_runs.R
Original file line number Diff line number Diff line change
Expand Up @@ -79,29 +79,40 @@ start_model_runs <- function(settings, write = TRUE, stop.on.error = TRUE) {
jobfile <- NULL
firstrun <- NULL

#Copy all run directories over if not local
if (!is_local) {
# copy over run directories
PEcAn.utils::retry.func(
PEcAn.remote::remote.copy.to(
host = settings$host,
src = settings$rundir,
dst = dirname(settings$host$rundir),
delete = TRUE
),
sleep = 2
)

# copy over out directories
PEcAn.utils::retry.func(
PEcAn.remote::remote.copy.to(
host = settings$host,
src = settings$modeloutdir,
dst = dirname(settings$host$outdir),
#include all directories, exclude all files
options = c("--include='*/'", "--exclude='*'"),
delete = TRUE
),
sleep = 2
)
}

# launch each of the jobs
for (run in run_list) {
run_id_string <- format(run, scientific = FALSE)
# write start time to database
PEcAn.DB::stamp_started(con = dbcon, run = run)

# if running on a remote cluster, create folders and copy any data
# to remote host
if (!is_local) {
PEcAn.remote::remote.execute.cmd(
host = settings$host,
cmd = "mkdir",
args = c(
"-p",
file.path(settings$host$outdir, run_id_string)))
PEcAn.remote::remote.copy.to(
host = settings$host,
src = file.path(settings$rundir, run_id_string),
dst = settings$host$rundir,
delete = TRUE)
}

# check to see if we use the model launcer
# check to see if we use the model launcher
if (is_rabbitmq) {
run_id_string <- format(run, scientific = FALSE)
folder <- file.path(settings$rundir, run_id_string)
Expand Down Expand Up @@ -159,10 +170,13 @@ start_model_runs <- function(settings, write = TRUE, stop.on.error = TRUE) {

if (!is_local) {
# copy data back to local
PEcAn.remote::remote.copy.from(
host = settings$host,
src = file.path(settings$host$outdir, run_id_string),
dst = settings$modeloutdir)
PEcAn.utils::retry.func(
PEcAn.remote::remote.copy.from(
host = settings$host,
src = file.path(settings$host$outdir, run_id_string),
dst = settings$modeloutdir),
sleep = 2
)
}

# write finished time to database
Expand Down Expand Up @@ -195,14 +209,17 @@ start_model_runs <- function(settings, write = TRUE, stop.on.error = TRUE) {
}

if (!is_local) {
for (run in run_list){
for (run in run_list){ #only re-copy run dirs that have launcher and job list
if (run %in% job_modellauncher) {
# copy launcher and joblist
PEcAn.remote::remote.copy.to(
host = settings$host,
src = file.path(settings$rundir, format(run, scientific = FALSE)),
dst = settings$host$rundir,
delete = TRUE)
PEcAn.utils::retry.func(
PEcAn.remote::remote.copy.to(
host = settings$host,
src = file.path(settings$rundir, format(run, scientific = FALSE)),
dst = settings$host$rundir,
delete = TRUE),
infotroph marked this conversation as resolved.
Show resolved Hide resolved
sleep = 2
)

}
}
Expand Down Expand Up @@ -251,18 +268,31 @@ start_model_runs <- function(settings, write = TRUE, stop.on.error = TRUE) {
if (length(jobids) > 0) {
PEcAn.logger::logger.debug(
"Waiting for the following jobs:",
unlist(jobids, use.names = FALSE))
unlist(unique(jobids)))
}

#TODO figure out a way to do this while for unique(jobids) instead of jobids
while (length(jobids) > 0) {
Sys.sleep(10)

if (!is_local) {
#Copy over log files to check progress
try(PEcAn.remote::remote.copy.from(
host = settings$host,
src = settings$host$outdir,
dst = dirname(settings$modeloutdir),
options = c('--exclude=*.h5')
))
}

for (run in names(jobids)) {
run_id_string <- format(run, scientific = FALSE)

# check to see if job is done
job_finished <- FALSE
if (is_rabbitmq) {
job_finished <- file.exists(file.path(jobids[run], "rabbitmq.out"))
job_finished <-
file.exists(file.path(settings$modeloutdir, run, "rabbitmq.out"))
} else if (is_qsub) {
job_finished <- PEcAn.remote::qsub_run_finished(
run = jobids[run],
Expand All @@ -271,18 +301,10 @@ start_model_runs <- function(settings, write = TRUE, stop.on.error = TRUE) {
}

if (job_finished) {

Aariq marked this conversation as resolved.
Show resolved Hide resolved
# Copy data back to local
if (!is_local) {
PEcAn.remote::remote.copy.from(
host = settings$host,
src = file.path(settings$host$outdir, run_id_string),
dst = settings$modeloutdir)
}


# TODO check output log
if (is_rabbitmq) {
data <- readLines(file.path(jobids[run], "rabbitmq.out"))
data <- readLines(file.path(settings$modeloutdir, run, "rabbitmq.out"))
if (data[-1] == "ERROR") {
msg <- paste("Run", run, "has an ERROR executing")
if (stop.on.error) {
Expand All @@ -294,6 +316,7 @@ start_model_runs <- function(settings, write = TRUE, stop.on.error = TRUE) {
}

# Write finish time to database
#TODO this repeats for every run in `jobids` writing every run's time stamp every time. This actually takes quite a long time with a lot of ensembles and should either 1) not be a for loop (no `for(x in run_list)`) or 2) if `is_modellauncher`, be done outside of the jobids for loop after all jobs are finished.
if (is_modellauncher) {
for (x in run_list) {
PEcAn.DB::stamp_finished(con = dbcon, run = x)
Expand All @@ -320,6 +343,17 @@ start_model_runs <- function(settings, write = TRUE, stop.on.error = TRUE) {
} # end loop over runs
} # end while loop checking runs

# Copy data back to local
if (!is_local) {
PEcAn.utils::retry.func(
PEcAn.remote::remote.copy.from(
host = settings$host,
src = settings$host$outdir,
dst = dirname(settings$modeloutdir)
),
sleep = 2
)
}
} # start_model_runs


Expand Down