Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast string "array.id" to integer, makeClusterFunctionTorque with SSH #53

Open
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

dagola
Copy link

@dagola dagola commented Oct 9, 2014

asIntexpects a numeric input value. array.id as output from Sys.getenv is a string.

In a multi-node cluster site only one node may be able to accept Torque
commands. If this node is accessible via SSH, BatchJobs can run on any
other node tunnelling the Torque command to that node.

Damian Gola added 2 commits October 9, 2014 09:29
In a multi-node cluster site only one node may be able to accept Torque
commands. If this node is accessible via SSH, BatchJobs can run on any
other node tunnelling the Torque command to that node.
@dagola dagola changed the title Cast string "array.id" to integer Cast string "array.id" to integer, makeClusterFunctionTorque with SSH Oct 9, 2014
On submit of array jobs each sub job gets as batch id array.id[].
array.id[] is not in qselect output. Thus waitForJobs stops after 5
sleeps because matching of internal batch ids with listJobs returns an
empty set.
@dagola dagola closed this Oct 9, 2014
@dagola dagola reopened this Oct 13, 2014
@mllg
Copy link
Member

mllg commented Oct 14, 2014

Do you login on the master via ssh to call qselect?

@dagola
Copy link
Author

dagola commented Oct 14, 2014

Yes, I do. Please see a17396e: in clusterFunctionsTorque all runOSCommandLinux calls are done with the ssh flag.

@mllg
Copy link
Member

mllg commented Oct 14, 2014

I'm amazed that this works! We've to check this on some more systems though.

@dagola
Copy link
Author

dagola commented Oct 14, 2014

Why should it not work? runOSCOmmandLinux supports running shell commands through ssh "natively".

@dagola dagola closed this Oct 14, 2014
@dagola dagola reopened this Oct 14, 2014
@mllg
Copy link
Member

mllg commented Oct 14, 2014

Why should it not work? runOSCOmmandLinux supports running shell commands through ssh "natively".

I need to double check that exit codes are correctly forwarded and quoting is correct.

@mllg
Copy link
Member

mllg commented Oct 15, 2014

  1. runOSCommandLinux seems to behave well in my interactive tests.
  2. You must have a shared file system. I've tried sshfs and encountered several problems because my local $HOME differs from my remote $HOME. We currently do not have a mechanism to deal with this situation, and symlinking does not seem to help.

@berndbischl Your opinion?

@dagola
Copy link
Author

dagola commented Oct 15, 2014

That's true, a shared file system is still needed.
The main reason why I think this is useful, at least at our site, is due to only our master node accepts jobs. But the master node has much less resources than other nodes, so sometimes we need to send a job from a R script running on a other node trough ssh to the master node.
Btw, in our special environment it is even possible to send jobs from a local machine via ssh to the master node. The remote $HOME is not mounted directly at remote / only, but also at /some/path/home. So it is possible to mount it locally via sshfs exactly at the same path as remote. So the paths are equal and it is possible to work with R and BatchJobs in the sshfs mount.

@mllg
Copy link
Member

mllg commented Oct 16, 2014

If Bernd does not have any objections, I'm afraid the SSH stuff will not make it into the next release because I do not have enough time to test this. But I would pull your changes after October 20th and try to generalize it for other cluster functions as well.

@berndbischl
Copy link
Contributor

@mllg
I trust you on this and I am also gone for the next few days without time. Lets talk about it in a few hours and decide. But I guess your plan is already best (the only reasonable solution).

Damian Gola and others added 30 commits February 8, 2016 15:24
… file systems it can happen that the file is not available instantaneous
… file systems it can happen that the file is not available instantaneous
# Conflicts:
#	R/clusterFunctionsTorque.R
Change to slurm scheduler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants