Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment documentation (containerized packaging, virtualized installation, managed deployment, and distributed execution) #437

Closed
aleszamuda opened this issue Oct 11, 2022 · 1 comment · Fixed by #441
Assignees
Milestone

Comments

@aleszamuda
Copy link
Contributor

aleszamuda commented Oct 11, 2022

Documentation on how to deploy DAPHNE, is still lacking an .md file in the doc/ directory.
This issue is created to address the documentation of deployment documentation for:

  • containerized packaging,
  • virtualized installation,
  • managed deployment, and
  • distributed execution.

The #335 already added the set of scripts and experimental data files regarding:

The added functionality from #335 is now intented to receive documentation in order to be merged in the main branch at GitHub. A new branch will be created to address this issue, in order to accomplish the above mentioned documentation .md file.

The planned files to be changed:

  1. doc/Deploy.md - the new file (tutorial-mode linearized explanation for a user).
  2. doc/GettingStarted.md - explanation of the switches from comment of PR 236 copying deploy files to merge #335.
  3. deploy/README.md - a short README file to explain directory structure and point to more documentation in doc/Deploy.md.

The doc/Deploy.md will explain the excerpts of descriptions of functionalities of deployment on SLURM (deploy-distributed-on-slurm.sh):

  • compilation of the Singularity image,
  • compilation of the daphne main and worker codes within the Singularity image
  • packaging compiled daphne codes
  • packaging compiled daphne codes with user payload as a payload package
  • uploading the payload package to an HPC platform
  • obtaining the list of PEERS from SLURM
  • executing daphne main and worker binaries on SLURM PEERS
  • collection of logs from daphne execution
  • cleanup of workers and payload deployment
    The deploy-distributed-on-slurm.sh packages and executes on a target HPC platform, it is tailored to the communication required with SLURM and the target HPC platform.

Specific description of functionality differences with deploy-distributed-on-slurm.sh compared to basic deployDistributed.sh will also be documented:

  • the build of the daphne main and worker node executables is executed through a Singularity container that is built on the utilized HPC, while the function "deploy" in deployDistributed.sh sends and builds executables on each node, which might cause overwrite if the workers use same mounted user storage (e.g. distributed storage attached as home directory)
  • the list of PEERS is not defined by the user but obtained from SLURM (in deployDistriuted.sh, the user supplies PEERS as an argument)
  • the support for single request deployment, run, and cleanup is provided

Additionally, from comment to pull request #431, this issue #437 will also address these points 1-4 from #431:

  1. The document explains how to set up the distributed workers, but it should also briefly comment on what to do afterwards. For instance, the distributed workers can be reused for multiple invocations of daphne. Furthermore, how can the workers be shut down?
  2. It would be great to mention explicitly that running on the distributed runtime does not require any changes to the DaphneDSL code, but just to the deployment environment, since daphne figures out automatically what to run in a distributed fashion (that might not be self-evident to users).
  3. It would be great to use an existing DaphneDSL example script (feel free to create a new one in scripts/examples). Can be very simple.
  4. It would be great if there was a README.md in deploy/. This directory is referenced, but when the users navigate there, it is not really clear where to start in that directory. This could be explained in such a readme.
@aleszamuda
Copy link
Contributor Author

Hi @pdamme, I would like to work on this issue, please assign me.

@pdamme pdamme added this to the v0.1 milestone Oct 11, 2022
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…rted.md, linking to /doc/development/BuildingDaphne.md
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…untime Systems" rephrased to "deploy a Daphne System"
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…esses" is now omitted. Mentioning "Daphne scripts" is also omitted. Both are replaced by "handling", which should then be explained in other documentation files.
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…e paragraph now mentions node types, Daphne system parts, and their connections
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…analysis of experimental data obtained through running"->"stopping and cleaning"
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…nology of a "process" to "task" and "coordinator" within the Daphne System Scheme
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
… tasks and containerized environment in distributed starting.
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…oyment now mentions starting a "Daphne system".
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…rement of having deployed workers before starting the Daphne system.
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…ibuted connectivity among coordinator and workers.
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
…istributed-on-slurm.sh mentioned in Deploy.md section title.
aleszamuda pushed a commit to aleszamuda/daphne that referenced this issue Oct 15, 2022
… requirement for both deployment scripts and referencing Slurm.
pdamme pushed a commit that referenced this issue Oct 16, 2022
…ualized installation, managed deployment, and distributed execution) (#441)

The deploy/README.md now explains where to start in the deploy/ directory, with other explainations such as initial overview, explaining the HPC deployment Computer Architecture Framework, deployment functionalities, list of files in the directory, and links to more documentation in doc/ directory.

The doc/Deploy.md now explains DAPHNE Packaging, Distributed Deployment, and Management. After the overview, it explains deployment functionalities for SLURM, then lists further usage documentation.

Closes #437.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants