diff --git a/docs/setup-cluster/slurm/_index.rst b/docs/setup-cluster/slurm/_index.rst index 8c0e08418c7c..5b487dfcfe98 100644 --- a/docs/setup-cluster/slurm/_index.rst +++ b/docs/setup-cluster/slurm/_index.rst @@ -24,14 +24,14 @@ Slurm/PBS deployment applies to the Enterprise Edition. -This document describes how Determined can be configured to utilize HPC cluster scheduling systems +This section describes how Determined can be configured to utilize HPC cluster scheduling systems via the Determined HPC launcher. In this type of configuration, Determined delegates all job scheduling and prioritization to the HPC workload manager (either Slurm or PBS). This integration enables existing HPC workloads and Determined workloads to coexist and Determined workloads to access all of the advanced capabilities of the HPC workload manager. -To install Determined on the HPC cluster, ensure that the :ref:`slurm-requirements` are met, then -follow the steps in the :ref:`install-on-slurm` document. +To install Determined on the HPC cluster, ensure that the :ref:`hpc-environment-requirements` and +:ref:`slurm-requirements` are met, then follow the steps in the :ref:`install-on-slurm` document. *********** Reference @@ -52,6 +52,7 @@ follow the steps in the :ref:`install-on-slurm` document. :hidden: slurm-requirements + hpc-environment-requirements hpc-launching-architecture hpc-security-considerations install-on-slurm diff --git a/docs/setup-cluster/slurm/hpc-environment-requirements.rst b/docs/setup-cluster/slurm/hpc-environment-requirements.rst new file mode 100644 index 000000000000..aad0482783d3 --- /dev/null +++ b/docs/setup-cluster/slurm/hpc-environment-requirements.rst @@ -0,0 +1,137 @@ +.. _hpc-environment-requirements: + +################################# + HPC Environment Requirements +################################# + +This document describes how to prepare your environment for installing Determined on an HPC cluster managed by Slurm or PBS workload managers. + +.. include:: ../../_shared/tip-keep-install-instructions.txt + +*************************** + Environment Requirements +*************************** + +Hardware Requirements +===================== + +The recommended requirements for the admin node are: + +- 1 admin node for the master, the database, and the launcher with the following specs: + - 16 cores + - 32 GB of memory + - 1 TB of disk space (depends on the database, see "Database Requirements" section below) + +The minimal requirements are: + +- 1 admin node with 8 cores, 16 GB of memory, and 200 GB of disk space + +.. note:: + While the node can be virtual, a physical one is preferred. + +Network Requirements +==================== + +The admin node needs to reach the HPC shared area (the scratch file system). The recommended requirements are: + +- 10 Gbps Ethernet link between the admin node and the HPC worker nodes + +The minimal requirements are: + +- 1 Gbps Ethernet link + +.. important:: + The admin node must be connected to the Internet to download container images and Python packages. If Internet access is not possible, the local container registry and package repository must be filled manually with external data. + +The following ports are used by the product's components and must be open: + ++------+-------------+------+-----------------------------------------------------------------------------------+ +| Node | Ports | Type | Description | ++======+=============+======+===================================================================================+ +| Admin| 8080, 8443 | TCP | Provide HTTP(S) access to the master node for web UI access and agent API access | ++------+-------------+------+-----------------------------------------------------------------------------------+ + +Storage Requirements +==================== + +Determined requires shared storage for experiment checkpoints, container images, datasets, and pre-trained models. All worker nodes connected to the cluster must be able to access it. The storage can be a network file system (like VAST, Ceph FS, Gluster FS, Lustre) or a bucket (on cloud or on-prem if it exposes an S3 API). + +Space requirements depend on the model complexity/size: + +- 10-30 TB of HDD space for small models (up to 1GB in size) +- 20-60 TB of SSD space for medium to large models (more than 1GB in size) + +Software Requirements +===================== + +The following software components are required: + ++------------------------+----------------------------------+------------------+ +| Component | Version | Installation Node| ++========================+==================================+==================+ +| Operating System | RHEL 8.5+ or 9.0+ | Admin | +| | SLES 15 SP3+ | | +| | Ubuntu 22.04+ | | ++------------------------+----------------------------------+------------------+ +| Java | >= 1.8 | Admin | ++------------------------+----------------------------------+------------------+ +| Python | >= 3.8 | Admin | ++------------------------+----------------------------------+------------------+ +| Podman | >= 4.0.0 | Admin | ++------------------------+----------------------------------+------------------+ +| PostgreSQL | 10 (RHEL 8), 13 (RHEL 9), | Admin | +| | 14 (Ubuntu 22.04) or newer | | ++------------------------+----------------------------------+------------------+ +| HPC client packages | Same as login nodes | Admin | ++------------------------+----------------------------------+------------------+ +| Container runtime | Singularity >= 3.7 | Workers | +| | (or Apptainer >= 1.0) | | +| | Podman >= 3.3.1 | | +| | Enroot >= 3.4.0 | | ++------------------------+----------------------------------+------------------+ +| HPC scheduler | Slurm >= 20.02 | Workers | +| | (excluding 22.05.5 - 22.05.8) | | +| | PBS >= 2021.1.2 | | ++------------------------+----------------------------------+------------------+ +| NVIDIA drivers | >= 450.80 | Workers | ++------------------------+----------------------------------+------------------+ + +Database Requirements +===================== + +The solution requires PostgreSQL 10 or newer, which will be installed on the admin node. The required disk space for the database is estimated as follows: + +- 200 GB on small systems (less than 15 workers) or big systems if the experiment logs are sent to Elasticsearch +- 16 GB/worker on big systems that store experiment logs inside the database + +*************************** + Installation Prerequisites +*************************** + +Before proceeding with the installation, ensure that: + +- The operating system is installed along with the HPC client packages (a clone of an existing login node could be made if the OS is the same or similar) +- The node has Internet connectivity +- The node has the shared file system mounted on /scratch +- Java is installed +- Podman is installed + +A dedicated OS user named ``determined`` should be created on the admin node. This user should: + +- Belong to the ``determined`` group +- Be able to run HPC jobs +- Have sudo permissions for specific commands (see :ref:`hpc-security-considerations` for details) + +.. note:: + All subsequent installation steps assume the use of the ``determined`` user or root access. + +For detailed installation steps, including OS-specific instructions and configuration, refer to the :ref:`install-on-slurm` document. + +Internal Task Gateway +===================== + +As of version 0.34.0, Determined supports the Internal Task Gateway feature for Kubernetes. This feature enables Determined tasks running on remote Kubernetes clusters to be exposed to the Determined master and proxies. If you're using a hybrid setup with both Slurm/PBS and Kubernetes, this feature might be relevant for your configuration. + +.. important:: + + Enabling this feature exposes Determined tasks to the outside world. Implement appropriate security measures to restrict access to exposed tasks and secure communication between the external cluster and the main cluster. diff --git a/docs/setup-cluster/slurm/install-on-slurm.rst b/docs/setup-cluster/slurm/install-on-slurm.rst index b62f69608f91..16082126fe6c 100644 --- a/docs/setup-cluster/slurm/install-on-slurm.rst +++ b/docs/setup-cluster/slurm/install-on-slurm.rst @@ -5,7 +5,10 @@ ################################# This document describes how to deploy Determined on an HPC cluster managed by the Slurm or PBS -workload managers. +workload managers. It covers both scenarios where root access is available and where it is not. + +For non-root installations, ensure that the prerequisites in :ref:`hpc-environment-requirements` +have been completed by the your system administrator before proceeding. .. include:: ../../_shared/tip-keep-install-instructions.txt @@ -123,7 +126,7 @@ configured, install and configure the Determined master: | | path, you can override the default by updating this value. | +----------------------------+----------------------------------------------------------------+ | ``gres_supported`` | Indicates that Slurm/PBS identifies available GPUs. The | - | | default is ``true``. See :ref:`slurm-config-requirements` or | + | | default is ``true``. See :ref:`slurm-requirements` or | | | :ref:`pbs-config-requirements` for details. | +----------------------------+----------------------------------------------------------------+ @@ -163,8 +166,8 @@ configured, install and configure the Determined master: #. If the compute nodes of your cluster do not have internet connectivity to download Docker images, see :ref:`slurm-image-config`. -#. If internet connectivity requires use of a proxy, make sure the proxy variables are defined as - per :ref:`proxy-config-requirements`. +#. If internet connectivity requires the use of a proxy, make sure the proxy variables are properly + configured in your environment. #. Log into Determined, see :ref:`users`. The Determined user must be linked to a user on the HPC cluster. If signed in with a Determined administrator account, the following example creates a diff --git a/docs/setup-cluster/slurm/slurm-known-issues.rst b/docs/setup-cluster/slurm/slurm-known-issues.rst index a1e0d2dd8988..9b579a315bb3 100644 --- a/docs/setup-cluster/slurm/slurm-known-issues.rst +++ b/docs/setup-cluster/slurm/slurm-known-issues.rst @@ -249,7 +249,7 @@ Some constraints are due to differences in behavior between Docker and Singulari ********************* - Enroot uses ``XDG_RUNTIME_DIR`` which is not provided to the compute jobs by Slurm/PBS by - default. The error ``mkdir: cannot create directory ‘/run/enroot’: Permission denied`` indicates + default. The error ``mkdir: cannot create directory ' /run/enroot': Permission denied`` indicates that the environment variable ``XDG_RUNTIME_DIR`` is not defined on the compute nodes. See :ref:`podman-config-requirements` for recommendations. diff --git a/docs/setup-cluster/slurm/slurm-requirements.rst b/docs/setup-cluster/slurm/slurm-requirements.rst index c17fac844d41..a8e82de7e27a 100644 --- a/docs/setup-cluster/slurm/slurm-requirements.rst +++ b/docs/setup-cluster/slurm/slurm-requirements.rst @@ -1,83 +1,13 @@ .. _slurm-requirements: -########################### - Installation Requirements -########################### +######################## + Slurm/PBS Requirements +######################## -******************** - Basic Requirements -******************** - -To deploy the Determined HPC Launcher on Slurm/PBS, the following requirements must be met. - -- The login node, admin node, and compute nodes must be installed and configured with one of the - following Linux distributions: - - - RHEL or Rocky Linux® 8.5, 8.6 - - RHEL 9 - - SUSE® Linux Enterprise Server (SLES) 12 SP3 , 15 SP3, 15 SP4 - - Ubuntu® 20.04, 22.04 - - Cray OS (COS) 2.3, 2.4 - - Note: More restrictive Linux distribution dependencies may be required by your choice of - Slurm/PBS version and container runtime (Singularity/Apptainer®, Podman, or NVIDIA® Enroot). - -- Slurm 20.02 or greater (excluding 22.05.5 through at least 22.05.8 - see - :ref:`slurm-known-issues`) or PBS 2021.1.2 or greater. - -- Apptainer 1.0 or greater, Singularity 3.7 or greater, Enroot 3.4.0 or greater or Podman 3.3.1 or - greater. - -- A cluster-wide shared filesystem with consistent path names across the HPC cluster. - -- User and group configuration must be consistent across all nodes. - -- All nodes must be able to resolve the hostnames of all other nodes. - -- To run jobs with GPUs, the NVIDIA or AMD drivers must be installed on each compute node. - Determined requires a version greater than or equal to 450.80 of the NVIDIA drivers. The NVIDIA - drivers can be installed as part of a CUDA installation but the rest of the CUDA toolkit is not - required. - -- Determined supports the `active Python versions `__. - -*********************** - Launcher Requirements -*********************** +This document describes the specific requirements for deploying Determined on Slurm or PBS workload +managers. -The launcher has the following additional requirements on the installation node: - -- Support for an RPM or Debian-based package installer -- Java 1.8 or greater -- Sudo is configured to process configuration files present in the ``/etc/sudoers.d`` directory -- Access to the Slurm or PBS command-line interface for the cluster -- Access to a cluster-wide file system with a consistent path names across the cluster - -.. _proxy-config-requirements: - -********************************** - Proxy Configuration Requirements -********************************** - -If internet connectivity requires a use of a proxy, verify the following requirements: - -- Ensure that the proxy variables are defined in ``/etc/environment`` (or ``/etc/sysconfig/proxy`` - on SLES). - -- Ensure that the `no_proxy` setting covers the login and admin nodes. If these nodes may be - referenced by short names known only within the cluster, they must explicitly be included in the - `no_proxy` setting. - -- If your experiment code communicates between compute nodes with a protocol that honors proxy - environment variables, you should additionally include the names of all compute nodes in the - `no_proxy` variable setting. - -The HPC launcher imports `http_proxy`, `https_proxy`, `ftp_proxy`, `rsync_proxy`, `gopher_proxy`, -`socks_proxy`, `socks5_server`, and `no_proxy` from ``/etc/environment`` and -``/etc/sysconfig/proxy``. These environment variables are automatically exported in lowercase and -uppercase into any launched jobs and containers. - -.. _slurm-config-requirements: +For general environment requirements, please refer to :ref:`hpc-environment-requirements`. ******************** Slurm Requirements @@ -194,6 +124,8 @@ interacts with Slurm, we recommend the following steps: .. _pbs-config-requirements: +.. _pbs-ngpus-config: + ****************** PBS Requirements ****************** @@ -226,8 +158,6 @@ interacts with PBS, we recommend the following steps: configure ``CUDA_VISIBLE_DEVICES`` or set the ``pbs.slots_per_node`` setting in your experiment configuration file to indicate the desired number of GPU slots for Determined. -.. _pbs-ngpus-config: - - Ensure the ``ngpus`` resource is defined with the correct values. To ensure the successful operation of Determined, define the ``ngpus`` resource value for each @@ -396,11 +326,9 @@ interacts with PBS, we recommend the following steps: Apptainer/Singularity Requirements ************************************ -Apptainer/Singularity is the recommended container runtime for Determined on HPC clusters. Apptainer -is a fork of Singularity 3.8 and provides both the ``apptainer`` and ``singularity`` commands. For -purposes of this documentation, you can consider all references to Singularity to also apply to -Apptainer. The Determined launcher interacts with Apptainer/Singularity using the ``singularity`` -command. +Determined supports Apptainer (formerly known as Singularity) for container runtime in HPC +environments. Ensure that Apptainer or Singularity is properly installed and configured on all +compute nodes of your cluster. .. note:: diff --git a/docs/setup-cluster/slurm/upgrade-on-hpc.rst b/docs/setup-cluster/slurm/upgrade-on-hpc.rst index 38bf077ea7e2..9ba1f19d3d0d 100644 --- a/docs/setup-cluster/slurm/upgrade-on-hpc.rst +++ b/docs/setup-cluster/slurm/upgrade-on-hpc.rst @@ -8,7 +8,8 @@ This procedure describes how to upgrade Determined on an HPC cluster managed by workload managers. Use this procedure when an earlier version of Determined is installed, configured, and functioning properly. -#. Review the latest :ref:`slurm-requirements` and ensure all dependencies have been met. +#. Review the latest :ref:`hpc-environment-requirements` and :ref:`slurm-requirements` and ensure + all dependencies have been met. #. Upgrade the launcher.