Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add cluster overview #9936

Merged
merged 5 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/images/webui-cluster-overview.png
tara-det-ai marked this conversation as resolved.
Show resolved Hide resolved
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions docs/manage/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@

<div class="landing">
<div class="tiles-flex">
<div class="tile-container">
<a class="tile" href="cluster-overview.html">
<h2 class="tile-title">Cluster Overview</h2>
<p class="tile-description">Discover how to manage your cluster in the WebUI.</p>
</a>
</div>
<div class="tile-container">
<a class="tile" href="historical-cluster-usage-data.html">
<h2 class="tile-title">Historical Cluster Usage Data</h2>
Expand Down
133 changes: 133 additions & 0 deletions docs/manage/cluster-overview.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
.. _cluster-overview:

##########################
Cluster Overview (WebUI)
##########################

The Cluster Overview page in the WebUI provides a comprehensive view of your Determined cluster's
status, resource utilization, and configuration. This page is accessible to users with appropriate
permissions and offers valuable insights into cluster performance and management.

********************
Accessing the Page
********************

To access the Cluster Overview:

#. Sign in to the WebUI.
#. From the left navigation pane, select **Cluster**.
#. The overview will be the default view under the Cluster section.

.. image:: /assets/images/webui-cluster-overview.png
:alt: A view of the Determined WebUI Cluster Overview tab

*****************
Page Components
*****************

The Cluster Overview page consists of several key components:

Resource Utilization
====================

This section displays real-time information about the cluster's resource usage:

- Connected Agents: The number of agents currently connected to the cluster.
- CUDA Slots Allocated: The number of CUDA (GPU) slots currently in use out of the total available.
- CPU Slots Allocated: The number of CPU slots currently in use out of the total available.
- Aux Containers Running: The number of auxiliary containers currently running out of the total
capacity.
- Active Searches: The number of active hyperparameter searches.

Slots Allocated Bars
--------------------

The slots allocated bars provide a visual representation of resource utilization across the cluster:

- Compute (CUDA) Slots Allocated: Shows the utilization of GPU resources.
- Compute (CPU) Slots Allocated: Shows the utilization of CPU resources.

Each bar is divided into sections:

- Running (Blue): Currently active slots.
- Pending (Purple): Slots allocated but not yet active.
- Free (Gray): Available slots.

The percentage and fraction of used slots are displayed on the right side of each bar.

Resource Pools
==============

This section lists the configured resource pools, providing detailed information for each:

- Pool Name: The name of the resource pool (e.g., pool1, pool2).

- Slots Allocated: Shows the number of slots in use and the total available. - For pools with mixed
resource types (both CUDA and CPU), it displays "Unspecified Slots Allocated". - For pools with a
single resource type, it specifies the type (e.g., "CUDA Slots Allocated").

- Aux Containers: The number of auxiliary containers running out of the total capacity.

- Additional Information: Includes details such as Accelerator type, Instance Type, Connected
Agents, Slots Per Agent, and Scheduler Type.

Note: The presence of "Unspecified Slots Allocated" indicates that the pool contains both CUDA and
CPU agents. While this is allowed, it is considered a suboptimal configuration and will be logged as
an error. It's recommended to separate CUDA and CPU resources into different pools for better
management and allocation.

For more details on resource pools, visit :ref:`resource-pools`.

Cluster Topology
================

A visual representation of the cluster's node and GPU distribution:

- Each node is displayed with its unique identifier
- The number of available and in-use slots on each node
- GPU types (if applicable)

To view detailed topology information:

#. Navigate to Resource Pools from the Cluster section.
#. Select a specific Resource Pool.
#. Look for the **Topology** section in the resource pool details page.

Job Queue
=========

An overview of the current job queue, including:

- Number of queued jobs
- Job priorities
- Estimated start times

For more information on managing the job queue, see :ref:`job-queue`.

Cluster Configuration
=====================

Key configuration settings for the cluster, such as:

- Master node information
- Scheduler type
- Version information

*********
Actions
*********

From the Cluster Overview page, administrators can perform several actions:

- Modify resource pool settings
- Adjust job queue priorities
- Access detailed logs and metrics

For specific instructions on these actions, refer to the respective documentation sections.

*****************
Troubleshooting
*****************

If you encounter issues or need more information about cluster management, visit the
:ref:`troubleshooting` guide or contact your system administrator.
Loading