From 768c5e1118c9f912947d387c3117a8513f777883 Mon Sep 17 00:00:00 2001 From: Nicholas Blaskey Date: Fri, 17 Nov 2023 15:38:26 -0500 Subject: [PATCH 1/4] docs: add release notes for 0.26.4 --- docs/release-notes.rst | 72 +++++++++++++++++++ docs/release-notes/log-policies.rst | 18 ----- .../release-notes/patch_master_config_cli.rst | 7 -- docs/release-notes/python-39-bump.rst | 5 -- .../remote-was-able-to-login.rst | 7 -- docs/release-notes/tensorboard-delete.rst | 6 -- 6 files changed, 72 insertions(+), 43 deletions(-) delete mode 100644 docs/release-notes/log-policies.rst delete mode 100644 docs/release-notes/patch_master_config_cli.rst delete mode 100644 docs/release-notes/python-39-bump.rst delete mode 100644 docs/release-notes/remote-was-able-to-login.rst delete mode 100644 docs/release-notes/tensorboard-delete.rst diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 7a3bedb5328..1c4a216fbd2 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -10,6 +10,78 @@ Version 0.26 ************** +Version 0.26.4 +============== + +**Release Date:** November 17, 2023 + +**Breaking Changes** + +- CLI: The old CLI command to patch master log config has been changed from ``det master config + --log --level --color `` to ``det master config set --log.level= + --log.color=``. + +**New Features** + +- Experiments: Add an experiment continue feature via a CLI command ``det e continue + ``. This allows users to resume or recover training for an experiment whether it + previously succeeded or failed. This is limited to single-searcher experiments and using it may + prevent the user from replicating the continued experiment's results. + +- Experiments: Add a ``log_policies`` configuration option to define actions when a trial's log + matches specified patterns. + + - The ``exclude_node`` action prevents a failed trial's restart attempts (due to its + max_restarts policy) from being scheduled on nodes with matched error logs. This is useful for + bypassing nodes with hardware issues like uncorrectable GPU ECC errors. + + - The ``cancel_retries`` action prevents a trial from restarting if a trial reports a log that + matches the pattern, even if it has remaining max_restarts. This avoids using resources for + retrying a trial that encounters certain failures that won't be fixed by retrying the trial, + such as CUDA memory issues. For details, visit :ref:`experiment-config-reference` and + :ref:`master-config-reference`. + + This option is also configurable at the cluster or resource pool level via task container + defaults. + +- Experiments: Add a new experiment config option ``log_policies`` to allow configuring policies to + take after a regex is matched. There are two action types a trial can be configured to take + + - ``exclude_node``: If a trial fails and restarts, the trial will not schedule on a node that + reported a log that matched the regex provided. This can be used to allow trials to avoid + being scheduled on nodes with certain hardware issues like uncorrectable gpu ECC errors. + + - ``cancel_retries``: If a trial reports a log that matches this pattern, the trial will not be + restarted. This is useful for certain errors that are not transient such as too large of a + model that causes a CUDA out of memory error. + + This can also be configured on a cluster or per resource pool option through task container + defaults. Please see :ref:`experiment-config-reference` and :ref:`master-config-reference` for + more information. + +- CLI: Add a new CLI command ``det e delete-tb-files [Experiment ID]`` to delete local TensorBoard + files associated to a given experiment. + +**Improvements** + +- Update default environment images to Python 3.9 from Python 3.8. + +**Bug Fixes** + +- Kubernetes: Support enabling and disabling agents to prevent Determined from scheduling jobs on + specific nodes. + + Upgrading from a version before this feature to a version after this feature only on Kubernetes + will cause queued allocations to be killed on upgrade. Users can pause queued experiments to + avoid this. + +- Users: Fix an issue where if a user's remote was set to true through ``det user edit + --remote=true`` that user could still login through a username and password. + +- Users: Fix an issue where if a user's remote status was edited through ``det user edit + --remote=true`` that user could still login through their username and password while they were + expected to only be able to login through IDP integrations. + Version 0.26.3 ============== diff --git a/docs/release-notes/log-policies.rst b/docs/release-notes/log-policies.rst deleted file mode 100644 index bd36df108ce..00000000000 --- a/docs/release-notes/log-policies.rst +++ /dev/null @@ -1,18 +0,0 @@ -:orphan: - -**New Features** - -- Experiments: Add a ``log_policies`` configuration option to define actions when a trial's log - matches specified patterns. - - - The ``exclude_node`` action prevents a failed trial's restart attempts (due to its - max_restarts policy) from being scheduled on nodes with matched error logs. This is useful for - bypassing nodes with hardware issues like uncorrectable GPU ECC errors. - - - The ``cancel_retries`` action prevents a trial from restarting if a trial reports a log that - matches the pattern, even if it has remaining max_restarts. This avoids using resources for - retrying a trial that encounters certain failures that won't be fixed by retrying the trial, - such as CUDA memory issues. For details, visit :ref:`experiment-config-reference` and - :ref:`master-config-reference`. - -This option is also configurable at the cluster or resource pool level via task container defaults. diff --git a/docs/release-notes/patch_master_config_cli.rst b/docs/release-notes/patch_master_config_cli.rst deleted file mode 100644 index 68c8ec6fd1e..00000000000 --- a/docs/release-notes/patch_master_config_cli.rst +++ /dev/null @@ -1,7 +0,0 @@ -:orphan: - -**Breaking Change** - -- CLI: The old CLI command to patch master log config has been changed from ``det master config - --log --level --color `` to ``det master config set --log.level= - --log.color=``. diff --git a/docs/release-notes/python-39-bump.rst b/docs/release-notes/python-39-bump.rst deleted file mode 100644 index 14fe9d4deda..00000000000 --- a/docs/release-notes/python-39-bump.rst +++ /dev/null @@ -1,5 +0,0 @@ -:orphan: - -**Improvements** - -- Update default environment images to Python 3.9 from Python 3.8. diff --git a/docs/release-notes/remote-was-able-to-login.rst b/docs/release-notes/remote-was-able-to-login.rst deleted file mode 100644 index 2897c1d8ba7..00000000000 --- a/docs/release-notes/remote-was-able-to-login.rst +++ /dev/null @@ -1,7 +0,0 @@ -:orphan: - -**Bug Fixes** - -- Users: Fix an issue where if a user's remote status was edited through ``det user edit - --remote=true`` that user could still login through their username and password while they were - expected to only be able to login through IDP integrations. diff --git a/docs/release-notes/tensorboard-delete.rst b/docs/release-notes/tensorboard-delete.rst deleted file mode 100644 index 46715d82dd8..00000000000 --- a/docs/release-notes/tensorboard-delete.rst +++ /dev/null @@ -1,6 +0,0 @@ -:orphan: - -**New Features** - -- CLI: Add a new CLI command ``det e delete-tb-files [Experiment ID]`` to delete local TensorBoard - files associated to a given experiment. From 3b062f84c39b8721065c0e1781337d4a17222d40 Mon Sep 17 00:00:00 2001 From: Nicholas Blaskey Date: Fri, 17 Nov 2023 15:49:43 -0500 Subject: [PATCH 2/4] removed extra notes --- docs/release-notes.rst | 30 ------------------------------ 1 file changed, 30 deletions(-) diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 1c4a216fbd2..3069910ece4 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -23,11 +23,6 @@ Version 0.26.4 **New Features** -- Experiments: Add an experiment continue feature via a CLI command ``det e continue - ``. This allows users to resume or recover training for an experiment whether it - previously succeeded or failed. This is limited to single-searcher experiments and using it may - prevent the user from replicating the continued experiment's results. - - Experiments: Add a ``log_policies`` configuration option to define actions when a trial's log matches specified patterns. @@ -44,21 +39,6 @@ Version 0.26.4 This option is also configurable at the cluster or resource pool level via task container defaults. -- Experiments: Add a new experiment config option ``log_policies`` to allow configuring policies to - take after a regex is matched. There are two action types a trial can be configured to take - - - ``exclude_node``: If a trial fails and restarts, the trial will not schedule on a node that - reported a log that matched the regex provided. This can be used to allow trials to avoid - being scheduled on nodes with certain hardware issues like uncorrectable gpu ECC errors. - - - ``cancel_retries``: If a trial reports a log that matches this pattern, the trial will not be - restarted. This is useful for certain errors that are not transient such as too large of a - model that causes a CUDA out of memory error. - - This can also be configured on a cluster or per resource pool option through task container - defaults. Please see :ref:`experiment-config-reference` and :ref:`master-config-reference` for - more information. - - CLI: Add a new CLI command ``det e delete-tb-files [Experiment ID]`` to delete local TensorBoard files associated to a given experiment. @@ -68,16 +48,6 @@ Version 0.26.4 **Bug Fixes** -- Kubernetes: Support enabling and disabling agents to prevent Determined from scheduling jobs on - specific nodes. - - Upgrading from a version before this feature to a version after this feature only on Kubernetes - will cause queued allocations to be killed on upgrade. Users can pause queued experiments to - avoid this. - -- Users: Fix an issue where if a user's remote was set to true through ``det user edit - --remote=true`` that user could still login through a username and password. - - Users: Fix an issue where if a user's remote status was edited through ``det user edit --remote=true`` that user could still login through their username and password while they were expected to only be able to login through IDP integrations. From cc104abcd7e859a94f8eb75fbe53baa39baddbaa Mon Sep 17 00:00:00 2001 From: Nicholas Blaskey Date: Fri, 17 Nov 2023 16:11:18 -0500 Subject: [PATCH 3/4] Apply suggestions from code review Co-authored-by: Danny Zhu --- docs/release-notes.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 3069910ece4..8621b728318 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -17,7 +17,7 @@ Version 0.26.4 **Breaking Changes** -- CLI: The old CLI command to patch master log config has been changed from ``det master config +- CLI: The CLI command to patch the master log config has been changed from ``det master config --log --level --color `` to ``det master config set --log.level= --log.color=``. @@ -27,11 +27,11 @@ Version 0.26.4 matches specified patterns. - The ``exclude_node`` action prevents a failed trial's restart attempts (due to its - max_restarts policy) from being scheduled on nodes with matched error logs. This is useful for + ``max_restarts`` policy) from being scheduled on nodes with matching error logs. This is useful for bypassing nodes with hardware issues like uncorrectable GPU ECC errors. - The ``cancel_retries`` action prevents a trial from restarting if a trial reports a log that - matches the pattern, even if it has remaining max_restarts. This avoids using resources for + matches the pattern, even if it has remaining ``max_restarts``. This avoids using resources for retrying a trial that encounters certain failures that won't be fixed by retrying the trial, such as CUDA memory issues. For details, visit :ref:`experiment-config-reference` and :ref:`master-config-reference`. @@ -40,7 +40,7 @@ Version 0.26.4 defaults. - CLI: Add a new CLI command ``det e delete-tb-files [Experiment ID]`` to delete local TensorBoard - files associated to a given experiment. + files associated with a given experiment. **Improvements** @@ -49,8 +49,8 @@ Version 0.26.4 **Bug Fixes** - Users: Fix an issue where if a user's remote status was edited through ``det user edit - --remote=true`` that user could still login through their username and password while they were - expected to only be able to login through IDP integrations. + --remote=true``, that user could still log in using their username and password; they should only + be able to log in through IdP integrations. Version 0.26.3 ============== From 45b3d57f50704e5faa916ae385ef73f4d5a460f7 Mon Sep 17 00:00:00 2001 From: Nicholas Blaskey Date: Fri, 17 Nov 2023 16:28:02 -0500 Subject: [PATCH 4/4] lint --- docs/release-notes.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 8621b728318..209c2ae4f43 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -27,13 +27,13 @@ Version 0.26.4 matches specified patterns. - The ``exclude_node`` action prevents a failed trial's restart attempts (due to its - ``max_restarts`` policy) from being scheduled on nodes with matching error logs. This is useful for - bypassing nodes with hardware issues like uncorrectable GPU ECC errors. + ``max_restarts`` policy) from being scheduled on nodes with matching error logs. This is + useful for bypassing nodes with hardware issues like uncorrectable GPU ECC errors. - The ``cancel_retries`` action prevents a trial from restarting if a trial reports a log that - matches the pattern, even if it has remaining ``max_restarts``. This avoids using resources for - retrying a trial that encounters certain failures that won't be fixed by retrying the trial, - such as CUDA memory issues. For details, visit :ref:`experiment-config-reference` and + matches the pattern, even if it has remaining ``max_restarts``. This avoids using resources + for retrying a trial that encounters certain failures that won't be fixed by retrying the + trial, such as CUDA memory issues. For details, visit :ref:`experiment-config-reference` and :ref:`master-config-reference`. This option is also configurable at the cluster or resource pool level via task container