docs: docs changes for searcher context removal #10182

azhou-determined · 2024-10-31T22:14:31Z

Ticket

Description

a few docs updates from searcher context removal.

Test Plan

Checklist

Changes have been manually QA'd
New features have been approved by the corresponding PM
User-facing API changes have the "User-facing API Change" label
Release notes have been added as a separate file under docs/release-notes/
See Release Note for details.
Licenses have been included for new code which was copied and/or modified from any external code

netlify · 2024-10-31T22:14:49Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`b9d4927`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/67255e339ce80a0008ac3616

codecov · 2024-10-31T22:17:06Z

Codecov Report

Attention: Patch coverage is 41.66667% with 7 lines in your changes missing coverage. Please review.

Project coverage is 54.25%. Comparing base (f9ac6bc) to head (b9d4927).
Report is 12 commits behind head on main.

Files with missing lines	Patch %	Lines
...determined/pytorch/deepspeed/_deepspeed_context.py	55.55%	4 Missing ⚠️
master/pkg/searcher/searcher.go	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10182      +/-   ##
==========================================
- Coverage   54.26%   54.25%   -0.01%     
==========================================
  Files        1259     1259              
  Lines      157284   157293       +9     
  Branches     3643     3643              
==========================================
- Hits        85357    85347      -10     
- Misses      71794    71813      +19     
  Partials      133      133

Flag	Coverage Δ
backend	`45.90% <0.00%> (-0.03%)`	⬇️
harness	`71.13% <55.55%> (-0.01%)`	⬇️
web	`54.30% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
harness/determined/pytorch/_trainer_utils.py	`57.50% <ø> (ø)`
harness/determined/pytorch/deepspeed/_trainer.py	`49.54% <ø> (ø)`
master/pkg/searcher/searcher.go	`46.72% <0.00%> (ø)`
...determined/pytorch/deepspeed/_deepspeed_context.py	`62.68% <55.55%> (-0.34%)`	⬇️

... and 5 files with indirect coverage changes

docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst

rb-determined-ai · 2024-10-31T22:37:49Z

docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst

+   +     # Set flag used by internal PyTorch training loop
+   +     os.environ["DET_MANUAL_INIT_DISTRIBUTED"] = "true"


I'm sorry, what?

If we can't fix this before the release ships, then I vote we remove this section and add it back later when we fix it.

This is basically a bug, and we're documenting it here rather than fixing it.

well, this is copied from pytorch-ug, where we document similar behavior. which isn't a bug, but a hackily-supported use case.

i'm pretty sure this isn't needed here though, since either we're in local mode, we init the dist backend and pass it into the ds.init(), or we're on cluster and it's None.

so i removed it, cc: @MikhailKardash

If we are documenting internal flags, that is a bug. It's also an oxymoron, since documenting it means it's not really internal, just a weirdly bad public API.

"it's a feature, not a bug!"

lol, IIRC at the time of landing pytorch trainer, we decided to do this because we wanted to document a shiny new local training capability but didn't have time to figure out a good way to do it.

you're right that it's a bad public API. i've put it on my todo list, and i'll fix it next week. it's not related to this PR or this feature, though.

docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst

rb-determined-ai · 2024-10-31T22:44:47Z

docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst

+If your training code needs to read some values from the experiment configuration,
+``pytorch.deepspeed.init()`` accepts an ``exp_conf`` argument which allows calling
+``context.get_experiment_config()`` from ``DeepSpeedTrialContext``.


The experiment config isn't safe for users to read from, because we automatically shim it from time to time.

The fact that it's part of the context is like, ancient legacy.

Instead, tell people to access info.trial.user_data for the data field or info.trial.hparams for hyperparameters.

MikhailKardash · 2024-10-31T22:45:34Z

docs/release-notes/searcher-context-removal.rst

+   integrates Keras training code with Determined through a single :ref:`Keras Callback
+   <api-keras-ug>`.
+
+-  API: introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that


One of the changes to DeepSpeedTrialContext was to remove det.EnvContext from it. This used to be accessible via context.env. I mentioned this as a potential breaking change for some users, who may have to move to the context.get_experiment_config() and context.get_hparams() methods.

Actually, should probably just tell users it's gone completely and they should read config from the cluster info.

This shouldn't technically be breaking because it was never documented, so it was not technically public.

But it's so old that it wasn't prefixed with _ because we didn't used to be good about doing that.

If it wasn't documented before, we don't need to document that it is gone, imo.

The reason I reference it is because our examples had code that used it (gpt_neox before the Trainer API rewrites, for example)

rb-determined-ai · 2024-10-31T22:49:25Z

docs/reference/experiment-config-reference.rst

@@ -319,25 +319,6 @@ While debugging, the logger will display lines highlighted in blue for easy iden
 Validation Policy
 *******************

-.. _experiment-config-min-validation-period:


uh, I should probably tell you that @tara-det-ai told me we should mark it as deprecated but not delete it from the docs.

This is contrary to my preference, which is to let users look in old docs if they want to use old features.

But also it makes sense in this case because we don't have any deprecation warnings anywhere in the system for these fields; users would only know by looking at the docs.

Also this means we need to revert some of my docs deletions from the searcher-context-removal branch, which I realize now that I forgot to do 😞. It would be the removals from this file, I think.

That said, the other places where you remove things like "use min_validation_period to control validations" should be removed. But the actual reference here should not.

welp. ok then. i added them all back.

thanks, sorry.

docs/release-notes/searcher-context-removal.rst

rb-determined-ai · 2024-10-31T22:52:00Z

examples/deepspeed/dcgan/model.py

@@ -46,7 +46,7 @@ def __init__(self, context: det_ds.DeepSpeedTrialContext,
        self.discriminator = self.context.wrap_model_engine(discriminator)
        self.fixed_noise = self.context.to_device(
            torch.randn(
-                self.context.train_micro_batch_size_per_gpu, self.hparams["noise_length"], 1, 1


was this a bug, or is this a breaking change to the API?

both? it was a bug because it was a breaking change. added to release notes.

This is the only reason we are breaking existing DeepSpeedTrials? Like, we're deprecating old paths, yes, but we're only actually breaking user code for this?

Can we include a @property-style getter that makes this not a breaking change?

something like:

@property def train_micro_batch_size_per_gpu(self) -> int: warnings.warn( "DeepSpeedTrialContext.train_micro_batch_size_per_gpu has been deprecated in " "Determined 0.38.0; please use the context.get_train_micro_batch_size_per_gpu() getter " "instead." FutureWarning, stacklevel=2, ) return self.get_train_micro_batch_size_per_gpu()

harness/determined/pytorch/_trainer_utils.py

docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst

rb-determined-ai · 2024-11-01T22:57:08Z

docs/release-notes/searcher-context-removal.rst

+-  DeepSpeed: the ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes
+   on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and
+   ``get_num_micro_batches_per_slot()``.


no longer breaking

rb-determined-ai

awesome!

update docs and add release note for searcher context removal in 0.38.0 (cherry picked from commit f02872a)

docs: searcher context removal docs

bd53030

azhou-determined requested a review from rb-determined-ai October 31, 2024 22:14

azhou-determined requested review from a team as code owners October 31, 2024 22:14

azhou-determined requested a review from corban-beaird October 31, 2024 22:14

cla-bot bot added the cla-signed label Oct 31, 2024

determined-ci requested a review from a team October 31, 2024 22:14

determined-ci added the documentation Improvements or additions to documentation label Oct 31, 2024

a few more docs changes

9984531

azhou-determined requested review from MikhailKardash and removed request for corban-beaird October 31, 2024 22:34