-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: docs changes for searcher context removal #10182
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10182 +/- ##
==========================================
- Coverage 54.26% 54.25% -0.01%
==========================================
Files 1259 1259
Lines 157284 157293 +9
Branches 3643 3643
==========================================
- Hits 85357 85347 -10
- Misses 71794 71813 +19
Partials 133 133
Flags with carried forward coverage won't be shown. Click here to find out more.
|
docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst
Outdated
Show resolved
Hide resolved
+ # Set flag used by internal PyTorch training loop | ||
+ os.environ["DET_MANUAL_INIT_DISTRIBUTED"] = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, what?
If we can't fix this before the release ships, then I vote we remove this section and add it back later when we fix it.
This is basically a bug, and we're documenting it here rather than fixing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, this is copied from pytorch-ug, where we document similar behavior. which isn't a bug, but a hackily-supported use case.
i'm pretty sure this isn't needed here though, since either we're in local mode, we init the dist backend and pass it into the ds.init(), or we're on cluster and it's None.
so i removed it, cc: @MikhailKardash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are documenting internal flags, that is a bug. It's also an oxymoron, since documenting it means it's not really internal, just a weirdly bad public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it's a feature, not a bug!"
lol, IIRC at the time of landing pytorch trainer, we decided to do this because we wanted to document a shiny new local training capability but didn't have time to figure out a good way to do it.
you're right that it's a bad public API. i've put it on my todo list, and i'll fix it next week. it's not related to this PR or this feature, though.
docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst
Outdated
Show resolved
Hide resolved
docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst
Outdated
Show resolved
Hide resolved
If your training code needs to read some values from the experiment configuration, | ||
``pytorch.deepspeed.init()`` accepts an ``exp_conf`` argument which allows calling | ||
``context.get_experiment_config()`` from ``DeepSpeedTrialContext``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The experiment config isn't safe for users to read from, because we automatically shim it from time to time.
The fact that it's part of the context is like, ancient legacy.
Instead, tell people to access info.trial.user_data
for the data
field or info.trial.hparams
for hyperparameters.
integrates Keras training code with Determined through a single :ref:`Keras Callback | ||
<api-keras-ug>`. | ||
|
||
- API: introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the changes to DeepSpeedTrialContext
was to remove det.EnvContext
from it. This used to be accessible via context.env
. I mentioned this as a potential breaking change for some users, who may have to move to the context.get_experiment_config()
and context.get_hparams()
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, should probably just tell users it's gone completely and they should read config from the cluster info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't technically be breaking because it was never documented, so it was not technically public.
But it's so old that it wasn't prefixed with _
because we didn't used to be good about doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it wasn't documented before, we don't need to document that it is gone, imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I reference it is because our examples had code that used it (gpt_neox
before the Trainer API rewrites, for example)
@@ -319,25 +319,6 @@ While debugging, the logger will display lines highlighted in blue for easy iden | |||
Validation Policy | |||
******************* | |||
|
|||
.. _experiment-config-min-validation-period: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, I should probably tell you that @tara-det-ai told me we should mark it as deprecated but not delete it from the docs.
This is contrary to my preference, which is to let users look in old docs if they want to use old features.
But also it makes sense in this case because we don't have any deprecation warnings anywhere in the system for these fields; users would only know by looking at the docs.
Also this means we need to revert some of my docs deletions from the searcher-context-removal branch, which I realize now that I forgot to do 😞. It would be the removals from this file, I think.
That said, the other places where you remove things like "use min_validation_period to control validations" should be removed. But the actual reference here should not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
welp. ok then. i added them all back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, sorry.
@@ -46,7 +46,7 @@ def __init__(self, context: det_ds.DeepSpeedTrialContext, | |||
self.discriminator = self.context.wrap_model_engine(discriminator) | |||
self.fixed_noise = self.context.to_device( | |||
torch.randn( | |||
self.context.train_micro_batch_size_per_gpu, self.hparams["noise_length"], 1, 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this a bug, or is this a breaking change to the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both? it was a bug because it was a breaking change. added to release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only reason we are breaking existing DeepSpeedTrials? Like, we're deprecating old paths, yes, but we're only actually breaking user code for this?
Can we include a @property
-style getter that makes this not a breaking change?
something like:
@property
def train_micro_batch_size_per_gpu(self) -> int:
warnings.warn(
"DeepSpeedTrialContext.train_micro_batch_size_per_gpu has been deprecated in "
"Determined 0.38.0; please use the context.get_train_micro_batch_size_per_gpu() getter "
"instead."
FutureWarning,
stacklevel=2,
)
return self.get_train_micro_batch_size_per_gpu()
71cf0a2
to
8a5d862
Compare
docs/model-dev-guide/api-guides/apis-howto/deepspeed/deepspeed.rst
Outdated
Show resolved
Hide resolved
- DeepSpeed: the ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes | ||
on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and | ||
``get_num_micro_batches_per_slot()``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer breaking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
update docs and add release note for searcher context removal in 0.38.0 (cherry picked from commit f02872a)
Ticket
Description
a few docs updates from searcher context removal.
Test Plan
Checklist
docs/release-notes/
See Release Note for details.