Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make index_add_ error if input source shape is wrong #100321

Closed
wants to merge 23 commits into from
Closed

Make index_add_ error if input source shape is wrong #100321

wants to merge 23 commits into from

Conversation

tfsingh
Copy link
Contributor

@tfsingh tfsingh commented Apr 29, 2023

Fixes #92576 , checking the following as described in the documentation:

"source.shape[dim] == len(index) and source.shape[i] == self.shape[i] for i != dim"

Would be happy to iterate on this if there are any issues, and would be happy to implement the checking for the CUDA and MPS implementations of index_add_.

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 29, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100321

Note: Links to docs will display an error until the docs builds have been completed.

✅ 1 Unrelated Failure

As of commit 6f2954e:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@tfsingh
Copy link
Contributor Author

tfsingh commented Apr 29, 2023

The test fallback_masked_fill might need to be changed -- the implementation looks roughly something like this (excluding the batching using vmap).

import torch
x = torch.randn(7, 11, 13)
dim = 0
index = torch.tensor([0, 4, 2])
values = torch.randn(3, 13)
ans = x.index_add(dim, index, values.requires_grad_()) # I appended the requires gradient as the error wouldn't be thrown without it, even though it should as described in the documentation

However, when we compute the backward mode gradient computation, we see the same error as described in the original issue (all done in a fresh colab notebook).

ans.sum().backward()
RuntimeError: expand(torch.FloatTensor{[3, 11, 13]}, size=[3, 13]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)

Since we'd like to do this check in the forward, I think we ought to change this test to expect the runtime error this pr throws. I suspect whatever's failing in the onnx tests is similar, so I'll look at those too.

Would greatly appreciate your input @lezcano (apologies as well for the previous pr which was botched, but I'm dedicated to getting this issue fixed).

@lezcano lezcano self-assigned this Apr 30, 2023
@lezcano
Copy link
Collaborator

lezcano commented Apr 30, 2023

Those errors are failing in CUDA. You modified the cpu version of the function. Now, you just need to modify accordingly the CUDA version of the function. You can either repeat the code or have a function that's shared between the two. Many functions have the checks shared, but yeah, this one repeats them, which is not great.

@lezcano lezcano self-requested a review April 30, 2023 10:17
@tfsingh
Copy link
Contributor Author

tfsingh commented May 3, 2023

Okay, just added -- hoping this works, but am of course happy to iterate on it if not.

@tfsingh
Copy link
Contributor Author

tfsingh commented May 4, 2023

The test fallback_masked_fill might need to be changed -- the implementation looks roughly something like this (excluding the batching using vmap).

import torch
x = torch.randn(7, 11, 13)
dim = 0
index = torch.tensor([0, 4, 2])
values = torch.randn(3, 13)
ans = x.index_add(dim, index, values.requires_grad_()) # I appended the requires gradient as the error wouldn't be thrown without it, even though it should as described in the documentation

However, when we compute the backward mode gradient computation, we see the same error as described in the original issue (all done in a fresh colab notebook).

ans.sum().backward()
RuntimeError: expand(torch.FloatTensor{[3, 11, 13]}, size=[3, 13]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)

Since we'd like to do this check in the forward, I think we ought to change this test to expect the runtime error this pr throws. I suspect whatever's failing in the onnx tests is similar, so I'll look at those too.

Would greatly appreciate your input @lezcano (apologies as well for the previous pr which was botched, but I'm dedicated to getting this issue fixed).

Re: failing tests. I think the remaining ones still fall under this analysis, in that they violate the invariant we're checking for. In the case of fallback_max_fill, we are given a self of shape (7,11,13) and source (3, 13). Omitting self[dim] and source[dim] gives shapes (11,13) and (13), which aren't broadcastable. This issue exists for the onnx tests as well, where we try to broadcast shape of (5,4,3) with shape (1,3,3) along dim = 1 in index_add_dim_size_differ and shape (5,4,3) with shape (1,3,3) along dim = 1 in index_add_dynamic_axes.

I'm not sure if we should rewrite them to throw the expected error, or if there's a better course of action here.

@lezcano
Copy link
Collaborator

lezcano commented May 4, 2023

gives shapes (11,13) and (13), which aren't broadcastable.

These shapes are broadcastable. This means that the check needs to be a bit more generic to account for broadcasting.

Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes should just be done in the meta impl. The meta impl is common to both the CPU and CUDA parts

@tfsingh
Copy link
Contributor Author

tfsingh commented May 7, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased add-err onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-err && git pull --rebase)

@tfsingh
Copy link
Contributor Author

tfsingh commented May 7, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased add-err onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-err && git pull --rebase)

auto source_sizes = source.sizes().vec();
self_sizes.erase(self_sizes.begin() + dim);
source_sizes.erase(source_sizes.begin() + dim);
TORCH_CHECK(!(!self_sizes.empty() && source_sizes.empty()), "source tensor shape must match self tensor shape (excluding the specified dimension)");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the check that we want. This does not check for broadcasting of any kind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be mistaken, but I believe infer_size_impl in ExpandUtils.cpp checks for broadcasting before reaching TensorAdvancedIndexing. Therefore, I believe this should check the only potential failure case with dimensionality, for if we reach this point in the program flow, we can assume the shapes are broadcastable.

If I'm incorrect, I'll need help on how to vectorize the check for broadcastability as you requested earlier (in which case we should update the code for infer_size_impl as it checks using a for loop).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called form

TORCH_PRECOMPUTE_META_FUNC(index_add)
(const Tensor& self, int64_t dim, const Tensor& index, const Tensor& source, const Scalar& alpha) {
  dim = maybe_wrap_dim(dim, self.dim());
  index_func_meta_impl(*this, self, dim, index, source, "index_add");
  return TORCH_PRECOMPUTE_STRUCT(index_add)().set_dim(dim);
}

(among other places), so there is no check here that source has the correct size.

You don't need to implement the broadcasting check yourself. There are utility functions to do this. I don't remember the name of them though, but shouldn't be difficult to find.

test/test_torch.py Outdated Show resolved Hide resolved
@tfsingh
Copy link
Contributor Author

tfsingh commented May 9, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased add-err onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-err && git pull --rebase)

Comment on lines 344 to 345
auto self_sizes_tensor = at::tensor(self_sizes, index.options());
auto source_sizes_tensor = at::tensor(source_sizes, index.options());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why creating a tensor just to call .sizes()? Pass the sizes directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried converting the sizes into IntArrayRefs with a constructor, but this wasn't working well. Is there a better way to go from a vector representation to the one calling .sizes() on a tensor produces?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you just pass the vectors, they should be casted automatically to IntArrayRefs via the implicit constructor.

@lezcano
Copy link
Collaborator

lezcano commented May 9, 2023

The changes LGTM so far. Now there are a number of other tests that are failing. These are failing because linked to a given function there are many other functions that are implemented in terms of it (e.g. those associated with vmap(fn), the decomposition associated with torch.compile, and perhaps a few others). The way to go about this is:

  • First, fix the test_ops.py -vk test_errors_index_add tests. That's related with the way you wrote the regex (perhaps it's too strict)
  • Then, you can try to find the other function associated with the other tests. In order to find them:
    • git grep [name of function] your way through the code base
    • Read the tests and run through them with a debugger, to see which code is executed.
  • Once you find a relevant function (e.g. the batching rule that makes vmap(index_add) work) it should be fairly easy to fix it by making it throw the same error

Again, run the tests locally and make sure they pass before pushing if possible.

Good luck!

@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 10, 2023
@tfsingh
Copy link
Contributor Author

tfsingh commented May 10, 2023

Thank you for your review. I've fixed the regex issue, and am having some trouble with the broadcasting check (see comment above).

I've been looking into the batching rule fix, and I've narrowed it down to index_add_batch_rule in BatchRulesScatterOps. I've been erasing the batching dimension of self and other along with the given dimension and then conducting a check for broadcastability, but this doesn't seem to work. Erasing just the given dimension doesn't trigger the error. I've added an xfail to catch the bugs for this CI run after seeing the same existed for index_put, perhaps this approach is acceptable. EDIT: This doesn't seem to be working, I feel I'll need the majority of your input here.

Finally, I might need to go to this Friday's office hours as my run_test.py isn't showing the same errors as the CI -- I'm trying to be mindful of CI runs, but feel as if my feedback loop is faulty without them.

@pytorchmergebot
Copy link
Collaborator

Successfully rebased add-err onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-err && git pull --rebase)

@tfsingh
Copy link
Contributor Author

tfsingh commented Jun 8, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

index_add_ should error if shape of source is wrong
5 participants