Make index_add_ error if input source shape is wrong #100321

tfsingh · 2023-04-29T19:28:03Z

Fixes #92576 , checking the following as described in the documentation:

"source.shape[dim] == len(index) and source.shape[i] == self.shape[i] for i != dim"

Would be happy to iterate on this if there are any issues, and would be happy to implement the checking for the CUDA and MPS implementations of index_add_.

pytorch-bot · 2023-04-29T19:28:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100321

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ 1 Unrelated Failure

As of commit 6f2954e:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

linux-focal-py3.8-gcc7 / test (distributed, 1, 2, linux.2xlarge, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

tfsingh · 2023-04-29T21:47:10Z

The test fallback_masked_fill might need to be changed -- the implementation looks roughly something like this (excluding the batching using vmap).

import torch
x = torch.randn(7, 11, 13)
dim = 0
index = torch.tensor([0, 4, 2])
values = torch.randn(3, 13)
ans = x.index_add(dim, index, values.requires_grad_()) # I appended the requires gradient as the error wouldn't be thrown without it, even though it should as described in the documentation

However, when we compute the backward mode gradient computation, we see the same error as described in the original issue (all done in a fresh colab notebook).

ans.sum().backward()
RuntimeError: expand(torch.FloatTensor{[3, 11, 13]}, size=[3, 13]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)

Since we'd like to do this check in the forward, I think we ought to change this test to expect the runtime error this pr throws. I suspect whatever's failing in the onnx tests is similar, so I'll look at those too.

Would greatly appreciate your input @lezcano (apologies as well for the previous pr which was botched, but I'm dedicated to getting this issue fixed).

lezcano · 2023-04-30T10:16:33Z

Those errors are failing in CUDA. You modified the cpu version of the function. Now, you just need to modify accordingly the CUDA version of the function. You can either repeat the code or have a function that's shared between the two. Many functions have the checks shared, but yeah, this one repeats them, which is not great.

tfsingh · 2023-05-03T06:32:55Z

Okay, just added -- hoping this works, but am of course happy to iterate on it if not.

aten/src/ATen/native/TensorAdvancedIndexing.cpp

tfsingh · 2023-05-04T03:02:07Z

The test fallback_masked_fill might need to be changed -- the implementation looks roughly something like this (excluding the batching using vmap).
import torch
x = torch.randn(7, 11, 13)
dim = 0
index = torch.tensor([0, 4, 2])
values = torch.randn(3, 13)
ans = x.index_add(dim, index, values.requires_grad_()) # I appended the requires gradient as the error wouldn't be thrown without it, even though it should as described in the documentation
However, when we compute the backward mode gradient computation, we see the same error as described in the original issue (all done in a fresh colab notebook).
ans.sum().backward()
RuntimeError: expand(torch.FloatTensor{[3, 11, 13]}, size=[3, 13]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)
Since we'd like to do this check in the forward, I think we ought to change this test to expect the runtime error this pr throws. I suspect whatever's failing in the onnx tests is similar, so I'll look at those too.

Would greatly appreciate your input @lezcano (apologies as well for the previous pr which was botched, but I'm dedicated to getting this issue fixed).

Re: failing tests. I think the remaining ones still fall under this analysis, in that they violate the invariant we're checking for. In the case of fallback_max_fill, we are given a self of shape (7,11,13) and source (3, 13). Omitting self[dim] and source[dim] gives shapes (11,13) and (13), which aren't broadcastable. This issue exists for the onnx tests as well, where we try to broadcast shape of (5,4,3) with shape (1,3,3) along dim = 1 in index_add_dim_size_differ and shape (5,4,3) with shape (1,3,3) along dim = 1 in index_add_dynamic_axes.

I'm not sure if we should rewrite them to throw the expected error, or if there's a better course of action here.

lezcano · 2023-05-04T09:07:40Z

gives shapes (11,13) and (13), which aren't broadcastable.

These shapes are broadcastable. This means that the check needs to be a bit more generic to account for broadcasting.

lezcano

The changes should just be done in the meta impl. The meta impl is common to both the CPU and CUDA parts

tfsingh · 2023-05-07T03:21:20Z

@pytorchbot rebase

pytorchmergebot · 2023-05-07T03:23:28Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-05-07T03:23:36Z

Successfully rebased add-err onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-err && git pull --rebase)

tfsingh · 2023-05-07T08:15:14Z

@pytorchbot rebase

pytorchmergebot · 2023-05-07T08:17:26Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-05-07T08:17:33Z

Successfully rebased add-err onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-err && git pull --rebase)

lezcano · 2023-05-07T09:51:05Z

aten/src/ATen/native/TensorAdvancedIndexing.cpp

+    auto source_sizes = source.sizes().vec();
+    self_sizes.erase(self_sizes.begin() + dim);
+    source_sizes.erase(source_sizes.begin() + dim);
+    TORCH_CHECK(!(!self_sizes.empty() && source_sizes.empty()), "source tensor shape must match self tensor shape (excluding the specified dimension)");


This is not the check that we want. This does not check for broadcasting of any kind.

I may be mistaken, but I believe infer_size_impl in ExpandUtils.cpp checks for broadcasting before reaching TensorAdvancedIndexing. Therefore, I believe this should check the only potential failure case with dimensionality, for if we reach this point in the program flow, we can assume the shapes are broadcastable.

If I'm incorrect, I'll need help on how to vectorize the check for broadcastability as you requested earlier (in which case we should update the code for infer_size_impl as it checks using a for loop).

This function is called form

TORCH_PRECOMPUTE_META_FUNC(index_add) (const Tensor& self, int64_t dim, const Tensor& index, const Tensor& source, const Scalar& alpha) { dim = maybe_wrap_dim(dim, self.dim()); index_func_meta_impl(*this, self, dim, index, source, "index_add"); return TORCH_PRECOMPUTE_STRUCT(index_add)().set_dim(dim); }

(among other places), so there is no check here that source has the correct size.

You don't need to implement the broadcasting check yourself. There are utility functions to do this. I don't remember the name of them though, but shouldn't be difficult to find.

test/test_torch.py

tfsingh · 2023-05-09T05:06:41Z

@pytorchbot rebase

pytorchmergebot · 2023-05-09T05:08:55Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-05-09T05:09:03Z

Successfully rebased add-err onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-err && git pull --rebase)

lezcano · 2023-05-09T08:14:26Z

aten/src/ATen/native/TensorAdvancedIndexing.cpp

+    auto self_sizes_tensor = at::tensor(self_sizes, index.options());
+    auto source_sizes_tensor = at::tensor(source_sizes, index.options());


Why creating a tensor just to call .sizes()? Pass the sizes directly.

I tried converting the sizes into IntArrayRefs with a constructor, but this wasn't working well. Is there a better way to go from a vector representation to the one calling .sizes() on a tensor produces?

if you just pass the vectors, they should be casted automatically to IntArrayRefs via the implicit constructor.

lezcano · 2023-05-09T10:45:47Z

The changes LGTM so far. Now there are a number of other tests that are failing. These are failing because linked to a given function there are many other functions that are implemented in terms of it (e.g. those associated with vmap(fn), the decomposition associated with torch.compile, and perhaps a few others). The way to go about this is:

First, fix the test_ops.py -vk test_errors_index_add tests. That's related with the way you wrote the regex (perhaps it's too strict)
Then, you can try to find the other function associated with the other tests. In order to find them:
- git grep [name of function] your way through the code base
- Read the tests and run through them with a debugger, to see which code is executed.
Once you find a relevant function (e.g. the batching rule that makes vmap(index_add) work) it should be fairly easy to fix it by making it throw the same error

Again, run the tests locally and make sure they pass before pushing if possible.

Good luck!

tfsingh · 2023-05-10T03:47:58Z

Thank you for your review. I've fixed the regex issue, and am having some trouble with the broadcasting check (see comment above).

I've been looking into the batching rule fix, and I've narrowed it down to index_add_batch_rule in BatchRulesScatterOps. I've been erasing the batching dimension of self and other along with the given dimension and then conducting a check for broadcastability, but this doesn't seem to work. Erasing just the given dimension doesn't trigger the error. I've added an xfail to catch the bugs for this CI run after seeing the same existed for index_put, perhaps this approach is acceptable. EDIT: This doesn't seem to be working, I feel I'll need the majority of your input here.

Finally, I might need to go to this Friday's office hours as my run_test.py isn't showing the same errors as the CI -- I'm trying to be mindful of CI runs, but feel as if my feedback loop is faulty without them.

pytorchmergebot · 2023-06-08T03:43:36Z

Successfully rebased add-err onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout add-err && git pull --rebase)

tfsingh · 2023-06-08T06:47:24Z

@pytorchbot merge

pytorchmergebot · 2023-06-08T06:51:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchbot added the open source label Apr 29, 2023

lezcano self-assigned this Apr 30, 2023

lezcano self-requested a review April 30, 2023 10:17

lezcano reviewed May 3, 2023

View reviewed changes

aten/src/ATen/native/TensorAdvancedIndexing.cpp Outdated Show resolved Hide resolved

lezcano reviewed May 6, 2023

View reviewed changes

lezcano reviewed May 7, 2023

View reviewed changes

tfsingh requested review from mruberry and ngimel as code owners May 9, 2023 04:17

lezcano reviewed May 9, 2023

View reviewed changes

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 10, 2023

tfsingh requested review from zou3519, Chillee and kshitij12345 as code owners May 10, 2023 02:37

tfsingh added 22 commits June 8, 2023 03:43

switch to torch_check

2ffcb3c

add cuda

860d6c5

deleted first error check

b5306a2

generalized

5dc5f79

remove cuda

942b90a

add test

9138241

move test to c_m_i

1d19cef

add device

71e0a7e

fix regex and extra tensor

c2ca038

initial -- not intended change but might pass test

d13a76d

back to tensor creation

ef6f32e

index options

0443dca

fix regex

f1b3d64

switch from tensor rep

32c6d78

remove broadcasting

8559c3b

another attempt

2636f95

potentially fix onnx

458b32e

another fix

f9e9f76

try again

5e03463

another attempt

a2b7b2e

go again

5a5eae1

fix remaining issue

6f2954e

pytorchmergebot added the merging label Jun 8, 2023

pytorchmergebot added the Merged label Jun 8, 2023

pytorchmergebot removed the merging label Jun 8, 2023

pytorchmergebot closed this in d2d03f0 Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make index_add_ error if input source shape is wrong #100321

Make index_add_ error if input source shape is wrong #100321

tfsingh commented Apr 29, 2023

pytorch-bot bot commented Apr 29, 2023 •

edited

Loading

tfsingh commented Apr 29, 2023 •

edited

Loading

lezcano commented Apr 30, 2023

tfsingh commented May 3, 2023

tfsingh commented May 4, 2023

lezcano commented May 4, 2023 •

edited

Loading

lezcano left a comment

tfsingh commented May 7, 2023

pytorchmergebot commented May 7, 2023

pytorchmergebot commented May 7, 2023

tfsingh commented May 7, 2023

pytorchmergebot commented May 7, 2023

pytorchmergebot commented May 7, 2023

lezcano May 7, 2023

tfsingh May 7, 2023

lezcano May 8, 2023

tfsingh commented May 9, 2023

pytorchmergebot commented May 9, 2023

pytorchmergebot commented May 9, 2023

lezcano May 9, 2023

tfsingh May 10, 2023

lezcano May 10, 2023

lezcano commented May 9, 2023

tfsingh commented May 10, 2023 •

edited

Loading

pytorchmergebot commented Jun 8, 2023

tfsingh commented Jun 8, 2023

pytorchmergebot commented Jun 8, 2023

		auto self_sizes_tensor = at::tensor(self_sizes, index.options());
		auto source_sizes_tensor = at::tensor(source_sizes, index.options());

Make index_add_ error if input source shape is wrong #100321

Make index_add_ error if input source shape is wrong #100321

Conversation

tfsingh commented Apr 29, 2023

pytorch-bot bot commented Apr 29, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100321

✅ 1 Unrelated Failure

tfsingh commented Apr 29, 2023 • edited Loading

lezcano commented Apr 30, 2023

tfsingh commented May 3, 2023

tfsingh commented May 4, 2023

lezcano commented May 4, 2023 • edited Loading

lezcano left a comment

Choose a reason for hiding this comment

tfsingh commented May 7, 2023

pytorchmergebot commented May 7, 2023

pytorchmergebot commented May 7, 2023

tfsingh commented May 7, 2023

pytorchmergebot commented May 7, 2023

pytorchmergebot commented May 7, 2023

lezcano May 7, 2023

Choose a reason for hiding this comment

tfsingh May 7, 2023

Choose a reason for hiding this comment

lezcano May 8, 2023

Choose a reason for hiding this comment

tfsingh commented May 9, 2023

pytorchmergebot commented May 9, 2023

pytorchmergebot commented May 9, 2023

lezcano May 9, 2023

Choose a reason for hiding this comment

tfsingh May 10, 2023

Choose a reason for hiding this comment

lezcano May 10, 2023

Choose a reason for hiding this comment

lezcano commented May 9, 2023

tfsingh commented May 10, 2023 • edited Loading

pytorchmergebot commented Jun 8, 2023

tfsingh commented Jun 8, 2023

pytorchmergebot commented Jun 8, 2023

Merge started

pytorch-bot bot commented Apr 29, 2023 •

edited

Loading

tfsingh commented Apr 29, 2023 •

edited

Loading

lezcano commented May 4, 2023 •

edited

Loading

tfsingh commented May 10, 2023 •

edited

Loading