Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add support for repeating each element of an array #654

Closed
kgryte opened this issue Jul 13, 2023 · 15 comments · Fixed by #690
Closed

RFC: add support for repeating each element of an array #654

kgryte opened this issue Jul 13, 2023 · 15 comments · Fixed by #690
Labels
API extension Adds new functions or objects to the API. topic: Manipulation Array manipulation and transformation.
Milestone

Comments

@kgryte
Copy link
Contributor

kgryte commented Jul 13, 2023

This RFC proposes adding support to the array API specification for repeating each element of an array.

Overview

Based on array comparison data, the API is available in most array libraries. The main exception is PyTorch which deviates in its naming convention (repeat_interleave vs NumPy et al's repeat).

Prior art

Proposal

def repeat(x: array, repeats: Union[int, Sequence[int], array], /, *, axis: Optional[int] = None)
  • repeats: the number of repetitions for each element.

    If axis is not None,

    • if repeats is an array, repeats.shape must broadcast to x.shape[axis].
    • if repeats is a sequence of ints, len(repeats) must broadcast to x.shape[axis].
    • if repeats is an integer, repeats must be broadcasted to match the size of a specified axis.

    If axis is None,

    • if repeats is an array, repeats.shape must broadcast to prod(x.shape).
    • if repeats is a sequence of ints, len(repeats) must broadcast to prod(x.shape).
    • if repeats is an integer, repeats must be broadcasted to match the size of the flattened array.
  • axis: specifies the axis along which to repeat values. If None, use a flattened input array and return a flat output array.

Questions

  • Both PyTorch and JAX support a kwarg for specifying the output size in order to avoid stream synchronization (PyTorch) and to allow compilation (JAX). Without such kwarg support, is this API viable? And what are the reasons for needing this kwarg when other array libraries (TensorFlow) omit such a kwarg?
  • When flattening the input array, flatten in row-major order? (precedent: nonzero)
  • Is PyTorch okay adding a repeat function in its main namespace, given the divergence in behavior for torch.Tensor.repeat, which behaves similar to np.tile?
  • CuPy only allows int, List, and Tuple for repeats, not an array. PyTorch may prefer a list of ints (see Unnecessary cuda synchronizations that we should remove in PyTorch pytorch/pytorch#108968).

Related

@kgryte kgryte added API extension Adds new functions or objects to the API. topic: Manipulation Array manipulation and transformation. labels Jul 13, 2023
@kgryte kgryte added this to the v2023 milestone Jul 13, 2023
@rgommers
Copy link
Member

rgommers commented Jul 13, 2023

Both PyTorch and JAX support a kwarg for specifying the output size in order to avoid stream synchronization (PyTorch) and to allow compilation (JAX). Without such kwarg support, is this API viable? And what are the reasons for needing this kwarg when other array libraries (TensorFlow) omit such a kwarg?

I think because the repeats and axis keywords may not contain int literals, but be derived from other variables. At that point, they can't be known statically. EDIT: it looks like one more case of "output shape depends on value of an input parameter".

When flattening the input array, flatten in row-major order? (precedent: nonzero)

I think we should always do that. There are more places where flattening happens, and they should always be row-major I'd think.

Is PyTorch okay adding a repeat function in its main namespace, given the divergence in behavior for torch.Tensor.repeat, which behaves similar to np.tile?

Probably yes, but it's a relatively slow process to do so. The issue for it is pytorch/pytorch#50013, and in the tracker issue for numpy compat linked from there, it says:

torch.Tensor.repeat is actually np.tile, not np.repeat

    TODO: warn that users should use tile instead of repeat
    TODO (Release 2): remove torch.Tensor.repeat
    TODO (Release 3): implement torch.repeat to be compatible with np.repeat

Of course the longer that is not executed on, the less appetite there will be to do it later on.

@rgommers
Copy link
Member

I think because the repeats and axis keywords may not contain int literals, but be derived from other variables. At that point, they can't be known statically. EDIT: it looks like one more case of "output shape depends on value of an input parameter".

Strangely enough, PyTorch and JAX don't have the same keyword for tile (xref gh-655).

@rgommers
Copy link
Member

I looked at the PyTorch situation again; it looks a little annoying. torch.repeat could in principle be added now, with behavior compatible with np.repeat and not with the Tensor.repeat method. I think the equally annoying thing is that it's not supported as of now, so it's just more work and no one is lining up to do that work.

@lezcano do you have any thoughts on this one?

@lezcano
Copy link
Contributor

lezcano commented Jul 27, 2023

Having a method and a free function that have different behaviour is not a good idea (if I understood the proposal correctly). The whole update process is also a non-go, has we'd need to do it in 3 releases, and the time between releases now has gone up (unofficially) to almost a year.

Now, this should not stop the array api from standarising this, as this function is pretty much repeate_interleave. One should just be careful to cast the tuple into a tensor and populate the output_size kwarg correctly.

nb. it doesn't make any sense that in PyTorch's repeat_interleave the repeats arg is a tensor and not a tuple...

@oleksandr-pavlyk
Copy link
Contributor

I argue we should allow value of repeat to be an array as well (we should not insist it to be a tuple of ints).

Offloading libraries must convert this vector and transfer data to an accelerator during the implementation anyway. Supporting array input would unlock certain use-cases, such a programmatic generation of repeats (random sampling, procedural creation, etc.)

@kgryte
Copy link
Contributor Author

kgryte commented Sep 7, 2023

@oleksandr-pavlyk I think this is a bug in the proposal. repeats should be allowed to be either an int or an array of ints.

@vadimkantorov
Copy link

Supporting arrays is also useful for cases like PyTorch where there might be improvements if the repeats/counts argument is already a tensor obtained from elsewhere and especially if it's already on the target device

@lezcano
Copy link
Contributor

lezcano commented Sep 10, 2023

@vadimkantorov What do you mean by "there might be improvements"? There are no improvements, and it may even be detrimental to pass a tensor to this function if it's in CUDA.
If you think otherwise, it'd be useful to show concrete examples.

@leofang
Copy link
Contributor

leofang commented Sep 22, 2023

Hi all, this RFC (and PR #690) was briefly discussed in the Array API meeting yesterday. Apology to @kgryte @oleksandr-pavlyk @seberg, for comments I made in the meeting without fully understanding the sticking points here. Below is my updated analysis & response.

As already pointed out earlier this thread repeat() is challenging in that the returned shape dynamically depending on the inputs, unlike most APIs that have a predetermined contract for statical analysis. So this is still an undressed question (already documented in the RFC) so far.

Moreover, with regard to the repeats argument taking an array object, I made a comment yesterday that stream ordering should help. In theory this would be true, but it involves implementation details of an array library. Specifically, array metadata (shape/strides/...) are usually stored on the host (as in CuPy/PyTorch), and so in such cases a blocking D2H copy is unavoidable for these libraries to get this info on host and allocate an array accordingly, and preserving stream ordering is not sufficient.

However, again some changes in the implementation can help. For example, if the array library can store array metadata on CUDA unified, pinned or heterogeneous memory, preserving stream ordering alone is sufficient, and no blocking copy is needed. (Consider the following CUDA events are ordered: repeats created on device -> array allocated by zero-copy accessing repeats from host via stream- wait event & callback.) So it's not the end of the world, and array libraries can evolve to work around implementation hurdles.

API-wise speaking, there's no issues. The semantics is well-defined, pending the raw-major discussion which is a separate topic.

CuPy chose to not support taking arrays for repeats for the same reason (cupy/cupy#1918, cupy/cupy#3849), sorry I didn't answer it right yesterday. But, since it's generally not possible to guarantee (despite our best effort) asynchronousity for all APIs, I am not convinced we cannot change should the standard accepts it (cc: @kmaehashi). On the CuPy side, we can always document that a certain combination of arguments may sync the device.

It has been a longstanding, recurrent user request to support repeats taking an array (see the above CuPy issues for example; this RFC and even earlier this thread also gave similar requests). I believe for user ergonomics it is the right choice to do. I disagree with @lezcano that

There are no improvements

The UX improvement is large and also anticipated by many downstream developers and users. So, as long as we address the dynamic shape issue (I suspect we'd end up with adding a remark, similarly to what we did for the unique*() APIs), I would support it.

Since @jpivarski had kindly offered a partial workaround to CuPy, I'd like to have Jim to chime in. I'd also eager to have Jax developers involved, maybe @jakevdp can offer insight?

@lezcano
Copy link
Contributor

lezcano commented Sep 22, 2023

What about libraries that create compiled code like JAX or torch.compile, or libraries that use cudagraphs. In this function, you cannot even give a bound on the size of the output without looking at repeats, so this function cannot be compiled, certainly not into a CUDA graph.

Regarding UX, my point is that if you have a repeats tensor, you can always do in PyTorch a tensor.tolist() and pass it to this function. This goes in line with the "it's better explicit than implicit".

@leofang
Copy link
Contributor

leofang commented Sep 22, 2023

I argue that belongs to the dynamic shape issue, which is a separate discussion that we can/should certainly have. But given the array APIs have been dealing with it, repeat() is not the first instance and should stick to the same design principles that we apply throughout the standard.

@vadimkantorov
Copy link

For these cases, it might also be good to allow the user to pass in out= or some other way to specify max allowable shape (maybe as was done with unique). Often in practice, the user can have set some upper bound

@seberg
Copy link
Contributor

seberg commented Sep 22, 2023

Thanks for digging into that @leofang, so the problem is the synchronization due to unknown size (and maybe also the fact that the array may be mutated).

From a NumPy perspective I have to say I find it very strange to ask the user to use .tolist() for/on a potentially (large) integer array, something that makes certain use-cases tricky/slow (to the point where I wonder how many usages in library contexts become awkward).

@oleksandr-pavlyk
Copy link
Contributor

Nit: it should probably say "if repeats is an array, repeats.shape must broadcast to (prod(x.shape),)"

@kgryte
Copy link
Contributor Author

kgryte commented Oct 19, 2023

Okay, I think we are getting to some semblance of consensus here. Namely,

  • repeat should support accepting an int, a sequence of ints, or an integer array. Accepting an array aligns with user expectations, is already supported by most array libraries (NumPy, TF, JAX, et al), and is a requested feature in other libraries lacking support for providing an array.
  • we should add a note in the specification (and in the documentation of array libraries supporting GPUs) that providing an array can result in potential performance degradation due to device synchronization when computing the output array size.

Are we okay with this path forward?

cc @seberg @leofang @lezcano @oleksandr-pavlyk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API. topic: Manipulation Array manipulation and transformation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants