Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modeling of shape queries #1133

Open
1 task
Tracked by #1039
jjsjann123 opened this issue Sep 10, 2024 · 4 comments
Open
1 task
Tracked by #1039

Modeling of shape queries #1133

jjsjann123 opened this issue Sep 10, 2024 · 4 comments
Assignees
Labels
design This is a largish feature / design enhancement New feature or request

Comments

@jjsjann123
Copy link
Collaborator

jjsjann123 commented Sep 10, 2024

🚀 Feature

Modeling of accessing shape attribute of Tensor/TensorProxy has raising up to discussion in separate PRs & discussions.

We are trying to debate on whether we would want to lift shape inference logic into prologue trace in Thunder in general.

i.e. For a program

   def foo(...):
      # assume we computed a tensor `t0`
      t1 = t0.reshape(t0.size(0), -1)

If we leave all shape logic in the computation, it should be simplified as below:

    # produce t0 from some earlier trace
    (i0, i1) = prims.shape(t0)

    i2 = prims.mul(1, i0)
    i3 = prims.mul(i2, i1)  # this is t0.numel

    i4 = prims.mul(1, i0)
    i5 = prims.div(i3, i4)  # this is the simplified logic in clang.reshape with `-1` in the entry
    t1 = prims.reshape(t0, (i0, i5))

One alternative is to lift all shape logic in the prologue trace, so we'll have

@prologue trace:
foo(...):
    # all the shape logic to compute i0 & i1 from original input

    i2 = prims.mul(1, i0)
    i3 = prims.mul(i2, i1)  # this is t0.numel

    i4 = prims.mul(1, i0)
    i5 = prims.div(i3, i4)  # this is the simplified logic in clang.reshape with `-1` in the entry  
    return (..., i0, i5, ...)  # NOTE: we are not necessarily passing i0 / i5, it could be any equivalent symbols.

@compute trace:
foo(..., i0, i5, ...):
    # compute t0
    t1 = prims.reshape(t0, (i0, i5))  # here we do not seeing that i0 is equivalent to t0.shape[0]

I think the first version where we see how the shape operation is defined in the operation would simplify generated kernel, since we do not need to resolve/validate reshape concretization.

  • Follow up with codegen example on the impact of shape operation vs opaque scalar reshape.
@jjsjann123 jjsjann123 added enhancement New feature or request design This is a largish feature / design labels Sep 10, 2024
@jjsjann123
Copy link
Collaborator Author

Logging offline suggestions by @mruberry:

  • It's OK to have prims.shape appear in the trace, which allows overlapping cpu and gpu computation, and if it's in the prologue that won't happen;

I think this justifies #1113 . cc'ing @t-vi

  • It'd be nice if the appearance of shape queries happened once per tensor it was needed for, at the global scope, and in a place that didn't disrupt fusions or require fusion executors like nvFuser to reason about the presence of the shape calls

I think this just requires a decent DCE to aggregate the shape query. Shape query on python is not trivial and takes ~ us.
Meanwhile, I'm uncertain if we would want to hide shape calls from fusion executors. We need to evaluate its impact on generated kernels as well as overhead of cache.

  • it should rely on direct comparisons of tuples of numbers and symbolic values to determine if two shapes are the same, and not try to infer they are the same because they have a common provenance

I agree that this should be the principle of how constraints are inserted. But if provenance can be used to simplify such constraints (e.g. equivalence, non-negative), I think those should still be leveraged.

@jjsjann123 jjsjann123 self-assigned this Sep 10, 2024
@t-vi
Copy link
Collaborator

t-vi commented Sep 10, 2024

So currently, the prologue performs exactly two things:

  • collecting tensors (and possibly soon other inputs) for the computation trace
  • checking things

I wonder if it would be good to keep things this way, in particular not having the prologue compute things for the computation trace. I'm not saying it should be this way or not, but just that if we change it, it should be a very deliberate choice.

@mruberry
Copy link
Collaborator

So currently, the prologue performs exactly two things:

  • collecting tensors (and possibly soon other inputs) for the computation trace
  • checking things

I wonder if it would be good to keep things this way, in particular not having the prologue compute things for the computation trace. I'm not saying it should be this way or not, but just that if we change it, it should be a very deliberate choice.

Some computations are so closely related to "checks" that it would reduce the total CPU work to compute them in the prologue.

In general I still like thinking about computing everything possible in the prologue. We have a lot of opportunities to optimize the performance of prologues. @jjsjann123 is completely correct that computing everything in the prologue does not take advantage of overlapping CPU and GPU computation, and it's interesting to see if we can do that effectively.

@t-vi
Copy link
Collaborator

t-vi commented Sep 10, 2024

Some computations are so closely related to "checks" that it would reduce the total CPU work to compute them in the prologue.

In general I still like thinking about computing everything possible in the prologue. We have a lot of opportunities to optimize the performance of prologues. @jjsjann123 is completely correct that computing everything in the prologue does not take advantage of overlapping CPU and GPU computation, and it's interesting to see if we can do that effectively.

I absolutely agree.

The key property we need to keep here is "thinness" of Thunder. By this I mean that if the user of thunder knows they're calling the same computation 50x with controlled inputs (e.g. in the training loop), they can run the prologue once and then rely on the compute function working for them.

Note that if we anticipate power users of Thunder to do this, we also save them computation by putting it in the prologue, which might be even more attractive than overlapping with GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design This is a largish feature / design enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants