Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline singleton splats #36169

Closed
wants to merge 1 commit into from
Closed

Inline singleton splats #36169

wants to merge 1 commit into from

Conversation

Keno
Copy link
Member

@Keno Keno commented Jun 6, 2020

As noted in #36087 and #29114, splatting integers currently has a
performance penalty that is unexpected. For tuples and SimpleVectors,
we have special purpose inliners that will simply inline the
tuple/SimpleVector into the call being splatted. However, for
everything else we'd have to run the iteration protocol to find
out what the values to substitute are. This change does just that,
limited to the case of length-1 (and empty) iterables. Benchmark:

f(x) = (x...,)
@code_typed f(1)
@benchmark f(1)

Before:

julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     209.357 ns (0.00% GC)
  median time:      213.404 ns (0.00% GC)
  mean time:        218.674 ns (0.16% GC)
  maximum time:     1.922 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     540

After:

julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = invoke Base.iterate(_2::Int64)::Tuple{Int64,Nothing}
│   %2 = (getfield)(%1, 1)::Int64
│   %3 = (getfield)(%1, 2)::Nothing
│        invoke Base.iterate(_2::Int64, %3::Nothing)::Nothing
│   %5 = Core.tuple(%2)::Tuple{Int64}
└──      return %5
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.044 ns (0.00% GC)
  median time:      3.047 ns (0.00% GC)
  mean time:        3.049 ns (0.00% GC)
  maximum time:     7.700 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Obviously this isn't 100% optimal yet, because the iterate calls themselves
don't get inlined, but it's a lot better. Inlining the iterate calls is
left for a follow up commit.

As noted in #36087 and #29114, splatting integers currently has a
performance penalty that is unexpected. For tuples and SimpleVectors,
we have special purpose inliners that will simply inline the
tuple/SimpleVector into the call being splatted. However, for
everything else we'd have to run the iteration protocol to find
out what the values to substitute are. This change does just that,
limited to the case of length-1 (and empty) iterables. Benchmark:

```
f(x) = (x...,)
@code_typed f(1)
@benchmark f(1)
```

Before:
```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     209.357 ns (0.00% GC)
  median time:      213.404 ns (0.00% GC)
  mean time:        218.674 ns (0.16% GC)
  maximum time:     1.922 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     540
```

After:
```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = invoke Base.iterate(_2::Int64)::Tuple{Int64,Nothing}
│   %2 = (getfield)(%1, 1)::Int64
│   %3 = (getfield)(%1, 2)::Nothing
│        invoke Base.iterate(_2::Int64, %3::Nothing)::Nothing
│   %5 = Core.tuple(%2)::Tuple{Int64}
└──      return %5
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.044 ns (0.00% GC)
  median time:      3.047 ns (0.00% GC)
  mean time:        3.049 ns (0.00% GC)
  maximum time:     7.700 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Obviously this isn't 100% optimal yet, because the `iterate` calls themselves
don't get inlined, but it's a lot better. Inlining the `iterate` calls is
left for a follow up commit.
@JeffBezanson JeffBezanson added compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) performance Must go faster labels Jun 6, 2020
@martinholters
Copy link
Member

I know my way around the optimizer too little to review in detail, but IIUC analyze_type_for_apply_rewrite duplicates a bit of work already done in abstract_iteration during inference. Is that just due to a lack of infrastructure to pipe the information from the latter to the former, or is there some general need to re-establish the types obtained during iteration? (Not that it matters much now, but if we want to get fancier with the inliner in the future, it would be unfortunate to duplicate more of that logic.)

@Keno
Copy link
Member Author

Keno commented Jun 8, 2020

Yes, we don't currently have any infrastructure to forward sideband information from inference to the optimizer (other than the cache for method lookups that we are using here). I'd like to improve that in the future, but for now this is the best we can do.

@martinholters
Copy link
Member

Thanks for confirming. (And by all means, that does not mean I disapprove of this PR in its current form in any way.)

Keno added a commit that referenced this pull request Jul 2, 2020
This change attempts to be a solution to the generalized problem
encountered in #36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what #36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to #36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
Keno added a commit that referenced this pull request Jul 2, 2020
This change attempts to be a solution to the generalized problem
encountered in #36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what #36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to #36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
Keno added a commit that referenced this pull request Jul 11, 2020
This change attempts to be a solution to the generalized problem
encountered in #36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what #36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to #36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
Keno added a commit that referenced this pull request Jul 12, 2020
This change attempts to be a solution to the generalized problem
encountered in #36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what #36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to #36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
Keno added a commit that referenced this pull request Jul 12, 2020
This change attempts to be a solution to the generalized problem
encountered in #36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what #36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to #36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
Keno added a commit that referenced this pull request Jul 13, 2020
This change attempts to be a solution to the generalized problem
encountered in #36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what #36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to #36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
Keno added a commit that referenced this pull request Jul 14, 2020
This change attempts to be a solution to the generalized problem
encountered in #36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what #36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to #36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
Keno added a commit that referenced this pull request Jul 15, 2020
This change attempts to be a solution to the generalized problem
encountered in #36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what #36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to #36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
Keno added a commit that referenced this pull request Jul 15, 2020
This change attempts to be a solution to the generalized problem
encountered in #36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what #36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to #36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
Keno added a commit that referenced this pull request Jul 15, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
Keno added a commit that referenced this pull request Jul 15, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
@Keno
Copy link
Member Author

Keno commented Jul 15, 2020

Superseded by #36684

@Keno Keno closed this Jul 15, 2020
Keno added a commit that referenced this pull request Jul 15, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
Keno added a commit that referenced this pull request Jul 17, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
Keno added a commit that referenced this pull request Jul 17, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
Keno added a commit that referenced this pull request Jul 18, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
simeonschaub pushed a commit to simeonschaub/julia that referenced this pull request Aug 11, 2020
This change attempts to be a solution to the generalized problem
encountered in JuliaLang#36169. In short, we do a whole bunch of analysis
during inference to figure out the final type of an expression,
but sometimes, we may need intermediate results that were
computed along the way. So far, we don't really have a great
place to put those results, so we end up having to re-compute
them during the optimization phase. That's what JuliaLang#36169 did,
but is clearly not a scalable solution.

I encountered the exact same issue while working on a new AD
compiler plugin, that needs to do a whole bunch of work during
inference to determine what to do (e.g. call a primitive, recurse,
or increase the derivative level), and optimizations need to have
access to this information.

This PR adds an additional `info` field to CodeInfo and IRCode
that can be used to forward this kind of information. As a proof
of concept, it forwards method match info from inference to
inlining (we do already cache these, so there's little performance
gain from this per se - it's more to exercise the infrastructure).

The plan is to do an alternative fix to JuliaLang#36169 on top of this
as the next step, but I figured I'd open it up for discussion first.
simeonschaub pushed a commit to simeonschaub/julia that referenced this pull request Aug 11, 2020
This supersedes JuliaLang#36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in JuliaLang#36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes JuliaLang#36087
Fixes JuliaLang#29114
@DilumAluthge DilumAluthge deleted the kf/inline_iterate_singleton branch March 25, 2021 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants