Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PRE for multi-level field loads in a loop #5110

Merged
merged 5 commits into from
May 12, 2018

Conversation

rajatd
Copy link
Contributor

@rajatd rajatd commented May 7, 2018

This PR enhances the existing Partial Redundancy Elimination (PRE) infrastructure to catch multi-level field loads, i.e., field loads of the form o.x.y.

Challenges:

  1. Loading o.x.y translates to roughly the following IR:

             T1 = o.x
             T2 = T1.y
    

    In order to put a load of T1.y in the loop landing pad, we need T1 to be live there. But, since it’s a temp its not live in the landing pad.

  2. In the prepass, when we create initial values for o.x and T1.y, we give them new symStores (because we aren't sure what symStore will those syms end up with at the end of the loop). Symbol-to-value map entries look something like this:

          symbol -> value -> symstore
               o.x -> v1 -> T3
               T1.y ->v2 -> T4
    

    The PRE code that runs after the loop prepass, then uses these symStores to preload the PRE candidate property syms (o.x and T1.y in the above example). Assuming we could solve the previous problem by inserting T1 = o.x in the landing pad, landing pad would look something like this:

             T3 = o.x
             T1 = o.x
             T4 = T1.y
    

    But, now when we optimize the landing pad, T1.y gets object-pointer copy-propped to T3.y. So, now we have T3.y live coming in to the loop, but T1.y live on the back-edge. Since, these are two different property syms, they don’t produce a value for any .y during merging info on the loop header. This defeats the purpose of putting an initial value for T1.y in the landing pad.

Solutions:

  1. For making T1 live in the landing pad, I rely on it being single-def and go with the premise that if its single def, I should be able to insert its defining instruction in the landing pad, as long as the rhs of the instruction is live in the landing pad; if its not, I try to make the rhs live by either:
    a. Processing it as a PRE candidate (if it was one), or
    b. recursing on the same logic as above for making it live.

  2. Since, T3 is the symstore for o.x on the back-edge, and we insert T3 = o.x in the landing pad, T3 will eventually object-pointer copy-prop into T2 = T1.y (making it T2 = T3.y) during the main pass of the loop. To then field copy prop T3.y, we need T3.y to have a value on the loop header and to have that, we need T3.y live on the back edge.
    And that is exactly what I do - when preloading T1.y in the landing pad, I recognize that T3 is going to be the copy-prop sym for T1 and I make T3.y live in the landing pad and all the back edges.

Perf

Kraken                            Left run time      Right run time     ? Run time  ? Run time %  Comment
--------------------------------  -----------------  -----------------  ----------  ------------  ---------------
Ai-astar                           203.80 ms ±0.90%   205.83 ms ±0.92%     2.03 ms         1.00%
Audio-beat-detection               105.00 ms ±0.95%   102.50 ms ±0.49%    -2.50 ms        -2.38%
Audio-dft                          141.89 ms ±0.99%   140.75 ms ±0.18%    -1.14 ms        -0.80%
Audio-fft                           74.50 ms ±0.70%    75.91 ms ±0.74%     1.41 ms         1.89%
Audio-oscillator                    73.63 ms ±0.43%    69.25 ms ±0.24%    -4.38 ms        -5.94%  Improved
Imaging-darkroom                   177.83 ms ±0.92%   176.33 ms ±0.68%    -1.50 ms        -0.84%
Imaging-desaturate                  75.57 ms ±0.49%    75.43 ms ±0.76%    -0.14 ms        -0.19%
Imaging-gaussian-blur              189.25 ms ±0.13%   165.60 ms ±0.24%   -23.65 ms       -12.50%  Improved
Json-parse-financial                58.50 ms ±0.85%    57.60 ms ±0.89%    -0.90 ms        -1.54%
Json-stringify-tinderbox            32.00 ms ±0.81%    32.00 ms ±0.99%     0.00 ms         0.00%
Stanford-crypto-aes                145.20 ms ±0.96%   144.80 ms ±0.94%    -0.40 ms        -0.28%
Stanford-crypto-ccm                112.00 ms ±0.89%   111.50 ms ±0.45%    -0.50 ms        -0.45%
Stanford-crypto-pbkdf2             246.33 ms ±0.27%   244.80 ms ±0.91%    -1.53 ms        -0.62%
Stanford-crypto-sha256-iterative    63.75 ms ±0.39%    62.83 ms ±0.96%    -0.92 ms        -1.44%
--------------------------------  -----------------  -----------------  ----------  ------------  ---------------
Total                             1699.25 ms ±0.67%  1665.14 ms ±0.66%   -34.11 ms        -2.01%  Likely improved

Octane            Left score       Right score      ∆ Score  ∆ Score %  Comment
----------------  ---------------  ---------------  -------  ---------  ---------------
Box2d             23756.13 ±0.16%  24083.13 ±0.58%   327.00      1.38%
Code-load          8355.50 ±0.25%   8435.50 ±0.15%    80.00      0.96%  Improved
Crypto            25798.93 ±0.24%  26194.50 ±0.18%   395.57      1.53%  Improved
Deltablue         19073.33 ±0.66%  20072.00 ±0.23%   998.67      5.24%  Improved
Earley-boyer      32181.25 ±0.36%  31788.40 ±0.49%  -392.85     -1.22%
Gbemu             36176.00 ±0.20%  36268.00 ±0.38%    92.00      0.25%
Mandreel          20893.25 ±0.32%  20803.50 ±0.27%   -89.75     -0.43%
Mandreel latency  61067.78 ±0.71%  61170.50 ±0.23%   102.72      0.17%
Navier-stokes     29619.50 ±0.30%  31657.71 ±0.30%  2038.21      6.88%  Improved
Pdfjs             13850.75 ±0.41%  13977.63 ±0.44%   126.88      0.92%
Raytrace          31511.67 ±0.38%  31847.71 ±0.30%   336.05      1.07%  Likely improved
Regexp             3627.00 ±0.51%   3586.13 ±0.11%   -40.88     -1.13%
Richards          17123.00 ±0.49%  17286.50 ±0.26%   163.50      0.95%
Splay             16760.20 ±0.49%  16873.75 ±0.32%   113.55      0.68%
Splay latency     34004.50 ±0.58%  34297.50 ±0.44%   293.00      0.86%
Typescript        28073.38 ±0.49%  28221.88 ±0.41%   148.50      0.53%
Zlib              70900.88 ±0.17%  70701.75 ±0.25%  -199.13     -0.28%
----------------  ---------------  ---------------  -------  ---------  ---------------
Total             22911.87 ±0.39%  23154.84 ±0.31%   242.96      1.06%  Likely improved

Follow up work will target catching multi-level field loads that have LdEnv in the load sequence, and loads of and stores to the same field in the same loop

@rajatd rajatd requested review from LouisLaf and a team May 7, 2018 22:39
@rajatd rajatd force-pushed the oxy-no-ldslot branch 3 times, most recently from 079a1d6 to 2f462b3 Compare May 8, 2018 22:37
@rajatd rajatd requested review from Cellule and pleath May 11, 2018 18:58
} NEXT_PREDECESSOR_BLOCK;

Assert(tail);
return tail;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when there are multiple back-edges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at the call site of this method explains why is it ok to get any tail block - basically we'll be getting the value of a PRE candidate sym on the block returned by this method. Since, the sym is a PRE candidate sym, its value must be the same on all tail blocks. I'll update the method name to GetAnyTailBlock


if (ldSrc->m_sym != propertySym)
{
// It's possible that the propertySym but have equivalent objPtrs. Verify their values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that the propertySy [](start = 25, length = 19)

Nit: fix comment

Copy link
Collaborator

@LouisLaf LouisLaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@chakrabot chakrabot merged commit 0c15d38 into chakra-core:master May 12, 2018
chakrabot pushed a commit that referenced this pull request May 12, 2018
Merge pull request #5110 from rajatd:oxy-no-ldslot

This PR enhances the existing Partial Redundancy Elimination (PRE) infrastructure to catch multi-level field loads, i.e., field loads of the form o.x.y.

**Challenges:**
1. Loading o.x.y translates to roughly the following IR:

                T1 = o.x
                T2 = T1.y

	In order to put a load of `T1.y` in the loop landing pad, we need `T1` to be live there. But, since it’s a temp its not live in the landing pad.

2. In the prepass, when we create initial values for `o.x` and `T1.y`, we give them new symStores (because we aren't sure what symStore will those syms end up with at the end of the loop). Symbol-to-value map entries look something like this:

             symbol -> value -> symstore
                  o.x -> v1 -> T3
                  T1.y ->v2 -> T4

	The PRE code that runs after the loop prepass, then uses these symStores to preload the PRE candidate property syms (`o.x` and `T1.y` in the above example). Assuming we could solve the previous problem by inserting `T1 = o.x` in the landing pad, landing pad would look something like this:

                T3 = o.x
                T1 = o.x
                T4 = T1.y

	But, now when we optimize the landing pad, `T1.y` gets object-pointer copy-propped to `T3.y`. So, now we have `T3.y` live coming in to the loop, but `T1.y` live on the back-edge. Since, these are two different property syms, they don’t produce a value for any `.y` during merging info on the loop header. This defeats the purpose of putting an initial value for `T1.y` in the landing pad.

**Solutions:**
1. For making `T1` live in the landing pad, I rely on it being single-def and go with the premise that if its single def, I should be able to insert its defining instruction in the landing pad, as long as the rhs of the instruction is live in the landing pad; if its not, I try to make the rhs live by either:
		a. Processing it as a PRE candidate (if it was one), or
		b. recursing on the same logic as above for making it live.

2. Since, `T3` is the symstore for `o.x` on the back-edge, and we insert `T3 = o.x` in the landing pad, `T3` will eventually object-pointer copy-prop into `T2 = T1.y` (making it `T2 = T3.y`) during the main pass of the loop. To then field copy prop `T3.y`, we need `T3.y` to have a value on the loop header and to have that, we need `T3.y` live on the back edge.
	And that is exactly what I do - when preloading `T1.y` in the landing pad, I recognize that `T3` is going to be the copy-prop sym for `T1` and I make `T3.y` live in the landing pad and all the back edges.

**Perf**
```
Kraken                            Left run time      Right run time     ? Run time  ? Run time %  Comment
--------------------------------  -----------------  -----------------  ----------  ------------  ---------------
Ai-astar                           203.80 ms ±0.90%   205.83 ms ±0.92%     2.03 ms         1.00%
Audio-beat-detection               105.00 ms ±0.95%   102.50 ms ±0.49%    -2.50 ms        -2.38%
Audio-dft                          141.89 ms ±0.99%   140.75 ms ±0.18%    -1.14 ms        -0.80%
Audio-fft                           74.50 ms ±0.70%    75.91 ms ±0.74%     1.41 ms         1.89%
Audio-oscillator                    73.63 ms ±0.43%    69.25 ms ±0.24%    -4.38 ms        -5.94%  Improved
Imaging-darkroom                   177.83 ms ±0.92%   176.33 ms ±0.68%    -1.50 ms        -0.84%
Imaging-desaturate                  75.57 ms ±0.49%    75.43 ms ±0.76%    -0.14 ms        -0.19%
Imaging-gaussian-blur              189.25 ms ±0.13%   165.60 ms ±0.24%   -23.65 ms       -12.50%  Improved
Json-parse-financial                58.50 ms ±0.85%    57.60 ms ±0.89%    -0.90 ms        -1.54%
Json-stringify-tinderbox            32.00 ms ±0.81%    32.00 ms ±0.99%     0.00 ms         0.00%
Stanford-crypto-aes                145.20 ms ±0.96%   144.80 ms ±0.94%    -0.40 ms        -0.28%
Stanford-crypto-ccm                112.00 ms ±0.89%   111.50 ms ±0.45%    -0.50 ms        -0.45%
Stanford-crypto-pbkdf2             246.33 ms ±0.27%   244.80 ms ±0.91%    -1.53 ms        -0.62%
Stanford-crypto-sha256-iterative    63.75 ms ±0.39%    62.83 ms ±0.96%    -0.92 ms        -1.44%
--------------------------------  -----------------  -----------------  ----------  ------------  ---------------
Total                             1699.25 ms ±0.67%  1665.14 ms ±0.66%   -34.11 ms        -2.01%  Likely improved

Octane            Left score       Right score      ∆ Score  ∆ Score %  Comment
----------------  ---------------  ---------------  -------  ---------  ---------------
Box2d             23756.13 ±0.16%  24083.13 ±0.58%   327.00      1.38%
Code-load          8355.50 ±0.25%   8435.50 ±0.15%    80.00      0.96%  Improved
Crypto            25798.93 ±0.24%  26194.50 ±0.18%   395.57      1.53%  Improved
Deltablue         19073.33 ±0.66%  20072.00 ±0.23%   998.67      5.24%  Improved
Earley-boyer      32181.25 ±0.36%  31788.40 ±0.49%  -392.85     -1.22%
Gbemu             36176.00 ±0.20%  36268.00 ±0.38%    92.00      0.25%
Mandreel          20893.25 ±0.32%  20803.50 ±0.27%   -89.75     -0.43%
Mandreel latency  61067.78 ±0.71%  61170.50 ±0.23%   102.72      0.17%
Navier-stokes     29619.50 ±0.30%  31657.71 ±0.30%  2038.21      6.88%  Improved
Pdfjs             13850.75 ±0.41%  13977.63 ±0.44%   126.88      0.92%
Raytrace          31511.67 ±0.38%  31847.71 ±0.30%   336.05      1.07%  Likely improved
Regexp             3627.00 ±0.51%   3586.13 ±0.11%   -40.88     -1.13%
Richards          17123.00 ±0.49%  17286.50 ±0.26%   163.50      0.95%
Splay             16760.20 ±0.49%  16873.75 ±0.32%   113.55      0.68%
Splay latency     34004.50 ±0.58%  34297.50 ±0.44%   293.00      0.86%
Typescript        28073.38 ±0.49%  28221.88 ±0.41%   148.50      0.53%
Zlib              70900.88 ±0.17%  70701.75 ±0.25%  -199.13     -0.28%
----------------  ---------------  ---------------  -------  ---------  ---------------
Total             22911.87 ±0.39%  23154.84 ±0.31%   242.96      1.06%  Likely improved

```

Follow up work will target catching multi-level field loads that have LdEnv in the load sequence, and loads of and stores to the same field in the same loop
chakrabot pushed a commit that referenced this pull request Jun 22, 2018
Merge pull request #5300 from rajatd:prepareSeg

If the entire range of a memop operation fits in the head segment, properly check if there were any missing items in the head outside of that range, before setting HasNoMissingValues to true on the dst array.

This is needed only after #5110.
chakrabot pushed a commit that referenced this pull request Jun 22, 2018
… missing values

Merge pull request #5300 from rajatd:prepareSeg

If the entire range of a memop operation fits in the head segment, properly check if there were any missing items in the head outside of that range, before setting HasNoMissingValues to true on the dst array.

This is needed only after #5110.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants