JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

leculver · 2024-05-03T18:05:51Z

Implement a gymnasium environment, JitCseEnv, to allow us to rapidly iterate on features, reward functions, and model/neural network architecture for JIT CSE optimization. This change:

Creates a hook in the JIT's common subexpression elimination optimization to allow it to be driven by environment variable.
Uses SuperPMI, with the new CSE hook, to drive CSE decision making in the JIT.
Implements a gym environment to manipulate features, rewards, and architecture of the reinforcement learning model to find what works and what doesn't.
Provides a mechanism to see live updates of the training process via Tensorboard, and post-training evaluation against the default CSE Heursitic.

This implements the bare minimum rewards and features needed to experiment with CSE optimization. The current non-normalized features and simple reward function creates a model that is almost as good as the current, hand-written CSE Heuristic in the JIT. Further developments and improvements will likely be done offline, this is meant to be the skeleton of the project that's shared.

More information can be found in the README.md included in this pull request.

Contributes to: #92915.

dotnet-policy-service · 2024-05-03T18:06:16Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

AndyAyersMS · 2024-05-03T20:15:40Z

@leculver skimmed through and this looks awesome! Will need a bit of time to review. Will try and get you some feedback by early next week.

FYI @dotnet/jit-contrib @matouskozak @mikelle-rogers

Also FYI @NoahIslam @LogarithmicFrog1: you might find the approach Lee is taking here a bit more accessible and/or familiar, if you're still up for some collaboration.

leculver · 2024-05-03T20:17:01Z

No problem, take the time you need.

AndyAyersMS

Overall this looks great. Happy to merge this as is.

Mostly my comments are about clarification and trying to match up what you have here with what I have done previously.

AndyAyersMS · 2024-05-07T18:31:20Z

src/coreclr/scripts/cse_ml/readme.md

I'd like to see a bit more of a writeup about the overall approach, either here or somewhere else. Things like

are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?

are the rewards discounted or undiscounted?

how are you handling the fact that reward magnitudes can vary greatly from one method to the next?

what sort of neural net topology gets built? Why is this a good choice?

how are you producing the aggregate score across all the methods?

I'm 100% in agreement about needing to create more writeup and documentation.

I guess I should have been a bit more clear in the intention of this Pull Request. I consider the code here the absolute minimum starting point that other folks (and myself) can play with to make improvements. The code here is meant as that playground for use over the next couple of months.

When I'm further along in experimenting with different approaches, model architecture, and so on, that's when I plan to write everything up. Some of the techniques will certainly change after I've had more time to experiment in the space, so I didn't write down too much about this base-design because I expect a lot of it to be different.

Here's quick answers to your questions:

are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?

This version uses incremental rewards by building up a sequence of decisions.

are the rewards discounted or undiscounted?

Rewards are discounted, but not heavily. Actually, we currently just use the stable-baselines default gamma of 0.99. I intentionally haven't tuned hyperparameters in this checkin. Again trying to keep it as simple as possible.

how are you handling the fact that reward magnitudes can vary greatly from one method to the next?

Currently, we use % change in the perfscore. This keeps rewards relatively within the same magnitude. Obviously some methods are longer than others and the change in perfscore for choosing a CSE likely doesn't scale with method length, so this is a place for improvement.

My overall goal with this checkin was simplicity and being able to understand what it's doing. Since the model trains successfully (though doesn't beat the current CSE Heursitic), I did not try to refine them further yet.

what sort of neural net topology gets built? Why is this a good choice?

Currently, it's the default for stable-baselines. I can give you the topology, but this was also a non-choice so far. The default network trained successfully, so I haven't dug further into the design (yet).

how are you producing the aggregate score across all the methods?

I'm just averaging the change in perfscore. I like your method better and will update to that next checkin.

AndyAyersMS · 2024-05-07T18:32:33Z

src/coreclr/jit/optcse.cpp

JIT changes look good.

There is some overlap with things from the other RL heuristic but I think it's ok and probably simpler for now to keep them distinct.

AndyAyersMS · 2024-05-07T18:36:44Z

src/coreclr/scripts/cse_ml/jitml/jit_cse.py

+        return REWARD_SCALE * (prev - curr) / prev
+
+    def _is_valid_action(self, action, method):
+        # Terminating is only valid if we have performed a CSE.  Doing no CSEs isn't allowed.


Is this because you track the "no cse" cases separately, so when learning you're always doing some cses?

There will certainly be some instances where doing no cses is the best policy.

My overall goal with this checkin is to get something relatively simple and understandable as the baseline for future work. In this case, my (intentionally) simple reward function isn't capable of understanding an initial "no" choice without adding extra complexity.

A more refined version of this project can and will handle the case where we choose no CSEs to perform, but I did not want to overcomplicate the initial version.

AndyAyersMS · 2024-05-07T18:39:16Z

src/coreclr/scripts/cse_ml/jitml/jit_cse.py

+        if np.isclose(prev, 0.0):
+            return 0.0
+
+        return REWARD_SCALE * (prev - curr) / prev


Maybe this answers my question about how the variability in rewards is handled? Is prev here some fixed policy result (say no cse or the current heuristic)?

The architecture of this model is to individually choose each CSE one after another until "none" is selected. The prev score is the score of the previous decision. For example, let's say the model eventually choses [3, 1, 5, stop]. In the first iteration, prev will be the perfscore of the method with no CSEs and curr will be the perfscore of only CSE 3 chosen. On the second iteration, prev will be the perfscocre of only CSE 3 chose, and curr will be with CSEs [3, 1] chosen. And so on.

This isn't the only way to build training, but it's the one I started with.

AndyAyersMS · 2024-05-07T18:55:30Z

src/coreclr/scripts/cse_ml/evaluate.py

+    no_jit_failure = result[result['failed'] != ModelResult.JIT_FAILED]
+
+    # next calculate how often we improved on the heuristic
+    improved = no_jit_failure[no_jit_failure['model_score'] < no_jit_failure['heuristic_score']]


I think this touches on how the aggregate score is computed.

Generally I like to use the geomean. If we have $N$ methods and for each have base score $b_i$ and diff score $d_i$. Then the aggregate geomean $G$ is

$$ G = e^{{1\over{N}} \sum_i log(d_i/b_i)} $$

(here lower is better) and I expect the "best possible" improvement to be around 0.99 (my policy gets about 0.994).

Ah interesting. I will add this to the next update, thanks for the suggestion!

- Also better wrapper factoring.

mikelle-rogers · 2024-05-08T22:11:21Z

src/coreclr/scripts/cse_ml/jitml/constants.py

+from .method_context import MethodContext
+
+MIN_CSE = 3
+MAX_CSE = 16


Is this a starting number for min and max CSEs and if this works, then we will extend further?

That's correct. This was the starting point to get something working. We need to think through how to give the model the ability to see and select all CSEs (up to 64 which is the JIT's max). Defining a new architecture is yet another project to work on. I filed that as an issue here: leculver/jitml#8

…learning (dotnet#101856) * Initial code * Add notes * Add CSE_HeuristicRLHook * Move metric print location, double -> int * Produce non-viable entries, fix output issue * Shuffle features by type * Initial JitEnv - not yet working * Change to snake_case * Initial RL implementation with stable-baselines3 * Enable parallel processing, fix some errors * Clean up train.py, allow algorithm selection * Fix paths * Fix issue with null result * Save method indexes * Check if process is still running * Up argument count before warning * Track more statistics on tensorboard * Fix an issue where we didn't let the model know it shouldn't pick something * Reward improvements - Scale up rewards. - Clamp rewards to [-1, 1] - Reward/penalize when complete if there are better/worse CSEs (this is very slow) - Reward when complete based on whether we beat the heuristic or not * Update jitenv.py to remove unused import * Fix inverted graph * Split data into test/train * Refactor for clarity * Use numpy for randomness * Add open questions * Fix a couple of model saving issues * Refactor and cleanup * Add evaluate.py * Fix inverted test/train * Add a way to get the probabilities of actions * Rename file * Clean up imports * Changed action space - 0 to action_space.n-2 are now the CSEs to apply instead of adding and subtracting 1 to the action. - 0 no longer means terminate, instead the action from the model of n-1 is the terminate signal. This is not passed to the JIT. * Add field validator for perf_score This shouldn't happen but it's important enough to validate * Update applicability to ensure we have at least enough viable candidates and not more than total * Fix a few bugs with evaluate * Fix test/train split, some extra output * Remove dead code, simplify format * Rename JitEnv -> JitCseEnv * More renames * Try to factor the observaiton space * Fix test/train split * Reward cleanup - Split reward function into shallow and deep. * Remove 0 perfscore check * Enable deep rewards * Fix issue where jit failed and produced None method * Simplify deeper rewards * Update todo * Add reward customization * Clean up __all__ * Fix issue where we would JIT the first CSE candidate in reset This was leftover code from the previous design of RLHook. * Add two new features, emit selected sequence * Jit one less method per cse chosen in deep rewards * Use info dictionary instead of a specific state * Fix segfault due to null variable * Add superpmi_context Getting the code well-factored so it's easy to modify. * Add tensorboard entry for invalid choices, clear results * Close the environment * Add documentation for JIT changes * Rename method * Normalize observation * Set return type hint for clarity * Add RemoveFeaturesWrapper * Update docstring * Rename function * Move feature normalization to a wrapper - Also better wrapper factoring. * Remove import * Fix warning * Fix Windows issue * Properly log when using A2C * Add readme * Change argument name * Remove whitespace change * Format fixes * Fix formatting * Update readme: Fix grammar, add a note about evaluation * Fixed incorrect filename in readme * Save more data to .json in preparation of other model kinds

leculver added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 3, 2024

leculver requested review from TIHan and AndyAyersMS May 3, 2024 18:05

dotnet-policy-service bot assigned leculver May 3, 2024

AndyAyersMS mentioned this pull request May 3, 2024

Investigate improving JIT heuristics with machine learning #92915

Closed

8 tasks

This was referenced May 3, 2024

System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanScalarDestination_SpecialValues fails #101721

Closed

Test failure in System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanDestinationFunctions_SpecialValues #101731

Closed

leculver force-pushed the jitml branch from 08c0bac to 7edaab4 Compare May 6, 2024 14:51

AndyAyersMS approved these changes May 7, 2024

View reviewed changes

leculver force-pushed the jitml branch from 6eaec8f to 21a3415 Compare May 7, 2024 20:38

leculver added 17 commits May 8, 2024 08:14

Initial code

9d940a2

Add notes

0c31d02

Add CSE_HeuristicRLHook

8a5acd5

Move metric print location, double -> int

476063a

Produce non-viable entries, fix output issue

5de67e1

Shuffle features by type

64b4821

Initial JitEnv - not yet working

0aa1d8d

Change to snake_case

4f9fe2a

Initial RL implementation with stable-baselines3

fe5d334

Enable parallel processing, fix some errors

dd762f8

Clean up train.py, allow algorithm selection

690c4c4

Fix paths

b8fd437

Fix issue with null result

8ee8a0f

Save method indexes

313788d

Check if process is still running

2d9e02f

Up argument count before warning

39a46b5

Track more statistics on tensorboard

11b4346

leculver added 22 commits May 8, 2024 08:14

Add tensorboard entry for invalid choices, clear results

b1cfb46

Close the environment

58d8e16

Add documentation for JIT changes

453dbde

Rename method

fa6c60d

Normalize observation

eb023e2

Set return type hint for clarity

5dbbc51

Add RemoveFeaturesWrapper

b6b32d9

Update docstring

d10a28b

Rename function

26d9ed3

Move feature normalization to a wrapper

5e39ded

- Also better wrapper factoring.

Remove import

7e39ed3

Fix warning

f880166

Fix Windows issue

b1ba57e

Properly log when using A2C

333236b

Add readme

8f277ac

Change argument name

5e399ce

Remove whitespace change

566c27a

Format fixes

80bb745

Fix formatting

b54e32f

Update readme: Fix grammar, add a note about evaluation

6b9bdac

Fixed incorrect filename in readme

dee6f4a

Save more data to .json in preparation of other model kinds

05d8574

leculver force-pushed the jitml branch from 21a3415 to 05d8574 Compare May 8, 2024 15:14

build-analysis bot mentioned this pull request May 8, 2024

arm32 fails in CI with "/lib/arm-linux-gnueabihf/libc.so.6: version `GLIBC_2.34' not found" #102030

Closed

mikelle-rogers reviewed May 8, 2024

View reviewed changes

leculver merged commit 279dbe1 into dotnet:main May 9, 2024
106 of 108 checks passed

leculver deleted the jitml branch May 9, 2024 13:44

leculver mentioned this pull request May 10, 2024

Improve evaluation code leculver/jitml#4

Open

github-actions bot locked and limited conversation to collaborators Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

leculver commented May 3, 2024 •

edited

Loading

dotnet-policy-service bot commented May 3, 2024

AndyAyersMS commented May 3, 2024

leculver commented May 3, 2024

AndyAyersMS left a comment

AndyAyersMS May 7, 2024

leculver May 7, 2024

leculver May 7, 2024

AndyAyersMS May 7, 2024

AndyAyersMS May 7, 2024

leculver May 7, 2024

AndyAyersMS May 7, 2024

leculver May 7, 2024

AndyAyersMS May 7, 2024

leculver May 7, 2024

mikelle-rogers May 8, 2024

leculver May 10, 2024

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

Conversation

leculver commented May 3, 2024 • edited Loading

dotnet-policy-service bot commented May 3, 2024

AndyAyersMS commented May 3, 2024

leculver commented May 3, 2024

AndyAyersMS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leculver commented May 3, 2024 •

edited

Loading