Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

Merged
merged 78 commits into from
May 9, 2024

Conversation

leculver
Copy link
Contributor

@leculver leculver commented May 3, 2024

Implement a gymnasium environment, JitCseEnv, to allow us to rapidly iterate on features, reward functions, and model/neural network architecture for JIT CSE optimization. This change:

  • Creates a hook in the JIT's common subexpression elimination optimization to allow it to be driven by environment variable.
  • Uses SuperPMI, with the new CSE hook, to drive CSE decision making in the JIT.
  • Implements a gym environment to manipulate features, rewards, and architecture of the reinforcement learning model to find what works and what doesn't.
  • Provides a mechanism to see live updates of the training process via Tensorboard, and post-training evaluation against the default CSE Heursitic.

This implements the bare minimum rewards and features needed to experiment with CSE optimization. The current non-normalized features and simple reward function creates a model that is almost as good as the current, hand-written CSE Heuristic in the JIT. Further developments and improvements will likely be done offline, this is meant to be the skeleton of the project that's shared.

More information can be found in the README.md included in this pull request.

Contributes to: #92915.

@leculver leculver added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 3, 2024
@leculver leculver requested review from TIHan and AndyAyersMS May 3, 2024 18:05
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member

@leculver skimmed through and this looks awesome! Will need a bit of time to review. Will try and get you some feedback by early next week.

FYI @dotnet/jit-contrib @matouskozak @mikelle-rogers

Also FYI @NoahIslam @LogarithmicFrog1: you might find the approach Lee is taking here a bit more accessible and/or familiar, if you're still up for some collaboration.

@leculver
Copy link
Contributor Author

leculver commented May 3, 2024

No problem, take the time you need.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great. Happy to merge this as is.

Mostly my comments are about clarification and trying to match up what you have here with what I have done previously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a bit more of a writeup about the overall approach, either here or somewhere else. Things like

  • are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?
  • are the rewards discounted or undiscounted?
  • how are you handling the fact that reward magnitudes can vary greatly from one method to the next?
  • what sort of neural net topology gets built? Why is this a good choice?
  • how are you producing the aggregate score across all the methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 100% in agreement about needing to create more writeup and documentation.

I guess I should have been a bit more clear in the intention of this Pull Request. I consider the code here the absolute minimum starting point that other folks (and myself) can play with to make improvements. The code here is meant as that playground for use over the next couple of months.

When I'm further along in experimenting with different approaches, model architecture, and so on, that's when I plan to write everything up. Some of the techniques will certainly change after I've had more time to experiment in the space, so I didn't write down too much about this base-design because I expect a lot of it to be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's quick answers to your questions:

are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?

This version uses incremental rewards by building up a sequence of decisions.

are the rewards discounted or undiscounted?

Rewards are discounted, but not heavily. Actually, we currently just use the stable-baselines default gamma of 0.99. I intentionally haven't tuned hyperparameters in this checkin. Again trying to keep it as simple as possible.

how are you handling the fact that reward magnitudes can vary greatly from one method to the next?

Currently, we use % change in the perfscore. This keeps rewards relatively within the same magnitude. Obviously some methods are longer than others and the change in perfscore for choosing a CSE likely doesn't scale with method length, so this is a place for improvement.

My overall goal with this checkin was simplicity and being able to understand what it's doing. Since the model trains successfully (though doesn't beat the current CSE Heursitic), I did not try to refine them further yet.

what sort of neural net topology gets built? Why is this a good choice?

Currently, it's the default for stable-baselines. I can give you the topology, but this was also a non-choice so far. The default network trained successfully, so I haven't dug further into the design (yet).

how are you producing the aggregate score across all the methods?

I'm just averaging the change in perfscore. I like your method better and will update to that next checkin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT changes look good.

There is some overlap with things from the other RL heuristic but I think it's ok and probably simpler for now to keep them distinct.

return REWARD_SCALE * (prev - curr) / prev

def _is_valid_action(self, action, method):
# Terminating is only valid if we have performed a CSE. Doing no CSEs isn't allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because you track the "no cse" cases separately, so when learning you're always doing some cses?

There will certainly be some instances where doing no cses is the best policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overall goal with this checkin is to get something relatively simple and understandable as the baseline for future work. In this case, my (intentionally) simple reward function isn't capable of understanding an initial "no" choice without adding extra complexity.

A more refined version of this project can and will handle the case where we choose no CSEs to perform, but I did not want to overcomplicate the initial version.

if np.isclose(prev, 0.0):
return 0.0

return REWARD_SCALE * (prev - curr) / prev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this answers my question about how the variability in rewards is handled? Is prev here some fixed policy result (say no cse or the current heuristic)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture of this model is to individually choose each CSE one after another until "none" is selected. The prev score is the score of the previous decision. For example, let's say the model eventually choses [3, 1, 5, stop]. In the first iteration, prev will be the perfscore of the method with no CSEs and curr will be the perfscore of only CSE 3 chosen. On the second iteration, prev will be the perfscocre of only CSE 3 chose, and curr will be with CSEs [3, 1] chosen. And so on.

This isn't the only way to build training, but it's the one I started with.

no_jit_failure = result[result['failed'] != ModelResult.JIT_FAILED]

# next calculate how often we improved on the heuristic
improved = no_jit_failure[no_jit_failure['model_score'] < no_jit_failure['heuristic_score']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this touches on how the aggregate score is computed.

Generally I like to use the geomean. If we have $N$ methods and for each have base score $b_i$ and diff score $d_i$. Then the aggregate geomean $G$ is

$$ G = e^{{1\over{N}} \sum_i log(d_i/b_i)} $$

(here lower is better) and I expect the "best possible" improvement to be around 0.99 (my policy gets about 0.994).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting. I will add this to the next update, thanks for the suggestion!

from .method_context import MethodContext

MIN_CSE = 3
MAX_CSE = 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a starting number for min and max CSEs and if this works, then we will extend further?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. This was the starting point to get something working. We need to think through how to give the model the ability to see and select all CSEs (up to 64 which is the JIT's max). Defining a new architecture is yet another project to work on. I filed that as an issue here: leculver/jitml#8

@leculver leculver merged commit 279dbe1 into dotnet:main May 9, 2024
106 of 108 checks passed
@leculver leculver deleted the jitml branch May 9, 2024 13:44
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
…learning (dotnet#101856)

* Initial code

* Add notes

* Add CSE_HeuristicRLHook

* Move metric print location, double -> int

* Produce non-viable entries, fix output issue

* Shuffle features by type

* Initial JitEnv - not yet working

* Change to snake_case

* Initial RL implementation with stable-baselines3

* Enable parallel processing, fix some errors

* Clean up train.py, allow algorithm selection

* Fix paths

* Fix issue with null result

* Save method indexes

* Check if process is still running

* Up argument count before warning

* Track more statistics on tensorboard

* Fix an issue where we didn't let the model know it shouldn't pick something

* Reward improvements

- Scale up rewards.
- Clamp rewards to [-1, 1]
- Reward/penalize when complete if there are better/worse CSEs (this is very slow)
- Reward when complete based on whether we beat the heuristic or not

* Update jitenv.py to remove unused import

* Fix inverted graph

* Split data into test/train

* Refactor for clarity

* Use numpy for randomness

* Add open questions

* Fix a couple of model saving issues

* Refactor and cleanup

* Add evaluate.py

* Fix inverted test/train

* Add a way to get the probabilities of actions

* Rename file

* Clean up imports

* Changed action space

- 0 to action_space.n-2 are now the CSEs to apply instead of adding and subtracting 1 to the action.
- 0 no longer means terminate, instead the action from the model of n-1 is the terminate signal.  This is not passed to the JIT.

* Add field validator for perf_score

This shouldn't happen but it's important enough to validate

* Update applicability to ensure we have at least enough viable candidates and not more than total

* Fix a few bugs with evaluate

* Fix test/train split, some extra output

* Remove dead code, simplify format

* Rename JitEnv -> JitCseEnv

* More renames

* Try to factor the observaiton space

* Fix test/train split

* Reward cleanup

- Split reward function into shallow and deep.

* Remove 0 perfscore check

* Enable deep rewards

* Fix issue where jit failed and produced None method

* Simplify deeper rewards

* Update todo

* Add reward customization

* Clean up __all__

* Fix issue where we would JIT the first CSE candidate in reset

This was leftover code from the previous design of RLHook.

* Add two new features, emit selected sequence

* Jit one less method per cse chosen in deep rewards

* Use info dictionary instead of a specific state

* Fix segfault due to null variable

* Add superpmi_context

Getting the code well-factored so it's easy to modify.

* Add tensorboard entry for invalid choices, clear results

* Close the environment

* Add documentation for JIT changes

* Rename method

* Normalize observation

* Set return type hint for clarity

* Add RemoveFeaturesWrapper

* Update docstring

* Rename function

* Move feature normalization to a wrapper

- Also better wrapper factoring.

* Remove import

* Fix warning

* Fix Windows issue

* Properly log when using A2C

* Add readme

* Change argument name

* Remove whitespace change

* Format fixes

* Fix formatting

* Update readme:  Fix grammar, add a note about evaluation

* Fixed incorrect filename in readme

* Save more data to .json in preparation of other model kinds
@github-actions github-actions bot locked and limited conversation to collaborators Jun 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants