-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Cleanup examples folder 22: Add 2 (count-based) curiosity examples. #46737
Merged
sven1977
merged 7 commits into
ray-project:master
from
sven1977:cleanup_examples_folder_22_count_based_curiosity
Jul 23, 2024
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
32faa1b
wip
sven1977 31e6230
wip
sven1977 5ae97ed
wip
sven1977 f3348fc
wip
sven1977 00f5085
Learns MountainCar-v0 until -119 reward
sven1977 4b64182
wip
sven1977 ede1744
Merge branch 'master' of https://github.com/ray-project/ray into clea…
sven1977 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
92 changes: 92 additions & 0 deletions
92
rllib/examples/connectors/classes/count_based_curiosity.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
from collections import Counter | ||
from typing import Any, List, Optional | ||
|
||
import gymnasium as gym | ||
|
||
from ray.rllib.connectors.connector_v2 import ConnectorV2 | ||
from ray.rllib.core.rl_module.rl_module import RLModule | ||
from ray.rllib.utils.typing import EpisodeType | ||
|
||
|
||
class CountBasedCuriosity(ConnectorV2): | ||
"""Learner ConnectorV2 piece to compute intrinsic rewards based on obs counts. | ||
|
||
Add this connector piece to your Learner pipeline, through your algo config: | ||
``` | ||
config.training( | ||
learner_connector=lambda obs_sp, act_sp: CountBasedCuriosity() | ||
) | ||
``` | ||
|
||
Intrinsic rewards are computed on the Learner side based on naive observation | ||
counts, which is why this connector should only be used for simple environments | ||
with a reasonable number of possible observations. The intrinsic reward for a given | ||
timestep is: | ||
r(i) = intrinsic_reward_coeff * (1 / C(obs(i))) | ||
where C is the total (lifetime) count of the obs at timestep i. | ||
|
||
The intrinsic reward is added to the extrinsic reward and saved back into the | ||
episode (under the main "rewards" key). | ||
|
||
Note that the computation and saving back to the episode all happens before the | ||
actual train batch is generated from the episode data. Thus, the Learner and the | ||
RLModule used do not take notice of the extra reward added. | ||
|
||
If you would like to use a more sophisticated mechanism for intrinsic reward | ||
computations, take a look at the `EuclidianDistanceBasedCuriosity` connector piece | ||
at `ray.rllib.examples.connectors.classes.euclidian_distance_based_curiosity` | ||
""" | ||
|
||
def __init__( | ||
self, | ||
input_observation_space: Optional[gym.Space] = None, | ||
input_action_space: Optional[gym.Space] = None, | ||
*, | ||
intrinsic_reward_coeff: float = 1.0, | ||
**kwargs, | ||
): | ||
"""Initializes a CountBasedCuriosity instance. | ||
|
||
Args: | ||
intrinsic_reward_coeff: The weight with which to multiply the intrinsic | ||
reward before adding (and saving) it back to the main (extrinsic) | ||
reward of the episode at each timestep. | ||
""" | ||
super().__init__(input_observation_space, input_action_space) | ||
|
||
# Naive observation counter. | ||
self._counts = Counter() | ||
self.intrinsic_reward_coeff = intrinsic_reward_coeff | ||
|
||
def __call__( | ||
self, | ||
*, | ||
rl_module: RLModule, | ||
data: Any, | ||
episodes: List[EpisodeType], | ||
explore: Optional[bool] = None, | ||
shared_data: Optional[dict] = None, | ||
**kwargs, | ||
) -> Any: | ||
# Loop through all episodes and change the reward to | ||
# [reward + intrinsic reward] | ||
for sa_episode in self.single_agent_episode_iterator( | ||
episodes=episodes, agents_that_stepped_only=False | ||
): | ||
# Loop through all obs, except the last one. | ||
observations = sa_episode.get_observations(slice(None, -1)) | ||
# Get all respective (extrinsic) rewards. | ||
rewards = sa_episode.get_rewards() | ||
|
||
for i, (obs, rew) in enumerate(zip(observations, rewards)): | ||
obs = tuple(obs) | ||
# Add 1 to obs counter. | ||
self._counts[obs] += 1 | ||
# Compute our count-based intrinsic reward and add it to the main | ||
# (extrinsic) reward. | ||
rew += self.intrinsic_reward_coeff * (1 / self._counts[obs]) | ||
# Store the new reward back to the episode (under the correct | ||
# timestep/index). | ||
sa_episode.set_rewards(new_data=rew, at_indices=i) | ||
|
||
return data |
122 changes: 122 additions & 0 deletions
122
rllib/examples/connectors/classes/euclidian_distance_based_curiosity.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
from collections import deque | ||
from typing import Any, List, Optional | ||
|
||
import gymnasium as gym | ||
import numpy as np | ||
|
||
from ray.rllib.connectors.connector_v2 import ConnectorV2 | ||
from ray.rllib.core.rl_module.rl_module import RLModule | ||
from ray.rllib.utils.typing import EpisodeType | ||
|
||
|
||
class EuclidianDistanceBasedCuriosity(ConnectorV2): | ||
"""Learner ConnectorV2 piece computing intrinsic rewards with euclidian distance. | ||
|
||
Add this connector piece to your Learner pipeline, through your algo config: | ||
``` | ||
config.training( | ||
learner_connector=lambda obs_sp, act_sp: EuclidianDistanceBasedCuriosity() | ||
) | ||
``` | ||
|
||
Intrinsic rewards are computed on the Learner side based on comparing the euclidian | ||
distance of observations vs already seen ones. A configurable number of observations | ||
will be stored in a FIFO buffer and all incoming observations have their distance | ||
measured against those. | ||
|
||
The minimum distance measured is the intrinsic reward for the incoming obs | ||
(multiplied by a fixed coeffieicnt and added to the "main" extrinsic reward): | ||
r(i) = intrinsic_reward_coeff * min(ED(o, o(i)) for o in stored_obs)) | ||
where `ED` is the euclidian distance and `stored_obs` is the buffer. | ||
|
||
The intrinsic reward is then added to the extrinsic reward and saved back into the | ||
episode (under the main "rewards" key). | ||
|
||
Note that the computation and saving back to the episode all happens before the | ||
actual train batch is generated from the episode data. Thus, the Learner and the | ||
RLModule used do not take notice of the extra reward added. | ||
|
||
Only one observation per incoming episode will be stored as a new one in the buffer. | ||
Thereby, we pick the observation with the largest `min(ED)` value over all already | ||
stored observations to be stored per episode. | ||
|
||
If you would like to use a simpler, count-based mechanism for intrinsic reward | ||
computations, take a look at the `CountBasedCuriosity` connector piece | ||
at `ray.rllib.examples.connectors.classes.count_based_curiosity` | ||
""" | ||
|
||
def __init__( | ||
self, | ||
input_observation_space: Optional[gym.Space] = None, | ||
input_action_space: Optional[gym.Space] = None, | ||
*, | ||
intrinsic_reward_coeff: float = 1.0, | ||
max_buffer_size: int = 100, | ||
**kwargs, | ||
): | ||
"""Initializes a CountBasedCuriosity instance. | ||
|
||
Args: | ||
intrinsic_reward_coeff: The weight with which to multiply the intrinsic | ||
reward before adding (and saving) it back to the main (extrinsic) | ||
reward of the episode at each timestep. | ||
""" | ||
super().__init__(input_observation_space, input_action_space) | ||
|
||
# Create an observation buffer | ||
self.obs_buffer = deque(maxlen=max_buffer_size) | ||
self.intrinsic_reward_coeff = intrinsic_reward_coeff | ||
|
||
self._test = 0 | ||
|
||
def __call__( | ||
self, | ||
*, | ||
rl_module: RLModule, | ||
data: Any, | ||
episodes: List[EpisodeType], | ||
explore: Optional[bool] = None, | ||
shared_data: Optional[dict] = None, | ||
**kwargs, | ||
) -> Any: | ||
if self._test > 10: | ||
return data | ||
self._test += 1 | ||
# Loop through all episodes and change the reward to | ||
# [reward + intrinsic reward] | ||
for sa_episode in self.single_agent_episode_iterator( | ||
episodes=episodes, agents_that_stepped_only=False | ||
): | ||
# Loop through all obs, except the last one. | ||
observations = sa_episode.get_observations(slice(None, -1)) | ||
# Get all respective (extrinsic) rewards. | ||
rewards = sa_episode.get_rewards() | ||
|
||
max_dist_obs = None | ||
max_dist = float("-inf") | ||
for i, (obs, rew) in enumerate(zip(observations, rewards)): | ||
# Compare obs to all stored observations and compute euclidian distance. | ||
min_dist = 0.0 | ||
if self.obs_buffer: | ||
min_dist = min( | ||
np.sqrt(np.sum((obs - stored_obs) ** 2)) | ||
for stored_obs in self.obs_buffer | ||
) | ||
if min_dist > max_dist: | ||
max_dist = min_dist | ||
max_dist_obs = obs | ||
|
||
# Compute our euclidian distance-based intrinsic reward and add it to | ||
# the main (extrinsic) reward. | ||
rew += self.intrinsic_reward_coeff * min_dist | ||
# Store the new reward back to the episode (under the correct | ||
# timestep/index). | ||
sa_episode.set_rewards(new_data=rew, at_indices=i) | ||
|
||
# Add the one observation of this episode with the largest (min) euclidian | ||
# dist to all already stored obs to the buffer (maybe throwing out the | ||
# oldest obs in there). | ||
if max_dist_obs is not None: | ||
self.obs_buffer.append(max_dist_obs) | ||
|
||
return data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
"""Placeholder for training with count-based curiosity. | ||
|
||
The actual script can be found at a different location (see code below). | ||
""" | ||
|
||
if __name__ == "__main__": | ||
import subprocess | ||
import sys | ||
|
||
# Forward to "python ../curiosity/[same script name].py [same options]" | ||
command = [sys.executable, "../curiosity/", sys.argv[0]] + sys.argv[1:] | ||
|
||
# Run the script. | ||
subprocess.run(command, capture_output=True) |
14 changes: 14 additions & 0 deletions
14
rllib/examples/connectors/euclidian_distance_based_curiosity.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
"""Placeholder for training with euclidian distance-based curiosity. | ||
|
||
The actual script can be found at a different location (see code below). | ||
""" | ||
|
||
if __name__ == "__main__": | ||
import subprocess | ||
import sys | ||
|
||
# Forward to "python ../curiosity/[same script name].py [same options]" | ||
command = [sys.executable, "../curiosity/", sys.argv[0]] + sys.argv[1:] | ||
|
||
# Run the script. | ||
subprocess.run(command, capture_output=True) |
Empty file.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Great example!