-
Notifications
You must be signed in to change notification settings - Fork 72
cmd_rl
Control how numeric indifferent preference values in RL rules are updated via reinforcement learning.
rl -g|--get <parameter>
rl -s|--set <parameter> <value>
rl -t|--trace <parameter> <value>
rl -S|--stats <statistic>
Option | Description |
---|---|
-g, --get |
Print current parameter setting |
-s, --set |
Set parameter value |
-t, --trace |
Print, clear, or init traces |
-S, --stats |
Print statistic summary or specific statistic |
The rl
command sets parameters and displays information related to reinforcement learning. The print
and trace
commands display additional RL related information not covered by this command.
Due to the large number of parameters, the rl
command uses the --get|--set <parameter> <value>
convention rather than individual switches for each parameter. Running rl
without any switches displays a summary of the parameter settings.
Parameter | Description | Possible values | Default |
---|---|---|---|
chunk-stop | If enabled, chunking does not create duplicate RL rules that differ only in numeric-indifferent preference value |
on , off
|
on |
decay-mode | How the learning rate changes over time |
normal , exponential , logarithmic , delta-bar-delta
|
normal |
discount-rate | Temporal discount (gamma) |
[ 0, 1]
|
0.9 |
eligibility-trace-decay-rate | Eligibility trace decay factor (lambda) |
[ 0, 1]
|
0 |
eligibility-trace-tolerance | Smallest eligibility trace value not considered 0 | (0, inf) | 0.001 |
hrl-discount | Discounting of RL updates over time in impassed states |
on , off
|
off |
learning | Reinforcement learning enabled |
on , off
|
off |
learning-rate | Learning rate (alpha) |
[ 0, 1]
|
0.3 |
step-size-parameter | Secondary learning rate |
[ 0,1]
|
1 |
learning-policy | Value update policy |
sarsa , q-learning , off-policy-gq-lambda , on-policy-gq-lambda
|
sarsa |
meta | Store rule metadata in header string |
on , off
|
off |
meta-learning-rate | Delta-Bar-Delta learning parameter |
[ 0, 1]
|
0.1 |
temporal-discount | Discount RL updates over gaps |
on , off
|
on |
temporal-extension | Propagation of RL updates over gaps |
on , off
|
on |
trace | Update the trace |
on , off
|
off |
update-log-path | File to log information about RL rule updates |
"" , <filename>
|
"" |
Parameter | Description | Possible values | Default |
---|---|---|---|
apoptosis |
Automatic excising of productions via base-level decay |
none , chunks , rl-chunks
|
none |
apoptosis-decay |
Base-level decay parameter |
[ 0, 1]
|
0.5 |
apoptosis-thresh |
Base-level threshold parameter (negates supplied value) | (0, inf) | 2 |
Apoptosis is a process to automatically excise chunks via the base-level decay model (where rule firings are the activation events). A value of chunks
has this apply to any chunk, whereas rl-chunks
means only chunks that are also RL rules can be forgotten.
Soar tracks some RL statistics over the lifetime of the agent. These can be accessed using rl --stats <statistic>
. Running rl --stats
without a statistic will list the values of all statistics.
Option | Description |
---|---|
update-error |
Difference between target and current values in last RL update |
total-reward |
Total accumulated reward in the last update |
global-reward |
Total accumulated reward since agent initialization |
This is an experimental feature of Soar RL. It based on the work in Richard S. Sutton's paper "Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta", available online at http://webdocs.cs.ualberta.ca/~sutton/papers/sutton-92a.pdf.
Delta Bar Delta (DBD) is implemented in Soar RL as a decay mode. It changes the way all the rules in the eligibility trace get their values updated. In order to implement this, the agent gets an additional learning parameter meta-learning-rate
and each rule gets two additional decay parameters: beta and h. The meta learning rate is set manually; the per-rule features are handled automatically by the DBD algorithm. The key idea is that the meta parameters keep track of how much a rule's RL value has been updated recently, and if a rule gets updates in the same direction multiple times in a row then subsequent updates in the same direction will have more effect. So DBD acts sort of like momentum for the learning rate.
To enable DBD, use rl --set decay-mode delta-bar-delta
. To change the meta learning rate, use e.g. rl --set meta-learning-rate 0.1
. When you execute rl
, under the "Experimental" section of output you'll see the current settings for decay-mode
and meta-learning-rate
. Also, if a rule gets printed concisely (e.g. by executing p
), and the rule is an RL rule, and the decay mode is set to delta-bar-delta, then instead of printing the rule name followed by the update count and the RL value, it will print the rule name, beta, h, update count, and RL value.
Note that DBD is a different feature than meta
. Meta determines whether metadata about a production is stored in its header string. If meta is on and DBD is on, then each rule's beta and h values will be stored in the header string in addition to the update count, so you can print out the rule, source it later and that metadata about the rule will still be in place.
Linear GQ(λ) is a gradient-based off-policy temporal-difference learning algorithm, as developed by Hamid Maei and described by Adam White and Rich Sutton (https://arxiv.org/pdf/1705.03967.pdf). This reinforcement learning option provides off-policy learning quite effectively. This is a good approach in cases when agent training performance is less important than agent execution performance. GQ(λ) converges despite irreversible actions and other difficulties approaching the training goal. Convergence should be guaranteed for stable environments.
To change the secondary learning rate that only applies when learning with GQ(λ), set the rl step-size-parameter
. It controls how fast the secondary set of weights changes to allow GQ(λ) to improve the rate of convergence to a stable policy. Small learning rates such as 0.01 or even lower seems to be good practice.
rl --set learning-policy off-policy-gq-lambda
will set Soar to use linear GQ(λ). It is preferable to use GQ(λ) over sarsa or q-learning when multiple weights are active in parallel and sequences of actions required for agents to be successful are sufficiently complex that divergence is possible. To take full advantage of GQ(λ), it is important to set step-size-parameter
to a reasonable value for a secondary learning rate, such as 0.01.
rl --set learning-policy on-policy-gq-lambda
will set Soar to use a simplification of GQ(λ) to make it on-policy while otherwise functioning identically. It is still important to set step-size-parameter
to a reasonable value for a secondary learning rate, such as 0.01.
For more information, please see the relevant slides on http://www-personal.umich.edu/~bazald/b/publications/009-sw35-gql.pdf
Sets a path to a file that Soar RL will write to whenever a production's RL value gets updated. This can be useful for logging these updates without having to capture all of Soar's output and parse it for these updates. Enable with e.g. rl --set update-log-path rl\_log.txt
. Disable with rl --set update-log-path ""
- that is, use the empty string "" as the log path. The current log path appears under the experimental section when you execute "rl".
If rl --set trace on
has been called, then proposed operators will be recorded in the trace for all goal levels. Along with operator names and other attribute-value pairs, transition probabilities derived from their numeric preferences are recorded.
Legal arguments following rl -t
or rl --trace
are as follows:
Option | Description |
---|---|
print |
Print the trace for the top state. |
clear |
Erase the traces for all goal levels. |
init |
Restart recording from the beginning of the traces for all goal levels. |
These may be followed by an optional numeric argument specifying a specific goal level to print, clear, or init. rl -t init
is called automatically whenever Soar is reinitialized. However, rl -t clear
is never called automatically.
The format in which the trace is printed is designed to be used by the program dot, as part of the Graphviz suite. The command ctf rl.dot rl -t
will print the trace for the top state to the file "rl.dot". (The default behavior for rl -t
is to print the trace for the top state.)
Here are some sample dot invocations for the top state:
Option | Description |
---|---|
dot -Tps rl.dot -o rl.ps |
ps2pdf rl.ps |
dot -Tsvg rl.dot -o rl.svg |
inkscape -f rl.svg -A rl.pdf |
The .svg format works better for large traces.