This file records all major updates and new features, starting from version 0.5. As Tensorforce is still developing, updates and bug fixes for the internal architecture are continuously being implemented, which will not be tracked here in detail.
- Renamed agent argument
reward_preprocessing
toreward_processing
, and in case of Tensorforce agent moved toreward_estimation[reward_processing]
- New
categorical
distribution argumentskip_linear
to not add the implicit linear logits layer
- Support for multi-actor parallel environments via new function
Environment.num_actors()
Runner
uses multi-actor parallelism by default if environment is multi-actor
- New optional
Environment
functionepisode_return()
which returns the true return of the last episode, if cumulative sum of environment rewards is not a good metric for runner display
- New
vectorized_environment.py
andmultiactor_environment.py
script to illustrate how to setup a vectorized/multi-actor environment.
- Agent argument
update_frequency
/update[frequency]
now supports float values > 0.0, which specify the update-frequency relative to the batch-size - Changed default value for argument
update_frequency
from1.0
to0.25
for DQN, DoubleDQN, DuelingDQN agents - New argument
return_processing
andadvantage_processing
(where applicable) for all agent sub-types - New function
Agent.get_specification()
which returns the agent specification as dictionary - New function
Agent.get_architecture()
which returns a string representation of the network layer architecture
- Improved and simplified module specification, for instance:
network=my_module
instead ofnetwork=my_module.TestNetwork
, orenvironment=envs.custom_env
instead ofenvironment=envs.custom_env.CustomEnvironment
(module file needs to be in the same directory or a sub-directory)
- New argument
single_output=True
for some policy types which, ifFalse
, allows the specification of additional network outputs for some/all actions via registered tensors KerasNetwork
argumentmodel
now supports arbitrary functions as long as they return atf.keras.Model
- New layer type
SelfAttention
(specification key:self_attention
)
- Support tracking of non-constant parameter values
- Rename attribute
episode_rewards
asepisode_returns
, and TQDM statusreward
asreturn
- Extend argument
agent
to supportAgent.load()
keyword arguments to load an existing agent instead of creating a new one.
- Added
action_masking.py
example script to illustrate an environment implementation with built-in action masking.
- Customized device placement was not applied to most tensors
- New agent argument
tracking
and corresponding functiontracked_tensors()
to track and retrieve the current value of predefined tensors, similar tosummarizer
for TensorBoard summaries - New experimental value
trace_decay
andgae_decay
for Tensorforce agent argumentreward_estimation
, soon for other agent types as well - New options
"early"
and"late"
for valueestimate_advantage
of Tensorforce agent argumentreward_estimation
- Changed default value for
Agent.act()
argumentdeterministic
fromFalse
toTrue
- New network type
KerasNetwork
(specification key:keras
) as wrapper for networks specified as Keras model - Passing a Keras model class/object as policy/network argument is automatically interpreted as
KerasNetwork
- Changed
Gaussian
distribution argumentglobal_stddev=False
tostddev_mode='predicted'
- New
Categorical
distribution argumenttemperature_mode=None
- New option for
Function
layer argumentfunction
to pass string function expression with argument "x", e.g. "(x+1.0)/2.0"
- New summary
episode-length
recorded as part of summary label "reward"
- Support for vectorized parallel environments via new function
Environment.is_vectorizable()
and new argumentnum_parallel
forEnvironment.reset()
- See
tensorforce/environments.cartpole.py
for a vectorizable environment example Runner
uses vectorized parallelism by default ifnum_parallel > 1
,remote=None
and environment supports vectorization- See
examples/act_observe_vectorized.py
for more details on act-observe interaction
- See
- New extended and vectorizable custom CartPole environment via key
custom_cartpole
(work in progress) - New environment argument
reward_shaping
to provide a simple way to modify/shape rewards of an environment, can be specified either as callable or string function expression
- New option for command line arguments
--checkpoints
and--summaries
to add comma-separated checkpoint/summary filename in addition to directory - Added episode lengths to logging plot besides episode returns
- Temporal horizon handling of RNN layers
- Critical bugfix for late horizon value prediction (including DQN variants and DPG agent) in combination with baseline RNN
- GPU problems with scatter operations
- Critical bugfix for DQN variants and DPG agent
- Removed default value
"adam"
for Tensorforce agent argumentoptimizer
(since default optimizer argumentlearning_rate
removed, see below) - Removed option
"minimum"
for Tensorforce agent argumentmemory
, useNone
instead - Changed default value for
dqn
/double_dqn
/dueling_dqn
agent argumenthuber_loss
from0.0
toNone
- Removed default value
0.999
forexponential_normalization
layer argumentdecay
- Added new layer
batch_normalization
(generally should only be used for the agent argumentsreward_processing[return_processing]
andreward_processing[advantage_processing]
) - Added
exponential/instance_normalization
layer argumentonly_mean
with defaultFalse
- Added
exponential/instance_normalization
layer argumentmin_variance
with default1e-4
- Removed default value
1e-3
for optimizer argumentlearning_rate
- Changed default value for optimizer argument
gradient_norm_clipping
from1.0
toNone
(no gradient clipping) - Added new optimizer
doublecheck_step
and corresponding argumentdoublecheck_update
for optimizer wrapper - Removed
linesearch_step
optimizer argumentaccept_ratio
- Removed
natural_gradient
optimizer argumentreturn_improvement_estimate
- Added option to specify agent argument
saver
as string, which is interpreted assaver[directory]
with otherwise default values - Added default value for agent argument
saver[frequency]
as10
(save model every 10 updates by default) - Changed default value of agent argument
saver[max_checkpoints]
from5
to10
- Added option to specify agent argument
summarizer
as string, which is interpreted assummarizer[directory]
with otherwise default values - Renamed option of agent argument
summarizer
fromsummarizer[labels]
tosummarizer[summaries]
(use of the term "label" due to earlier version, outdated and confusing by now) - Changed interpretation of agent argument
summarizer[summaries] = "all"
to include only numerical summaries, so all summaries except "graph" - Changed default value of agent argument
summarizer[summaries]
from["graph"]
to"all"
- Changed default value of agent argument
summarizer[max_summaries]
from5
to7
(number of different colors in TensorBoard) - Added option
summarizer[filename]
to agent argumentsummarizer
- Added option to specify agent argument
recorder
as string, which is interpreted asrecorder[directory]
with otherwise default values
- Added
--checkpoints
/--summaries
/--recordings
command line argument to enable saver/summarizer/recorder agent argument specification separate from core agent configuration
- Added
save_load_agent.py
example script to illustrate regular agent saving and loading
- Fixed problem with optimizer argument
gradient_norm_clipping
not being applied correctly - Fixed problem with
exponential_normalization
layer not updating moving mean and variance correctly - Fixed problem with
recent
memory for timestep-based updates sometimes sampling invalid memory indices
- Removed agent arguments
execution
,buffer_observe
,seed
- Renamed agent arguments
baseline_policy
/baseline_network
/critic_network
tobaseline
/critic
- Renamed agent
reward_estimation
argumentsestimate_horizon
topredict_horizon_values
,estimate_actions
topredict_action_values
,estimate_terminal
topredict_terminal_values
- Renamed agent argument
preprocessing
tostate_preprocessing
- Default agent preprocessing
linear_normalization
- Moved agent arguments for reward/return/advantage processing from
preprocessing
toreward_preprocessing
andreward_estimation[return_/advantage_processing]
- New agent argument
config
with valuesbuffer_observe
,enable_int_action_masking
,seed
- Renamed PPO/TRPO/DPG argument
critic_network
/_optimizer
tobaseline
/baseline_optimizer
- Renamed PPO argument
optimization_steps
tomulti_step
- New TRPO argument
subsampling_fraction
- Changed agent argument
use_beta_distribution
default to false - Added double DQN agent (
double_dqn
) - Removed
Agent.act()
argumentevaluation
- Removed agent function arguments
query
(functionality removed) - Agent saver functionality changed (Checkpoint/SavedModel instead of Saver/Protobuf):
save
/load
functions andsaver
argument changed - Default behavior when specifying
saver
is not to load agent, unless agent is created viaAgent.load
- Agent summarizer functionality changed:
summarizer
argument changed, some summary labels and other options removed - Renamed RNN layers
internal_{rnn/lstm/gru}
tornn/lstm/gru
andrnn/lstm/gru
toinput_{rnn/lstm/gru}
- Renamed
auto
network argumentinternal_rnn
tornn
- Renamed
(internal_)rnn/lstm/gru
layer argumentlength
tohorizon
- Renamed
update_modifier_wrapper
tooptimizer_wrapper
- Renamed
optimizing_step
tolinesearch_step
, andUpdateModifierWrapper
argumentoptimizing_iterations
tolinesearch_iterations
- Optimizer
subsampling_step
accepts both absolute (int) and relative (float) fractions - Objective
policy_gradient
argumentratio_based
renamed toimportance_sampling
- Added objectives
state_value
andaction_value
- Added
Gaussian
distribution argumentsglobal_stddev
andbounded_transform
(for improved bounded action space handling) - Changed default memory
device
argument toCPU:0
- Renamed rewards summaries
Agent.create()
accepts act-function asagent
argument for recording- Singleton states and actions are now consistently handled as singletons
- Major change to policy handling and defaults, in particular
parametrized_distributions
, new default policiesparametrized_state/action_value
- Combined
long
andint
type - Always wrap environment in
EnvironmentWrapper
class - Changed
tune.py
arguments
- Changed independent mode of
agent.act
to use final values of dynamic hyperparameters and avoid TensorFlow conditions - Extended
"tensorflow"
format ofagent.save
to include an optimized Protobuf model with an act-only graph as.pb
file, andAgent.load
format"pb-actonly"
to load act-only agent based on Protobuf model - Support for custom summaries via new
summarizer
argument valuecustom
to specify summary type, andAgent.summarize(...)
to record summary values - Added min/max-bounds for dynamic hyperparameters min/max-bounds to assert valid range and infer other arguments
- Argument
batch_size
now mandatory for all agent classes - Removed
Estimator
argumentcapacity
, now always automatically inferred - Internal changes related to agent arguments
memory
,update
andreward_estimation
- Changed the default
bias
andactivation
argument of some layers - Fixed issues with
sequence
preprocessor - DQN and dueling DQN properly constrained to
int
actions only - Added
use_beta_distribution
argument with defaultTrue
to many agents andParametrizedDistributions
policy, so default can be changed
- DQN/DuelingDQN/DPG argument
memory
now required to be specified explicitly, plusupdate_frequency
default changed - Removed (temporarily)
conv1d/conv2d_transpose
layers due to TensorFlow gradient problems Agent
,Environment
andRunner
can now be imported viafrom tensorforce import ...
- New generic reshape layer available as
reshape
- Support for batched version of
Agent.act
andAgent.observe
- Support for parallelized remote environments based on Python's
multiprocessing
andsocket
(replacingtensorforce/contrib/socket_remote_env/
andtensorforce/environments/environment_process_wrapper.py
), available viaEnvironment.create(...)
,Runner(...)
andrun.py
- Removed
ParallelRunner
and merged functionality withRunner
- Changed
run.py
arguments - Changed independent mode for
Agent.act
: additional argumentinternals
and corresponding return value, initial internals viaAgent.initial_internals()
,Agent.reset()
not required anymore - Removed
deterministic
argument forAgent.act
unless independent mode - Added
format
argument tosave
/load
/restore
with supported formatstensorflow
,numpy
andhdf5
- Changed
save
argumentappend_timestep
toappend
with defaultNone
(instead of'timesteps'
) - Added
get_variable
andassign_variable
agent functions
- Added optional
memory
argument to various agents - Improved summary labels, particularly
"entropy"
and"kl-divergence"
linear
layer now accepts tensors of rank 1 to 3- Network output / distribution input does not need to be a vector anymore
- Transposed convolution layers (
conv1d/2d_transpose
) - Parallel execution functionality contributed by @jerabaul29, currently under
tensorforce/contrib/
- Accept string for runner
save_best_agent
argument to specify best model directory different fromsaver
configuration saver
argumentsteps
removed andseconds
renamed tofrequency
- Moved
Parallel/Runner
argumentmax_episode_timesteps
fromrun(...)
to constructor - New
Environment.create(...)
argumentmax_episode_timesteps
- TensorFlow 2.0 support
- Improved Tensorboard summaries recording
- Summary labels
graph
,variables
andvariables-histogram
temporarily not working - TF-optimizers updated to TensorFlow 2.0 Keras optimizers
- Added TensorFlow Addons dependency, and support for TFA optimizers
- Changed unit of
target_sync_frequency
from timesteps to updates fordqn
anddueling_dqn
agent
- Improved unittest performance
- Added
updates
and renamedtimesteps
/episodes
counter for agents and runners - Renamed
critic_{network,optimizer}
argument tobaseline_{network,optimizer}
- Added Actor-Critic (
ac
), Advantage Actor-Critic (a2c
) and Dueling DQN (dueling_dqn
) agents - Improved "same" baseline optimizer mode and added optional weight specification
- Reuse layer now global for parameter sharing across modules
- New block layer type (
block
) for easier sharing of layer blocks - Renamed
PolicyAgent/-Model
toTensorforceAgent/-Model
- New
Agent.load(...)
function, saving includes agent specification - Removed
PolicyAgent
argument(baseline-)network
- Added policy argument
temperature
- Removed
"same"
and"equal"
options forbaseline_*
arguments and changed internal baseline handling - Combined
state/action_value
tovalue
objective with argumentvalue
either"state"
or"action"
- Fixed setup.py packages value
- DQFDAgent removed (temporarily)
- DQNNstepAgent and NAFAgent part of DQNAgent
- Agents need to be initialized via
agent.initialize()
before application - States/actions of type
int
require an entrynum_values
(instead ofnum_actions
) Agent.from_spec()
changed and renamed toAgent.create()
Agent.act()
argumentfetch_tensors
changed and renamed toquery
,index
renamed toparallel
,buffered
removedAgent.observe()
argumentindex
renamed toparallel
Agent.atomic_observe()
removedAgent.save/restore_model()
renamed toAgent.save/restore()
update_mode
renamed toupdate
states_preprocessing
andreward_preprocessing
changed and combined topreprocessing
actions_exploration
changed and renamed toexploration
execution
entrynum_parallel
replaced by a separate argumentparallel_interactions
batched_observe
andbatching_capacity
replaced by argumentbuffer_observe
scope
renamed toname
update_mode
replaced bybatch_size
,update_frequency
andstart_updating
optimizer
removed, implicitly defined as'adam'
,learning_rate
addedmemory
defines capacity of implicitly defined memory'replay'
double_q_model
removed (temporarily)
- New mandatory argument
max_episode_timesteps
update_mode
replaced bybatch_size
andupdate_frequency
memory
removedbaseline_mode
removedbaseline
argument changed and renamed tocritic_network
baseline_optimizer
renamed tocritic_optimizer
gae_lambda
removed (temporarily)
step_optimizer
removed, implicitly defined as'adam'
,learning_rate
added
cg_*
andls_*
arguments removed
optimizer
removed, implicitly defined as'adam'
,learning_rate
added
- Environment properties
states
andactions
are now functionsstates()
andactions()
- States/actions of type
int
require an entrynum_values
(instead ofnum_actions
) - New function
Environment.max_episode_timesteps()
- ALE, MazeExp, OpenSim, Gym, Retro, PyGame and ViZDoom moved to
tensorforce.environments
- Other environment implementations removed (may be upgraded in the future)
- Improved
run()
API forRunner
andParallelRunner
ThreadedRunner
removed
examples
folder (includingconfigs
) removed, apart fromquickstart.py
- New
benchmarks
folder to replace parts of oldexamples
folder