Reinforcement Learning RL is a "a way of programming agents by reward and punishment without needing to specify how the task is to be achieved" Kaelbling, Littman, & Moore, 96.
The basic RL problem includes states (s), actions (a) and rewards (r). The typical formulation is as follows:
The goal of RL is to select actions a to move around states s to maximize future reward r. Three key additional components in RL are:
- policy (π): the policy is the agent's behavior, in other words, how to select action (a) in certain state (s)
- Value function (V): prediction of the future reward or how much reward will I get from action a in state s
- Model: representation of the environment, learnt from experience
- Value-based RL: Estimate the optimal value function Q∗(s, a). This is the maximum value achievable under any policy
- Policy-based RL: Search directly for the optimal policy π∗. This is the policy achieving maximum future reward
- Model-based RL: Build a model of the environment. Plan (e.g. by lookahead) using model
There's also the actor-critic (e.g.: DDPG) techniques which learn both policies and value functions simultaneously.
- Model-free or model-based: From https://www.quora.com/What-is-the-difference-between-model-based-and-model-free-reinforcement-learning:
Model based learning attempts to model the environment, and then based on that model, choose the most appropriate policy. Model-free learning attempts to learn the optimal policy in one step
- On-policy or Off-policy: From https://datascience.stackexchange.com/questions/13029/what-are-the-advantages-disadvantages-of-off-policy-rl-vs-on-policy-rl:
- On-policy methods:
- attempt to evaluate or improve the policy that is used to make decisions,
- often use soft action choice, i.e. π(s,a)>0,∀aπ(s,a)>0,∀a,
- commit to always exploring and try to find the best policy that still explores,
- may become trapped in local minima.
- Off-policy methods:
- evaluate one policy while following another, e.g. tries to evaluate the greedy policy while following a more exploratory scheme,
- the policy used for behaviour should be soft,
- policies may not be sufficiently similar,
- may be slower (only the part after the last exploration is reliable), but remains more flexible if alternative routes appear.