Skip to content

Commit

Permalink
Merge pull request #83 from gtbook/frank_jul3
Browse files Browse the repository at this point in the history
Planning and DRL in Ch 6
  • Loading branch information
dellaert authored Jul 9, 2024
2 parents ad88fd0 + 895ed82 commit 09861ac
Show file tree
Hide file tree
Showing 14 changed files with 851 additions and 193 deletions.
10 changes: 7 additions & 3 deletions S22_sorter_actions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -335,9 +335,13 @@
"| 0 | 0.50 |\n",
"| 3 | 0.05 |\n",
"| 5 | 0.25 |\n",
"| 10 | 0.20 |\n",
"\n",
"\n",
"| 10 | 0.20 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could apply this same approach for each of the possible actions, and then use the resulting\n",
"PMFs to make decisions about which actions to apply. We will discuss such an approach\n",
"to planning a bit later in this chapter."
Expand Down
30 changes: 24 additions & 6 deletions S27_sorter_summary.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
"This can be extended to any number of random variables, so that, in theory, all uncertainties\n",
"in the world could be modeled by a single joint probability distribution.\n",
"However, specifying a complete joint probability distribution is exceedingly expensive.\n",
"Consider that if we have $n$ random variables which take on $N_1, N_2,\\dots N_n$ possible values, respectively\n",
"Consider that if we have $n$ random variables which take on $N_1, N_2,\\dots N_n$ possible values, respectively,\n",
"the size of a table to represent the joint probability distribution\n",
"would be $N_1 \\times N_2 \\times \\dots N_n$, i.e., the size of this data structure\n",
"grows exponentially with the number of random variables."
Expand All @@ -93,7 +93,15 @@
"This kind of model is sometimes called a forward sensor model, since it models the behavior of the sensor\n",
"given the state of the world.\n",
"Note that the conditional distribution $P(Z | X=x)$ is itself a valid probability distribution.\n",
"For example, if $Z$ is a discrete random variable, we would have $\\sum_i P(Z=z_i | X=x) =1$.\n"
"For example, if $Z$ is a discrete random variable, we would have $\\sum_i P(Z=z_i | X=x) =1$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also saw how probabilistic models can be incorporated into computational solutions via sampling.\n",
"In particular, we developed an algorithm that uses the cumulative distribution function to generate samples for a given particular probability distribution. We used this sampling algorithm to investigate the empirical relationship between outcomes generated from a probability distribution and the expected value of the distribution."
]
},
{
Expand Down Expand Up @@ -124,7 +132,7 @@
"$$x^*_{MAP} = \\arg \\max_x P(x|z)$$\n",
"\n",
"Note that this computation requires that we have access to the prior probability distribution $P(X)$.\n",
"In cases where this is not available, we may instead choose to compute the *maximum likelihood* estimate (or *MLE),\n",
"In cases where this is not available, we may instead choose to compute the *maximum likelihood* estimate (or *MLE*),\n",
"which is given by\n",
"\n",
"$$x^*_{MLE} = \\arg \\max_x L(x;z) = \\arg \\max_x P(Z=z|X) $$\n",
Expand All @@ -143,16 +151,26 @@
"source": [
"\n",
"## Background and History\n",
"The origins of probability theory can be traced back to games of chance in ancient societies, but the first real attempts to formalize the study of probability came during the Renaissance, in the works of mathematicians such as Cardano, Pascal, and Fermat. These early mathematical approaches still focused mainly on games of chance, with a strong empirical flavor. The line between statistics and probability theory was a blurry one in Renaissance times.\n",
"The origins of probability theory can be traced back to games of chance in ancient societies, but the first real attempts to formalize the study of probability came during the Renaissance, in the works of mathematicians such as Cardano, Pascal, and Fermat. These early mathematical approaches focused mainly on games of chance, with a strong empirical flavor. The line between statistics and probability theory was a blurry one in Renaissance times.\n",
"\n",
"It was Bayes, in the eighteenth century, who pioneered the idea of using evidence, together with ideas from probability theory, to draw inferences. While what we now know as Bayes Theorem is a general result that does not depend on the specific probability distributions under consideration, Bayes studied the specific case of inferring the parameter of a binomial distribution given observed outcomes. The more general development is due largely to Laplace, in the years following the death of Bayes.\n",
"\n",
"What we think of today as probability theory was formalized in the early 1930’s by Kolmogorov. It was Kolmogorov who formulated the three axioms that form the basis for modern probability theory:\n",
"1.\tFor any event $A$, $P(A) \\geq 0$.\n",
"2.\t$P(\\Omega)=1$.\n",
"3.\tFor disjoint events $A$ and $B$, $P(A\\cup B) = P(A) + P(B)$.\n",
"\n",
"Equipped with these three axioms and a background in *real analysis*, one can derive most all of the important results that comprise modern probability theory.\n",
"The Renaissance mathematicians were interested in understanding random phenomena. Kolmogorov, a Russian mathematician, was interested in establishing a rigorous theoretical foundation for probability theory. It was Bayes who pioneered the idea of using evidence, together with ideas from probability theory, to draw inferences. \n",
"The Renaissance mathematicians were interested in understanding random phenomena. Kolmogorov, a Russian mathematician, was interested in establishing a rigorous theoretical foundation for probability theory. \n",
"\n",
"One of the best recent books we have found useful is the book [\"Introduction to Probability for Data Science\"](https://probability4datascience.com/index.html) {cite:p}`Chan23book_prob4ds`.\n"
"One of the best recent books we have found useful is the book [\"Introduction to Probability for Data Science\"](https://probability4datascience.com/index.html) {cite:p}`Chan23book_prob4ds`.\n",
"The classic reference for statistical reasoning, including \n",
"maximum likelihood estimation and Bayesian decision theory \n",
"is {cite:p}`duda2012pattern`.\n",
"Anders Hald has written two volumes on the history of probability theory, \n",
"one that covers the period from Bernoulli, De Moivre and Laplace to the mid-twentieth century,\n",
"with particular attention to estimation problems {cite:p}`HaldBook98`,\n",
"and one that covers developments prior to 1750 {cite:p}`HaldBook03`."
]
},
{
Expand Down
16 changes: 10 additions & 6 deletions S32_vacuum_actions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1118,14 +1118,18 @@
"entries all four CPT tables have for the Bayes net above. \n",
"This is shown in the following table.\n",
"\n",
"| CPT | \\# entries |\n",
"|-|-|\n",
"| *P(Z)* | 9 |\n",
"| CPT | \\# entries |\n",
"|-------------|------------|\n",
"| *P(Z)* | 9 |\n",
"| *P(Y\\|Z)* | 90 |\n",
"| *P(X\\|Y,Z)* | 900 |\n",
"| *P(W\\|X,Z)* | 900 |\n",
"\n",
"\n",
"| *P(W\\|X,Z)* | 900 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For example, $P(X|Y,Z)$ has 900 entries, i.e., 9\n",
"(independent) entries for each of 100 possible combinations of $Y$ and\n",
"$Z$. Hence, the total number of parameters we need is only $1,899$,\n",
Expand Down
20 changes: 10 additions & 10 deletions S35_vacuum_decision.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -85,12 +85,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Previously in this chapter, we described how conditional probability distributions can be used to model uncertainty\n",
"in the effects of actions. We defined the belief state $b_{k+1}$ to be the posterior probability distribution\n",
"for the state at time $k+1$ given the sequence of actions $a_1 \\dots a_k$.\n",
"Previously, in Section 3.2, we described how conditional probability distributions can be used to model uncertainty\n",
"in the effects of actions. We defined the belief state $b_{k+1}$ to be the posterior probability $P(X_{k+1} | a_1 \\dots a_k)$ for the state $X_{k+1}$ at time $k+1$ given the sequence of actions $a_1 \\dots a_k$.\n",
"In every example, the sequence of actions was predetermined, and we merely calculated probabilities\n",
"associated with performing these actions from some specified initial state, described\n",
"by a probability distribution $P(X_1)$.\n",
"associated with performing these actions from some specified initial state, governed\n",
"by the probability distribution $P(X_1)$.\n",
"\n",
"In this section, we consider the problem of choosing which actions to execute.\n",
"Making these decisions requires that we have quantitative criteria for evaluating actions and their effects.\n",
Expand Down Expand Up @@ -454,10 +453,11 @@
"and to the term $\\gamma^l R(x_{k+l},a_{k+l},x_{k+l+1})$ as a **discounted reward.**\n",
"Note that for $\\gamma = 1$, there is no discount, and all future rewards are treated with equal weight.\n",
"\n",
"We now use this to define a more general utility function.\n",
"Suppose the robot executes a sequence of actions, $a_1, \\dots, a_n$,\n",
"starting in state $X_1=x_1$, and passes through\n",
"state sequence $x_1,x_2,x_3\\dots x_{n+1}$.\n",
"We define the utility function $U: {\\cal A}^n \\times {\\cal X}^{n+1} \\rightarrow \\mathbb{R}$ as\n",
"We define the **utility function** $U: {\\cal A}^n \\times {\\cal X}^{n+1} \\rightarrow \\mathbb{R}$ as\n",
"\n",
"$$\n",
"U(a_1, \\dots, a_n, x_1, \\dots x_{n+1}) =\n",
Expand All @@ -484,7 +484,7 @@
"We can, again, deal with this difficulty by computing the *expected* utility for a\n",
"given sequence of actions, \n",
"$E[U(a_1, \\dots, a_n, X_1, \\dots X_n)]$.\n",
"We can now formulate a slightly more sophisticated version of our planning problem. Choose the sequence of actions $a_{1:n}$ that maximizes the expected utility:\n",
"We can now formulate a slightly more sophisticated version of our planning problem: *choose the sequence of actions $a_{1:n}$ that maximizes the expected utility*:\n",
"\n",
"$$ a_1^*, \\dots a_n^* = \\arg \\max_{a_1 \\dots a_n \\in {\\cal A}^n} E[U(a_1, \\dots, a_n, X_1, \\dots X_{n+1})]\n",
"$$\n",
Expand All @@ -493,7 +493,7 @@
"and choosing the sequence that maximizes the expectation.\n",
"Obviously this is not a computationally tractable approach.\n",
"Not only does the number of possible action sequences grow exponentially with the time horizon $n$,\n",
"but the computation of the expectation for a specific action sequence is also computationally heavy.\n",
"but computing the expectation for a specific action sequence is also computationally heavy.\n",
"We can, however, approximate this optimization process using the concept of rollouts, as we will now see."
]
},
Expand Down Expand Up @@ -802,7 +802,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -907,7 +907,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we have a given policy, $\\pi$, we can compute a policy rollout in a manner analogous to computing control tape rollouts described above.\n",
"Once we have a given policy, $\\pi$, we can compute a *policy rollout* in a manner analogous to computing control tape rollouts described above.\n",
"In particular, at each state, instead of sampling from the distribution\n",
"$P(X_{k+1} | a_k, x_k)$ we sample from the distribution\n",
"$P(X_{k+1} | \\pi(x_k), x_k)$. In other words, instead of simulating a pre-specified action $a_k$, we choose $a_k = \\pi(x_k)$.\n",
Expand Down
20 changes: 10 additions & 10 deletions S36_vacuum_RL.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -88,10 +88,10 @@
"<img src=\"Figures3/S36-iRobot_vacuuming_robot-04.jpg\" alt=\"Splash image with intelligent looking robot\" width=\"40%\" align=center style=\"vertical-align:middle;margin:10px 0px\">\n",
"\n",
"When a Markov Decision Process is fully specified we can *compute* an optimal policy.\n",
"Below we first define optimal value functions and examine its properties, most notably the Bellman equation.\n",
"Below we first define optimal value functions and examine their properties, most notably the Bellman equation.\n",
"We then discuss value iteration and policy iteration, two algorithms to calculate the optimal value function and its associated optimal policy. However, both these algorithms need a fully-defined MDP.\n",
"\n",
"When the MPD is not known in advance, however, we have to *learn* an optimal policy over time. There are two main approaches: model-based and model-free."
"When the MDP is not known in advance, however, we have to *learn* an optimal policy over time. There are two main approaches: model-based and model-free."
]
},
{
Expand Down Expand Up @@ -120,7 +120,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This principle enables a key step in the derivation of a recursive formulation for the optimal policy. Indeed, the **optimal value function** $V^*: {\\cal X} \\rightarrow {\\cal A}$\n",
"This principle enables a key step in deriving a recursive formulation for the optimal policy. Indeed, the **optimal value function** $V^*: {\\cal X} \\rightarrow {\\cal A}$\n",
"is merely the value function for the optimal policy.\n",
"This can be written mathematically as\n",
"\n",
Expand All @@ -142,7 +142,7 @@
"the corresponding value function at $x'$ will be the optimal value function for $x'$.\n",
"For the fourth line,\n",
"because the value function has been written in recursive form,\n",
"$\\pi$ is only applied to the current state (i.e., when $\\pi$ is evaluated in the optimization,\n",
"$\\pi$ is applied only to the current state (i.e., when $\\pi$ is evaluated in the optimization,\n",
"it always appears as $\\pi(x)$).\n",
"Therefore, we can write the optimization\n",
"as a maximization with respect to the *action* applied in the *current state*, rather than as a\n",
Expand Down Expand Up @@ -240,15 +240,15 @@
"The second method, value iteration, iteratively improves an estimate of $V^*$, ultimately converging to the optimal value function.\n",
"Both, however, need access to the MDP's transition probabilities and the reward function.\n",
"\n",
"**Policy Iteration** starts with an initial guess at the optimal policy, and then iteratively improve our guess until no further improvements are possible.\n",
"**Policy Iteration** starts with an initial guess at the optimal policy, and then iteratively improves our guess until no further improvements are possible.\n",
"In particular, policy iteration generates a sequence of policies\n",
"$\\pi^0, \\pi^1, \\dots \\pi^n$, such that $\\pi^{i+1}$ is better than policy $\\pi^i$.\n",
"This process ends when no further improvement is possible, which\n",
"occurs when $\\pi^{i+1} = \\pi^i.$\n",
"\n",
"To improve the policy $\\pi^i$, we update the action chosen *for each state* by applying\n",
"Bellman's equation using $\\pi^i$ in place of $\\pi^*$.\n",
"The can be achieved with the following algorithm:\n",
"This can be achieved with the following algorithm:\n",
"\n",
"Start with a random policy $\\pi^0$ and $i=0$, and repeat until convergence:\n",
"1. Compute the value function $V^{\\pi^i}$\n",
Expand Down Expand Up @@ -319,7 +319,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"On the other hand, if we have a guess for the initial policy, we can intialize\n",
"On the other hand, if we have a guess for the initial policy, we can initialize\n",
"$\\pi^0$ accordingly.\n",
"For example, we can start with a not-so-smart `always_right` policy:"
]
Expand Down Expand Up @@ -448,7 +448,7 @@
"Instead, we often use a condition such as $|V^{i+1} - V^i| < \\epsilon$, for some small value of $\\epsilon$\n",
"as the termination condition.\n",
"\n",
"Finally, note that we can once again use the Q-values to obtain a very concise description for the value update:\n",
"Finally, note that we can once again use the Q-values to obtain a concise description for the value update:\n",
"\n",
"$$\n",
"V^{i+1}(x) \\leftarrow \\max_a Q(x, a; V^i).\n",
Expand Down Expand Up @@ -779,15 +779,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"A different, model-free approach is **Q_learning**. In the above we tried to *model* the world by trying estimate the (large) transition and reward tables. However, remember from the previous section that there is a much smaller table of Q-values $Q(x,a)$ that also allow us to act optimally. This is because we can calculate the optimal policy $\\pi^*(x)$ from the optimal Q-values $Q^*(x,a) \\doteq Q(x, a; V^*)$:\n",
"A different, model-free approach is **Q-learning**. In the above we tried to *model* the world by trying estimate the (large) transition and reward tables. However, remember from the previous section that there is a much smaller table of Q-values $Q(x,a)$ that also allow us to act optimally. This is because we can calculate the optimal policy $\\pi^*(x)$ from the optimal Q-values $Q^*(x,a) \\doteq Q(x, a; V^*)$:\n",
"\n",
"$$\n",
"\\pi^*(x) = \\arg \\max_a Q^*(x,a).\n",
"$$\n",
"\n",
"This begs the question whether we can simply learn the Q-values instead, which might be more *sample-efficient*. In other words, we would get more accurate values with less training data, as we have less quantities to estimate.\n",
"\n",
"To do this, remember that the Bellman equation can be written as \n",
"To do this, recall that the Bellman equation can be written as \n",
"\n",
"$$\n",
"V^*(x) = \\max_a Q^*(x,a)\n",
Expand Down
Loading

0 comments on commit 09861ac

Please sign in to comment.