-
Notifications
You must be signed in to change notification settings - Fork 0
/
dynamic_programming.qmd
215 lines (166 loc) · 15.8 KB
/
dynamic_programming.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
title: "Dynamic programming and discrete-time optimal control"
bibliography:
- ref_optimal_control.bib
- ref_reinforcement_learning_and_dynamic_programming.bib
format:
html:
html-math-method: katex
code-fold: true
execute:
enabled: false
warning: false
jupyter: julia-1.10
---
In the previous two chapters we explained direct and indirect approaches to discrete-time optimal control. While the former conveniently allows incorporating almost arbitrary constraints, it only provides a control trajectory (a finite sequence of values of the control variable); if feedback is needed, the optimization must be performed in every sampling period (thus implementing the concept of receding horizon or model predictive control, MPC). The latter, in contrast, can lead to a (state) feedback control law, but this only happens in special cases such as a regulation of a linear system minimizing a quadratic cost (LQR) while assuming no bound constraints on the the control or state variables; in the general case it leads to a two-point boundary value problem, which can only be solved numerically for trajectories.
In this chapter we present yet another approach — dynamic programming DP. It also allows imposing constraints (in fact, even constraints such as integrality of variables, which are not compatible with our derivative-based optimization toolset exploited so far), and yet it directly leads to feedback controllers.
{{< video https://www.youtube.com/embed/7qRCZ9_wfYI?si=hLh9IPUoSin2VXoY >}}
While in the case of linear systems with a quadratic cost function, dynamic programming provides another route to the theoretical results that we already know — Riccati equation based solution to the LQR problem —, in the the case of general nonlinear dynamical systems with general cost functions, the feedback controllers come in the form of look-up tables. This format of a feedback controller gives some hint about disadvantages of DP, namely, both computation and then the use of these look-up tables do not scale well with the dimension of the state space (aka curse of dimensionality). Various approximation schemes exist — one promising branch is known as reinforcement learning.
## Bellman's principle of optimality and dynamic programming
We start by considering the following example.
::: {#exm-trip-from-prague-to-ostrava}
## Reusing the plan for a trip from Prague to Ostrava
We are planning a car trip from Prague to Ostrava and you are searching for a route that minimizes the total time. Using the online planner we learn that the fastest route from Prague to Ostrava is — as bizarre as it sounds — via (actually around) Brno.
<iframe style="border:none" src="https://frame.mapy.cz/s/munotenatu" width="700" height="466" frameborder="0"></iframe>
Now, is it possible to reuse this plan for our friends from Brno who are also heading for Ostrava?
<iframe style="border:none" src="https://frame.mapy.cz/s/hasebevole" width="700" height="466" frameborder="0"></iframe>
The answer is yes, as the planner confirms. Surely did not even need the planner to answer such trivial question. And yet it demonstrates the key wisdom of the whole chapter — the Bellman's principle of optimality —, which we now state formally.
:::
::: {#thm-bellmans-principle-of-optimality}
## Bellman's principle of optimality
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
:::
We now investigate this idea a bit more quantitatively using a simple computational example of finding a shortest path in a graph.
::: {#exm-shortest-path-in-a-graph}
## Shortest path in a graph
We consider a directional graph with nodes `A`, `B`, `C`, `D`, and `E` and edges with the prescribed lengths as in the figure below.
``` {dot}
digraph G {
size="6,6";
ratio="fill";
rankdir=LR;
node [color=red, fontcolor=white, style=filled, shape=circle];
A -> B [label=3];
A -> D [label=1];
B -> C [label=2];
B -> E [label=1];
D -> E [label=3];
D -> G [label=2];
C -> F [label=3];
E -> F [label=3];
E -> H [label=2];
G -> H [label=4];
F -> I [label=4];
H -> I [label=2];
}
```
The task is now to find the shortest path from `A` to `I`. What are possible solution strategies? We can start enumerating all the possible paths and calculate their costs (by summing the costs of the participating edges). Needless to say, this strategy based on enumeration scales very badly with the growing number of nodes.
Alternatively, we solve the problem using dynamic programming and relying on Bellman's principle of optimality. Before we proceed, we need to define the concept of a *stage*. It is perhaps less common and natural when it comes to solving graph problems, but we introduce it with anticipation of discrete-time optimal control problems. By the *k*th stage we understand the node at which the *k*th decision needs to be made. In our case, starting at `A`, 4 decisions need to be made to reach the final node. But let's agree that we also denote the final node as the stage, the 5th one, even if no decision is to be made here. The total number of stages is then N=5.
The crucial attribute of the strategy based on dynamic programming is that we proceed *backwards*. We start at the very final *stage*. At this stage, there is just one node and there is nothing we can do, but note that it also makes sense to formulate problems with several possible nodes at the final stage, each with a different (terminal) costs — we will actually use once we switch to the optimal control setting. Now we proceed backwards to the last but one, that is, the *(N-1)*th stage.
These are `F` and `H` nodes at this 4the stage. In these two nodes there is again no freedom as for the actions, but for each of them we can record their respective *cost to go*: 4 for the `F` node and 2 for the `H` node. These costs reflect how costly it is to reach the terminal node from them.
Things are only getting interesting if we now proceed to the 3rd stage. We now have to consider three possible nodes: `C`, `E` and `G`. For the `C` and `G` nodes there is still just one action and we can only record their costs to go. The cost for the `C` node can be computed as the cost for the immediate transition from `C` to `F` plus the cost for the `F` node, which we recorded previously, that is, 3+4=7. We record the value of 7 with the `C` node. Similarly for the `G` node. For the `E` node there are two possible actions — two possible decisions to be made, two possible paths to choose from. Either to the left (or, actually, up in our orientation of the graph), which would bring us to the node `F`, or to the right (or down), which would bring us to the node `H`. We compute the costs to go for both decisions and choose the decision with a smaller cost. Here the cost of the decision to go to the left is composed of the cost of the transition to F plus the cost to go from `F`, that is, 3+4=7. The cost to go for the decision to go right is composed of the transition cost from `E` to `H` plus the cost to go from `H`, that is, 2+2=4. Obviously, the optimal decision is to go right, that is, to the node `H`. Here, on top of the value of the optimal (smallest) cost to go from the node we also record the optimal decision (go to the right/down). We do it by coloring the edge in blue.
Note that in principle we should have highlighted the edges from `F` to `I`, from `C` to `F`, and from `G` to `H`. It was unnecessary here since there were the only possible edges emanating from these nodes.
We proceed backwards to the 2nd stage, and we compute the costs to go for the nodes `B` and `D`. Again we record their optimal values and the actual optimal decisions.
One last shift backwards and we are at the initial node A, for which we can do the same computation of the costs to go. Note that here coincidently both decisions have the same cost to go, hence both possible decisions/actions are optimal and we can just toss a coin.
``` {dot}
digraph G {
size="6,6";
ratio="fill";
rankdir=LR;
node [color=red, fontcolor=white, style=filled, shape=circle];
A [label="A\n8"];
B [label="B\n5"];
D [label="D\n7"];
C [label="C\n7"];
E [label="E\n4"];
G [label="G\n6"];
F [label="F\n4"];
H [label="H\n2"];
I [label="I\n0"];
A -> B [label="3", color=blue, penwidth=3.0];
A -> D [label="1", color=blue, penwidth=3.0];
B -> C [label="2"];
B -> E [label="1", color=blue, penwidth=3.0];
D -> E [label="3", color=blue, penwidth=3.0];
D -> G [label="2"];
C -> F [label="3"];
E -> F [label="3"];
E -> H [label="2", color=blue, penwidth=3.0];
G -> H [label="4"];
F -> I [label="4"];
H -> I [label="2", color=blue, penwidth=3.0];
}
```
Maybe it is not immediately clear from the graph, but when viewed as an itinerary for a trip, it provides a feedback controller. Even if for whichever reason we find ourselves out of the optimal path, we can always have a look at the graph — it will guide us along the path that is optimal from that given node. For example, if we happen to be in node `C`, we do have a plan. Well, here is misleadingly simple as there is no decision to be made, but you get the point.
:::
## Bellman's principle of optimality applied to the discrete-time optimal control problem
Let's recapitulate here the problem of optimal control for a discrete-time system. In particular, we consider the system modelled by
$$
\bm x_{k+1} = \mathbf f_k(\bm x_k,\bm u_k),
$$
defined on the discrete time interval $[i,N]$, with the initial state $\bm x_i$ fixed ($\bm x_i = \mathbf x_i$) We aim at minimizing the cost function
$$
J_i^N\left(\bm x_i, \bm u_i, \bm u_{i+1}, \ldots, \bm u_{N-1}\right) = \phi(\bm x_N,N) + \sum_{k=i}^{N-1}L_k(\bm x_k,\bm u_k).
$$
Before we proceed, some comments on the notation are in order. Indeed, a well tuned and systematically used notation is instrumental in dynamic programming.
::: {.callout-note}
## We omit the final time from the notation for the cost function
While the cost function does depend on the final time too, in most if not all our analyses we assume that it is fixed and understood from the context. Hence we will not explicitly indicate the dependence on the final time. We will write just $J_i(\ldots)$. This may help reduce the notational clutter as we are going to need the upper index for something else soon.
:::
::: {.callout-note}
## We omit the state trajectory from the notation for the cost function and leave just the initial state
The cost function is clearly a function of the full sequence $\bm x_i, \bm x_{i+1},\ldots, \bm x_N$ of the state vectors too. In the previous chapters we handled it systematically (either by considering them as optimization variables in the simultaneous direct approach or by introducing Lagrange multipliers in the indirect approache). But here we want to emphasize the fact that starting with $\bm x_{i+1}$, the whole state trajectory is uniquelly determined by the initial state $\bm x_i$ and the corresponding control trajectory $\bm u_i, \bm u_{i+1},\ldots, \bm u_{N-1}$. Therefore, we write the cost function as a function of the initial state, the initial time (we already agreed above not to emphasize the final time), and the sequence of controls.
:::
::: {.callout-note}
# We use the lower index to display dependence on time
The dependence on the discrete time is reflected by the lower indices: not only in $\bm x_k$ and $\bm u_k$ but also in $\mathbf f_k()$, $L_k()$ and $J_k()$. We could perhaps write these as $\mathbf f(\cdot,\cdot,k)$, $L(\cdot,\cdot,k)$ and $J(\cdot,\cdot,k)$ to better indicate that $k$ is really an argument for these functions, but we prefer making it compatible with the way we indicate the time dependence of $\bm x_k$ and $\bm u_k$.
:::
Having introduced the cost function parameterized by the initial state, initial time and the full sequence of controls, we now introduce the *optimal cost function*
$$
\boxed{
J^\star_i(\bm x_i) = \min_{\bm u_i,\ldots, \bm u_{N-1}} J_i\left(\bm x_i, \bm u_i, \bm u_{i+1}, \ldots, \bm u_{N-1}\right).}
$${#eq-optimal_cost_function}
The sequence of controls in the above minimization may be subject to some constraints, but we do not indicate them here for the sake of notational simplicity.
::: {.callout-important}
## Difference between the $J_i$ and $J^\star_i$ functions
Understanding the difference is crucial. While the cost function $J_i$ depends on the (initial) state, the (initial) time and the sequence of controls applied over the whole interval, the optimal cost function $J^\star_i$ only depends on the (initial) state and the (initial) time.
:::
Assume now that we can find an optimal control sequence from any given state $\bm x_{k+1}$ at time $k+1$ on, i.e., we can find $\bm u_{k+1}^\star,\bm u_{k+2}^\star,\ldots, \bm u_{N-1}^\star$ yielding the optimal cost $J_{k+1}^\star(\bm x_{k+1})$. We will soon show how to actually find it, but for the time being we just assume we can have it. We now show how it can be used to find the optimal cost $J_k^\star(\bm x_k)$ at state $\bm x_k$ and time $k$.
Let's now consider the following strategy: with the system at state $\bm x_k$ and time $k$ we apply some control $\bm u_k$, not necessarily an optimal one, which brings the system to the state $\bm x_{k+1}$ in the next time $k+1$. But from then on we use the control sequence $\bm u_{k+1}^\star,\bm u_{k+2}^\star,\ldots, \bm u_{N-1}^\star$ that is optimal from $\bm x_{k+1}$. The corresponding cost is
$$
L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\bm x_{k+1}).
$${#eq-cost_of_the_all_but_one_optimal_controls}
Bellman's principle of optimality states that if we optimize the above expression over $\bm u_k$, we get the optimal cost $J_k^\star(\bm x_k)$ at time $k$
$$
\boxed{J_k^\star(\bm x_k) = \min_{\bm u_k}\left(L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\bm x_{k+1})\right).}
$${#eq-bellman_for_discrete_time_optimal_control}
Hence, at a given state $\bm x_{k}$ and time $k$, the optimization is performed over only one (possibly vector) control $\bm u_k$ and not the whole trajectory as the definition of the optimal cost in @eq-optimal_cost_function suggests! What a simplification!
::: {.callout-important}
The minimization needs to be performed over the whole sum $L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\bm x_{k+1})$, because $\bm x_{k+1}$ is a function of $\bm u_k$ (recall that $\bm x_{k+1} = \mathbf f_k(\bm x_k,\bm u_k)$). We can also write @eq-bellman_for_discrete_time_optimal_control as
$$
J_k^\star(\bm x_k) = \min_{\bm u_k}\left(L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\mathbf f_k(\bm x_k,\bm u_k))\right),
$$
which makes it more apparent.
:::
Once we have the optimal cost function $J^\star_{k}$, the optimal control $\bm u_k^\star(x_k)$ at a given time $k$ and state $\bm x_k$ is obtained by
$$
\boxed{
\bm u_k^\star(\bm x_k) = \arg \min_{\bm u_k}\left(L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\mathbf f_k(\bm x_k,\bm u_k))\right).}
$$
### Alternative formulation of dynamic programming using Q-factors
The cost function in @eq-cost_of_the_all_but_one_optimal_controls is sometimes called *Q-factor* and we denote it $Q_k(\bm x_k,\bm u_k)$. We write its definition here for convenience
$$
Q^\star_k(\bm x_k,\bm u_k) = L_k(\bm x_k,\bm u_k) + J_{k+1}^\star(\bm x_{k+1}).
$$
The optimal cost function $J_k^\star(\bm x_k)$ can be recovered from the optimal Q-factor $Q_k^\star(\bm x_k,\bm u_k)$ by taking the minimum over $\bm u_k$
$$
J_k^\star(\bm x_k) = \min_{\bm u_k} Q_k^\star(\bm x_k,\bm u_k).
$$
Bellman's principle of optimality can be then expressed using the optimal Q-factor as
$$
\boxed{Q_k^\star(\bm x_k,\bm u_k) = L_k(\bm x_k,\bm u_k) + \min_{\bm u_{k+1}} Q_{k+1}^\star(\bm x_{k+1},\bm u_{k+1})}.
$$
Optimal control is then obtained from the optimal Q-factor as the minimizing control
$$
\boxed{\bm u_k^\star(\bm x_k) = \arg \min_{\bm u_k} Q_k^\star(\bm x_k,\bm u_k).}
$$