dp.html

<!DOCTYPE html>

<html>

  <head>
    <title>Underactuated Robotics: Dynamic Programming</title>
    <meta name="Underactuated Robotics: Dynamic Programming" content="text/html; charset=utf-8;" />
    <link rel="canonical" href="http://underactuated.mit.edu/dp.html" />

    <script src="https://hypothes.is/embed.js" async></script>
    <script type="text/javascript" src="htmlbook/book.js"></script>

    <script src="htmlbook/mathjax-config.js" defer></script> 
    <script type="text/javascript" id="MathJax-script" defer
      src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
    </script>
    <script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>

    <link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
    <script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
    <script>hljs.initHighlightingOnLoad();</script>

    <link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
  </head>

<body onload="loadChapter('underactuated');">

<div data-type="titlepage">
  <header>
    <h1><a href="underactuated.html" style="text-decoration:none;">Underactuated Robotics</a></h1>
    <p data-type="subtitle">Algorithms for Walking, Running, Swimming, Flying, and Manipulation</p> 
    <p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
    <p style="font-size: 14px; text-align: right;"> 
      &copy; Russ Tedrake, 2020<br/>
      <a href="tocite.html">How to cite these notes</a><br/>
    </p>
  </header>
</div>

<p><b>Note:</b> These are working notes used for <a
href="http://underactuated.csail.mit.edu/Spring2020/">a course being taught
at MIT</a>. They will be updated throughout the Spring 2020 semester.  <a 
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture  videos are available on YouTube</a>.</p> 

<table style="width:100%;"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=stochastic.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=underactuated.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=lqr.html>Next Chapter</a></td>
</tr></table>


<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 6"><h1>Dynamic Programming</h1>

  <p>In chapter 2, we spent some time thinking about the phase portrait of the
  simple pendulum, and concluded with a challenge: can we design a nonlinear
  controller to <em>reshape</em> the phase portrait, with a very modest amount
  of actuation, so that the upright fixed point becomes globally stable?  With
  unbounded torque, feedback-cancellation solutions (e.g., invert gravity) can
  work well, but can also require an unnecessarily large amount of control
  effort. The energy-based swing-up control solutions presented for the acrobot
  and cart-pole systems  are considerably more appealing, but required some
  cleverness and might not scale to more complicated systems.  Here we
  investigate another approach to the problem, using computational optimal
  control to synthesize a feedback controller directly.</p>

  <section><h1>Formulating control design as an optimization</h1>

    <p>In this chapter, we will introduce optimal control - a control design
    process using optimization.  This approach is powerful for a number of
    reasons. First and foremost, it is very general - allowing us to   specify
    the goal of control equally well for fully- or under-actuated,   linear or
    nonlinear, deterministic or stochastic, and continuous or   discrete
    systems. Second, it permits concise descriptions of potentially very
    complex desired behaviours, specifying the goal of control as an scalar
    objective (plus a list of constraints).  Finally, and most importantly,
    optimal control is very amenable to numerical solutions.
    <elib>Bertsekas00a</elib> is a fantastic reference on this material for
    those who want a somewhat rigorous treatment; <elib>Sutton18</elib> is an
    excellent (free) reference for those who want something more
    approachable.</p>

    <p>The fundamental idea in optimal control is to formulate the goal of
    control as the <em>long-term</em> optimization of a scalar cost function.
    Let's introduce the basic concepts by considering a system that is even
    simpler than the simple pendulum.</p>

    <example><h1>Optimal Control Formulations for the Double Integrator</h1>

      <p>Consider the double integrator system $$\ddot{q} = u, \quad |u| \le
      1.$$ If you would like a mechanical analog of the system (I always do),
      then you can think about this as a unit mass brick moving along the x-axis
      on a frictionless surface, with a control input which provides a
      horizontal force, $u$. The task is to design a control system, $u =
      \pi(\bx,t)$, $\bx=[q,\dot{q}]^T$ to regulate this brick to $\bx =
      [0,0]^T$.</p>

      <figure>
      <img width="70%" src="figures/double_integrator_brick.svg"/>
      <figcaption>The double integrator as a unit-mass brick on a frictionless
      surface</figcaption>
      </figure>

      <p>In order to formulate this control design problem using optimal
      control, we must define a scalar objective which scores the long-term
      performance of running each candidate control policy, $\pi(\bx,t)$, from
      each initial condition, $(\bx_0,t_0)$, and a list of constraints that must
      be satisfied. For the task of driving the double integrator to the origin,
      one could imagine a number of optimal control formulations which would
      accomplish the task, e.g.: <ul> <li> Minimum time:  $\min_\pi t_f,$
      subject to $\bx(t_0) = \bx_0,$ $\bx(t_f) = {\bf 0}.$ </li> <li> Quadratic
      cost:  $\min_\pi \int_{t_0}^{\infty} \left[ \bx^T(t) {\bf Q} \bx(t)
      \right] dt,$ ${\bf Q}\succ0$.</li> </ul> where the first is a constrained
      optimization formulation which optimizes time, and the second accrues a
      penalty at every instance according to the distance that the state is away
      from the origin (in a metric space defined by the matrix ${\bf Q}$), and
      therefore encourages trajectories that go more directly towards the goal,
      possibly at the expense of requiring a longer time to reach the goal (in
      fact it will result in an exponential approach to the goal, where as the
      minimum-time formulation will arrive at the goal in finite time). Note
      that both optimization problems only have well-defined solutions if it is
      possible for the system to actually reach the origin, otherwise the
      minimum-time problem cannot satisfy the terminal constraint, and the
      integral in the quadratic cost would not converge to a finite value as
      time approaches infinity (fortunately the double integrator system is
      controllable, and therefore can be driven to the goal in finite time).</p>

      <p> Note that the input limits, $|u|\le1$ are also required to make this
      problem well-posed; otherwise both optimizations would result in the
      optimal policy using infinite control input to approach the goal
      infinitely fast. Besides input limits, another common approach to limiting
      the control effort is to add an additional quadratic cost on the input (or
      "effort"), e.g. $\int \left[ \bu^T(t) {\bf R} \bu(t) \right] dt,$   ${\bf
      R}\succ0$. This could be added to either formulation above.  We will
      examine many of these formulations in some details in the examples worked
      out at the end of this chapter. </p>

    </example>

    <p>Optimal control has a long history in robotics.  For instance, there has
    been a great deal of work on the minimum-time problem for pick-and-place
    robotic manipulators, and the linear quadratic regulator (LQR) and linear
    quadratic regulator with Gaussian noise (LQG) have become essential tools
    for any practicing controls engineer.  With increasingly powerful computers
    and algorithms, the popularity of numerical optimal control has grown at an
    incredible pace over the last few years.</p>

    <example id="minimum_time_double_integrator"><h1>The minimum time problem for the double
    integrator</h1>

      <p>For more intuition, let's do an informal derivation of the solution to
      the minimum time problem for the double integrator with input constraints:
      \begin{align*}
        \minimize_{\pi} \quad & t_f\\
        \subjto \quad & \bx(t_0) = \bx_0, \\
        & \bx(t_f) = {\bf 0}, \\
        & \ddot{q}(t) = u(t), \\
        & |u| \le 1.
      \end{align*}
      What behavior would you expect an optimal controller exhibit?</p>

      <p> Your intuition might tell you that the best thing that the brick can
      do, to reach the goal in minimum time with limited control input, is to
      accelerate maximally towards the goal until reaching a critical point,
      then hitting the brakes in order to come to a stop exactly at the goal.
      This would be called a <em>bang-bang</em> control policy; these are often
      optimal for systems with bounded input, and it is in fact optimal for the
      double integrator, although we will not prove it until we have developed
      more tools.  <!-- leave the proof to the pontryagin notes --></p>

      <p> Let's work out the details of this bang-bang policy.  First, we can
      figure out the states from which, when the brakes are fully applied, the
      system comes to rest precisely at the origin.  Let's start with the case
      where $q(0) < 0$, and $\dot{q}(0)>0$, and "hitting the brakes" implies
      that $u=-1$ . Integrating the equations, we have \begin{gather*}
      \ddot{q}(t) = u = -1 \\\dot{q}(t) = \dot{q}(0) - t \\ q(t) = q(0) +
      \dot{q}(0) t - \frac{1}{2} t^2.  \end{gather*} Substituting $t =
      \dot{q}(0) - \dot{q}$ into the solution give $\dot{q}$ reveals that the
      system orbits are parabolic arcs: \[ q = -\frac{1}{2} \dot{q}^2 + c_{-},
      \] with $c_{-} = q(0) + \frac{1}{2}\dot{q}^2(0)$.</p>

      <figure>
        <img width="80%" src="figures/double_integrator_orbits.svg"/>
        <figcaption>Two solutions for the system with $u=-1$</figcaption>
      </figure>

      <!-- t = qdot - qdot0,
      q = q0 + qdot0(qdot-qdot0) + 1/2(qdot-qdot0)^2
        = q0 + qdot0*qdot - qdot0^2 + 1/2qdot^2 - qdot*qdot0 + 1/2qdot0^2
        = 1/2qdot^2 + (q0 - 1/2 qdot0^2)
      -->

      <p>Similarly, the solutions for $u=1$ are $q = \frac{1}{2} \dot{q}^2 +
      c_{+}$, with $c_{+}=q(0)-\frac{1}{2}\dot{q}^2(0)$.</p>

      <p> Perhaps the most important of these orbits are the ones that pass
      directly through the origin (e.g., $c_{-}=0$). Following our initial
      logic, if the system is going slower than this $\dot{q}$ for any $q$, then
      the optimal thing to do is to slam on the accelerator
      ($u=-\text{sgn}(q)$).  If it's going faster than the $\dot{q}$ that we've
      solved for, then still the best thing to do is to brake; but inevitably
      the system will overshoot the origin and have to come back.  We can
      summarize this policy with: \[ u = \begin{cases}  +1 & \text{if } (\dot{q}
      < 0 \text{ and } q \le \frac{1}{2} \dot{q}^2) \text{ or } (\dot{q}\ge 0
      \text{ and } q < -\frac{1}{2} \dot{q}^2) \\ 0 & \text{if } q=0 \text{ and
      } \dot{q}=0 \\ -1 & \text{otherwise} \end{cases} \]</p> <!--This policy is
      cartooned in Figure~\ref{fig:mintime_double_int}.  %Trajectories of the
      system %executing this policy are also included - the fundamental
      %characteristic is that the system is accelerated as quickly as %possible
      toward the switching surface, then rides the switching %surface in to the
      origin. -->

      <figure>
      <img width="80%" src="figures/double_integrator_mintime_policy.svg"/>
      <figcaption>Candidate optimal (bang-bang) policy for the minimum-time
        double integrator problem.</figcaption>
      </figure>

      <p>and illustrate some of the optimal solution trajectories:</p>

      <figure>
      <img width="80%" src="figures/double_integrator_mintime_orbits.svg"/>
      <figcaption>Solution trajectories of system using the optimal policy</figcaption>
      </figure>

      <p>And for completeness, we can compute the optimal time to the goal by
      solving for the amount of time required to reach the switching surface
      plus the amount of time spent moving along the switching surface to the
      goal.  With a little algebra, you will find that the time to goal,
      $T(\bx)$, is given by \[ T(\bx) = \begin{cases}
      2\sqrt{\frac{1}{2}\dot{q}^2-q} - \dot{q} & \text{for } u=+1 \text{
      regime}, \\ 0 & \text{for } u=0, \\ \dot{q} +
      2\sqrt{\frac{1}{2}\dot{q}^2+q} & \text{for } u=-1, \end{cases} \]<!-- call
      t_m the time to the surface, then the time on switching surface =
      |qdot(t_m)|
      for u=1
      q0,qdot0 => qm,qdotm with qm = -1/2 qdotm^2
      -1/2 qdotm^2 = 1/2 qdotm^2 + c+
      qdotm^2 = - c+  , qdotm = sqrt(-c+)
      T = (qdotm-qdot0)+qdotm = 2*sqrt(-c+) - qdot0
      for u=-1, qdotm^2 = c- , qdotm = -sqrt(c-)
      T = (qdot0-qdotm)-qdotm = qdot0 + 2*sqrt(c-)
      -->
      plotted here:</p>

      <figure>
      <img width="80%" src="figures/double_integrator_mintime_cost_to_go.svg"/>
      <figcaption>Time to the origin using the bang-bang policy</figcaption>
      </figure>

      <p> Notice that the function is continuous (though not smooth), even
      though the policy is discontinuous.</p>

    </example> <!-- end of example -->

    <subsection><h1>Additive cost</h1>

      <p> As we begin to develop theoretical and algorithmic tools for optimal
      control, we will see that some formulations are much easier to deal with
      than others. One important example is the dramatic simplification that
      can come from formulating objective functions using <em>additive
      cost</em>, because they often yield recursive solutions.  In the additive
      cost formulation, the long-term "score" for a trajectory can be written
      as $$\int_0^T \ell(x(t),u(t)) dt,$$ where $\ell()$ is the instantaneous cost
      (also referred to as the "running cost"), and $T$ can be either a finite
      real number or $\infty$.  We will call a problem specification with a
      finite $T$ a "finite-horizon" problem, and $T=\infty$ an
      "infinite-horizon" problem.  Problems and solutions for infinite-horizon
      problems tend to be more elegant, but care is required to make sure that
      the integral converges for the optimal controller (typically by having an
      achievable goal that allows the robot to accrue zero-cost).</p>

      <p> The quadratic cost function suggested in the double integrator example
      above is clearly written as an additive cost.  At first glance, our
      minimum-time problem formulation doesn't appear to be of this form, but we
      actually can write it as an additive cost problem using an infinite
      horizon and the instantaneous cost  $$\ell(x,u) = \begin{cases} 0 & \text{if
      } x=0, \\ 1 & \text{otherwise.} \end{cases}$$</p>

      <p> We will examine a number of approaches to solving optimal control
      problems throughout the next few chapters.  For the remainder of this
      chapter, we will focus on additive-cost problems and their solution via
      <em>dynamic programming</em>.</p>

    </subsection> <!-- end of additive cost -->

  </section> <!-- control design as an optimization -->

  <section><h1>Optimal control as graph search</h1>

    <p> For systems with continuous states and continuous actions, dynamic
    programming is a set of theoretical ideas surrounding additive cost optimal
    control problems. For systems with a finite, discrete set of states and a
    finite, discrete set of actions, dynamic programming also represents a set
    of very efficient numerical <em>algorithms</em> which can compute optimal
    feedback controllers. Many of you will have learned it before as a tool for
    graph search. </p>

    <p>Imagine you have a directed graph with states (or nodes) $\{s_1,s_2,...\}
    \in S$ and "actions" associated with edges labeled as $\{a_1,a_2,...\} \in
    A$, as in the following trivial example:</p>

    <figure><img width="70%" src="figures/graph_search.svg"/><figcaption>A
    simple directed graph.</figcaption></figure>

    <p>Let us also assume that each edge has an associate weight or cost, using
    $\ell(s,a)$ to denote the cost of being in state $s$ and taking action $a$.
    Furthermore we will denote the transition "dynamics" using \[ s[n+1] =
    f(s[n],a[n]). \] For instance, in the graph above, $f(s_1,a_1) = s_2$.</p>

    <p>There are many algorithms for finding (or approximating) the optimal
    path from a start to a goal on directed graphs.  In dynamic programming,
    the key insight is that we can find the shortest path from every node by
    solving recursively for the optimal <em>cost-to-go</em> (the cost that will
    be accumulated when running the optimal controller) from every node to the
    goal. One such algorithm starts by initializing an estimate $\hat{J}^*=0$
    for all $s_i$, then proceeds with an iterative algorithm which sets
    \begin{equation} \hat{J}^*(s_i) \Leftarrow \min_{a \in A} \left[ \ell(s_i,a) +
    \hat{J}^*\left({f(s_i,a)}\right) \right]. \label{eq:value_update}
    \end{equation} In software, $\hat{J}^*$ can be represented as a vector with
    dimension equal to the number of discrete states.  This algorithm,
    appropriately known as <em>value iteration</em>, is guaranteed to converge
    to the optimal cost-to-go up to a constant factor, $\hat{J}^* \rightarrow
    J^* + c$ <elib>Bertsekas96</elib>, and in practice does so rapidly.
    Typically the update is done in <em>batch</em> -- e.g. the estimate is
    updated for all $i$ at once -- but the <em>asynchronous</em> version where
    states are updated one at a time is also known to converge, so long as
    every state is eventually updated infinitely often.  Assuming the graph has
    a goal state with a zero-cost self-transition, then this cost-to-go
    function represents the weighted shortest distance to the goal. </p>

    <p> Value iteration is an amazingly simple algorithm, but it accomplishes
    something quite amazing: it efficiently computes the long-term cost of an
    optimal policy from <i>every</i> state by iteratively evaluating the
    one-step cost.  If we know the optimal cost-to-go, then it's easy to extract
    the optimal policy, $a = \pi^*(s)$: \begin{equation} \pi^*(s_i) = \argmin_a
    \left[ \ell(s_i,a) + J^*\left( f(s_i,a) \right) \right].
    \label{eq:policy_update} \end{equation} It's a simple algorithm, but playing
    with an example can help our intuition.</p>

    <example><h1>Grid World</h1>

      <p>Imagine a robot living in a grid (finite state) world.  Wants to get
      to the goal location.  Possibly has to negotiate cells with obstacles.
      Actions are to move up, down, left, right, or do nothing.
      <elib>Sutton98</elib></p>

      <figure>
        <a href="figures/gridworld_mintime.swf">
          <img width="80%" src="figures/gridworld_mintime.svg" />
        </a>
        <figcaption>The one-step cost for the grid-world minimum-time problem.
        The goal state has a cost of zero, the obstacles have a cost of 10, and
        every other state has a cost of 1. <em>Click the image to watch the
        value iteration algorithm in action.</em></figcaption>
      </figure>

      <jupyter>examples/grid_world.ipynb</jupyter>

    </example> <!-- end grid world -->

    <todo>figure/text for graph approximation of a continuous state
    space.</todo>

    <example><h1>Dynamic Programming for the Double Integrator</h1>

      <p>You can run value iteration for the double integrator (using
      barycentric interpolation to interpolate between nodes) in <drake></drake>
      using: </p>

      <jupyter no_colab>examples/double_integrator/value_iteration.ipynb</jupyter>

      <p>Please do take some time to try different cost functions by
      editing the code yourself.</p>

    </example>

    <p> Let's take a minute to appreciate how amazing this is.  Our solution to
    finding the optimal controller for the double integrator wasn't all that
    hard, but it required some mechanical intuition and solutions to
    differential equations.  The resulting policy was non-trivial -- bang-bang
    control with a parabolic switching surface. The value iteration algorithm
    doesn't use any of this directly -- it's a simple algorithm for graph
    search.  But remarkably, it can generate effectively the same policy with
    just a few moments of computation.</p>

    <p>It's important to note that there <em>are</em> some differences between
    the computed policy and the optimal policy that we derived, due to
    discretization errors.  We will ask you to explore these in the
    problems.</p>

    <p>The real value of this numerical solution, however, is unlike our
    analytical solution for the double integrator, we can apply this same
    algorithm to any number of dynamical systems virtually without modification.
    Let's apply it now to the simple pendulum, which was intractable
    analytically.</p>

    <example><h1>Dynamic Programming for the
    Simple Pendulum</h1>

      <p>You can run value iteration for the simple pendulum (using barycentric
      interpolation to interpolate between nodes) in <drake></drake> using:</p>

      <jupyter no_colab>examples/pendulum/value_iteration.ipynb</jupyter>

    <p>Again, you can easily try different cost functions by
      editing the code yourself.</p>

    </example>

  </section> <!-- end of graph search -->

  <section><h1>Continuous dynamic programming</h1>

    <p> I find the graph search algorithm extremely satisfying as a first step,
    but also become quickly frustrated by the limitations of the discretization
    required to use it. In many cases, we can do better; coming up with
    algorithms which work more natively on continuous dynamical systems.  We'll
    explore those extensions in this section.</p>

    <subsection><h1>The Hamilton-Jacobi-Bellman Equation</h1>

      <p> It's important to understand that the value iteration equations,
      equations (\ref{eq:value_update}) and (\ref{eq:policy_update}), are more
      than just an algorithm.  They are also sufficient conditions for
      optimality: if we can produce a $J^*$ and $\pi^*$ which satisfy these
      equations, then $\pi^*$ must be an optimal controller.  There are an
      analogous set of conditions for the continuous systems.  For a system
      $\dot{\bx} = f(\bx,\bu)$ and an infinite-horizon additive cost
      $\int_0^\infty \ell(\bx,\bu)dt$, we have: \begin{gather} 0 = \min_\bu \left[
      \ell(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu) \right], \label{eq:HJB} \\
      \pi^*(\bx) = \argmin_\bu \left[ \ell(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu)
      \right]. \end{gather} Equation \ref{eq:HJB} is known as the
      <em>Hamilton-Jacobi-Bellman</em> (HJB) equation. <elib>Bertsekas05</elib>
      <sidenote>Technically, a Hamilton-Jacobi equation is a PDE whose <em>time
      derivative</em> depends on the first-order partial derivatives over
      state, which we get in the finite-time deriviation; Eq \ref{eq:HJB} is
      the steady-state solution of the Hamilton-Jacobi equation.</sidenote>
      <!-- chapter 3 --> gives an informal derivation of these equations as the
      limit of a discrete-time approximation of the dynamics, and also gives
      the following sufficiency theorem:</p>

      <theorem><h1>HJB Sufficiency Theorem</h1>

        <p>The following statement is adapted from Proposition 3.2.1 of
        <elib>Bertsekas05</elib>.  Consider a system $\dot{\bx}=f(\bx,\bu)$ and
        an infinite-horizon additive cost $\int_0^\infty \ell(\bx,\bu)dt$.  Suppose
        $J(\bx)$ is a solution to the HJB equation; that is $J$ is continuously
        differentiable in $\bx$ and is such that \[ 0 = \min_{\bu \in U} \left[
        \ell(\bx,\bu) + \pd{J}{\bx}f(\bx,\bu) \right],\quad \text{for all } \bx. \]
        Suppose also that $\pi^*(\bx)$ attains the minimum in the equation for
        all $\bx$.  Further assume that the differential equation described by
        $f$ has a unique solution starting from any state $\bx$, that the
        control input trajectory, $\bu^*(t)$, obtained by evaluating $\pi^*$
        along any solution is piecewise continuous as a function of $t$, and
        that there exists at least one fixed point, $\bx_{fp}$, with
        $f(\bx_{fp}, \pi^*(\bx_{fp})) = 0$ and $J^*(\bx_{fp}) = 0$. Then $J$ is
        equal to the optimal cost-to-go function, $J(\bx)=J^*(\bx)$ for all
        $\bx$, and the control trajectories $\bu^*(t)$ are optimal.</p>

      </theorem>

      <p>As a tool for verifying optimality, the HJB equations are actually
      surprisingly easy to work with: we can verify optimality for an
      infinite-horizon objective without doing any integration; we simply have
      to check a derivative condition on the optimal cost-to-go function $J^*$.
      Let's see this play out on the double integrator example.</p>

      <example id="hjb_double_integrator"><h1>HJB for the Double
      Integrator</h1>

        <p>Consider the problem of regulating the double integrator (this time
        without input limits) to the origin   using a quadratic cost: $$
        \ell(\bx,\bu) = q^2 + \dot{q}^2 + u^2. $$  I claim   (without derivation)
        that the optimal controller for this objective is $$\pi(\bx) = -q -
        \sqrt{3}\dot{q}.$$   To convince you that this is indeed optimal, I
        have produced the following cost-to-go function: $$J(\bx) = \sqrt{3}
        q^2 + 2 q \dot{q} + \sqrt{3} \dot{q}^2.$$</p>

        <p>Taking \begin{gather*} \pd{J}{q} = 2\sqrt{3} q + 2\dot{q}, \qquad
        \pd{J}{\dot{q}} = 2q + 2\sqrt{3}\dot{q}, \end{gather*}   we can write
        \begin{align*} \ell(\bx,\bu) + \pd{J}{\bx}f(\bx,\bu) &= q^2 + \dot{q}^2 +
        u^2 + (2\sqrt{3} q + 2\dot{q}) \dot{q} + (2q + 2\sqrt{3}\dot{q}) u
        \end{align*}   This is a convex quadratic function in $u$, so we can
        find the minimum with respect to $u$ by finding where the gradient with
        respect to $u$ evaluates to zero.   \[ \pd{}{u} \left[ \ell(\bx,\bu) +
        \pd{J}{\bx} f(\bx,\bu) \right] = 2u + 2q + 2\sqrt{3}\dot{q}. \]  Setting
        this equal to $0$ and   solving for $u$ yields: $$u^* = -q - \sqrt{3}
        \dot{q},$$ thereby confirming that our policy $\pi$ is in fact the
        minimizer.  Substituting $u^*$ back into the HJB reveals that the right
        side does in fact simplify to zero.  I hope you are convinced!</p>

      </example>

      <p>Note that evaluating the HJB for the time-to-go of the minimum-time
      problem for the double integrator will also reveal that the HJB is
      satisfied wherever that gradient is well-defined.  This is certainly
      mounting evidence in support of our bang-bang policy being optimal, but
      since $\pd{J}{\bx}$ is not defined everywhere, it does not actually
      satisfy the requirements of the sufficiency theorem as stated above.</p>

    </subsection> <!-- end HJB -->

    <subsection id="hjb_minimizing_control"><h1>Solving for the minimizing control</h1>

      <p>We still face a few barriers to actually using the HJB in an algorithm.
      The first barrier is the minimization over $u$.  When the action set was
      discrete, as in the graph search version, we could evaluate the one-step
      cost plus cost-to-go for every possible action, and then simply take the
      best.  For continuous action spaces, in general we cannot rely on the
      strategy of evaluating a finite number of possible $\bu$'s to find the
      minimizer.</p>

      <p>All is not lost.  In the quadratic cost double integrator example
      above, we were able   to solve explicitly for the minimizing $\bu$ in
      terms of the cost-to-go.  It turns out that   this strategy will actually
      work for a number of the problems we're interested in, even   when the
      system (which we are given) or cost function (which we are free to pick,
      but which   should be expressive) gets more complicated.</p>

      <p>Recall that I've already tried to convinced you that a majority of the
      systems of interest   are <em>control affine</em>, e.g. I can write \[
      f(\bx,\bu) = f_1(\bx) + f_2(\bx)\bu. \]   We can make another dramatic
      simplification by restricting ourselves to instantaneous cost functions of
      the form \[ \ell(\bx,\bu) = \ell_1(\bx) + \bu^T {\bf R} \bu, \qquad {\bf R}={\bf
      R}^T \succ 0. \] In my view, this is not very restrictive - many of the
      cost functions that I find myself choosing to write down can be expressed
      in this form.  Given these assumptions, we can write the HJB as \[ 0 =
      \min_{\bu} \left[ \ell_1(\bx) + \bu^T {\bf R} \bu + \pd{J}{\bx} \left[
      f_1(\bx) + f_2(\bx)\bu \right]\right]. \]  Since this is a positive
      quadratic function in $\bu$, if the system does not have any constraints
      on $\bu$, then we can solve in closed-form for the minimizing $\bu$ by
      taking the gradient of the right-hand side: \[ \pd{}{\bu} = 2\bu^T {\bf R}
      + \pd{J}{\bx} f_2(\bx) = 0, \] and setting it equal to zero to obtain \[
      \bu^* = -\frac{1}{2}{\bf R}^{-1}f_2^T(\bx) \pd{J}{\bx}^T.\] If there are
      linear constraints on the input, such as torque limits, then more
      generally this could be solved (at any particular $\bx$) as a quadratic
      program.</p>

      <p> What happens in the case where our system is not control affine or if
      we really   do need to specify an instantaneous cost function on $\bu$
      that is not simply quadratic?   If the goal is to produce an iterative
      algorithm, like value iteration, then one   common approach is to make a
      (positive-definite) quadratic approximation in $\bu$ of the HJB, and
      updating   that approximation on every iteration of the algorithm.  This
      broad approach is often referred to as <em>differential dynamic
      programming</em> (c.f. <elib>Jacobson70</elib>).   </p>

    </subsection> <!-- end solve for u -->

    <subsection><h1>Numerical solutions for $J^*$</h1>

      <p> The other major barrier to using the HJB in a value iteration
      algorithm is that  the estimated optimal cost-to-go function, $\hat{J}^*$,
      must somehow be represented with a finite set of numbers, but we don't yet
      know anything about the potential form it must take. In fact, knowing the
      time-to-goal solution for minimum-time problem with the double integrator,
      we see that this function might need to be non-smooth for even very simple
      dynamics and objectives.</p>

      <p>One natural way to parameterize $\hat{J}^*$ -- a scalar
      valued-function defined over the state space -- is to define the values
      on a mesh.  This approach then admits algorithms with close ties to the
      relatively very advanced numerical methods used to solve other partial
      differential equations (PDEs), such as the ones that appear in finite
      element modeling or fluid dynamics. One important difference, however, is
      that our PDE lives in the dimension of the state space, while many of the
      <a href="https://en.wikipedia.org/wiki/Types_of_mesh">mesh
      representations</a> from the other disciplines are optimized for two or
      three dimensional space.  Also, our PDE may have discontinuities (or at
      least discontinuous gradients) at locations in the state space which are
      not known apriori.</p>

      <p>A slightly more general view of the problem would describe the mesh
      (and the associated interpolation functions) as just one form of
      representations for <a
      href="https://en.wikipedia.org/wiki/Function_approximation">function
      approximation</a>.  Using a <a
      href="https://en.wikipedia.org/wiki/Deep_learning">
      neural network</a> to represent the cost-to-go also falls under the
      domain of function approximation, perhaps representing the other extreme
      in terms of complexity; using deep networks in approximate dynamic
      programming is common in <a
      href="http://rail.eecs.berkeley.edu/deeprlcourse/">deep reinforcement
      learning</a>, which we will discuss more later in the book.</p>

      <todo>  (see Appendix C for a brief background on function approximation)
      </todo>

      <subsubsection><h1>Value iteration with function
      approximation</h1>

        <p>If we approximate $J^*$ with a finitely-parameterized function
        $\hat{J}_\balpha^*$, with parameter vector $\balpha$, then   this
        immediately raises many important questions:
        <ul>
          <li>What if the true cost-to-go function does not live in the
        prescribed function class; e.g., there does not exist an $\balpha$ which
        satisfies the sufficiency conditions for all $\bx$?  (This seems very
        likely to be the case.)</li>
          <li>What update should we apply in the iterative algorithm?</li>
          <li>What can we say about it's convergence?</li>
        </ul>
        Let us start by considering updates given by a least-squares approximation of the value iteration update.</p>

        <p>Using the least squares solution in a value iteration update is
        sometimes referred to as <i>fitted value iteration</i>, where $\bx_k$
        are some number of samples taken from the continuous space and for
        discrete-time systems the iterative approximate solution to
        \begin{gather*} J^*(\bx_0) = \min_{u[\cdot]} \sum_{n=0}^\infty
        \ell(\bx[n],\bu[n]), \\ \text{ s.t. } \bx[n+1] = f(\bx[n], \bu[n]),
        \bx[0] = \bx_0\end{gather*} becomes \begin{gather} J^d_k = \min_\bu
        \left[ \ell(\bx_k,\bu) + \hat{J}^*_\alpha\left({f(\bx_k,\bu)}\right)
        \right], \\ \balpha \Leftarrow \argmin_\balpha \sum_k
        \left(\hat{J}^*_\balpha(\bx_k) - J^d_k \right)^2.
        \label{eq:fitted_value_iteration} \end{gather} Since the desired values $J^d_k$ are only an initial guess of the cost-to-go, we will apply this algorithm iteratively until (hopefully) we achieve some numerical convergence.</p>
      
        <p>Note that the update in \eqref{eq:fitted_value_iteration} is not
        <i>quite</i> the same as doing least-squares optimization of $$\sum_k
        \left(\hat{J}^*_\balpha(\bx_k) - \min_\bu \left[ \ell(\bx_k,\bu) +
        \hat{J}^*_\alpha\left({f(\bx_k,\bu)}\right) \right] \right)^2,$$ because
        in this equation $\alpha$ has an effect on both occurences of
        $\hat{J}^*$.  In \eqref{eq:fitted_value_iteration}, we cut that
        dependence by taking $J_k^d$ as fixed desired values; this version
        performs better in practice.  Like many things, this is an old idea that
        has been given a new name in the deep reinforcement learning literature
        -- people think of the $\hat{J}^*_\alpha$ on the right hand side as
        being the output from a fixed "target network".  For nonlinear function
        approximators, the update to $\alpha$ in
        \eqref{eq:fitted_value_iteration} is often replaced with just one step
        of gradient descent.</p>

        <example><h1>Neural Value Iteration</h1>

          <p>Let us try reproducing our double-integrator value iteration examples using neural networks in <a href="https://pytorch.org/tutorials/">PyTorch</a>:</p>
    
          <jupyter>examples/double_integrator/neural_value_iteration.ipynb</jupyter>

          <!--
          <p>In this example, you'll see that we use two different approaches to
          minimizing over $\bu$.  First, we use discrete actions where we simply
          evaluate all possible actions and keep the best.  Second, we explore
          the case which exploits the quadratic cost and the control-affine
          dynamics -- but note that we also have to go to continuous time to
          have the closed-form solution.  (Can you understand why?)</p>
          -->
          
        </example>        

        <p>In general, the convergence and accuracy guarantees for value
        iteration with generic function approximators are quite weak.  But we do
        have some results for the special case of <em>linear function
        approximators</em>.  A linear function approximator takes the form: \[
        \hat{J}^*_\balpha(\bx) = \sum_i \alpha_i \psi_i(\bx) = \bpsi^T(\bx)
        \balpha, \] where $\bpsi(\bx)$ is a vector of potentially nonlinear
        features.  Common examples of features include polynomials, radial basis
        functions, or most interpolation schemes used on a mesh. The
        distinguishing feature of a linear function approximator is the ability
        to exactly solve for $\balpha$ in order to represent a desired function
        optimally, in a least-squares sense.  For linear function approximators,
        this is simply: \begin{gather*} \balpha \Leftarrow \begin{bmatrix}
        \bpsi^T(\bx_1) \\ \vdots \\ \bpsi^T(\bx_K)\end{bmatrix}^+
        \begin{bmatrix} J^d_1 \\ \vdots \\ J^d_K \end{bmatrix}, \end{gather*}
        where the $^+$ notation refers to a Moore-Penrose pseudoinverse.
        Remarkably, for linear function approximators, this update is still
        known to converge to the globally optimal $\balpha^*$.</p>  

        <todo> add citations for convergence results </todo>
      </subsubsection>

      <subsubsection><h1>Value iteration on a mesh</h1>

        <p>Imagine that we use a mesh to approximate the cost-to-go function
        over that state space with $K$ mesh points $\bx_k$. We would like to
        perform the value iteration update: \begin{equation} \forall k,
        \hat{J}^*(\bx_k) \Leftarrow \min_\bu \left[ \ell(\bx_k,\bu) +
        \hat{J}^*\left({f(\bx_k,\bu)}\right) \right],
        \label{eq:mesh_value_iteration} \end{equation} but must deal with the
        fact that $f(\bx_k,\bu)$ might not result in a next state that is
        directly at a mesh point.  Most interpolation schemes for a mesh can be
        written as some weighted combination of the values at nearby mesh
        points, e.g. \[ \hat{J}^*(\bx) = \sum_i \beta_i(\bx) \hat{J}^*(\bx_i),
        \quad \sum_i \beta_i = 1 \] with $\beta_i$ the relative weight of the
        $i$th mesh point.  In <drake></drake> we have implemented barycentric
        interpolation<elib>Munos98</elib>. Taking $\alpha_i = \hat{J}^*(\bx_i)$,
        the cost-to-go estimate at mesh point $i$, we can see that this is
        precisely an important special case of fitted value iteration with
        linear function approximation.  Furthermore, assuming $\beta_i(\bx_i) =
        1,$ (e.g., the only point contributing to the interpolation <i>at a mesh
        point</i> is the value at the mesh point itself), the update in Eq.
        (\ref{eq:mesh_value_iteration}) is precisely the least-squares update
        (and it achieves zero error). This is the representation used in the value iteration examples that you've already experimented with above.</p>

      </subsubsection>

<todo>
      <subsubsection><h1>Representing the cost-to-go in a deep network</h1>

      </subsubsection>
</todo>
      <subsubsection><h1>Continuous-time systems</h1>

        <p>For solutions to systems with continuous-time dynamics, I have to
        uncover one of the details that I've so far been hiding to keep the
        notation simpler. Let us consider a problem with a finite-horizon:
        \begin{gather*} \min_{\bu[\cdot]} \sum_{n=0}^N \ell(\bx[n],\bu[n]), \\
        \text{ s.t. } \bx[n+1] = f(\bx[n], \bu[n]), \bx[0] = \bx_0\end{gather*}
        In fact, the way that we compute this is by solving the <em>time-varying
        cost-to-go function</em> backwards in time: \begin{gather*}J^*(\bx,N) =
        \min_\bu \ell(\bx, \bu) \\ J^*(\bx,n-1) = \min_\bu \left[ \ell(\bx, \bu) +
        J^*(f(\bx,\bu), n) \right]. \end{gather*} The convergence of the value
        iteration update is equivalent to solving this time-varying cost-to-go
        backwards in time until it reaches a steady-state solution (the
        infinite-horizon solution).  Which explains why value iteration only
        converges if the optimal cost-to-go is bounded.</p>

        <p>Now let's consider the continuous-time version.  Again, we have a
        time-varying cost-to-go, $J^*(\bx,t)$.  Now $$\frac{dJ^*}{dt} =
        \pd{J^*}{\bx}f(\bx,\bu) + \pd{J^*}{t},$$ and our sufficiency condition
        is $$0 = \min_\bu \left[\ell(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu) +
        \pd{J^*}{t} \right].$$  But since $\pd{J^*}{t}$ doesn't depend on $\bu$,
        we can pull it out of the $\min$ and write the (true) HJB:
        $$-\pd{J^*}{t} = \min_\bu \left[\ell(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu)
        \right].$$ The standard numerical recipe <elib>Osher03 </elib> for
        solving this is to approximate $\hat{J}^*(\bx,t)$ on a mesh and then
        integrate the equations backwards in time (until convergence, if the
        goal is to find the infinite-horizon solution).  If, for mesh point
        $\bx_i$ we have $\alpha_i(t) = \hat{J}^*(\bx_i, t)$, then:
        $$-\dot\alpha_i(t) = \min_\bu \left[\ell(\bx_i, \bu) + \pd{J^*(\bx_i,
        t)}{\bx}f(\bx_i,\bu) \right],$$ where the partial derivative is
        estimated with a suitable finite-difference approximation on the mesh
        and often some "viscosity" terms are added to the right-hand side to
        provide additional numerical robustness; see the Lax-Friedrichs scheme
        <elib>Osher03 </elib> (section 5.3.1) for an example. </p>

        <p>Probably most visible and consistent campaign using numerical HJB
        solutions in applied control (at least in robotics) has come from <a
        href="http://hybrid.eecs.berkeley.edu/index.html">Claire Tomlin's group
        at Berkeley</a>. Their work leverages <a
        href="https://www.cs.ubc.ca/~mitchell/ToolboxLS/">Ian Mitchell's Level
        Set Toolbox</a>, which solves the Hamilton-Jacobi PDEs on a Cartesian
        mesh using the technique cartooned above, and even includes the
        minimum-time problem for the double integrator as a tutorial
        example<elib>Mitchell05</elib>.</p>

      </subsubsection>

    </subsection> <!-- end function approx -->

  </section> <!-- end continuous time DP -->

  <todo> try just running snopt on quartic objective </todo>

  <!--
  <subsection><h1>A continuous policy iteration algorithm</h1>

  <todo> simple (e.g. one-d example) </todo>

  </section> -->
  <!-- end policy iteration -->

  <!--
  <subsection><h1>How far can we take this?  (Performance and Scaling)</h1>

  <todo>
  Errors in bang-bang for double integrator due to discretization.

  Limited in number of dimensions.  Function approximation, but lacks convergence results.
  </todo>

  </section> --><!-- end scaling -->

  <section><h1>Extensions</h1>

    <todo> add section on extensions.  discounting.  finite-horizon / time-varying dynamics or cost.  call-out to the stochastic case in the future chapter.</todo>

    <p>There are many many nice extensions to the basic formulations that we've presented so far.  I'll try to list a few of the most important ones here.  I've also had a number of students in this course explore very interesting extensions; for example <elib>Yang20</elib> looked a imposing a low-rank structure on the (discrete-state) value function using ideas from matrix completion to obtain good estimates despite updating only a small fraction of the states.</p>

    <subsection><h1>Linear Programming Approach</h1>

      <p>For discrete MDPs, value iteration is a magical algorithm because it is
      simple, but known to converge to the global optimal, $J^*$.  However,
      other important algorithms are known; one of the most important is a
      solution approach using linear programming.  This formulation provides an
      alternative view, but may also be more generalizable and even more
      efficient for some instances of the problem.</p>

      <todo>I've moved the stochastic dp equations, but this text depended on them.</todo>
      <p>Recall the optimality conditions from Eq.
      (\ref{eq:stochastic_dp_optimality_cond}).  If we describe the cost-to-go
      function as a vector, $J_i = J(s_i)$, then these optimality conditions can
      be rewritten in vector form as \begin{equation} \bJ = \min_a \left[ {\bf \ell}(a)
      + \bT(a) \bJ \right], \label{eq:vector_stochastic_dp} \end{equation} where
      $\ell_i(a) = \ell(s_i,a)$ is the cost vector, and $T_{i,j}(a) = \Pr(s_j|s_i,a)$
      is the transition probability matrix.  Let us take $\bJ$ as the vector of
      decision variables, and replace Eq. (\ref{eq:vector_stochastic_dp}) with
      the constraints: \begin{equation} \forall a, \bJ \le {\bf \ell}(a) + \bT(a)
      \bJ.\end{equation}  Clearly, for finite $a$, this is finite list of linear
      constraints, and for any vector $\bJ$ satisfying these constraints we have
      $\bJ \le \bJ^*$ (elementwise).  Now write the linear program:
      \begin{gather*} \maximize_\bJ \quad \bc^T \bJ, \\ \subjto \quad \forall a,
      \bJ \le
      {\bf \ell}(a) + \bT(a) \bJ, \end{gather*} where $c$ is any positive vector.  The
      solution to this problem is $\bJ = \bJ^*$.</p>

      <p>Perhaps even more interesting is that this approach can be generalized
      to linear function approximators.  Taking a vector form of my notation
      above, now we have a matrix of features with $\bPsi_{i,j} =
      \psi^T_j(\bx_i)$ and we can write the LP \begin{gather} \maximize_\balpha
      \quad \bc^T \bPsi \balpha, \\ \subjto \quad \forall a, \bPsi \balpha \le
      {\bf \ell}(a) + \bT(a) \bPsi \balpha. \end{gather}  This time the solution is not
      necessarily optimal, because $\bPsi \balpha^*$ only approximates $\bJ^*$,
      and the relative values of the elements of $\bc$ (called the
      "state-relevance weights") can determine the relative tightness of the
      approximation for different features <elib>Farias02</elib>.</p>

    </subsection>

  </section> <!-- end of extensions -->

  <section><h1>Exercises</h1>

    <exercise><h1>Choosing a Cost Function</h1>

      <figure>
        <img width="50%" src="figures/exercises/choosing_cost.svg"/>
        <figcaption>Autonomous car moving at velocity $v$ on a straight road.</figcaption>
      </figure>

      <p>The figure above shows an autonomous car moving at constant velocity $v>0$ on a straight road.  Let $x$ be the (longitudinal) position of the car along the road, $y$ its (transversal) distance from the centerline, and $\theta$ the angle between the centerline and the direction of motion.  The only control action is the steering velocity $u$, which is constrained in the interval $[u_{\text{min}}, u_{\text{max}}]$ (where $u_{\text{min}}<0$ and $u_{\text{max}}>0$).  We describe the car dynamics with the simple kinematic model \begin{align*}\dot x &= v \cos \theta, \\ \dot y &= v \sin \theta, \\ \dot \theta &= u.\end{align*}  Let $\bx = [x, y, \theta]^T$ be the state vector.  To optimize the car trajectory we consider a quadratic objective function $$J = \int_{0}^{\infty} [\bx^T(t) \bQ \bx(t)  + R u^2(t)] dt,$$ where $\bQ$ is a constant positive-semidefinite (hence symmetric) matrix and $R$ is a constant nonnegative scalar (note that $R=0$ is allowed here).</p>

      <ol type="a">

        <li>Suppose our only goal is to keep the distance $y$ between the car and the centerline as small as possible, without worrying about anything else.  What would be your choice for $\bQ$ and $R$?</li>

        <li>How would the behavior of the car change if you were to multiply the weights $\bQ$ and $R$ from the previous point by an arbitrary positive coefficient $\alpha$?</li>

        <li>The cost function from point (a) might easily lead to excessively sharp turns.  Which entry of $\bQ$ or $R$ would you increase to mitigate this issue?</li>

        <li>Country roads are more slippery on the sides than in the center.  Is this class of objective functions rich enough to include a penalty on sharp turns that increases with the distance of the car from the centerline?</li>

        <li>With this dynamic model and this objective function, would you ever choose a weight matrix $\bQ$ which is strictly positive definite (independent of the task you want the car to accomplish)? Why?</li>

      </ol>

    </exercise>

    <exercise><h1>Ill-Posed Optimal Control Problem</h1>

      <p>In this exercise we will see how seemingly simple cost functions can give surprising results.  Consider the single-integrator system $\dot x = u$ with initial state $x(0)=0$.  We would like to find the control signal $u(t)$ that minimizes the seemingly innocuous cost function $$J = \int_0^T x^2(t) + (u^2(t) - 1)^2 dt,$$ with $T$ finite.  To this end, we consider a <a href="https://en.wikipedia.org/wiki/Square_wave">square-wave</a> control parameterized by $\tau>0$: $$u_\tau(t) = \begin{cases} 1 &\text{if} & t \in [0, \tau) \cup [3 \tau, 5 \tau) \cup [7 \tau, 9 \tau) \cup \cdots \\ -1 &\text{if} & t \in [\tau, 3 \tau) \cup [5 \tau, 7 \tau) \cup [9 \tau, 11 \tau) \cup \cdots \end{cases}.$$</p>

      <ol type="a">

        <li>What are the states $x$ and the inputs $u$ for which the running cost $$\ell(x, u) = x^2 + (u^2 - 1)^2$$ is equal to zero?</li>

        <li>Consider now two control signals $u_{\tau_1}(t)$ and $u_{\tau_2}(t)$, with $\tau_1 = \tau_2 / 2$.  Which one of the two incurs the lower cost $J$? (Hint: start by sketching how the system state $x(t)$ evolves under these two control signals.)</li>

        <li>What happens to the cost $J$ when $\tau$ gets smaller and smaller?  What does the optimal control signal look like?  Could you implement it with a finite-bandwidth controller?</li>

      </ol>

    </exercise>

    <exercise><h1>A Linear System with Quadratic Cost</h1>

      <p>Consider the scalar control differential equation $$\dot{x} = x + u,$$ and the infinite horizon cost function $$J = \int_0^{\infty} [3x^2(t) + u^2(t)] dt.$$  As we will see in the <a href="lqr.html">chapter on linear-quadratic regulation</a>, the optimal cost-to-go for a problem of this kind assumes the form $J^* = S x^2$.  It is not hard to see that this, in turn, implies that the optimal controller has the form $u^* = - K x$.</p>

      <ol type="a">

        <li>Imagine that you plugged the expression $J^* = S x^2$ in the HJB equations for this problem, you solved them for $S$, and you got $S \leq 0$.  Would this result ring any alarm bells?  Explain your answer.</li>

        <li>Use the HJB sufficiency theorem to derive the optimal values of $S$ and $K$.</li>

        <li>

          Working with digital controllers, we typically have to sample the dynamics of our systems, and approximate them with discrete-time models.  Let us introduce a time step $h > 0$ and discretize the dynamics as $$\frac{x[n+1] - x[n]}{h} = x[n] + u[n],$$ and the objective as $$h \sum_{n=0}^{\infty} (3 x^2[n] + u^2[n]).$$  One of the following expressions is the correct cost-to-go $J_h^*(x)$ for this discretized problem.  Can you identify it without solving the discrete-time HJB equation? Explain your answer.

          <ol ol type="i">

            <li>$J_h^* (x)= S_h x^4$ with $S_h = 3 + h + \sqrt{6}h^2$.</li>

            <li>$J_h^* (x)= S_h x^2$ with $S_h = 1 + 2h + 2 \sqrt{h^2 + h +1}$.</li>

            <li>$J_h^* (x)= S_h x^2$ with $S_h = 3 + 2h^2 + \sqrt{h^2 + h + 2}$.</li>

          </ol>

        </li>

      </ol>

    </exercise>

    <exercise><h1>Value Iteration for Minimum-Time Problems</h1>

      <p>In this exercise we analyze the performance of the value-iteration algorithm, considering its application to the <a href="#minimum_time_double_integrator">minimum time problem for the double integrator</a>.  <a href="https://colab.research.google.com/github/RussTedrake/underactuated/blob/master/exercises/dp/minimum_time/minimum_time.ipynb" target="_blank">In this python notebook</a>, you will find everything you need for this analysis.  Take the time to go through the notebook and understand the code in it, then answer the following questions.</p>

      <ol type="a">

        <li>

          At the end of the <a href="https://colab.research.google.com/github/RussTedrake/underactuated/blob/master/exercises/dp/minimum_time/minimum_time.ipynb" target="_blank">notebook</a> section "Performance of the Value-Iteration Policy", we plot the state trajectory of the double integrator in closed loop with the value-iteration policy, as well as the resulting control signal.

          <ol ol type="i">

            <li>Does the state-space trajectory follow the theoretically-optimal quadratic arcs we have seen in <a href="#minimum_time_double_integrator">the example</a>?</li>

            <li>Is the control policy we get from value iteration a bang-bang policy? In other words, does the control signal take values in the set $\{-1, 0, 1\}$ exclusively?</li>

            <li>Explain in a couple of sentences, what is the reason for this behavior.</li>

          </ol>

        </li>

        <li>In the "Value Iteration Algorithm" section of the <a href="https://colab.research.google.com/github/RussTedrake/underactuated/blob/master/exercises/dp/minimum_time/minimum_time.ipynb" target="_blank">notebook</a>, increase the number of knot points on the $q$ and the $\dot q$ axes to refine the state-space mesh used in the value iteration.  Does this fix the issues you have seen in point (a)?</li>

        <li>In the final section of the <a href="https://colab.research.google.com/github/RussTedrake/underactuated/blob/master/exercises/dp/minimum_time/minimum_time.ipynb" target="_blank">notebook</a>, implement the theoretically-optimal control policy from <a href="#minimum_time_double_integrator">the example</a>, and use the plots to verify that the closed-loop system behaves as expected.</li>

      </ol>

    </exercise>

  </section>

</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->

<table style="width:100%;"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=stochastic.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=underactuated.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=lqr.html>Next Chapter</a></td>
</tr></table>

<div id="footer">
  <hr>
  <table style="width:100%;">
    <tr><td><em>Underactuated Robotics</em></td><td align="right">&copy; Russ
      Tedrake, 2020</td></tr>
  </table>
</div>


</body>
</html>