Months ago, I came across this video of a robot hand rotating a cube (in all 3D) to achieve specific configurations of that cube. This is extremely difficult - the robot has 20 degrees of freedom!

This got me interested in the idea of reinforcement learning, a framework that allows the development of agents for a wide variety of problems. Luckily, this coincided with OpenAI releasing the fantastic Spinning Up in Deep RL guide, written by Josh Achiam, a Ph.D. student at UC Berkeley.

Tinkering with some tools along with some friends, we developed Amca, a Backgammon-playing agent along with its environment for training and evaluating algorithms. The more I read and tinkered, the more it seemed obvious to me the importance this field will have on robotics (and more) in the future. Grab a cup of coffee and make yourself comfortable before proceeding.

Consider the following model shown in Fig. 1. In this model, we have an agent existing in some
environment. This agent observes the environment, and based on that observation performs an
action (or does nothing - which can be thought of as an action). This is the most basic type of
agent, known as a *simple reflex agent*.

We can model this agent as a function which takes an input (*observation*) \(o_t\) and
returns an output (*action*) \(a_t\). Let's call this function *policy* \(\pi\).
We can now make this agent 'intelligent' by defining our policy to
react to observations by defining \(n\) rules. Let's design such a policy for a party: \[ \begin{equation} \pi(o_t) = \begin{cases} \text{smile} & \text{if
someone smiles at you} \\ \text{dance} & \text{if the music is good} \\ \hfill \vdots \\
a^{(j)}_t & \text{if } o_t \equiv x\\ \hfill \vdots \\ a^{(n)}_t & \text{for all other } o_t
\\ \end{cases} \end{equation} \]

Most high-level decision-making in robotics today follows this framework, albeit with more features to make it more robust and using fancy names like finite-state machines. You can probably see how this limits the robot's adaptability to new situations, but it gets the job done for most purposes.

In the 1980s, ideas from the fields of optimal control and psychology merged into what we today
call reinforcement learning. It can be summarized (with a pinch of salt) as the addition of an
extra *reward* input to the simple reflex agent, as shown in Fig. 2.

This reward signal adds a subtle, but very powerful, property to the agent. It allows the agent's policy \(\pi_{\boldsymbol{\theta}}\) (now denoted with a subscript \(\boldsymbol{\theta}\) that describes its parameters) to be learnable by informing the agent of the advantages and disadvantages of its action relative to that state.

To understand how RL can be put to use in robotics, let's try to control a mobile robot to move to some goal pose. To summarize one theory behind controlling a mobile robot, the controller assumes:

- the goal to be \(\boldsymbol{g}=[0, 0, 0]^T\), which describes the robot pose in 2D Cartesian coordinates and 1D heading angle in the interial or world frame,
- the observation to be \(\boldsymbol{o_t}=[\rho, \alpha, \beta]^T\), the robot pose in polar coordinates,
- the action to be \(\boldsymbol{a_t}=[\dot{\rho}, \dot{\alpha}, \dot{\beta}]^T\), the velocity commanded to the robot in polar coordinates,

RL in the context of robotics is concerned with finding alternatives to control systems or augmenting them where they fail. So instead of using this controller, if we wanted to develop our own controller using RL, we can input a lower level observation and action, without resorting to transformations to and from the polar representation, we don't need to transform our frame of reference such that the goal is the origin, and we can define a goal heading as required. \[ \begin{equation} \boldsymbol{\pi_\theta} \left(\left[\begin{array}{c} x \\ y \\ \phi \\ \dot{x} \\ \dot{y} \\ \dot{\phi} \end{array}\right]\right) = \left[\begin{array}{c} \dot{x} \\ \dot{y} \\ \dot{\phi} \end{array}\right] \hspace{2em} \text{given } \boldsymbol{g} = [g_x, g_y, g_\theta]^T \end{equation} \]

This may not seem like a very important contribution, but we will see how it can be expanded into domains where the input the controller is nothing more than an image, or where there are no control-based solutions (at least not without major drawbacks), such as in self-driving cars where decisions need to be made from very complex and widely varying signals and/or images, or in legged robots, where developing an accurate model of the system dynamics is extremely difficult. Before we can go any further, though, we must delve into the framework of RL environments and their dynamics.

Given

- some set of states \(S = \{s^{(1)}, s^{(2)}, s^{(3)}, s^{(4)}, s^{(5)}, \dots \} \) and
- some set of actions \(A = \{a^{(1)}, a^{(2)}, a^{(3)}, a^{(4)}, a^{(5)}, \dots \} \),

If our state contains all the relevant information about the entire sequence of states and actions that preceded it, it is a Markov state. If we can model our environment as a set of Markov states, we can then reliably navigate this environment using just our current state, and an action we choose. The sequence of states that we accumulate while moving in this way is called a Markov Process.

Now we can wrap it up by defining a Markov Decision Process to be one where the agent can reliably (because of the Markov property) decide (because of our policy - we'll get to that shortly) on how to navigate through the environment to achieve its goal, and this is the framework upon which modern RL, and especially deep RL, is built on.

Formally, a Markov Decision Process for an RL problem is usually a 5-tuple, \(\langle S, A, R, P, \rho_0 \rangle\), where

- \(S\) is the set of all valid states,
- \(A\) is the set of all valid actions,
- \(R : S \times A \times S \to \mathbb{R}\) is the reward function (remember the reward signal we mentioned earlier?), depending on the current state, current action, and future state \(R(s_t, a_t, s_{t+1})\),
- \(P : S \times A \to \mathcal{P}(S)\) is the transition probability function, with \(P(s'|s,a)\) being the probability of transitioning into state \(s'\) if you start in state \(s\) and take action \(a\), and
- \(\rho_0\) is the starting state distribution.

In the context of learning, the problem becomes one where we would like to learn some optimal policy \(\pi^*_t\) that will return the optimal action \(a^*_t\) given the current observation \(o_t\) and set of actions \(A_t\). \[ \begin{equation} \pi^{*}(o_t, A_t) = a^*_t \end{equation} \]

Note that \(o_t \not\equiv s_t\); an observation does not (usually) have the Markov
property. The state is the true (usually condensed) representation of the environment,
whereas the observation is the partial information received by the agent about the
environment. The exception to this is when a state is *fully observable*, such as in
chess, checkers, backgammon, etc.

To start defining the RL problem, as seen with the MDP 5-tuple earlier, we must refer to some reward function \(R(s_{t}, a_t,s_{t+1})\) that will send a reward signal to the agent as seen in Fig. 2. The intuition behind the reward function is to, well, reward the agent when it does a good action, and punish the agent when it does a bad action.

To define an optimal policy, we need a measure of policy performance; the most popular such
measure is the *expected reward*. Given some *trajectory* (a sequence of
state-action pairs), \(\tau = \langle s_0a_0, s_1a_1, \dots, s_Ta_T \rangle\), we can define
the expected reward as: \[ \begin{equation}
\DeclareMathOperator*{\E}{\mathbb{E}} J(\pi_\theta)=\E_{\tau\sim
\pi_\theta}[{B(\tau)}]=\int_{\tau} P(\tau|\pi_\theta){B(\tau)} \end{equation}
\label{expected-reward} \] where

- \({B(\tau)}=\sum_{t=0}^{T-1}R(s_{t}, a_t,s_{t+1})\) is the sum of the rewards in the trajectory
- \(P(\tau|\pi_\theta) = \rho_0 (s_0) \prod_{t=0}^{T-1} P(s_{t+1} | s_t, a_t) \pi_\theta(a_t | s_t)\) is the probability distribution of the trajectories we will come across while using policy \(\pi_\theta\)

As the name indicates, this is just the expected value of the reward we will accumulate if
we follow the trajectory \(\tau\). Note that we can also have infinite trajectories, and
adapt our trajectory-reward function to \(B(\tau))=\sum_{t=0}^{\infty} \gamma^t r_t\), where
\(\gamma \in (0, 1)\) (this is usually called the *discount factor*).

Now, we can define the RL problem of finding the optimal policy as an optimization problem, where we would like to maximize our expected reward: \[ \begin{equation} \DeclareMathOperator*{\argmax}{arg\,max} \pi^* = \argmax_{\pi_\theta} J(\pi_\theta) \label{optimal-policy} \end{equation} \]

One of the most common algorithms used to solve this is called Q-Learning; this method maximizes
the action-state value, also known as the *Q-value*, of a policy. The Q-value is
simply an approximation of how good the policy is. \[
\begin{equation} \DeclareMathOperator*{\E}{\mathbb{E}} Q^{\pi_\theta}(s,a) =
\E_{\tau\sim \pi_\theta}[{B(\tau))| s_0 = s, a_0 = a}] \end{equation} \]

Notice here that the Q-value measures the expected value of the return of the policy
__after we have started with state \(\underline{s}\) and action \(\underline{a}\)__. The
assumption is that if we can maximize the *Q-function* seen here, then we would be
able to extract the optimal action at every state simply by looking at the Q-values for each
action that can be taken at that state and selecting the action with the maximum Q-value.
\[ \begin{equation} \DeclareMathOperator*{\argmax}{arg\,max}
a(s) = \argmax_a Q^{\pi_\theta}(s,a) \end{equation} \]

This is, in fact, the basis of Deep Q-Learning (DQN), introduced in the seminal paper by Mnih et. al, that re-launched great interest in the field of RL. DQN, however, differs from classic Q-Learning in some ways.

The first difference is that it uses a deep neural network to represent the Q-function, which allow recent (powerful) advancements in the field of deep learning for optimization to be utilized.

The second major difference is the use of *experience replay*, which allows using the
Q-values (and their state-action pairs) obtained while following several different policies
during learning. The subtle difference is that we attempt to converge to the optimal policy
by using what we learned about the Q-value of the state-action pairs observed using many
policies (*off-policy learning*), instead of just one (*on-policy learning*).
While both the classic and deep variants of Q-learning do off-policy learning, DQN randomly
samples from many previous *experiences*, whereas classic Q-learning at each step
only learns from the last obtained experience. This serves two purposes:

- It helps the optimization algorithm avoid local minima, and
- it eases learning from human experience: so long as the experience is recorded in the same format, you can always add it to the set of experiences to learn from.

Note that the loss function, or performance metric, used to learn the Q-function in DQN is not equivalent to (the negative of) the expected reward, \(J(\pi_\theta)\), seen above, since the optimal Q-function itself is not defined by the largest expected reward. It is defined by the largest expected reward, given an initial state and initial action. In fact, the loss function used in DQN and its optimization is beyond the scope of this preface; suffice to say gradient descent with an Iron Man suit is used.

Recently, however, there has been great interest in *Policy Optimization*. Here,
instead of optimizing a Q-function, the policy itself is optimized. Following the trend of
modern AI, the policy function is represented as a neural network. Unlike with Q-Learning,
it is optimized by using the expected reward, \(J(\pi_\theta)\), as a performance objective.
Note that the choice between policy optimization and Q-learning is not binary, and as will
see later, they can be integrated with each other.

Usually, to optimize (learn) the policy function, the objective performance function used is
also learned using another neural network that measures the *state value*, which is
yet another approximation of how good the policy is. \[
\DeclareMathOperator*{\E}{\mathbb{E}} \begin{equation} V^{\pi_\theta}(s) = \E_{\tau\sim
\pi_\theta}[{B(\tau)| s_0 = s]} \end{equation} \]

Note that the state value and the state-action value (Q-value) are two distinct functions. The difference is that in a state-action value function, the initial action is given, whereas in a state value function, the action is extracted from the policy.

This way, algorithms using policy optimization usually progress by optimizing the state value neural network, then using the state value approximation generated to optimize the policy neural network, then taking an action based on the policy network and the observation, and repeating the process. Keep this in mind, we will see an example of this in the next section.

RL-developed agents have become state of the art agents in several domains, especially very complicated games, such as Backgammon, Chess, Go, Dota, and StarCraft. The stochasticity, enormous size, and abstractness of the state and action spaces, along with the need for real-time responses makes RL a great candidate for a solution. Note that it is not the only alternative, as shown by Libratus, an agent based on good old-fashioned game theory that beat the top Human players in Texas Hold 'em, a game of comparable complexity to the ones mentioned earlier.

That being said, research has been going on since the 1990s on ways of applying RL to robotics problems, and there has been some very interesting recent work done to this end. To get a solid hold of how RL relates to robotics and perhaps a peek into the future of robotics, we will take a look at two very recent papers that use RL for robotics problems.

Let's think about how we can train an RL agent to control a robot. Would you let the RL train on the robot itself? Probably not; robots are expensive and the agent may attempt to execute very strange, or dangerous, actions which might harm the robot or humans around it. Usually, the preferred method is to fire up a simulation of the robot and train an RL agent to control the robot until it seems robust enough to control the real robot, then continue the training on the robot.

The problem is that the RL agent developed is, often, __too good__! It exploits specific
properties of the simulation and develops a policy that would be undesirable, weird, or
impossible in reality.

In early 2019, ETH Zurich and Intel published a paper that attempted to attack this
problem in an engineering fashion. The researchers attempt to learn a controller for an ANYmal, a popular quadruped
robot research platform. As expected, they note the difficulty of training such an agent in
simulation due to the complexity of rigid-body dynamics of the
robot interacting with its environment, as well as the robot's *series-elastic
actuators* being difficult to describe to a physics simulation.

For the actuators, the authors use supervised machine learning to develop a model of the actuators. More specifically, they trained a 3-layer deep neural network to predict the torque output given the three previous position errors (actual - commanded) and velocities. They use this model in their simulation to get a more accurate actuator response while training. They also note that the model generated this way outperformed a separate model that does not employ learning but assumes ideal properties.

To account for the rigid-body dynamics complexity, the authors train their policy on several
models of the robot, with each model having different inertial properties. This ensures that
the policy developed does not exploit a modeling error; this is a form of *domain
randomization*, a popular way of mitigating these so-called *sim-to-real*
issues.

Finally, they use *Trust Region Policy Optimization* (TRPO), a policy gradient
algorithm for their training. As an observation, the policy takes in current and previous
joint states (they stress this), along with linear and angular velocities, and outputs joint
position commands.

The results are quite impressive: one policy developed enables robot to run faster than ever before, and another enables it to recover from falling. An overview of the work done, as well as some results, can be seen below:

In late 2017, Berkeley Artificial Intelligence
Research (BAIR) researchers showed the utility of using the *maximum entropy*
framework to define an RL problem, and then twice in 2018, they expanded on this by
describing the Soft Actor-Critic (SAC), an RL
algorithm family specifically designed for robotics problems.

To begin with, SAC is a model-free (a model being a function that can predict state
transitions or rewards) RL algorithm family developed with robotics applications in mind. It
is specifically designed to be robust, to allow data sampling and training to be run on
separate threads (*asynchronous sampling*) and to allow handling of interruptions in
the data stream.

Most importantly, SAC uses the maximum entropy RL framework, which augments the objective function defined in Eqn. \(\ref{expected-reward}\) to include an entropy term: \[ \DeclareMathOperator*{\E}{\mathbb{E}} \begin{equation} J(\pi_\theta) = \E_{ \tau \sim \pi_\theta} \left[B(\tau) + \alpha \mathcal{H}(\tau)\right] \end{equation} \label{maximum-entropy-objective-function} \]

where \(\mathcal{H(\tau)}\) is the total entropy of that trajectory, \(\sum_{\tau \sim \pi_\theta}\mathcal{H}(\pi_\theta(s_t))\), and \(\alpha\) is called the temperature and is a hyperparameter that determines the relative importance of the entropy to the reward. SAC also presents an automatic way of adjusting the temperature \(\alpha\) for each time-step using dynamic programming.

By adding an entropy term, the policy learned will be robust to perturbations. Think of a
legged robot running: the information received is from noisy sensors, and the actions sent
are to noisy actuators, that are both affected by stochastic physical processes, so pushing
the agent to develop a policy for more chaotic (read: higher entropy) signal streams is
essential. Another way to see this is that by maximizing entropy, the agent maximizes
*exploration* and by maximizing reward, the agent maximizes *exploitation*.

SAC also uses an *actor-critic* framework, meaning it deploys a policy approximator,
called the actor, and a soft Q-function (the softmax of the Q-function; beyond the scope of
this preface) approximator, called the critic, simultaneously. Recall what we earlier
discussed at the end of the RL section! At each time-step, the algorithm alternates between
updating the Q-function (critic), then using it to update the policy (actor), then
repeating.

Crucially, SAC utilizes a memory pool. This allows sampling state-action pairs from multiple policies when calculating the objective function for the soft Q-function approximator. This also allows for sampling multiple states from multiple policies when calculating the objective function for the policy approximator. This results in much greater sample efficiency and is similar to the 'trick' earlier discussed with DQN.

Artificial intelligence, and especially machine learning, has seen a significant rise in effectiveness and applicability in the past decade. There are many reasons for this, but we can safely assume the end of the string of successes will not be seen in the short term. That being said, these successes have been massively hyped - beyond anything that has been seen for at least a decade.

Therefore, it is best to be cautiously optimistic about these new techniques. It is far too soon to judge the utility of these methods, how fast they can be adopted, how far they can be adopted, or if they will be adopted at all. After all, the robotics world is notorious for being very promising and hugely disappointing - mostly due to the economic inviability of many robotics-related ideas.

Personally, I do see RL playing a significant role in the way we think about robot motion and decision-making. What might set RL apart is that it addresses a very key aspect that has restricted service robotics to research labs: it provides a substantially promising alternative for robotics control, with a focus on generality and robustness rather than the much more domain-bounded classical methods.