- Contact
- About
- Postscript

Abdul Rahman Dabbour
AR Dabbour

Months ago, I came across this video of a robot hand rotating a cube (in all 3D) to achieve specific configurations of that cube. This is extremely difficult - the robot has 20 degrees of freedom!

This got me interested in the idea of reinforcement learning, a framework that allows the development of agents for a wide variety of problems. Luckily, this coincided with OpenAI releasing the fantastic Spinning Up in Deep RL guide, written by Josh Achiam, a Ph.D. student at UC Berkeley.

Tinkering with some tools along with some friends, we developed Amca, a Backgammon-playing agent along with its environment for training and evaluating algorithms. The more I read and tinkered, the more it seemed obvious to me the importance this field will have on robotics (and more) in the future. Grab a cup of coffee and make yourself comfortable before proceeding.

Consider the following model shown in Fig. 1. In this model, we have an agent existing in some environment. This agent observes the environment, and based on that observation performs an action (or does nothing - which can be thought of as an action). This is the most basic type of agent, known as a

We can model this agent as a function which takes an input (*observation*) \(o_t\) and returns an
output (*action*) \(a_t\). Let's call this function *policy* \(\pi\). We can now make this
agent 'intelligent' by defining our policy to
react to observations by defining \(n\) rules. Let's design such a policy for a party:
\[
\begin{equation}
\pi(o_t) = \begin{cases}
\text{smile} & \text{if someone smiles at you} \\
\text{dance} & \text{if the music is good} \\
\hfill \vdots \\
a^{(j)}_t & \text{if } o_t \equiv x\\
\hfill \vdots \\
a^{(n)}_t & \text{for all other } o_t \\
\end{cases}
\end{equation}
\]

Most high-level decision-making in robotics today follows this framework, albeit with more features to make it more robust and using fancy names like finite-state machines. You can probably see how this limits the robot's adaptability to new situations, but it gets the job done for most purposes.

In the 1980s, ideas from the fields of optimal
control and psychology merged into what we
today call reinforcement learning. It
can be summarized (with a pinch of salt) as the addition of an extra *reward* input to the simple
reflex agent, as shown in Fig. 2.

This reward signal adds a subtle, but very powerful, property to the agent. It allows the agent's policy \(\pi_{\boldsymbol{\theta}}\) (now denoted with a subscript \(\boldsymbol{\theta}\) that describes its parameters) to be learnable by informing the agent of the advantages and disadvantages of its action relative to that state.

To understand how RL can be put to use in robotics, let's try to control a mobile robot to move to some goal pose. To summarize one theory behind controlling a mobile robot, the controller assumes:

- the goal to be \(\boldsymbol{g}=[0, 0, 0]^T\), which describes the robot pose in 2D Cartesian coordinates and 1D heading angle in the interial or world frame,
- the observation to be \(\boldsymbol{o_t}=[\rho, \alpha, \beta]^T\), the robot pose in polar coordinates,
- the action to be \(\boldsymbol{a_t}=[\dot{\rho}, \dot{\alpha}, \dot{\beta}]^T\), the velocity commanded to the robot in polar coordinates,

RL in the context of robotics is concerned with finding alternatives to control systems or augmenting them where they fail. So instead of using this controller, if we wanted to develop our own controller using RL, we can input a lower level observation and action, without resorting to transformations to and from the polar representation, we don't need to transform our frame of reference such that the goal is the origin, and we can define a goal heading as required. \[ \begin{equation} \boldsymbol{\pi_\theta} \left(\left[\begin{array}{c} x \\ y \\ \phi \\ \dot{x} \\ \dot{y} \\ \dot{\phi} \end{array}\right]\right) = \left[\begin{array}{c} \dot{x} \\ \dot{y} \\ \dot{\phi} \end{array}\right] \hspace{2em} \text{given } \boldsymbol{g} = [g_x, g_y, g_\theta]^T \end{equation} \]

This may not seem like a very important contribution, but we will see how it can be expanded into domains where the input the controller is nothing more than an image, or where there are no control-based solutions (at least not without major drawbacks), such as in self-driving cars where decisions need to be made from very complex and widely varying signals and/or images, or in legged robots, where developing an accurate model of the system dynamics is extremely difficult. Before we can go any further, though, we must delve into the framework of RL environments and their dynamics.

Given

- some set of states \(S = \{s^{(1)}, s^{(2)}, s^{(3)}, s^{(4)}, s^{(5)}, \dots \} \) and
- some set of actions \(A = \{a^{(1)}, a^{(2)}, a^{(3)}, a^{(4)}, a^{(5)}, \dots \} \),

If our state contains all the relevant information about the entire sequence of states and actions that preceded it, it is a Markov state. If we can model our environment as a set of Markov states, we can then reliably navigate this environment using just our current state, and an action we choose. The sequence of states that we accumulate while moving in this way is called a Markov Process.

Now we can wrap it up by defining a Markov Decision Process to be one where the agent can reliably (because of the Markov property) decide (because of our policy - we'll get to that shortly) on how to navigate through the environment to achieve its goal, and this is the framework upon which modern RL, and especially deep RL, is built on.

Formally, a Markov Decision Process for an RL problem is usually a 5-tuple, \(\langle S, A, R, P, \rho_0 \rangle\), where

- \(S\) is the set of all valid states,
- \(A\) is the set of all valid actions,
- \(R : S \times A \times S \to \mathbb{R}\) is the reward function (remember the reward signal we mentioned earlier?), depending on the current state, current action, and future state \(R(s_t, a_t, s_{t+1})\),
- \(P : S \times A \to \mathcal{P}(S)\) is the transition probability function, with \(P(s'|s,a)\) being the probability of transitioning into state \(s'\) if you start in state \(s\) and take action \(a\), and
- \(\rho_0\) is the starting state distribution.

In the context of learning, the problem becomes one where we would like to learn some optimal policy \(\pi^*_t\) that will return the optimal action \(a^*_t\) given the current observation \(o_t\) and set of actions \(A_t\). \[ \begin{equation} \pi^{*}(o_t, A_t) = a^*_t \end{equation} \]

Note that \(o_t \not\equiv s_t\); an observation does not (usually) have the Markov property.
The state is the true (usually condensed) representation of the environment, whereas the observation is the
partial information received by the agent about the environment. The exception to this is when a state is
*fully observable*, such as in chess, checkers, backgammon, etc.

To start defining the RL problem, as seen with the MDP 5-tuple earlier, we must refer to some reward function \(R(s_{t}, a_t,s_{t+1})\) that will send a reward signal to the agent as seen in Fig. 2. The intuition behind the reward function is to, well, reward the agent when it does a good action, and punish the agent when it does a bad action.

To define an optimal policy, we need a measure of policy performance; the most popular such measure is the
*expected reward*. Given some *trajectory* (a sequence of state-action pairs), \(\tau =
\langle
s_0a_0, s_1a_1, \dots, s_Ta_T \rangle\), we can define the expected reward as:
\[
\begin{equation}
\DeclareMathOperator*{\E}{\mathbb{E}}
J(\pi_\theta)=\E_{\tau\sim \pi_\theta}[{B(\tau)}]=\int_{\tau} P(\tau|\pi_\theta){B(\tau)}
\end{equation}
\label{expected-reward}
\]
where

- \({B(\tau)}=\sum_{t=0}^{T-1}R(s_{t}, a_t,s_{t+1})\) is the sum of the rewards in the trajectory
- \(P(\tau|\pi_\theta) = \rho_0 (s_0) \prod_{t=0}^{T-1} P(s_{t+1} | s_t, a_t) \pi_\theta(a_t | s_t)\) is the probability distribution of the trajectories we will come across while using policy \(\pi_\theta\)

As the name indicates, this is just the expected value of the reward we will accumulate if we follow the
trajectory \(\tau\).
Note that we can also have infinite trajectories, and adapt our trajectory-reward function to
\(B(\tau))=\sum_{t=0}^{\infty} \gamma^t r_t\), where \(\gamma \in (0, 1)\) (this is usually called the
*discount factor*).

Now, we can define the RL problem of finding the optimal policy as an optimization problem, where we would like to maximize our expected reward: \[ \begin{equation} \DeclareMathOperator*{\argmax}{arg\,max} \pi^* = \argmax_{\pi_\theta} J(\pi_\theta) \label{optimal-policy} \end{equation} \]

One of the most common algorithms used to solve this is called Q-Learning;
this method maximizes the action-state value, also known as the *Q-value*, of a policy. The Q-value
is simply an approximation of how good the policy is.
\[
\begin{equation}
\DeclareMathOperator*{\E}{\mathbb{E}}
Q^{\pi_\theta}(s,a) = \E_{\tau\sim \pi_\theta}[{B(\tau))| s_0 = s, a_0 = a}]
\end{equation}
\]

Notice here that the Q-value measures the expected value of the return of the policy __after we have
started with state \(\underline{s}\) and action \(\underline{a}\)__. The assumption is that if we can
maximize the
*Q-function*
seen here, then we would be able to extract the optimal action at every state simply by looking at the
Q-values for each action that can be taken at that state and selecting the action with the maximum
Q-value.
\[
\begin{equation}
\DeclareMathOperator*{\argmax}{arg\,max}
a(s) = \argmax_a Q^{\pi_\theta}(s,a)
\end{equation}
\]

This is, in fact, the basis of Deep Q-Learning (DQN), introduced in the seminal paper by Mnih et. al, that re-launched great interest in the field of RL. DQN, however, differs from classic Q-Learning in some ways.

The first difference is that it uses a deep neural network to represent the Q-function, which allow recent (powerful) advancements in the field of deep learning for optimization to be utilized.

The second major difference is the use of *experience replay*, which allows using the Q-values (and
their state-action pairs) obtained while following several different policies during learning. The
subtle difference is that we attempt to converge to the optimal policy by using what we learned about the
Q-value of the state-action pairs observed using many policies (*off-policy learning*), instead of
just one (*on-policy learning*), which is the case of classic Q-Learning. This is highly significant
because it can reduce training time by a very large factor; imagine a class where each person studies only
one chapter and they are all allowed to share information in an exam.

Note that the loss function, or performance metric, used to learn the Q-function in DQN is not equivalent to (the negative of) the expected reward, \(J(\pi_\theta)\), seen above, since the optimal Q-function itself is not defined by the largest expected reward. It is defined by the largest expected reward, given an initial state and initial action. In fact, the loss function used in DQN and its optimization is beyond the scope of this preface; suffice to say gradient descent with an Iron Man suit is used.

Recently, however, there has been great interest in *Policy Optimization*. Here, instead of
optimizing a Q-function, the policy itself is optimized. Following the trend of modern AI, the policy
function is represented as a neural network. Unlike with Q-Learning, it is optimized by using the expected
reward, \(J(\pi_\theta)\), as a performance objective. Note that the choice between policy optimization and
Q-learning is not binary, and as will see later, they can be integrated with each other.

Usually, to optimize (learn) the policy function, the objective performance function used is also learned
using another neural network that measures the *state value*, which is yet another approximation of
how good the policy is.
\[
\DeclareMathOperator*{\E}{\mathbb{E}}
\begin{equation}
V^{\pi_\theta}(s) = \E_{\tau\sim \pi_\theta}[{B(\tau)| s_0 = s]}
\end{equation}
\]

Note that the state value and the state-action value (Q-value) are two distinct functions. The difference is that in a state-action value function, the initial action is given, whereas in a state value function, the action is extracted from the policy.

This way, algorithms using policy optimization usually progress by optimizing the state value neural network, then using the state value approximation generated to optimize the policy neural network, then taking an action based on the policy network and the observation, and repeating the process. Keep this in mind, we will see an example of this in the next section.

RL-developed agents have become state of the art agents in several domains, especially very complicated games, such as Backgammon, Chess, Go, Dota, and StarCraft. The stochasticity, enormous size, and abstractness of the state and action spaces, along with the need for real-time responses makes RL a great candidate for a solution. Note that it is not the only alternative, as shown by Libratus, an agent based on good old-fashioned game theory that beat the top Human players in Texas Hold 'em, a game of comparable complexity to the ones mentioned earlier.

That being said, research has been going on since the 1990s on ways of applying RL to robotics problems, and there has been some very interesting recent work done to this end. To get a solid hold of how RL relates to robotics and perhaps a peek into the future of robotics, we will take a look at two very recent papers that use RL for robotics problems.

Let's think about how we can train an RL agent to control a robot. Would you let the RL train on the robot itself? Probably not; robots are expensive and the agent may attempt to execute very strange, or dangerous, actions which might harm the robot or humans around it. Usually, the preferred method is to fire up a simulation of the robot and train an RL agent to control the robot until it seems robust enough to control the real robot, then continue the training on the robot.

The problem is that the RL agent developed is, often, __too good__! It exploits specific properties
of the simulation and develops a policy that would be undesirable, weird, or impossible in reality.

In early 2019, ETH Zurich and Intel
published a paper that attempted to attack this problem in an
engineering fashion. The researchers attempt to learn a controller for an ANYmal, a popular quadruped
robot research platform. As expected, they note the difficulty of training such an agent in simulation
due to the complexity of rigid-body dynamics
of the robot interacting with its environment, as well as the robot's *series-elastic actuators*
being
difficult to describe to a physics simulation.

For the actuators, the authors use supervised machine learning to develop a model of the actuators. More specifically, they trained a 3-layer deep neural network to predict the torque output given the three previous position errors (actual - commanded) and velocities. They use this model in their simulation to get a more accurate actuator response while training. They also note that the model generated this way outperformed a separate model that does not employ learning but assumes ideal properties.

To account for the rigid-body dynamics complexity, the authors train their policy on several models of
the robot, with each model having different inertial properties. This ensures that the policy developed
does not exploit a modeling error; this is a form of *domain randomization*, a popular way of
mitigating these so-called *sim-to-real* issues.

Finally, they use *Trust Region Policy Optimization* (TRPO), a policy gradient algorithm for their
training. As an observation, the policy takes in current and previous joint states (they stress this), along
with linear and angular velocities, and outputs joint position commands.

The results are quite impressive: one policy developed enables robot to run faster than ever before, and another enables it to recover from falling. An overview of the work done, as well as some results, can be seen below:

In late 2017, Berkeley Artificial Intelligence Research (BAIR)
researchers showed the utility of using the *maximum entropy* framework to define an RL problem, and
then twice in 2018, they expanded on this by describing the Soft
Actor-Critic (SAC), an RL algorithm family specifically designed for robotics problems.

To begin with, SAC is a model-free (a model being a function that can predict state transitions or rewards)
RL algorithm family developed with robotics applications in mind. It is specifically designed to be robust,
to allow data sampling and training to be run on separate threads (*asynchronous sampling*) and to
allow handling of interruptions in the data stream.

Most importantly, SAC uses the maximum entropy RL framework, which augments the objective function defined in Eqn. \(\ref{expected-reward}\) to include an entropy term: \[ \DeclareMathOperator*{\E}{\mathbb{E}} \begin{equation} J(\pi_\theta) = \E_{ \tau \sim \pi_\theta} \left[B(\tau) + \alpha \mathcal{H}(\tau)\right] \end{equation} \label{maximum-entropy-objective-function} \]

where \(\mathcal{H(\tau)}\) is the total entropy of that trajectory, \(\sum_{\tau \sim \pi_\theta}\mathcal{H}(\pi_\theta(s_t))\), and \(\alpha\) is called the temperature and is a hyperparameter that determines the relative importance of the entropy to the reward. SAC also presents an automatic way of adjusting the temperature \(\alpha\) for each time-step using dynamic programming.

By adding an entropy term, the policy learned will be robust to perturbations. Think of a legged robot
running: the information received is from noisy sensors, and the actions sent are to noisy actuators,
that are both affected by stochastic physical processes, so pushing the agent to develop a policy for more
chaotic (read: higher entropy) signal streams is essential. Another way to see this is that by maximizing
entropy, the agent maximizes *exploration* and by maximizing reward, the agent maximizes
*exploitation*.

SAC also uses an *actor-critic* framework, meaning it deploys a policy approximator, called the
actor, and a soft Q-function (the softmax of the Q-function; beyond the scope of this preface) approximator,
called the critic, simultaneously. Recall what we earlier discussed at the end of the RL section! At each
time-step, the algorithm alternates between updating the Q-function (critic), then using it to update the
policy (actor), then repeating.

Crucially, SAC utilizes a memory pool. This allows sampling state-action pairs from multiple policies when calculating the objective function for the soft Q-function approximator. This also allows for sampling multiple states from multiple policies when calculating the objective function for the policy approximator. This results in much greater sample efficiency and is similar to the 'trick' earlier discussed with DQN.

Artificial intelligence, and especially machine learning, has seen a significant rise in effectiveness and applicability in the past decade. There are many reasons for this, but we can safely assume the end of the string of successes will not be seen in the short term. That being said, these successes have been massively hyped - beyond anything that has been seen for at least a decade.

Therefore, it is best to be cautiously optimistic about these new techniques. It is far too soon to judge the utility of these methods, how fast they can be adopted, how far they can be adopted, or if they will be adopted at all. After all, the robotics world is notorious for being very promising and hugely disappointing - mostly due to the economic inviability of many robotics-related ideas.

Personally, I do see RL playing a significant role in the way we think about robot motion and decision-making. What might set RL apart is that it addresses a very key aspect that has restricted service robotics to research labs: it provides a substantially promising alternative for robotics control, with a focus on generality and robustness rather than the much more domain-bounded classical methods.