Optimization is everywhere in R&D and Engineering – whether that be to find the most efficient aircraft wing design or the optimal conditions for industrial manufacturing. An area I have worked in extensively is industrial chemistry, where scientists try to identify the optimal conditions that achieve highest yield and quality of the final product of a chemical reaction.

Typically, the input-output relationship of these types of complex problems may be unclear: we know what the inputs are (in the case of industrial chemical processes it may be temperature, concentrations, flow rates other external parameters), but we cannot write down what the outputs will be (yield, quality metrics etc.) These kinds of challenges can be thought of as black-box optimization problems because we cannot see the relationships between inputs and outputs. Because of this, we cannot use gradients in optimization – because effectively we only know the output for each input and not how the output changes around that point. This immediately limits the techniques which can be applied to these problems.

In many real-world problems, the situation is even more challenging because it is expensive to acquire the output for a new set of inputs. Consider an example where the output requires a destructive test to be performed. Each test is expensive, so any optimisation technique needs to use as few samples as possible, ruling out Monte Carlo techniques or evolutionary algorithms.

Bayesian Optimization is a strategy for solving black-box optimization problems that has had many successes in the past few years. One can find applications in drug development, material design and chemistry, but also in automated machine learning toolboxes for hyperparameter tunning. Bayesian Optimization incorporates some prior belief about the form of the problem, and this requires human insight into the nature of the challenge.

A natural question is whether there is an approach that can learn how to optimise black-box problems better based on previous experience tackling similar challenges. It turns out that this is exactly what recent advances in Meta Reinforcement Learning seek to do. I wanted to find out whether Meta reinforcement Learning can offer benefits in R&D and Engineering problems and had the chance to spend 3 months studying the area as part of a project within Tessella’s Cortex programme. My work was inspired by a number of research papers in the area.

“We should build in only meta-methods that can find and capture the complexity of the world. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.”

The Bitter Lesson – Rich Sutton

Reinforcement learning is an area of machine learning concerned with how to automate intelligent decision making to maximize rewards over time for a given task. Over the past few years, Reinforcement Learning techniques have started to see some real-world applications; AstraZeneca used it in the area of molecular design, Amazon applied Reinforcement Learning techniques for warehouse and logistics optimization and are also experimenting with Reinforcement Learning for their delivery drone fleet.

A limitation for the use of RL techniques for real-world applications is that training a Reinforcement Learning algorithm often requires a vast amount of data, usually from simulations. As mentioned above, for the types of problems we are trying to tackle, acquiring lots of data is prohibitively expensive. Moreover, if the underlying mechanism of the problem is not well understood it’s simply impossible to craft a simulator to train an RL agent anyway.

Even if a reinforcement learning agent has been trained to find a good policy for an optimization problem, there is no guarantee that this policy can be applied to new problems, even if they are similar in nature. So, a lot more training will be needed - possibly from scratch - whenever, due to updated business requirements, the optimization problem changes.

Humans are good Reinforcement Learning agents because they can use prior knowledge of the world and hard-gained experience to quickly adapt to new tasks. Meta Reinforcement Learning tries to mimic how humans learn. Practically, a meta-agent will be shown several training tasks that are somehow similar to the actual task we are trying to solve. This way, instead of us injecting the model with our own prior knowledge about the task, we will enable the meta-agent to build this prior knowledge via exposure to other training tasks. The agent will then be able to quickly adapt its policy and solve the new task in only a small number of steps.

*Figure 1: Learning on lots of known chemical reactions which can be simulated and applying to a new unknown one that cannot be simulated and where it is expensive to do experimental runs.*

## My Cortex project

Meta-RL seemed a very interesting approach to apply to black-box optimisation challenges, with the potential to improve on existing methods. However, most of the current research in this area focused around solving discrete-state tasks like the multi-armed bandit problem or navigating a maze. I wanted to understand if Meta-RL could allow for more efficient optimization of black-box optimization challenges.

### Optimization Problems

Given that this project was just three months long and since I did not have access to specialised equipment, for example, a chemical reactor to run actual experiments, I decided to create some synthetic optimization examples which I could use in place of experimental data. Crucially, I would solve my synthetic examples using meta RL as though they were real-world black-box challenges.

I made sure that the surfaces I created were representative of optimization surfaces for an actual physical system and put them to the test. In Figure 2, the plot on the left shows a surface that is similar to surfaces we see in the area of industrial chemistry. The middle surface shows periodic behaviour which is typical of multi-reflector systems with applications in satellites. The plot on the right shows a surface with big spikes which is close to zero everywhere else. This type of behaviour is common in dynamical systems where a resonance effect is observed.

### Approach

So how do we train a meta-RL agent? There are three key ingredients involved:

**A collection of related optimization problems.**Showing the agent a bunch of problems that are similar to the problem we are trying to solve will help them build the prior knowledge we described before. In Figure 3, I show a collection of optimization surfaces we are going to use to train a meta-agent to optimise the surface on the right.**A model with memory.**For my project, I used an LSTM network. The LSTM networks maintain a hidden state that incorporates all the information of triplets of (action, state, reward) they encounter.**A meta-learning algorithm.**A meta-learning algorithm refers to the way we update the policy, or in our case the weights of the network. For my project, I used gradient descent.

The training process of a meta-RL agent consists of two nested loops. The outer loop samples a new optimization problem and adjusts the network weights which determine the meta-policy. The inner loop is essentially the standard Reinforcement Learning training, where for a fixed optimization problem the agent interacts with the environment, observes the state and the associated reward and updates its weights.

### Results

For my project, I tried to identify the location of the maximum for synthetic surfaces like those in Figure 2. In Figure 4, I show the average performance of the method that uses Meta-RL and compare it against the performance obtained by a Bayesian Optimization technique (Kriging) and an evolutionary algorithm (Nelder-Mead). We see that in all the cases the distance from the maximum reduces with the number of experiments we run.

For the smooth surfaces (left) the performance of Meta-RL is comparable to the performance of Kriging and in fact, it outperforms it for a small number of experiments; Kriging needed 10 experiments to get to 10% from the true maximum where Meta-RL needed only 5. For the surfaces with periodic behaviour (middle), Meta-RL got down to 7% from the true maximum after 50 experiments while Kriging got to 10%. The final example (right) shows the performance in random surfaces with big spikes at random locations. We can see that Meta-RL cannot handle these types of problems successfully as there is not enough structure in the problem for the meta-agent to learn.

## Conclusion

Although Meta Reinforcement Learning is still in its infancy, I managed to show that it has real potential, especially in solving problems where there is a clear structure in the optimization surface that a meta-agent can learn and exploit. I expect the technique to find applications in the design of any real-life experiments where finding the optimum in a small number of experiments is critical.