AI is becoming more and more prevalent in many aspects of daily life relating to security and comfort. Machine learning models are being utilized in surveillance, facial recognition, medical image analysis, banking, quality control and autonomous vehicles, just to name a few. In many of these safety-critical applications, the consequences of failure are severe. It is therefore crucial to verify that these AI systems are robust.
Software developers have spent years cultivating best practices for ensuring the robustness of their programs; thorough testing is vital to the success of any software project. Unfortunately, these practices are not always suitable for machine learning systems, which are inherently tricky to test due to their scale and lack of structure.
Machine learning algorithms are usually evaluated via testing on a yet unseen dataset, assessing the average-case performance of the model. But it is also essential to test the 'rainy day' scenarios to ensure the model is robust - that it has acceptable performance in the worst-case scenario. This is particularly important in high-stakes situations such as AI for autonomous vehicles. Would you trust a self-driving car that is easily tricked into misreading a speed limit?
Vulnerabilities in AI
It is well known that machine learning systems are not robust: they can easily be fooled into giving an incorrect prediction just by making an imperceptibly small modification to the input. The addition of this kind of apparent noise to an input designed to trick a machine learning system is known as an adversarial attack. The series of transformations performed in a neural network are very sensitive to these subtle changes, amplifying their effects. These attacks are highly effective and easy to design, leading to security threats in real-world applications of AI.
Imagine printing a cheque that can fool the AI into cashing it for many times its value. Or imagine the consequences of a sticker placed on a 20-mph traffic sign that fools autonomous cars into interpreting the speed limit as 50 mph.
The real world has an inherent randomness and we could equally imagine an ‘accidental’ adversarial attack in nature: dirt, graffiti or bad weather obscuring a traffic sign. We need to have confidence in the ability of these systems to handle these rare failure cases. The difficulty is that AI learns by example, and it is difficult – and impractical – to train them for every possible scenario.
These adversarial attacks only need to change the input a small amount to fool the AI. A popular strategy to defend against these attacks is adversarial training. This attempts to improve the generalization of the model by generating adversarial examples during training. For example, we take a 50-mph sign and modify it to fool the model into thinking it is a 70-mph sign, then tell the model it should learn it is still a 50-mph sign.
A model trained using this method will be able to defend against the types of attack it is trained on, but may still be vulnerable to other, stronger attacks. It is a game of adversarial cat and mouse: each defence works to close some vulnerabilities but there will be others left open. We need a way to guarantee that the model is robust to variations to the inputs without having to generate every possible variation of each image to train the AI with.
Specifications for AI
A machine learning model is robust if small changes to the input do not change the output. One approach for ensuring robustness is to evaluate whether a system is consistent with a list of specifications describing the intended functionality.
Mathematically, a specification is a relationship that must hold between the inputs and outputs of a system. In the case of image classification, traffic sign recognition for instance, the specification would be that small changes in the input do not change the output away from the correct classification. Given an image of a 20 mph sign, the network should classify the image as a 20 mph sign regardless of small modifications. We can train a model to satisfy this specification, and by doing so it will be verifiably robust, guaranteeing that no adversarial attack will be effective within some defined bounds.
Interval bound propagation (IBP) is an example of an algorithm that trains a model to be consistent with a specification. DeepMind have used it successfully to classify handwritten digits and simple images. IBP balances two requirements in training a model: first to fit the data (for accuracy), and second to ensure that no small change to the input will cause misclassification (for robustness).
IBP works by creating a bounding box around the set of possible adversarial changes as they propagate through the neural network. This makes it easy to compute the upper bound on the worst-case violation of the specification to be verified. By specifically training models to be verifiable, this method produces models that are inherently easier to verify, much like using test driven development encourages more testable code.
I used IBP in my Cortex project to build a verifiably robust traffic sign classifier as a practical example to investigate robustness in AI. I trained two networks using IBP: one optimised for small variations and one for large variations. I also built a standard convolutional neural network, which is the standard solution for image classification tasks and therefore a good baseline for comparison. All networks were on the German Traffic Sign Recognition Benchmark Dataset, a dataset of around 40,000 labelled images of 43 different types of traffic signs.
To evaluate the performance of these systems, we can calculate different types of accuracy: the fraction of correctly classified examples in different test sets.
Both the baseline network and the IBP network have similar nominal accuracies over 90%. As the IBP network balances accuracy with robustness, it has a slightly lower nominal accuracy than the baseline network. However, the baseline network has little resistance to a strong attack, unlike the IBP network. As well as effectiveness against adversaries, IBP allows us to determine the verified accuracy. This means we can place an upper bound on the fraction of examples for which we can guarantee that no possible modification (of a given size) would cause misclassification.
Classifying the traffic sign images is a difficult problem to solve; there are many classes and limited data. Moreover, defending against adversarial examples is challenging as it requires the AI to produce good outputs for every possible input. This suggests that in order to train a robust classifier even more data is needed than the large datasets required to attain good accuracy.
It is impossible to train a classifier on every possible example but training it to be robust provides some confidence that it will be able to handle unexpected scenarios. By testing the performance of the system when we are intentionally trying to cause it to misbehave, we are simultaneously testing the performance in the worst-case scenarios and rare failure cases.
Robustness in AI is important for three key reasons:
- Defending against adversaries: addressing security concerns and intended harm to the system.
- Rare failure cases: testing the performance in the worst-case scenarios provides evidence that the system will not misbehave due to any unforeseen randomness in real-world environments.
- Measuring progress of AI towards human-level abilities: adversarial attacks demonstrate dramatic differences in decision making between humans and AI. Understanding these differences is crucial to our trust in AI.
When building AI systems, robustness is an important and useful tool in our toolkit. It allows us to verify and validate that our models are behaving in the intended way. Thorough testing of AI is just as important as testing of software, and development of best practices is key to our trust in AI.