You know the scene. In movies and TV, when the detective is reviewing security footage and their tech guys enhance it on the computer. What was once impossible to discern is now crystal clear and the villain’s face is revealed. Technology that used to be a fictional fantasy is becoming a reality. This is super-resolution at work.
Super-resolution is a field of AI that has seen a lot of recent advancement. Forward steps in deep learning have been seeing great results. In particular, the Generative Adversarial Network (GAN). You've probably heard of GANs by now, but here's a quick refresher: a GAN is a pair of neural networks set up to work against each other! The Generator tries to come up with a plausible answer to fool the Discriminator, and the Discriminator attempts to tell whether the answers are real or fake. As the model trains, the Generator gets better at fooling the Discriminator. The Discriminator gets better at telling apart real and fake images, eventually leading to a model that can generate highly realistic images.
Unfortunately, GANs have their nefarious uses too – they’re behind some of the recent ‘deepfakes’ that have been circulating the internet. But research into GANs is still in its infancy, and in my project, I looked at ways in which GANs can be used for super-resolution, with no motives of deceit. In particular, I looked at satellite images (or "remote-sensing images") of densely packed vehicles and enhanced them so that we can better count the number of objects.
Why Did I Choose Super-Resolution?
There has been a lot of research into SR, and most of it treats it as a standalone task. Whether the application of super-resolution was successful is typically assessed by some measure of the "quality" of the output image. But this leaves a lot of untapped potential for the technology – image processing is a vast field, and the potential for SR to improve the results of other image processing tasks is huge. The other general use for SR is to get the same results using low-resolution images as you could on high-resolution images, thereby reducing the cost of image acquisition.
What makes my project different from most super-resolution research is that, rather than treat it as a standalone task, I optimized it for the task of counting objects in satellite images. By focusing on object counting, I was able to get much more quantifiable (less hand-wavey!) results and use SR as part of a more directly useful and well-defined application than just making stuff look "better".
So, I had two deep neural networks to work with: an SR network and an object counting network. Figure 1 shows the high-level architecture for the project. It's not obvious that applying SR to an image followed by object counting will give better results than just object counting directly, but some previous work, such as Text SR, had given me some hope that it could work. (Spoiler: it did work!)
Figure 1: High-level architecture of task-specific SR
How did Super-Resolution Achieve This?
Let's take a step back – even with the fanciest of models, how can we fill in finer details in a low-resolution image? Surely the information is just not there? Well, it's to do with model training. A model with the right architecture can learn which high-resolution images are realistic by seeing enough examples of high-resolution images paired with low-resolution images.
For some applications, this will not be enough (e.g. criminal mugshots), but if the details don't have to exactly match the truth, a realistic-looking image may be good enough – and GANs are very good at producing realistic-looking images.
How Good is Good Enough?
The SR literature lacks consensus on how to quantitatively assess the results. Some commonly used metrics include PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity) and FID (Fréchet Inception Distance). It has been pointed out (e.g. in the SRGAN paper) that PSNR and SSIM do not correlate well with human-perceived quality, but there is a bigger issue with all three: they tell you nothing about how useful the output image is, in terms of what you're planning to do with it next. In my case, I want to use the image to count the number of vehicles as viewed from the satellite. This makes the benefit of the SR network directly quantifiable: how accurate does it make the predicted number of objects?
I compared the GAN-based solution against bicubic interpolation – this is a classical upscaling technique, which does not use machine learning, and is commonly used as a baseline for comparison in SR research.
Observing the Satellite Image Data
I started with a small set of high-resolution (HR) images that contained densely packed large vehicles, which I augmented using methods that preserved the spatial resolution of the images. The spatial resolution was around 30 to 50 cm. While it is generally cost-effective to obtain satellite imagery at this resolution for small areas (such as a supermarket car park), it can be prohibitively expensive to get for large areas (such as a country). For large-scale geospatial analysis, typically, only low-resolution images will be available.
I downscaled the HR images by 8× using bicubic interpolation (n.b. this technique can be used for downscaling or upscaling) to produce pairs of high and low-resolution (HR and LR) images for training. The effective spatial resolution of the LR images was around 2 to 4m, which is representative of low-cost satellite imagery.
After being downscaled, you can see the LR image on the left-hand side of Figure 1. It’s becoming increasingly difficult to count the number of vehicles on show.
Utilizing the Super-Resolution GAN
ESRGAN (Enhanced Super-Resolution Generative Adversarial Network) is a network with a well-established architecture for performing super-resolution, which can enhance the resolution of an image by 4× (on each axis, so a total of 16× more pixels). I used this as a starting point and modified the Generator to use a deeper architecture to get 8× super-resolution (a total of 64× more pixels).
The Object Counting Model
As SR was the focus of the project, I chose a recent, cutting-edge counting model (ASPD-net), with some improvements to model selection. This is a deep, convolutional neural network similar to crowd density estimation methods such as CrowdNet, which predicts the object density at each pixel in the image. The output of the counting model is a density map, from which we can sum the pixel values to obtain the estimated object count.
While I got reasonably good results using ESRGAN out of the box, the results were improved by doing task-specific optimization – i.e., training the model using the counting loss obtained via the object counting model, rather than losses related to human-perceived quality or pixel-wise error.
The architecture of this approach is shown in Figure 2. The Generator is trained using a combination of the counting loss and the adversarial loss from the Discriminator.
Figure 2: Architecture of task-specific super-resolution GAN
Figure 3 shows an example of the output image from my selected SR model compared with the HR and LR images and traditional bicubic upscaling. Below the images are the predicted object density maps from the counting model, with the ground-truth density map at the lower right. From left to right, the images depict low resolution (LR), bicubic, modified ESRGAN (mine), and true high resolution (HR).
Figure 3: SR image results and vehicle density maps
Note that this is not the model that produced the images that were the most similar to the HR image; it's the model that gave the most accurate vehicle count on the validation set.
The final results on the test set for the selected model are shown in Figure 4. The chart shows the mean absolute percentage error in the estimated number of large vehicles. The performance for the selected GAN (middle bar) is remarkably close to what we get by feeding the true HR images into the counting model (right-hand bar), and significantly better than using bicubic interpolation (left-hand bar).
Figure 4: Performance of final model (8x GAN) on test set, compared with 8x bicubic and true HR
Super-Resolution - Science-fiction Brought to Life?
So, can the tech guy in the movies go from a blurry blob of pixels to a recognisable mugshot using super-resolution? Well, I wouldn't recommend this technique for that application – GANs do, after all, only give us a plausible fake, and you could end up arresting the wrong person. Oh, and even for zoom and enhance on other objects, the level of upscaling seen on TV is still ridiculous!
However, this project has shown that, with the right adaptations and training approach, GAN-based super-resolution can give you quantifiably better results on downstream image processing tasks, and can save cost by getting results on lower-resolution images that are almost as good as using real high-resolution images.