Combining large diverse data sets, mathematical models and machine learning allows first live risk assessment of disease spread
Tessella has built a new mapping system working alongside researchers in the Spatial Ecology & Epidemiology Group (SEEG) at the University of Oxford. SEEG scientists have developed mathematical models to predict disease risk in all locations worldwide and designed a novel system, built by Tessella, which uses these models to provide continually updated maps of disease risk. The resulting maps are automatically updated as soon as new data becomes available. The ABRAID (Atlas of Baseline Risk Assessment for Infectious Disease) project is funded by The Bill and Melinda Gates Foundation.
Techniques for developing disease risk maps are advanced, but have previously been applied only to creating static maps. This process is time intensive and has meant that maps were at risk of being out of date by the time they were completed.
The ABRAID project overcomes this by absorbing and filtering vast quantities of continuously collated data, extracted from internet reports by HealthMap and others, to accurately capture disease occurrences. By using this data in conjunction with machine learning methods, an interface for easy data validation and established modelling techniques the system is able to continually update its disease risk maps.
Dealing with big data
ABRAID absorbs publicly available occurrence data on infectious diseases, such as news articles reporting outbreaks and academic/government reports, primarily via a feed from HealthMap which sweeps the Internet for such data.
To ensure the data is robust enough for maps on which health decisions are made, the project team has built validation techniques to spot and eliminate data which is unreliable.
The first layer of validation identifies anything which is clearly anomalous, such as a report of a disease in a place where conditions make its presence unlikely, or where a report crops up far from any other known incidence.
The software incorporates data on environmental covariates such as temperature and humidity, as well as models of how socio-economic factors affect disease occurrence. It also assigns each country or region a score based on the known range of said diseases. Based on this information, the team developed a machine learning approach which assigns each piece of occurrence data a validity score. If the occurrence is very reliable it goes into the model, if it unreliable it is excluded.
Where situations are unclear the data is passed for validation by a team of experts who provide the reliability score manually. The machine learning programme notes the expert responses, which are fed back into the process, so that it learns the specific signatures of a reliable report.
Mapping disease risk
Once the data is verified, the mapping itself uses statistical models called Species Distribution Models (niche models) developed for disease risk mapping by SEEG. By combining these modelling techniques with up to date verified occurrence data, it has become possible to produce continually updated dynamic maps of disease risk.
The team has also created a display and interface which shows the data output overlaid onto global maps at a 5km2 level so users can easily visualise the modelled infection risk. Additionally it provides access to visualisations of the disease extent and occurrences used in the model, as well as statistical measures of the model.
More maps for more diseases
The project has been completed for maps of dengue and cutaneous leishmaniasis. The goal is now to expand to cover a wide range of diseases for which good data is available – with a focus on those which haven’t already been well mapped and which are of high worldwide significance.
The hope is that up to 30 diseases could be mapped within the next few years. The technology and techniques have been developed and proven. The next steps are to increase the number of data sources feeding in, including more academic publications as well as expanding the use of social media, and to increase the number of experts involved in evaluating data.
Alan Bell, Sector Director Life Sciences, Tessella, says: “New online data sources have provided huge amounts of information which opens new opportunities for mapping diseases. The challenge has been to find effective ways to sift through this vast quantity of new information from disparate sources and of varying reliability, identify what’s useful, and translate it into a consistent format so that is can be fed into proven models. By harnessing advances in machine learning and our broad scientific and mathematical expertise, we are proud to have played an instrumental part in a project which could help save many lives.”
Catherine Moyes, ABRAID Director, University of Oxford says: “Tessella developers have taken a complex design and built a robust system that functions as specified. They have been a pleasure to work with and have proven highly skilled, flexible and personable”.