The importance of good data
The world has undergone a rapid change since the fight against COVID-19 began. Data models no longer present an accurate view of the world around us, and the data coming in is unpredictable. Knowing how to transform the new data you've captured into data that is effective for modelling is a key aspect of the challenge faced by COVID-19 task forces.
Good data is the foundation of any model. Epidemiologists must be confident that data feeding their models, for example on population density and prevalence of other infections, is accurate and representative. Diagnostics models need to be trained on data that truly represents the full range of disease manifestations. And manufacturers need to be confident that their models of new processes reflect their working environment.
Even if a model is perfect, it will still produce a wrong result if the data going in is incorrect or incomplete. Garbage in, garbage out, as they say.
Accessing that data is a pain point for many modellers. Data is often in different formats, different locations, or labelled according to different systems. Some will be subjectively captured – such as doctors' notes - and reflect human biases. It often requires significant work before it can be used for modelling – time that modellers don’t have right now.
If errors in data are missed early on, they will cause problems down the line, leading to sub-optimal, or even incorrect, decisions. So, it's vital to ensure data is correct and available to modellers.
This is not a new problem. But doing this for COVID-19 projects, where new data of varying quality is coming in from poorly understood environments, captured by people under huge pressure, makes it harder than ever.
Time can be saved by being focussed. There is no time to capture all the data you can and see what it tells you. Identify your end objective and be laser-focussed on capturing data that is useful to that, reducing time needed to find and manage data.
But once you have identified useful data, the same rigour is needed as always. Making decisions quickly with data is not about cutting corners, it's about getting things right first time, so you get earlier answers, and don’t need to repeat work.
Getting your (data ware)house in order
Good data is FAIR (Findable, Accessible, Interoperable, Reusable). It's stored in a way that makes it easy to identify by anyone who searches for it. It's in formats that can be read by humans and machines. And it's clear about any limitations or rules about how it can be used.
The following four principles, inspired by the Tessella Data Management Maturity Model, should be followed to ensure an organisation’s data can be effectively used for modelling.
1. Data must be of sufficient quality for modellers
Data must be complete, correct, and consistent.
It must come from a trusted source. This may be simple for your own chemical analysis data. But it will be more complicated for open source data, or data from clinical trials or hospitals which may include bias or misreporting, and particularly challenging for public data.
It's never a good idea to download data and start using it without understanding its quality and appropriateness first. Data scientists can do a lot, but they may need to have data reviewed by domain experts who understand what the data represents in the real world.
A very simple example is government reported death rates, where the total numbers of deaths is useful, but hides important variations such as time lags in reporting, missed cases, cases where people die ‘with’ rather than ‘from’ COVID-19, and comorbidities which may be critical for understanding both risk and intervention efficacy. We've seen projects derailed because modellers drew invalid conclusions from data sets, because key contextual information hadn't been formally recorded.
And data needs to be cleaned to remove missing or confounding elements. Many diagnostic AIs have failed first time because data indicating a positive diagnosis contained labels added by the diagnosing physician. The AI learned to spot the label, not the disease indicator.
2. Complete metadata makes data searchable and understandable
Metadata should be added to enhance understanding and usability. This will include descriptions of what the data represents – for example the type of molecule or toxicology, but also provenance, timestamps, usage licences, and more. There must also be a consistent taxonomy for naming things.
Good metadata allows different groups with different interests to find it easily in the system, and allow those reading it – including machines – to make sense of it and easily compare it to other data.
3. Good privacy and security avoid problems down the line
If models are trained on data which doesn’t meet privacy rules, it could cause big problems down the line. Patient data, for example, will involve meeting GDPR rules as a minimum, but other organisations from the MHRA or the FDA also set requirements. Data should only be captured if it was collected by someone authorised to do so, and its provenance and allowable use should be made clear in the metadata. It must also have adequate security in place to protect it where it is stored and used.
Concerns over data security are already causing headaches for the NHS tracking app before it's even deployed, and the associated controversy over ties to Cambridge Analytica is a poignant reminder that using opaquely captured data can permanently destroy trust.
4. Make data consistent, accessible, and traceable
Data stores, lakes, and warehouses need to be setup so that data is accessible to anyone who needs it, whilst restricted to those who don’t. This also includes selecting tools and building integrators that would pipe data to the data science teams.
Data must have a single point of truth. It must be linked together in the IT system so that if one instance is changed, all others are updated. Otherwise inconsistencies can be introduced across models using the same data.
Finally, all data must have a data steward, someone who makes decisions about how it's stored and managed, and someone who can be contacted by modellers who need further information.
This work can be complex, but these are practical and manageable steps which ensure data makes sense for those using it. We have seen the importance of this in pre-COVID times. We helped a large pharma company explore how to better apply data science to preclinical data, in order to predict late-stage clinical failures.
After speaking to the modellers themselves, it quickly became clear that the main problem was that data was hard to find, hard to understand, laborious to use, and sometimes risky to draw conclusions from. By focusing the project on improving data management, they were able to improve their models and better understand the reasons for failure. Getting the data right means better answers early on, and reduces risk of failure down the line.
We will discuss how they do this in our third article. But before they start, they need to consider the vital issue of trust, to which we will turn next.
The world has changed rapidly, and the way data is being collected and used has changed with it. We can help COVID-19 task forces take control of the situation.