The correct approach
Once you’re confident you have the right data to give you a good foundation, it’s time to use it to build models that work.
There is no rule for which approach is best for a particular problem. The nature and context of the problem, data quality and quantity, computing power needs, speed, and intended use, all feed into model choice and design.
That said, many COVID-19 challenges are quickly evolving, with limited, uncertain, or changeable data, which may lend themselves to particular approaches more than others.
Disease spread prediction, for example, involves constant updated information from apps, hospitals, and potentially less curated sources, such as social media and doctors’ notes. Rapid trials of treatments with potential secondary indications involves collecting data at unprecedented speed on patients’ responses, some of it subjective, where much is unknown about the disease.
Predicting how staff will move in a warehouse involves the complexities of how humans will respond to a completely new working environment.
Doing lots with little data
Popular techniques such as machine learning, which need lots of well understood data, may not be appropriate for such opportunities. ‘Most powerful’ is not the same as ‘most suitable’. A neural network may hold the potential for the best answers, but there’s no point building it if data is limited, or if it needs to run on an embedded medical device with limited compute capability. It just won’t work.
Alternate approaches such as Bayesian uncertainty quantification may be more appropriate for scenarios with limited data. This approach involves updating our knowledge and its uncertainty with each data point, so that each piece of data adds incrementally to the richness of information into the model. These uncertainties can gradually be reduced as more information becomes available.
For example, we used a Bayesian approach to model patient recruitment and retention for clinical trials. Using demographic data, and historical recruitment data, we established the uncertainties around who would sign up, and when. These were automatically updated as each new recruit was confirmed, improving predictive power over the course of the trial.
This work saved hundreds of thousands of dollars for our client, just by allowing the right equipment to be delivered to trial locations at the right time, and brought further benefit by reducing over-recruitment, and predicting start-dates for new drug revenue streams. Since the COVID-19 crisis started, this approach has been incredibly powerful in helping understand trial delays and remediation scenarios.
How to decide which approach to use
Retraining existing models is likely to be risky in many cases, as they were built to model a different situation and their assumptions may no longer apply. If new data is sufficiently similar in quality and quantity, there may be a case for reusing existing models with suitable retraining and validation.
But in many cases the new data will be quite different and need new, bespoke models to be built from the ground up. This approach allows them to be more targeted to the problem and data at hand, but they also require more time, and possibly skills outside the experience of the user.
Building any particular model requires access to someone with the right skill set for that model. But the real challenge is knowing which model is best to use. Mistakes are often made when decisions are based on what modelling skills are available, rather than what is best for the problem.
The best decisions come from involving a range of data science experts, who can assess the best tools based on extensive experience of similar problems. To help teams with this mission, Tessella developed RAPIDE, distilled from 4 decades experience of designing and building data analytics, advanced statistical modelling, AI and machine learning solutions for a wide and diverse cross-section of high-tech and R&D heavy industries.
RAPIDE was crafted to enable data scientists to consistently assess potential value, problem feasibility, and identify and apply the best tools and approaches to meet each specific challenge. Though transparent, complete and accessible, RAPIDE is fundamentally dependent upon individual practitioner skill, experience and judgement to implement correctly. Used well, RAPIDE directs the data scientists’ crucial choices when navigating all phases of the modelling process, and ensures effort is focused on the right areas of the right problems.
The following presents an example of RAPIDE in action.
1. Readiness assessment
Start by defining what you want to do. Then assess what data you need and what's available (this may happen as part of your data identification and gathering process, discussed in our previous article).
Understand the type of analytics problem. Is it classification/regression, supervised/unsupervised, predictive, root-cause analysis, statistical, physics-based? Understand how “dynamic” the problem is – for example, will the nature of the incoming data change over time – as this will mean periodic retraining, something that needs to be factored into planning.
If the problem is brand new and there is not sufficient proven data available to validate a model, this may limit your approach to well understood modelling techniques such as cluster analysis or principle component analysis. If more data is available, or is likely to be available soon, you may be able to deploy more complex self-learning approaches.
2. Advanced Data Screening and Pinpointing variables
Explore the data using a range of simple techniques to spot the meaningful correlations between events of interest. For example, does contact tracing data suggest transmission happens more frequently in specific situations, such as workspaces under a certain size?
Identify any constraints in the data that might limit model choice without further processing; such as data that's overly broad and obscuring the key variables that dictate behaviour. Early insights help direct your model to be most effective in the context of your objective and required performance.
3. Identify Candidate Algorithms
Based on outputs of the previous analysis, identify candidate modelling techniques (which could be empirical, physical, stochastic, hybrid). Shortlist the most promising candidate algorithms and quickly assess feasibility of each.
4. Develop Powerful Models
Decide on the most suitable model for the problem. Check implementation requirements such as user interface, required processing speed, architecture, etc to ensure it will be a usable solution before you commit. Gather validation data. Build it.
If all these steps are confidently carried out correctly, no model should fail after deployment. So, although this process demands rare skills, experience and may take some time upfront, it will lead to quicker, real answers and reduce model failure rates and expensive reworking.
Finally, as noted earlier, the technically best model is not necessarily the same as the model that works best in the real world. To ensure a balance of predictive power and successful user uptake, trust need to be considered throughout this process. We will discuss this in the next article.