Shared resources enable greater collaboration: Big science in the cloud
The vast quantity of data generated by global monitoring initiatives and large-scale research facilities present both new opportunities and challenges for scientists. Results that can be captured in minutes may take years to fully understand.
To help researchers review and analyze this growing volume of information, cloud-based platforms are now being developed to combine distributed access with shared high-power computing resources.
These tools are opening the door to massively collaborative projects, including citizen science, and are providing a manageable route for making publicly-funded research available to the wider world.
Catherine Jones, based at the STFC’s Rutherford Appleton Laboratory in Oxfordshire, UK, leads the software engineering group at the Ada Lovelace Centre—an integrated, cross-disciplinary, data intensive science center supporting national facilities such as synchrotrons and high-power lasers.
In her role, Jones is closely involved with providing researchers with access to tools and data over the cloud—an approach known as Data Analysis as a Service. “Traditionally, researchers using our facilities would have taken the data with them, but as data volumes increase you need to look at other solutions,” she says.
Using internal cloud facilities at the STFC, Jones and her colleagues offer researchers access to virtual machines designed to simplify working with large amounts of scientific results. “The virtual machines are aimed at a specific scientific technique,” Jones explains. “Whenever a user spins one up, they have access to their data and to the routines that they’ll need for that analysis, together with the right amount of computing resource.”
The benefits of a cloud-based approach to data analysis include streamlined administration and maintenance
The cloud-based tools require testing and documentation to make sure that the platform meets not just the researchers’ immediate needs, but also provides a robust solution long term. In other words, a product that can be serviced, supported and transferred.
Currently, the system supports scientists conducting research using one specific experimental technique at the STFC’s ISIS neutron spallation facility, with plans to roll it out further. It’s a model that could be applied across different research communities, although each one will have its own specific needs. Detailed requirements gathering is essential to understand the machine learning and AI needs across multiple labs.
The benefits of a cloud-based approach to data analysis include streamlined administration and maintenance. For example, the use of virtual machines makes it easier to roll out software upgrades and apply version control so that scientific models can be re-run, and their results reproduced in the future.
There are advantages too when it comes to configuring the work environment. “It’s easier to match the computing resources to the analysis, as a cloud setup is more flexible,” says Jones. “It’s a more elastic resourcing mechanism.” The hope here is that researchers will gain more time to spend on the analysis, with less to worry about in terms of the hardware under the hood.
Different fields, different requirements
As Jones points out, different scientific fields can have different requirements when it comes to dealing with the demands of big data. John Watkins, who is head of environmental informatics at the Centre for Ecology & Hydrology (CEH), gives an example.
“With particle physics, the challenges are likely to be more in terms of data volume and the analytics of a particular data flow,” he says. “However, with environmental science you are often assessing a very broad variety of data. This needs to be pulled from multiple sources and can be very, very different in nature.”
Watkins’ colleague, Mike Brown—who is head of application development at CEH—refers to the so-called Vs of big data (a list that includes volume, variety, velocity, and veracity) to emphasize the multiple challenges associated with providing scientists with easy access to data and analytical tools.
A key objective for Brown and Watkins is to connect environmental scientists who understand the data with experts in numerical techniques who are developing cutting-edge analytical methods. Once again, the solution has been to provide collaborative facilities in the cloud—this time through a project known as DataLabs, funded by NERC.
“It’s not just about providing easy-to-use interfaces, it’s also about enabling the dialogue between researchers with a shared aim,” Watkins comments. “The provision of collaborative tools such as Jupyter Notebooks or R-Shiny apps are a way of achieving this over time.”
To break down the DataLabs project into user stories, an approach that helped the team to capture the key features of the platform and quickly pilot its ideas, Watkins and Brown worked with experts at Tessella. "The aim in the first 12 months was to build a proof-of-concept to show that all the different elements could work together and would be useful for the community," says Jamie Downing, a project manager at Tessella who has been supporting the program's core partners.
Today, the group has the essential elements in place from end-to-end and the first case studies show that Data Labs has got off to a flying start. As an example, researchers are now using the cloud-based environment to run much more detailed CEH land-cover models. The leap in performance (a jump from 1 km to 25 m resolution), coupled with significantly reduced execution time, is a huge improvement on what was possible under the previous physical workstation-based approach.
Other fields get to benefit too. The experience in developing DataLabs has provided a springboard for rolling out similarly collaborative platforms such as solutions supporting the Data and Analytics Facility for National Infrastructure (DAFNI). This is a project that aims to integrate advanced research models with established national systems for modeling critical infrastructure.
“Led by Oxford University and funded by the EPSRC, the initiative aspires over the next 10 years to be able to model the UK at a household level, 50 years into the future,” explains Nick Cook, a senior analyst at Tessella. Here, the firm is involved in conceptualizing DAFNI’s capabilities and implementation roadmap.
This is a project that aims to integrate advanced research models with established national systems for modeling critical infrastructure.
One of the project's early goals is to create a 'digital twin' of a city such as Exeter - in other words, to create a rigorous, detailed, virtual model of a city with a population of several hundred thousand people together with its transport infrastructure, utility services, and environmental context. This digital twin would, for example, help planners to decide where to invest in new road or rail networks, and to identify the best sites for housing, schools, and doctors' surgeries.
Cook cautions that such a hyperscale systems approach will succeed only if it performs in a reliable, repeatable, and provenanced way. "When users deliver their findings, they need to be able to justify how the results that have been generated in an analogous way to applying scientific best practices of high energy physics or life science research - engendering a sense of trust in their outcomes to perhaps skeptical or hostile audiences," he emphasizes.
DAFNI is looking very closely at what DataLabs is doing as a way of providing the interface and the virtual research spaces within its own cloud. Both proposals share requirements to store the results in a traceable way that preserves the integrity of the data and protects it against tampering, inadvertent corruption, or malicious use. It's an area that could one day see digital ledgers, or block chains, playing an important role - particularly when dealing with the sensitive nature of critical national infrastructures.
More food for thought
As well as supporting collaborative number crunching, cloud-based big science solutions make it much easier to reach out and share knowledge and expertise - for example, through webinars and workshops.
Today, more and more of us have experience of operating in the cloud, collaborating on projects at work, and watching movies and sharing photos at home. Popular online platforms have become easier to use and more personalized to our requirements. But as expectations rise, so can our demands in terms of what an interface can do and the features we’d like to see.
“It can be a challenge when you don’t have the resources of giants like Google, but it’s all to the good as our experiences encourage us to think about simple and easy pathways and not to make our solutions overly complicated,” says Jones.
Summing up, the days of doing an experiment and being able to carry the full data set back to your PC on a USB stick are over. And while it's unlikely to surprise many that cloud storage and online data access has risen to the challenge, the devil is in the detail. Get it right and platforms can do so much more for the scientific community—providing scalable computing resource, simplifying maintenance and upgrades, and enabling multidisciplinary collaboration to spur on research progress.
AI gears up for data analysis: Making the most of machine learning
Applying AI know-how to the giant pool of data gathered from the world’s leading and most powerful scientific instruments could accelerate the process of scientific discovery. Developing algorithms that can reduce the burden on experts and run 24 hours a day is an appealing option for shrinking the backlog of experimental results that have yet to be fully analyzed and understood.
Powerful machine learning approaches offer new ways to extract scientific meaning from the raw experimental data, which ultimately could help funders to unlock more value from their investment in research.
Large-scale experimental facilities such as neutron and synchrotron sources have become an essential element of modern scientific research, allowing visiting researchers to probe the structure and properties of many different types of materials. They also generate huge amounts of experimental data, which can make it difficult for visiting scientists without specialist knowledge of the experiment to extract meaningful information from the raw datasets. As a result, some of the data collected during their valuable beamtime is never properly analyzed.
The good news is that this situation has improved dramatically over the last 10 years, with a consortium of leading neutron facilities working together to streamline and standardize the software used to analyse data from neutron scattering and muon spectroscopy experiments. The framework - Mantid - supports a common data structure and shared algorithms to enable visiting scientists to easily process and visualize their experimental results.
“This common framework helps visiting scientists to get to grips with instruments at different facilities,” comments Nick Draper, one of Tessella’s senior project managers. “But it also helps researchers to make use of a different instrument at the same facility.”
Next big challenge
According to Draper, who has long been involved in supporting big science projects, the next major challenge is to make it easier for researchers from different scientific backgrounds to analyze and interpret the complex experimental output that can be produced.
“Often there’s not just one model that you could fit to your data, there could be 20 or 30 options, and sometimes it’s not absolutely clear which model you should be picking,” Draper explains. “At the moment, it takes expert opinion from instrument scientists who really understand the experiments to lead and guide on which approaches to take.”
But with larger and larger volumes of data to get through, this can create a bottleneck that delays results. One option for speeding up the process is to exploit artificial intelligence (AI) to help with model selection. It’s a concept that some researchers might feel uneasy about, but Draper’s colleague Matt Jones—an analyst at Tessella who keeps a watchful eye on the latest industry trends—has some words of reassurance. “AI is there to help the human, it’s not there to govern and provide the answers—it’s there to augment,” he states.
The deep learning revolution
Today, the buzz surrounding artificial intelligence is hard to ignore. We’ve been wowed by computers that can beat grandmasters at chess and Go, and are served by increasingly powerful speech recognition and machine translation tools. To the list of highlights, you can also add breakthroughs in image recognition together with progress in driverless vehicles. But why is it all happening now? After all, many machine learning algorithms have been around for decades.
The crucial factor is the impact of scale, specifically the parallel growth of data and available computing power. And this has transformed the capabilities of one technique in particular—deep learning—which benefits greatly from the availability of large datasets.
While other methods plateau when you feed them with more information, the performance of deep learning’s artificial neural networks keeps climbing. And the larger (or deeper) the neural network, the greater its capacity to absorb the value of its inputs and deliver meaningful outputs.
Combining big data with large amounts of compute makes it possible to create artificial neural networks with many so-called hidden layers. These deep learning systems are giant mathematical functions that comprise multiple layers of nodes, equipped with self-adjusting weights and biases, all sandwiched between a series of inputs and outputs.
The rich combination of data and compute—together with a greater understanding of how to train (or propagate) these powerful multi-layered networks—is now taking the performance of machine learning techniques to new heights.
Engaging the benefits
The flip-side is that research groups need access to large amounts of data and large amounts of compute to engage the full benefits of deep learning, and they need support from teams who can get these systems up and running.
It’s an issue that Tony Hey, Chief Data Scientist at the STFC, and his team are aware of. To help researchers to extract more science more efficiently, from their experiments, Hey is assembling a Scientific Machine Learning group, working closely with the Alan Turing Institute—the UK’s national institute for data science and artificial intelligence.
Hey is also linked to STFC’s Ada Lovelace Centre, which is being established as an integrated, cross-disciplinary, data intensive science hub that has the potential to transform research at big science facilities through a multidisciplinary approach to data processing, computer simulation and data analytics.
Research groups need access to large amounts of data and large amounts of compute to engage the full benefits of deep learning.
Objectives for Hey include applying AI and advanced machine learning technologies to the experimental data generated by STFC-supported facilities at the Harwell Campus: the Diamond synchrotron source; the ISIS neutron and muon source; the UK’s Central Laser Facility; and the NERC Centre for Environmental Data Analytics with its JASMIN super data cluster.
“The analysis of huge datasets requires automation and machine help as the volume goes beyond what used to be possible by hand,” Hey comments. “However, there are lots of opportunities to try to help automate the data flow in the pipeline in getting data from a machine to the point where you can do science with the results.”
The goal is to create a broader support structure for machine learning and AI that other disciplines can tap into.
Building this pipeline requires helping researchers to understand more about the machine learning algorithms. “You need transparency and understandability as to how various methods will get you to an answer, not black boxes,” he points out.
Hey is keen to develop what he describes as machine learning benchmarks. He also wants to leverage existing expertise in communities such as particle physics and astronomy, who have been dealing with petabyte-scale big data challenges for some time.
The goal is to create a broader support structure for machine learning and AI that other disciplines can tap into. It means being able to strip out the jargon and make processes such as data classification models understandable outside a given field.
One way of lowering the barrier to entry is to provide what John Watkins of the CEH calls “teaching labs“—for example, C++ routines that have been packaged into an R library, married with a dataset, and then wrapped in a web-based R-shiny app for convenient access. “They let people look at various algorithms and play with them to learn their particular characteristics and discover how methods may or may not be useful in their work,” he says.
For Watkins and his environmental science colleagues, one size rarely fits all. Researchers in the field commonly need to understand a variety of data from different sources—for example, output from sensors on land and in the atmosphere, as well as oceanographic measurements.
“Ideally you want access to a range of tools to hit a block of data with and compare the results to identify the most efficient method,” he advises. “You don’t want to be in the position where you can only attack it with one method, because that’s the only capacity that you have.”
Researchers in the field commonly need to understand a variety of data from different sources - for example, output from sensors on land and in the atmosphere
There are other considerations too beyond stripping out the jargon and providing accessible and benchmarked tools. It’s also important to support the optimal workflow for a given task, which might be running models on an HPC, storing the results on a large-scale data cluster, and then switching to a smaller scale operation once the portion of the data that’s important has been identified.
Clearly, it’s a job for multi-skilled teams who can navigate not just the technology, but also the science that the AI is being targeted at.
Returning to our earlier example, Draper is encouraged by pilot analysis using small-angle neutron scattering data, where AI is now being used to steer users towards using either a spherical model or a cylindrical model to fit the data. Early results are promising, but the next question is whether the approach remains effective when the choice jumps to as many as 40 different models.
Just the beginning
Draper and his Tessella colleague Matt Jones believe this is just the beginning of a trend that could revolutionize the analysis of scientific data, with interest growing among the research community in the possible benefits of AI. “We are just starting to prick the edges of this future now,” says Matt Jones. He anticipates more conversational type interfaces, as well as visual approaches such as virtual reality, that lend themselves to presenting highly-detailed scientific structures and complex data.
“AI is a really interesting place for the future,” adds Draper, who is also well aware of the hurdles. “You need lots of training data,” he points out, “and that data has to be properly tagged.”
But what happens if training data doesn’t exist, or is only available in limited quantities? One idea is to back-generate images that indicate what a particular model would look like. “If you do that lots of times with different parameters, mixing in static and distorting the images to make them as realistic as you can, then you can create training data,” says Draper. “The challenge is to ensure that you are not simply overtraining your dataset to recognize the things that you have created as opposed to actual experimental results.”
Synthetic data that sums a number of signals has proven useful in enhancing speech recognition—for example, by training systems to overcome background sounds such as in-car noise—so again, it’s possible that knowledge developed in one sector can be transferred across different domains.
Success in deploying AI requires teams with talent across multiple areas: an understanding of the data, knowledge of machine learning algorithms plus statistical methods, and expertise in high-performance or cluster computing. But the potential rewards make the challenges worth conquering and can extend to other areas beyond analyzing experimental results.
Google has reportedly saved a fortune by using deep learning to reduce the costs of running its data centres. Algorithms can alert operators when machinery is close to failure and should be replaced, which minimizes downtime. The output can also inform optimal servicing frequencies to keep equipment in reliable working order for as long as possible.
This predictive power can be applied at big science facilities too, notes Tessella’s Kevin Woods—a senior project manager involved in the update of instrument control systems. “By looking at the long-term patterns [in the signals] you can actually spot imminent failures,” he says. One example could be a gradual increase in motor operating temperature, which may indicate that an actuation unit is on its way to overheating.
Rewards within reach
As we’ve seen, investing in AI puts multiple rewards within reach. Machine learning has the potential to dramatically speed up the analysis of big data across different domains, hopefully allowing research teams to make faster progress in their understanding of increasingly complex phenomena. To succeed, researchers need easy access to extensive data sets, large amounts of compute, and the ability to experiment with and understand which algorithms are best matched to the task.
Mike Brown, CEH
John Watkins, CEH
Tony Hey, STFC
Catherine Jones, STFC
Nick Cook, Tessella
Jamie Downing, Tessella
Nick Draper, Tessella
Matt Jones, Tessella
Kevin Woods, Tessella