To ensure AI delivers, each project must be approached in a way that maximises chances of success. In our new whitepaper, The Three Stages of Enterprise AI Success, we discuss three interconnected steps for doing so: build the model, prepare the data; and deploy it correctly into the enterprise.
Here we discuss the second of these.
Don’t Feed Your AI Fake News, Give it Reliable Data so it Can Learn Properly
If you have been taught logical thinking, but you only read fake news, you’ll reach the wrong conclusions about the world. Equally, good AI models are only as good as the data that goes into them. If a model is trained perfectly to recognise that a machine is about to fail, but a key sensor is feeding it the wrong data, the model will reach the wrong conclusions. Garbage in, garbage out.
Making data usable is down to data engineers, who must acquire the data, examine it, understand it, clean it, and prepare it. They must work in close collaboration with the data scientists and share knowledge of the model, business and data. This is vital for two reasons. Firstly, they are accountable for gathering the training data for the data scientists. Secondly, once the model is deployed into the enterprise and processing real-world data, they need to ensure that data is reliable enough for it to reach a meaningful result.
All the Problems with your Enterprise Data that Will Totally Confuse your AI
Data is drawn from disparate places in an enterprise. Data sources can include; temperature sensors, machine monitoring devices, customer databases, and mobile heath apps. It can be held in different formats, from structured engineering data, to excel spreadsheets, to images, notes, video and voice recordings. All sorts of problems emerge.
Data often has inconsistent naming conventions. A company may have multiple sites with multiple sensors capturing data on temperature, energy output, vibrations, etc. Different units may be used in different regions (feet vs meters) and sensors are often named inconsistently or even mistyped (‘Temp_1’ vs ‘Temp-1’). The central data team ends up with many streams of inconsistently named data, making it hard to reliably feed them into models.
Data can go missing. Employees sometimes forget to upload important information, or update databases. Sensors malfunction, or machines are taken out of service creating gaps in the time series. The March 2019 Ethiopian Airlines disaster happened when a sensor failed, causing the plane’s automatic system to misunderstand what was happening.
Data collected from past human decisions will reflect human biases. Amazon’s AI recruitment tool was trained on employee CVs, mostly male, learning that systemic gender differences - from writing style to personal interests – were determinants of a successful hire. The result was an AI that dismissed women as unsuitable for the job.
Tidying up your Messy Data to Make it AI-ready
The data engineering task is to make this data usable by AI models.
Depending on the data source, this will require building systems to access the data, for example APIs which extract data and load it into the desired database, from which the model runs.
The data must be cleaned – removing corrupt or inaccurate records. And it must be properly structured and tagged so that it conforms to the technical requirements of the target database so the model can interpret it in the correct way.
Once data is flowing, there is a need for a system for naming things and agreeing consistent data formats. For many organisations, this ideally means considerable changes to existing data collection methods – or a lot of work for data engineers converting it into correct formats.
Supplementary models can be used to automate some of the data curation. Metadata such as geo tagging or time stamps can be proxies for other information, allowing data feeds to be automatically relabelled and fed into the target database. Models can cope with problems such as inconsistent units or missing data, but only if the problem has first been identified and a model trained to deal with it.
Data scientists rely on data engineers for good data for their models, and data engineers need to work with data scientists to understand likely biases in the data, and evolve models to meet their data curation challenges. Both must be involved in continued oversight post deployment, to spot problems or changes in data, and retrain models as needed.
Once we have a good model and good data, the final challenge is making it work in an enterprise environment. We will discuss this in our third article.
This article is part of our ‘Three Stages of Enterprise AI Success’ series. Download the full whitepaper here