Modern data science is having an identity crisis – although presented as the heartbeat of our digital future, it fails many critical tests, not least to protect us against fake news and election meddling. Old timers (IBM, Microsoft) and newer kids (Amazon, Google) offer freely available “state-of-the-art” AI toolkits, but all too often the applications built using them sit somewhere between slightly misguided and downright dangerous. Data science has big trust issues.
This situation is comparable to the nascent software industry of the 1960s and 70s. The response of that community was to put software engineering on a professional footing – the Software Development Lifecycle¹. Now is the critical time to do the same for data science, articulating and crystallizing best practice. Otherwise, history will repeat, and we’ll make the same mistakes as in previous decades.
Let’s be clear – plenty of data science leaders, planning and implementing AI and machine learning projects, far-sighted enough to understand the importance of professionalization, are working to develop their own frameworks. But building professional engineering standards is hard, as we know only too well. Our own data science framework, RAPIDE, is many years in the making, and we believe mature to the point of being a firm reference for others.
Artificial Intelligence wants to occupy an essential role in our domestic and professional lives. From the personal world of mobile devices and home assistants to operational systems controlling transportation, utilities and vital elements of our economic infrastructure, daily life is increasingly digital and powered by AI.
For all of this to work effectively on the scale being promised, rapid automation reliant upon AI is essential. However, we hear ever more stories of AI failing dramatically, making choices and recommendations that clearly appear to be nonsense to the people around it.
Extravagant marketing claims and media coverage of AI reinforce the belief that data science is a well understood and well-regulated, mature scientific discipline. A predictable matter of routine; just add data and stir. Anyone with hands-on experience knows this is fundamentally untrue. The application of data science, analytics or AI demands a clear understanding of the toolkits, the core principles on which they are based and the strengths and weaknesses of their implementation. Only then can you decide if they are a good fit to the business problem and business context, which must be kept in sharp focus, or you’ll lose your way.
There are many high profile examples of AI and related data sciences making mistakes². Commentators often portray them as a failure of the machine. The more interesting stories and causes of failure lie with the way the data science team designed and built the solution.
It was reported that Amazon abandoned a machine learning powered recruitment application because “the machine” was inherently gender biased. The machine, however, did exactly what it was trained to accomplish; sort and rank CV’s based upon its learned behaviour. The ranking was an entirely accurate reflection of the raw information it was fed, and contained more, but less newsworthy, flaws than screening out female candidates.
The real story is that, taken alone, the CVs of current employees could not be used to train an algorithm to identify new candidates with “the right stuff”. Since the training set reflected the overwhelmingly male dominated existing workforce, any systemic differences from writing style to personal interests reduced the ranking for female applicants. Inherently flawed design was the problem, not the technology. It is to prevent all too common implementation errors like this that a professional framework is needed.
Many businesses are investing heavily into their digital futures, with AI and data science projects at the heart of high-profile digital transformation initiatives. Although every digital journey is unique, with different starting and end-points, in our experience the lack of a common professional framework to guide multi-disciplinary, distributed teams is a repeated factor in failing projects and programs.
The growing series of set backs and failures puts considerable strain on confidence, then budgets, operational capability and ultimately erodes senior level support in the digital transformation vision. An effective governance framework should incorporate a set of direct tests to help diagnose when an organization needs to strengthen its data science delivery processes to make them fit for purpose.
Consistent failure to transition projects across phases of the lifecycle: e.g. not progressing from proof-of-concept to pilot to production deployment in a controlled manner, is one of the strongest signs that you need to take decisive action to professionalize.
Data science is all about difficult choices. At the core of any professional governance framework lies the support and guidance needed to allow fast-growing, mixed experience teams to make the right judgement to get the best results that data science has to offer at each decision point as they navigate the project lifecycle from proofs-of-concept to hardened enterprise solutions.
Any data science framework needs to be repeatable and scalable with a focus on ease of exploration and rapid development, incorporating the principles of “fail-fast and fail-early”. This is particularly important for high innovation R&D and engineering environments where agile solution delivery is a commercial imperative.
It must be general enough to work everywhere, despite the ever-increasing variety of systems found within the IT landscape common to most large organizations: a non-intuitive muddle of enterprise applications, inconsistent technology choices, commodity infrastructures as well as bespoke solutions all serving the same business units. To be actively used by internal teams it must both play to existing strengths while also recognizing and supporting weaknesses and mitigating risk.
The core principles it is built on must apply equally to ground up, bespoke solution engineering, and to specialist configuration of 3rd party AI and data science platforms. The digital future will be a subtle, highly personalised, and evolving mix of these fundamental approaches.
Effective engineering frameworks were being built long before data science and software became formal disciplines. In common, they reflected experience and a deep understanding of the complex challenges they were designed to overcome and had the awareness to incorporate insight into human as well as technical factors.
Data science is unique as a discipline but no exception to these principles. Building best practice for data science must reflect a mastery across a broad range of advanced mathematical, statistical and modelling techniques as well as a deep understanding of the nature of people and their relationship to data itself.
Understandably, to date, most best practices applied to data science projects have been adapted from software engineering, including agile frameworks. A good foundation, but not enough. The discipline of software engineering is the controlled assembly of code into robust solutions; the conversion of clearly-defined requirements into software functionality.
The focus of data science is the controlled exploration and discovery of new information and relationships, meaningful insights and rigorous testing of hypotheses. All delivered with the speed, clarity and consistency needed for business to act decisively and with confidence within what may be a limited window of competitive advantage.
Agile software development methods suit high cadence and exploration, but lack any of the essential specifics that cover the design of training regimes, the identification and preparation of training data, bias identification and resolution and post deployment re-training using operational data.
Understanding that we need to do things differently, the next step is to understand the key principles and characteristics that a data science framework must possess. It makes sense to us to start from an understanding of which software engineering principles require extending, and then clarify those additional dimensions needed in a data science framework.
Keep a clear sight on what you aim to achieve as there are several tensions that must be balanced. For example, data science calls for the controlled exploration of data within the context of its relation to a real-world problem. A framework can however quickly become overly prescriptive to the point of constraining the team, suppressing their natural talent, instinct, and sense of personal ownership of the solution.
The result is a team working for the framework and not the framework working for the team. With the best will in the world, any attempt to narrowly encode good intentions into a fixed process immediately limits and reduces its effectiveness. We always prefer guidance to prescription, trusting in the intelligence and creativity of the individual to use the resources to direct them to the best solution and to seek further assistance when needed.
Not least, the framework must be relevant and accessible to a range of abilities and experience, and have the flexibility to be future proofed, without which any best practice rapidly becomes obsolescent.
RAPIDE is a data science governance framework developed internally by Tessella. Distilled from decades spent designing and building data analytics, advanced statistical modelling, AI and machine learning solutions for a wide cross-section of high-tech industries.
RAPIDE has been crafted to enable a broad spectrum of data scientists to consistently identify and apply the best tools and approaches to meet each specific challenge. With only a single choice of data science hammer to wield, no fancy technology makes you immune to the human tendency of wanting to turn what are in reality very different data challenges into the same type of nail. RAPIDE is the controlled way to never get into that situation.
RAPIDE, though transparent, complete and accessible, is fundamentally dependent upon individual practitioner skill, judgement and experience to implement correctly. As previously explained, it is explicitly not a predetermined, directed tool for planning and executing individual project tasks or a decision tree-based approach to be followed during development phases. Instead, RAPIDE directs the data scientist’s skill and understanding to inform the crucial choices that need to be made whilst negotiating all phases of data science solution engineering.
Accurate control of the physical properties of materials used in the manufacture of high performance, high tolerance components is vital as operational wear and tear can tend towards fatigue and expensive in-service engine failures.
The understandable response of operators is to adopt conservative inspection regimes where human experts physically inspect components for faults, irrespective of whether any problem is suspected.
To reduce the need for manual inspections, a manufacturer of specialist, high-tolerance engine parts attempted to construct a model using component sensor data to predict when a potential issue or failure was due to occur. The diagnostic basis was to spot finely nuanced changes in the sensor data collected from the component while in operation.
Their own attempts at an internally developed solution did not prove fit for purpose. Repeated false positives meant it couldn’t be relied upon to isolate the real potential threats to component failure.
We partnered with them to assemble a new collaborative data science unit, with the stronger mix of skills and professionalism needed to build an accurate and trusted AI that correctly predicted high risk of failure, prolonging the life term of the components and reducing the overhead of expensive manual inspections. The unit remained internal to the manufacturer’s organization but was jointly led and staffed.
The combined team followed the phases and approach laid out in Tessella’s RAPIDE governance framework. First steps included a ‘readiness assessment’ – testing the validity of the scientific principles thought to govern behaviour and conducting a detailed review of data quality and completeness. This quickly revealed that the earlier algorithm was fundamentally incapable of incorporating the necessary scientific principles. It also immediately highlighted that their training data was far too sparse and variable in quality for the chosen technique – a very common situation.
This was followed by an ‘advanced data screening’ phase, which revealed that it was only possible to obtain the required insights through a combination of component sensor data fused with other engine measurements, routinely collected over day to day operation. The greater density of data to work with provided enough inherent information content to pinpoint those indicators to the driving factors behind the prediction of a high failure risk.
The team was now prepared to compile and investigate a super-set of candidate algorithms, assess them, and down-select the most effective solution based only on the merits of that problem. Only the most successful model was progressed to full training and validation of its predictions, based upon our upgraded training dataset which we now had confidence would support our ambitious needs.
The new sensor data-driven solution could alert a heightened risk of failure within hours of the first pre-failure events occurring, instead of the weeks or months common to manual inspection regimes. Further evolution, following embedding the solution into the operational business, is providing insight into future design changes that improve engine data collection and boost prediction accuracy.
The partnership of professional data scientists, working seamlessly together with the specialist engineering teams to implement the guidance of the RAPIDE governance phases, demonstrated the power of the framework to deliver a genuine and trusted AI step change in cutting the overhead of ineffective manual inspections.
* To protect client confidentiality, the single narrative in this section is a composite of different projects. This allows us to share more information, without affecting the fundamental nature of our message.
Data science leaders have ready access to powerful AI and data science tools. But applying AI and data science techniques without the guidance of a professional framework means a high risk of project failure – reporting baseless “insights” that are misleading and probably damaging to your business.
An effective governance framework must be clear, accessible, transparent and ensure the multiple stages of data discovery, model building, training, hypothesis testing, prediction validation, etc. are conducted in a rigorous, well-ordered manner.
The focus should be on guidance and support for independent thinking, rather than prescription. Developing a professional framework for data science is not easy. Software engineering frameworks offer a good starting point but need considerable enhancement.
Tessella’s RAPIDE is a world class example of a successfully proven framework that underpins hundreds of data science projects delivered each year across multiple high-tech industries. We consider it to be a strong foundation for those considering building frameworks further specialized to their own needs.
1. https://en.wikipedia.org/wiki/Software_development_process - Wikipedia Software development process
2. https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983 - Medium - 2018 in Review: 10 AI Failures