Data science growing pains
This situation is comparable to the nascent software industry of the 1960s and 70s. The response of that community was to put software engineering on a professional footing – the Software Development Lifecycle¹. Now is the critical time to do the same for data science, articulating, and crystallizing best practice with a data science framework. Otherwise, history will repeat, and we’ll make the same mistakes as in previous decades.
Let’s be clear – plenty of data science leaders, planning and implementing AI and machine learning projects, far-sighted enough to understand the importance of professionalization, are working to develop their own frameworks. But building professional engineering standards is hard, as we know only too well. Our own data science framework, RAPIDE, is many years in the making, and we believe mature to the point of being a firm reference for others.
Data Science Is Not Yet a Well-Behaved Scientific Discipline
From the personal world of mobile devices and home assistants to operational systems controlling transportation, utilities, and vital elements of our economic infrastructure, daily life is increasingly digital and powered by AI. But without a framework for governance, data science is unable to become a well-behaved discipline.
For all of this to work effectively on the scale being promised, rapid automation reliant upon AI is essential. However, we hear ever more stories of AI failing dramatically, making choices and recommendations that clearly appear to be nonsense to the people around it.
Extravagant marketing claims and media coverage of AI reinforce the belief that data science is a well-understood, well-regulated, and mature scientific discipline. A predictable matter of routine; just add data and stir. But anyone with hands-on experience knows this is fundamentally untrue. The application of data science, analytics, or AI demands a clear understanding of the toolkits, the core principles on which they're based, and the strengths and weaknesses of their implementation. Only then can you decide if they're a good fit for the business problem and business context in question.
When Data Science Gets It Wrong
There are many high profile examples of AI and related data sciences making mistakes². Commentators often portray them as a failure of the machine. The more interesting stories and causes of failure lie with the way the data science team designed and built the solution.
For example, Amazon abandoned a machine learning-powered recruitment application because “the machine” was inherently gender-biased. The machine, however, did exactly what it was trained to accomplish: sort and rank CVs based upon its learned behavior. The ranking was an entirely accurate reflection of the raw information it was fed, and contained more, but less newsworthy, flaws than screening out female candidates.
The real story is that, taken alone, the CVs of current employees could not be used to train an algorithm to identify new candidates with “the right stuff”. Since the training set reflected the overwhelmingly male-dominated existing workforce, any systemic differences from writing style to personal interests reduced the ranking for female applicants. The inherently flawed design was the problem, not the technology. It is to prevent all too common implementation errors like this that a professional framework is needed.
Do You Need a Data Science Framework?
Many businesses are investing heavily in their digital futures, with AI and data science projects at the heart of high-profile digital transformation initiatives. Although every digital journey is unique, with different start and end-points, in our experience the lack of a common professional data science framework to guide multi-disciplinary, distributed teams is a repeated factor in failing projects and programs.
A growing series of setbacks and failures put considerable strain on confidence, then budgets, and operational capability. They ultimately erode senior-level support in the digital transformation vision. An effective governance framework should incorporate a set of direct tests to help diagnose when an organization needs to strengthen its data science delivery processes to make them fit for purpose.
Consistent failure to transition projects across phases of the lifecycle: e.g. not progressing from proof-of-concept to pilot to production deployment in a controlled manner, is one of the strongest signs that you need to take decisive action to professionalize.
3 Signs That You Need a Professional Data Science Framework:
- Multiple underperforming data science projects in your portfolio – running over budget, falling behind schedule, not delivering value, and finally losing sponsorship.
- Data science projects are paused, restarted, or quietly aborted entirely.
- Significant budget and resources are flowing into supporting a never-ending stream of data science proofs-of-concept, of which very few become formally piloted and even fewer developed into a production system and deployed to the enterprise.
What Challenges Must a Professional Data Science Framework Address?
At the core of any professional governance framework for data science lies the support and guidance needed to allow fast-growing, mixed experience teams to make the judgment. This lets them get the best results that data science has to offer at each decision point as they navigate the project lifecycle from proofs-of-concept to hardened enterprise solutions.
Any data science framework needs to be repeatable and scalable with a focus on ease of exploration and rapid development, incorporating the principles of “fail-fast and fail-early”. This is particularly important for high innovation R&D and engineering environments where agile solution delivery is a commercial imperative.
It must be general enough to work everywhere, despite the ever-increasing variety of systems found within the IT landscape common to most large organizations. To be actively used by internal teams, it must both play to existing strengths while also recognizing and supporting weaknesses and mitigating risk.
The core principles it's built on must apply equally to ground up, bespoke solution engineering, and to the specialist configuration of 3rd party AI and data science platforms. The digital future will be a subtle, highly personalized, and evolving mix of these fundamental approaches.
Why Professional Data Science Needs a Dedicated Framework
Effective engineering frameworks were being built long before data science and software became formal disciplines. In common, they reflected experience and a deep understanding of the complex challenges they were designed to overcome and had the awareness to incorporate insight into human as well as technical factors.
Data science is unique as a discipline but is no exception to these principles. Building best practice for data science must reflect a mastery across a broad range of advanced mathematical, statistical, and modeling techniques as well as a deep understanding of the nature of people and their relationship to data itself.
Understandably, to date, best practices applied to data science projects have been adapted from software engineering, including agile frameworks - a good foundation, but not enough. The discipline of software engineering is the controlled assembly of code into robust solutions; the conversion of clearly-defined requirements into software functionality.
The focus of data science is the controlled exploration and discovery of new information and relationships, meaningful insights, and rigorous testing of hypotheses. All delivered with the speed, clarity, and consistency needed for the business to act decisively and with confidence within what may be a limited window of competitive advantage.
Agile software development methods suit high cadence and exploration. However, they lack any of the essential specifics that cover the design of training regimes, the identification and preparation of training data, bias identification and resolution, and post-deployment re-training using operational data.
How to Build a Professional Data Science Framework
Understanding that we need to do things differently, the next step is to understand the core principles and characteristics a data science framework must possess. It makes sense to start from an understanding of which software engineering principles require extending, and then clarify those additional dimensions needed in a data science framework.
Keep sight of what you aim to achieve as several tensions must be balanced. For example, data science calls for the controlled exploration of data within the context of its relation to a real-world problem. A data science framework can, however, quickly become overly prescriptive to the point of constraining the team, suppressing their natural talent, instinct, and sense of personal ownership of the solution.
The result is a team working for the framework and not the framework working for the team. With the best will in the world, any attempt to narrowly encode good intentions into a fixed process immediately limits and reduces its effectiveness. We always prefer guidance to prescription, trusting in the intelligence and creativity of the individual to use the resources to direct them to the best solution and to seek further assistance when needed.
Not least, the framework must be relevant and accessible to a range of abilities and experience, and have the flexibility to be future-proofed, without which any best practice rapidly becomes obsolescent.
The 5 Characteristics of a Successful Data Science Framework
- Repeatable and agile, so that it's effective in rapid-paced and high innovation environments.
- Embraces a stage-gated, “fail-early” philosophy that rapidly establishes proven foundations for future investment.
- Multi-pass structure to facilitate the refining of rapid-start, quick wins into enterprise-ready, mature solutions.
- Technologically multilateral – providing support for the widest possible range of data science tools and techniques; prioritizing regular reassessment of the most effective and appropriate tools for each job on its merits; embracing the next wave of new technology whilst guarding against shiny and new for the sake of it.
- Equally effective at supporting the design and delivery of predictive, prescriptive, and diagnostic models and solutions.
RAPIDE – Tessella’s Professional Data Science Framework
RAPIDE is a data science governance framework developed internally by Tessella, distilled from decades spent designing and building data analytics, advanced statistical modeling, and AI and machine learning solutions for a wide cross-section of high-tech industries.
RAPIDE was built to enable a broad spectrum of data scientists to consistently identify and apply the best tools and approaches to meet each specific challenge. With only a single choice of data science hammer to wield, no fancy technology makes you immune to the human tendency of wanting to turn what are in reality very different data challenges into the same type of nail. RAPIDE is the controlled way to never get into that situation.
RAPIDE, though transparent, complete, and accessible, is fundamentally dependent upon individual practitioner skill, judgment, and experience to implement correctly. As previously explained, it is explicitly not a predetermined, directed tool for planning and executing individual project tasks or a decision tree-based approach to follow during development phases. Instead, RAPIDE directs the data scientist’s skill and understanding to inform crucial decisions while negotiating all stages of data science solution engineering.
- Pragmatic guidance on how to make the right choices during the design, implementation, and industrialization of data science and data-driven solutions.
- Clear, transparent, and accessible, yet requiring individual skill, judgment, and experience to implement correctly.
- Designed around the use of quality-checks at each step to:
- Break down complex challenges into manageable phases.
- Stop poorly defined and underperforming workstreams.
- Ensure correct data science approaches are taken at the right time and in the proper order.
- Fully compatible and integrated with modern software development frameworks, including agile and traditional waterfall approaches.
RAPIDE in Action — the Rigorous Application of AI to Predict Component Failures in Engines*
Accurate control of the physical properties of materials used in the manufacture of high performance, high tolerance components is vital as operational wear and tear can tend towards fatigue and expensive in-service engine failures.
The understandable response of operators is to adopt conservative inspection regimes where human experts physically inspect components for faults, irrespective of whether any problem is suspected.
To reduce the need for manual inspections, a manufacturer of high-tolerance engine parts attempted to construct a model using component sensor data to predict when a potential issue or failure was due to occur. The diagnostic basis was to spot finely nuanced changes in the sensor data collected from the component while in operation.
Their attempts at an internally developed solution did not prove fit for purpose. Repeated false positives meant it couldn’t be relied upon to isolate the real potential threats to component failure.
Tessella partnered with them to assemble a new collaborative data science unit, with a more robust mix of skills and professionalism needed to build an accurate and trusted AI that correctly predicted a high risk of failure. This prolonged the life term of the components and reduced the overhead of expensive manual inspections. The unit remained internal to the manufacturer’s organization but was jointly led and staffed.
The combined team followed the approach laid out in Tessella’s RAPIDE governance framework. The first steps included a ‘readiness assessment’ – testing the validity of the scientific principles thought to govern behavior and conducting a detailed review of data quality and completeness. This quickly revealed that the earlier algorithm was fundamentally incapable of incorporating the necessary scientific principles. It also immediately highlighted that their training data was far too sparse and variable in quality for the chosen technique – a pervasive situation in AI development.
The project then underwent an 'advanced data screening' phase. This revealed that it was only possible to obtain the required insights through a combination of component sensor data fused with other engine measurements, routinely collected over the day-to-day operations. The greater density of data to work with provided enough inherent information content to pinpoint those indicators to the driving factors behind the prediction of high failure risk.
The team was now prepared to compile and investigate a super-set of candidate algorithms, assess them, and down-select the most effective solution based only on the merits of that problem. Only the most successful model progressed to full training and validation of its predictions, based upon our upgraded training dataset which we now had confidence would support our ambitious needs.
The new sensor data-driven solution could alert a heightened risk of failure within hours of the first pre-failure events occurring, instead of the weeks or months common to manual inspection regimes. Further evolution, following the embedding of the solution into the operational business, is providing insight into future design changes that improve engine data collection and boost prediction accuracy.
The partnership of professional data scientists, working seamlessly together with the specialist engineering teams to implement the guidance of the RAPIDE governance phases, demonstrated the power of the framework to deliver a genuine and trusted AI step-change in cutting the overhead of ineffective manual inspections.
* To protect client confidentiality, the single narrative in this section is a composite of different projects. This allows us to share more information, without affecting the fundamental nature of our message.
Data science leaders have ready access to powerful AI and data science tools. But applying AI and data science techniques without the guidance of a professional framework means a high risk of project failure – reporting baseless “insights” that are misleading and probably damaging to your business.
An effective governance framework must be clear, accessible, transparent, and ensure the multiple stages of data discovery, model building, training, hypothesis testing, and prediction validation are conducted in a rigorous, well-ordered manner.
- https://en.wikipedia.org/wiki/Software_development_process- Wikipedia Software development process
- https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983- Medium - 2018 in Review: 10 AI Failures