2 Understanding the basics

2.1 Data projects have a lifecycle

Existing data science guides often include the concept of a data science lifecycle. Examples include the CRoss Industry Standard Process for Data Mining (CRISP-DM) and the Microsoft Team Data Science Process (TDSP).

This guide aligns broadly with the life cycle stages below, but your project may follow a different path. We’ve also included additional considerations around structuring and resourcing a health data science project.

2.2 The data science life cycle stages

Business understanding – What does the health system or business or world need?
Data acquisition and understanding – What data do we have/need? Is it ‘clean’?
Data preparation – How do we organise the data for analysis, including modelling?
Modelling – What modelling techniques should we apply?
Evaluation – Which model best meets the health system or business or world’s objectives?
Deployment – How do users and stakeholders access the results?
Maintenance - How will a model be monitored and refreshed over time?

2.3 Consider the business case

2.3.1 Research vs audit vs ‘business as usual’

People and organisations undertaking health data science may need help clarifying whether what they want to do is “research” or not. There are many definitions of research, but a simple definition is:

Research creates new knowledge, which could include new methods or processes, and could lead to the creation of new guidelines.

Business as usual (BAU) includes activities such as development, sales, and support. If your business includes the creation of new capabilities or features, this portion of your activities can be perceived as research.

Audits are activities that check whether you are following existing processes or guidelines. Testing new processes becomes research, since you are departing from the existing processes.

Quality improvement seeks to improve an existing process that is already of benefit to patients. Machine learning (ML) models incorporated into workflows can cause changes to the standard of care patients receive, including unintended consequences. Translation from a tool to implementation is essentially research to validate and determine the impact of ML on patients.

Research projects can have varying goals, such as ‘analysis only’ versus ‘analyse and model for future use’. Some projects may have a commercial focus on delivering an outcome for use in practice. It is important to define what you are doing as clearly as possible at the start of a project (see Social licence and Consent).

Note

All research undertaken in New Zealand using health data requires ethical review (see Ethics & privacy).

2.4 Clearly articulate the goal

Good data science doesn’t have to be complicated, but it should be clear. Many projects fail to clearly articulate their goal which may lead to people working to a different agenda.

Co-design - a design-led process that uses creative and participatory methods involving all stakeholders to ensure the result meets their needs - is essential at this early stage. It’s important to be explicit in these circumstances; you cannot define a problem, determine the appropriate data and methodology, interpret results or run a successful project without the input and partnership of the end-users and other stakeholders.

Define goals up front, and consider what question is being answered: Do you want to answer a question that is relevant to a specific population (often geographically defined), or to produce a model or other output that can be generalised to other populations? Is it a proof of concept where further work may be required, or does it require implementation to be used in practice?

Often the real goal is masked by sub-goals. A project that has been forced to fit within a call for proposals or programme, but where the actual goals of the researcher and those that the project is set up to address are different, is one example of this. Carefully consider the problem you’re trying to solve and seek external validation to confirm if it’s a real-world problem for end users.

2.5 Consider equity

Equity (the quality of being fair and impartial) is considered throughout this guide. When setting up a project, plan for this in each step of the process. When this is not considered upfront, overall gains come at the expense of differences in equity.

A constant view of equity, also called an “equity lens”, is also addressed throughout this guide, and co-design is central to this concept.

2.6 Choosing the right question - exploring feasibility and delivering value

Choosing the right question is an essential part of any research effort.

Before beginning, get really clear on the problem that is being solved and understand if it is clinically meaningful and adds value. Define the need or problem and then find data/technology to answer it, not the other way around. For model development, define what success looks like e.g how accurate should a model be.

Note

A human-centred design approach is essential and one type of this is co-design which involves seeking information from different stakeholders (clinicians, patients/consumers, systems users) and research from the existing evidence/base or literature on the topic. Simple frameworks like the Five Whys can be helpful to apply when exploring problems to find a root cause or work arounds. The end-users and those impacted by the use of a tool need to understand and believe in the value being delivered to drive engagement and improve translation into real world applications.

Mapping the clinical workflow into which a model fits in dynamic contexts can also useful - do clinicians have current workarounds for problems or barriers that remain useful, or is there a genuine gap ? Can the model or algorithm be responsive to real world complexity?

Often the problem that we start with might originate in the wider health sector and you may need to consider whether there is willingness and need for uptake in this wider context. Another lens for thinking about the problem can be considered throught the value that a solution delivers. Some places where value can be identified is in: improving patient outcomes or experience, improving efficiency, reducing uncertainty, prioritising resource use for those most in need and reducing the use of resource.

In building a business case, the value of data science projects needs to be expressed from the perspectives of each of the different stakeholders. Consider the unique selling points, unintended consequences, risks vs benefits from social, economic, health outcomes and reputational perspectives. Articulate value based on each stakeholder’s perspective. Consider engaging with a health economist for specialist advice. Do not exaggerate or over-sell the potential real-life impact or underplay potential risks. (These are discussed more in the Governance section.)

Note

Work through the use-case end-to-end before starting, getting clear on how the algorithm will be accessed and used, and what decisions will be made as a result.

What level of transparency or interpretability (REF transparency section) is required in order for the results to be useful? Do we only care about predictions or are we looking to drive process change?

Consider how the work will be validated or trialled (REF evaluation section) in a clinical setting, e.g. as a clinical trial or passive background comparison with production data.

Algorithms and models are only one piece of the puzzle in improving health outcomes. What are the interventions in place or available to take action based on machine learning-driven insights? For example, if someone is identified as being at higher risk of readmission to hospital,what can be offered as a result?

Is an algorithm necessarily the best solution? Consider guidance to avoid overuse and misuse of machine learning in clinical research (Volovici et al. 2022).

Volovici, V, Syn, N L, Ercole, A, Zhao, J J, and Liu, N. 2022. “Steps to avoid overuse and misuse of machine learning in clinical research”. Nature Medicine 28.10, 1996–1999. https://www.nature.com/articles/s41591-022-01961-6.

2.7 Funding

Finding funding to pursue a data science project can be difficult. In some cases, data science endeavours may be supported by a specific public or private organisation. Recognising and declaring conflicts of interest is essential to understanding the applicability of the data science results, and any biases that may apply.

In New Zealand, funding tends to be available for conceptual research, or for commercialisation activities. In our experience, there is substantial translational work required between these two phases. Potential sources for funding translative work include Callaghan Innovation and the MedTech Innovation Quarter (MedTech-iQ), as well as private companies who invest in research and development activities. Note that the NZ government’s Research and Development Tax Incentive scheme (RDTI) may incentivise these funding activities where the path to commercialisation is less certain.

It is usually necessary to pitch, propose, or otherwise justify a request for funding data science activities. You might be responding to a third party’s request for proposals (RFP) or open call for a funding round. When considering the project you want to do, think about the points below.

Who is paying for this work to be completed? What is their interest?
Who else has an interest in the outcome, including the researchers? Does their influence/interest in the outcome need to be declared or managed in any particular way?
If this work is successful, what further investment will be required to ensure that the work leads to its full potential?
If the ultimate goal is a commercial product, there is often substantial investment required to take a proof of concept to a maintainable, robust product. Think about:
- Who will provide support to the people using this product, even if they are using it only for research purposes or in informal ways?
- What is the lifecycle of the research output? Who will be able to look after it in 5 years’ time?

2.8 Legal, IP, and regulatory considerations

Legal advice should be sought, and issues related to legal, intellectual property and/or regulatory issues explicitly discussed and documented prior to you starting work. Consider who holds the data you want to use, who owns the models and what may be required for lifecycle management.

Software may be considered a medical device if it is used in the diagnosis, treatment, prevention, cure, or mitigation of diseases or other conditions. Depending on how the intended use is defined, the software may be subject to country dependent regulatory requirements. Specialised regulatory advisors should be engaged early as the process takes time and documentation for approval may need to be generated during the development process.

2.9 Useful resources

Report on Artificial Intelligence for Health in New Zealand: Hauora i te Atamai Iahiko (AI Forum 2019)
A UK guide to good practice for digital and data-driven health technologies
An Assessment Report on Algorithms published by the New Zealand government (Statistics NZ 2018)
Research and engagement considerations described by the NZ Digital Identity Programme

AI Forum. 2019. “Artificial intelligence for health in new zealand: Hauora i te atamai iahiko”. https://aiforum.org.nz/wp-content/uploads/2019/10/AI-For-Health-in-New-Zealand.pdf.

Statistics NZ. 2018. “Algorithm assessment report”. https://www.data.govt.nz/docs/algorithm-assessment-report/.