9  Evaluation

After a model has been created and validated, you should evaluate it to determine whether it can address the question you initially set out to answer. In some cases this can take the form of a formal prospective data study, where an independent dataset is collected, and the performance of the model on this dataset is assessed. Models that affect patient standard of care will be subject to ethical processes which should be considered upfront (see Ethics).

Existing models, such as from other countries or jurisdictions, can also be evaluated against a local dataset, which is a good way to tell how useful someone else’s model may be on your own data.

Lags in data availability can affect evaluation: a model that goes live today may not be able to be effectively evaluated until several months later.

Possible social and business impacts of model deployment should be considered in the governance process. Once a model has been deployed, actual impacts can be assessed through data collection, which can also include interviews with interested parties.

9.1 Model safety

Making a model safe for use in a healthcare setting generally involves ensuring that its rates of falsely classifying results as positive or negative are low.

Note

From a clinical perspective, the main area of concern is often false negative or low-risk individuals with a positive outcome. You should include clinical experts in discussions on appropriate model thresholds to address this.

When models are implemented, often errors in classifying high-risk individuals can be identified because there is a follow-up step associated with that high risk which aims to reduce the risk. However, errors in classifying low-risk individuals often have two possible explanations:

  1. The outcome was unpredictable

  2. Using additional data could improve the accuracy of the model’s prediction

It is usually difficult for a data scientist to know what additional data could have been used. It’s helpful for your clinical lead to audit a number of false negative results to determine if the outcome is truly unpredictable or if additional data will help with the prediction.

9.2 Bias in data

Data is collected in a particular context, which leads to bias. This can come from different sources including historical bias, data imbalance, missingness, and human prejudice. There’s no single best definition of bias or fairness that applies equally well for every data science application.

While training machine learning models using historically collected data, or drawing any conclusion from data, you should be typically mindful about the potential bias in the data, regarding sensitive attributes such as age, ethnicity and gender. Bias-related harms can be reinforced by machine learning models and systems.

The Manatū Hauora Ministry of Health’s Emerging Health Technology Introductory Guidance (PDF) section on Bias is a good place to start learning.

Machine learning fairness is a broad topic; relevant reading includes a section in Google’s crash course in machine learning, its reference on A Marketer’s Guide to Machine Learning Fairness, and AI Fairness 360 (Bellamy et al. 2018).

Bellamy, R K E, Dey, K, Hind, M, Hoffman, S C, Houde, S, Kannan, K, Lohia, P, Martino, J, Mehta, S, Mojsilovic, A, Nagar, S, Ramamurthy, K N, Richards, J, Saha, D, Sattigeri, P, Singh, M, Varshney, K R, and Zhang, Y. 2018. AI fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias”. http://arxiv.org/abs/1810.01943.

Due to the historical focus of cohort studies, certain groups of the population - such as certain ethnic groups or females - may be underrepresented and more vulnerable to bias in studies where any conclusions were drawn or models were trained.

In evaluation work, it’s important to measure the goodness of fit, accuracy and other metrics of a model from multiple perspectives rather than the overall metrics only. Consider the basic measurement aspects with respect to sensitive attributes (for example gender and ethnicity):

  1. The difference of actual patient data metrics stratified by sensitive attributes - whether there’s any inequity among the stratified groups, and what bias it may bring into the models

  2. The difference of predicted outcome metrics stratified by sensitive attributes - whether there is any inequity in model predictions among the stratified groups, and what downstream consequence this may cause

  3. The difference of model performance metrics among the stratified groups - whether the model is treating the groups equally, and what downstream consequence this may cause.

Reporting metrics with stratification by sensitive attributes whenever applicable can help to more easily maintain an equity lens. Performance of the IBIS/Tyrer-Cuzick model of breast cancer risk by race and ethnicity in the Women’s Health Initiative is an example see Kurian et al. 2021.

Kurian, A W, Hughes, E, Simmons, T, Bernhisel, R, Probst, B, Meek, S, Caswell‐Jin, J L, John, E M, Lanchbury, J S, Slavin, T P, Wagner, S, Gutin, A, Rohan, T E, Shadyab, A H, Manson, J E, Lane, D, Chlebowski, R T, and Stefanick, M L. 2021. “Performance of the IBIS/tyrer‐cuzick model of breast cancer risk by race and ethnicity in the women’s health initiative”. Cancer 127.20, 3742–3750. https://onlinelibrary.wiley.com/doi/10.1002/cncr.33767.

There’s a trade-off between data informativeness, model performance and fairness. Most bias mitigation methods can’t avoid playing with this balance. We highly recommend taking into account the use case and follow-up impacts while deciding which bias mitigation method to be used, and how it’s used.

Mitigating bias and improving fairness is mostly not a technical challenge but a much broader systematic challenge. We recommend including diverse voices and perspectives in data science work by, for example, having Māori researchers involved in your project.

Note

Even if no mitigation can be included, we recommend that the bias itself should be analysed and reported if possible, especially that it is important to identify who is most vulnerable to the bias-related harms.

9.2.1 Useful tools and resources

Wolff, R F, Moons, K G M, Riley, R D, Whiting, P F, Westwood, M, Collins, G S, Reitsma, J B, Kleijnen, J, and Mallett, S. 2019. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies”. Annals of Internal Medicine 170.1, 51–58. https://www.acpjournals.org/doi/10.7326/M18-1376.
Moons, K G M, Wolff, R F, Riley, R D, Whiting, P F, Westwood, M, Collins, G S, Reitsma, J B, Kleijnen, J, and Mallett, S. 2019. PROBAST: A tool to assess risk of bias and applicability of prediction model studies: Explanation and elaboration”. Annals of Internal Medicine 170.1, W1. http://annals.org/article.aspx?doi=10.7326/M18-1377.

9.3 Keep subject matter experts involved throughout

It’s very important to continually engage subject matter experts, such as clinicians, throughout your project. This can encompass involving clinicians in the project as an advisory level, and can be extended by involving clinicians in iterative development of models, such as providing labels.

Tip

It’s incredibly important to keep the perspectives of subject matter experts front of mind throughout your project. If these experts are not involved in the data science process, their acceptance of the outputs will be limited or absent. This may mean your model isn’t able to be used to its full potential - or in the worst case scenario, not used at all.

9.4 Transparency, interpretability, and explanation

In the context of health, it’s especially important to communicate findings in a way that builds trust in your model development process and outputs. If trust isn’t built in your model, it may not be adopted in practices. However, while robust and transparent explainability may seem like a good goal for an AI model to become useful, there may be some disadvantages to this approach (Babic et al. 2021).

Babic, B, Gerke, S, Evgeniou, T, and Cohen, I G. 2021. “Beware explanations from AI in health care”. Science 373.6552, 284–286. https://www.science.org/doi/abs/10.1126/science.abg1834.

The requirement for models to be transparent, interpretable and explainable should guide your early modelling decisions, such as which algorithm to use and how to treat inputs. Linear models tend to be more interpretable and explainable; so too are models that are built with inputs that haven’t been subject to significant transformations or weightings.

The choice of a ‘simpler’ model may compromise model performance, meaning better outcomes are sacrificed for the sake of explainability. There is little point however in deploying a deep learning model that has excellent performance, but isn’t trusted or used. You should carefully consider this trade-off.

You should also consider outcome measures that are tangible and meaningful to your end user, and that have some relationship to the project goal, such as hospitalisation, mortality, or rankings.

Finally, you should indicate which inputs are most important to the model performance or model outputs (feature importance) and how the model inputs are weighted in relation to model outputs, and what that means in practice. For example, if a model includes modifiable inputs, a model user will want to know how changes to that input might affect an outcome. From an equity perspective, a model user will also want to know to what extent inputs such as ethnicity, age, gender or deprivation impact the outcome.

These considerations are also relevant to governance of models. GDPR’s regulation specifically emphasises a model’s transparency, accountability and governance (see Kaminski and Malgieri 2021).

Kaminski, M E and Malgieri, G. 2021. “Algorithmic impact assessments under the GDPR: Producing multi-layered explanations”. International Data Privacy Law 11.2, 125–144. https://academic.oup.com/idpl/article/11/2/125/6024963.

9.4.1 Transparency around the inputs

It’s important to provide good data definitions and reasons for how data has been treated. For example:

  • “The model includes age but it has been grouped into 3 categories (18-39, 40-59 and 60+) to simplify the model and handle outliers without compromising performance”; or
  • “The count of regular medications includes medications that have been prescribed in the two years prior to the test positive date for this infection. Regular medications are medications that have been prescribed at least four times over that period. Prescription data rather than dispensing data is used as it has better coverage.”

Providing the provenance of data is also important. For example:

“This data came from a national collection of data that went through a quality assurance process. It is current and relevant to the problem being answered in this way…”

9.4.2 Interpretation of the output

There are different types of output from different algorithms, such as risk scores, predictions, and simulation results. In particular, users of your model need to understand what the value means in the context of how it will be used.

For example, depending on how the model is built and the use case, a risk score of 0.8 might mean 80% probability of an outcome, or it may be a value that only makes sense while compared with other people’s scores in a cohort. If the risk score has been developed for a ranking use case, your end user needs to understand that the output for a given person has meaning in relation to outputs for other people to determine who is at higher risk, rather than as a standalone value.

9.4.3 Transparency of algorithm development

You should audit the behaviour of an algorithm at the population level. Unexpected behaviours indicate defects or limitations of the model, and are roadblocks for gaining trust. For example, the relationship between the predicted values and certain covariants (e.g. increased predicted mortality risk vs. age) for the validation cohort should match with empirical evidence, or the clinician’s judgement. Interpretation techniques such as partial dependence plots can facilitate such investigation.

You should also explain model prediction for specific individuals. For example, a person’s readmission risk score can be attributed to age, previous hospitalisation history, cancer diagnosis, ethnicity and other risk factors. Local surrogate models or Shapley values based methods are commonly adopted techniques to provide explanations at the individual level. Interpretable Machine Learning (Christoph Molnar 2022) provides more details about model interpretation techniques.

Christoph Molnar. 2022. “Interpretable machine learning”. https://christophm.github.io/interpretable-ml-book/index.html.
Tip

Clinicians or other stakeholders need to be involved in the population- and individual-level algorithm auditing, driving iterations of algorithm development with their feedback.

Where possible, making the code base and dataset public adds credibility as others can understand the data, how the model was developed, allowing them to replicate findings themselves and to challenge development assumptions, or even suggest improvements.

9.4.4 Understanding the performance of an algorithm

Defining performance metrics for your algorithm in the context of the problem it’s trying to solve is important.

Take, for example, a use case where a model is being used to predict a condition (with prevalence 2%) that requires an intervention. A positive predictive value of 0.95 means that of those people that presented with the condition, 95% were identified by the model. Putting this into context, we can say that if 1000 people a day were assessed, we would expect that 20 of them would require the intervention. The model would identify 19 of those people, meaning weekly, 5 people with that condition would be missed if we were to rely solely on model outputs.

Information like this can help a clinician understand that while the model performs well, in practice they might like to supplement model outputs with other assessments.

For models that output probabilities, such as logistic regression, such analyses can be useful in helping clinicians quantify and understand the trade-off between false positives and false negatives in order to decide which decision thresholds may be appropriate.

Tip

Plain language and accessible explanations not only help build trust in the model; they help the user understand how to use the model, which helps it to be adopted.

9.4.5 Understanding the impact of an algorithm

Tip

You should evaluate your algorithm’s benefit and cost against realistic scenarios. Classic model performance metrics such RMSE, accuracy, precision or recall generally won’t go far enough to illustrate the consequences of integrating the algorithm into a healthcare workflow.

How your algorithm can be integrated into a workflow needs to be understood and well documented. It’s worth further evaluating what the potential healthcare outcome could be, especially when there’s a limit to clinical capacity.

Here’s an example: A model is developed to classify GP-referred patients into high priority and low priority using a certain priority score threshold, in order to deliver timely triaging. However, there are already many people waiting in the triage queue, people newly referred each day, and the number of referral requests clinicians can process has a limit and uncertainty. By looking at the classification metrics of this model, without knowing how and at which step of the process the model will be used, it’s hard to tell exactly whether the integration of the model will bring more benefit than cost in the waiting time for patients who are in urgent need.

Defining proper impact metrics that take the use case and goal of your modelling into consideration, and carefully running through a further evaluation against workflow, will provide more informative insights than classic model performance metrics alone.

Deterministic or stochastic simulation techniques can be used for this evaluation when a working scenario can be quantitatively described. An example of this is the New Zealand business case for a hospital avoidance programme using a readmission risk model, which presented its financial impact in healthcare (Vaithianathan et al. 2012).

Vaithianathan, R, Jiang, N, and Ashton, T. 2012. “A model for predicting readmission risk in new zealand”. http://hdl.handle.net/10419/242509.