7  Data preparation

7.1 Data wrangling

Data wrangling is the process of cleaning and unifying complex datasets for easy access and analysis. Good data wrangling practice is based on comprehensive knowledge of the data. We recommend the following:

  • For a given data source, learn as much as you can about its collection, storage, how it’s updated and maintained, the definition and dependency of each data item, and the limitations. There are more upfront ethical considerations when dealing with health data including consent, ethics approval and understanding the equity and clinical context.

  • For local data collection, consider how to store it (in files, in a database, locally or on cloud etc.) to make information retrieval easier, or/and data sharing easier (see Data Management).

  • Consider the size of the data - should it be processed all at once, batch processed or stream processed? Stream processing is the processing of records as a stream of data, for example record by record. Your computing location (on your computer, on a local server, or on a cloud provider) may affect your decision here.

  • Consider if the data needs de-identification (see Data identifiability).

  • Maintain an equity lens for any type of data science work, including building and evaluating models (see Bias in data).

  • Involve clinical leads when data wrangling. Within healthcare, the risk associated with some data (for example, laboratory data or vitals) is often distributed at the extreme ends. It’s difficult for data scientists to know if the risk is distributed linearly without the clinical context.

7.1.1 Data labelling

Data labelling can have a significant impact on the success of a data science project. A label is a category you might tag a record with for a model to learn. For instance, you might label an ED admission according to whether that patient is high priority for triaging, or you might label a medical alert as containing an adverse drug reaction. When learning, a model finds patterns in the data attributes that map to the labels. In the triage example, attributes that are mapped to the label “is high priority” might include body temperature, blood pressure, heart rate and oximeter reading.

Aim for labels that are clearly and appropriately defined, involving multiple qualified labellers.

7.1.1.1 Defining labels

Be precise, and aim to align the definitions with potential use cases. For example, radiology reports might already be labelled according to the nature of the report findings, such as there was a “finding”, or there was a “finding that is not of concern” or there was “no finding” etc. However, if our goal is to know whether a report requires follow-up and we would like to use machine learning techniques over those reports and labels to achieve this, the initial labels may not be useful: some reports that have a “finding that is not of concern” may need follow-up and some may not. The labels “follow-up required”, or “no follow-up required” are more directly aligned to the use case here.

Be clear. A clear label definition will help both human labellers and model performance. If the boundary between labels is fuzzy, a person will either struggle to choose a label for a data record, or will apply labels inconsistently. If labels are inconsistent in the training set, then we might reasonably expect that the model will also struggle to find the boundary between labels. In the radiology report example, imagine that the use case aligns with categorising reports according to findings. Consider one report showing that a person has a pre-existing mass considered benign, who then has a follow up scan that reveals the mass has not changed. Different human labellers would need clear definitions to be able to consistently label this report. Without a clear definition, one human labeller may decide that there is a “finding that is not of concern” based on the presence of a mass of no concern, while another may decide that there is “no finding” given nothing has changed since the last scan.

When there is a period of time between labelling iterations, the same human labeller may make inconsistent labelling choices. Clear label definitions will also help with label consistency over time.

7.1.1.2 Many labellers make models work

Even with clear definitions, it is normal to find that different human labellers may vary in how they apply a labelling strategy with no one person being the “source of truth”. Inter-labeller variation should be expected. To reduce the effects of this variation, involve many labellers so that each record has labels assigned by two or more people. Labellers will disagree, so when training a model, you could either decide to use records with labels that human labellers unanimously agree on, or those that the majority of labellers agree on, excluding the records with label disagreement from the training set.

The reality is that data labelling in a health context often requires highly skilled and experienced practitioners working under time constraints. If you only have one person who can provide the time required (for what can be a tedious task), then investment in clearly defining labels upfront is especially important.

For ongoing labelling efforts, including labelling in a clinical workflow may be an efficient strategy.

7.1.2 Data quality

It’s important to conduct a data quality review as an early step and produce statistical summaries for checking purposes. This process usually includes profiling the fields within a data set, visualising distributions, checking relationships between fields and exploring data completeness (including completeness over time). Involve the data owners and subject matter experts in this process and consider additional rounds of quality reviews if the data is complex.

Data profiling (for example, with Pandas-profiling) is recommended for both data quality checking and data understanding. Data profiling involves providing descriptive summaries and visualisations of the data for review with clinicians and subject matter experts.

Basic aspects for data quality check include:

  • Data availability across time

  • Data distribution change across time

  • Outliers: Do we need to cap variables with extreme values?

  • Data completeness/missingness

  • Duplication in the data

  • Target label distribution (data imbalance)

  • Association between features and the target

  • Data timeliness: The data resource provides the data in appropriate time.

  • Scalability: The data can be accessed from many components without losing its meaning

  • Provenance: The data should be based on valid authority

  • Locality: The location of data resource should be provided to check relevance of data

  • Structure: form of data

  • Adoption: The data are useful for the project purpose and adaptable with the requirements, including operational requirements (will the data be refreshed, timely and in the expected format?)

  • Identification of extreme/outlier values, and variables that require capping.

7.2 Data linkage

Many projects that involve health data will require linking datasets on certain attributes (keys). For example we might link datasets by matching records that share the same NHI, by matching records that share a combination of personal details, or by matching locations.

One example is linking data across government departments or agencies, for instance linking hospitalisation data to vaccination data, matching on NHI. Another example is where individual health data is linked to census data based on census meshblock.

Data linkage refers to this joining of datasets. It allows us to get a more detailed picture of a person or entity. For instance, when linking medical records to census records, this might reveal that a patient resides in an area of high deprivation.

7.2.1 Unique identifiers

NHI is a unique health identifier that is commonly used to link datasets in health. As a unique identifier, each NHI should identify a single person and each person should have a single NHI. However, in certain cases multiple NHIs get recorded for a single person. We note that records with a single unique identifier, and those with multiple “unique” identifiers, can have equity considerations in health and indicate the nature of interaction with the health system.

7.2.3 Risk of re-identification

Two datasets that separately will not identify an individual may identify an individual once linked. As a result, the risk of re-identification of an individual should be considered prior to linking datasets. Ideally, datasets are linked together and de-identified prior to delivery from the health provider; however, this is often not possible.

7.2.4 Approaches

Where data cannot be linked on a unique identifier, there are approaches that can be used to match records. The ‘deterministic’ approach is to match on a combination of attributes that will uniquely identify a person. An issue with this approach is that in a dataset that is de-identified well, it should be difficult to uniquely identify individuals based on a combination of unique attributes.

The alternative ‘probabilistic’ approach involves calculating conditional probabilities to determine the likelihood that a given pair of records match.

7.3 Missing data

Missing data without context should be interpreted carefully. For example, missing hospitalisation records might indicate different access to care or it could mean the data was not recorded; you may not have access, or you’re healthy and haven’t needed to see your GP. Care must be taken in interpreting missing data, as it can be hard to be sure of the context around the missing data. This illustrates why clinical engagement, and stepping back to see the wider picture, is vital in health data science projects.

Missing data is a common problem with most health data sets. When you encounter missing data, you can choose how to manage it, by either:

  • Removing it
  • Interpreting it as “not applicable”
  • Imputing it (fill it in with other values).

Often a combination of removing excessively missing (according to a pre-determined threshold) observations and variables, and then imputing the remaining missing values, is effective.

Which option you choose should be based on whether your method is tolerant of missing values; whether the data appears to be ‘missing completely at random’ (MCAR) in which case it could be ignored; or whether there are some patterns in its missingness.

If you choose to remove the data, you’ll need to decide whether to remove variables or observations (for example, rows or columns).

Try adding ‘proxy’ variables or altering data collection processes to avoid missing data as much as possible.

Most data is unlikely to be MCAR. Often, data is more likely to be missing for particular reasons (defined as ‘missing not at random’ (MNAR)). Consider if there’s a reason for the missingness, as this can itself be informative. For example, data may be missing from smaller subgroups within the dataset. This is particularly important when considering stratification for ethnicity.

It’s important to consider the potential bias introduced into a model by removing missing data, or the impact on equity if data is removed. Before continuing with imputation or removal, check for patterns in the data of individuals who have partial missing data to assess the equity and bias implications of removing or imputing it. Look for evidence of non-random missingness by comparing the complete and missing data groups through stratification for important demographic variables such as age and ethnicity.

Imputation can be a useful technique for overcoming missing data problems, but can be computationally intensive. Through a PDH research project, a guide on multiple imputation was published with information about the application of imputation techniques. Your strategy for handling missing values will have implications for how you handle future unseen data too.

7.4 Use of synthetic data

Accessing real-world data can sometimes be difficult, whether due to timing, privacy concerns, access problems, lack of participation in healthcare by certain groups, or rarity of a disease.

Synthetic data (information that’s artificially generated rather than produced by real-world events) can be used if you’re having difficulty accessing real-world datasets. Synthetic data may also satisfy a use case (for example, for data augmentation purposes).

Synthetic data can be used for prototyping or building analytics dashboards that are ready to plug into real-world datasets. It can also be used for machine learning purposes, including to integrate data relating to under-represented conditions and groups of people, and to protect privacy. Recent publications have overviewed the use of synthetic data (see Chen et al. 2021).

Chen, R J, Lu, M Y, Chen, T Y, Williamson, D F K, and Mahmood, F. 2021. “Synthetic data in machine learning for medicine and healthcare”. Nature Biomedical Engineering 5.6, 493–497. https://www.nature.com/articles/s41551-021-00751-8.

However, off-the-shelf models that generate synthetic data (such as Synthea) are not mature and should be carefully assessed for local use. We highly recommend that you’re aware of how the synthetic data is generated and what the limitations are before using it.

Off-the-shelf calculators can also be considered in lieu of building or enhancing models with synthetic data, with a caveat that these calculators are ideally subject to local validation. There are tools available such as Te Pokapū Hātepe o Aotearoa New Zealand Algorithm Hub which provide a shared knowledge-base of pre-trained models, algorithms and risk calculators, reviewed by a multidisciplinary governance group.