Topic Group 6: Evaluating diagnostic tests and prediction models

Chairs: Gary Collins, Carl Moons, Ewout Steyerberg
Members: Patrick Bossuyt, Petra Macaskill, David McLernon, Ben van Calster, Andrew Vickers

Methods and measures are well established to evaluate the accuracy of a diagnostic test for classifying individuals according to whether they do, or do not, have a condition of interest (Pepe 20031), Zhou 20022). A test may be intrinsically binary, ordinal or continuous, with a reference standard ("gold standard") used to define whether patients have the target condition. For a binary test, the most commonly reported measures of test accuracy are sensitivity and specificity. These estimate the probability that the test result is correct, conditional on the disease status of an individual. Other commonly reported measures include likelihood ratios, and positive and negative predictive values of the test. For ordinal and continuous tests, a Receiver Operating Characteristic (ROC) curve is typically used to represent the trade-off in sensitivity and specificity as the test threshold varies, with the area under the ROC curve used as a global measure of test accuracy or discrimination. Sensitivity, specificity and other measures are often reported at chosen cut-point(s) for test positivity. These methods underpin diagnostic test evaluation, but they are not necessarily well applied or interpreted. Indeed, there are concerns about the value of sensitivity and specificity as metrics.

It is widely accepted that test performance is likely to vary according to the context in which the test is used, for instance, where it lies in the clinical pathway. This has clear implications for the (potential) role of the test, e.g. a replacement for an existing test; an “add-on” test or a triage test (Bossuyt 20063). Evaluation of the potential role of a test has implications for study design as well as methods for test comparisons and assessing the gain in using tests in combination (Hayen 20104).

In practice, a test is generally used in conjunction with other information such as the age of the patient, their sex, symptoms, clinical signs and possibly the results of other tests when making a diagnosis. Multivariable logistic regression is commonly used to develop a model to predict the presence of disease, thereby utilising all available relevant information for diagnostic decision support. Prediction modelling is especially important also in the area of prognosis, where survival modelling such as Cox regression is frequently used to predict the probability of a given outcome (e.g. mortality) in the future, based on the profile of an individual in terms of prognostic factors, test results, biomarkers etc. Model performance is usually assessed in terms of overall predictive performance, discrimination (ability to classify individuals correctly into two outcome categories), and calibration (agreement between the predicted probabilities and observed outcomes) (Steyerberg 20105). While these are generally regarded as the key criteria for assessing model performance, specific methods and measures vary occasionally producing inconsistent or conflicting results. This is especially evident when assessing the incremental gain of adding a test or biomarker to a model (Steyerberg 20126, Pepe 20137).

Diagnostic tests and model predictions are imperfect. Thus there is a potential for harm as well as benefit in terms of decisions regarding (further) investigations, treatment and prognosis for individuals. Even though an evaluation may indicate good diagnostic accuracy or model performance, evaluation of clinical utility requires determining whether decisions based on the test or model improve patient outcomes. This can be done either by decision analysis (net benefit) (Steyerberg 20105) or by prospective cost effectiveness analysis (Hunink 20018).

We will review diagnostic test evaluation in terms of methods, measures and study designs that are relevant to the assessment of a test in terms of its intended use. We will also review methods for the evaluation of prediction models for diagnosis and prognosis, with a particular focus on reclassification and approaches that assess clinical utility.

In the longer term, we will consider extending the above framework to address prediction models for differential diagnosis, dealing with missing data (e.g. incomplete verification), assessment of calibration, consistency with economically relevant outcomes (e.g. medical costs and quality adjusted life years), and impact of measurement error (e.g. error in the reference standard).