Topic Group 6: Evaluating diagnostic tests and prediction models

Chairs:	Ewout Steyerberg, Ben van Calster
Members:	Patrick Bossuyt, Tom Boyles, Gary Collins, Kathleen Kerr, Petra Macaskill, David McLernon, Carl Moons, Maarten van Smeden, Andrew Vickers, Max Westphal, Laure Wynants

Homepage: Topic Group 6

Methods and measures are well established to evaluate the accuracy of a diagnostic test for classifying individuals according to whether they do, or do not, have a condition of interest (Pepe 2003¹), Zhou 2002²). A test may be intrinsically binary, ordinal or continuous, with a reference standard ("gold standard") used to define whether patients have the target condition. For a binary test, the most commonly reported measures of test accuracy are sensitivity and specificity. These estimate the probability that the test result is correct, conditional on the disease status of an individual. Other commonly reported measures include likelihood ratios, and positive and negative predictive values of the test. For ordinal and continuous tests, a Receiver Operating Characteristic (ROC) curve is typically used to represent the trade-off in sensitivity and specificity as the test threshold varies, with the area under the ROC curve used as a global measure of test accuracy or discrimination. Sensitivity, specificity and other measures are often reported at chosen cut-point(s) for test positivity. These methods underpin diagnostic test evaluation, but they are not necessarily well applied or interpreted. Indeed, there are concerns about the value of sensitivity and specificity as metrics.

It is widely accepted that test performance is likely to vary according to the context in which the test is used, for instance, where it lies in the clinical pathway. This has clear implications for the (potential) role of the test, e.g. a replacement for an existing test; an “add-on” test or a triage test (Bossuyt 2006 ³). Evaluation of the potential role of a test has implications for study design as well as methods for test comparisons and assessing the gain in using tests in combination (Hayen 2010 ⁴).

In practice, a test is generally used in conjunction with other information such as the age of the patient, their sex, symptoms, clinical signs and possibly the results of other tests when making a diagnosis. Multivariable logistic regression is commonly used to develop a model to predict the presence of disease, thereby utilising all available relevant information for diagnostic decision support. Prediction modelling is especially important also in the area of prognosis, where survival modelling such as Cox regression is frequently used to predict the probability of a given outcome (e.g. mortality) in the future, based on the profile of an individual in terms of prognostic factors, test results, biomarkers etc. Model performance is usually assessed in terms of overall predictive performance, discrimination (ability to classify individuals correctly into two outcome categories), and calibration (agreement between the predicted probabilities and observed outcomes) (Steyerberg 2010 ⁵). While these are generally regarded as the key criteria for assessing model performance, specific methods and measures vary occasionally producing inconsistent or conflicting results. This is especially evident when assessing the incremental gain of adding a test or biomarker to a model (Steyerberg 2012 ⁶, Pepe 2013 ⁷).

Diagnostic tests and model predictions are imperfect. Thus there is a potential for harm as well as benefit in terms of decisions regarding (further) investigations, treatment and prognosis for individuals. Even though an evaluation may indicate good diagnostic accuracy or model performance, evaluation of clinical utility requires determining whether decisions based on the test or model improve patient outcomes. This can be done either by decision analysis (net benefit) (Steyerberg 2010 ⁵) or by prospective cost effectiveness analysis (Hunink 2001⁸).

We will review diagnostic test evaluation in terms of methods, measures and study designs that are relevant to the assessment of a test in terms of its intended use. We will also review methods for the evaluation of prediction models for diagnosis and prognosis, with a particular focus on reclassification and approaches that assess clinical utility.

In the longer term, we will consider extending the above framework to address prediction models for differential diagnosis, dealing with missing data (e.g. incomplete verification), assessment of calibration, consistency with economically relevant outcomes (e.g. medical costs and quality adjusted life years), and impact of measurement error (e.g. error in the reference standard).

Pepe M. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press: Oxford, 2003.
Zhou X, Obuchowski N, McClish D. Statistical Methods in Diagnostic Medicine. JohnWiley & Sons Ltd: New York, 2002.
Bossuyt PM, Irwig L, Craig J, Glasziou P. Comparative accuracy: assessing new tests against existing diagnostic pathways. BMJ 2006; 332:1089–1092.
Hayen A, Macaskill P, Irwig L, Bossuyt P. Appropriate statistical methods are required to assess diagnostic tests for replacement, add-on, and triage. Journal of Clinical Epidemiology 2010; 63(8):883–891.
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 2010; 21(1):128–138.
Steyerberg EW, Pencina MJ, Lingsma HF, Kattan MW, Vickers AJ, Van Calster B. Assessing the incremental value of diagnostic and prognostic markers: a review and illustration. European Journal of Clinical Investigation 2012; 42(2):216–228.
Pepe MS, Kerr KF, Longton G,Wang Z. Testing for improvement in prediction model performance. Statistics in Medicine 2013; 32(9):1467–1482.
Hunink MGM, Glasziou PP, Siegel JE,Weeks JC, Pliskin JS, Elstein AS, Weinstein MC. Decision Making in Health and Medicine. Integrating Evidence and Values. Cambridge University Press: Cambridge, UK, 2001.