Topic Group 2: Selection of variables and functional forms in multivariable analysis

Chairs:	Georg Heinze, Aris Perperoglou, Willi Sauerbrei
Members:	Michal Abrahamowicz, Harald Binder, Daniela Dunkler, Frank Harrell, Marc Henrion, Michael Kammer, Marianna Nold, Matthias Schmid, Theresa Ullmann

Homepage: Topic Group 2

In multivariable analysis, it is common to have a mix of binary, categorical (ordinal or unordered) and continuous variables which may influence an outcome. While TG6: Evaluating Diagnostic Tests and Prediction Models considers the situation where the main task is predicting the outcome as accurately as possible, the main focus of TG2 is to identify influential variables and gain insight into their individual and joint relationship with the outcome. Two of the (interrelated) main challenges are selection of variables for inclusion in a multivariable explanatory model, and choice of the functional forms for continuous variables (Harrell 2001¹, Sauerbrei et al. 2007 ²).

In practice, multivariable models are usually built through a combination of

a priori inclusion of well established ‘predictors’ of the outcome of interest, and
a posteriori selection of additional variables, based often on arbitrary, data-dependent procedures and criteria such as statistical significance or goodness-of-fit measures.

There is a consensus that all of the many suggested model building strategies have weaknesses (Miller 2002³) but opinions on the relative advantages and disadvantages of particular strategies differ considerably.

The effects of continuous predictors are typically modeled by either categorizing them (which raises such issues as the number of categories, cutpoint values, implausibility of the resulting step-function relationships, local biases, power loss, or invalidity of inference in case of data-dependent cutpoints) (Greenland 1995⁴) or assuming linear relationships with the outcome, possibly after a simple transformation (e.g. logarithmic or quadratic). Often, however, the reasons for choosing such conventional representation of continuous variables are not discussed and the validity of the underlying assumptions is not assessed.

To address these limitations, statisticians have developed flexible modeling techniques based on various types of smoothers, including fractional polynomials (Royston and Altman 1994⁵, Royston and Sauerbrei 2008⁶) and several ‘flavours’ of splines. The latter include restricted regression splines (Boer 2001⁷, Harrell 2001¹) penalized regression splines (Wood 2006⁸) and smoothing splines (Hastie and Tibshirani 1990⁹). For multivariable analysis, these smoothers have been incorporated in generalized additive models.

Various examples illustrate that such smoothers can yield new insight into the role of continuous variables (Abrahamowicz et al. 1997 ¹⁰, Royston and Sauerbrei 2008⁶). However, further practical guidance is urgently needed, necessitating extended investigations of analytical properties and systematic comparisons between alternative methods.TG2 will start with a comprehensive review of methodological, medical and econometrics literature to

identify and assess methods currently used in practice,
find any published guidelines on selection of variables and their functional forms, and
find systematic simulation-based comparisons of alternative techniques, especially in multivariable analyses (Binder 2013 ¹¹)

Part (c) may lead to new comparative simulation studies and provide building blocks for evaluation of new techniques by simulation.

We aim to develop consensus-based tentative recommendations, initially for level 2 expertise, under some simplifying assumptions about the data structure. Recommendations will address accuracy, efficiency, transportability, ease of implementation and interpretatbility, in wide range of applications (Sauerbrei 2007²). Furthermore, we aim to develop systematic guidance for using splines in applications, similar to existing guidelines for fractional polynomials (Royston and Sauerbrei 2008⁶). Longer-term goals include evaluation of and recommendations for computationally intensive variable selection algorithms which incorporate shrinkage and resampling techniques; collaborations with other TGs to account for such complexities as missing data, measurement errors, time-varying confounding, or issues specific to modeling continuous predictors in survival analyses (Abrahamowicz and MacKenzie 2006 ¹²).

Harrell FE. Regression Modeling Strategies: with applications to linear models, logistic regression, and survival analysis. Springer: New York, 2001.
Sauerbrei W; Royston P; Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Statistics in medicine 2007. DOI: 10.1002/sim.3148
Miller A. Subset Selection in Regression. Taylor & Francis, 2002.
Greenland S. Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. Epidemiology (Cambridge, Mass.) 1995. 6(4): 450–454.
Royston P; Altman DG. Regression Using Fractional Polynomials of Continuous Covariates: Parsimonious Parametric Modelling. Appl. Statist. 1994; 43(3): 429–467.
Royston P; Sauerbrei W. Multivariable model-building. A pragmatic approach to regression analysis based on fractional polynomials for continuous variables. Wiley: Chichester, 2008.
Boer C de. A practical guide to splines. revised edn. Springer: New York, 2001.
Wood S. Generalized Additive Models. Chapman & Hall/CRC: New York, 2006.
Hastie T; Tibshirani R. Generalized Additive Models. Chapman & Hall/CRC: New York, 1990.
Abrahamowicz M; Du Berger R; Grover SA. Flexible modeling of the effects of serum cholesterol on coronary heart disease mortality. American Journal of Epidemiology 1997; 145(8): 714–729
Binder H; Sauerbrei W; Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. STATISTICS IN MEDICINE 2013; 32(13): 2262–2277
Abrahamowicz M; MacKenzie TA. Joint estimation of time-dependent and non-linear effects of continuous covariates on survival. Statist. Med. 2006. DOI: 10.1002/sim.2519