Topic Group 2: Selection of variables and functional forms in multivariable analysis

Chairs: Michal Abrahamowicz, Willi Sauerbrei, Aris Perperoglou
Members: Heiko Becher, Harald Binder, Frank Harrell, Georg Heinze, Patrick Royston, Matthias Schmid

In multivariable analysis, it is common to have a mix of binary, categorical (ordinal or unordered) and continuous variables which may influence an outcome. While TG6: Evaluating Diagnostic Tests and Prediction Models considers the situation where the main task is predicting the outcome as accurately as possible, the main focus of TG2 is to identify influential variables and gain insight into their individual and joint relationship with the outcome. Two of the (interrelated) main challenges are selection of variables for inclusion in a multivariable explanatory model, and choice of the functional forms for continuous variables (Harrell 20011, Sauerbrei et al. 20072).

In practice, multivariable models are usually built through a combination of

  1. a priori inclusion of well established ‘predictors’ of the outcome of interest, and
  2. a posteriori selection of additional variables, based often on arbitrary, data-dependent procedures and criteria such as statistical significance or goodness-of-fit measures.

There is a consensus that all of the many suggested model building strategies have weaknesses (Miller 20023) but opinions on the relative advantages and disadvantages of particular strategies differ considerably.


The effects of continuous predictors are typically modeled by either categorizing them (which raises such issues as the number of categories, cutpoint values, implausibility of the resulting step-function relationships, local biases, power loss, or invalidity of inference in case of data-dependent cutpoints) (Greenland 19954) or assuming linear relationships with the outcome, possibly after a simple transformation (e.g. logarithmic or quadratic). Often, however, the reasons for choosing such conventional representation of continuous variables are not discussed and the validity of the underlying assumptions is not assessed.

To address these limitations, statisticians have developed flexible modeling techniques based on various types of smoothers, including fractional polynomials (Royston and Altman 19945, Royston and Sauerbrei 20086) and several ‘flavours’ of splines. The latter include restricted regression splines (Boer 20017, Harrell 20011) penalized regression splines (Wood 20068) and smoothing splines (Hastie and Tibshirani 19909). For multivariable analysis, these smoothers have been incorporated in generalized additive models.

Various examples illustrate that such smoothers can yield new insight into the role of continuous variables (Abrahamowicz et al. 199710, Royston and Sauerbrei 20086). However, further practical guidance is urgently needed, necessitating extended investigations of analytical properties and systematic comparisons between alternative methods.TG2 will start with a comprehensive review of methodological, medical and econometrics literature to

  1. identify and assess methods currently used in practice,
  2. find any published guidelines on selection of variables and their functional forms, and
  3. find systematic simulation-based comparisons of alternative techniques, especially in multivariable analyses (Binder 201311)

Part (c) may lead to new comparative simulation studies and provide building blocks for evaluation of new techniques by simulation.


We aim to develop consensus-based tentative recommendations, initially for level 2 expertise, under some simplifying assumptions about the data structure. Recommendations will address accuracy, efficiency, transportability, ease of implementation and interpretatbility, in wide range of applications (Sauerbrei 20072). Furthermore, we aim to develop systematic guidance for using splines in applications, similar to existing guidelines for fractional polynomials (Royston and Sauerbrei 20086). Longer-term goals include evaluation of and recommendations for computationally intensive variable selection algorithms which incorporate shrinkage and resampling techniques; collaborations with other TGs to account for such complexities as missing data, measurement errors, time-varying confounding, or issues specific to modeling continuous predictors in survival analyses (Abrahamowicz and MacKenzie 200612).

  1. Harrell FE. Regression Modeling Strategies: with applications to linear models, logistic regression, and survival analysis. Springer: New York, 2001.
  2. Sauerbrei W; Royston P; Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Statistics in medicine 2007. DOI: 10.1002/sim.3148.
  3. Miller A. Subset Selection in Regression. Taylor & Francis, 2002.
  4. Greenland S. Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. Epidemiology (Cambridge, Mass.) 1995. 6(4): 450–454.
  5. Royston P; Altman DG. Regression Using Fractional Polynomials of Continuous Covariates: Parsimonious Parametic Modelling. Appl. Statist. 1994; 43(3): 429–467.
  6. Royston P; Sauerbrei W. Multivariable model-building. A pragmatic approach to regression analysis based on fractional polynomials for continuous variables. Wiley: Chichester, 2008.
  7. Boer C de. A practical guide to splines. revised edn. Springer: New York, 2001.
  8. Wood S. Generalized Additive Models. Chapman & Hall/CRC: New York, 2006.
  9. Hastie T; Tibshirani R. Generalized Additive Models. Chapman & Hall/CRC: New York, 1990.
  10. Abrahamowicz M; Du Berger R; Grover SA. Flexible modeling of the effects of serum cholesterol on coronary heart disease mortality. American Journal of Epidemiology 1997; 145(8): 714–729.
  11. Binder H; Sauerbrei W; Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. STATISTICS IN MEDICINE 2013; 32(13): 2262–2277.
  12. Abrahamowicz M; MacKenzie TA. Joint estimation of time-dependent and non-linear effects of continuous covariates on survival. Statist. Med. 2006. DOI: 10.1002/sim.2519.