Topic Group 3: Initial data analysis

Chairs: Marianne Huebner, Saskia le Cessie, Werner Vach
Members: Dianne Cook, Heike Hofmann, Lara Lusa, Carsten Oliver Schmidt

The initial steps of all data analyses consist of checking consistency and accuracy of the data, describing and exploring the study sample and preparing the data for further analyses. It is crucial this is done before embarking on complex analyses.

The drive to obtain high quality data should start long before data collection and include a careful database design with variable definitions, plausibility checks, date checks and a well-planed system for identifying likely data for errors and resolving inconsistencies. Cleaning data especially when integrating multiple data sources should be done systematically and carefully (van den Broeck 20051). After being (reasonably) confident that the data are error-free, the next step is to become familiar with the collected data and examine it for consistency of data formats, number and patterns of missing data, distributions of continuous variables, (e.g. skewness, variation), and frequencies of categorical variables, checking group labels (Chatfield 20022, Cox 20113). The inclusion and exclusion criteria in the process of selecting the subset of data to be analyzed in the study should be described along with an overview of missing measurements and follow-up data (Elm 20074). Both raw data and the final dataset need to be saved. A general rule is that a statistical analysis plan is made and agreed upon before the data collection starts and that it should not be altered without agreement of the project steering group. This should reduce the extent of data dredging or hypothesis fishing leading to false positive studies (George 20025). It is important that the complete initial data analysis process is transparent and that researchers document all steps for reproducibility (Baggerly 20096).

One of the aims of the initial data analysis is to provide a clear description of the data in tables and figures. This can be done in many different ways: summary statistics can be reported for the total population or for subgroups; continuous variables can be summarized by means and standard deviations, by medians and percentiles or by categorizing them. Small groups of categorical variables can be combined. Medical papers often have a descriptive table of the data. In certain instances, for example when data are missing not completely at random, summary statistics of the study sample do not unbiasedly represent population characteristics. In this case, one has to decide whether the descriptive table should include corrections for the missing data.

Another step is preparation of the data for more advanced analyses. In this step decisions have to be made about the way variables are used in further analyses. Variables can be used in their raw form, they may be transformed or categorized, they may be rescaled or standardized, they may be used as single variables, combined to summary scores or as more complex functions, e.g. as ratios (Vach 20137). Further, procedures to handle missing values and outliers should be clarified.

This topic group aims to provide guidance for all of the above mentioned steps in the initial data analysis. We will discuss advantages and limitations of different approaches. Recommendations will be given based on an overview of existing literature and feedback from experienced researchers.

  1. van den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Medicine 2005; 2(10):e267.
  2. Chatfield C. Confessions of a pragmatic statistician. Journal of the Royal Statistical Society. Series D (The Statistician) 2002; 51(Part 1):1–20.
  3. Cox D, Donnelly C. Preliminary analysis. In Principles of Applied Statistics. Cambridge University Press: Cambridge, 2011.
  4. Elm E von, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. STROBE initiative. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. Epidemiology (Cambridge, Mass.) 2007; 18(6):800–804.
  5. George DS, Shah E. Data dredging, bias, or confounding. BMJ 2002; 325(7378):1437–1438.
  6. Baggerly KA, Coombes KR. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. The Annals of Applied Statistics 2009; 3(4):1309–1334.
  7. Vach W. Transformation of covariates. In Regression Models as a Tool in Medical Research. Taylor & Francis Group: Boca Raton, FL, USA, 2013; 264–273.