Topic group 9: High-dimensional data

Chairs: Lisa McShane, Joerg Rahnenfuehrer
Members: Axel Benner, Harald Binder, Anne-Laure Boulesteix, Tomasz Burzykowski, Riccardo De Bin, W.Evan Johnson, Lara Lusa, Stefan Michiels, Sherri Rose, Willi Sauerbrei

The increasing availability and use of “big data”, characterized by large numbers of observations and/or variables per subject, across a range of biomedical applications has created both challenges in data handling and opportunities for the development and novel application of statistical methods and algorithms. In molecular medicine, “omics” data (e.g., genomics, transcriptomics, proteomics, and metabolomics) are ubiquitous and have stimulated extensive collaborations between statisticians, computational scientists, bioinformaticians, biologists, clinicians and other biomedical researchers. Electronic health records contain not only standard demographic, clinical, and laboratory data collected through a patient history, but also information from potentially many different providers involved in a patient’s care.  Data may be represented in many different forms and derive from multiple sources. Unique opportunities exist to leverage these large databases to support programs in comparative effectiveness and health outcomes research. Simultaneously, advances in statistical methodology and machine learning methods have been contributed to improved approaches for data mining, statistical inference, and prediction in the high dimensional data setting.

The goal of the ‘high-dimensional data’ topic group of the STRATOS initiative (TG9) is to provide guidance amid the jungle of opportunities and pitfalls inherent in the analysis of high-dimensional biological and medical data. Illustrative examples representing rich high-dimensional data sets presented together with in-depth evaluation and discussion of various statistical and computational approaches aim to reinforce concepts and support specific recommendations for best practices.  Examples will include data generated in gene expression profiling, genomic sequencing, infectious disease, and electronic health record-based studies. Data pre-processing methods and data analysis pipelines for specific types of omics data will also be demonstrated. Simulation studies, and analytical arguments when possible, will be used to assess method performance and to compare alternative approaches. Didactic material and example data sets generated by this effort will be accumulated in a helpful, freely available resource.

Subtopics receiving special emphasis include the following:

  • Data Pre-processing
  • Exploratory Data Analysis
  • Data Reduction
  • Multiple Testing
  • Prediction Modeling/Algorithms
  • Comparative Effectiveness and Causal Inference
  • Design Considerations
  • Data Simulation Methods
  • Resources for Publicly Available High-dimensional Data Sets 

Many of these subtopics are also relevant to other topic groups and panels of the STRATOS initiative. Thus, in some cases, the work of TG9 will build upon their results, but always with a focus on the high-dimensional data aspect.