I am an Assistant Professor in the Department of Statistics and Actuarial Science at the University of Waterloo. I received my PhD from the Department of Biostatistics at the University of Washington, where I was advised by Daniela Witten.

My research interests are in:

  1. Developing valid inferential procedures that can be applied after "double-dipping",
  2. Using statistical machine learning to solve problems in complex data settings, such as multi-view data, network data, compositional data, and spatial data, and
  3. Applying novel optimization methods to the design of experiments.


1. Valid inference after "double-dipping"

As data sets continue to grow in size, in many settings the focus of data collection has shifted away from testing pre-specified hypotheses, and towards hypothesis generation. Researchers are often interested in performing an exploratory data analysis in order to generate hypotheses, and then testing those hypotheses on the same data; I call this "double dipping". Unfortunately, double dipping can lead to a highly inflated Type I error rate. Of late, I have been working on solving the "double dipping" problem for hierarchical clustering and regression trees, using techniques from the selective inference and post-selection inference literature.


  • Lucy L. Gao, Jacob Bien, and Daniela Witten (2020+) Selective inference for hierarchical clustering, submitted. [pdf] [website] [github] [code]

2. Learning from Multi-View Data

In the multi-view data setting, multiple data sets (views) are available on a single common set of observations. For example, multivariate clinical and genomic data sets may be available on a single set of tissue samples, or we may have two network data sets that describe physical interactions and co-membership in protein complexes between a single set of proteins.


  • Lucy L. Gao, Jacob Bien and Daniela Witten (2019) Are clusterings of multiple data views independent? To appear in Biostatistics. [pdf] [cran] [code]
    [Received a 2019 ASA Biometrics Section Student Travel Award.]


  • Lucy L. Gao, Daniela Witten and Jacob Bien (2020+) Testing for association in multi-view network data. [pdf] [cran] [code]
    [Received a 2020 ASA Statistical Learning and Data Science Section Student Paper Award.]

3. Optimal Experiment Design

The number of replicates in experiments limits the amount of information that is available, but we maximize the amount of information gained by carefully choosing the values of the experimental inputs. This is the central problem of optimal experiment design.


  • Pengqi Liu, Lucy L. Gao and Julie Zhou (2020). R-optimal designs for multi-response regression models with multi-factors. To appear in Communications in Statistics - Theory and Methods. [pdf]
  • Lucy L. Gao and Julie Zhou (2020). Minimax D-optimal designs for multivariate regression models with multi-factors. To appear in Journal of Statistical Planning and Inference . [pdf]
  • Lucy L. Gao and Julie Zhou (2017) D-optimal designs based on the second-order least squares estimator. Statistical Papers, 58(2): 77-94.
  • Lucy L. Gao and Julie Zhou (2014) New optimal design criteria for regression models with asymmetric errors. Journal of Statistical Planning and Inference, 149: 140-151.

3. Collaborative Research

During the first year of my Ph.D., I collaborated with researchers at the Seattle Children’s Research Institute to characterize liver transplantation offers to pediatric patients.


  • Evelyn Hsu, Michele Shaffer, Lucy L. Gao, Christopher Sonnenday, Michael Volk, John Bucuvalas and Jennifer Lai (2017) Analysis of liver offers to pediatric candidates on the transplant wait list. Gastroenterology, 153(4): 988-995.



is an R package that computes valid p-values for a difference in means between estimated clusters in a data set. [website] [paper] [github]


is an R package for learning whether and how clusters defined with respect to different data views are associated. [paper1] [paper2] [cran]

Contact Me

Email: lucy dot gao at uwaterloo dot ca