Current research profile

The core of my research program investigates model validation and inference in the context of statistical machine learning. As scientists fit increasingly complex models to their data, there is an ever-growing need for methods that quantify our uncertainty in the quality of these models, and for methods that can formally test results suggested by these models. My work seeks to fill this gap, especially within the context of unsupervised learning, where existing methods for validation and inference are limited compared to the supervised learning setting. My research program has expanded our statistical toolkit for using a single data set to fit a supervised or unsupervised learning model and to (1) test hypotheses suggested by that fitted model, (2) assess the stability of that fitted model, or (3) accurately estimate the prediction error of that fitted model. The methods developed in the course of my research program are relevant to many application domains, and have particular impact in the field of single-cell genomics, where the output of unsupervised learning models are routinely used as surrogates for unobserved aspects of cell states.

I have secondary interests in developing theory and methodology in the area of optimal experiment design and the selection of hyperparameters for statistical machine learning algorithms. These diverse areas are unified by my interest in adapting key ideas from the mathematical optimization community to statistical problems.


In the following, * represents alphabetical author ordering.


Yiqun T. Chen and Lucy L. Gao (2023) Testing for a difference in means of a single feature after clustering. [pdf]

Lucy L. Gao*, Jane J. Ye*, Haian Yin*, Shangzhi Zeng*, and Jin Zhang* (2023) Moreau Envelope Based Difference-of-weakly-Convex Reformulation and Algorithm for Bilevel Programs. [pdf]

Abigail Keller, Lucy L. Gao, Daniela Witten, and Maitreya J. Dunham (2023) Condition-dependent fitness effects of large synthetic chromosome amplifications. [pdf]

Ameer Dharamshi, Anna Neufeld, Keshav Motwani, Lucy L. Gao, Daniela Witten, and Jacob Bien (2023) Generalized data thinning using sufficient statistics. [pdf]


Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, and Daniela Witten (2024+) Data thinning for convolution-closed distributions. To appear in Journal of Machine Learning Research. [pdf] [software]

Lucy L. Gao*, Jane J. Ye*, Shangzhi Zeng*, and Julie Zhou* (2024+) Necessary and sufficient conditions for multiple objective optimal regression designs. To appear in Statistica Sinica. [pdf][code]


Lucy L. Gao, Jacob Bien, and Daniela Witten (2024) Selective inference for hierarchical clustering. Journal of the American Statistical Association, 119(545), 332-342. [pdf] [software]

Anna Neufeld, Lucy L. Gao, Joshua Popp, Alexis Battle, and Daniela Witten (2024) Inference after latent variable estimation for single-cell RNA-sequencing data. Biostatistics, 25(1), 270-287. [pdf] [software]


Anna Neufeld, Lucy L. Gao, and Daniela Witten (2022) Tree-values: selective inference for regression trees. Journal of Machine Learning Research, 23(305), 1−43. [pdf] [software]

Lucy L. Gao, Daniela Witten and Jacob Bien (2022) Testing for association in multi-view network data. Biometrics, 78(3), 1018-1030. [pdf] [software]

Lucy L. Gao*, Jane J. Ye*, Haian Yin*, Shangzhi Zeng*, and Jin Zhang* (2022). Value function based difference-of-convex algorithm for bilevel hyperparameter selection problems. Proceedings of International Conference on Machine Learning (ICML) 2022. [pdf] [code]

Pengqi Liu, Lucy L. Gao and Julie Zhou (2022). R-optimal designs for multi-response regression models with multi-factors. Communications in Statistics - Theory and Methods, 51(2), 340-355. [pdf]


Lucy L. Gao, Jacob Bien and Daniela Witten (2020). Are clusterings of multiple data views independent? Biostatistics, 21(4), 692-708. [pdf] [software]

Lucy L. Gao and Julie Zhou (2020). Minimax D-optimal designs for multivariate regression models with multi-factors. Journal of Statistical Planning and Inference. [pdf]


Evelyn Hsu, Michele Shaffer, Lucy L. Gao, Christopher Sonnenday, Michael Volk, John Bucuvalas, and Jennifer Lai (2017). Analysis of liver offers to pediatric candidates on the transplant wait list. Gastroenterology, 153 (4), 998-995.

Lucy L. Gao* and Julie Zhou* (2017). D-optimal designs based on the second-order least squares estimator. Statistical Papers, 58(2), 77-94. [pdf]

Lucy L. Gao and Julie Zhou (2014) New optimal design criteria for regression models with asymmetric errors. Journal of Statistical Planning and Inference, 149: 140-151.