# STAT 548 PhD Qualifying Course Papers

## Choosing a paper

At the end of this document is a list of papers that I am interested in supervising as Qualifying Papers (QPs). I am happy to discuss any other paper that you are interested in and think might be appropriate.

I am interested in almost all problems involving answering scientific questions using complex data sets in biomedical science. My research often involves statistical learning (especially unsupervised learning) and selective inference. Recently, I have been especially interested in applications in single-cell sequencing.

## General Expectations

If you are interested in doing a QP with me, the first step is to email me to schedule a one-on-one meeting. Please use the words “Qualifying Paper” in the subject line of your email. At our first meeting, please be prepared to discuss:

Your background.

Your long-term research interests (it’s okay if these are not yet well-defined).

Why you are interested in the particular paper/project.

When you will submit your report (typically about four-six weeks after we meet).

The details of the QP project and report.

Any concerns or questions you may have.

In the process of working on the QP project and report, you may find yourself needing guidance or clarification. We can meet again as many times as needed before the report due date for this purpose.

## Project and Report

Structure your report as follows:

**Problem definition**(1 page): Extract mathematical/statistical problems from the paper and organize them.**Significance**(1-2 paragraphs): Why is this an interesting and/or important problem? What can be learned by studying this problem? Why is it exciting for you? Author contribution: How did the author(s) find the solution? What was a novel contribution beyond traditional approaches?**Limitations/challenges**(1-2 paragraphs): What are the assumptions? Are they realistic? What are the technical limitations that the authors acknowledge or not?**Paper-specific project**(3 pages max): I have paired each available QP with specific prompts designed to help inspire related research projects. Details vary and are discussed below. If there are multiple prompts, then pick one. You can also come up with your own prompt - seemingly unrelated ideas inspired by the original QP are fine here - but make sure I approve of the prompt before you start working on it. The aim of this section is to get you thinking creatively about research, and to begin developing the skills necessary for writing research proposals.**Discussion**(1 page): Briefly discuss what you have learned and what you would achieve if you were to develop your paper-specific project into a full paper.

When you are ready to submit, provide me with:

The report in LaTeX (everything needed to generate the pdf report plus the pdf report itself).

All code needed to reproduce all experimental/numerical results in the report. All code should have informative file names (e.g. Figure1.R) and be clearly commented and documented. Code can be in any language you wish, though I strongly prefer R.

## Available Papers

~~1. (TAKEN) Data fission: an alternative to sample splitting~~

**Paper**: Leiner, Duan, Wasserman, and Ramdas. Data fission: splitting a single data point.

**Note**: Since this paper is rather long, focus your attention on a single application. If you’re not sure which to pick, then I suggest the application in Section 4.

**Project Idea 1:** The main idea is to explore other applications of data fission.

Describe a data analysis task (different from the applications explored in Sections 3-5) where you think that data fission would be useful.

Make sure to discuss

*why*you think data fission would be useful. What might go wrong if you didn’t use data fission? It will likely be helpful to conduct some simple simulation studies to demonstrate what problems occur if you don’t use data fission in your setting.Describe how you would go about applying data fission to your chosen setting.

**Project Idea 2:** The main idea is to explore the robustness of data fission.

For the single application you choose to focus on in your QP, evaluate the robustness of data fission against distributional assumptions via simulation. For example, what if we apply data fission assuming that the data are Gaussian, but the data are really Poisson? (You can safely ignore silly cases that would lead to you being unable to apply data fission at all - for example, you couldn’t apply the “Poisson” version of data-fission to Gaussian data because you can’t do the binomial sub-sampling on non-integer data.)

Write a report about your simulation study. This should include a description of the simulation set-up, a description of the results including tables and figures as appropriate, and a summary of what you learned from the simulation study.

### 2. Valid inference after outlier removal

**Paper**: Chen and Bien, Valid inference corrected for outlier removal.

**Note**: You do not need to fully understand Section 3.2. You can also focus your attention on the easiest case of testing an individual regression coefficient, though the paper is more general than that.

**Project Idea:** Linear regression analyses are far from the only analysis where data analysts remove outliers before performing inference. Pick an analysis where you might see analysts remove outliers before performing inference. Then, ask yourself: does this outlier removal step lead to to undesirable properties for the inference step? Explain why or why not. Support your claim with arguments based on statistical reasoning and/or results on simulated data.

### 3. Fast interpretable greedy-tree sums

**Paper**: Tan, Singh, Nasseri, Agarwal, and Yu. Fast Interpretable Greedy-Tree Sums (FIGS)

**Project Idea:** In the discussion, the authors say that “the class of FIGS models could be further extended to include linear terms”. Discuss the potential benefits and drawbacks of this extension. To help inform the potential benefits part of this discussion, you may want to re-visit the diabetes data set and assess the generalization performance of:

A (generalized) linear model that includes BMI, age, and a trichotomized version of the plasma glucose variable into <= 128, 128 < PG < 166, and >= 166

A (generalized) linear model that includes BMI, a dummy variable for Age >= 30, and a trichotomized version of the plasma glucose variable into <= 128, 128 < PG < 166, and >= 166

A (generalized) linear model that includes a dummy variable for BMI >= 30, age, and a trichotomized version of the plasma glucose variable into <= 128, 128 < PG < 166, and >= 166

Other combinations of linear and piecewise constant terms as you see fit.

It might also be helpful to repeat this idea on other data sets used as examples in the paper.

~~4. (TAKEN) Post-prediction inference~~

**Paper**: Wang, McCormick, and Leek. Methods for correcting inference based on outcomes predicted by machine learning

**Project Idea:** The authors focus on what goes wrong when you plug in machine learning predictions as *outcomes* in regression, and propose a way to fix it. What if you plug in machine learning predictions as the *covariates*?

Ask yourself: what (if anything) goes wrong? Support your claim with arguments based on statistical reasoning and/or results on simulated data. There may very well be related literature as well for you to incorporate here in areas like measurement error.

Do you think it would be straightforward to extend the authors’ method to the case where you plug in machine learning predictions as covariates rather than outcomes? If yes, give a high-level overview on how you would go about extending the method. If no, then explain why.

Attribution: Much of this document is copied (or very slightly adapted) from Daniel, Yongjin, Trevor, and Ben. Many thanks to them!