Technical Details

Model

We model n observations of a \(p\)-vector as a matrix Gaussian: \[{\bf X} \sim \mathcal{MN}_{n \times p} (\boldsymbol{\mu}, {\bf I}_n, \sigma^2 {\bf I}_p),\] where \(\boldsymbol{\mu} \in \mathbb{R}^{n \times p}\), with rows \(\mu_i\), is unknown, and \(\sigma^2 > 0\) is known. This says that the rows \(X_i \in \mathbb{R}^p\) of \({\bf X}\) are independent multivariate Gaussian random vectors with mean \(\mu_i\) and covariance matrix \(\sigma^2 {\bf I}_p\).

Problem set-up

Given any realization \({\bf x} \in \mathbb{R}^{n \times p}\) of \({\bf X}\), we consider clustering \({\bf x}\) to obtain \(\mathcal{C}({\bf x})\), a partition of \(\{1, 2, \ldots, n\}\), then using \({\bf x}\) to test, for a pair of clusters \(C_1, C_2 \in \mathcal{C}({\bf x})\): \[ H_0: \bar{\mu}_{C_1} = \bar{\mu}_{C_2} \quad \text{vs.}\quad H_1: \bar{\mu}_{C_1} \neq \bar{\mu}_{C_2}, \] where \(\bar{\mu}_{C_k} = \sum \limits_{i \in C_k} \mu_i/|C_k|\) is the mean of \(\boldsymbol \mu\) in \(C_k\). Let \(\bar{X}_{C_k} = \sum \limits_{i \in C_k} X_i/|C_k|\) be the empirical mean in \({\bf X}\) of \(C_k\). One might be tempted to simply apply a Wald test of \(H_0: \bar{\mu}_{C_1} = \bar{\mu}_{C_2}\), with p-value given by \[ \mathbb{P}_{H_0} (\|\bar{X}_{C_1} - \bar{X}_{C_2}\|_2 \geq \|\bar{x}_{C_1} - \bar{x}_{C_2}\|_2), \] where \(\|\bar{X}_{C_1} - \bar{X}_{C_2}\| \sim \left ( \sigma \sqrt{1/|C_1| + 1/|C_2|} \right ) \cdot \chi_p\). However, since the Wald test does not account for the fact that the clusters \(C_1\) and \(C_2\) were estimated from the data, it is virtually guaranteed to find a statistically significant difference between them. (Somewhat unintuitively, this problem cannot be solved by splitting the data into training and test sets; click here for an illustration.)

Our Solution

Roughly speaking, our framework is a version of the Wald test of \(H_0: \bar{\mu}_{C_1} = \bar{\mu}_{C_2}\) that conditions on the fact that we estimated \(C_1\) and \(C_2\) from the data, and therefore yields valid p-values. The p-values from our framework can be written as \[ \mathbb{P}(\phi \geq \|\bar{x}_{C_1} - \bar{x}_{C_2}\|_2 \mid \phi \in \mathcal{S}),\] where \(\phi \sim \left ( \sigma \sqrt{1/|C_1| + 1/|C_2|} \right ) \cdot \chi_p\), \[\mathcal{S} = \{\phi: \text{Clustering } {\bf x}'(\phi) \text{ yields the clusters } C_1 \text{ and } C_2 \},\] and \({\bf x}'(\phi)\) is a perturbed version of \({\bf x}\), where observations in clusters \(C_1\) and \(C_2\) have been pulled apart (if \(\phi > \|\bar{x}_{C_1} - \bar{x}_{C_2}\|_2\)) or pushed together (if \(\phi < \|\bar{x}_{C_1} - \bar{x}_{C_2}\|_2\)) in the direction of \(\bar{x}_{C_1} - \bar{x}_{C_2}\).

If we can compute the conditioning set \(\mathcal{S}\), then we can compute p-values exactly. Our software currently computes \(\mathcal{S}\) for hierarchical clustering with squared Euclidean distance and single, average, weighted, centroid, median, or Ward linkage; see Section 3 of our paper for a description of the algorithms we use to compute \(\mathcal{S}\). For hierarchical clustering with other linkages or non-hierarchical clustering methods, our software approximates the p-value using Monte Carlo sampling; details are in Section 4.1 of our paper.

Extensions

Our software can also compute p-values under the alternative model \[{\bf X} \sim \mathcal{MN}_{n \times p} (\boldsymbol{\mu}, {\bf I}_n, \boldsymbol{\Sigma}),\] where \(\boldsymbol{\Sigma}\) is a known positive definite matrix; details are in Section 4.2 of our paper.