Sparse Mean Estimation in Adversarial Settings via Incremental Learning

Jianhao Ma
University of Pennsylvania
[email protected]
   Rui Ray Chen
Tsinghua University
[email protected]
   Yinghui He
Princeton University
[email protected]
   Salar Fattahi
University of Michigan
[email protected]
   Wei Hu
University of Michigan
[email protected]
Abstract

In this paper, we study the problem of sparse mean estimation under adversarial corruptions, where the goal is to estimate the kk-sparse mean of a heavy-tailed distribution from samples contaminated by adversarial noise. Existing methods face two key limitations: they require prior knowledge of the sparsity level kk and scale poorly to high-dimensional settings. We propose a simple and scalable estimator that addresses both challenges. Specifically, it learns the kk-sparse mean without knowing kk in advance and operates in near-linear time and memory with respect to the ambient dimension. Under a moderate signal-to-noise ratio, our method achieves the optimal statistical rate, matching the information-theoretic lower bound. Extensive simulations corroborate our theoretical guarantees. At the heart of our approach is an incremental learning phenomenon: we show that a basic subgradient method applied to a nonconvex two-layer formulation with an 1\ell_{1}-loss can incrementally learn the kk nonzero components of the true mean while suppressing the rest. More broadly, our work is the first to reveal the incremental learning phenomenon of the subgradient method in the presence of heavy-tailed distributions and adversarial corruption.

1 Introduction

Almost all statistical methods rely explicitly or implicitly on certain assumptions on the distribution of the data. In practice, however, these assumptions are only approximately satisfied, mainly due to the presence of heavy-tailed distributions and adversarial corruptions [Rou+11]. To resolve these issues, the field of robust statistics has been developed to construct estimators that exhibit “insensitivity to small deviations from the (model) assumptions[Hub11, p.2]. Robust statistics has a long history with the fundamental work of John Tukey [Tuk60, Tuk62], Peter Huber [Hub64, Hub67], and Frank Hampel [Ham71, Ham74]. It has been applied across various domains, such as biology, finance, and computer science [Rou+11].

Nonetheless, in high-dimensional scenarios, robust statistics contend with the curse of dimensionality. Firstly, the majority of estimators in the literature demand exponential runtime with respect to data dimension. To resolve this problem, special attention has been devoted to algorithmic robust statistics, which aims to design efficient algorithms for different tasks in the high-dimensional robust statistics (see the recent book [DK23] and survey paper [DK19]). Secondly, generic high-dimensional robust statistical tasks are often oblivious to the intrinsic structure of the data. As such, they rely on overly conservative sample sizes that have an undesirable dependency on the data dimension.

In this paper, we aim to address these challenges for one of the most fundamental problems in robust statistics, namely robust sparse mean estimation. More specifically, given an ϵ\epsilon-corrupted set of samples from an unknown and possibly heavy-tailed distribution \mathbb{P} with a kk-sparse mean μ=𝔼[X]d\mu^{\star}=\mathbb{E}[X]\in\mathbb{R}^{d}, our goal is to design a computationally and statistically efficient estimator μ^\hat{\mu} of the mean μ\mu^{\star}. Throughout this paper, we focus on the so-called strong contamination model [DK23, Definition 1.6] for the corruption in the data, which encompasses a variety of existing models, such as Huber’s contamination model [Hub64].

Definition 1.1 (Strong contamination model).

Given a corruption parameter ϵ(0,ϵ0)\epsilon\in(0,\epsilon_{0}) and distribution \mathbb{P}, the ϵ\epsilon-corrupted samples are generated as follows: (i) the algorithm specifies the number of samples nn and then nn i.i.d. samples are drawn from \mathbb{P}. (ii) An arbitrarily powerful adversary then inspects the samples, removes ϵn\epsilon n of them, and replaces them with arbitrary points. The resulting ϵ\epsilon-corrupted samples are given to the algorithm.

Designing a statistically and computationally efficient estimator for the mean is highly nontrivial in this setting due to the following reasons. First, contrary to the robust (dense) mean estimation, there is a conjectured computational-statistical tradeoff [DKS17, BB19, BB20] for the robust kk-sparse mean estimation, which asserts that any efficient algorithm needs Ω~(k2)\tilde{\Omega}(k^{2}) samples, while its statistically-optimal (but possibly inefficient) counterpart only requires 𝒪~(k)\tilde{\mathcal{O}}(k) samples. This conjecture has neither been proved nor refuted. Second, most existing mean estimators are designed for light-tailed distributions [Bal+17, Dia+19a, Che+21]. The only two efficient estimators available for heavy-tailed distributions [Dia+22a, Dia+22], however, are impractical for real-world applications, as they rely on computationally intensive techniques such as the ellipsoid algorithm and the sum-of-squares method. A fundamental question thus arises:

Can we design a practically efficient estimator for the robust sparse mean estimation problem that overcomes the conjectured computational-statistical tradeoff?

In this work, we provide an affirmative answer to this question under moderate assumptions. Our proposed approach comprises two stages. In the first stage, we provide a coarse-grained estimation of the mean that is enough to identify the top-kk nonzero elements of the mean. In particular, we show that a simple subgradient method applied to a two-layer diagonal linear neural network with 1\ell_{1}-loss can identify the top-kk nonzero elements of the mean incrementally and sequentially while keeping the zero entries arbitrarily small. After the identification of the top-kk nonzero elements, in the second stage, we provide a finer-grained estimation of the nonzero elements of the mean by employing a generic robust mean estimator—such as those introduced in [DK19, Che+20]—restricted to the top-kk nonzero elements, thereby reducing the effective dimension of the problem from dd to kk. Our proposed approach achieves optimal statistical error, sample complexity, and computational cost under moderate assumptions. Furthermore, we demonstrate that these assumptions do not alter the inherent complexity of the problem, as evidenced by a matching information-theoretic lower bound. Table 1 provides a summary of our results compared to the existing estimators. Our contributions are summarized below:

  • -

    Overcoming the computational-statistical tradeoff. We demonstrate that our algorithm can surpass the conjectured computational-statistical tradeoff under additional conditions. At a high level, we require an ϵ\epsilon-dependent upper bound for the coordinate-wise third moment and a lower bound for the signal-to-noise ratio (SNR). Additionally, we demonstrate that our algorithm matches the information-theoretic lower bound under exactly the same conditions.

  • -

    Near-linear dependency on the dimension. The first stage of our algorithm is coordinate-wise decomposable and fully parallelizable. Therefore, it runs in 𝒪~(d)\tilde{\mathcal{O}}(d) time and memory on a single thread, and in 𝒪~(d/K)\tilde{\mathcal{O}}(d/K) time and 𝒪~(d)\tilde{\mathcal{O}}(d) memory on KK threads. Moreover, the computational cost of the second stage of our algorithm is independent of dd. In contrast, the existing robust sparse mean estimators have a poor dependency on dd (see Table 1).

  • -

    No prior knowledge on the sparsity level. Our method does not require prior knowledge of the sparsity level kk. In contrast, all existing methods for robust sparse mean estimation (in both light- and heavy-tailed settings) require knowledge of the sparsity level kk.

  • -

    Superior practical performance. Through extensive experiments, we show that, despite its simplicity, our method performs well across a broad class of heavy-tailed distributions, including those with unbounded variance.

Algorithm 2\ell_{2}-error Sample complexity Running time
Lower bound Ω(ϵ)\Omega(\sqrt{\epsilon}) Ω~(k/ϵ)\tilde{\Omega}(k/\epsilon) -
[Dep20, PBR20] 𝒪(ϵ)\mathcal{O}(\sqrt{\epsilon}) 𝒪~(k/ϵ)\tilde{\mathcal{O}}(k/\epsilon) exp(d)\exp(d)
[Dia+22a] 𝒪(ϵ)\mathcal{O}(\sqrt{\epsilon}) 𝒪~(k2/ϵ)\tilde{\mathcal{O}}\left(k^{2}/\epsilon\right) poly(d)\operatorname{poly}(d)
[Dia+22] 𝒪(ϵ)\mathcal{O}(\sqrt{\epsilon}) 𝒪~(k𝒪(1)/ϵ)\tilde{\mathcal{O}}(k^{\mathcal{O}(1)}/\epsilon) poly(d)\operatorname{poly}(d)
Ours (Stage 1) 𝒪(kϵ)\mathcal{O}(\sqrt{k\epsilon}) 𝒪~(1/ϵ)\tilde{\mathcal{O}}(1/\epsilon) 𝒪~(d)\tilde{\mathcal{O}}(d)
Ours (full) 𝒪(ϵ)\mathcal{O}(\sqrt{\epsilon}) 𝒪~(k/ϵ)\tilde{\mathcal{O}}(k/\epsilon) 𝒪~(d)\tilde{\mathcal{O}}(d)
Table 1: Comparisons between different algorithms for robust sparse mean estimation. Here, kk represents the sparsity level, dd is the ambient dimension, and ϵ\epsilon denotes the corruption ratio. We use Ω~()\tilde{\Omega}(\cdot) and 𝒪~()\tilde{\mathcal{O}}(\cdot) to hide logarithmic factors. For simplicity, the dependency on the sample size is omitted in the above comparisons. Our algorithms require some mild assumptions as detailed in Theorem 4.1.

2 Related Work

Robust (sparse) mean estimation.

Robust mean estimation is a fundamental problem in statistics, with its earliest work dating back to [Tuk60, Hub64]. However, throughout its extensive history [Yat85, DL88, DG92, Hub11], and even up to recent times [LM19, LM19b, Dep20, PBR20], most statisticians have primarily focused on developing statistically optimal estimators, often overlooking the fact that these estimators can be computationally inefficient. It is only recently, following the seminal work of [LRV16, Dia+19], that researchers have started to develop polynomial-time algorithms for robust mean estimation [Dia+17, SCV17, CDG19] as well as other robust learning tasks, including robust PCA [Bal+17] and robust regression [CCM13].

Robust sparse mean estimation, as a distinct variant, has attracted considerable attention, particularly in extremely high-dimensional settings. However, the situation for robust sparse mean estimation is more nuanced compared to the dense case. Firstly, unlike the dense case, there is a conjectured computational-statistical tradeoff [DKS17, BB19, BB20], suggesting that efficient algorithms demand a qualitatively larger sample complexity than their inefficient counterparts. In particular, there is evidence that such a tradeoff is unavoidable for Stochastic Query (SQ) algorithms [DKS17]. On the other hand, most prior works have primarily concentrated on the light-tailed setting [Bal+17, Dia+19a, Che+21]. Researchers have only recently addressed the heavy-tailed setting using stability-based approaches [Dia+22a] and sum-of-squares methods [Dia+22]. While these algorithms are polynomial-time, they may not be practical when dealing with high-dimensional settings.

Incremental learning.

Over the past few years, it has been shown practically and theoretically that gradient-based methods tend to explore the solution space in an incremental order of complexity, ultimately favoring low-complexity solutions in numerous machine learning tasks [GSD19, MLF25]. This phenomenon is known as incremental learning. Specifically, researchers have investigated incremental learning in various contexts, such as matrix factorization and its variants [LLL20, MGF22, Jin+23], tensor factorization [RMC21, RMC22, MGF22], deep linear networks [Aro+19, GBL19, Li+21, MF22], and general neural networks [Hu+20, Fre+22]. In essence, incremental learning is believed to be crucial for understanding the empirical success of optimization and generalization in contemporary machine learning [GSD19]. However, to the best of our knowledge, its emergence in adversarial settings remains unexplored.

Notation:

We use the notations a(n)b(n)a(n)\lesssim b(n) and a(n)=𝒪(b(n))a(n)=\mathcal{O}(b(n)) to denote a(n)Cb(n)a(n)\leq Cb(n), for a universal constant CC and sufficiently large nn. Similarly, the notations a(n)b(n)a(n)\gtrsim b(n) and a(n)=Ω(b(n))a(n)=\Omega(b(n)) are used to denote a(n)Cb(n)a(n)\geq Cb(n), for a universal constant CC and sufficiently large nn. The notation a=Θ(b)a=\Theta(b) is used to denote a=𝒪(b)a=\mathcal{O}(b) and b=𝒪(a)b=\mathcal{O}(a). Moreover, the notation a(n)=o(b(n))a(n)=o(b(n)) implies that limn+a(n)/b(n)=0\lim_{n\to+\infty}a(n)/b(n)=0. The sign()\operatorname{sign}(\cdot) function is defined as sign(x)=x/|x|\operatorname{sign}(x)=x/|x| if x0x\neq 0, and sign(0)=[1,1]\operatorname{sign}(0)=[-1,1]. We also define sign~(x)=x/|x|\operatorname{\widetilde{sign}}(x)=x/|x| if x0x\neq 0, and sign~(0)=0\operatorname{\widetilde{sign}}(0)=0. Given a set 𝒳\mathcal{X}, the indicator function 𝕀𝒳()\mathbb{I}_{\mathcal{X}}(\cdot) is defined as 𝕀𝒳(x)=1\mathbb{I}_{\mathcal{X}}(x)=1 if x𝒳x\in\mathcal{X}, and 𝕀𝒳(x)=0\mathbb{I}_{\mathcal{X}}(x)=0 otherwise. Similarly, and with a slight abuse of notation, for an event \mathcal{E}, we define the indicator function 𝕀()=1\mathbb{I}(\mathcal{E})=1 if \mathcal{E} occurs, and 𝕀()=0\mathbb{I}(\mathcal{E})=0 otherwise. We denote [n]:={1,2,,n}[n]:=\{1,2,\cdots,n\}. For two functions f,g:df,g:\mathbb{R}^{d}\to\mathbb{R}, we define fg=supxd|f(x)g(x)|\left\lVert f-g\right\rVert_{\infty}=\sup_{x\in\mathbb{R}^{d}}|f(x)-g(x)|. For two vectors x,ydx,y\in\mathbb{R}^{d}, their Hadamard product is defined as xy=[x1y1xdyd]x\odot y=[x_{1}y_{1}\ \cdots\ x_{d}y_{d}]^{\top}. For a vector xdx\in\mathbb{R}^{d}, we define x2=[x12,,xd2]x^{2}=[x_{1}^{2},\cdots,x_{d}^{2}]^{\top}. For a vector xdx\in\mathbb{R}^{d} and index set II with size kk, the notation [x]Ik[x]_{I}\in\mathbb{R}^{k} refers to the projection of xx onto II. Moreover, we define xy=min{x,y}x\wedge y=\min\{x,y\}. We represent mixtures of probability distributions as linear combinations of their corresponding density functions. For example, given two distributions 1\mathbb{P}_{1} and 2\mathbb{P}_{2} and a scalar 0ϵ10\leq\epsilon\leq 1, we define the mixture 3=(1ϵ)1+ϵ2\mathbb{P}_{3}=(1-\epsilon)\mathbb{P}_{1}+\epsilon\mathbb{P}_{2}. A sample from 3\mathbb{P}_{3} is drawn from 1\mathbb{P}_{1} with probability 1ϵ1-\epsilon and from 2\mathbb{P}_{2} with probability ϵ\epsilon.

3 Overview of Our Approach

To lay the groundwork, we begin by introducing the standard median-of-means (MoM) estimator [NY83, JVV86, AMS96] originally designed for estimating the mean of a one-dimensional random variable. MoM estimator serves as a cornerstone for more sophisticated methods as detailed in [LM19b, Pra+20, LL20, Dia+22a].

Definition 3.1 (Median-of-means estimator for one-dimensional case).

Given a set of ϵ\epsilon-corrupted samples S={X1,,Xn}S=\{X_{1},\cdots,X_{n}\}\subset\mathbb{R}, we first partition them into JJ subgroups S1,,SJS_{1},\cdots,S_{J} with equal sizes, where we assume nn is divisible by JJ for simplicity. We then calculate the sample mean for each subgroup, i.e., X¯j=1BiSjXi\bar{X}_{j}=\frac{1}{B}\sum_{i\in S_{j}}X_{i} where B=n/JB=n/J. Subsequently, the median-of-means (MoM) estimator is obtained by taking the median of the sample means X¯1,,X¯J\bar{X}_{1},\cdots,\bar{X}_{J}, i.e., μ^MoM=median{X¯1,,X¯J}\hat{\mu}_{\textsf{MoM}}=\operatorname{median}\left\{\bar{X}_{1},\cdots,\bar{X}_{J}\right\}.

Alternatively, the MoM estimator can be expressed as the minimizer of the following 1\ell_{1}-loss:

μ^MoM=argminμ1Jj=1J|X¯jμ|.\hat{\mu}_{\textsf{MoM}}=\operatorname*{arg\,min}_{\mu\in\mathbb{R}}\frac{1}{J}\sum_{j=1}^{J}\left|\bar{X}_{j}-\mu\right|. (1)

By appropriately selecting the number of subgroups JJ, it can be shown that the MoM estimator matches the information-theoretic lower bound Ω(σϵ)\Omega(\sigma\sqrt{\epsilon}) for heavy-tailed distributions under the strong contamination model (Definition 1.1).

Proposition 3.1 (One-dimensional MoM estimator).

Consider a corruption parameter ϵ\epsilon, a failure probability δ\delta, and a set SS of nn many ϵ\epsilon-corrupted samples from a distribution \mathbb{P} with mean μ\mu^{\star}\in\mathbb{R} and variance 𝔼[(Xμ)2]σ2\mathbb{E}[(X-\mu^{\star})^{2}]\leq\sigma^{2}. Suppose that nlog(1/δ)/ϵn\gtrsim\log(1/\delta)/\epsilon. Then, upon choosing the number of subgroups J=Θ(ϵn+log(1/δ))J=\Theta(\lceil\epsilon n\rceil+\log(1/\delta)), with probability at least 1δ1-\delta over the sample set SS, the MoM estimator μ^MoM\hat{\mu}_{\textsf{MoM}} satisfies |μ^MoMμ|=𝒪(σϵ)|\hat{\mu}_{\textsf{MoM}}-\mu^{\star}|=\mathcal{O}\left(\sigma\sqrt{\epsilon}\right).

A more precise statement of Proposition 3.1 and its proof are presented in Appendix A.

Naively applying MoM estimator to different coordinates of a high-dimensional random variable leads to an undesirable dependency on the dimension dd. More precisely, the coordinate-wise MoM, which corresponds to the solution to the following convex optimization

μ^MoM=argminμdcvx(μ):=1Jj=1JX¯jμ1,\displaystyle\hat{\mu}_{\textsf{MoM}}=\operatorname*{arg\,min}_{\mu\in\mathbb{R}^{d}}\mathcal{L}_{\textsf{cvx}}(\mu):=\frac{1}{J}\sum_{j=1}^{J}\left\lVert\bar{X}_{j}-\mu\right\rVert_{1}, (cvx)

suffers from a suboptimal error rate of μ^MoMμ2=𝒪(σdϵ)\left\lVert\hat{\mu}_{\textsf{MoM}}-\mu^{\star}\right\rVert_{2}=\mathcal{O}(\sigma\sqrt{d\epsilon}) (see Theorem A.1 in Appendix A). This error is unavoidable for the MoM estimator since the coordinate-wise error 𝒪(σϵ)\mathcal{O}(\sigma\sqrt{\epsilon}) is uniformly distributed across each coordinate. An alternative approach, the geometric MoM [Min15], which replaces the 1\left\lVert\cdot\right\rVert_{1} in  cvx by 2\left\lVert\cdot\right\rVert_{2}, also suffers from a similar error.

Two-layer model

To address the above issue, we model the mean μ\mu as a two-layer model u2v2u^{2}-v^{2} for u,vdu,v\in\mathbb{R}^{d}, and obtain (u,v)(u,v) by minimizing the following nonconvex 1\ell_{1}-loss

minu,vdncvx(u,v)=12Jj=1JX¯j(u2v2)1.\displaystyle\min_{u,v\in\mathbb{R}^{d}}\mathcal{L}_{\textsf{ncvx}}(u,v)=\frac{1}{2J}\sum_{j=1}^{J}\left\lVert\bar{X}_{j}-\left(u^{2}-v^{2}\right)\right\rVert_{1}. (ncvx)

To solve this optimization problem, we propose a subgradient method (SubGM) with small initialization u(0)=v(0)=α1u(0)=v(0)=\alpha\vec{1}, where 1=[1,,1]d\vec{1}=[1,\cdots,1]^{\top}\in\mathbb{R}^{d} and α>0\alpha>0 is a sufficiently small factor. At each iteration, SubGM updates the solution as

u(t+1)\displaystyle u(t+1) =u(t)ηg(t)whereg(t)uncvx(u(t),v(t)),\displaystyle=u(t)-\eta g(t)\quad\text{where}\quad g(t)\in\partial_{u}\mathcal{L}_{\textsf{ncvx}}(u(t),v(t)),
v(t+1)\displaystyle v(t+1) =v(t)ηh(t)whereh(t)vncvx(u(t),v(t)).\displaystyle=v(t)-\eta h(t)\quad\text{where}\quad h(t)\in\partial_{v}\mathcal{L}_{\textsf{ncvx}}(u(t),v(t)). (SubGM)

Here, η>0\eta>0 is the stepsize, and uncvx(u,v)\partial_{u}\mathcal{L}_{\textsf{ncvx}}(u,v) and vncvx(u,v)\partial_{v}\mathcal{L}_{\textsf{ncvx}}(u,v) indicate the (Clarke) subdifferentials of ncvx\mathcal{L}_{\textsf{ncvx}}, defined as:

uncvx(u,v)\displaystyle\partial_{u}\mathcal{L}_{\textsf{ncvx}}(u,v) =1Jj=1Jsign(u2v2X¯j)u,\displaystyle=\frac{1}{J}\sum_{j=1}^{J}\operatorname{sign}(u^{2}-v^{2}-\bar{X}_{j})\odot u, (2)
vncvx(u,v)\displaystyle\partial_{v}\mathcal{L}_{\textsf{ncvx}}(u,v) =1Jj=1Jsign(u2v2X¯j)v.\displaystyle=-\frac{1}{J}\sum_{j=1}^{J}\operatorname{sign}(u^{2}-v^{2}-\bar{X}_{j})\odot v. (3)

The detailed implementation of our proposed algorithm is presented in Algorithm 1.

Algorithm 1 Robust sparse mean estimation via incremental learning
1:Input: dataset SS, corruption parameter ϵ\epsilon, failure probability δ\delta, initialization scale α\alpha, stepsize η\eta, and iteration time T[2ηlog(1/α),6ηlog(1/α)]T\in\left[\frac{2}{\eta}\log(1/\alpha),\frac{6}{\eta}\log(1/\alpha)\right].
2:Stage 1 (SubGM):
3:  Pre-processing: Divide the dataset into JJ equal subgroups S1,,SJS_{1},\cdots,S_{J}, where J=100ϵnJ=100\lceil\epsilon n\rceil. Calculate the sample means X¯j=JniSjXi\bar{X}_{j}=\frac{J}{n}\sum_{i\in S_{j}}X_{i}.
4:  Initialization: u(0)=v(0)=α.1u(0)=v(0)=\alpha.\vec{1}
5:  for t=1,,Tt=1,\cdots,T do
6:   Update u(t),v(t)u(t),v(t) via SubGM.
7:  end for
8:  Identification of top-kk elements: Calculate I={i[d]:|ui2(T)vi2(T)|α}I=\left\{i\in[d]:|u^{2}_{i}(T)-v^{2}_{i}(T)|\geq\alpha\right\}.
9:  Return μ^(T)=u2(T)v2(T)\hat{\mu}(T)=u^{2}(T)-v^{2}(T).
10:Stage 2 (optional):
11:  Consider the projected dataset Sk={[Xi]I:XiS}S_{k}=\{[X_{i}]_{I}:X_{i}\in S\} and apply an existing robust mean estimator (e.g. those introduced in [DK19, Che+20]) to SkS_{k}.
Refer to caption
(a)
Refer to caption
(b)
Figure 1: The predicted coefficients for ncvx and cvx. We run the subgradient method with stepsize η=0.07\eta=0.07 in both settings. We initialize ncvx with α=1×1010\alpha=1\times 10^{-10} and use a zero initialization for cvx. We generate the inliers with d=500,n=2000d=500,n=2000 from a lognormal distribution with a variance of 3.33.3 and a kk-sparse mean with k=4k=4 and nonzero elements [5,2,2,1.5][5,2,2,1.5]. The corruption rate is 0.050.05 and the outliers are generated from a Cauchy distribution with a mean of 2020 and variance of 5050.

Our key contribution is to reveal the emergence of incremental learning: we show that SubGM with small initialization learns the nonzero components (signals) long before overfitting the zero components (residuals) to noise. Consequently, there exists a wide range of iterations within which the signals are in the order of Ω(1)\Omega(1) while the residuals remain in the order of 𝒪(α)\mathcal{O}(\alpha) (see Figure 1(a)). Remarkably, we show that this interval only depends on the stepsize η\eta and the initialization scale α\alpha, and it can be widened by reducing these user-defined parameters. In stark contrast, differentiating between the signals and residuals is challenging in the convex setting (cvx) precisely due to the lack of incremental learning, as shown in Figure 1(b). After successfully identifying the locations of the top-kk elements, we can employ existing robust mean estimation techniques [DK19, Che+20] on the dataset projected onto the recovered support to further improve the estimation of the top-kk nonzero elements.

4 Main Result

In this section, we present the theoretical guarantees for Algorithm 1. We begin by analyzing the first stage of the algorithm, which focuses on recovering the support of the true mean.

4.1 Stage 1: Identification of Support via Coarse-grained Estimation

We denote μmax=maxi{|μi|}\mu_{\max}^{\star}=\max_{i}\{|\mu_{i}^{\star}|\} and μmin=mini{|μi|:μi0}\mu_{\min}^{\star}=\min_{i}\{|\mu_{i}^{\star}|:\mu_{i}^{\star}\not=0\}. Our main theorem is presented next.

Theorem 4.1 (Convergence guarantee for SubGM).

Let \mathbb{P} be a distribution on d\mathbb{R}^{d} with an unknown kk-sparse mean μ\mu^{\star}, unknown covariance matrix Σσ2I\Sigma\preceq\sigma^{2}I, and unknown coordinate-wise third moment satisfying 𝔼[|Xiμi|3]σ3/ϵ,1id\mathbb{E}[|X_{i}-\mu_{i}^{\star}|^{3}]\lesssim\sigma^{3}/\sqrt{\epsilon},\forall 1\leq i\leq d. Suppose a sample set of size nlog(d/δ)/ϵn\gtrsim\log(d/\delta)/\epsilon is collected according to the strong contamination model (Definition 1.1) with corruption parameter ϵ\epsilon. Upon setting the stepsize ησϵ/μmax\eta\leq\sigma\sqrt{\epsilon}/\mu_{\max}^{\star} and the initialization scale 0<ασϵ/dμmax50<\alpha\lesssim\sigma\sqrt{\epsilon/d}\wedge\mu_{\max}^{\star-5} in Algorithm 1, with a probability of at least 1δ1-\delta, the following statements hold for any iteration 2ηlog(1/α)T6ηlog(1/α)\frac{2}{\eta}\log(1/\alpha)\leq T\leq\frac{6}{\eta}\log(1/\alpha):

  • 2\ell_{2}-error. The 2\ell_{2}-error is upper-bounded by

    μ^(T)μ2σkϵ.\left\lVert\hat{\mu}(T)-\mu^{\star}\right\rVert_{2}\lesssim\sigma\sqrt{k\epsilon}. (4)
  • Identification of the top-kk elements. If we additionally have ϵμmin2/σ2\epsilon\lesssim\mu_{\min}^{\star 2}/\sigma^{2}, then we obtain

    |μ^i(T)|\displaystyle|\hat{\mu}_{i}(T)| σϵ,\displaystyle\gtrsim\sigma\sqrt{\epsilon},\quad where μi0,\displaystyle\text{where }\mu_{i}^{\star}\neq 0, (5)
    |μ^i(T)|\displaystyle|\hat{\mu}_{i}(T)| α,\displaystyle\lesssim\alpha,\quad where μi=0.\displaystyle\text{where }\mu_{i}^{\star}=0.

Comparison to the existing results. Simply applying coordinate-wise MoM estimator results in an 2\ell_{2}-error rate 𝒪(σdϵ)\mathcal{O}(\sigma\sqrt{d\epsilon}), which is considerably worse than our result when kdk\ll d. On the other hand, to guarantee a correct support recovery, the previous efficient estimators rely on prior knowledge of kk, while the coordinate-wise MoM requires an accurate value of μmin\mu_{\min}^{\star} to separate the signals from residuals (as evidenced by Figure 1(b)). In contrast, our proposed algorithm only requires a lower bound μ^minμmin\hat{\mu}_{\min}\leq\mu_{\min}^{\star} to differentiate the signals from residuals; in fact, this lower bound can be arbitrarily small (i.e., conservative) provided that the initialization scale is chosen as αμ^min\alpha\ll\hat{\mu}_{\min}. We also highlight that, much like other existing estimators under the strong contamination model, our estimator requires prior knowledge of the corruption parameter ϵ\epsilon (or its upper bound).

Proof sketch. We next provide the proof sketch of the above theorem, deferring its details to Section 7. Specifically, we analyze the coordinate-wise dynamic μ^i(t)=ui2(t)vi2(t)\hat{\mu}_{i}(t)=u_{i}^{2}(t)-v_{i}^{2}(t) for some 1id1\leq i\leq d. Without loss of generality, we assume μi0\mu_{i}^{\star}\geq 0. Upon defining βi(t)=1Jj=1Jsign~(X¯j,iμ^i(t))\beta_{i}(t)=\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}(\bar{X}_{j,i}-\hat{\mu}_{i}(t)), the update rules for ui(t)u_{i}(t) and vi(t)v_{i}(t) can be written as

ui(t+1)=(1+ηβi(t))ui(t),vi(t+1)=(1ηβi(t))vi(t).\displaystyle u_{i}(t+1)=\left(1+\eta\beta_{i}(t)\right)u_{i}(t),\quad v_{i}(t+1)=\left(1-\eta\beta_{i}(t)\right)v_{i}(t). (6)

Based on the above update rules, βi(t)\beta_{i}(t) controls the growth rate of the dynamics. Indeed, during the initial iterations, we have μ^i(t)μ^i(0)=ui2(0)vi2(0)=0\hat{\mu}_{i}(t)\approx\hat{\mu}_{i}(0)=u_{i}^{2}(0)-v_{i}^{2}(0)=0, which in turn implies that βi(t)βi(0)\beta_{i}(t)\approx\beta_{i}(0). Consequently, the dynamics of ui(t)u_{i}(t) and vi(t)v_{i}(t) can be well approximated using the following exponential functions

ui(t)(1+ηβi(0))tα,vi(t)(1ηβi(0))tα.u_{i}(t)\approx(1+\eta\beta_{i}(0))^{t}\alpha,\quad v_{i}(t)\approx(1-\eta\beta_{i}(0))^{t}\alpha. (7)

Therefore, to analyze the behaviors of ui(t)u_{i}(t) and vi(t)v_{i}(t), it suffices to characterize the magnitude of βi(0)\beta_{i}(0) for different coordinates. To achieve this, we define 𝒥clean\mathcal{J}_{\text{clean}} as the index set of the subgroups [J][J] that do not contain any outliers, and denote its complement as 𝒥outlier=[J]\𝒥clean\mathcal{J}_{\text{outlier}}=[J]\backslash\mathcal{J}_{\text{clean}}. We have

βi(0)\displaystyle\beta_{i}(0) =1Jj𝒥cleansign~(X¯j,i)±|𝒥outlier|J\displaystyle=\frac{1}{J}\sum_{j\in\mathcal{J}_{\text{clean}}}\operatorname{\widetilde{sign}}(\bar{X}_{j,i})\pm\frac{\left|\mathcal{J}_{\text{outlier}}\right|}{J} (denote δ=|𝒥outlier|J\delta=\frac{\left|\mathcal{J}_{\text{outlier}}\right|}{J})
(1δ)𝔼[sign~(X¯j,i)]±δ\displaystyle\approx(1-\delta)\mathbb{E}\left[\operatorname{\widetilde{sign}}(\bar{X}_{j,i})\right]\pm\delta (for sufficiently large 𝒥clean\mathcal{J}_{\text{clean}})
=(1δ)(12Pr(X¯j,iμiμi))±δ\displaystyle=(1-\delta)\left(1-2\mathrm{Pr}\left(\bar{X}_{j,i}-\mu_{i}^{\star}\leq-\mu_{i}^{\star}\right)\right)\pm\delta
(1δ)(12Φ(μiBVar(X)))±δ.\displaystyle\approx(1-\delta)(1-2\Phi(-\mu_{i}^{\star}\cdot\sqrt{B\mathrm{Var}(X)}))\pm\delta. (due to finite-sample central limit theorem)

Here, B=n/JB=n/J is the size of each subgroup, and Φ()\Phi(\cdot) represents the cumulative distribution function (CDF) of the standard Gaussian distribution. Let us define residual={i:μi=0}\mathcal{I}_{\text{residual}}=\{i:\mu_{i}^{\star}=0\} and signal={i:μi0}\mathcal{I}_{\text{signal}}=\{i:\mu_{i}^{\star}\not=0\}. Based on the above characterization of βi(0)\beta_{i}(0), for all iresiduali\in\mathcal{I}_{\text{residual}}, we have 12Φ(μiBVar(X))=01-2\Phi(-\mu_{i}^{\star}\cdot\sqrt{B\mathrm{Var}(X)})=0, which in turn implies βi(0)±δ\beta_{i}(0)\approx\pm\delta. Furthermore, by setting J=CϵnJ=C\lceil\epsilon n\rceil with a suitably large constant CC, B=n/J1/(Cϵ)B=n/J\geq 1/(C\epsilon) can be made sufficiently large given a sufficiently small ϵ\epsilon. This ensures that βi(0)Ω(1δ)±δ\beta_{i}(0)\approx\Omega(1-\delta)\pm\delta for all isignali\in\mathcal{I}_{\text{signal}}. On the other hand, we have δϵn/J1/C\delta\leq\lceil\epsilon n\rceil/J\leq 1/C since |𝒥outlier|ϵn\left|\mathcal{J}_{\text{outlier}}\right|\leq\lceil\epsilon n\rceil. As a result, |βi(0)||\beta_{i}(0)| can be made arbitrarily small for all iresiduali\in\mathcal{I}_{\text{residual}} and βi(0)=Ω(1)\beta_{i}(0)=\Omega(1) for all isignali\in\mathcal{I}_{\text{signal}}. This discrepancy in the growth rates of ui(t)u_{i}(t) and vi(t)v_{i}(t) enables our algorithm to separate the signals from residuals within just a few iterations. In Section 7, we provide a more delicate analysis of the dynamics, showing that for all T[2ηlog(1/α),6ηlog(1/α)]T\in[\frac{2}{\eta}\log(1/\alpha),\frac{6}{\eta}\log(1/\alpha)] we have

ui2(t)vi2(t)=μi±𝒪(σϵ),\displaystyle\,u_{i}^{2}(t)-v_{i}^{2}(t)=\mu_{i}^{\star}\pm\mathcal{O}(\sigma\sqrt{\epsilon}),\quad for isignal,\displaystyle\text{for }i\in\mathcal{I}_{\text{signal}}, (8)
|ui2(t)vi2(t)|=poly(α)σϵ,\displaystyle\left|u_{i}^{2}(t)-v_{i}^{2}(t)\right|=\operatorname{poly}(\alpha)\ll\sigma\sqrt{\epsilon},\quad for iresidual.\displaystyle\text{for }i\in\mathcal{I}_{\text{residual}}.

The above equation sheds light on the key difference between ncvx and cvx: unlike  cvx where the error is equally distributed across different coordinates, the error in ncvx is primarily distributed among the signals, while the error in the residuals can be kept arbitrarily small by a proper choice of the initialization scale α\alpha. This implies that, if the signals are sufficiently larger than the induced error, i.e., |μi|σϵ,isignal|\mu_{i}^{\star}|\gtrsim\sigma\sqrt{\epsilon},\forall i\in\mathcal{I}_{\text{signal}}, our algorithm can successfully identify the signals.

4.2 Stage 2: Achieving Optimal Rate on the Support via Fine-grained Estimation

As illustrated in Section 4.1, a direct application of SubGM leads to an estimation error of 𝒪(kϵ)\mathcal{O}(\sqrt{k\epsilon}). In this section, we show that this error can be further improved once the support of the mean is identified correctly. Our key insight is that once the support of the mean is recovered, we can reduce the problem to a robust dense mean estimation defined only over the recovered support. Under such a regime, existing estimators designed for robust dense mean estimation [DK19, Che+20] can be employed to further reduce the estimation error.

Proposition 4.1 (Adapted from Proposition 1.6 in [DKP20]).

Let \mathbb{P} be a distribution on k\mathbb{R}^{k} with an unknown mean μ\mu^{\star} and unknown covariance matrix Σσ2I\Sigma\preceq\sigma^{2}I. Suppose a sample set of size nn is collected according to the strong contamination model (Definition 1.1) with corruption parameter ϵ<1/2\epsilon<1/2. Then, there exists an algorithm that runs in 𝒪(kn)\mathcal{O}(kn) time and memory and, with a probability of at least 1δ1-\delta, outputs an estimator μ^\hat{\mu} that satisfies

μ^μ2σϵ+σk/n+σlog(1/δ)/n.\left\lVert\hat{\mu}-\mu^{\star}\right\rVert_{2}\lesssim\sigma\sqrt{\epsilon}+\sigma\sqrt{k/n}+\sigma\sqrt{\log(1/\delta)/n}.

Equipped with the above result, we next provide an end-to-end guarantee for our full algorithm.

Theorem 4.2 (Guarantee for the full algorithm).

Let \mathbb{P} be a distribution on d\mathbb{R}^{d} satisfying the conditions in Theorem 4.1. Suppose a sample set of size n(k+log(d/δ))/ϵn\gtrsim({k+\log(d/\delta)})/{\epsilon} is collected according to the strong contamination model (Definition 1.1) with corruption parameter ϵμmin2/σ2\epsilon\lesssim\mu_{\min}^{2}/\sigma^{2}. Then, with the choice of α=Θ(σϵ/dμmax5)\alpha=\Theta(\sigma\sqrt{\epsilon/d}\wedge\mu_{\max}^{\star-5}) and η=Θ(σϵ/μmax)\eta=\Theta(\sigma\sqrt{\epsilon}/\mu_{\max}^{\star}), our full algorithm runs in 𝒪(ndlog(d))\mathcal{O}(nd\log(d)) time and memory and, with a probability of at least 1δ1-\delta, outputs an estimate μ^\hat{\mu} that satisfies

μ^μ2σϵ.\left\lVert\hat{\mu}-\mu^{\star}\right\rVert_{2}\lesssim\sigma\sqrt{\epsilon}. (9)

Upon setting the sample size n=Θ((k+log(d/δ))/ϵ)n=\Theta\left(({k+\log(d/\delta)})/{\epsilon}\right), our proposed two-stage method runs in 𝒪~(dk)\tilde{\mathcal{O}}(dk) time and memory and returns a solution with an error in the order of 𝒪(σϵ)\mathcal{O}(\sigma\sqrt{\epsilon}). Our next theorem shows that this error is indeed information-theoretically optimal up to a constant factor and thus cannot be improved.

Theorem 4.3 (Information-theoretic lower bound).

There exists a distribution \mathbb{P} with kk-sparse mean μ\mu^{\star}, covariance matrix Σσ2I\Sigma\preceq\sigma^{2}I, and coordinate-wise third moment satisfying 𝔼[|Xiμi|3]σ3/ϵ,1id\mathbb{E}[|X_{i}-\mu_{i}^{\star}|^{3}]\lesssim\sigma^{3}/\sqrt{\epsilon},\forall 1\leq i\leq d such that, given any arbitrarily large sample set collected according to the strong contamination model (Definition 1.1) with corruption parameter ϵ\epsilon, no algorithm can estimate the mean μ\mu^{\star} with 2\ell_{2}-error o(σϵ)o(\sigma\sqrt{\epsilon}).

Comparison to the existing lower bounds. To achieve the optimal error rate, the sample complexity of our method scales linearly with the sparsity level kk. A careful reader may realize that our sample complexity is unexpectedly smaller than the optimal sample complexity Ω((klog(d/k)+log(d/δ))/ϵ)\Omega((k\log(d/k)+\log(d/\delta))/\epsilon) introduced in [LM19a] when kk is sufficiently small. This is due to the additional assumptions we impose on the coordinate-wise third moment of the distribution and the corruption parameter ϵ\epsilon. On the other hand, it is recently shown in [DK19, PBR20] that under the bounded third moment, the dependency of the estimation error on ϵ\epsilon can be improved from ϵ1/2\epsilon^{1/2} to ϵ2/3\epsilon^{2/3}. Our worse dependency on ϵ\epsilon is due to our more relaxed assumption on the third moment: unlike the assumptions made in [DK19, PBR20], our imposed upper bound on the third moment is inversely proportional to ϵ\sqrt{\epsilon}. Consequently, the imposed upper bound can get arbitrarily large with a smaller corruption parameter. In this extreme case where ϵ0\epsilon\to 0, this condition can be dropped all together.

5 Proof Sketch

In this section, we provide a proof sketch for the dynamics of SubGM (Theorem 4.1). To streamline the presentation, we keep our arguments at a high level; a more detailed proof is deferred in the supplementary materials. We analyze the coordinate-wise dynamic μ^i(t)=ui2(t)vi2(t)\hat{\mu}_{i}(t)=u_{i}^{2}(t)-v_{i}^{2}(t) for 1id1\leq i\leq d. Without loss of generality, we assume μi0\mu_{i}^{\star}\geq 0. Upon defining βi(t)=1Jj=1Jsign~(X¯j,iμ^i(t))\beta_{i}(t)=\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}(\bar{X}_{j,i}-\hat{\mu}_{i}(t)), the update rules for ui(t)u_{i}(t) and vi(t)v_{i}(t) can be written as

ui(t+1)=(1+ηβi(t))ui(t),vi(t+1)=(1ηβi(t))vi(t).\displaystyle u_{i}(t+1)=\left(1+\eta\beta_{i}(t)\right)u_{i}(t),\quad v_{i}(t+1)=\left(1-\eta\beta_{i}(t)\right)v_{i}(t). (10)

Based on the above update rules, βi(t)\beta_{i}(t) controls the growth rate of the dynamics. Indeed, during the initial iterations, we have μ^i(t)μ^i(0)=ui2(0)vi2(0)=0\hat{\mu}_{i}(t)\approx\hat{\mu}_{i}(0)=u_{i}^{2}(0)-v_{i}^{2}(0)=0, which in turn implies that βi(t)βi(0)\beta_{i}(t)\approx\beta_{i}(0). Consequently, the dynamics of ui(t)u_{i}(t) and vi(t)v_{i}(t) can be well approximated using the following exponential functions

ui(t)(1+ηβi(0))tα,vi(t)(1ηβi(0))tα.u_{i}(t)\approx(1+\eta\beta_{i}(0))^{t}\alpha,\quad v_{i}(t)\approx(1-\eta\beta_{i}(0))^{t}\alpha. (11)

Therefore, to analyze the behaviors of ui(t)u_{i}(t) and vi(t)v_{i}(t), it suffices to characterize the magnitude of βi(0)\beta_{i}(0) for different coordinates. To achieve this, we define 𝒥clean\mathcal{J}_{\text{clean}} as the index set of the subgroups [J][J] that do not contain any outliers and 𝒥outlier=[J]𝒥clean\mathcal{J}_{\text{outlier}}=[J]-\mathcal{J}_{\text{clean}}. Consequently, we have

βi(0)\displaystyle\beta_{i}(0) =1Jj𝒥cleansign~(X¯j,i)±|𝒥outlier|J\displaystyle=\frac{1}{J}\sum_{j\in\mathcal{J}_{\text{clean}}}\operatorname{\widetilde{sign}}(\bar{X}_{j,i})\pm\frac{\left|\mathcal{J}_{\text{outlier}}\right|}{J} (denote δ=|𝒥outlier|J\delta=\frac{\left|\mathcal{J}_{\text{outlier}}\right|}{J})
(1δ)𝔼[sign~(X¯j,i)]±δ\displaystyle\approx(1-\delta)\mathbb{E}\left[\operatorname{\widetilde{sign}}(\bar{X}_{j,i})\right]\pm\delta (due to concentration bound)
=(1δ)(12Pr(X¯j,iμiμi))±δ\displaystyle=(1-\delta)\left(1-2\mathrm{Pr}\left(\bar{X}_{j,i}-\mu_{i}^{\star}\leq-\mu_{i}^{\star}\right)\right)\pm\delta
(1δ)(12Φ(μiB/Var(X)))±δ.\displaystyle\approx(1-\delta)(1-2\Phi(-\mu_{i}^{\star}\cdot\sqrt{B/\mathrm{Var}(X)}))\pm\delta. (due to finite-sample central limit theorem)

Here Φ()\Phi(\cdot) represents the cumulative distribution function (CDF) of the standard Gaussian distribution. Let us define residual={i:μi=0}\mathcal{I}_{\text{residual}}=\{i:\mu_{i}^{\star}=0\} and signal={i:μi0}\mathcal{I}_{\text{signal}}=\{i:\mu_{i}^{\star}\not=0\}. Based on the above characterization of βi(0)\beta_{i}(0), we have βi(0)±δ\beta_{i}(0)\approx\pm\delta for all iresiduali\in\mathcal{I}_{\text{residual}}. Furthermore, by setting J=CϵnJ=C\lceil\epsilon n\rceil with a suitably large constant CC, we can ensure βi(0)Ω(1δ)±δ\beta_{i}(0)\approx\Omega(1-\delta)\pm\delta for all isignali\in\mathcal{I}_{\text{signal}} because B=n/J1/(Cϵ)B=n/J\geq 1/(C\epsilon) can be made sufficiently large given a sufficiently small ϵ\epsilon. On the other hand, we have δϵn/J1/C\delta\leq\lceil\epsilon n\rceil/J\leq 1/C since |𝒥outlier|ϵn\left|\mathcal{J}_{\text{outlier}}\right|\leq\lceil\epsilon n\rceil. As a result, |βi(0)||\beta_{i}(0)| can be made arbitrarily small for all iresiduali\in\mathcal{I}_{\text{residual}} and βi(0)=Ω(1)\beta_{i}(0)=\Omega(1) for all isignali\in\mathcal{I}_{\text{signal}}. This discrepancy in the growth rates of ui(t)u_{i}(t) and vi(t)v_{i}(t) enables our algorithm to separate the signals from residuals within just a few iterations. In the supplementary materials, we provide a more delicate analysis of the dynamics, showing that for all T[2ηlog(1/α),6ηlog(1/α)]T\in[\frac{2}{\eta}\log(1/\alpha),\frac{6}{\eta}\log(1/\alpha)] we have

ui2(t)vi2(t)=μi±𝒪(σϵ),\displaystyle\,u_{i}^{2}(t)-v_{i}^{2}(t)=\mu_{i}^{\star}\pm\mathcal{O}(\sigma\sqrt{\epsilon}),\quad for isignal,\displaystyle\text{for }i\in\mathcal{I}_{\text{signal}}, (12)
|ui2(t)vi2(t)|=poly(α)σϵ,\displaystyle\left|u_{i}^{2}(t)-v_{i}^{2}(t)\right|=\operatorname{poly}(\alpha)\ll\sigma\sqrt{\epsilon},\quad for iresidual.\displaystyle\text{for }i\in\mathcal{I}_{\text{residual}}.

The above equation sheds light on the key difference between ncvx and cvx: unlike  cvx where the error is equally distributed across different coordinates, the error in ncvx is primarily distributed among the signals, while the error in the residuals can be kept arbitrarily small by a proper choice of the initialization scale α\alpha. This implies that, if the signals are sufficiently larger than the induced error, i.e., |μi|σϵ,isignal|\mu_{i}^{\star}|\gtrsim\sigma\sqrt{\epsilon},\forall i\in\mathcal{I}_{\text{signal}}, our algorithm can successfully identify the signals.

6 Simulation

In this section, we present numerical simulations to corroborate the theoretical results established in Section 4. Further implementation details, together with additional simulation studies, are deferred to the appendix. The complete codebase is publicly accessible at https://github.com/ying-hui-he/Robust_mean_estimation.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: The data dimension is fixed at d=100d=100. Unless otherwise specified, the corruption ratio is set to ϵ=0.1\epsilon=0.1 and the sparsity level to k=4k=4. For the first two simulations, the distribution parameters c,b,νc,b,\nu are set to 3.13.1 (see Appendix B for further details). The sample size is n=600n=600 in the first and third simulations, while in the second simulation it is scaled as n=100kn=100k, varying proportionally with the sparsity level.

Simulation setup.

All the experiments are conducted on a MacBook Pro 2021 with the Apple M1 Pro chip and a 1616GB unified memory. We pick three representative heavy-tailed probability distributions: Fisk, Pareto, and Student’s tt. To make a fair comparison, we fix the data dimension at d=100d=100 and use the constant-bias noise model introduced in [Che+21] to generate outliers. Unless otherwise stated, we set the corruption ratio at ϵ=0.1\epsilon=0.1 and the sparsity level at k=4k=4. As for the algorithm in Stage 2, we utilize the filter-based algorithm RME_sp introduced in [Dia+19a]. Furthermore, we compare our algorithms with the Oracle estimator, which uses the coordinate-wise MoM on the clean data with an optimal choice of subgroup numbers. In all of our simulations, we set the number of iterations of SubGM to 200200, which is in line with our theoretical results.

Identification of top-kk elements.

In this experiment, we evaluate the success rate under varying corruption ratios ϵ\epsilon, while keeping all other parameters fixed. Our theoretical result (Theorem 4.1) indicates that provable identification is possible only when ϵμmin2/σ2\epsilon\lesssim\mu_{\min}^{\star 2}/\sigma^{2}, suggesting that the success rate should deteriorate as ϵ\epsilon increases. We define the recovered index set obtained by SubGM as II, and the true index set of the top-kk elements as IkI_{k}. The success rate is then measured as |IIk|/|IIk||I\cap I_{k}|/|I\cup I_{k}|. The results, presented in Figure 2(a), are averaged over 5050 independent trials for each setting. Notably, SubGM achieves exact recovery of the true index set II even when up to 30%30\% of the samples are corrupted, highlighting the robustness and practical effectiveness of our method.

Comparison between Stage 1 and full algorithms.

We evaluate the 2\ell_{2}-error of the Stage 1 and full algorithms across varying sparsity levels kk. Our theoretical results predict a gap in 2\ell_{2}-error between the two algorithms—𝒪(σkϵ)\mathcal{O}(\sigma\sqrt{k\epsilon}) versus 𝒪(σϵ)\mathcal{O}(\sigma\sqrt{\epsilon})—when kk is sufficiently large. To minimize the influence of sample size, we set n=100kn=100k, ensuring a sufficiently large number of samples. As shown in Figure 2(b), the two algorithms perform comparably when kk is small. However, as kk increases, the 2\ell_{2}-error of Stage 1 grows sublinearly, while the full algorithm maintains a stable error level. These empirical findings are fully consistent with our theoretical predictions.

Infinite variance regime.

In this experiment, we evaluate the performance of our algorithm in the infinite variance regime, fixing the sparsity level at k=4k=4 and the sample size at n=600n=600. When the distribution parameters c,b,νc,b,\nu fall within the interval (1,2](1,2], the Fisk, Pareto, and Student’s tt distributions all exhibit infinite variance (see Appendix B for further details). As shown in Figure 2(c), both Stage 1 and the full algorithm maintain strong performance in this setting, suggesting that our theoretical guarantees may extend to the infinite variance regime. Notably, Stage 1 consistently outperforms the full algorithm across all three distributions, implying that SubGM may possess greater robustness than existing estimators under infinite variance.

7 Proofs

The proofs of our main results are organized as follows. Section 7.1 presents preliminary lemmas. Section 7.2 establishes the convergence guarantee of SubGM (Theorem 4.1), and Section 7.3 provides the end-to-end guarantee of the full algorithm (Theorem 4.2). Section 7.4 derives the information-theoretic lower bound (Theorem 4.3), and Appendix A proves a formal variant of Proposition 3.1 to establish the properties of the MoM estimator.

7.1 Preliminaries

This section presents all the technical lemmas that will be used to prove our main results.

Lemma 7.1 (Chebyshev’s inequality [Ver18, Corollary 1.2.5]).

Suppose that XX\sim\mathbb{P} with Var(X)<\mathrm{Var}(X)<\infty. Then, for any δ>0\delta>0, we have

Pr(|X𝔼[X]|δ)Var(X)δ2.\mathrm{Pr}\left(\left|X-\mathbb{E}[X]\right|\geq\delta\right)\leq\frac{\mathrm{Var}(X)}{\delta^{2}}. (13)
Lemma 7.2 (Hoeffding’s inequality [Ver18, Theorem 2.2.6]).

Let X1,,XnX_{1},\cdots,X_{n} be independent random variables such that aiXibia_{i}\leq X_{i}\leq b_{i} almost surely. Then for all δ>0\delta>0, we have

Pr(i=1nXi𝔼[Xi]δ)exp{2δ2i=1n(biai)2}.\mathrm{Pr}\biggl{(}\sum_{i=1}^{n}X_{i}-\mathbb{E}[X_{i}]\geq\delta\biggr{)}\leq\exp\biggl{\{}-\frac{2\delta^{2}}{\sum_{i=1}^{n}(b_{i}-a_{i})^{2}}\biggr{\}}. (14)
Lemma 7.3 (Dvoretzky-Kiefer-Wolfowitz Inequality [Mas90]).

Let F(t)=Pr(Xt)F(t)=\mathrm{Pr}(X\leq t) be the CDF of a random variable XX, and let F^n()=1ni=1n𝕀(,](Xi)\hat{F}_{n}(\cdot)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}_{(-\infty,\cdot]}(X_{i}) be the empirical CDF based on nn i.i.d. samples X1,,XnX_{1},\dots,X_{n}\sim\mathbb{P}. We have

Pr(F^nFt)2e2nt2for all t0.\mathrm{Pr}\left(\left\lVert\hat{F}_{n}-F\right\rVert_{\infty}\geq t\right)\leq 2e^{-2nt^{2}}\quad\text{for all }t\geq 0. (15)
Lemma 7.4.

Suppose X1,Xni.i.d.X_{1},\cdots X_{n}\overset{i.i.d.}{\sim}\mathbb{P}. Then, with probability at least 1δ1-\delta and for all aa\in\mathbb{R}, we have

|1ni=1nsign~(Xia)𝔼[sign~(Xa)]|2log(2/δ)n.\biggl{|}\frac{1}{n}\sum_{i=1}^{n}\operatorname{\widetilde{sign}}\left(X_{i}-a\right)-\mathbb{E}\left[\operatorname{\widetilde{sign}}\left(X-a\right)\right]\biggr{|}\leq\sqrt{\frac{2\log(2/\delta)}{n}}. (16)
Proof.

Proof Note that sign~(Xia)=1𝕀(,a](Xi)𝕀(,a)(Xi)\operatorname{\widetilde{sign}}(X_{i}-a)=1-\mathbb{I}_{(-\infty,a]}(X_{i})-\mathbb{I}_{(-\infty,a)}(X_{i}) and 𝔼[sign~(Xa)]=1Pr(Xa)Pr(X<a)\mathbb{E}\left[\operatorname{\widetilde{sign}}\left(X-a\right)\right]=1-\mathrm{Pr}(X\leq a)-\mathrm{Pr}(X<a). Therefore, we have

|1ni=1nsign~(Xia)𝔼[sign~(Xa)]|\displaystyle\biggl{|}\frac{1}{n}\sum_{i=1}^{n}\operatorname{\widetilde{sign}}\left(X_{i}-a\right)-\mathbb{E}\left[\operatorname{\widetilde{sign}}\left(X-a\right)\right]\biggr{|} (17)
|1ni=1n𝕀(,a](Xi)Pr(Xa)|+supb<a|1ni=1n𝕀(,b](Xi)Pr(Xb)|\displaystyle\leq\biggl{|}\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}_{(-\infty,a]}(X_{i})-\mathrm{Pr}(X\leq a)\biggr{|}+\sup_{b<a}\biggl{|}\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}_{(-\infty,b]}(X_{i})-\mathrm{Pr}(X\leq b)\biggr{|}
2F^nF.\displaystyle\leq 2\left\lVert\hat{F}_{n}-F\right\rVert_{\infty}.

Upon setting t=12nlog(2δ)t=\sqrt{\frac{1}{2n}\log\left(\frac{2}{\delta}\right)} in the Dvoretzky-Kiefer-Wolfowitz Inequality (Lemma 7.3), with probability at least 1δ1-\delta and for all aa\in\mathbb{R}, we have

|1ni=1nsign~(Xia)𝔼[sign~(Xa)]|2F^nF2log(2/δ)n.\displaystyle\biggl{|}\frac{1}{n}\sum_{i=1}^{n}\operatorname{\widetilde{sign}}\left(X_{i}-a\right)-\mathbb{E}\left[\operatorname{\widetilde{sign}}\left(X-a\right)\right]\biggr{|}\leq 2\left\lVert\hat{F}_{n}-F\right\rVert_{\infty}\leq\sqrt{\frac{2\log(2/\delta)}{n}}.\hfill\square (18)

Lemma 7.5 (Berry-Esseen bound [Ver18, Theorem 2.1.3]).

Suppose X1,Xni.i.d.X_{1},\cdots X_{n}\overset{i.i.d.}{\sim}\mathbb{P}, where \mathbb{P} has zero mean and bounded third moment, i.e., μ=𝔼[X]=0,ρ=𝔼[|X|3]<\mu=\mathbb{E}[X]=0,\rho=\mathbb{E}[|X|^{3}]<\infty. Then, upon denoting Zn=(i=1nXi)/nσ2Z_{n}=(\sum_{i=1}^{n}X_{i})/\sqrt{n\sigma^{2}} where σ2=𝔼[X2]\sigma^{2}=\mathbb{E}[X^{2}], we have

supa|Pr(Zn<a)Φ(a)|0.5ρσ3n.\sup_{a\in\mathbb{R}}\left|\mathrm{Pr}\left(Z_{n}<a\right)-\Phi(a)\right|\leq\frac{0.5\rho}{\sigma^{3}\sqrt{n}}. (19)

Here Φ()\Phi(\cdot) is the CDF of standard Gaussian distribution.

7.2 Proof of Theorem 4.1

To prove this theorem, it is essential to first establish the uniform concentration of 1Jj=1Jsign~(X¯j,i+a)\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right) for all a0a\geq 0.

Lemma 7.6.

Suppose X1,Xni.i.d.X_{1},\cdots X_{n}\overset{i.i.d.}{\sim}\mathbb{P}, where \mathbb{P} has zero mean, variance σ2\sigma^{2}, and coordinate-wise third moment ρ\rho. Moreover, suppose samples are generated according to the strong contamination model (Definition 1.1) with corruption parameter ϵ\epsilon. Suppose n20000log(2d/δ)/ϵn\geq 20000\log(2d/\delta)/\epsilon, J=100ϵnJ=100\lceil\epsilon n\rceil, and ρ0.005σ3/ϵ\rho\leq 0.005\sigma^{3}/\sqrt{\epsilon}. Upon dividing the samples into JJ equal subgroups S1,,SJS_{1},\cdots,S_{J} and denoting the empirical mean of each subgroup by X¯j=1JkSjXk\bar{X}_{j}=\frac{1}{J}\sum_{k\in S_{j}}X_{k}, with probability at least 1δ1-\delta, the following statements hold

  • For all a20σϵa\geq 20\sigma\sqrt{\epsilon} and all 1id1\leq i\leq d, we have: 351Jj=1Jsign~(X¯j,i+a)1.\frac{3}{5}\leq\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right)\leq 1.

  • For all 0a0.001σϵ0\leq a\leq 0.001\sigma\sqrt{\epsilon} and all 1id1\leq i\leq d, we have: 0.081Jj=1Jsign~(X¯j,i+a)0.08.-0.08\leq\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right)\leq 0.08.

Proof.

Proof We prove the two cases separately.

Case 1: a20σϵa\geq 20\sigma\sqrt{\epsilon}. We only need to prove the lower bound since the upper bound is trivial. We partition the index set of subgroups 1,,J{1,\ldots,J} into two disjoint subsets: 𝒥clean\mathcal{J}_{\text{clean}}, containing all subgroups free of outliers, and 𝒥outlier\mathcal{J}_{\text{outlier}}, containing those with at least one outlier. Note that |𝒥outlier|ϵn|\mathcal{J}_{\text{outlier}}|\leq\lceil\epsilon n\rceil. Therefore, we obtain

1Jj=1Jsign~(X¯j,i+a)1Jj𝒥cleansign~(X¯j,i+a)ϵnJ.\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right)\geq\frac{1}{J}\sum_{j\in\mathcal{J}_{\text{clean}}}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right)-\frac{\lceil\epsilon n\rceil}{J}. (20)

Next, applying Lemma 7.4 and a union bound, we obtain that, with probability at least 1δ1-\delta and for all 1id1\leq i\leq d,

1Jj=1Jsign~(X¯j,i+a)\displaystyle\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right) |𝒥clean|J(𝔼[sign~(X¯j,i+a)]2log(2d/δ)|𝒥clean|)ϵnJ\displaystyle\geq\frac{|\mathcal{J}_{\text{clean}}|}{J}\left(\mathbb{E}\left[\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right)\right]-\sqrt{\frac{2\log(2d/\delta)}{|\mathcal{J}_{\text{clean}}|}}\right)-\frac{\lceil\epsilon n\rceil}{J} (21)
=|𝒥clean|J(12Pr(X¯j,ia)2log(2d/δ)|𝒥clean|)ϵnJ.\displaystyle=\frac{|\mathcal{J}_{\text{clean}}|}{J}\left(1-2\mathrm{Pr}\left(\bar{X}_{j,i}\leq-a\right)-\sqrt{\frac{2\log(2d/\delta)}{|\mathcal{J}_{\text{clean}}|}}\right)-\frac{\lceil\epsilon n\rceil}{J}.

To proceed, one can write

Pr(X¯j,ia)\displaystyle\mathrm{Pr}\left(\bar{X}_{j,i}\leq-a\right) =Pr(X¯j,iVar(X)/BaVar(X)/B)\displaystyle=\mathrm{Pr}\biggl{(}\frac{\bar{X}_{j,i}}{\sqrt{\mathrm{Var}(X)/B}}\leq-\frac{a}{\sqrt{\mathrm{Var}(X)/B}}\biggr{)} (22)
(a)Φ(aVar(X)/B)+0.5ρσ3B\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\Phi\biggl{(}-\frac{a}{\sqrt{\mathrm{Var}(X)/B}}\biggr{)}+\frac{0.5\rho}{\sigma^{3}\sqrt{B}}
(b)exp{Ba22Var(X)}+0.5ρσ3B\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\exp\biggl{\{}-\frac{Ba^{2}}{2\mathrm{Var}(X)}\biggr{\}}+\frac{0.5\rho}{\sigma^{3}\sqrt{B}}
exp{Ba22σ2}+0.5ρσ3B.\displaystyle\leq\exp\biggl{\{}-\frac{Ba^{2}}{2\sigma^{2}}\biggr{\}}+\frac{0.5\rho}{\sigma^{3}\sqrt{B}}.

Here, (a)(a) follows from the Berry-Esseen bound (Lemma 7.5). In (b)(b), we use the concentration inequality for standard Gaussian distribution. Combining the above inequalities and recalling our choices of JJ, BB, and nn, we conclude that, with probability at least 1δ1-\delta and for all 1id1\leq i\leq d,

1Jj=1Jsign~(X¯j,i+a)\displaystyle\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right) |𝒥clean|J(12(exp{Ba22σ2}+0.5ρσ3B)2log(2d/δ)|𝒥clean|)ϵnJ\displaystyle\geq\frac{|\mathcal{J}_{\text{clean}}|}{J}\left(1-2\left(\exp\biggl{\{}-\frac{Ba^{2}}{2\sigma^{2}}\biggr{\}}+\frac{0.5\rho}{\sigma^{3}\sqrt{B}}\right)-\sqrt{\frac{2\log(2d/\delta)}{|\mathcal{J}_{\text{clean}}|}}\right)-\frac{\lceil\epsilon n\rceil}{J} (23)
0.99(12(e2+0.025)1009914500)0.01\displaystyle\geq 99\cdot\left(1-2\cdot\left(e^{-2}+0.025\right)-\sqrt{\frac{100}{99}\cdot\frac{1}{4500}}\right)-01
35.\displaystyle\geq\frac{3}{5}.

This completes the proof of the first statement.

Case 2: 0a0.001σϵ0\leq a\leq 0.001\sigma\sqrt{\epsilon}. In this case, it suffices to provide an upper bound for |1Jj=1Jsign~(X¯j,i+a)|\left|\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right)\right|. Following a similar derivation as in Case 1, with probability at least 1δ1-\delta and for all 1id1\leq i\leq d, we have

|1Jj=1Jsign~(X¯j,i+a)|\displaystyle\left|\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}+a\right)\right| (24)
|𝒥clean|J(12Φ(aVar(X)/B)+ρσ3B+2log(2d/δ)|𝒥clean|)+ϵnJ\displaystyle\leq\frac{|\mathcal{J}_{\text{clean}}|}{J}\left(1-2\Phi\left(-\frac{a}{\sqrt{\mathrm{Var}(X)/B}}\right)+\frac{\rho}{\sigma^{3}\sqrt{B}}+\sqrt{\frac{2\log(2d/\delta)}{|\mathcal{J}_{\text{clean}}|}}\right)+\frac{\lceil\epsilon n\rceil}{J}
=|𝒥clean|J(2Φ(0)2Φ(aVar(X)/B)+ρσ3B+2log(2d/δ)|𝒥clean|)+ϵnJ\displaystyle=\frac{|\mathcal{J}_{\text{clean}}|}{J}\left(2\Phi(0)-2\Phi\left(-\frac{a}{\sqrt{\mathrm{Var}(X)/B}}\right)+\frac{\rho}{\sigma^{3}\sqrt{B}}+\sqrt{\frac{2\log(2d/\delta)}{|\mathcal{J}_{\text{clean}}|}}\right)+\frac{\lceil\epsilon n\rceil}{J}
(a)1.98aσ2/B+0.99ρσ3B+1.98log(2d/δ)J+ϵnJ\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{1.98a}{\sqrt{\sigma^{2}/B}}+\frac{0.99\rho}{\sigma^{3}\sqrt{B}}+\sqrt{\frac{1.98\log(2d/\delta)}{J}}+\frac{\lceil\epsilon n\rceil}{J}
1.98aBϵ+0.05+1.989000+0.01\displaystyle\leq 98a\sqrt{B\epsilon}+05+\sqrt{\frac{1.98}{9000}}+01
0.08.\displaystyle\leq 08.

Here, in (a)(a), we use the anti-concentration for the standard Gaussian distribution. This completes the proof of the second statement. ∎

We are now ready to present the proof of Theorem 4.1. To this goal, we first present a more precise version of its statement.

Theorem 7.1 (Convergence guarantee for SubGM).

Let \mathbb{P} be a distribution on d\mathbb{R}^{d} with an unknown kk-sparse mean μ\mu^{\star}, unknown covariance matrix Σσ2I\Sigma\preceq\sigma^{2}I, and unknown coordinate-wise third moment satisfying 𝔼[|Xiμi|3]0.005σ3/ϵ,1id\mathbb{E}[|X_{i}-\mu_{i}^{\star}|^{3}]\leq 0.005\sigma^{3}/\sqrt{\epsilon},\forall 1\leq i\leq d. Suppose a sample set of size n20000log(2d/δ)/ϵn\geq 20000\log(2d/\delta)/\epsilon is collected according to the strong contamination model (Definition 1.1) with corruption parameter ϵ\epsilon. Upon setting the stepsize ησϵ/μmax\eta\leq\sigma\sqrt{\epsilon}/\mu_{\max}^{\star} and the initialization scale 0<α0.001σϵ/dμmax50<\alpha\leq 0.001\sigma\sqrt{\epsilon/d}\wedge\mu_{\max}^{\star-5} in Algorithm 1, with a probability of at least 1δ1-\delta, the following statements hold for any iteration 2ηlog(1/α)T6ηlog(1/α)\frac{2}{\eta}\log(1/\alpha)\leq T\leq\frac{6}{\eta}\log(1/\alpha):

  • Near optimal 2\ell_{2}-error. The 2\ell_{2}-error is upper-bounded by

    μ^(T)μ31σkϵ.\left\lVert\hat{\mu}(T)-\mu^{\star}\right\rVert\leq 31\sigma\sqrt{k\epsilon}. (25)
  • Coordinate-wise error bound. We obtain

    |μ^i(T)μi|\displaystyle|\hat{\mu}_{i}(T)-\mu_{i}^{\star}| 30σϵ,\displaystyle\leq 0\sigma\sqrt{\epsilon},\quad where μi0,\displaystyle\text{where }\mu_{i}^{\star}\neq 0, (26)
    |μ^i(T)|\displaystyle|\hat{\mu}_{i}(T)| α,\displaystyle\leq\alpha,\quad where μi=0.\displaystyle\text{where }\mu_{i}^{\star}=0.

Before proceeding to the proof, we note that second statement of Theorem 7.1 together with the assumption ϵμmin2/(961σ2)\epsilon\leq\mu_{\min}^{\star 2}/(961\sigma^{2}) readily implies μ^i(T)σϵ\hat{\mu}_{i}(T)\geq\sigma\sqrt{\epsilon} for every ii such that μi0\mu^{\star}_{i}\not=0, leading to the second statement of Theorem 4.1.

Proof.

Proof of Theorem 7.1 Let us define residual={i:μi=0}\mathcal{I}_{\text{residual}}=\{i:\mu_{i}^{\star}=0\} and signal={i:μi0}\mathcal{I}_{\text{signal}}=\{i:\mu_{i}^{\star}\not=0\}. We analyze coordinate-wise dynamics μ^i(t):=ui2(t)vi2(t)\hat{\mu}_{i}(t):=u_{i}^{2}(t)-v_{i}^{2}(t) separately for signals signal\mathcal{I}_{\text{signal}} and residuals residual\mathcal{I}_{\text{residual}}.

Signal dynamics.

Without loss of generality, we assume that μi>0\mu_{i}^{\star}>0. Let us first revisit the update rule for SubGM:

ui(t+1)\displaystyle u_{i}(t+1) =(1+η1Jj=1Jsign~(X¯j,iμ^i(t)))ui(t),\displaystyle=\biggl{(}1+\eta\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}-\hat{\mu}_{i}(t)\right)\biggr{)}u_{i}(t), (27)
vi(t+1)\displaystyle v_{i}(t+1) =(1η1Jj=1Jsign~(X¯j,iμ^i(t)))vi(t).\displaystyle=\biggl{(}1-\eta\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}-\hat{\mu}_{i}(t)\right)\biggr{)}v_{i}(t).

We further divide our analysis into two cases depending on the magnitude of |μi|\left|\mu_{i}^{\star}\right|.

Case 1: μi20σϵ\mu_{i}^{\star}\geq 20\sigma\sqrt{\epsilon}. We define Ti={mint:μiμ^i(t)<20σϵ}T_{i}=\left\{\min t:\mu_{i}^{\star}-\hat{\mu}_{i}(t)<20\sigma\sqrt{\epsilon}\right\}. Hence, for all 0tTi0\leq t\leq T_{i}, the first statement of Lemma 7.6 can be invoked to show

0.61Jj=1Jsign~(X¯j,iμ^i(t))1.0.6\leq\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}-\hat{\mu}_{i}(t)\right)\leq 1. (28)

By incorporating this into Equation 27, we obtain

ui2(t+1)\displaystyle u_{i}^{2}(t+1) (1+0.6η)2ui2(t)(1+1.2η)ui2(t),\displaystyle\geq\left(1+0.6\eta\right)^{2}u_{i}^{2}(t)\geq(1+1.2\eta)u_{i}^{2}(t), (29)
vi2(t+1)\displaystyle v_{i}^{2}(t+1) (10.6η)2vi2(t)vi2(t).\displaystyle\leq\left(1-0.6\eta\right)^{2}v_{i}^{2}(t)\leq v_{i}^{2}(t). (30)

Notice that vi(0)=αv_{i}(0)=\alpha at the initialization. We find that vi2(t)α2,0tTiv_{i}^{2}(t)\leq\alpha^{2},\forall 0\leq t\leq T_{i}, which remains adequately small throughout the trajectory. Next, we examine the dynamics of ui2(t)u_{i}^{2}(t). Taking into account that ui2(0)=α2u_{i}^{2}(0)=\alpha^{2} and ui2(t)(1+1.2η)tui2(0)u_{i}^{2}(t)\geq(1+1.2\eta)^{t}u_{i}^{2}(0), we have that within Ti53ηlog(|μi|α)T_{i}\leq\frac{5}{3\eta}\log\left(\frac{|\mu_{i}^{\star}|}{\alpha}\right) iterations, the following holds

ui2(Ti)α2(1+1.2η)Tiμi10σϵ.u_{i}^{2}(T_{i})\geq\alpha^{2}(1+1.2\eta)^{T_{i}}\geq\mu_{i}^{\star}-10\sigma\sqrt{\epsilon}. (31)

This implies

μ^i(Ti)=ui2(Ti)vi2(Ti)μi10σϵα2μi20σϵ.\hat{\mu}_{i}(T_{i})=u_{i}^{2}(T_{i})-v_{i}^{2}(T_{i})\geq\mu_{i}^{\star}-10\sigma\sqrt{\epsilon}-\alpha^{2}\geq\mu_{i}^{\star}-20\sigma\sqrt{\epsilon}. (32)

Next, we show that |μiμ^i(Ti)|20σϵ\left|\mu_{i}^{\star}-\hat{\mu}_{i}(T_{i}^{\star})\right|\leq 20\sigma\sqrt{\epsilon}. To this goal, when t<Tit<T_{i}, we provide an upper bound on the difference between two consecutive iterations as follows

|μ^i(t+1)μ^i(t)|\displaystyle|\hat{\mu}_{i}(t+1)-\hat{\mu}_{i}(t)| |ui2(t+1)ui2(t)|+|vi2(t+1)vi2(t)|\displaystyle\leq\left|u^{2}_{i}(t+1)-u^{2}_{i}(t)\right|+\left|v^{2}_{i}(t+1)-v^{2}_{i}(t)\right| (33)
(a)|(1+η1Jj=1Jsign~(X¯j,iμ^i(t)))21|ui2(t)+α2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\Bigg{|}\biggl{(}1+\eta\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}-\hat{\mu}_{i}(t)\right)\biggr{)}^{2}-1\Bigg{|}\cdot u_{i}^{2}(t)+\alpha^{2}
(b)((1+η)21)ui2(t)+α2\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\left((1+\eta)^{2}-1\right)u_{i}^{2}(t)+\alpha^{2}
(c)3ημi+2α2\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}3\eta\mu_{i}^{\star}+2\alpha^{2}
(d)4σϵ.\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}4\sigma\sqrt{\epsilon}.

Here in (a)(a), we use the fact that |vi2(t+1)vi2(t)|max{vi2(t+1),vi2(t)}α\left|v^{2}_{i}(t+1)-v^{2}_{i}(t)\right|\leq\max\{v^{2}_{i}(t+1),v^{2}_{i}(t)\}\leq\alpha. In (b)(b), we use the estimate 0.61Jj=1Jsign~(X¯j,iμ^i(t))10.6\leq\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}-\hat{\mu}_{i}(t)\right)\leq 1. In (c)(c), we use the fact that ui2(t)μ^i(t)+vi2(t)μi+α2u_{i}^{2}(t)\leq\hat{\mu}_{i}(t)+v_{i}^{2}(t)\leq\mu_{i}^{\star}+\alpha^{2}. Lastly, in (d)(d), we use the condition that ησϵμmax\eta\leq\frac{\sigma\sqrt{\epsilon}}{\mu_{\max}^{\star}} and α20.5σϵ\alpha^{2}\leq 0.5\sigma\sqrt{\epsilon}. Hence, we have

μ^i(Ti)μiμ^i(Ti1)μi0 by definition of Ti+(μ^i(Ti)μ^i(Ti1))4σϵ.\hat{\mu}_{i}(T_{i})-\mu_{i}^{\star}\leq\underbrace{\hat{\mu}_{i}(T_{i}-1)-\mu_{i}^{\star}}_{\leq 0\text{ by definition of }T_{i}}+\left(\hat{\mu}_{i}(T_{i})-\hat{\mu}_{i}(T_{i}-1)\right)\leq 4\sigma\sqrt{\epsilon}. (34)

Combining with the fact that μ^i(Ti)μi20σϵ\hat{\mu}_{i}(T_{i})-\mu_{i}^{\star}\geq-20\sigma\sqrt{\epsilon}, we derive that |μiμ^i(Ti)|20σϵ\left|\mu_{i}^{\star}-\hat{\mu}_{i}(T_{i}^{\star})\right|\leq 20\sigma\sqrt{\epsilon}.

We will now demonstrate that for any tTit\geq T_{i}^{\star}, the condition |μiμ^i(t)|30σϵ\left|\mu_{i}^{\star}-\hat{\mu}_{i}(t)\right|\leq 30\sigma\sqrt{\epsilon} always holds. Using the fact n20000log(2d/δ)/ϵn\geq 20000\log(2d/\delta)/\epsilon and Theorem A.1 (in the appendix), we have |μ^MoMμi|5σϵ|\hat{\mu}_{\textsf{MoM}}-\mu_{i}^{\star}|\leq 5\sigma\sqrt{\epsilon}. Then, the triangle inequality implies

|μiμ^i(t)||μ^MoMμi|+|μ^MoMμ^i(t)|.\left|\mu_{i}^{\star}-\hat{\mu}_{i}(t)\right|\leq\left|\hat{\mu}_{\textsf{MoM}}-\mu_{i}^{\star}\right|+\left|\hat{\mu}_{\textsf{MoM}}-\hat{\mu}_{i}(t)\right|. (35)

Therefore, it suffices to show that |μ^MoMμ^i(t)|25σϵ\left|\hat{\mu}_{\textsf{MoM}}-\hat{\mu}_{i}(t)\right|\leq 25\sigma\sqrt{\epsilon} for every tTit\geq T_{i}^{\star}. To this goal, we use induction on tt. For t=Tit=T_{i}^{\star}, we have

|μ^MoMμ^i(Ti)||μ^MoMμi|+|μiμ^i(Ti)|25σϵ.\left|\hat{\mu}_{\textsf{MoM}}-\hat{\mu}_{i}(T_{i}^{\star})\right|\leq\left|\hat{\mu}_{\textsf{MoM}}-\mu_{i}^{\star}\right|+\left|\mu_{i}^{\star}-\hat{\mu}_{i}(T_{i}^{\star})\right|\leq 25\sigma\sqrt{\epsilon}. (36)

Now, let us assume that at time tTit\geq T_{i}^{\star}, |μ^MoMμ^i(t)|25σϵ\left|\hat{\mu}_{\textsf{MoM}}-\hat{\mu}_{i}(t)\right|\leq 25\sigma\sqrt{\epsilon}. Without loss of generality, we assume μ^i(t)μ^MoM\hat{\mu}_{i}(t)\leq\hat{\mu}_{\textsf{MoM}}. Based on the definition of the MoM estimator, we have

j=1Jsign~(X¯j,iμ^MoM)=0j=1Jsign~(X¯j,iμ^i(t))0.\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}-\hat{\mu}_{\textsf{MoM}}\right)=0\implies\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}-\hat{\mu}_{i}(t)\right)\geq 0. (37)

Let βi(t)=1Jj=1Jsign~(X¯j,iμ^i(t))\beta_{i}(t)=\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}-\hat{\mu}_{i}(t)\right). With this notation, we can derive the following inequality

μ^i(t+1)μ^i(t)\displaystyle\hat{\mu}_{i}(t+1)-\hat{\mu}_{i}(t) =(2ηβi(t)+η2βi2(t))ui2(t)+(2ηβi(t)η2βi2(t))vi2(t)0,\displaystyle=(2\eta\beta_{i}(t)+\eta^{2}\beta_{i}^{2}(t))u_{i}^{2}(t)+(2\eta\beta_{i}(t)-\eta^{2}\beta_{i}^{2}(t))v_{i}^{2}(t)\geq 0, (38)

where in the last inequality, we use the fact that ui2(t)vi2(t)u_{i}^{2}(t)\geq v_{i}^{2}(t). On the other hand, following exactly the same argument in Equation 33, we have

μ^i(t+1)μ^i(t)4σϵ.\displaystyle\hat{\mu}_{i}(t+1)-\hat{\mu}_{i}(t)\leq 4\sigma\sqrt{\epsilon}. (39)

By combining the above two inequalities, we establish that |μ^MoMμ^i(t+1)|25σϵ\left|\hat{\mu}_{\textsf{MoM}}-\hat{\mu}_{i}(t+1)\right|\leq 25\sigma\sqrt{\epsilon}. This completes the proof of induction.

Case 2: |μi|20σϵ|\mu_{i}^{\star}|\leq 20\sigma\sqrt{\epsilon}. Since μ^i(0)=0\hat{\mu}_{i}(0)=0, at iteration t=0t=0 we already have |μiμ^i(t)|20σϵ\left|\mu_{i}^{\star}-\hat{\mu}_{i}(t)\right|\leq 20\sigma\sqrt{\epsilon}. Consequently, the analysis reduces to the last phase of Case 1, from which we can conclude |μiμ^i(t)|30σϵ\left|\mu_{i}^{\star}-\hat{\mu}_{i}(t)\right|\leq 30\sigma\sqrt{\epsilon} for all t0t\geq 0.

Residual dynamics.

In this case, we employ induction on tt to demonstrate that |ui2(t)vi2(t)|α|u_{i}^{2}(t)-v^{2}_{i}(t)|\leq\alpha for all 0tT0\leq t\leq T. For the base case, this relationship is valid as ui2(0)vi2(0)=0u_{i}^{2}(0)-v^{2}_{i}(0)=0. Assuming that this relation holds at time tt, we can refer to Lemma 7.6 and deduce

0.081Jj=1Jsign~(X¯j,i)0.08.-0.08\leq\frac{1}{J}\sum_{j=1}^{J}\operatorname{\widetilde{sign}}\left(\bar{X}_{j,i}\right)\leq 0.08. (40)

Hence, we have

ui2(t+1)\displaystyle u_{i}^{2}(t+1) (1+0.08η)2ui2(t)(1+η/6)ui2(t),\displaystyle\leq(1+08\eta)^{2}u_{i}^{2}(t)\leq\left(1+\eta/6\right)u_{i}^{2}(t), (41)
vi2(t+1)\displaystyle v_{i}^{2}(t+1) (1+0.08η)2vi2(t)(1+η/6)vi2(t).\displaystyle\leq(1+08\eta)^{2}v_{i}^{2}(t)\leq\left(1+\eta/6\right)v_{i}^{2}(t).

Therefore, for all t6ηlog(1α)t\leq\frac{6}{\eta}\log\left(\frac{1}{\alpha}\right), we obtain

|ui2(t)vi2(t)|max{ui2(t),vi2(t)}α2(1+η/6)tα.\left|u_{i}^{2}(t)-v_{i}^{2}(t)\right|\leq\max\left\{u_{i}^{2}(t),v_{i}^{2}(t)\right\}\leq\alpha^{2}(1+\eta/6)^{t}\leq\alpha. (42)

Putting everything together.

Finally, since we set α0.001dσϵμmax5\alpha\leq\frac{0.001}{\sqrt{d}}\sigma\sqrt{\epsilon}\wedge\mu_{\max}^{\star-5}, for any 2ηlog(1α)T6ηlog(1α)\frac{2}{\eta}\log\left(\frac{1}{\alpha}\right)\leq T\leq\frac{6}{\eta}\log\left(\frac{1}{\alpha}\right), we have

μ^(T)μ2\displaystyle\left\lVert\hat{\mu}(T)-\mu^{\star}\right\rVert_{2} k30σϵ+dα31σkϵ.\displaystyle\leq\sqrt{k}\cdot 0\sigma\sqrt{\epsilon}+\sqrt{d}\alpha\leq 1\sigma\sqrt{k\epsilon}. (43)

This completes the proof. ∎

7.3 Proof of Theorem 4.2

The proof follows by combining Theorem 4.1 and Proposition 4.1. First, for the data distribution and corruption model considered in Theorem 4.1, once we set the sample size nlog(d/δ)/ϵn\gtrsim\log(d/\delta)/\epsilon, then with probability at least 1δ/21-\delta/2, we can successfully determine the location of the top-kk nonzero elements. For short, we represent the indices of these top-kk elements as IkI_{k}. Following the successful determination of these indices, we can then narrow our focus to a kk-dimensional subproblem on the dataset Sk:={[Xi]Ik:XiS}S_{k}:=\{[X_{i}]_{I_{k}}:X_{i}\in S\} with the mean [μ]Ik[\mu^{\star}]_{I_{k}}. We can then apply Proposition 4.1 to this reduced dataset. Specifically, once the sample size satisfies n(k+log(d/δ))/ϵn\gtrsim(k+\log(d/\delta))/\epsilon, there exists an estimator such that with probability at least 1δ/21-\delta/2, it can output a μ^\hat{\mu} satisfying μ^[μ]Ikσϵ\left\lVert\hat{\mu}-[\mu^{\star}]_{I_{k}}\right\rVert\lesssim\sigma\sqrt{\epsilon}.

Combining these two steps via a simple union bound, we know that with a probability of at least 1δ1-\delta, our two-stage estimator μ^\hat{\mu} satisfies μ^μσϵ\left\lVert\hat{\mu}-\mu^{\star}\right\rVert\lesssim\sigma\sqrt{\epsilon}. This concludes the proof. \hfill\square

7.4 Proof of Theorem 4.3

Consider two probability distributions 1\mathbb{P}_{1} and 2\mathbb{P}_{2}, where 2=(1ϵ)1+ϵ\mathbb{P}_{2}=(1-\epsilon)\mathbb{P}_{1}+\epsilon\mathbb{Q} for some distribution \mathbb{Q}. Suppose we draw nn i.i.d. samples from 1\mathbb{P}_{1}. Under the strong contamination model (Definition 1.1) with parameter ϵ\epsilon, this same set of samples can be equivalently viewed as ϵ\epsilon-corrupted samples from 2\mathbb{P}_{2}. Consequently, no algorithm can distinguish between the two cases (see [Li19] for details).

Therefore, it suffices to construct two probability distributions satisfying the conditions in Theorem 4.3. Without loss of generality, we focus on the one-dimensional case, since additional coordinates can be set identically. We require two distributions 1,2\mathbb{P}_{1},\mathbb{P}_{2} such that:

  • Both distributions have variance at most σ2\sigma^{2} and third central moment at most σ3/ϵ\sigma^{3}/\sqrt{\epsilon};

  • 2\mathbb{P}_{2} can be written as (1ϵ)1+ϵ(1-\epsilon)\mathbb{P}_{1}+\epsilon\mathbb{Q} for some distribution \mathbb{Q};

  • Their means μ1,μ2\mu_{1},\mu_{2} satisfy |μ1μ2|σϵ|\mu_{1}-\mu_{2}|\geq\sigma\sqrt{\epsilon}.

Following [Li19], we construct 1\mathbb{P}_{1} as the point mass at 0, and let 2=(1ϵ)1+ϵ\mathbb{P}_{2}=(1-\epsilon)\mathbb{P}_{1}+\epsilon\mathbb{Q}, where \mathbb{Q} is the point mass at σ/ϵ\sigma/\sqrt{\epsilon}. It is straightforward to verify that 1\mathbb{P}_{1} and 2\mathbb{P}_{2} satisfy all three conditions, completing the proof. \square

8 Conclusion

Many estimation tasks in statistics become notoriously difficult in the robust setting when certain assumptions on the data are lifted. For instance, almost all statistically optimal robust mean estimators suffer from overwhelmingly high computational costs. While classical results in robust statistics have shed light on the statistical limits of robust estimation, its computational aspects have mostly remained elusive. In this work, we aim to bridge this gap by presenting the first computationally efficient and statistically optimal method for robust sparse mean estimation, thereby overcoming a conjectured computational-statistical barrier under moderate conditions.

Acknowledgements

We thank Jikai Hou for insightful discussions. SF is supported, in part, by NSF Award DMS-2152776, ONR Award N00014-22-1-2127, and MICDE Catalyst Grant.

References

  • [AMS96] Noga Alon, Yossi Matias and Mario Szegedy “The space complexity of approximating the frequency moments” In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 1996, pp. 20–29
  • [Aro+19] Sanjeev Arora, Nadav Cohen, Wei Hu and Yuping Luo “Implicit regularization in deep matrix factorization” In Advances in Neural Information Processing Systems 32, 2019
  • [Bal+17] Sivaraman Balakrishnan, Simon S Du, Jerry Li and Aarti Singh “Computationally efficient robust sparse estimation in high dimensions” In Conference on Learning Theory, 2017, pp. 169–212 PMLR
  • [BB19] Matthew Brennan and Guy Bresler “Average-case lower bounds for learning sparse mixtures, robust estimation and semirandom adversaries” In arXiv preprint arXiv:1908.06130, 2019
  • [BB20] Matthew Brennan and Guy Bresler “Reducibility and statistical-computational gaps from secret leakage” In Conference on Learning Theory, 2020, pp. 648–847 PMLR
  • [CCM13] Yudong Chen, Constantine Caramanis and Shie Mannor “Robust sparse regression under adversarial corruption” In International conference on machine learning, 2013, pp. 774–782 PMLR
  • [CDG19] Yu Cheng, Ilias Diakonikolas and Rong Ge “High-dimensional robust mean estimation in nearly-linear time” In Proceedings of the thirtieth annual ACM-SIAM symposium on discrete algorithms, 2019, pp. 2755–2771 SIAM
  • [Che+21] Yu Cheng, Ilias Diakonikolas, Rong Ge, Shivam Gupta, Daniel M Kane and Mahdi Soltanolkotabi “Outlier-robust sparse estimation via non-convex optimization” In arXiv preprint arXiv:2109.11515, 2021
  • [Che+20] Yu Cheng, Ilias Diakonikolas, Rong Ge and Mahdi Soltanolkotabi “High-dimensional robust mean estimation via gradient descent” In International Conference on Machine Learning, 2020, pp. 1768–1778 PMLR
  • [Dep20] Jules Depersin “Robust subgaussian estimation with vc-dimension” In arXiv preprint arXiv:2004.11734, 2020
  • [Dia+19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra and Alistair Stewart “Robust estimators in high-dimensions without the computational intractability” In SIAM Journal on Computing 48.2 SIAM, 2019, pp. 742–864
  • [Dia+17] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra and Alistair Stewart “Being robust (in high dimensions) can be practical” In International Conference on Machine Learning, 2017, pp. 999–1008 PMLR
  • [Dia+19a] Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Eric Price and Alistair Stewart “Outlier-robust high-dimensional sparse estimation via iterative filtering” In Advances in Neural Information Processing Systems 32, 2019
  • [DK19] Ilias Diakonikolas and Daniel M Kane “Recent advances in algorithmic high-dimensional robust statistics” In arXiv preprint arXiv:1911.05911, 2019
  • [DK23] Ilias Diakonikolas and Daniel M Kane “Algorithmic high-dimensional robust statistics” Cambridge university press, 2023
  • [Dia+22] Ilias Diakonikolas, Daniel M Kane, Sushrut Karmalkar, Ankit Pensia and Thanasis Pittas “Robust sparse mean estimation via sum of squares” In Conference on Learning Theory, 2022, pp. 4703–4763 PMLR
  • [Dia+22a] Ilias Diakonikolas, Daniel M Kane, Jasper CH Lee and Ankit Pensia “Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions” In arXiv preprint arXiv:2211.16333, 2022
  • [DKP20] Ilias Diakonikolas, Daniel M Kane and Ankit Pensia “Outlier robust mean estimation with subgaussian rates via stability” In Advances in Neural Information Processing Systems 33, 2020, pp. 1830–1840
  • [DKS17] Ilias Diakonikolas, Daniel M Kane and Alistair Stewart “Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures” In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 73–84 IEEE
  • [DG92] David L Donoho and Miriam Gasko “Breakdown properties of location estimates based on halfspace depth and projected outlyingness” In The Annals of Statistics JSTOR, 1992, pp. 1803–1827
  • [DL88] David L Donoho and Richard C Liu “The” automatic” robustness of minimum distance functionals” In The Annals of Statistics 16.2 Institute of Mathematical Statistics, 1988, pp. 552–586
  • [Fre+22] Spencer Frei, Gal Vardi, Peter L Bartlett, Nathan Srebro and Wei Hu “Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data” In arXiv preprint arXiv:2210.07082, 2022
  • [GBL19] Gauthier Gidel, Francis Bach and Simon Lacoste-Julien “Implicit regularization of discrete gradient dynamics in linear neural networks” In Advances in Neural Information Processing Systems 32, 2019
  • [GSD19] Daniel Gissin, Shai Shalev-Shwartz and Amit Daniely “The implicit bias of depth: How incremental learning drives generalization” In arXiv preprint arXiv:1909.12051, 2019
  • [Ham71] FR Hampel “A general definition of qualitative robustness” In Ann. Math. Stat 42, 1971, pp. 1887–1896
  • [Ham74] Frank R Hampel “The influence curve and its role in robust estimation” In Journal of the american statistical association 69.346 Taylor & Francis, 1974, pp. 383–393
  • [Hu+20] Wei Hu, Lechao Xiao, Ben Adlam and Jeffrey Pennington “The surprising simplicity of the early-time learning dynamics of neural networks” In Advances in Neural Information Processing Systems 33, 2020, pp. 17116–17128
  • [Hub64] Peter J Huber “Robust Estimation of a Location Parameter” In The Annals of Mathematical Statistics JSTOR, 1964, pp. 73–101
  • [Hub67] Peter J Huber “Under nonstandard conditions” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability: Weather Modification; University of California Press: Berkeley, CA, USA, 1967, pp. 221
  • [Hub11] Peter J Huber “Robust statistics” In International encyclopedia of statistical science Springer, 2011, pp. 1248–1251
  • [JVV86] Mark R Jerrum, Leslie G Valiant and Vijay V Vazirani “Random generation of combinatorial structures from a uniform distribution” In Theoretical computer science 43 Elsevier, 1986, pp. 169–188
  • [Jin+23] Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S Du and Jason D Lee “Understanding incremental learning of gradient descent: A fine-grained analysis of matrix sensing” In arXiv preprint arXiv:2301.11500, 2023
  • [LRV16] Kevin A Lai, Anup B Rao and Santosh Vempala “Agnostic estimation of mean and covariance” In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 2016, pp. 665–674 IEEE
  • [LL20] Guillaume Lecué and Matthieu Lerasle “Robust machine learning by median-of-means: theory and practice” In Annals of Statistics, 2020
  • [Li19] Jerry Z. Li “Lecture 2: Total Variation, Statistical Models, and Lower Bounds” https://jerryzli.github.io/robust-ml-fall19/lec2.pdf, Lecture notes, Robust Machine Learning (Fall 2019), 2019
  • [Li+21] Jiangyuan Li, Thanh Nguyen, Chinmay Hegde and Ka Wai Wong “Implicit sparse regularization: The impact of depth and early stopping” In Advances in Neural Information Processing Systems 34, 2021, pp. 28298–28309
  • [LLL20] Zhiyuan Li, Yuping Luo and Kaifeng Lyu “Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning” In arXiv preprint arXiv:2012.09839, 2020
  • [LM19] Gabor Lugosi and Shahar Mendelson “Robust multivariate mean estimation: the optimality of trimmed mean” In arXiv preprint arXiv:1907.11391, 2019
  • [LM19a] Gábor Lugosi and Shahar Mendelson “Near-optimal mean estimators with respect to general norms” In Probability theory and related fields 175.3-4 Springer, 2019, pp. 957–973
  • [LM19b] Gábor Lugosi and Shahar Mendelson “Sub-Gaussian estimators of the mean of a random vector” In The Annals of Statistics 47.2 JSTOR, 2019, pp. 783–794
  • [MF22] Jianhao Ma and Salar Fattahi “Blessing of Depth in Linear Regression: Deeper Models Have Flatter Landscape Around the True Solution” In Advances in Neural Information Processing Systems, 2022
  • [MGF22] Jianhao Ma, Lingjun Guo and Salar Fattahi “Behind the Scenes of Gradient Descent: A Trajectory Analysis via Basis Function Decomposition” In arXiv preprint arXiv:2210.00346, 2022
  • [MLF25] Jianhao Ma, Geyu Liang and Salar Fattahi “Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions” In arXiv preprint arXiv:2505.17304, 2025
  • [Mas90] Pascal Massart “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality” In The annals of Probability JSTOR, 1990, pp. 1269–1283
  • [Min15] Stanislav Minsker “Geometric median and robust estimation in Banach spaces”, 2015
  • [NY83] Arkadij Semenovič Nemirovskij and David Borisovich Yudin “Problem complexity and method efficiency in optimization” Wiley-Interscience, 1983
  • [PBR20] Adarsh Prasad, Sivaraman Balakrishnan and Pradeep Ravikumar “A robust univariate mean estimator is all you need” In International Conference on Artificial Intelligence and Statistics, 2020, pp. 4034–4044 PMLR
  • [Pra+20] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan and Pradeep Ravikumar “Robust estimation via robust gradient estimation” In Journal of the Royal Statistical Society Series B: Statistical Methodology 82.3 Oxford University Press, 2020, pp. 601–627
  • [RMC21] Noam Razin, Asaf Maman and Nadav Cohen “Implicit regularization in tensor factorization” In International Conference on Machine Learning, 2021, pp. 8913–8924 PMLR
  • [RMC22] Noam Razin, Asaf Maman and Nadav Cohen “Implicit regularization in hierarchical tensor factorization and deep convolutional neural networks” In International Conference on Machine Learning, 2022, pp. 18422–18462 PMLR
  • [Rou+11] Peter J Rousseeuw, Frank R Hampel, Elvezio M Ronchetti and Werner A Stahel “Robust statistics: the approach based on influence functions” John Wiley & Sons, 2011
  • [SCV17] Jacob Steinhardt, Moses Charikar and Gregory Valiant “Resilience: A criterion for learning in the presence of arbitrary outliers” In arXiv preprint arXiv:1703.04940, 2017
  • [Tuk62] John W Tukey “The future of data analysis” In The annals of mathematical statistics 33.1 JSTOR, 1962, pp. 1–67
  • [Tuk60] John Wilder Tukey “A survey of sampling from contaminated distributions” In Contributions to probability and statistics Stanford University Press, 1960, pp. 448–485
  • [Ver18] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge university press, 2018
  • [Yat85] Yannis G Yatracos “Rates of convergence of minimum distance estimators and Kolmogorov’s entropy” In The Annals of Statistics 13.2 Institute of Mathematical Statistics, 1985, pp. 768–774

Appendix A MoM Estimator under Strong Contamination Model

In this section, we prove the key properties of the 11-dimensional and high-dimensional MoM estimators under the strong contamination model (Definition 1.1). The following is a more precise statement of Proposition 3.1, which is adapted from Fact 2.1. in [Dia+22a]. As the complete proof does not appear in the original source, we include it here for completeness.

Proposition A.1 (One-dimensional MoM estimator).

Consider a corruption parameter ϵ\epsilon, failure probability δ\delta, and a set SS of nn many ϵ\epsilon-corrupted samples from a distribution \mathbb{P} with mean μ\mu^{\star} and variance 𝔼[(Xμ)2]σ2\mathbb{E}[(X-\mu^{\star})^{2}]\leq\sigma^{2}. Then, with probability at least 1δ1-\delta, the MoM estimator μ^MoM\hat{\mu}_{\textsf{MoM}} satisfies |μ^MoMμ|σ(42(ϵ+1/n)+16log(1/δ)/n)|\hat{\mu}_{\textsf{MoM}}-\mu^{\star}|\leq\sigma\left(4\sqrt{2}\left(\sqrt{\epsilon}+\sqrt{1/n}\right)+16\sqrt{\log(1/\delta)/n}\right).

Proof.

Proof We partition the index set of the subgroups {1,,J}\{1,\cdots,J\} into two parts: 𝒥clean\mathcal{J}_{\text{clean}} and 𝒥outlier\mathcal{J}_{\text{outlier}}. Here 𝒥clean\mathcal{J}_{\text{clean}} comprises all the subgroups without outliers, and 𝒥outlier\mathcal{J}_{\text{outlier}} consists of subgroups containing at least one outlier. According to our strong contamination model, we have |𝒥outlier|ϵn|\mathcal{J}_{\text{outlier}}|\leq\lceil\epsilon n\rceil. Subsequently, we observe that

{|μ^MoMμ|ξ}{j𝒥clean𝕀(|X¯jμ|ξ)J2ϵn}.\{|\hat{\mu}_{\textsf{MoM}}-\mu^{\star}|\geq\xi\}\subseteq\biggl{\{}\sum_{j\in\mathcal{J}_{\text{clean}}}\mathbb{I}(|\bar{X}_{j}-\mu^{\star}|\geq\xi)\geq\frac{J}{2}-\lceil\epsilon n\rceil\biggr{\}}. (44)

Here, X¯j=1BiSjXi\bar{X}_{j}=\frac{1}{B}\sum_{i\in S_{j}}X_{i}, where B=n/JB=n/J is the size of each subgroup and SjS_{j} is the subgroup jj. For simplicity, let us denote Zj=𝕀(|X¯jμ|ξ)Z_{j}=\mathbb{I}(|\bar{X}_{j}-\mu^{\star}|\geq\xi) and pξ=Pr(|X¯jμ|ξ)p_{\xi}=\mathrm{Pr}\left(|\bar{X}_{j}-\mu^{\star}|\geq\xi\right). Then, the above inclusion implies

Pr(|μ^MoMμ|ξ)\displaystyle\mathrm{Pr}\left(|\hat{\mu}_{\textsf{MoM}}-\mu^{\star}|\geq\xi\right) Pr(j𝒥cleanZjJ2ϵn)\displaystyle\leq\mathrm{Pr}\biggl{(}\sum_{j\in\mathcal{J}_{\text{clean}}}Z_{j}\geq\frac{J}{2}-\lceil\epsilon n\rceil\biggr{)} (45)
=Pr(1|𝒥clean|j𝒥clean(Zj𝔼[Zj])J/2ϵn|𝒥clean|pξ).\displaystyle=\mathrm{Pr}\biggl{(}\frac{1}{|\mathcal{J}_{\text{clean}}|}\sum_{j\in\mathcal{J}_{\text{clean}}}\left(Z_{j}-\mathbb{E}[Z_{j}]\right)\geq\frac{J/2-\lceil\epsilon n\rceil}{|\mathcal{J}_{\text{clean}}|}-p_{\xi}\biggr{)}.

Since ZjZ_{j} is bounded, we can apply Hoeffding’s inequality (Lemma 7.2) to obtain

Pr(|μ^MoMμ|ξ)exp{2|𝒥clean|(J/2ϵn|𝒥clean|pξ)2}.\mathrm{Pr}\left(|\hat{\mu}_{\textsf{MoM}}-\mu^{\star}|\geq\xi\right)\leq\exp\biggl{\{}-2|\mathcal{J}_{\text{clean}}|\left(\frac{J/2-\lceil\epsilon n\rceil}{|\mathcal{J}_{\text{clean}}|}-p_{\xi}\right)^{2}\biggr{\}}. (46)

Moreover, we can use Chebyshev’s inequality (Lemma 7.1) to establish an upper bound for pξp_{\xi}:

pξ=Pr(|X¯jμ|ξ)σ2Bξ2=Jσ2nξ2.p_{\xi}=\mathrm{Pr}\left(|\bar{X}_{j}-\mu^{\star}|\geq\xi\right)\leq\frac{\sigma^{2}}{B\xi^{2}}=\frac{J\sigma^{2}}{n\xi^{2}}. (47)

Upon defining J=4ϵn+32log(1/δ)J=4\lceil\epsilon n\rceil+32\log(1/\delta) and ξ=σ(42(ϵ+1/n)+16log(1/δ)/n)\xi=\sigma\left(4\sqrt{2}\left(\sqrt{\epsilon}+\sqrt{1/n}\right)+16\sqrt{\log(1/\delta)/n}\right), we have the following estimates

|𝒥clean|\displaystyle|\mathcal{J}_{\text{clean}}| Jϵn32log(1/δ);\displaystyle\geq J-\lceil\epsilon n\rceil\geq 2\log(1/\delta); (48)
J/2ϵn|𝒥clean|\displaystyle\frac{J/2-\lceil\epsilon n\rceil}{|\mathcal{J}_{\text{clean}}|} J/2ϵnJ14;\displaystyle\geq\frac{J/2-\lceil\epsilon n\rceil}{J}\geq\frac{1}{4};
pξ\displaystyle p_{\xi} Jσ2nξ218.\displaystyle\leq\frac{J\sigma^{2}}{n\xi^{2}}\leq\frac{1}{8}.

Combining these bounds with Equation 46, we obtain

Pr(|μ^MoMμ|σ(42(ϵ+1/n)+16log(1/δ)/n))\displaystyle\mathrm{Pr}\left(|\hat{\mu}_{\textsf{MoM}}-\mu^{\star}|\geq\sigma\left(4\sqrt{2}\left(\sqrt{\epsilon}+\sqrt{1/n}\right)+16\sqrt{\log(1/\delta)/n}\right)\right) (49)
exp{232log(1/δ)(1418)2}=δ.\displaystyle\leq\exp\biggl{\{}-2\cdot 2\log(1/\delta)\cdot\left(\frac{1}{4}-\frac{1}{8}\right)^{2}\biggr{\}}=\delta.

This completes the proof. \hfill\square

Directly applying MoM estimator to each coordinate of a dd-dimensional dataset leads to the following proposition.

Theorem A.1 (High dimensional coordinate-wise MoM estimator).

Consider a corruption parameter ϵ\epsilon, failure probability δ\delta, and a set SS of nn many ϵ\epsilon-corrupted samples from a distribution \mathbb{P} with mean μ\mu^{\star} and coordinate-wise variance 𝔼[(Xμ)2]σ2,1id\mathbb{E}[(X-\mu^{\star})^{2}]\leq\sigma^{2},\forall 1\leq i\leq d. Then, with probability at least 1δ1-\delta, the coordinate-wise MoM estimator μ^MoM\hat{\mu}_{\textsf{MoM}} satisfies μ^MoMμσ(42(ϵ+1/n)+16log(d/δ)/n)\left\lVert\hat{\mu}_{\textsf{MoM}}-\mu^{\star}\right\rVert_{\infty}\leq\sigma\left(4\sqrt{2}\left(\sqrt{\epsilon}+\sqrt{1/n}\right)+16\sqrt{\log(d/\delta)/n}\right) and μ^MoMμ2σd(42(ϵ+1/n)+16log(d/δ)/n)\left\lVert\hat{\mu}_{\textsf{MoM}}-\mu^{\star}\right\rVert_{2}\leq\sigma\sqrt{d}\left(4\sqrt{2}\left(\sqrt{\epsilon}+\sqrt{1/n}\right)+16\sqrt{\log(d/\delta)/n}\right).

Proof.

Proof The proof follows directly from Proposition A.1 and a simple union bound. \hfill\square

Appendix B Additional Simulations

B.1 Experimental Details

We run our simulations on three heavy-tailed distributions: Fisk, Pareto, and Student’s tt distributions. In each case, we apply a symmetrization trick to make the density function symmetric around zero. The density function of the Fisk distribution with parameter cc is expressed as follows:

f(x;c)=c|x|c12(1+|x|c)2for x,c>0.f(x;c)=\frac{c|x|^{c-1}}{2(1+|x|^{c})^{2}}\quad\text{for }x\in\mathbb{R},c>0. (50)

The density function of the Pareto distribution with parameters bb is

f(x;b)={b2|x|b+1for|x|1,0for|x|<1.,for x,b>0.f(x;b)=\{\begin{array}[]{cc}\frac{b}{2|x|^{b+1}}&\mbox{for}\quad|x|\geq 1,\\ 0&\mbox{for}\quad|x|<1.\end{array},\quad\text{for }x\in\mathbb{R},b>0. (51)

Lastly, the density function for student tt-distribution is

f(x;ν)=Γ(ν+12)νπΓ(ν/2)(1+x2ν)(ν+1)/2for x,ν>0.f(x;\nu)=\frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\Gamma(\nu/2)}\left(1+\frac{x^{2}}{\nu}\right)^{-(\nu+1)/2}\quad\text{for }x\in\mathbb{R},\nu>0. (52)

Here Γ\Gamma is the gamma function. In all three distributions described above, the parameters c,b,νc,b,\nu correspondingly denote the existence of the c,b,νc,b,\nu-th moment. For instance, when c,b,νc,b,\nu fall within the range of (1,2](1,2], the variances are infinite. Regarding the outliers, we generate them via the constant-bias noise model as introduced in [Che+21].

Furthermore, unless stated otherwise, all simulations are conducted with the following predefined settings: data dimension dd is set to 100100, sparsity level kk is set to 44 with nonzero elements being [10,5,4,2][10,-5,-4,2], sample size mm is set to 600, and the corruption ratio ϵ\epsilon is set at 10%10\%. As for our algorithm, we set the number of subgroups to be J=1.5ϵn+150J=1.5\lceil\epsilon n\rceil+150. Note that, compared to the theoretical choice of J=100ϵnJ=100\lceil\epsilon n\rceil in Algorithm 1, we choose a smaller JJ to make our algorithm work for a larger corruption ratio ϵ\epsilon in practice. Moreover, in SubGM, we set the initialization scale α=105\alpha=10^{-5} and the step-size η=0.05\eta=0.05.

We select sparse_GD [Che+21] and sparse_filter [Dia+19a] as our benchmark algorithms. We note that these algorithms do not come with theoretical assurances in the heavy-tailed setting. Nonetheless, we have empirically found that these two algorithms surpass others in performance, even in the heavy-tailed setting. We also highlight that the polynomial-time algorithms that come equipped with theoretical guarantees for heavy-tailed setting [Dia+22a, Dia+22] are impractical since they rely on time-consuming methods such as sum-of-squares and ellipsoid methods.

We employ both sparse_GD and sparse_filter in the second stage of our algorithm, setting the sparsity parameter to k=|I|k=|I|, where II is the index set identified in the first stage. In total, we evaluate six estimators: oracle (which removes all outliers), sparse_GD, sparse_filter, stage_1, full_GD (our algorithm with sparse_GD in the second stage), and full_filter (our algorithm with sparse_filter in the second stage). In stage_1, we run SubGM for T=600T=600 iterations, whereas in full_GD and full_filter, we reduce the iteration count to T=200T=200 to lower computational cost.

B.2 Sensitivity to Prior Knowledge of kk

We underscore the fact that prior algorithms necessitate a prior knowledge of the exact sparsity level kk. In contrast, our approach can identify the sparsity level automatically. For this simulation, we assign a true sparsity level of k=10k=10 with nonzero components [2,2,2,2,2,2,2,2,2,2][2,2,2,2,2,-2,-2,-2,-2,-2] and assess the performance of the benchmark algorithms, namely sparse_GD and sparse_filter, while varying the input kk^{\prime}, which is an upper bound of kk, within the range of [10, 40]. As illustrated in Figure 3, the performance of these benchmark algorithms is highly sensitive to the choice of kk^{\prime} across all examined distributions. Their performances further destabilize when the underlying distributions start to exhibit heavier tails. In contrast, our algorithm automatically recognizes the sparsity pattern across all scenarios. For all subsequent simulations, we provide the benchmark algorithms with the true sparsity level kk to ensure a fair comparison.

Refer to caption
Refer to caption
Figure 3: Comparison among different algorithms with varying input kk^{\prime}, where k=10k=10 and kkk^{\prime}\geq k is an upper bound of kk. The second row corresponds to distributions with infinite variance.

B.3 Performance with Different kk

In this simulation, we evaluate the performance of various algorithms under different sparsity levels kk. We set all nonzero entries of μ\mu^{\star} to 22. As shown in the first row of Figure 4, all algorithms—except stage_1 (as predicted by Theorem 4.1) and sparse_filter (which underperforms at larger sparsity levels kk)—achieve 2\ell_{2}-error that remains largely independent of sparsity. In more heavy-tailed settings, depicted in the second row of Figure 4, all algorithms display an increase in 2\ell_{2}-error as kk grows. Nevertheless, across nearly all scenarios, our full algorithms (full_GD and full_filter) outperform the benchmarks. We further hypothesize that the weaker performance of full_filter for the Pareto distribution with b=2b=2 arises from the suboptimal performance of sparse_GD and sparse_filter when used in Stage 2.

Refer to caption
Refer to caption
Figure 4: Comparison among different algorithms for varying sparsity levels kk. The second row corresponds to distributions with infinite variance.

B.4 Infinite Variance Regime

In this simulation, we evaluate the performance of the algorithms with respect to the heaviness of the tail distributions. As shown in Figure 5, we vary the parameters c,b,νc,b,\nu over the range [1,3.5][1,3.5]. Smaller parameter values correspond to heavier tails, with values in the interval (1,2](1,2] resulting in distributions of infinite variance. Our algorithms (stage_1, full_GD, and full_filter) demonstrate superior robustness under these heavy-tailed conditions, highlighting the advantage of our approach.

Refer to caption
Figure 5: Comparison among different algorithms in the infinite variance regime.

B.5 Performance with Different ϵ\epsilon

In this simulation, we study the relationship between the 2\ell_{2}-error and the corruption ratio ϵ\epsilon across all six estimators. As shown in Figure 6, apart from the Oracle—whose error remains unaffected by ϵ\epsilon (as expected)—our proposed algorithms (either single-stage or full version) consistently outperform the alternatives. While our theoretical analysis predicts an 2\ell_{2}-error of order 𝒪(ϵ)\mathcal{O}(\sqrt{\epsilon}), the empirical results reveal an approximately linear dependence on ϵ\epsilon. We attribute this discrepancy to the non-adversarial nature of the outlier model used in our experiments.

Refer to caption
Figure 6: Comparison among different algorithms for different corruption rates ϵ\epsilon.

B.6 Running Time

Next, we examine the running time of our algorithms. Specifically, we run 600600 iterations for stage_1, while in the full algorithms we restrict Stage 1 to 200200 iterations. As shown in Figure 7, all algorithms exhibit linear runtime, consistent with our theoretical guarantees.

Refer to caption
Figure 7: Running time of the proposed single-stage and full algorithms as a function of dimension.