Sparse Mean Estimation in Adversarial Settings via Incremental Learning
Abstract
In this paper, we study the problem of sparse mean estimation under adversarial corruptions, where the goal is to estimate the -sparse mean of a heavy-tailed distribution from samples contaminated by adversarial noise. Existing methods face two key limitations: they require prior knowledge of the sparsity level and scale poorly to high-dimensional settings. We propose a simple and scalable estimator that addresses both challenges. Specifically, it learns the -sparse mean without knowing in advance and operates in near-linear time and memory with respect to the ambient dimension. Under a moderate signal-to-noise ratio, our method achieves the optimal statistical rate, matching the information-theoretic lower bound. Extensive simulations corroborate our theoretical guarantees. At the heart of our approach is an incremental learning phenomenon: we show that a basic subgradient method applied to a nonconvex two-layer formulation with an -loss can incrementally learn the nonzero components of the true mean while suppressing the rest. More broadly, our work is the first to reveal the incremental learning phenomenon of the subgradient method in the presence of heavy-tailed distributions and adversarial corruption.
1 Introduction
Almost all statistical methods rely explicitly or implicitly on certain assumptions on the distribution of the data. In practice, however, these assumptions are only approximately satisfied, mainly due to the presence of heavy-tailed distributions and adversarial corruptions [Rou+11]. To resolve these issues, the field of robust statistics has been developed to construct estimators that exhibit “insensitivity to small deviations from the (model) assumptions” [Hub11, p.2]. Robust statistics has a long history with the fundamental work of John Tukey [Tuk60, Tuk62], Peter Huber [Hub64, Hub67], and Frank Hampel [Ham71, Ham74]. It has been applied across various domains, such as biology, finance, and computer science [Rou+11].
Nonetheless, in high-dimensional scenarios, robust statistics contend with the curse of dimensionality. Firstly, the majority of estimators in the literature demand exponential runtime with respect to data dimension. To resolve this problem, special attention has been devoted to algorithmic robust statistics, which aims to design efficient algorithms for different tasks in the high-dimensional robust statistics (see the recent book [DK23] and survey paper [DK19]). Secondly, generic high-dimensional robust statistical tasks are often oblivious to the intrinsic structure of the data. As such, they rely on overly conservative sample sizes that have an undesirable dependency on the data dimension.
In this paper, we aim to address these challenges for one of the most fundamental problems in robust statistics, namely robust sparse mean estimation. More specifically, given an -corrupted set of samples from an unknown and possibly heavy-tailed distribution with a -sparse mean , our goal is to design a computationally and statistically efficient estimator of the mean . Throughout this paper, we focus on the so-called strong contamination model [DK23, Definition 1.6] for the corruption in the data, which encompasses a variety of existing models, such as Huber’s contamination model [Hub64].
Definition 1.1 (Strong contamination model).
Given a corruption parameter and distribution , the -corrupted samples are generated as follows: (i) the algorithm specifies the number of samples and then i.i.d. samples are drawn from . (ii) An arbitrarily powerful adversary then inspects the samples, removes of them, and replaces them with arbitrary points. The resulting -corrupted samples are given to the algorithm.
Designing a statistically and computationally efficient estimator for the mean is highly nontrivial in this setting due to the following reasons. First, contrary to the robust (dense) mean estimation, there is a conjectured computational-statistical tradeoff [DKS17, BB19, BB20] for the robust -sparse mean estimation, which asserts that any efficient algorithm needs samples, while its statistically-optimal (but possibly inefficient) counterpart only requires samples. This conjecture has neither been proved nor refuted. Second, most existing mean estimators are designed for light-tailed distributions [Bal+17, Dia+19a, Che+21]. The only two efficient estimators available for heavy-tailed distributions [Dia+22a, Dia+22], however, are impractical for real-world applications, as they rely on computationally intensive techniques such as the ellipsoid algorithm and the sum-of-squares method. A fundamental question thus arises:
Can we design a practically efficient estimator for the robust sparse mean estimation problem that overcomes the conjectured computational-statistical tradeoff?
In this work, we provide an affirmative answer to this question under moderate assumptions. Our proposed approach comprises two stages. In the first stage, we provide a coarse-grained estimation of the mean that is enough to identify the top- nonzero elements of the mean. In particular, we show that a simple subgradient method applied to a two-layer diagonal linear neural network with -loss can identify the top- nonzero elements of the mean incrementally and sequentially while keeping the zero entries arbitrarily small. After the identification of the top- nonzero elements, in the second stage, we provide a finer-grained estimation of the nonzero elements of the mean by employing a generic robust mean estimator—such as those introduced in [DK19, Che+20]—restricted to the top- nonzero elements, thereby reducing the effective dimension of the problem from to . Our proposed approach achieves optimal statistical error, sample complexity, and computational cost under moderate assumptions. Furthermore, we demonstrate that these assumptions do not alter the inherent complexity of the problem, as evidenced by a matching information-theoretic lower bound. Table 1 provides a summary of our results compared to the existing estimators. Our contributions are summarized below:
-
-
Overcoming the computational-statistical tradeoff. We demonstrate that our algorithm can surpass the conjectured computational-statistical tradeoff under additional conditions. At a high level, we require an -dependent upper bound for the coordinate-wise third moment and a lower bound for the signal-to-noise ratio (SNR). Additionally, we demonstrate that our algorithm matches the information-theoretic lower bound under exactly the same conditions.
-
-
Near-linear dependency on the dimension. The first stage of our algorithm is coordinate-wise decomposable and fully parallelizable. Therefore, it runs in time and memory on a single thread, and in time and memory on threads. Moreover, the computational cost of the second stage of our algorithm is independent of . In contrast, the existing robust sparse mean estimators have a poor dependency on (see Table 1).
-
-
No prior knowledge on the sparsity level. Our method does not require prior knowledge of the sparsity level . In contrast, all existing methods for robust sparse mean estimation (in both light- and heavy-tailed settings) require knowledge of the sparsity level .
-
-
Superior practical performance. Through extensive experiments, we show that, despite its simplicity, our method performs well across a broad class of heavy-tailed distributions, including those with unbounded variance.
| Algorithm | -error | Sample complexity | Running time |
| Lower bound | - | ||
| [Dep20, PBR20] | |||
| [Dia+22a] | |||
| [Dia+22] | |||
| Ours (Stage 1)∗ | |||
| Ours (full)∗ |
2 Related Work
Robust (sparse) mean estimation.
Robust mean estimation is a fundamental problem in statistics, with its earliest work dating back to [Tuk60, Hub64]. However, throughout its extensive history [Yat85, DL88, DG92, Hub11], and even up to recent times [LM19, LM19b, Dep20, PBR20], most statisticians have primarily focused on developing statistically optimal estimators, often overlooking the fact that these estimators can be computationally inefficient. It is only recently, following the seminal work of [LRV16, Dia+19], that researchers have started to develop polynomial-time algorithms for robust mean estimation [Dia+17, SCV17, CDG19] as well as other robust learning tasks, including robust PCA [Bal+17] and robust regression [CCM13].
Robust sparse mean estimation, as a distinct variant, has attracted considerable attention, particularly in extremely high-dimensional settings. However, the situation for robust sparse mean estimation is more nuanced compared to the dense case. Firstly, unlike the dense case, there is a conjectured computational-statistical tradeoff [DKS17, BB19, BB20], suggesting that efficient algorithms demand a qualitatively larger sample complexity than their inefficient counterparts. In particular, there is evidence that such a tradeoff is unavoidable for Stochastic Query (SQ) algorithms [DKS17]. On the other hand, most prior works have primarily concentrated on the light-tailed setting [Bal+17, Dia+19a, Che+21]. Researchers have only recently addressed the heavy-tailed setting using stability-based approaches [Dia+22a] and sum-of-squares methods [Dia+22]. While these algorithms are polynomial-time, they may not be practical when dealing with high-dimensional settings.
Incremental learning.
Over the past few years, it has been shown practically and theoretically that gradient-based methods tend to explore the solution space in an incremental order of complexity, ultimately favoring low-complexity solutions in numerous machine learning tasks [GSD19, MLF25]. This phenomenon is known as incremental learning. Specifically, researchers have investigated incremental learning in various contexts, such as matrix factorization and its variants [LLL20, MGF22, Jin+23], tensor factorization [RMC21, RMC22, MGF22], deep linear networks [Aro+19, GBL19, Li+21, MF22], and general neural networks [Hu+20, Fre+22]. In essence, incremental learning is believed to be crucial for understanding the empirical success of optimization and generalization in contemporary machine learning [GSD19]. However, to the best of our knowledge, its emergence in adversarial settings remains unexplored.
Notation:
We use the notations and to denote , for a universal constant and sufficiently large . Similarly, the notations and are used to denote , for a universal constant and sufficiently large . The notation is used to denote and . Moreover, the notation implies that . The function is defined as if , and . We also define if , and . Given a set , the indicator function is defined as if , and otherwise. Similarly, and with a slight abuse of notation, for an event , we define the indicator function if occurs, and otherwise. We denote . For two functions , we define . For two vectors , their Hadamard product is defined as . For a vector , we define . For a vector and index set with size , the notation refers to the projection of onto . Moreover, we define . We represent mixtures of probability distributions as linear combinations of their corresponding density functions. For example, given two distributions and and a scalar , we define the mixture . A sample from is drawn from with probability and from with probability .
3 Overview of Our Approach
To lay the groundwork, we begin by introducing the standard median-of-means (MoM) estimator [NY83, JVV86, AMS96] originally designed for estimating the mean of a one-dimensional random variable. MoM estimator serves as a cornerstone for more sophisticated methods as detailed in [LM19b, Pra+20, LL20, Dia+22a].
Definition 3.1 (Median-of-means estimator for one-dimensional case).
Given a set of -corrupted samples , we first partition them into subgroups with equal sizes, where we assume is divisible by for simplicity. We then calculate the sample mean for each subgroup, i.e., where . Subsequently, the median-of-means (MoM) estimator is obtained by taking the median of the sample means , i.e., .
Alternatively, the MoM estimator can be expressed as the minimizer of the following -loss:
| (1) |
By appropriately selecting the number of subgroups , it can be shown that the MoM estimator matches the information-theoretic lower bound for heavy-tailed distributions under the strong contamination model (Definition 1.1).
Proposition 3.1 (One-dimensional MoM estimator).
Consider a corruption parameter , a failure probability , and a set of many -corrupted samples from a distribution with mean and variance . Suppose that . Then, upon choosing the number of subgroups , with probability at least over the sample set , the MoM estimator satisfies .
A more precise statement of Proposition 3.1 and its proof are presented in Appendix A.
Naively applying MoM estimator to different coordinates of a high-dimensional random variable leads to an undesirable dependency on the dimension . More precisely, the coordinate-wise MoM, which corresponds to the solution to the following convex optimization
| (cvx) |
suffers from a suboptimal error rate of (see Theorem A.1 in Appendix A). This error is unavoidable for the MoM estimator since the coordinate-wise error is uniformly distributed across each coordinate. An alternative approach, the geometric MoM [Min15], which replaces the in cvx by , also suffers from a similar error.
Two-layer model
To address the above issue, we model the mean as a two-layer model for , and obtain by minimizing the following nonconvex -loss
| (ncvx) |
To solve this optimization problem, we propose a subgradient method (SubGM) with small initialization , where and is a sufficiently small factor. At each iteration, SubGM updates the solution as
| (SubGM) |
Here, is the stepsize, and and indicate the (Clarke) subdifferentials of , defined as:
| (2) | ||||
| (3) |
The detailed implementation of our proposed algorithm is presented in Algorithm 1.
Our key contribution is to reveal the emergence of incremental learning: we show that SubGM with small initialization learns the nonzero components (signals) long before overfitting the zero components (residuals) to noise. Consequently, there exists a wide range of iterations within which the signals are in the order of while the residuals remain in the order of (see Figure 1(a)). Remarkably, we show that this interval only depends on the stepsize and the initialization scale , and it can be widened by reducing these user-defined parameters. In stark contrast, differentiating between the signals and residuals is challenging in the convex setting (cvx) precisely due to the lack of incremental learning, as shown in Figure 1(b). After successfully identifying the locations of the top- elements, we can employ existing robust mean estimation techniques [DK19, Che+20] on the dataset projected onto the recovered support to further improve the estimation of the top- nonzero elements.
4 Main Result
In this section, we present the theoretical guarantees for Algorithm 1. We begin by analyzing the first stage of the algorithm, which focuses on recovering the support of the true mean.
4.1 Stage 1: Identification of Support via Coarse-grained Estimation
We denote and . Our main theorem is presented next.
Theorem 4.1 (Convergence guarantee for SubGM).
Let be a distribution on with an unknown -sparse mean , unknown covariance matrix , and unknown coordinate-wise third moment satisfying . Suppose a sample set of size is collected according to the strong contamination model (Definition 1.1) with corruption parameter . Upon setting the stepsize and the initialization scale in Algorithm 1, with a probability of at least , the following statements hold for any iteration :
-
•
-error. The -error is upper-bounded by
(4) -
•
Identification of the top- elements. If we additionally have , then we obtain
(5)
Comparison to the existing results. Simply applying coordinate-wise MoM estimator results in an -error rate , which is considerably worse than our result when . On the other hand, to guarantee a correct support recovery, the previous efficient estimators rely on prior knowledge of , while the coordinate-wise MoM requires an accurate value of to separate the signals from residuals (as evidenced by Figure 1(b)). In contrast, our proposed algorithm only requires a lower bound to differentiate the signals from residuals; in fact, this lower bound can be arbitrarily small (i.e., conservative) provided that the initialization scale is chosen as . We also highlight that, much like other existing estimators under the strong contamination model, our estimator requires prior knowledge of the corruption parameter (or its upper bound).
Proof sketch. We next provide the proof sketch of the above theorem, deferring its details to Section 7. Specifically, we analyze the coordinate-wise dynamic for some . Without loss of generality, we assume . Upon defining , the update rules for and can be written as
| (6) |
Based on the above update rules, controls the growth rate of the dynamics. Indeed, during the initial iterations, we have , which in turn implies that . Consequently, the dynamics of and can be well approximated using the following exponential functions
| (7) |
Therefore, to analyze the behaviors of and , it suffices to characterize the magnitude of for different coordinates. To achieve this, we define as the index set of the subgroups that do not contain any outliers, and denote its complement as . We have
| (denote ) | ||||
| (for sufficiently large ) | ||||
| (due to finite-sample central limit theorem) |
Here, is the size of each subgroup, and represents the cumulative distribution function (CDF) of the standard Gaussian distribution. Let us define and . Based on the above characterization of , for all , we have , which in turn implies . Furthermore, by setting with a suitably large constant , can be made sufficiently large given a sufficiently small . This ensures that for all . On the other hand, we have since . As a result, can be made arbitrarily small for all and for all . This discrepancy in the growth rates of and enables our algorithm to separate the signals from residuals within just a few iterations. In Section 7, we provide a more delicate analysis of the dynamics, showing that for all we have
| (8) | ||||||
The above equation sheds light on the key difference between ncvx and cvx: unlike cvx where the error is equally distributed across different coordinates, the error in ncvx is primarily distributed among the signals, while the error in the residuals can be kept arbitrarily small by a proper choice of the initialization scale . This implies that, if the signals are sufficiently larger than the induced error, i.e., , our algorithm can successfully identify the signals.
4.2 Stage 2: Achieving Optimal Rate on the Support via Fine-grained Estimation
As illustrated in Section 4.1, a direct application of SubGM leads to an estimation error of . In this section, we show that this error can be further improved once the support of the mean is identified correctly. Our key insight is that once the support of the mean is recovered, we can reduce the problem to a robust dense mean estimation defined only over the recovered support. Under such a regime, existing estimators designed for robust dense mean estimation [DK19, Che+20] can be employed to further reduce the estimation error.
Proposition 4.1 (Adapted from Proposition 1.6 in [DKP20]).
Let be a distribution on with an unknown mean and unknown covariance matrix . Suppose a sample set of size is collected according to the strong contamination model (Definition 1.1) with corruption parameter . Then, there exists an algorithm that runs in time and memory and, with a probability of at least , outputs an estimator that satisfies
Equipped with the above result, we next provide an end-to-end guarantee for our full algorithm.
Theorem 4.2 (Guarantee for the full algorithm).
Let be a distribution on satisfying the conditions in Theorem 4.1. Suppose a sample set of size is collected according to the strong contamination model (Definition 1.1) with corruption parameter . Then, with the choice of and , our full algorithm runs in time and memory and, with a probability of at least , outputs an estimate that satisfies
| (9) |
Upon setting the sample size , our proposed two-stage method runs in time and memory and returns a solution with an error in the order of . Our next theorem shows that this error is indeed information-theoretically optimal up to a constant factor and thus cannot be improved.
Theorem 4.3 (Information-theoretic lower bound).
There exists a distribution with -sparse mean , covariance matrix , and coordinate-wise third moment satisfying such that, given any arbitrarily large sample set collected according to the strong contamination model (Definition 1.1) with corruption parameter , no algorithm can estimate the mean with -error .
Comparison to the existing lower bounds. To achieve the optimal error rate, the sample complexity of our method scales linearly with the sparsity level . A careful reader may realize that our sample complexity is unexpectedly smaller than the optimal sample complexity introduced in [LM19a] when is sufficiently small. This is due to the additional assumptions we impose on the coordinate-wise third moment of the distribution and the corruption parameter . On the other hand, it is recently shown in [DK19, PBR20] that under the bounded third moment, the dependency of the estimation error on can be improved from to . Our worse dependency on is due to our more relaxed assumption on the third moment: unlike the assumptions made in [DK19, PBR20], our imposed upper bound on the third moment is inversely proportional to . Consequently, the imposed upper bound can get arbitrarily large with a smaller corruption parameter. In this extreme case where , this condition can be dropped all together.
5 Proof Sketch
In this section, we provide a proof sketch for the dynamics of SubGM (Theorem 4.1). To streamline the presentation, we keep our arguments at a high level; a more detailed proof is deferred in the supplementary materials. We analyze the coordinate-wise dynamic for . Without loss of generality, we assume . Upon defining , the update rules for and can be written as
| (10) |
Based on the above update rules, controls the growth rate of the dynamics. Indeed, during the initial iterations, we have , which in turn implies that . Consequently, the dynamics of and can be well approximated using the following exponential functions
| (11) |
Therefore, to analyze the behaviors of and , it suffices to characterize the magnitude of for different coordinates. To achieve this, we define as the index set of the subgroups that do not contain any outliers and . Consequently, we have
| (denote ) | ||||
| (due to concentration bound) | ||||
| (due to finite-sample central limit theorem) |
Here represents the cumulative distribution function (CDF) of the standard Gaussian distribution. Let us define and . Based on the above characterization of , we have for all . Furthermore, by setting with a suitably large constant , we can ensure for all because can be made sufficiently large given a sufficiently small . On the other hand, we have since . As a result, can be made arbitrarily small for all and for all . This discrepancy in the growth rates of and enables our algorithm to separate the signals from residuals within just a few iterations. In the supplementary materials, we provide a more delicate analysis of the dynamics, showing that for all we have
| (12) | ||||||
The above equation sheds light on the key difference between ncvx and cvx: unlike cvx where the error is equally distributed across different coordinates, the error in ncvx is primarily distributed among the signals, while the error in the residuals can be kept arbitrarily small by a proper choice of the initialization scale . This implies that, if the signals are sufficiently larger than the induced error, i.e., , our algorithm can successfully identify the signals.
6 Simulation
In this section, we present numerical simulations to corroborate the theoretical results established in Section 4. Further implementation details, together with additional simulation studies, are deferred to the appendix. The complete codebase is publicly accessible at https://github.com/ying-hui-he/Robust_mean_estimation.
Simulation setup.
All the experiments are conducted on a MacBook Pro 2021 with the Apple M1 Pro chip and a GB unified memory. We pick three representative heavy-tailed probability distributions: Fisk, Pareto, and Student’s . To make a fair comparison, we fix the data dimension at and use the constant-bias noise model introduced in [Che+21] to generate outliers. Unless otherwise stated, we set the corruption ratio at and the sparsity level at . As for the algorithm in Stage 2, we utilize the filter-based algorithm RME_sp introduced in [Dia+19a]. Furthermore, we compare our algorithms with the Oracle estimator, which uses the coordinate-wise MoM on the clean data with an optimal choice of subgroup numbers. In all of our simulations, we set the number of iterations of SubGM to , which is in line with our theoretical results.
Identification of top- elements.
In this experiment, we evaluate the success rate under varying corruption ratios , while keeping all other parameters fixed. Our theoretical result (Theorem 4.1) indicates that provable identification is possible only when , suggesting that the success rate should deteriorate as increases. We define the recovered index set obtained by SubGM as , and the true index set of the top- elements as . The success rate is then measured as . The results, presented in Figure 2(a), are averaged over independent trials for each setting. Notably, SubGM achieves exact recovery of the true index set even when up to of the samples are corrupted, highlighting the robustness and practical effectiveness of our method.
Comparison between Stage 1 and full algorithms.
We evaluate the -error of the Stage 1 and full algorithms across varying sparsity levels . Our theoretical results predict a gap in -error between the two algorithms— versus —when is sufficiently large. To minimize the influence of sample size, we set , ensuring a sufficiently large number of samples. As shown in Figure 2(b), the two algorithms perform comparably when is small. However, as increases, the -error of Stage 1 grows sublinearly, while the full algorithm maintains a stable error level. These empirical findings are fully consistent with our theoretical predictions.
Infinite variance regime.
In this experiment, we evaluate the performance of our algorithm in the infinite variance regime, fixing the sparsity level at and the sample size at . When the distribution parameters fall within the interval , the Fisk, Pareto, and Student’s distributions all exhibit infinite variance (see Appendix B for further details). As shown in Figure 2(c), both Stage 1 and the full algorithm maintain strong performance in this setting, suggesting that our theoretical guarantees may extend to the infinite variance regime. Notably, Stage 1 consistently outperforms the full algorithm across all three distributions, implying that SubGM may possess greater robustness than existing estimators under infinite variance.
7 Proofs
The proofs of our main results are organized as follows. Section 7.1 presents preliminary lemmas. Section 7.2 establishes the convergence guarantee of SubGM (Theorem 4.1), and Section 7.3 provides the end-to-end guarantee of the full algorithm (Theorem 4.2). Section 7.4 derives the information-theoretic lower bound (Theorem 4.3), and Appendix A proves a formal variant of Proposition 3.1 to establish the properties of the MoM estimator.
7.1 Preliminaries
This section presents all the technical lemmas that will be used to prove our main results.
Lemma 7.1 (Chebyshev’s inequality [Ver18, Corollary 1.2.5]).
Suppose that with . Then, for any , we have
| (13) |
Lemma 7.2 (Hoeffding’s inequality [Ver18, Theorem 2.2.6]).
Let be independent random variables such that almost surely. Then for all , we have
| (14) |
Lemma 7.3 (Dvoretzky-Kiefer-Wolfowitz Inequality [Mas90]).
Let be the CDF of a random variable , and let be the empirical CDF based on i.i.d. samples . We have
| (15) |
Lemma 7.4.
Suppose . Then, with probability at least and for all , we have
| (16) |
Proof.
Proof Note that and . Therefore, we have
| (17) | ||||
Upon setting in the Dvoretzky-Kiefer-Wolfowitz Inequality (Lemma 7.3), with probability at least and for all , we have
| (18) |
∎
Lemma 7.5 (Berry-Esseen bound [Ver18, Theorem 2.1.3]).
Suppose , where has zero mean and bounded third moment, i.e., . Then, upon denoting where , we have
| (19) |
Here is the CDF of standard Gaussian distribution.
7.2 Proof of Theorem 4.1
To prove this theorem, it is essential to first establish the uniform concentration of for all .
Lemma 7.6.
Suppose , where has zero mean, variance , and coordinate-wise third moment . Moreover, suppose samples are generated according to the strong contamination model (Definition 1.1) with corruption parameter . Suppose , , and . Upon dividing the samples into equal subgroups and denoting the empirical mean of each subgroup by , with probability at least , the following statements hold
-
•
For all and all , we have:
-
•
For all and all , we have:
Proof.
Proof We prove the two cases separately.
Case 1: . We only need to prove the lower bound since the upper bound is trivial. We partition the index set of subgroups into two disjoint subsets: , containing all subgroups free of outliers, and , containing those with at least one outlier. Note that . Therefore, we obtain
| (20) |
Next, applying Lemma 7.4 and a union bound, we obtain that, with probability at least and for all ,
| (21) | ||||
To proceed, one can write
| (22) | ||||
Here, follows from the Berry-Esseen bound (Lemma 7.5). In , we use the concentration inequality for standard Gaussian distribution. Combining the above inequalities and recalling our choices of , , and , we conclude that, with probability at least and for all ,
| (23) | ||||
This completes the proof of the first statement.
Case 2: . In this case, it suffices to provide an upper bound for . Following a similar derivation as in Case 1, with probability at least and for all , we have
| (24) | ||||
Here, in , we use the anti-concentration for the standard Gaussian distribution. This completes the proof of the second statement. ∎
We are now ready to present the proof of Theorem 4.1. To this goal, we first present a more precise version of its statement.
Theorem 7.1 (Convergence guarantee for SubGM).
Let be a distribution on with an unknown -sparse mean , unknown covariance matrix , and unknown coordinate-wise third moment satisfying . Suppose a sample set of size is collected according to the strong contamination model (Definition 1.1) with corruption parameter . Upon setting the stepsize and the initialization scale in Algorithm 1, with a probability of at least , the following statements hold for any iteration :
-
•
Near optimal -error. The -error is upper-bounded by
(25) -
•
Coordinate-wise error bound. We obtain
(26)
Before proceeding to the proof, we note that second statement of Theorem 7.1 together with the assumption readily implies for every such that , leading to the second statement of Theorem 4.1.
Proof.
Proof of Theorem 7.1 Let us define and . We analyze coordinate-wise dynamics separately for signals and residuals .
Signal dynamics.
Without loss of generality, we assume that . Let us first revisit the update rule for SubGM:
| (27) | ||||
We further divide our analysis into two cases depending on the magnitude of .
Case 1: . We define . Hence, for all , the first statement of Lemma 7.6 can be invoked to show
| (28) |
By incorporating this into Equation 27, we obtain
| (29) | ||||
| (30) |
Notice that at the initialization. We find that , which remains adequately small throughout the trajectory. Next, we examine the dynamics of . Taking into account that and , we have that within iterations, the following holds
| (31) |
This implies
| (32) |
Next, we show that . To this goal, when , we provide an upper bound on the difference between two consecutive iterations as follows
| (33) | ||||
Here in , we use the fact that . In , we use the estimate . In , we use the fact that . Lastly, in , we use the condition that and . Hence, we have
| (34) |
Combining with the fact that , we derive that .
We will now demonstrate that for any , the condition always holds. Using the fact and Theorem A.1 (in the appendix), we have . Then, the triangle inequality implies
| (35) |
Therefore, it suffices to show that for every . To this goal, we use induction on . For , we have
| (36) |
Now, let us assume that at time , . Without loss of generality, we assume . Based on the definition of the MoM estimator, we have
| (37) |
Let . With this notation, we can derive the following inequality
| (38) |
where in the last inequality, we use the fact that . On the other hand, following exactly the same argument in Equation 33, we have
| (39) |
By combining the above two inequalities, we establish that . This completes the proof of induction.
Case 2: . Since , at iteration we already have . Consequently, the analysis reduces to the last phase of Case 1, from which we can conclude for all .
Residual dynamics.
In this case, we employ induction on to demonstrate that for all . For the base case, this relationship is valid as . Assuming that this relation holds at time , we can refer to Lemma 7.6 and deduce
| (40) |
Hence, we have
| (41) | ||||
Therefore, for all , we obtain
| (42) |
Putting everything together.
Finally, since we set , for any , we have
| (43) |
This completes the proof. ∎
7.3 Proof of Theorem 4.2
The proof follows by combining Theorem 4.1 and Proposition 4.1. First, for the data distribution and corruption model considered in Theorem 4.1, once we set the sample size , then with probability at least , we can successfully determine the location of the top- nonzero elements. For short, we represent the indices of these top- elements as . Following the successful determination of these indices, we can then narrow our focus to a -dimensional subproblem on the dataset with the mean . We can then apply Proposition 4.1 to this reduced dataset. Specifically, once the sample size satisfies , there exists an estimator such that with probability at least , it can output a satisfying .
Combining these two steps via a simple union bound, we know that with a probability of at least , our two-stage estimator satisfies . This concludes the proof.
7.4 Proof of Theorem 4.3
Consider two probability distributions and , where for some distribution . Suppose we draw i.i.d. samples from . Under the strong contamination model (Definition 1.1) with parameter , this same set of samples can be equivalently viewed as -corrupted samples from . Consequently, no algorithm can distinguish between the two cases (see [Li19] for details).
Therefore, it suffices to construct two probability distributions satisfying the conditions in Theorem 4.3. Without loss of generality, we focus on the one-dimensional case, since additional coordinates can be set identically. We require two distributions such that:
-
•
Both distributions have variance at most and third central moment at most ;
-
•
can be written as for some distribution ;
-
•
Their means satisfy .
Following [Li19], we construct as the point mass at , and let , where is the point mass at . It is straightforward to verify that and satisfy all three conditions, completing the proof.
8 Conclusion
Many estimation tasks in statistics become notoriously difficult in the robust setting when certain assumptions on the data are lifted. For instance, almost all statistically optimal robust mean estimators suffer from overwhelmingly high computational costs. While classical results in robust statistics have shed light on the statistical limits of robust estimation, its computational aspects have mostly remained elusive. In this work, we aim to bridge this gap by presenting the first computationally efficient and statistically optimal method for robust sparse mean estimation, thereby overcoming a conjectured computational-statistical barrier under moderate conditions.
Acknowledgements
We thank Jikai Hou for insightful discussions. SF is supported, in part, by NSF Award DMS-2152776, ONR Award N00014-22-1-2127, and MICDE Catalyst Grant.
References
- [AMS96] Noga Alon, Yossi Matias and Mario Szegedy “The space complexity of approximating the frequency moments” In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 1996, pp. 20–29
- [Aro+19] Sanjeev Arora, Nadav Cohen, Wei Hu and Yuping Luo “Implicit regularization in deep matrix factorization” In Advances in Neural Information Processing Systems 32, 2019
- [Bal+17] Sivaraman Balakrishnan, Simon S Du, Jerry Li and Aarti Singh “Computationally efficient robust sparse estimation in high dimensions” In Conference on Learning Theory, 2017, pp. 169–212 PMLR
- [BB19] Matthew Brennan and Guy Bresler “Average-case lower bounds for learning sparse mixtures, robust estimation and semirandom adversaries” In arXiv preprint arXiv:1908.06130, 2019
- [BB20] Matthew Brennan and Guy Bresler “Reducibility and statistical-computational gaps from secret leakage” In Conference on Learning Theory, 2020, pp. 648–847 PMLR
- [CCM13] Yudong Chen, Constantine Caramanis and Shie Mannor “Robust sparse regression under adversarial corruption” In International conference on machine learning, 2013, pp. 774–782 PMLR
- [CDG19] Yu Cheng, Ilias Diakonikolas and Rong Ge “High-dimensional robust mean estimation in nearly-linear time” In Proceedings of the thirtieth annual ACM-SIAM symposium on discrete algorithms, 2019, pp. 2755–2771 SIAM
- [Che+21] Yu Cheng, Ilias Diakonikolas, Rong Ge, Shivam Gupta, Daniel M Kane and Mahdi Soltanolkotabi “Outlier-robust sparse estimation via non-convex optimization” In arXiv preprint arXiv:2109.11515, 2021
- [Che+20] Yu Cheng, Ilias Diakonikolas, Rong Ge and Mahdi Soltanolkotabi “High-dimensional robust mean estimation via gradient descent” In International Conference on Machine Learning, 2020, pp. 1768–1778 PMLR
- [Dep20] Jules Depersin “Robust subgaussian estimation with vc-dimension” In arXiv preprint arXiv:2004.11734, 2020
- [Dia+19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra and Alistair Stewart “Robust estimators in high-dimensions without the computational intractability” In SIAM Journal on Computing 48.2 SIAM, 2019, pp. 742–864
- [Dia+17] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra and Alistair Stewart “Being robust (in high dimensions) can be practical” In International Conference on Machine Learning, 2017, pp. 999–1008 PMLR
- [Dia+19a] Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Eric Price and Alistair Stewart “Outlier-robust high-dimensional sparse estimation via iterative filtering” In Advances in Neural Information Processing Systems 32, 2019
- [DK19] Ilias Diakonikolas and Daniel M Kane “Recent advances in algorithmic high-dimensional robust statistics” In arXiv preprint arXiv:1911.05911, 2019
- [DK23] Ilias Diakonikolas and Daniel M Kane “Algorithmic high-dimensional robust statistics” Cambridge university press, 2023
- [Dia+22] Ilias Diakonikolas, Daniel M Kane, Sushrut Karmalkar, Ankit Pensia and Thanasis Pittas “Robust sparse mean estimation via sum of squares” In Conference on Learning Theory, 2022, pp. 4703–4763 PMLR
- [Dia+22a] Ilias Diakonikolas, Daniel M Kane, Jasper CH Lee and Ankit Pensia “Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions” In arXiv preprint arXiv:2211.16333, 2022
- [DKP20] Ilias Diakonikolas, Daniel M Kane and Ankit Pensia “Outlier robust mean estimation with subgaussian rates via stability” In Advances in Neural Information Processing Systems 33, 2020, pp. 1830–1840
- [DKS17] Ilias Diakonikolas, Daniel M Kane and Alistair Stewart “Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures” In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 73–84 IEEE
- [DG92] David L Donoho and Miriam Gasko “Breakdown properties of location estimates based on halfspace depth and projected outlyingness” In The Annals of Statistics JSTOR, 1992, pp. 1803–1827
- [DL88] David L Donoho and Richard C Liu “The” automatic” robustness of minimum distance functionals” In The Annals of Statistics 16.2 Institute of Mathematical Statistics, 1988, pp. 552–586
- [Fre+22] Spencer Frei, Gal Vardi, Peter L Bartlett, Nathan Srebro and Wei Hu “Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data” In arXiv preprint arXiv:2210.07082, 2022
- [GBL19] Gauthier Gidel, Francis Bach and Simon Lacoste-Julien “Implicit regularization of discrete gradient dynamics in linear neural networks” In Advances in Neural Information Processing Systems 32, 2019
- [GSD19] Daniel Gissin, Shai Shalev-Shwartz and Amit Daniely “The implicit bias of depth: How incremental learning drives generalization” In arXiv preprint arXiv:1909.12051, 2019
- [Ham71] FR Hampel “A general definition of qualitative robustness” In Ann. Math. Stat 42, 1971, pp. 1887–1896
- [Ham74] Frank R Hampel “The influence curve and its role in robust estimation” In Journal of the american statistical association 69.346 Taylor & Francis, 1974, pp. 383–393
- [Hu+20] Wei Hu, Lechao Xiao, Ben Adlam and Jeffrey Pennington “The surprising simplicity of the early-time learning dynamics of neural networks” In Advances in Neural Information Processing Systems 33, 2020, pp. 17116–17128
- [Hub64] Peter J Huber “Robust Estimation of a Location Parameter” In The Annals of Mathematical Statistics JSTOR, 1964, pp. 73–101
- [Hub67] Peter J Huber “Under nonstandard conditions” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability: Weather Modification; University of California Press: Berkeley, CA, USA, 1967, pp. 221
- [Hub11] Peter J Huber “Robust statistics” In International encyclopedia of statistical science Springer, 2011, pp. 1248–1251
- [JVV86] Mark R Jerrum, Leslie G Valiant and Vijay V Vazirani “Random generation of combinatorial structures from a uniform distribution” In Theoretical computer science 43 Elsevier, 1986, pp. 169–188
- [Jin+23] Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S Du and Jason D Lee “Understanding incremental learning of gradient descent: A fine-grained analysis of matrix sensing” In arXiv preprint arXiv:2301.11500, 2023
- [LRV16] Kevin A Lai, Anup B Rao and Santosh Vempala “Agnostic estimation of mean and covariance” In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 2016, pp. 665–674 IEEE
- [LL20] Guillaume Lecué and Matthieu Lerasle “Robust machine learning by median-of-means: theory and practice” In Annals of Statistics, 2020
- [Li19] Jerry Z. Li “Lecture 2: Total Variation, Statistical Models, and Lower Bounds” https://jerryzli.github.io/robust-ml-fall19/lec2.pdf, Lecture notes, Robust Machine Learning (Fall 2019), 2019
- [Li+21] Jiangyuan Li, Thanh Nguyen, Chinmay Hegde and Ka Wai Wong “Implicit sparse regularization: The impact of depth and early stopping” In Advances in Neural Information Processing Systems 34, 2021, pp. 28298–28309
- [LLL20] Zhiyuan Li, Yuping Luo and Kaifeng Lyu “Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning” In arXiv preprint arXiv:2012.09839, 2020
- [LM19] Gabor Lugosi and Shahar Mendelson “Robust multivariate mean estimation: the optimality of trimmed mean” In arXiv preprint arXiv:1907.11391, 2019
- [LM19a] Gábor Lugosi and Shahar Mendelson “Near-optimal mean estimators with respect to general norms” In Probability theory and related fields 175.3-4 Springer, 2019, pp. 957–973
- [LM19b] Gábor Lugosi and Shahar Mendelson “Sub-Gaussian estimators of the mean of a random vector” In The Annals of Statistics 47.2 JSTOR, 2019, pp. 783–794
- [MF22] Jianhao Ma and Salar Fattahi “Blessing of Depth in Linear Regression: Deeper Models Have Flatter Landscape Around the True Solution” In Advances in Neural Information Processing Systems, 2022
- [MGF22] Jianhao Ma, Lingjun Guo and Salar Fattahi “Behind the Scenes of Gradient Descent: A Trajectory Analysis via Basis Function Decomposition” In arXiv preprint arXiv:2210.00346, 2022
- [MLF25] Jianhao Ma, Geyu Liang and Salar Fattahi “Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions” In arXiv preprint arXiv:2505.17304, 2025
- [Mas90] Pascal Massart “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality” In The annals of Probability JSTOR, 1990, pp. 1269–1283
- [Min15] Stanislav Minsker “Geometric median and robust estimation in Banach spaces”, 2015
- [NY83] Arkadij Semenovič Nemirovskij and David Borisovich Yudin “Problem complexity and method efficiency in optimization” Wiley-Interscience, 1983
- [PBR20] Adarsh Prasad, Sivaraman Balakrishnan and Pradeep Ravikumar “A robust univariate mean estimator is all you need” In International Conference on Artificial Intelligence and Statistics, 2020, pp. 4034–4044 PMLR
- [Pra+20] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan and Pradeep Ravikumar “Robust estimation via robust gradient estimation” In Journal of the Royal Statistical Society Series B: Statistical Methodology 82.3 Oxford University Press, 2020, pp. 601–627
- [RMC21] Noam Razin, Asaf Maman and Nadav Cohen “Implicit regularization in tensor factorization” In International Conference on Machine Learning, 2021, pp. 8913–8924 PMLR
- [RMC22] Noam Razin, Asaf Maman and Nadav Cohen “Implicit regularization in hierarchical tensor factorization and deep convolutional neural networks” In International Conference on Machine Learning, 2022, pp. 18422–18462 PMLR
- [Rou+11] Peter J Rousseeuw, Frank R Hampel, Elvezio M Ronchetti and Werner A Stahel “Robust statistics: the approach based on influence functions” John Wiley & Sons, 2011
- [SCV17] Jacob Steinhardt, Moses Charikar and Gregory Valiant “Resilience: A criterion for learning in the presence of arbitrary outliers” In arXiv preprint arXiv:1703.04940, 2017
- [Tuk62] John W Tukey “The future of data analysis” In The annals of mathematical statistics 33.1 JSTOR, 1962, pp. 1–67
- [Tuk60] John Wilder Tukey “A survey of sampling from contaminated distributions” In Contributions to probability and statistics Stanford University Press, 1960, pp. 448–485
- [Ver18] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge university press, 2018
- [Yat85] Yannis G Yatracos “Rates of convergence of minimum distance estimators and Kolmogorov’s entropy” In The Annals of Statistics 13.2 Institute of Mathematical Statistics, 1985, pp. 768–774
Appendix A MoM Estimator under Strong Contamination Model
In this section, we prove the key properties of the -dimensional and high-dimensional MoM estimators under the strong contamination model (Definition 1.1). The following is a more precise statement of Proposition 3.1, which is adapted from Fact 2.1. in [Dia+22a]. As the complete proof does not appear in the original source, we include it here for completeness.
Proposition A.1 (One-dimensional MoM estimator).
Consider a corruption parameter , failure probability , and a set of many -corrupted samples from a distribution with mean and variance . Then, with probability at least , the MoM estimator satisfies .
Proof.
Proof We partition the index set of the subgroups into two parts: and . Here comprises all the subgroups without outliers, and consists of subgroups containing at least one outlier. According to our strong contamination model, we have . Subsequently, we observe that
| (44) |
Here, , where is the size of each subgroup and is the subgroup . For simplicity, let us denote and . Then, the above inclusion implies
| (45) | ||||
Since is bounded, we can apply Hoeffding’s inequality (Lemma 7.2) to obtain
| (46) |
Moreover, we can use Chebyshev’s inequality (Lemma 7.1) to establish an upper bound for :
| (47) |
Upon defining and , we have the following estimates
| (48) | ||||
Combining these bounds with Equation 46, we obtain
| (49) | ||||
This completes the proof. ∎
Directly applying MoM estimator to each coordinate of a -dimensional dataset leads to the following proposition.
Theorem A.1 (High dimensional coordinate-wise MoM estimator).
Consider a corruption parameter , failure probability , and a set of many -corrupted samples from a distribution with mean and coordinate-wise variance . Then, with probability at least , the coordinate-wise MoM estimator satisfies and .
Proof.
Proof The proof follows directly from Proposition A.1 and a simple union bound. ∎
Appendix B Additional Simulations
B.1 Experimental Details
We run our simulations on three heavy-tailed distributions: Fisk, Pareto, and Student’s distributions. In each case, we apply a symmetrization trick to make the density function symmetric around zero. The density function of the Fisk distribution with parameter is expressed as follows:
| (50) |
The density function of the Pareto distribution with parameters is
| (51) |
Lastly, the density function for student -distribution is
| (52) |
Here is the gamma function. In all three distributions described above, the parameters correspondingly denote the existence of the -th moment. For instance, when fall within the range of , the variances are infinite. Regarding the outliers, we generate them via the constant-bias noise model as introduced in [Che+21].
Furthermore, unless stated otherwise, all simulations are conducted with the following predefined settings: data dimension is set to , sparsity level is set to with nonzero elements being , sample size is set to 600, and the corruption ratio is set at . As for our algorithm, we set the number of subgroups to be . Note that, compared to the theoretical choice of in Algorithm 1, we choose a smaller to make our algorithm work for a larger corruption ratio in practice. Moreover, in SubGM, we set the initialization scale and the step-size .
We select sparse_GD [Che+21] and sparse_filter [Dia+19a] as our benchmark algorithms. We note that these algorithms do not come with theoretical assurances in the heavy-tailed setting. Nonetheless, we have empirically found that these two algorithms surpass others in performance, even in the heavy-tailed setting. We also highlight that the polynomial-time algorithms that come equipped with theoretical guarantees for heavy-tailed setting [Dia+22a, Dia+22] are impractical since they rely on time-consuming methods such as sum-of-squares and ellipsoid methods.
We employ both sparse_GD and sparse_filter in the second stage of our algorithm, setting the sparsity parameter to , where is the index set identified in the first stage. In total, we evaluate six estimators: oracle (which removes all outliers), sparse_GD, sparse_filter, stage_1, full_GD (our algorithm with sparse_GD in the second stage), and full_filter (our algorithm with sparse_filter in the second stage). In stage_1, we run SubGM for iterations, whereas in full_GD and full_filter, we reduce the iteration count to to lower computational cost.
B.2 Sensitivity to Prior Knowledge of
We underscore the fact that prior algorithms necessitate a prior knowledge of the exact sparsity level . In contrast, our approach can identify the sparsity level automatically. For this simulation, we assign a true sparsity level of with nonzero components and assess the performance of the benchmark algorithms, namely sparse_GD and sparse_filter, while varying the input , which is an upper bound of , within the range of [10, 40]. As illustrated in Figure 3, the performance of these benchmark algorithms is highly sensitive to the choice of across all examined distributions. Their performances further destabilize when the underlying distributions start to exhibit heavier tails. In contrast, our algorithm automatically recognizes the sparsity pattern across all scenarios. For all subsequent simulations, we provide the benchmark algorithms with the true sparsity level to ensure a fair comparison.


B.3 Performance with Different
In this simulation, we evaluate the performance of various algorithms under different sparsity levels . We set all nonzero entries of to . As shown in the first row of Figure 4, all algorithms—except stage_1 (as predicted by Theorem 4.1) and sparse_filter (which underperforms at larger sparsity levels )—achieve -error that remains largely independent of sparsity. In more heavy-tailed settings, depicted in the second row of Figure 4, all algorithms display an increase in -error as grows. Nevertheless, across nearly all scenarios, our full algorithms (full_GD and full_filter) outperform the benchmarks. We further hypothesize that the weaker performance of full_filter for the Pareto distribution with arises from the suboptimal performance of sparse_GD and sparse_filter when used in Stage 2.


B.4 Infinite Variance Regime
In this simulation, we evaluate the performance of the algorithms with respect to the heaviness of the tail distributions. As shown in Figure 5, we vary the parameters over the range . Smaller parameter values correspond to heavier tails, with values in the interval resulting in distributions of infinite variance. Our algorithms (stage_1, full_GD, and full_filter) demonstrate superior robustness under these heavy-tailed conditions, highlighting the advantage of our approach.
B.5 Performance with Different
In this simulation, we study the relationship between the -error and the corruption ratio across all six estimators. As shown in Figure 6, apart from the Oracle—whose error remains unaffected by (as expected)—our proposed algorithms (either single-stage or full version) consistently outperform the alternatives. While our theoretical analysis predicts an -error of order , the empirical results reveal an approximately linear dependence on . We attribute this discrepancy to the non-adversarial nature of the outlier model used in our experiments.
B.6 Running Time
Next, we examine the running time of our algorithms. Specifically, we run iterations for stage_1, while in the full algorithms we restrict Stage 1 to iterations. As shown in Figure 7, all algorithms exhibit linear runtime, consistent with our theoretical guarantees.