A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers: A Guided Walkthrough

post.cover
Theatre of Dionysus at Delphi, Greece

Introduction

In high-dimensional statistical inference, it is common for the number of parameters p to be comparable to or greater than the sample size n. However, for an estimator θ^n to be consistent in such a regime, meaning that it converges to the true parameter θ, it is necessary to make additional low-dimensional assumptions on the model. Examples of such constraints that have been well-studied include linear regression with sparsity constraints, estimation of structured covariance or inverse covariance matrices, graphical model selection, sparse principal component analysis (PCA), low-rank matrix estimation, matrix decomposition problems and estimation of sparse additive nonparametric models (Negahban et al., 2009).

In recent years, there has been a flurry of work on each of these individual specific cases. However, the authors of the paper in discussion poses the question of whether there is a way of unifying these analysis to understand all of such estimators in a common framework, and answers it in the affirmative. They showed that it is possible to bound the squared difference between any regularized M-estimator and its true parameter by (1) the decomposability of the regularization function, and (2) restricted strong convexity of the loss function. We will call this the “main theorem” in the remainder of the blog post, and this is referred to as “Theorem 1” in (Negahban et al., 2009).

In the remainder of the paper, we will develop the tools necessary to deeply understand and prove the result. Notation used will be consistent with the original paper for expositional clarity.

Background

In this section, we develop some of the necessary background and notation to build up to the proof.

Regularized M-estimators

M-estimators (M for “maximum likelihood-type”) are solutions that minimize the sum of loss functions ρ: (1)θ^argminθi=1nρ(xi,θ).

If we add a regularization term R to penalize complexity of the model, scaled by weights λ, the method is known as a regularized M-estimator: (2)θ^argminθi=1nρ(xi,θ)+λR(θ).

Example (Lasso Program)
The Lasso program is an example of a regularized M-estimator, where a 1 regularization penalty is applied: θ^argminθRd{12nyXθ22+λnθ1}.

Dual Norms

Definition (Dual Norms)
Let R be a norm induced by an inner product ,. Then the dual norm of R is defined as R(v):=supuRp{0}u,vR(u)=supR(u)1u,v.
Example (1 and norms are dual norms)
We will show that the dual of the 1 norm is the norm. Well, to see that R(v)v, observe that R(v)=supu11u,v=supu11k=1p|uk||vk|supu11(k=1p|uk|)v(since u11)=|v|. For the opposite direction, supu11u,v=supu11k=1p|uk||vk|(set j=argmaxj|vj|,u=ej)1|vj|=v, hence we have equality.

Subspace Compatibility Constant

The subspace compatibility constant measures how much the regularizer R can change with respect to the error norm restricted to the subspace M. This concept will show up later in showing that the restricted strong convexity condition will hold with certain parameters.

The subspace compatibility constant is defined as follows:

Definition (Subspace Compatibility Constant)
For any subspace M of Rp, the subspace compatibility constant with respect to the pair (R,) is given by Ψ(M):=supuM{0}R(u)u.

It can be thought of as the Lipschitz constant of the regularizer with respect to the error norm restricted to values in M, by considering the point where it can vary the most.

Projections

Define the projection operator (3)ΠM(u):=argminvM|uv| to be the projection of u onto the subspace M. For notational brevity, we will use the shorthand uM=ΠM(u).

One property of the projection operator is that it is non-expansive, meaning that (4)|Π(u)Π(v)||uv| for some error norm . In other words, it has Lipschitz constant 1.

Problem Formulation

In our setup, we define the following quantities:

  • Z1n:={Z1,,Zn} n i.i.d observations drawn from distribution P with some parameter θ,
  • L:Rp×ZnR a convex and differentiable loss function, such that L(θ;Z1n) returns the loss of θ on observations Z1n,
  • λn>0: a user-defined regularization penalty,
  • R:RpR+ a norm-based regularizer.

The purpose of the regularized M-estimator is then to solve for the convex optimization problem

(5)θ^λnargminθRp{L(θ;Z1n)+λnR(θ)},

and we are interested in deriving bounds on (6)θ^λnθ for some error norm induced by an inner product , in Rp.

Decomposability of the Regularizer R

The first key property in the result is decomposability of our norm-based regularizer R. Working in the ambient Rp, define MRp to be the model subspace that captures the constraints of the model that we are working with (i.e k-sparse vectors), and denote M to be its closure, i.e the union of M and all of its limit points. In addition, denote M to be the orthogonal complement of M, namely

(7)M:={vRpu,v=0 for all uM }.

We call this the perturbation subspace, as they represent perturbations away from the model subspace M. The reason why we need to consider M instead of M is because there are some special cases of low-rank matrices and nuclear norms where it could be possible that M is strictly contained in M.

Now we can introduce the property of decomposability:

Definition (Regularizer Decomposability)
Given a pair of subspaces MM, a norm-based regularizer R is decomposable with respect to (M,M) if R(θ+γ)=R(θ)+R(γ) for all θM and γM.

Since R is a norm-based regularizer, by the triangle inequality property of norms we know that always (8)R(θ+γ)R(θ)+R(γ), and hence this is a stronger condition which requires tightness in the inequality when we are specifically considering elements in the closure of the model subspace and its orthogonal complement.

Decomposability of the regularizer is important as it allows us to penalize deviations γ away from the model subspace in M to the maximum extent possible. We are usually interested to find model subspaces that are small, with a large orthogonal complement. We will see in the main theorem that when this is the case, we will obtain better rates for estimating θ.

There are many natural contexts that admit regularizers which are decomposable with respect to subspaces, and the following example highlights one such case.

Example (s-sparse Vectors)
Consider estimating the parameters θ^ with 1-regularization in Rp where we assume that the model is s-sparse. Then for any set S[p] where |S|=s, we can define our model subspace M as M(S):={θRpθj=0jS}, i.e all the vectors in Rp that only has support in S. In this case, M=M, and our orthogonal complement M is just M(S):={γRpγj=0jS}. Then this setup is decomposable: θ+γ1=θS+γSc1=θS1+γSc=θ1+γ1 by the Pythagorean theorem.

Role of Decomposability

Figure 1. A visualization of C(M,M;θ). The shaded area represents the set C(M,M;θ), i.e all values of θ that satisfies the inequality of the set in Lemma 1.

Decomposability is important because it allows us to bound the error of the estimator. This is given in the following result, which is known as Lemma 1 in (Negahban et al., 2009):

Lemma (Lemma 1 in (Negahban et al., 2009) )
Suppose that L is a convex and differentiable function, and consider any optimal solution θ^ to the optimization problem with a strictly positive regularization parameter satisfying λn2R(L(θ;Z1n)). Then for any pair (M,M) over which R is decomposable, the error Δ^=θ^λnθ belongs to the set C(M,M;θ):={ΔRpR(ΔM)3R(ΔM)+4R(θM)}.

Recall from the Projections Section that ΔM represents the projection of Δ onto M, and similarly for the other quantities. Due to space constraints, we are unable to prove Lemma Lemma 1 in this survey, but it is very important in the formulation of restricted strong convexity, and in proving Theorem 1.

Figure 1 provides a visualization of C(M,M;θ) in R3 in the sparse vectors setting. In this case, S={3} with |S|=1, and so the projection of Δ onto the model subspace only has non-zero values on the third coordinate, and its orthogonal complement is where the third coordinate is zero. Formally,

(9)M(S)=M(S)={ΔR3Δ1=Δ2=0},(10)M(S)={ΔR3Δ3=0}.

The vertical axis of Figure 1 denotes the third coordinate, and the horizontal plane denotes the first two coordinates. The shaded area represents the set C(M,M;θ), i.e all values of θ that satisfies the inequality of the set in Lemma 1.

Figure 1(a) shows the special case when θM. In this scenario, R(θM)=0, and so

C(M,M;θ)={ΔRpR(ΔM)3R(ΔM)},

which is a cone.

However, in the general setting where θM, then R(θM)>0, and the set C(M,M;θ) will become a star-shaped set like what is shown in Figure 1(b).

Restricted Strong Convexity (RSC) of the Loss Function

Figure 2. An illustration of the role of curvature in guaranteeing that Δ^=θ^λnθ is small when L(θ^λn)L(θ) is small.

In a classical setting, as the number of samples n increases, the difference in loss dL=|L(θ^λn)L(θ)| will converge to zero. However, the convergence in loss by itself is insufficient to also ensure the convergence in parameters, Δ^=θ^λnθ. Instead, it also depends on the curvature of the loss function L.

Figure 2 illustrates the importance of curvature. In Figure 2(a), L has high curvature, and so having a small dL also implies a small Δ^. On the other hand, in Figure 2(b), L has an almost flat landscape near θ^λn, and hence even when dL is small, Δ^ could still be large.

Consider performing a Taylor expansion of L around θ:

(11)L(θ+Δ)=L(θ)+L(θ),Δ+12ΔT2L(θ)Δ+δL(Δ,θ).

Then we can rearrange and write the error of the first-order Taylor series expansion at θ as

δL(Δ,θ)=L(θ+Δ)L(θ)L(θ),Δ.

The first-order Taylor approximation is a linear approximation, and hence the error δL(Δ,θ), which is dominated by the quadratic term, can capture the curvature about θ.

As such, one way to show that L has good curvature about θ is to show that δL(Δ,θ)κΔ2 holds for all Δ in a neighborhood of θ. This is because we are enforcing a lower bound on its quadratic growth.

This leads us to the definition of restricted strong convexity:

Definition (Restricted Strong Convexity)
The loss function satisfies a restricted strong convexity (RSC) condition with curvature κL>0 and tolerance function τL if δL(Δ,θ)κLΔ2τL2(θ) for all ΔC(M,M;θ).

We only need to consider error terms ΔC(M,M;θ), since Lemma ??? guarantees us that the error term will only lie in that set.

In many statistical models, restricted strong convexity holds with τL=0, however, it is required in more general settings, such as generalized linear models.

Proof of Theorem 1

We can now state and prove the main result of the paper. This will hold under the decomposability of the regularizer (G1), and the restricted strong convexity of the loss function (G2).

  • (G1) The regularizer R is a norm and is decomposable with respect to the subspace pair (M,M), where MM.

  • (G2) The loss function L is convex and differentiable, and satisfies restricted strong convexity with curvature κL and tolerance τL.

Theorem 1 in (Negahban et al., 2009) (Bounds for General Models)
Under conditions (G1) and (G2), consider the convex optimization problem (5) based on a strictly positive positive regularization constant λn2R(L(θ)). Then any optimal solution θ^λn to the convex program (5) satisfies the bound θ^λnθ29λn2κL2Ψ2(M)+λnκL(2τL2(θ)+4R(θM)).

We will rely on the following lemmas that will be stated without proof due to space constraints:

Lemma 3 in (Negahban et al., 2009) (Deviation Inequalities)
For any decomposable regularizer and p-dimensional vectors θ and Δ, we have R(θ+Δ)R(θ)R(ΔM)R(ΔM)2R(θM). Moreover, as long as λn2R(L(θ)) and L is convex, we have L(θ+Δ)L(θ)λn2[R(ΔM)+R(ΔM)].
Lemma 4 in (Negahban et al., 2009)
If F(Δ)>0 for all vectors ΔK(δ), then Δ^δ.

Note that this was similar to our previous analysis on restricted strong convexity where we only really need to consider error terms restricted to C(M,M;θ) due to Lemma 1. Therefore, it suffices to show F(Δ)>0 to obtain a bound on Δ^=θ^λnθ, which completes the proof of Theorem 1.

Define F:RpR by

(12)F(Δ):=L(θ+Δ)L(θ)+λn{R(θ+Δ)R(θ)},

and define the set

(13)K(δ):=C(M,M;θ){Δ=δ}.

Take any ΔK. Then

(by definition)F(Δ)=L(θ+Δ)L(θ)+λn{R(θ+Δ)R(θ)}(14)L(θ),Δ+κLΔ2τL2(θ)+λn{R(θ+Δ)R(θ)}(15)(by restricted strong convexity: δL(Δ,θ)κLΔ2τL2(θ),(16) and δL(Δ,θ)=L(θ+Δ)L(θ)L(θ),Δ ) (17)L(θ),Δ+κLΔ2τL2(θ)+λn{R(ΔM)R(ΔM)2R(θM)}(18)(by Lemma 3).

We lower bound the first term as L(θ),Δλn2R(Δ):

(19)|L(θ),Δ|R(L(θ))R(Δ)(Cauchy-Schwarz using dual norms R and R)(20)λn2R(Δ)Theorem 1 assumption: λn2R(L(θ))),

and hence,

(21)L(θ),Δλn2R(Δ).

So applying to (18),

(22)F(Δ)κLΔ2τL2(θ)+λn{R(ΔM)R(ΔM)2R(θM)}λn2R(Δ)(23)κLΔ2τL2(θ)+λn{R(ΔM)R(ΔM)2R(θM)}λn2(R(ΔM)+R(ΔM))(24)(Triangle inequality: R(Δ)R(ΔM)+R(ΔM))(25)=κLΔ2τL2(θ)+λn{12R(ΔM)32R(ΔM)2R(θM)}(26)(Moving terms in)(27)κLΔ2τL2(θ)+λn{32R(ΔM)2R(θM)}(28)(Norms always non-negative)(29)=κLΔ2τL2(θ)λn2{3R(ΔM)+4R(θM)}.

To bound the term R(ΔM), recall the definition of subspace compatibility:

(30)Ψ(M):=supuM{0}R(u)u,

and hence

(31)R(ΔM)Ψ(M)ΔM.

To upper bound ΔM, we have

(32)ΔM=ΠM(Δ)ΠM(0)(Since 0MΠM(0)=0(33)Δ0(Projection operator is non-expansive, see Equation 4)(34)=Δ,

which substituting into Equation (30) gives

(35)R(ΔM)Ψ(M)Δ.

Now we can use this result to lower bound Equation 29:

(36)F(Δ)κLΔ2τL2(θ)λn2{3Ψ(M)Δ+4R(θM)}.

The RHS of the inequality in Equation 36 has a strictly positive definite quadratic form in Δ, and hence by taking Δ large, it will be strictly positive. To find such a sufficiently large Δ, write

(37)a=κL,(38)b=3λn2Ψ(M),(39)c=τL2(θ)+2λnR(θM),

such that we have

(40)F(Δ)aΔ2bΔc.

Then the square of the rightmost intercept is given by the squared quadratic formula

(41)Δ2=((b)+b24a(c)2a)2(42)=(b+b2+4ac2a)2(43)(b2+4aca)2(bb2+4ac)(44)=b2+4aca2(45)=9λn2Ψ2(M)4κL2+4τL2(θ)+8λnR(θM)κL.(Substituting in a,b,c)

In (Negahban et al., 2009), they were able to show an upper bound of

(46)Δ29λn2Ψ2(M)κL2+λnκL{2τL2(θ)+4R(θM)},

but I did not manage to figure out how they managed to produce a λn term beside the τL2(θ) term. All other differences are just constant factors. It may be due to an overly coarse bound on my end applied in Equation 43, but it is unclear to me how the λn term can be applied on only the τL2(θ) term without affecting the R(θM) term.

With Equation 46, we can hence apply Lemma 4 in (Negahban et al., 2009) to obtain the desired result that

(47)θ^λnθ29λn2κL2Ψ2(M)+λnκL(2τL2(θ)+4R(θM)).

This concludes the proof.

Conclusion

In the proof of Theorem 1, we saw how the bound is derived from the two key ingredients of the decomposability of the regularizer, and restricted strong convexity of the loss function. The decomposability of the regularizer allowed us to ensure that the error vector Δ^ will stay in the set C(M,M;θ). This condition is then required in Lemma 4 of (Negahban et al., 2009), which allows us to bound Δ^ given that F(Δ)>0. In one of the steps where we were lower bounding F(Δ) in the proof, we made use of the properties of restricted strong convexity.

Theorem 1 provides a family of bounds for each decomposable regularizer under the choice of (M,M). The authors of (Negahban et al., 2009) were able to use Theorem 1 to rederive both existing known results, and also derive new results on low-rank matrix estimation using the nuclear norm, minimax-optimal rates for noisy matrix completion, and noisy matrix decomposition. The reader is encouraged to refer to (Negahban et al., 2009) for more details on the large number of corrollaries of Theorem 1.

Acknowledgments

I would like to thank my dear friend Josh Abrams for helping to review and provide valuable suggestions for this post!

Citations

  1. Negahban, S., Yu, B., Wainwright, M. J., and Ravikumar, P. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., and Culotta, A. (eds.), Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009. URL https://proceedings.neurips.cc/paper_files/paper/2009/file/dc58e3a306451c9d670adcd37004f48f-Paper.pdf.



    Related Posts:

  • An Intuitive Introduction to Gaussian Processes
  • Bounding Mixing Times of Markov Chains via the Spectral Gap
  • Notes on 'The Llama 3 Herd of Models'
  • Playing Sound Voltex at Home: Setting Up Unnamed SDVX Clone with the Yuancon SDVX Controller
  • Creating Trackback Requests for Static Sites