Bounding Mixing Times of Markov Chains via the Spectral Gap

A Markov chain that is aperiodic and irreducible will eventually converge to a stationary distribution. This is widely used in many applications in machine learning, such as in Markov Chain Monte Carlo (MCMC) methods, where random walks on Markov chains are used to obtain a good estimate of the log likelihood of the partition function of a model, which is hard to compute directly as it is #P-hard (this is even harder than NP-hard). However, one major issue is that it is unclear how many steps we should take before we are guaranteed that the Markov chain has converged to the true stationary distribution. In this post, we will see how the spectral gap of the transition matrix of the Markov Chain relates to its mixing time.
Mixing Times
Our goal is to try to develop methods to understand how long it takes to approximate the stationary distribution
Aside: Coupling
Coupling is one general technique that allows us to bound how long it takes for a Markov Chain to converge to its stationary distribution based. It is based on having two copies of the original Markov Chain running simultaneously, with one being at stationarity, and showing how they can be made to coincide (i.e have bounded variation distance) after some time (known as the coupling time).
We will not discuss coupling in this post, but will instead develop how spectral gaps can be used, as this is more useful for other concepts.
The Spectral Gap Method
The main idea of the Spectral Gap method is that the mixing time is bounded by the inverse of the spectral gap, which is the difference between the largest and second largest eigenvalues of the transition matrix.
Before we can talk about one distribution approximating another, we need to first introduce what closeness between two distributions means The formulation that we will use is via the Total Variation Distance.
The equality between the two lines can be observed from the fact that
since both

Intuition for Mixing Times
We consider how long it takes to converge on some special graphs to build up intuition.
Random Walks on Path Graphs
The path graph is a line graph on

Random Walks on the Complete Graph
The complete graph
This only takes 1 step to mix, since after a single step we can reach any vertex.

This short analysis tells us that if our graph looks like a line graph then we should expect poor mixing times; whereas if it looks more like a complete graph then we can expect the opposite.
Mixing Times
We now formally introduce the concept of mixing times.
We claim that the choice of
To bound mixing times, we consider random walks on undirected, regular graphs
Consider random walks on an undirected, regular graph
where
The stationary distribution for
This can be seen from the following:
If
where
Spectral Graph Theory
Spectral graph theory is the study of how the eigenvalues and eigenvectors of the matrix of a graph reveals certain properties about the graph, for instance, how well-connected it is.
-
for all , and -
if and only if is connected -
if and only if does not have a bipartite connected component
We prove each of the claims in Lemma 1 in order.
Let
This shows that
It remains to show that
WLOG, assume that the graph has two distinct connected components; the proof easily extends to having more components.
Let
Define
Then
We can show the same for
Since by our disconnected assumption
This shows the backwards direction.
We will show that for any eigenvector
Let
Note that this lemma shows that if
Suppose that
Let
Define vector
Since
Similarly as for the backwards direction of Claim 2, we can see that this can only hold on each element
This shows how we can gleam useful information about a graph just from its eigenvalues.
Recall how we previously showed that a unique stationary distribution exists if the graph is connected and not bipartite. Now we have another characterization of the same property, except in terms of the eigenvalues of its transition matrix:
Our goal now is to formulate a robust version of this corollary, where we can bound the mixing time of approaching the stationary distribution.
Bounding the Mixing Time via the Spectral Gap
We define the spectral gap:
We now finally introduce a lemma that shows that the mixing time is proportional to the inverse of the spectral gap multiplied by a log factor:
This shows that if your spectral gap is bounded by a constant, your mixing time is in
Proof
We now prove Lemma 2.
Let
Since this is a symmetric matrix, the eigenvectors are pairwise orthogonal.
We can perform an eigenvalue decomposition of
It follows from Equation
Let
After
We can re-write
since we previously showed that the all-ones vector is always an eigenvector with eigenvalue 1, where here it is re-scaled to have unit norm.
Then
where the last step follows from the fact that
Rearranging and moving to work in the L2 (Euclidean) norm, we obtain
where the last step comes from the fact that
Since
However, what we really care about is the total variation distance, which is the quantity
Recall that for any
To relate the L2 distance to L1 distance, we can apply the above inequality to get
So we set
We say that a Markov Chain is fast mixing if
Expander Graphs
Lemma 2 motivates the following definition of expander graphs:
From what we have learnt so far, we know that an expander has to be well-connected in order to have a large spectral gap. Expander graphs can be used for derandomization, which helps to reduce the amount of random bits required for algorithms.
Related Posts: