In Principle

In the previous post, we discussed the general concepts behind manifold learning and some key assumptions in their development. We then looked at the Swiss roll example and showed how PCA, a linear method, was unable to successfully unroll the sheet. Now we will talk about diffusion maps specifically, and see if it does a better job at unrolling the sheet. Most of my discussion below follows this publication (1). Both the original diffusion mapping method proposed by Coifman and Lafon (2) and later developments are a bit more general than what we will be covering. My focus again is on the approach as it pertains to the analysis of trajectories from molecular simulation, and the description below should suffice.

The objective of diffusion mapping, as with other manifold learning algorithms, is to construct the best low-dimensional embedding of high-dimensional input data. Given some measure of similarity between observations which offers a good representation of local connectivity, the diffusion map reconstructs a global representation of the intrinsic manifold. For something like the Swiss roll, similarity can be described using the Euclidean distance. For molecular trajectories, this may be the rotationally and translationally minimized root mean square deviation (RMSD) – we will discuss this in our next post.

Once a similarity metric is chosen, pairwise distances between all observations are computed and thresholded using a Gaussian kernel. Therefore, given a pairwise distance function d(i,j), we can construct a matrix \mathbf{A} with elements,

    \[ A_{ij} = \exp\left(-\frac{d(i,j)^2}{2\epsilon}\right)\ \  i,j=1,\ldots,N. \]

Applying the Gaussian function has the effect of only retaining points that are “close”, a length-scale which is set by \sqrt{\epsilon}. This parameter \epsilon is sometimes referred to as the “kernel bandwidth”.  Recall in the previous post our discussion of the “locally Euclidean” assumption. Here, setting the value of \epsilon and applying the Gaussian kernel precisely controls the maximum distance along the yet-to-be-discovered manifold which is well captured by our distance measure d(i,j). Points which are farther away from this distance are deemed to be too distant to meaningfully characterize the manifold, and are discarded. To drive this point home, I’ve plotted below the Gaussian kernel as a function of a few values of \epsilon.

Diffusion maps Gaussian kernelAs you can see, the length at which the value decays is approximately equal to \sqrt{\epsilon}. Now the first question that comes up is: how do we choose \epsilon? Conveniently, Coifman et al. proposed a heuristic which is to generate a plot of log(\sum_{ij}{A_{ij}) vs. \log(\epsilon) and take the extent of the linear regime as a good choice for \epsilon. We will look at how to do this once we tackle the Swiss roll example. This process of selection ensures that each data point is well-connected by a series of “small hops” of \mathcal{O}(\sqrt{\epsilon}) in the distance metric. Ideally, each data point or snapshot can be connected to every other snapshot through hops along entries of \mathbf{A} which will allow diffusion mapping to synthesize a single global approximation of the underlying manifold. If the data is not well connected, only isolated embeddings of each distinct region will be produced.

The next step is to normalize each row of the \mathbf{A} matrix to yield a new \mathbf{M} matrix, which is a right-stochastic Markov transition matrix. The elements of \mathbf{M} are then,

    \[ M_{ij} = \frac{A_{ij}}{\sum_{j=1}^{N}{A_{ij}}}\ \ i,j=1,\ldots,N. \]

By doing this we are basically saying that the process we are trying to describe is Markovian. The last step is to obtain the eigenvectors/values of the \mathbf{M} matrix. Note that the first eigenpair will be trivial, with eigenvalue \psi_1 = 1 and eigenvector \vec{\mathbf{\Psi}}_1 = \vec{1}. The remaining top eigenvectors represent the slowest diffusive modes of the system, or the principal components of the manifold along which the long-time dynamics of the system evolves. There are a few interesting questions that we have not answered in this brief overview of diffusion maps, such as estimating the intrinsic dimensionality of the manifold. In the case of the Swiss roll we know the sheet is two dimensional, but in general we do not have any a priori knowledge of how many dimensions we should consider. I let the answer to these questions emerge naturally from the more complex example we will address in the next post. For now, we will conclude the basic outlining of diffusion maps.

In Practice

It turns out that a basic implementation of diffusion maps is not very complicated. We will pick up where we left off in the last example where we attempted to use PCA to unroll the sheet without success. Our first step towards diffusion mapping will be to compute the pairwise distances between the points in our data matrix \mathbf{X}. There’s a convenient function in scikit-learn that we can use:

d = metrics.euclidean_distances(X)

Our next task will be figuring out what the optimal choice of \epsilon. Let’s generate a plot of log(\sum_{ij}{A_{ij}) vs. \log(\epsilon). Here I will use another built-in function in scikit-learn which applies a Gaussian kernel to the pairwise distances (also called radial basis functions) and vary the value of \log(\epsilon).

# Values of epsilon in base 2 we want to scan. 
eps = np.power(2., np.arange(-10.,14.,1))

# Pre-allocate array containing sum(Aij). 
Aij = np.zeros(eps.shape)

# Loop through values of epsilon and evaluate matrix sum.
for i in range(len(eps)): 
    A = metrics.pairwise.rbf_kernel(X, gamma=1./(2.*eps[i]))
    Aij[i] = A.sum()

Log-log kernel bandwidth

We can see how the linear region of the plot above starts to turn around \epsilon \approx 8, but to be conservative we take \epsilon = 4. We will see that in general, from my limited experience, it is better to err on the side of a smaller value. Next, let’s generate Markov matrix \mathbf{M}, compute the top 3 eigenvectors and plot the second and third (recall the first is trivial).

# From the plot above we see that 4 is a good choice. 
eps = 4.

# Generate final matrix A, and row normalized matrix M. 
A = metrics.pairwise.rbf_kernel(X, gamma=1./(2.*eps))
M = A/A.sum(axis=1, keepdims=True)

# Get the eigenvalues/vectors of M. 
# We normalize by the first eigenvector. 
W, V = np.linalg.eig(M)
V = V/V[:,0]

Diffusion map swiss roll 4What?! I bet you expected to see a nicely unwrapped sheet didn’t you? Maybe we chose a poor value of \epsilon? Let’s scan through some values of \epsilon and see what we get.

Swiss roll panelSomething very interesting is going on here. First of all, notice how for small values of \epsilon the color follows the “arc” perfectly. This means that the arc parameterizes the distance along the sheet quite well. However, it seems that at small values were are effectively getting a one-dimensional arc embedded in two dimensions. Yet, as epsilon increases and the arc starts to spread out into a sheet, it also curls over and overlaps rather than completely unroll. So what gives?! Well, because our Swiss roll is long an narrow (L/H \approx 3), by the time \epsilon is on the proper scale to capture the sheet height, we start including points that are too distant such that our local approximation is no longer valid. In other words, for a long sheet there is such a disparity in length scale that diffusion mapping is extracting the slowest mode only (length of the sheet). The other mode, which is the height of the sheet cannot be resolved. I don’t see this as a shortcoming of diffusion mapping, but more a matter of understanding and interpretation. Diffusion mapping did its job, it just doesn’t match our initial expectation – that doesn’t mean it’s wrong.

If we generate a slightly thicker sheet with L/H \approx 2 we get the following diffusion map:

Nice sheet diffusion maps

A much nicer sheet indeed. Of course the choice of \epsilon can be improved but this proves the point I think. This sums up the idea behind diffusion mapping. It’s actually quite powerful as we have seen. And so long as we know what to expect, it can become quite a useful tool to know. I have put up an IPython notebook for the exercises in both this post and the previous which can be found here. Next time we will use diffusion maps to analyze trajectories from molecular simulation and attempt to reproduce results from a paper. I hope you found this useful! Until then…

 

1.
Ferguson AL, Panagiotopoulos AZ, Kevrekidis IG, Debenedetti PG (2011) Nonlinear dimensionality reduction in molecular simulation: The diffusion map approach. Chemical Physics Letters 509(1–3):1–11. [Source]
2.
Coifman RR, Lafon S (2006) Diffusion maps. Applied and Computational Harmonic Analysis 21(1):5–30. [Source]