Diffusion maps is one of many nonlinear dimensionality reduction techniques which aim to find a meaningful low-dimensional representation of a high-dimensional data set. My interest in diffusion maps in particular is that in recent years it has been applied (1) to the analysis (2) of molecular simulation trajectories with very interesting results (3) – we will specifically work through such an example in a later post. The idea behind dimensionality reduction in general is to take a data set containing points in dimensional space and find the “best” representation, or embedding, of the data in dimensional space where . In other words, if each data point is an dimensional vector, then we are seeking a way to construct a new representation of that same data point using only a dimensional vector which captures the most relevant information. This is why dimensionality reduction has also been termed, “feature extraction”. It is also a type of unsupervised learning where we are not attempting to learn a relationship between input and output variables, but rather find a way to re-represent the same inputs.
Now depending on how one chooses to define “best representation” in dimensions, we get the various different methods that exist. In fact, this is a non-insignificant detail: the assumptions behind this “best” representation can dictate if a method will be effective for our data. One of the earliest methods that arguably sparked the growth of this entire field is known as isometric feature mapping, or isomap for short. I highly recommend giving the original paper (4) a read; it’s short and very neat! An important aspect of diffusion maps, isomap and other nonlinear dimensionality reduction methods, is that the resulting low dimensional representation is a nonlinear function of the original coordinates. This is in stark contrast to PCA, which finds new coordinates that are linear combinations of the original coordinates. We will see, in a short while, how linear dimensionality reduction can limit our ability to extract useful dimensions.
Central to most all nonlinear dimensionality reduction (also called manifold learning) methods is the so-called manifold hypothesis. The idea is that even though the data we have in hand can be very high dimensional (think molecular trajectory with 3N coordinates, or an image with NxM pixels), the possible combinations in that space containing meaningful data is much smaller, and that there is in fact a manifold in low dimension along which this data lies. To help illustrate this I’ve included an image below.
If we are interested in digit recognition, and we have a black-and-white image consisting of let’s say, pixels, then there are possible images of that size. Surely however, the number of images that contain meaningful information such as numeric digits (left exhibit) are far fewer than the possible images out there. In fact, it is most likely that the majority of the possible images are noisy garbage (right exhibit). Thus we can conclude that there is some lower dimensional manifold along which images consisting of digits lie, and that variation along this manifold does a good job at characterizing differences in the images we visualize. Finally, in order for this to be the case, the region of meaningful high dimensional space is sparse. This is in essence the manifold hypothesis.
The Swiss Roll
A second important assumption made by most manifold learning methods (we’ve upgraded our terminology – hooray!) is local Euclidean-ness. This means that even though there is some global nonlinear manifold along which the data lies, Euclidean distance is a good measure of distance along that manifold within the local vicinity of a point. We are essentially saying that if we zoom into that manifold really closely it kind of looks like a hyperplane, so distances along it can be measured using the Euclidean norm. Because of this assumption, if we have two data points that are sufficiently “close” (more on closeness later), then we can use this local positional information as a starting point for reconstructing a global nonlinear picture. Think of this another way: if two data points are very close in Euclidian distance, they are probably very close on the manifold, but the opposite is not necessarily true.
A toy benchmark commonly used to test out manifold learning algorithms is known as the “Swiss roll”, which we’ll be playing with moving forward. The Swiss roll is basically a two dimensional sheet rolled up and embedded in three dimensions. A good manifold learning algorithm should be able to “unroll” the sheet in two dimensions. We can generate the sheet programmatically (5) by sampling points according to , where and . This yields an unfolded roll of length and height , where . This is what it looks like in Python:
n = 2000 # Number of points to generate. noise = 0.1 # Amount of noise. h = 30 * np.random.rand(n, 1); # Height of the sheet. t = (3*np.pi/2)*(1 + 2*np.random.rand(n, 1)) # Parameter X = np.hstack((t*np.cos(t), h, t*np.sin(t))) + noise*np.random.rand(n, 3)
Here we choose to generate two thousand points, add a bit of noise, and go for a height of . This puts our at , a factor that will turn out to be important later on. Visualizing this as a 3D scatter plot we get the image below.
As we can see, the color indicates the path along the length of the sheet. Let’s highlight a few key points. Each point on the Swiss roll is described using three values , so our data is 3 dimensional. However, though the data is represented in three dimensions, we know that in fact it lies along a two dimensional sheet, our manifold, which is nonlinear and is embedded in a third dimension. We can also see from the figure that the Euclidian distance between a point in dark blue and one in medium green is not a good measure of the true distance along the rolled up sheet. The distance along the true manifold, the Swiss roll in our case, is known as the geodesic distance. On the other hand, for points that are very close together, Euclidean distance is approximately equal to the geodesic distance. Thus we’ve seen how the Swiss roll represents an ideal benchmark for dimensionality reduction methods as it satisfies all the necessary assumptions and is easy to generate and visualize.
Principal Component Analysis
Before we get into diffusion mapping, let’s look at what PCA gives us. Recall that PCA seeks to find a projection direction of the original data such that the variance in the data is maximized. This represents the first principal component. The next principal component is orthogonal to the first and captures the second largest amount of variance, and so on. In practice, the principal components can be computed as the descending eigenvalue/eigenvector pairs of the covariance matrix, or through singular value decomposition of the data matrix. Let’s perform PCA on the swiss roll and see what we get. We are interested in the first two principal components. The python code below uses scikit-learn.
Y = sk.decomposition.PCA(n_components=2).fit_transform(X) plt.scatter(Y[:, 0], Y[:, 1], c=t)
Well, that was easy. What does our plot look like?
Notice how the first and second principal components correspond to approximately the and axes respectively. This is because PCA is only capable of performing linear projections of the data. In that sense, PCA cannot extract relationships that lie along nonlinear manifolds, and as a result convolutes data that should be spread out in low dimensional space.
This is where diffusion maps and other manifold learning methods come into play. Our focus is primarily on diffusion maps for the reasons stated previously. But before we can map diffusion we must first understand the principles behind it, now that we have a better understanding of dimensionality reduction in general. We shall pick this up in a follow-up post. Until next time…