<< Chapter < Page Chapter >> Page >

There is one detail left to introduce before going into the mechanics of computing the principal components. Notice that in figure 3-a the data set is not shown at any particular position in 3D space, that is, the data set could be centered at any point in space, not necessarily around the origin (0,0,0) . However, in order to compute the principal components, compute the low-dimensional projections, and synthesize new points, the data set is assumed to be centered in every direction. In other words, the data set needs to have its centroid removed before the computations. If new points need to be synthesized using the PCs, the centroid can be added to place the newly syntesized points in the correct region of space.

To compute the principal components, let X be an nxM matrix that contains n M-dimensional data points in its columns. Furthermore, assume that the mean for all dimensions is zero, i.e., the data are centered. The goal is to find an MxM orthonormal transformation matrix P containing the PCs, such that:

  • Y P T X , where the columns of Y are the projections onto the PCs.
  • P P T I , that is, P is orthonormal.
  • Y Y T D , the covariance matrix of the projected points Y , is a diagonal matrix, so that the resulting projections are uncorrelated.
The resulting covariance matrix Y Y T can be re-written as:

We want Y Y T to be a diagonal matrix D so we can write:

So then, by multiplying by P to the left and P T to the right:

Since P P T I . Note that the Singular Value Decomposition (SVD) of X X T yields:

Where V and W are the left and right eigenvectors of X X T , and S is a diagonal matrix with the eigenvalues. But since X X T is a symmetric matrix by construction, the left and right eigenvectors coincide, so W V . So we can write:

This fact, together with the fact that P and V are orthonormal, means that P V and D S (because both D and S are diagonal). In other words, the Principal Components, or PCs, of the data set X are given by the eigenvectors of the covariance matrix X X T of the original (centered) data. Moreover, the diagonal matrix of eigenvalues, S , is equal to the matrix D , which is the covariance of the projected points Y . Since it is a diagonal matrix, the diagonal contains the variance of the projected data set along each dimension. By retrieving the eigenvectors (or PCs) in order of decreasing eigenvalue, the PCs that explain the most data variance are automatically chosen before the rest. Since most SVD computations return the eigenvectors and eigenvalues in this order anyway, the resulting PCs are already sorted by variance.

It is important to note at this point that the eigenvalues actually correspond to the variance along each PC. By computing the ratio of each eigenvalue s i to the total sum, one can obtain the fraction of total variance explained by each PC when the data is projected onto them. Subtracting the sum of variance fraction for the first d PCs from 1, we can obtain what is known as the residual variance r d , that is, the amount of variance in the original data left unexplained by discarding the PCs corresponding to the lower M- d eigenvalues:

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Geometric methods in structural computational biology. OpenStax CNX. Jun 11, 2007 Download for free at http://cnx.org/content/col10344/1.6
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Geometric methods in structural computational biology' conversation and receive update notifications?

Ask