<< Chapter < Page | Chapter >> Page > |
There is one detail left to introduce before going into the mechanics of computing the principal components. Notice that in figure 3-a the data set is not shown at any particular position in 3D space, that is, the data set could be centered at any point in space, not necessarily around the origin . However, in order to compute the principal components, compute the low-dimensional projections, and synthesize new points, the data set is assumed to be centered in every direction. In other words, the data set needs to have its centroid removed before the computations. If new points need to be synthesized using the PCs, the centroid can be added to place the newly syntesized points in the correct region of space.
To compute the principal components, let be an nxM matrix that contains n M-dimensional data points in its columns. Furthermore, assume that the mean for all dimensions is zero, i.e., the data are centered. The goal is to find an MxM orthonormal transformation matrix containing the PCs, such that:
We want to be a diagonal matrix so we can write:
So then, by multiplying by to the left and to the right:
Since . Note that the Singular Value Decomposition (SVD) of yields:
Where and are the left and right eigenvectors of , and is a diagonal matrix with the eigenvalues. But since is a symmetric matrix by construction, the left and right eigenvectors coincide, so . So we can write:
This fact, together with the fact that and are orthonormal, means that and (because both and are diagonal). In other words, the Principal Components, or PCs, of the data set are given by the eigenvectors of the covariance matrix of the original (centered) data. Moreover, the diagonal matrix of eigenvalues, , is equal to the matrix , which is the covariance of the projected points . Since it is a diagonal matrix, the diagonal contains the variance of the projected data set along each dimension. By retrieving the eigenvectors (or PCs) in order of decreasing eigenvalue, the PCs that explain the most data variance are automatically chosen before the rest. Since most SVD computations return the eigenvectors and eigenvalues in this order anyway, the resulting PCs are already sorted by variance.
It is important to note at this point that the eigenvalues actually correspond to the variance along each PC. By computing the ratio of each eigenvalue to the total sum, one can obtain the fraction of total variance explained by each PC when the data is projected onto them. Subtracting the sum of variance fraction for the first PCs from 1, we can obtain what is known as the residual variance , that is, the amount of variance in the original data left unexplained by discarding the PCs corresponding to the lower M- eigenvalues:
Notification Switch
Would you like to follow the 'Geometric methods in structural computational biology' conversation and receive update notifications?