<< Chapter < Page | Chapter >> Page > |
After all conformations have been aligned, the PCA procedure can be used exactly as detailed above. The first step is to determine the average vector from the conformational vector set, so it can be subtracted from all conformations to build the matrix . This is done by computing an average conformation that contains the average for all 3N dimensions of the data set. It is important to note that this "average conformation" no longer represents a physically feasible molecule, since the Cartesian coordinates of all atoms are being averaged throughout the entire data set. When this "average conformation" is subtracted from the aligned input conformations, the "centered" data now becomes atomic displacements, not positions , and so building a covariance matrix for this data makes sense.
The PCs for molecular conformation data can be used for the two purposes explained before, i.e., to obtain a low-dimensional representation of each point and to synthesize (or interpolate) new conformations by following the PCs. Now, the PCs have a physical meaning: they represent the "main directions" followed by the molecule's 3N degrees of freedom, or, in other words, the directions followed collectively by the atoms. Interpolating along each PC makes each atom follow a linear trajectory, that corresponds to the direction of motion that explains the most data variance. For this reason, the PCs are often called Main Modes of Motion or Collective Modes of Motion when computed from molecular motion data. Interpolating along the first few PCs has the effect of removing atomic "vibrations" that are normally irrelevant for the molecule's bigger scale motions, since the vibration directions have little data variance and would correspond to the last (least important) modes of motion. It is now possible to define a lower-dimensional subspace of protein motion spanned by the first few principal components and to use these to project the initial high-dimensional data onto this subspace. Since the PCs are displacements (and not positions), in order to interpolate conformations along the main modes of motion one has to start from one of the known structures and add a multiple of the PCs as perturbations. In mathematical terms, in order to produce conformations interpolated along the first PC one can compute:
Where is a molecular conformation from the (aligned) data set and the interpolating parameter can be used to add a deviation from the structure along the main direction of motion. The parameter can be either positive or negative. However, it is important to keep in mind that large values of the interpolating parameter will start stretching the molecule beyond physically acceptable shapes, since the PCs make all atoms follow straight lines and will fairly quickly destroy the molecule's topology. A typical way of improving the physical feasibility of interpolated conformations is to subject them to a few iterations of energy minimization following some empirical force field.
Although it is possible to determine as many principal components as the number of original variables (3N), PCA is typically used to determine the smallest number of uncorrelated principal components that explain a large percentage of the total variation in the data, as quantified by the residual variance. The exact number of principal components chosen is application-dependent and constitutes a truncated basis of representation.
Notification Switch
Would you like to follow the 'Geometric methods in structural computational biology' conversation and receive update notifications?