<< Chapter < Page | Chapter >> Page > |
If the distance measure between points has the metric properties as explained above, then n Euclidean coordinates can always be found for a set of n points, such that the Euclidean distance between them matches the original distances. In order to obtain these Euclidean coordinates, such coordinates are assumed to have a mean of zero, i.e., they are centered. In this way, the matrix of pairwise distances can be converted into a matrix of dot products by squaring the distances and performing a "double centering" on them to produce the matrix of pairwise dot products:
Let's assume that n Euclidean coordinates exist for each data point and that such coordinates can be placed in a matrix . Then, multiplying by its transpose should equal the matrix of dot products computed before. Finally, in order to retrieve the coordinates in the unknown matrix , we can perform an SVD of which can be expressed as:
Where the left and right singular vectors coincide because is a symmetric matrix. The diagonal matrix of eigenvalues can be split into two identical matrices, each having the square root of the eigenvalues, and by doing this a solution for can be found:
Since has been computed through an SVD, the coordinates are ordered by data variance like in PCA, so the first component of is the most important, and so on. Keeping this in mind, the dimensionality reduction can be performed by ignoring the higher dimensions in and only keeping the first few, like in PCA.
Geodesic distances. The notion of "geodesic" distance was originally defined as the length of a geodesic path, where the geodesic path between two points on the surface of the Earth is considered the "shortest path" since one is constrained to travel on the Earth's surface. The concept of geodesic distance can be generalized to any mathematical surface, and defined as "the length of the shortest path between two points that lie on a surface, when the path is constrained to lie on the surface." Figure 10 shows a schematic of this generalized concept of geodesic path and distance.
The Isomap algorithm. The Isomap method augments classical MDS with the notion of geodesic distance, in order to capture the non-linearity of a data set. To achieve its goal, Isomap uses MDS to compute few Euclidean coordinates that best preserve pairwise geodesic distances , rather than direct distances . Since the coordinates computed by MDS are Euclidean, these can be plotted on a Cartesian set of axes. The effect of Isomap is similar to "unrolling" the non-linear surface into a natural parameterization, as depicted in figure 11. Isomap approximates the geodesic distances from the data itself, by first building a neighborhood graph for the data. A neighborhood graph consists of the original set of points, together with a connection between "neighboring" points (neighboring points are defined as the closest points to a given point, as determined by the distance measure used) as shown in figure 11-b. After the neighborhood graph has been built, it can be used to approximate the geodesic distance between all pairs of points as the shortest path distance along the graph (red path in figure 11-b). Naturally, the sampling of the data set has to be enough to capture the inherent topology of the non-linear space for this approximation to work. MDS then takes these geodesic distances and produces the Euclidean coordinates for the set. Figure 11-c shows two Euclidean coordinates for each point in the example.
Notification Switch
Would you like to follow the 'Geometric methods in structural computational biology' conversation and receive update notifications?