Next: Definition of N-Land Up: EXPLORING DATA SETS OF Previous: EXPLORING DATA SETS OF

## Projection Pursuit: A Brief Overview

When examining data sets of any dimensionality, researchers are generally looking for subsets of the data that are ``interesting'', i.e. that display some measure of structure or departure from normal distribution. This structure can take the form of trends, clusters, hypersurfaces, or anomalies. However, with high-dimensional data this is often made difficult by the fact that higher dimensional space is quite often very sparse; this is described by Huber (9) as the ``curse of dimensionality''. In addition, structure may span any arbitrary subspace of the data, thus increasing the computational cost/complexity required to locate the structure algorithmically and hampering the development of effective visualization techniques needed to facilitate the search process.

The traditional approach to examining these high-dimensional data sets is to reduce their dimensionality, usually by linear and/or nonlinear mapping or projection strategies (see Crawford and Fall (10)). Humans are very good at visual pattern recognition, and projecting the data set down to one-, two-, or even three-dimensional space allows this ability to be utilized. However, projection is a data-smoothing operation, since existing structure can only be obscured by a projection and never enhanced by it (see Friedman (11)).

The idea of projection pursuit is to locate the projection or projections from high- to low-dimensional space that reveal the most details about the structure of the data set. Once an interesting set of projections has been found, existing structures (clusters, surfaces, etc.) can be extracted and analyzed separately. There are two general approaches taken to projection pursuit: manual and automatic.

The most basic form of manual projection pursuit is the scatterplot, which is, in its most simple form, a two-dimensional display to indicate data characteristics over two selected dimensions at a time. It is quite simple to produce all (2n) pair-wise scatterplots for N-dimensional space and perform analysis on these. Unfortunately, this method only allows structure across the two plotted dimensions to be discovered. When the number of dimensions to analyze grows very large, other projection methods must be considered. See Crawford and Fall (10) for more details. Variations on the use of scatterplots include the PRIM-9 system (Tukey et. al. (12)), PRIM-H (Donoho et. al. (13)), and Orion (McDonald (14)).

The main limitation of manual projection pursuit is the amount of time it takes to exhaustively explore a given space. If one were to use Asimov's Grand Tour concept (15), which calls for presenting projections of the data set in a sequence with a difference of views so slight as to make the sequence similar to watching a movie, and make a complete search of a high-dimensional space, it would take approximately three hours to completely explore a four-dimensional space (see Huber (9)). Clearly, touring spaces of even higher dimensionality would be out of the question, unless perhaps one has some sense of what, based on a given projection, the next change in the view should be to produce the desired result.

Friedman and Tukey devised a method (16) to automate the task of projection pursuit. Basically, they characterize a given projection by a numerical index that indicates the amount of structure that is present. This index can then be used as the basis for a heuristic search to locate the ``interesting'' projections. Different types of heuristic searches are suggested in Friedman and Tukey (16) and in Tukey and Tukey (17).

Once structure has been found, it is then removed from the data. The data are then examined for further structure, which, if found, is also removed. This process continues until there is no remaining structure detectable within the data. A variety of ways to remove structure have been suggested; the interested reader should see Huber (9) and Friedman (11) for some significant examples.

Projection pursuit methods are a great step forward in the problem of high-dimensional data analysis, although according to Crawford and Fall (10) they do have many limitations. One of the most common problems is the difficulty in determining just what the solutions from automatic projection pursuit methods (typically projection index values) actually mean. Also, most projection pursuit software doesn't possess the ability to make inferences, and so can get fooled by false structure. Finally, it is difficult, if not impossible, in general to algorithmically specify what constitutes structure in data.

Next: Definition of N-Land Up: EXPLORING DATA SETS OF Previous: EXPLORING DATA SETS OF
Matthew Ward
1999-02-23