Scatterplots


Figure 1

Scatterplots are one of the oldest and most commonly used methods to project high dimensional data to 2-dimensions. In this method, N * (N - 1)/2 pairwise parallel projections are generated, each giving the viewer a general impression regarding relationships within the data between pairs of dimensions.  The projections are generally arranged in a grid structure to help the user remember the dimensions associated with each projection.  Many variations on the scatterplot have been developed to increase the information content of the image as well as provide tools to facilitate data exploration.  Some of these include rotating the data cloud [TUK:88],  using different symbols to distinguish classes of data and occurrences of overlapping points, and using color or shading to provide a third dimension within each projection.

Figure 1 presents a seven dimensional data set using scatterplots.  Note that plotting each dimension against itself along the diagonal provides distribution information on the individual dimensions.  The data set contains statistics regarding crime in Detroit between 1961 and 1973, and consists of 13 data points. The data set was obtained via anonymous ftp from unix.hensa.ac.uk in the directory /pub/statlib/datasets. Some dimensions of the original set have been eliminated to facilitate display using scatterplots. Linear structures within several of the projections indicate some correlation between the two dimensions involved in the projections. Thus, for example, there is a correlation between the number of full-time police, the number of homicides, and the number of government workers (with a corresponding negative correlation in the percent of cleared homicides).

One major limitation of scatterplots is that they are most effective with small numbers of dimensions, as increasing the dimensionality results in decreasing the screen space provided for each projection.  Strategies for addressing this limitation include using three dimensions per plot or providing panning or zooming mechanisms.  Other limitations include being generally restricted to orthogonal views and difficulties in discovering relationships which span more than two dimensions.  Advantages of scatterplots include ease of interpretation and relative insensitivity to the size of the data set.

We have extended flat scatterplots to hierarchical scatterplots. In hierarchical scatterplots, clusters are displayed instead of individual data points. In each plot, a cluster is presented by a point and a colorful band around it. The point and the band indicate the mean and the extend of the cluster.  Movie 1 is a multiresolutional cluster display of hierarchical scatterplots.
 

References

[TUK:88]:  Tukey, J.W., Fisherkeller, M.S., Friedman, J.H..  PRIM-9, an interactive multidimensional data display and analysis system. Dynamic Graphics for Statistics (W.S. Cleveland and M.E. McGill, eds.), Wadsworth and Brooks, 1988.

[ward:94]:  M. Ward.  Xmdvtool: Integrating multiple methods for visualizing multivariate data.  Proc. of Visualization '94, p. 326-33, 1994.