Next:Visualizing ClustersUp:Hierarchical Parallel CoordinatesPrevious:Parallel Coordinates

Hierarchical Clustering

A clustering algorithm groups objects, or data items, based on measures of proximity between pairs of objects [12]. In particular, a hierarchical clustering algorithm constructs a tree of nested clusters based on proximity information. Our primary purpose for building a cluster hierarchy is to structure and present data at different levels of abstraction.

Let E be the a set of k N-dimensional objects.

$\begin{displaymath}{\bf E} = \{ e_1, e_2, e_3, ..., e_k \}\end{displaymath}$

where e_i is the N-vector:

$\begin{displaymath}e_i = \{ x_{i1}, x_{i2}, x_{i3}, ..., x_{iN} \}\end{displaymath}$

An m-partition P of E breaks E into m subsets $\{ C_1,C_2, ..., C_m \}$ satisfying the following:

$\begin{eqnarray*}C_i \cap C_j &=& \emptyset ~~ \mbox{ for all $1 \le i,j, \le m$ , $i \neq j$\space } \\\bigcup_{i=1}^m C_i &=& {\bf E}\end{eqnarray*}$ A partition Q is nested into a partition P if every component of Q is a proper subset of a component of P. That is, P is formed by merging components of Q. A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence.

A hierarchical clustering may be organized as a tree structure: Let P_i be a component of P, and Q be the m partitions of P_i. Let P_i be instantiated by a tree node T_i. Then, the components of Q form the children nodes of T_i. In particular, if m is always 2, the resulting structure is a strictly binary tree.

There is a large body of literature dealing with hierarchical cluster construction. The actual method of tree construction is however not relevant to this paper. Any method that builds a tree which abides by the above definitions could in principle be used as the tree construction scheme in our system.

However, most clustering algorithms are not appropriate for large datasets because they do not consider the case where the dataset can be too large to fit into memory. In such cases, there is a need to work with limited resources to perform clustering as accurately as possible while keeping I/O costs low. In recent years, a number of algorithms for clustering large datasets have been proposed [2,10,27]. We have adopted the Birch clustering algorithm [27] as our primary clustering technique, although our visualization would work equally well with other methods.

Next:Visualizing ClustersUp:Hierarchical Parallel CoordinatesPrevious:Parallel Coordinates