Let E be the a set of k N-dimensional objects.
where ei is the N-vector:
An m-partition P of E breaks E into
m subsets
satisfying the following:
A hierarchical clustering may be organized as a tree structure: Let Pi be a component of P, and Q be the m partitions of Pi. Let Pi be instantiated by a tree node Ti. Then, the components of Q form the children nodes of Ti. In particular, if m is always 2, the resulting structure is a strictly binary tree.
There is a large body of literature dealing with hierarchical cluster construction. The actual method of tree construction is however not relevant to this paper. Any method that builds a tree which abides by the above definitions could in principle be used as the tree construction scheme in our system.
However, most clustering algorithms are not appropriate for large datasets because they do not consider the case where the dataset can be too large to fit into memory. In such cases, there is a need to work with limited resources to perform clustering as accurately as possible while keeping I/O costs low. In recent years, a number of algorithms for clustering large datasets have been proposed [2,10,27]. We have adopted the Birch clustering algorithm [27] as our primary clustering technique, although our visualization would work equally well with other methods.