fastcluster: Fast hierarchical clustering routines for R and Python
fastcluster: Fast hierarchical clustering routines for R and Python
Contents
1 Introduction
A common task in unsupervised machine learning and data analysis in general is clustering. This means a method to partition a discrete metric space into sensible subsets. The exact setup and procedures may vary, but the general idea is to group data points with similar features together. This might reveal some structure in the data set, or one can simplify the data set by dealing with entire clusters instead of many individual data points.
An important subset of clustering techniques are hierarchical clustering schemes. These do not produce a prescribed number of clusters but a so-called dendrogram, which allows the user to decide on a reasonable number of clusters based on the algorithmic output and then partition the data accordingly. This workflow also complies with the paradigm in topological data analysis that often an overview over the data on all scales is helpful to detect features which might be present at different scales.
From a practical point of view, clustering is used by many people in different jobs, ranging from basic research to commerce. Nevertheless, the hierarchical clustering schemes were implemented in a largely sub-optimal way in the standard software, to say the least. Even R, which is the most widely used statistical software, does not use the most efficient algorithms in the several packages that have been made for hierarchical clustering.
The fastcluster package implements the seven common hierarchical clustering schemes efficiently. The package is made with two interfaces to standard software: R and Python, which should already cover a big part of the scientific community. (A Matlab interface should pose very little to no problems at all—any help or contribution with that is appreciated.)
For a first overview over what performance improvements to expect, look at the graphs below. Note the log-log scale, which compresses improvements by orders of magnitude into a single diagram.
2 Technical key facts
- The fastcluster package is a C++ library for hierarchical (agglomerative) clustering on data with a dissimilarity index. It efficiently implements the seven most widely used clustering schemes: single, complete, average, weighted, Ward, centroid and median linkage. (The “weighted” distance update scheme (Matlab, SciPy) is also called “mcquitty” in R.)
- The fastcluster library currently has interfaces to two languages: R and Python/SciPy.
- The interfaces are designed as drop-in replacements for the existing routines. Once the fastcluster library is loaded at the beginning of the code, every program that uses hierarchical clustering can benefit immediately and effortlessly from the performance gain.
- The fastcluster package is licensed under the GPLv3 and published on CRAN.
- The following comparison chart shows the asymptotic worst-case time complexity of various implementations in terms of the number of data points N. The input is a matrix of pairwise dissimilarities of size N⋅(N−1)/2.
Compared to the empirical figure below, no theoretical bounds are available for Matlab Release 2010b. The reason is that the algorithm has apparently changed in that release, at least for single linkage, but in contrast to Release 2010a the source code is not available any more.
Method fastcluster R:
agnesR:
flashClustR:
hclustSciPy Matlab
R2010asingle Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) complete Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) average Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) weighted Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) Ward Θ(N²) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) centroid Θ(N³) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) median Θ(N³) Θ(N³) Θ(N³) Θ(N³) Θ(N³) Ω(N³) - Regarding the best-case complexity, only the “fastcluster“ and “flashClust” packages achieve the theoretically optimal bound of Θ(N²). SciPy, “agnes” and “hclust” have a best-case complexity of only Θ(N³). In Matlab Release 2010a, the situation is more complicated: While the clustering step itself has a best-case complexity of Θ(N²), the “Ward”, “centroid” and “median“ methods ensure beforehand that the N data points can be embedded in N-dimensional Euclidean space. This sanity check involves finding the eigenvalues of a dense (N×N) matrix, which outweighs the actual clustering step. Also, the theoretical performance of single linkage is seriously hampered in Matlab Release 2010a by an unfortunate caching strategy. See the performance tests below on how this affects a variety of data sets.
- All of the above implementations use Θ(N²) memory, except the single linkage routine in the fastcluster package: Apart from the Θ(N²) input data, which is not overwritten, it uses only Θ(N) temporary memory.
- The figure below shows performance comparisons between the various packages. Note that the fastcluster package is not only faster for big data sets but the overhead for small data sets is also smaller. Hence, the improved asymptotic and worst-case complexity does not come with a trade-off for small input sizes here.
Click on the links below to see the comparisons for all seven methods.
← Previous: Median linkage Single linkage
Next: Complete linkage → The colored bands show maximum and minimum time over a variety of data sets. The synthetic data sets are samples drawn in an iid. manner from various mixtures of Gaussian distributions in Euclidean space of various dimensions. The results were obtained on a PC with an Intel dual-core CPU T7500 with 2.2 GHz clock speed and 4GB of RAM. The operating system was Ubuntu 10.10 64-bit (Ubuntu 10.04 64-bit for Matlab R2010a). R version: 2.12.2, flashClust version: 1.01, Python version: 2.6.6, NumPy version 1.3.0, SciPy version: 0.7.2.
3 Download and installation
Installation procedures were tested by the author under 64-bit Ubuntu. CRAN also hosts precompiled binaries (of the R library, not the Python module) for Windows and MacOS X. In principle, it should be possible to install the fastcluster package on any system that has a C++ compiler and R respectively Python with NumPy. There are no unusual libraries needed to compile the package, only the STL library, which every C++ compiler should have by default.
Please send me feedback if you accomplish to install the fastcluster package on a certain platform but needed to tweak the configuration! I will update the installation instructions and modify the package if needed (eg. include the right compiler flags for various operating systems).
Installation for R
Enter the command install.packages("fastcluster") in R, and R will download the package automatically, then install it. That's it!
If this does not work, please consult R's help function by typing ?INSTALL from within R or read the “R installation and administration” manual.
For manual download, you can get the fastcluster package from here: Download page at CRAN.
You may need to start R with administrator rights to be able to install packages. There are ways to install R packages without administrator privileges in your user directories. See this help page for example.
Installation for Python
Make sure that you have Python and NumPy installed. Download the fastcluster package from the R archive here: Download page at CRAN. Open a terminal, go to the directory with the downloaded file and extract the contents of the archive with:
tar -xvf fastcluster_(version).tar.gzAlternatively, use your favorite archive manager for unpacking, eg. on Windows. This will generate a new directory fastcluster. Switch to its python subdirectory:cd fastcluster/python/Now compile the library and install the Python module by:python setup.py installYou may need to precede this command with sudo or install the package in your home directory. For example, the linepython setup.py install --home=~installs the package in the path ~/lib/python/. See the chapter “Installing Python modules” in the Python documentation for further help.
4 Usage
4.1 R
In R, load the package with the following command:
library('fastcluster')The package overwrites the function hclust from the “stats” package (in the same way as the flashClust package does). Please remove any references to the flashClust package in your R files to not accidentally overwrite the hclust function with the flashClust version.The new hclust function has exactly the same calling conventions as the old one. You may just load the package and immediately and effortlessly enjoy the performance improvements. The function is also an improvement to the flashClust function from the “flashClust” package. Just replace every call to flashClust by hclust and expect your code to work as before, only faster. (If you are using flashClust prior to version 1.01, update it! See the change log for flashClust.)
The agnes function from the “cluster“ package does a bit more than hclust. If you do not need the extra functionality, you may also wish to replace agnes by fastcluster's hclust for higher speed.
If you need to access the old function or make sure that the right function is called, specify the package as follows:
stats::hclust(…) fastcluster::hclust(…) flashClust::hclust(…)WARNING
R and Matlab/SciPy use different conventions for the “Ward”, “centroid” and “median” methods. R assumes that the dissimilarity matrix consists of squared Euclidean distances, while Matlab and SciPy expect non-squared Euclidean distances. The fastcluster package respects these conventions and uses different formulas in the two interfaces.
If you want the same results in both interfaces, then feed R with the entry-wise square of the distance matrix, d^2, for the “Ward”, “centroid” and “median” methods and later take the square root of the height field in the dendrogram. For the “average” and “weighted” alias “mcquitty” methods, you must still take the same distance matrix d as in the Python interface for the same results. The “single” and “complete” methods only depend on the relative order of the distances, hence it does not make a difference whether one operates on the distances or the squared distances.
The code example in the R documentation (enter ?hclust or example(hclust) in R) contains an instance where the squared distance matrix is generated from Euclidean data.
4.2 Python
The fastcluster package is imported as usual by
import fastclusterIt provides the following functions:linkage(X, method='single', metric='euclidean', preserve_input=True) single(X) complete(X) average(X) weighted(X) ward(X) centroid(X) median(X)The argument X is either a compressed distance matrix or a collection of N observation vectors in D dimensions as an ( N×D) array. Apart from the argument preserve_input, the methods have the same input and output as the functions of the same name in the package scipy.cluster.hierarchy. Therefore, I do not duplicate the documentation and refer to the SciPy documentation here and here for further details.The additional, optional argument preserve_input specifies whether the fastcluster package first copies the distance matrix or writes into the existing array. If you generate the distance matrix only for the clustering step and do not need it afterwards, you may save half the memory by saying preserve_input=False.
Note that the input array X contains unspecified values after this procedure. You may want to write
linkage(X, method="…", preserve_distance=False) del Xto make sure that you do not accidentally use the matrix X after it has been used as scratch memory.The single linkage algorithm does not write to the distance matrix or its copy anyway, so the preserve_distance flag has no effect in this case.