Clustering Molecular Dynamics Trajectories: 1. Characterizing the Performance of Different Clustering Algorithms

Stephen Tanner; Nephi Thompson; Thomas E. Cheatham; Jianyin Shao

doi:10.1021/ct700119m

Abstract

Molecular dynamics simulation methods produce trajectories of atomic positions (and optionally velocities and energies) as a function of time and provide a representation of the sampling of a given molecule's energetically accessible conformational ensemble. As simulations on the 10-100 ns time scale become routine, with sampled configurations stored on the picosecond time scale, such trajectories contain large amounts of data. Data-mining techniques, like clustering, provide one means to group and make sense of the information in the trajectory. In this work, several clustering algorithms were implemented, compared, and utilized to understand MD trajectory data. The development of the algorithms into a freely available C code library, and their application to a simple test example of random (or systematically placed) points in a 2D plane (where the pairwise metric is the distance between points) provide a means to understand the relative performance. Eleven different clustering algorithms were developed, ranging from top-down splitting (hierarchical) and bottom-up aggregating (including single-linkage edge joining, centroid-linkage, average-linkage, complete-linkage, centripetal, and centripetal-complete) to various refinement (means, Bayesian, and self-organizing maps) and tree (COBWEB) algorithms. Systematic testing in the context of MD simulation of various DNA systems (including DNA single strands and the interaction of a minor groove binding drug DB226 with a DNA hairpin) allows a more direct assessment of the relative merits of the distinct clustering algorithms. Additionally, means to assess the relative performance and differences between the algorithms, to dynamically select the initial cluster count, and to achieve faster data mining by "sieved clustering" were evaluated. Overall, it was found that there is no one perfect "one size fits all" algorithm for clustering MD trajectories and that the results strongly depend on the choice of atoms for the pairwise comparison. Some algorithms tend to produce homogeneously sized clusters, whereas others have a tendency to produce singleton clusters. Issues related to the choice of a pairwise metric, clustering metrics, which atom selection is used for the comparison, and about the relative performance are discussed. Overall, the best performance was observed with the average-linkage, means, and SOM algorithms. If the cluster count is not known in advance, the hierarchical or average-linkage clustering algorithms are recommended. Although these algorithms perform well, it is important to be aware of the limitations or weaknesses of each algorithm, specifically the high sensitivity to outliers with hierarchical, the tendency to generate homogenously sized clusters with means, and the tendency to produce small or singleton clusters with average-linkage.

Keywords

Cluster analysisComputer sciencePairwise comparisonAlgorithmData miningContext (archaeology)TrajectoryMetric (unit)Artificial intelligencePhysics

Affiliated Institutions

University of Utah US

Related Publications

The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure

Glenn W. Milligan , Shih Chung Soon , Lisa M. Sokol

An evaluation of four clustering methods and four external criterion measures was conducted with respect to the effect of the number of clusters, dimensionality, and relative cl...

1983 IEEE Transactions on Pattern Analysis... 170 citations

An Examination of Procedures for Determining the Number of Clusters in a Data Set

Glenn W. Milligan , Martha C. Cooper

A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlappi...

1985 Psychometrika 3811 citations

Ironing out the wrinkles in the rare biosphere through improved OTU clustering

Susan M. Huse , David B. Mark Welch , Hilary G. Morrison +1 more

Summary Deep sequencing of PCR amplicon libraries facilitates the detection of low‐abundance populations in environmental DNA surveys of complex microbial communities. At the sa...

2010 Environmental Microbiology 1346 citations

Unsupervised K-Means Clustering Algorithm

Kristina P. Sinaga , Miin‐Shen Yang

The k-means algorithm is generally the most known and used clustering method. There are various extensions of k-means to be proposed in the literature. Although it is an unsuper...

2020 IEEE Access 1917 citations

Integration K-Means Clustering Method and Elbow Method For Identification of The Best Customer Profile Cluster

Muhammad Ali Syakur , Bain Khusnul Khotimah , Eka Mala Sari Rochman +1 more

Clustering is a data mining technique used to analyse data that has variations and the number of lots. Clustering was process of grouping data into a cluster, so they contained ...

2018 IOP Conference Series Materials Scien... 1104 citations

Publication Info

Year: 2007
Type: article
Volume: 3
Issue: 6
Pages: 2312-2334
Citations: 850
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Clustering Molecular Dynamics Trajectories: 1. Characterizing the Performance of Different Clustering Algorithms

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

850

OpenAlex

Cite This

APA Style

                            
                                
                                    Stephen Tanner, 
                                
                                    Nephi Thompson, 
                                
                                    Thomas E. Cheatham
                                
                                et al.
                            
                            (2007). 
                            Clustering Molecular Dynamics Trajectories: 1. Characterizing the Performance of Different Clustering Algorithms. 
                            Journal of Chemical Theory and Computation
                            , 3
                            (6)
                            , 2312-2334.
                            https://doi.org/10.1021/ct700119m
                        

Identifiers

DOI: 10.1021/ct700119m