Size-and-Shape Space Gaussian Mixture Models for StructuralClustering of Molecular Dynamics Trajectories

H Klem and GM Hocky and M McCullagh, JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 18, 3218-3230 (2022).

DOI: 10.1021/acs.jctc.1c01290

Determining the optimal number and identity of structuralclusters from an ensemble of molecular configurations continues to be achallenge. Recent structural clustering methods have focused on the use ofinternal coordinates due to the innate rotational and translational invarianceof these features. The vast number of possible internal coordinatesnecessitates a feature space supervision step to make clustering tractablebut yields a protocol that can be system type- specific. Particle positions offeran appealing alternative to internal coordinates but suffer from a lack ofrotational and translational invariance, as well as a perceived insensitivity toregions of structural dissimilarity. Here, we present a method, denoted shape-GMM, that overcomes the shortcomings of particle positions using aweighted maximum likelihood alignment procedure. This alignment strategy is then built into an expectation maximization Gaussianmixture model (GMM) procedure to capture metastable states in the free-energy landscape. The resulting algorithm distinguishesbetween a variety of different structures, including those indistinguishable by root-mean-square displacement and pairwise distances,as demonstrated on several model systems. Shape-GMM results on an extensive simulation of the fast-folding HP35 Nle/Nle mutantprotein support a four-state folding/unfolding mechanism, which is consistent with previous experimental results and provideskinetic details comparable to previous state-of-the art clustering approaches, as measured by the VAMP-2 score. Currently, trainingof shape-GMMs is recommended for systems (or subsystems) that can be represented by???200 particles and???100k configurationsto estimate high-dimensional covariance matrices and balance computational expense. Once a shape-GMM is trained, it can be usedto predict the cluster identities of millions of configurations

Return to Publications page