The research I conducted this summer as a 2015 Data Science Initiative Fellow focused on the development of algorithms for the calculation of the rates of transitions among spatial configurations of complex biomolecules in molecular dynamics computer simulations. Many chemical properties of biomolecules, such as DNA and proteins, are due to their configurations, i.e. the relative spatial orientation of their atoms. The first project I completed (figure 1) was the development of an algorithm for calculating time correlation functions from Milestoning calculations, a preexisting method for calculating transition rates between configurations by measuring how long it takes the system to transition between hyperplane interfaces in a space of possible configurations. In this case, time correlation functions quantify how long it takes a molecule to “forget” the configuration in which it started, a key insight for comparing configurational data from molecular dynamics simulations with experimental data produced using NMR spectroscopy, one of the best experimental methods for measuring the rates of configurational transitions. The algorithm is based on carrying out random walks along a discrete configuration space graph, and generating time-dependent probability distributions of being at each vertex from the random walks. This is the first time that an algorithm for calculating time correlation functions from Milestoning data has ever been devised.
The second method developed is an algorithm for a fully automated method for both finding the physically important regions of a molecule’s configuration space, and defining the milestone hyperplanes which best serve to bound these regions. It is often the case that only one configuration of a biomolecule of interest, typically obtained from x-ray crystallography experiments, is known. This known configuration can be used as a starting point for molecular dynamics simulations in order to determine other possible configurations of the molecule, but the molecular motions from these simulations are often too complex for a human to recognize the important configurational changes, let alone define the milestone hyperplanes in the space that best capture the transitions.
Motivated by this, an algorithm was devised which consists of two subroutines, first a search of the configuration space (figure 2), where mutually repulsive clones of the system explore the space, and second a milestone designation step (figure 3), where the superior pattern recognition capabilities of machine learning are harnessed to first define clusters of the computer-generated configurations and then define hyperplane interfaces between these clusters to be used as milestones. Figure 2 shows the configuration search step, where a search in configuration space is modeled as a two dimensional potential. This is akin to randomly kicking a ball around a landscape, where raising the temperature is analogous to increasing the strength of the kicks. The mutually repulsive clone exploration step outperformed both the unassisted sampling and the sampling run at an artificially high temperature, by exploring the entire space without blurring out the distinct configurational subspaces of similar configuration (latitude and longitude) and energy (elevation). Figure 3 shows the milestone designation step on both our two dimensional model system and our molecular system, alanine dipeptide. While a clustering algorithm was able to distinguish the 10 different configurational subspaces (top left) that define the configuration space graph shown top right, the elevated temperature configuration data yielded just three subspaces (not shown). For the molecular system, a three dimensional configuration space, composed of the distances between the carbon atoms shown in blue, green, and red, was devised (bottom left). The molecular dynamics governing the configurational fluctuations of this system were then simulated, and these pairwise distances were outputted. Shown bottom right is the result of using a clustering algorithm to group the points into distinct sets, the interfaces between which could then be used in Milestoning simulations.
This project has served to lay some theoretical groundwork and proof of concept for a fully automated machine learning-based method for calculating configurational kinetics from molecular dynamics simulations. The next step is to develop a full software implementation suitable for large scale simulations of complex biomolecules comprised of thousands of atoms. I am incredibly grateful to Prof. Andricioaei and Prof. Butts for their advisement on this project, and to the Data Science Initiative for providing me the opportunity to begin this pathway of scientific inquiry, which I plan to pursue further in the near future.