Friday, January 19, 2018

Tutorial 1: ubiquitin unfolding analysis

If CONAN is a bit too daunting to use first, the following tutorial can hopefully help you. Do the following two steps:


  1. Download CONAN, version 1.0, from: https://github.com/HITS-MBM/conan/archive/v1.0.tar.gz . Unzip and move all files inside to a folder of your choice. This is all the installation you need (assuming you have Python 3 with SciPy installed, e.g., through installing Anaconda).
  2. Test your distribution by:
    cd test
    ./test.py
    The test script will run 8 very quick runs, checking that everything is OK. Ideally, it will exit with the message "All tests passed! Everything looks good."
    If there are serious issues, please check your Python installation.
If everything was in order, we can move on to example applications. Right now, there is one: ubiquitin unfolding. First things first, let us go to the application folder:

cd applications/1_ubi_unfold/
We will first start with the file fine.inp (included in the folder):

TRAJ pull_500ps.xtc
COORD pull.tpr
NLEVEL 1001
DT 500
TRUNC  1.0
TRUNC_INTER 0.5
TRUNC_DR 1.0
DR_MODE init
GNUS_PATH ../../gnuplot_inputscripts
RUN_MDMAT yes
DOMAINS domains.txt
COORD_PDB ubi.pdb
K_TRAJ_CLUSTERS 1-5

This file contains the trajectory file, the coordinates file, fixes the number of levels in the distance measurements (in this case, 1001 levels between 0 and 1.0 mean that we'll have a resolution of 0.001 nm). We will perform cluster analysis as well, trying a number of clusters between 1-5. Before we begin, we can also check out the 3D movie of the unfolding:



We are ready to run CONAN!
../../conan.py fine.inp

When prompted by gmx mdmat to choose the group of residues to measure, select 2:
Group     2 (      Protein-H) has   602 elements

CONAN is a very fast analysis tool, except for the creation of the framewise contact maps (.png files for the movie). After about a minute, all the aggregate plots should be done. While the movie is being made, let us check out the aggregate plots. For now, I'll show the more interesting ones (domain-annotated, due to our file domains.txt):

  1. The average contact map (aggregate/avg_mdmat.domains.png) shows us the average distance between residues. This already suggests to us which parts of the structure are lost at which point. The text version of this information is the third column in mdmat_average_rmsf.dat
  2. The lifetime contact map (aggregate/interaction_lifetime.domains.png) shows the same trend as the average contact map, although this time we can be sure that residues at long distances indeed were interacting during the trajectory.
  3. The "last encounter" plot (aggregate/timeline_last_encounter.domains.png) shows similar data, but this time time encoded.
  4. The "local interaction time" plot (aggregate/local_interactions.domains.png) averages all the nonzero interaction times per residue. This means that high values mean relatively stable regions (i.e., unfolding later). Clearly, the helix alpha1 is the most stable. 
By now, the two major animations should be ready (if you have mencoder installed). If not, just check out the ones here and check your copies later :)

This is how the contact map evolves (or, in this case, "devolves") in time (CONAN.domains.mp4):


And the corollary, the change with respect to the initial frame (CONAN.dr_init.domains.mp4):

 

Blue values signify broken contacts, orange signify formed ones.

A very important question in analyzing an (un)folding trajectory is whether or not there have been intermediates. We can do this through two methods:

First, we can just look at the contact-based RMSD:


This suggests that there is an intermediate state somewhere between 50-90 ns. We can also see this in the videos above.

Second, we can form a hierarchy of our 472 frames, which we have asked for (using the keyword K_TRAJ_CLUSTERS). In your folder cluster_trj/, you will find two plots:

The inter-frame RMSD:

The very left side (or bottom side) of the 2D plot is the same quantity plotted above, i.e., the difference between each frame with the initial one (t=0 ns). In fact, this shows that there is still some clustering going on that we'd have missed in the previous plot (at t>100 ns).

And the dendrogram:

We could reasonably choose 2, 3, or 5 clusters. If we choose 3, we get the following framewise assignments (cluster_trj/3/frames_cluster_assignment.png):

Indeed, there is a cluster of intermediatesbetween frames 80-220, i.e., 40-110 ns.

To understand what these clusters mean, we can see the interaction lifetimes (also the average contact maps and the central frames are shown, as well as all the frames are symbolically linked).

Here they are:

Cluster 0: this is obviously the native state.

Cluster 1: the strands  β1- β2 lost contact with the rest of the protein and with each other. Several other contacts are only partially populated, which means that a further division of this cluster might make sense (which we will not do here).

Cluster2: This cluster only has some residual helicity.

(NB: it is not a coincidence that the clusters are ordered according to the evolution of the contact map: they are ordered by the temporal position of the central frames.)

Finally, we might want to identify "events" in the trajectory. If we do so, we obviously want to find contacts maps that are very different than their neighbours. However, if we look at the RMSD w.r.t. the previous frame, it is quite noisy (aggregate/rmsd_contact_map_previous.png).

So we will change the sampling rate to 5 ns instead. Look at the input file coarse.inp:

DT 5000
TRUNC  1.0
TRUNC_INTER 0.5
TRUNC_DR 0.5
DR_MODE prev
GNUS_PATH ../../gnuplot_inputscripts
RUN_MDMAT no
DOMAINS domains.txt

The differences are: RUN_MDMAT no (the dmf.xpm file by gmx mdmat will be reread), DT 5000 (reducing frequency), and TRUNC_DR 0.5 (for better visibility, we set our colormap to show changes of up to 0.5 nm as maximal).

We can rerun CONAN as:
../../conan.py coarse.inp
(Don't worry, the old output files will be stored in a backup folder.)

This time, the run will be much faster: there is 10 times less data to process.

The RMSD plot is now like (aggregate/rmsd_contact_map_previous.png):

We can also check for the two main peaks through:
sort -rnk 2 time_mdmat_rmsd.dat  | head
 115.0000    0.1135    0.1867
  35.0000    0.1039    0.1259
  40.0000    0.0952    0.1329
 105.0000    0.0949    0.1954
  85.0000    0.0851    0.1659
  55.0000    0.0720    0.1384
  45.0000    0.0706    0.1433
  95.0000    0.0698    0.1679
 135.0000    0.0671    0.1920
 125.0000    0.0645    0.1855
To understand what they mean, let us look at the corresponding frames:

frames/00007_dr.prev.domains.png
frames/00008_dr.prev.domains.png
frames/00023_dr.prev.domains.png



In fact, we found, without knowing anything about the trajectory, transitions between the same three clusters that we have found before.

I hope you enjoyed this tutorial. Look out for more.



Thursday, March 23, 2017

Correlations with the contact map

One of the most powerful features of ConAn is its ability to correlate features of interest or external perturbations with the contact map. This can allow a precise identification of key interactions (between residues or groups of residues) responsible for some macroscopic quantity. One "macroscopic quantity" is time. If we identify parts of our protein that have a significant time drift, that might mean that the reference structure is a poor one, or that our simulation needs to be run longer!

The 50 ns equilibrium simulation of ubiquitin, for example, shows the following time correlation:

Oops! There is significant rearrangement of the linker region between β3 and β4 (observed also in the previous section as a "tug of war" between R42 and D58). If we want to faithfully simulate ubiquitin, we would need to simulate "longer" (perhaps several milliseconds as we recently learned! but I digress). This kind of analysis would be very difficult with "fitting" methods, perhaps one could look at time correlations with pricipal components, but here we have direct access to a very large amount of information.

At constant velocity pulling experiments, the very first part of the unfolding trajectory sees a linear increase in force (as a "force ramp"). In that case, this Pearson correlation with time represents the force response of the protein. We examine this for the first 10 ns (of a total of 250) of "unfolding" of ubiquitin in which only minor changes in the contact map can be observed:

Residue pairs in orange get closer together as the external force increases (and could therefore be considered "force-bearing") and blue ones farther apart ("force-compliant"). More sampling than 10ns on a single frame would definitely be helpful, of course!

How about some other macroscopic variable? Let us try to explain the radius of gyration of ubiquitin. If we check how the contact map correlates with this observable, we get:

If we identify the positively correlating interactions, we find out that it is the final loop with everything else and the flexible loop with β3 that are mainly responsible for changes in the radius of gyration. This makes a lot of sense, of course.



Contact map-based alternatives for common measures

Contact maps can provide alternatives for widely used statistics on MD simulation. These can be defined in several different ways, here I present the way ConAn implements them.




  1. RMSF: This is commonly defined as the fluctuation of a residue around its average position, which means one fits the whole protein on a reference structure and measures distance to those of the reference structure. However, if large-scale conformational changes happen, these average positions can be meaningless and the fits particularly unreliable. If we measure root-mean-square fluctuations of inter-residue distances, we are using a more robust metric unaffected by any fitting. A further plus is that we (possibly) gain more insight into the relevant motions (which distances fluctuate and by how much?). This is a comparison of ubiquitin RMSF with fluctuations of the contact map:
    In this case, we learn very similar information to the "old way" (the linker of the final few residues dominate the RMSF), although we know the most important distances that change (roughly, linker - β3)
  2. RMSD: the root-mean-squared displacement shows how far our structure is from some reference (often, the initial frame). This is again done by a fit and a way around this fitting is again possible by getting individual differences in inter-residue distances. This way, we ignore large collective motions and catch changes in small distances (this is a trade-off, not an obvious plus!). This is our RMSD on the contact map for our 50 ns simulation of ubiquitin:

    These values are quite a bit smaller than normal RMSD's since we "fit" every pair separately (i.e., we don't fit them at all!).
  3. Inter-residue cross-correlation: getting the cross-correlation of pairs with other pairs would leave us with a 4D object, which is a bit too much to visualize. However, we can project distances on residues using some formula. In the current implementation of ConAn, we use the simple formula:

    If we then correlate the time series of this measure for various residue pairs, we can get an idea of where the protein is more or less folded.

    This measure could be thought of an alternative with the relatively well-known cross-correlation analysis that correlates the time series of deviation from average positions (there is that concept again!).

    If two residues correlate in our measure, it is likely the case that they are directly linked or their environments change concomitantly. Anti-correlation would be characteristic to competitive binding to a third residue (which would then correlate to both of the external ones). Without further ado, this is the inter-residue cross-correlation of our old friend ubiquitin:



    Most of the correlation is indeed where we would expect it to be (along the main diagonal), but there is also some interesting anti-correlations. If we look at Arg42, for example, we see that it correlates with its direct neighborhood, then anti-correlates strongly with Asp58, then correlates again:

    Now that we know what we are looking for, we can easily identify a major shift in interaction networks: Arg42 interacted with Gln48 (orange) but after the linker changed conformation, it is Asp58 (blue) that gains an interaction partner in Lys49. Similar anti-correlations can often yield interesting insight into the dynamics of the protein (not necessarily time-evolution, but possible interaction networks).
    The initial conformation: Arg42 interacts with Gln48.



    The final conformation: Asp58 interacts with Lys49,
    making the previous interaction impossible.

  4. Cluster analysis on residues: this can be an interesting way of finding clusters of interacting partners. (soon to be described/expanded...) For example, this is what three clusters of residues of ubiquitin look like, where each cluster is characterized by a high interaction lifetime (described in the previous post) to a central residue (represented here by opaque licorice). This is done by a k-medoid method...
    The inter-residue "distance metric" in this
    case is a % of interaction time. For cluster analysis,
    we will use 100%-(interaction time).
    The three identified clusters. along with their central residues (in opaque).

  5. Cluster analysis on the trajectory: the same as normal cluster analysis, but (as usual) without fitting structures! The distance metric in this case is inter-frame RMSD. This is what the inter-frame RMSD looks like in this case:


    and the clusters identified:



    There is again the same tradeoff as mentioned in the RMSD section. Deviations in the contact map could be more interesting for specific, important interactions, while "normal RMSD" will detect large-scale conformational changes. Cluster analysis based on inter-residue distances can be useful if one is interested in intra-domain motion and perhaps wants to ignore relatively large "macro-changes" that do not change the energy too much (e.g., if the system contains a flexible linker).
  6. Principal component analysis: in this case, we do not try to decompose deviations from an average structure as principal components, but rather deviations from an average contact map. Average contact maps can be a better measure than an average structure, particularly if the conformation of the protein drastically changes. These are the first three principal components of the dynamical contact map of ubiquitin (always ordered to have a positive time correlation):





    And the projections are:


Equilibrium simulations: how can contact maps help?

In equilibrium, the animation of the contact maps can still be interesting, if for no other reason than to see that nothing much is happening (this is a 50 ns simulation of ubiquitin):

Again, we can check if we have gained or lost any contacts along the way:


Not much there! "encounter" is defined (in this case) as beginning if two residues are within 6 ångström of each other (defined as any two heavy atoms) and ending if two residues are farther than 7.5 ångström. This two-cutoff scheme avoids the detection of spurious encounters (say, if a distance keeps oscillating between 0.59-0.61, a single cutoff would yield many encounters while this one just one).

ConAn can also observe the average encounter time as well as number of encounters, although you should make sure that what you observe is actually a physical interaction.

However, this can be a quick "sanity check" or an initial screening. Furthermore, these measures can be very useful for unstructured proteins. In those cases, the above plots would be much more colorful! (Example analyses on IDP's are coming soon.)

Unfolding ubiquitin: what breaks when?

The off-diagonal parts of the contact map contain all the non-trivial information of the folding. Therefore, if the protein unfolds (or we unfold it ourselves), all the off-diagonal parts will disappear. These animations can be fun to look at:



Or we can look at the unfolding, the force profile, and the contact maps together:


Animations are always really useful, but if we want to classify trajectories, or have a quick overview, it is better to summarize all the frames in one plot. ConAn has several time-encoded contact maps, for example, on the total interaction time:

This shows that there are roughly speaking three groups of contact losses:

  1. Around ~15-25%: residues 0-10 losing contact with 10-20 and 60-70.
  2. Around ~40%: residues 25-70 losing contact with each other.
  3. Around ~70%: residues 20-40 losing helicity.
We can also encode the initial secondary structure to easier interpret this plot (ubiquitin is formed of an helix and four beta strands):

Another option is a time-encoded plot of last encounters:

The contacts are more widespread in this case as short-lived intermediates form residues 25-70 appear.

In case you are studying folding instead (or non-native intermediates during unfolding), you will also find the opposite plot interesting, i.e., time of the first encounter:

Strand β4 forms a short-lived helical intermediate around 150 ns.

If we had several different trajectories, any of these 3 statistics (although probably the total interaction time would be the best) could be a good basis for clustering them, either "by eye" or by some more sophisticated method.

Basics: what are we plotting?

The first output from ConAn!
This contact map shows inter-residue distances of a protein. The distance is defined (here) as the shortest distance between any two heavy atoms from the two residues. Both the x and the y-axes show the sequence of the protein. The main diagonal contains "trivial cases" (any residue is at d=0 from itself), the off-diagonal elements show the secondary, or tertiary, structure of the protein. If you are very familiar with contact maps, perhaps you have already identified that we are working with ubiquitin here (as we will always do unless otherwise mentioned). Any distance longer to 1nm is truncated to this value for clarity (and all the statistical analyses that follow will use this truncated value). Here it is in all its glory (PDB ID: 1UB1):
This contact map can tell a good structural biologist more than a simple 3D structure. For example, by combining sequence data with the contacts, you can identify probable physical interactions (this is also obtained by ConAn):
In the following, we will show further analyses with ConAn (all on ubiquitin), first the "cooler" case of unfolding simulations, and then more subtle analyses on equilibrium simulations.

Wednesday, March 22, 2017

Introduction to ConAn

ConAn (which stands for Contact Analysis) is a tool to analyze MD trajectories through the use of contact maps and related quantities. It can be very useful for a first, exploratory analysis, i.e., identifying key contacts to be analyzed further, but can also offer alternative, possibly more precise, measures to the ones commonly used (RMSD, RMSF, etc...).

In the following few posts we will go through most of the functionalities of ConAn and its possible uses.