Friday, January 19, 2018

Tutorial 1: ubiquitin unfolding analysis

If CONAN is a bit too daunting to use first, the following tutorial can hopefully help you. Do the following two steps:


  1. Download CONAN, version 1.0, from: https://github.com/HITS-MBM/conan/archive/v1.0.tar.gz . Unzip and move all files inside to a folder of your choice. This is all the installation you need (assuming you have Python 3 with SciPy installed, e.g., through installing Anaconda).
  2. Test your distribution by:
    cd test
    ./test.py
    The test script will run 8 very quick runs, checking that everything is OK. Ideally, it will exit with the message "All tests passed! Everything looks good."
    If there are serious issues, please check your Python installation.
If everything was in order, we can move on to example applications. Right now, there is one: ubiquitin unfolding. First things first, let us go to the application folder:

cd applications/1_ubi_unfold/
We will first start with the file fine.inp (included in the folder):

TRAJ pull_500ps.xtc
COORD pull.tpr
NLEVEL 1001
DT 500
TRUNC  1.0
TRUNC_INTER 0.5
TRUNC_DR 1.0
DR_MODE init
GNUS_PATH ../../gnuplot_inputscripts
RUN_MDMAT yes
DOMAINS domains.txt
COORD_PDB ubi.pdb
K_TRAJ_CLUSTERS 1-5

This file contains the trajectory file, the coordinates file, fixes the number of levels in the distance measurements (in this case, 1001 levels between 0 and 1.0 mean that we'll have a resolution of 0.001 nm). We will perform cluster analysis as well, trying a number of clusters between 1-5. Before we begin, we can also check out the 3D movie of the unfolding:



We are ready to run CONAN!
../../conan.py fine.inp

When prompted by gmx mdmat to choose the group of residues to measure, select 2:
Group     2 (      Protein-H) has   602 elements

CONAN is a very fast analysis tool, except for the creation of the framewise contact maps (.png files for the movie). After about a minute, all the aggregate plots should be done. While the movie is being made, let us check out the aggregate plots. For now, I'll show the more interesting ones (domain-annotated, due to our file domains.txt):

  1. The average contact map (aggregate/avg_mdmat.domains.png) shows us the average distance between residues. This already suggests to us which parts of the structure are lost at which point. The text version of this information is the third column in mdmat_average_rmsf.dat
  2. The lifetime contact map (aggregate/interaction_lifetime.domains.png) shows the same trend as the average contact map, although this time we can be sure that residues at long distances indeed were interacting during the trajectory.
  3. The "last encounter" plot (aggregate/timeline_last_encounter.domains.png) shows similar data, but this time time encoded.
  4. The "local interaction time" plot (aggregate/local_interactions.domains.png) averages all the nonzero interaction times per residue. This means that high values mean relatively stable regions (i.e., unfolding later). Clearly, the helix alpha1 is the most stable. 
By now, the two major animations should be ready (if you have mencoder installed). If not, just check out the ones here and check your copies later :)

This is how the contact map evolves (or, in this case, "devolves") in time (CONAN.domains.mp4):


And the corollary, the change with respect to the initial frame (CONAN.dr_init.domains.mp4):

 

Blue values signify broken contacts, orange signify formed ones.

A very important question in analyzing an (un)folding trajectory is whether or not there have been intermediates. We can do this through two methods:

First, we can just look at the contact-based RMSD:


This suggests that there is an intermediate state somewhere between 50-90 ns. We can also see this in the videos above.

Second, we can form a hierarchy of our 472 frames, which we have asked for (using the keyword K_TRAJ_CLUSTERS). In your folder cluster_trj/, you will find two plots:

The inter-frame RMSD:

The very left side (or bottom side) of the 2D plot is the same quantity plotted above, i.e., the difference between each frame with the initial one (t=0 ns). In fact, this shows that there is still some clustering going on that we'd have missed in the previous plot (at t>100 ns).

And the dendrogram:

We could reasonably choose 2, 3, or 5 clusters. If we choose 3, we get the following framewise assignments (cluster_trj/3/frames_cluster_assignment.png):

Indeed, there is a cluster of intermediatesbetween frames 80-220, i.e., 40-110 ns.

To understand what these clusters mean, we can see the interaction lifetimes (also the average contact maps and the central frames are shown, as well as all the frames are symbolically linked).

Here they are:

Cluster 0: this is obviously the native state.

Cluster 1: the strands  β1- β2 lost contact with the rest of the protein and with each other. Several other contacts are only partially populated, which means that a further division of this cluster might make sense (which we will not do here).

Cluster2: This cluster only has some residual helicity.

(NB: it is not a coincidence that the clusters are ordered according to the evolution of the contact map: they are ordered by the temporal position of the central frames.)

Finally, we might want to identify "events" in the trajectory. If we do so, we obviously want to find contacts maps that are very different than their neighbours. However, if we look at the RMSD w.r.t. the previous frame, it is quite noisy (aggregate/rmsd_contact_map_previous.png).

So we will change the sampling rate to 5 ns instead. Look at the input file coarse.inp:

DT 5000
TRUNC  1.0
TRUNC_INTER 0.5
TRUNC_DR 0.5
DR_MODE prev
GNUS_PATH ../../gnuplot_inputscripts
RUN_MDMAT no
DOMAINS domains.txt

The differences are: RUN_MDMAT no (the dmf.xpm file by gmx mdmat will be reread), DT 5000 (reducing frequency), and TRUNC_DR 0.5 (for better visibility, we set our colormap to show changes of up to 0.5 nm as maximal).

We can rerun CONAN as:
../../conan.py coarse.inp
(Don't worry, the old output files will be stored in a backup folder.)

This time, the run will be much faster: there is 10 times less data to process.

The RMSD plot is now like (aggregate/rmsd_contact_map_previous.png):

We can also check for the two main peaks through:
sort -rnk 2 time_mdmat_rmsd.dat  | head
 115.0000    0.1135    0.1867
  35.0000    0.1039    0.1259
  40.0000    0.0952    0.1329
 105.0000    0.0949    0.1954
  85.0000    0.0851    0.1659
  55.0000    0.0720    0.1384
  45.0000    0.0706    0.1433
  95.0000    0.0698    0.1679
 135.0000    0.0671    0.1920
 125.0000    0.0645    0.1855
To understand what they mean, let us look at the corresponding frames:

frames/00007_dr.prev.domains.png
frames/00008_dr.prev.domains.png
frames/00023_dr.prev.domains.png



In fact, we found, without knowing anything about the trajectory, transitions between the same three clusters that we have found before.

I hope you enjoyed this tutorial. Look out for more.