Measuring the Similarity of Protein Structures by Means of the Universal Similarity Metric (Auxiliary Programs and Scripts)

The protocol described in the paper Measuring the Similarity of Protein Structures by Means of the Universal Similarity Metric can be implemented using the various source codes described here:

Extracting the chain A from the PDB files

To build the working data set, first download the selected pdb files.

Then run extractModel.pl on each pdb file.

The arguments to the script are:

extractModel.pl inputPDBFileName outputPDBFileName

This will read the pdb file inputPDBFileName and will extract *only* chain A into a file with name outputPDBFileName

Building the contact maps from the PDB files

To produce the contact maps from the model extracted in the previous point use the program BuildContactMapFromPDB. This program's arguments are:

java -mx1600000000 BuildContactMapFromPDB PDFFileName threshold boolean.

The PDBFileName is the name of a model extracted in the previous point.

The threshold is a number in Angstroms to specify the distance to be considered in the contact map.

If boolean = true then the contact map will be plotted in the screen, otherwise it will only be saved.

The out file name (containing the contact map) will be PDFFileName.cm

The Java source files and class files are in a tar file called contactmapsources.tar and you can recompile with compileBuildContactMapFromPDB.

Estimating the Kolmogorov-Chaitin-Solomonof complexity of the contact map files

To estimate the Kolmogorov-Chaitin-Solomonof complexity of the various (concatenated)contact maps, simply use *any* compression algorithm (i.e. compress, gzip, zip, etc) and store the list of complexities in the format explained below.

Computing the USM distance matrix

The java source file USM.java computes the Universal Similarity Metric for a set of protein structure pairs. The arguments to USM are:

USM c_1-sizes c_1-c_2-sizes distancesFile

c_1-sizes is the name of the file with the estimated kolmogorov complexity of each protein in the data set. The format for this file should be:

number_1 protId_1
number_2 protId_2
....     ...
number_n protId_n
number_i is the complexity of protein protId_1 and n is the size of the data set. c_1-c_2-sizes is the name of the file with the estimated kolmogorov complexity of each *concatenated* protein pair in the data set. The format for this file should be:
number_1    protId_1-protId_1
number_2    protId_1-protId_2
...         ...
number_n    protId_1-protId_n
number_n+1  protId_2-protId_1
number_n+2  protId_2-protId_2
number_n+3  protId_2-protId_3
...         ...
number_n*n  protId_n-protId_n
distancesFile is the name of the output file where the USM accordingly to Eq. 4 in the paper will be computed. The format is self-evident, ie. a square matrix with all the n*n similarity values. To compile USM.java simply execute compileUSM
NOTE1: all the java programs were developed for jdk1.1.8.
NOTE2: if you are using a mixed unix/dos environment you may need to use dos2unix in these files to get rid of unwanted control characters.
NOTE3: you must set the $JAVA_HOME variable.