Abstract- Protein structural class prediction can play a vital role in protein 3-D structure prediction by reducing the search space of 3-D structure prediction algorithms. In this paper we used support vector machine to predict protein structural class solely based of its amino acid sequences, i.e. mainly α, mainly β , α- β and fss from CATH protein structure database; all-α, all-β, α/β, α+β from SCOP protein structure database. Four different datasets were used in this paper among them two were constructed using a unique way called Representative Protein Extraction method. During the training phase for the binary classification 99.91% accuracy was achieved for fss vs. others. Also during the testing phase for SCOP database the overall prediction accuracy was 97.14% whereas for CATH database it was 96%. The results obtained in this study are quite encouraging, indicating that it can be used as a complimentary method for protein class prediction to many other existing methods.
Keywords- Protein structural class, Support vector machine, CATH database, SCOP database.
Introduction
Protein 3-D (dimensional) structure prediction is very important in bioinformatics research since the interactions between different proteins or between proteins and their ligands (therefore the function through these interactions) are all determined by the 3-D structure of proteins. Currently there are 60769 [1] structures in Protein Data Bank (PDB) whereas number available genome sequences in different public domains hundred times more. And still more and more genomes are being sequenced every day. On the other hand PDB is not growing so faster since the structure determination process is heavily depends on X-Ray crystallography and NMR which are very expensive and time consuming. Even the structures of some proteins like trans-membrane proteins and some larger proteins can not be determined by these methods. Because of this, protein structure determination solely based on its sequence using computational methods is emerging as an alternative approach. Protein structural class prediction can play a vital role in protein 3-D structure prediction ([2], [5]). Prior knowledge of protein structural class will significantly improve the quality and performance of protein secondary and tertiary structure prediction algorithms from amino acid sequences by reducing the search space of structure prediction process. the concept of protein structural class was first introduced by Levitt and Chotia [8] based on a visual inspection of polypeptide chain topologies in a dataset of 31 globular proteins. They have classified the proteins whose structures are known into four main structural classes, i.e. (i) all-α, composed of mainly alpha helices, (ii) all-β, composed of mainly beta strands, (iii) α/β containing proteins in which helices and strands are mixed, and beta strands are parallel and (iv) α+β consists of proteins in which helices and strands are mixed, and β strands are anti-parallel. As the number of protein 3-D structures entered in the protein data bank, the number of structural classes has been extended to seven classes by adding multi-domain protein classes (μ), membrane and cell surface protein classes (ρ) and small protein classes (σ).
Many research groups are currently maintaining web accessible hierarchical classifications of protein sequences. The mostly used are SCOP and CATH databases. The Structural Classification of Proteins (SCOP) database [12] provides a comprehensive description of protein structures and evolutionary relationships based on a hierarchical classification scheme. At the top of the hierarchy are classes (α, β, α/β, α+β, μ, ρ, σ) that are subsequently subdivided into folds, superfamilies, families, protein domains, and then individual PDB protein structure entries. CATH is a hierarchical classification system developed by Janet Thornton, Christine Orengo, and colleagues [14] that describes all known protein domain structures. CATH clusters proteins at four major levels: class (C), architecture (A), topology (T), and homologous superfamily (H). At the highest level (class) the CATH database describes main folds based on secondary-structure prediction: mainly α, mixed α and β, and mainly β, as well as a category of few secondary structures (fss). Assignment at this level resembles the SCOP database system The architecture (A) level of CATH describes the shape of the domain structure as determined by the orientations of the secondary structures. The topology (T) level of CATH describes fold families. If proteins belong to the same T-level, they not only have the similar number and arrangement of secondary structures, but also the same connectivity linking their secondary structure elements. And the fourth level, homologous superfamily (H) level clusters proteins that are likely to share homology (i.e., descent from a common ancestor).
Common and simple practices for classifying an unknown protein are to finding a homologous protein with known structure and then classify the query protein accordingly. But the situation is more complicated when there have no clear homologues with known structure and the structural class assignment is especially difficult. At the same time gap between proteins with known and unknown structure is widening exponentially. Over the years several computational methods have been applied in protein structural class prediction from its amino acid sequences ([3], [6], [9], [10], [15], [16]). In this paper presents a protein structural class prediction based on hierarchical classification of CATH and SCOP database using support vector machine (SVM). Here we have considered the impact of relationship between amino acid neighbours in the sequence, i.e. effects of the relative frequencies of tetrapeptides on the training results. Similar research has been carried out by some others researchers such as Markowetz et al. (2003), Isik et al. (2004), Zhang et al. (2006), Sun et al. (2006) ([7], [10], [15], [16]). But their dataset are different from the dataset we have used in this research and also they have used dipeptide or tripeptide whereas we have used tetrapeptide. And so far current investigation this is the first report where both CATH and SCOP database is used for structural prediction by SVM. Furthermore the training dataset is selected using a unique way called Representative Protein Extraction method [11].
Materials and Methods
Dataset
In this study, we have used four working datasets; the three are from SCOP database and last one from CATH database. The first dataset contains 277 domains, of which 70 are all-alpha domains, 61 are all-beta, 81 are alpha/beta and 65 are alpha+beta [17]. The second dataset contains 498 domains, of which 107 are all-alpha domains, 126 are all-beta, 136 are alpha/beta and 129 are alpha+beta [17]. Although both of the datasets have been used extensively by many researchers, they are in small size and suffer from redundancy [2]. The third and fourth datasets are taken using a method called Representative Protein Extraction described in [11]. Using this method we have extracted 1195 representative sequences from SCOP database which are the representatives of 1195 fold families in SCOP database, of which 284 are all-alpha domains, 174 are all-beta, 147 are alpha/beta, 376 are alpha+beta, 66 are Multi Domain, 58 are Membrane Cell and 90 are Small Proteins [12]. We have also extracted 1110 protein sequences from CATH database, which are the representative protein sequences of 1110 topologies in CATH database, of which 310 are mainly alpha, 196 are mainly beta, 512 are alpha-beta and 92 are few secondary structures [14].
Approach
1) Feature Extraction: In CATH database protein sequences are classified into four structural classes whereas in SCOP have seven structural classes. After getting the training dataset according to previous section we have extracted features from these protein sequences to train the SVM classifier. At first we marked the sequences form a particular class as 'Current' (i.e. mainly alpha, α for CATH database) and the rest of the sequences from other classes are marked as 'Other' list (i.e. β, α- β and Fss). Then we have constructed the list of all possible sequences of length 4 using 20 amino acids. Thus this list contains 160,000 (20*20*20*20) tetrapeptides. We then calculated the frequency of each of this tetrapeptide in both of the lists (i.e. in 'Current' and 'Other' list). Let fci be the total frequency of the ith tetrapeptide in all the sequences from the 'Current' list and foi be the frequency of the ith tetrapeptide in all the sequences from the 'Other' list. A difference matrix (Di) is computed based on the absolute difference between the number of occurrences of those 160,000 possible tetrapeptides of amino acids in 'Current' and 'Other' list.
Di = | fci - foi | (1)
Where, Di is the absolute difference of the occurrence of the ith tetrapeptide. Sorting Di, in descending order and first 4000 tetrapeptides with the highest frequency are selected. Now we have calculated the frequency of these 4000 tetrapeptides in every sequence of the 'Current' and 'Other' list. Let fi be the frequency of the ith tetrapeptide in a protein sequence. A normalization parameter can be obtained according to the following equation.
(2)
Now we have calculated transition probabilities (probability of a particular amino acid 'X' is followed by another amino acid 'Y' in a given protein sequence) of each of the amino acid in a protein sequence. Thus a protein can be represented as follows:
(3)
where pi,j is the transition probability of ith amino acid to jth amino acid in the protein sequence. Defined in this way each of the protein sequence in both of the 'Current' and 'Other' list corresponds to a 4400D vector.
2) SVM Training: For the SVM training of the proposed method the Libsvm (version-2.89) is used [22]. Libsvm is available free, simple, easy-to-use, and efficient software for SVM classification and regression. The training is done using afore mentioned four data sets from CATH and SCOP database. Mainly five parameters are used for training purpose. Those are SVM type(s), kernel type (t), gamma (g), cost (c) and n fold validation (v) .To evaluate the accuracy of the proposed method using SVM classifier we used jackknife test for cross-validation. C-SVC type SVM (s=0), RBF type kernel (t-2) and 5-fold cross validation (v=5) test was done using the training data. After training the SVM classifier we have tested the classifier with different data set.
We have composed this dataset by selecting 50 protein sequences for each of the protein structural classes from both of the CATH and SCOP databases. The training and testing of the proposed method is done in two ways; One-against-one classification and One-against-others classification or multi-class prediction [13].
Results and Discussion
One-Against-One Classification
In one-against-one classification scheme six binary classifiers for CATH database are built: α vs. β, α vs. α- β, α vs. Fss, β vs. α- β, β vs. Fss, α- β vs. Fss; and six binary classifiers for SCOP database are built: all α vs. all β, all α vs. α/β, all α vs. α+β, all β vs. α/β, all β vs. α+β, α/β vs. α+β. The prediction accuracy of jackknife test of each one-against-one classification by using proposed method is shown in Table I. Optimized value of SVM parameters g=0.03125 and c=2 are used for both one against one and one against others tests. The prediction accuracy displays promising result for most of them. It is apparent from Table I, that our proposed method performs better with CATH database than SCOP database. From this result is seems to be that classifiers with two clearly distinguished structural classes performs higher than classifiers with two 'mixed structural classes' such as α vs. α-β, and β vs. α-β. Probably this is because in 'mixed structural
TABLE
Prediction Accuracy of One-Against-One Classification
Dataset
α vs. β /
all-α vs. all-β
α vs. α-β /
all-α vs. α/β
α vs. Fss /
all-α vs. α+β
β vs. α-β /
all-β vs. α/β
β vs. Fss /
all-β vs. α+β
α-β vs. Fss /
α/β vs. α+β
Average accuracy
277
96.09%
87.16%
94.70%
97.18%
99.20%
100%
95.72%
498
98.70%
96.70%
99.15%
96.17%
99.21%
99.62%
98.26%
1195
85.59%
85.15%
85.15%
83.18%
90.36%
90.25%
86.61%
1110
89.53%
88.32%
99.75%
94.63%
98.96%
100.00%
95.20%
TABLE I
The Binary Classification Accuracies of Jackknife Test (One-Against-Other)
Cross Validation Accuracy in %
Dataset
α vs.others
β vs.others
α/β vs.others
α+β vs.others
μ vs.others
ρ vs.others
σ vs.others
α- β vs.others
Fss vs.others
Overall
277
93.43
96.35
94.16
94.89
-
-
-
-
-
94.71
498
97.79
97.59
98.59
98.39
-
-
-
-
-
98.09
1195
94.81
98.74
98.24
91.80
99.17
98.66
99.75
-
-
97.31
1110
92.52
97.75
-
-
-
-
-
81.7
99.91
92.97
classes' there are some structural similarities among α vs. α-β, and β vs. α-β.
One-Against-Others Classification
Four One-against-others classifiers for CATH database and seven One-against-others classifiers for SCOP database are used in this proposed method and the prediction accuracy displays promising result for most of them. The results are presented in Table II.
For SCOP database (dataset 1-3) the classifier 'σ vs. others' gives the highest prediction accuracy, which is about 99.75%. For CATH database (dataset 4) the classifier 'fss vs. others' gives the highest prediction accuracy, being about 99.91% and the 'β vs. others' also gives very good accuracy about 97.75%. It is interesting to see that the classifiers in which μ, ρ, σ, or Fss are involved have very high accuracy than the classifiers involving α or β. May be this is because we have less number of protein sequences of μ, ρ, σ, or FSS in our training datasets in comparison with α or β protein sequences.
Also 50 random protein sequences from each structural class were collected from both SCOP and CATH database and tested the prediction accuracy using the proposed model. Table III displays remarkable result for all structural classes of proteins.
TABLE
Rate of Correct Prediction for Each Class
SCOP database
CATH database
all- α
94%(47/50)
Mainly α
84%(42/50)
all- β
100%(50/50)
Mainly β
100%(50/50)
α/β
100%(50/50)
α-β
100%(50/50)
α+β
90%(45/50)
Fss
100%(50/50)
μ
100%(50/50)
ρ
96(48/50)
σ
100%(50/50)
Over all
97.14%(340/350)
Over all
96%(192/200)
Conclusion
The number of experimentally derived protein structures is increasing very rapidly nowadays. Thus it requires a continuous updating of the existing protein structure classification databases. Manual classification of all these protein sequence may be an extensive job whereas an automated classification model can reduce the work of human
in a significant amount. In this paper we presented a very simple but effective method using support vector machine for protein structural class classification. Our experimental result showed a very high success rate over most of the existing methods.