REDUCED INITIAL DATASET FROM 33,869 TO 26,011 SEQUENCESThe protein sequences were retrieved from the Ensemble FTP website. DISOPRED2 server contains about 750 non-redundant sequences and it was used to predict the disorder hits in the protein sequences. In order to have a high accuracy the false positive rate threshold was set to 2%. Finally, the output would be obtained in the form of PSI-BLAST hits and alignments. CD-HIT is a program used to remove redundant sequences and to cluster protein sequence database by using a cut-off of 90% identity. Precisely, the non-redundant dataset contained 26,011 protein sequences.
DEFINITION OF A DISORDERED REGION AND ITS ASSOCIATED PROBLEM
A disordered region is defined as an adjoining set of at least 30 predicted disordered residues. There are numerous disordered regions in the protein structures determined from X-Ray Crystallography and NMR Spectroscopy (Charles R. Kissinger et al.; 1999). Distinct disordered regions where the disordered regions in between the ordered regions. The problem is that the there is an unnecessary prediction of transmembrane regions as disordered regions since they have a bias to conformation and amino acid composition.
INPUT SEQUENCE PSIPRED PAGE
USAGE OF MEMSAT3 TO REMOVE THE OVERLAPMEMSAT3 is a program that is used to predict the topology of membrane proteins and its structure (Jones et al.; 1994). The statistical tables involved helps in determining the compositional bias and this further benefits to locate and eliminate the overlap of transmembrane regions with that of disordered regions. In this study, only minimal amount of disordered-transmembrane overlapping hits were found.MEMSAT can be run from PSIPRED or it can be downloaded directly and then specify the prediction and filtering options by giving the input sequences.
SUBMIT THE JOB
FILTERING OPTIONS
FURTHER DECREMENT TO 11,477 DISORDERED PROTEINS USING PFILT
PFILT was used to filter out the disorder predictions. There were coiled-coil regions mimicking the transmembrane regions where these had an extent of similarity in their amino acid composition to that of disordered hits. This PFILT helped in the refining of disordered regions excluding the transmembrane and coiled-coil regions. This conventional dataset of disordered domains revealed functional significance (Balint Mezaros et al.; 2009). Finally, the converged set contained 11,477 disordered proteins and this set was subjected to further analysis.
OCCURRENCE OF DISORDER WITHIN THE PROTEOME
FIGURE 1
PERCENTAGE OF PREDICTED DERANGED RESIDUES WITHIN PROTEINSBy definition, the intrinsically unstructured proteins in the dataset contained at least one 30-residue disorder region and the axiom was that when the disordered regions where calculated in bins of 10% the
200546331 BIOL5202M
superiority set had 30 % of their residues to be unstructured. When the genomes of complex species were compared with simpler species, the number of deranged regions were more in number in the former (Yugong Cheng et al.; 2006).
FIGURE 2
INTRINSICALLY UNSTRUCTURED REGIONS FOUND WITHIN N- AND C- TERMINALSThe hypothesis is that almost 30% of the 11,477 protein hits were the intrinsically unstructured regions and on further amplification to the non-redundant dataset it was found that almost 97% of the proteins had disordered residues inclusive of the shorter disordered residues, maximum of proteins (95%) had less than five disordered regions. There were 21 proteins totally disordered, their homologues were found using BLAST, it was then analysed using DisoProt where they found 9 hits and the remaining 12 hits were the HMG (High Motility Group) proteins involved in distinct functions.
SELECT THE TOOL FOR CELL COMPONENT IN THE GO DATABASE
INFORMATION OF DISORDERED REGIONS IN THE CHROMOSOMES.When the information for chromosomal location was retrieved from Ensembl SQL database, they found that 44% of the proteins in a connection with a chromosome had entropy. Specifically, 38% on chromosome 21 and 50% on chromosome 12 and X. The most exciting detail was that 13 proteins located in the mitochondrial chromosome did not contain any entropy. On the contrary, the author mentions that the largest disordered region in the mitochondrial chromosome was 10 residues.
INPUT QUERY NAME OR GO TERMS OR PROTEIN ID
TABLE 1
OCCURRENCE OF DISTURBED REGIONS IN THE CELLULAR COMPONENTS
Gene Ontology was used to find the occurrence of deranged regions in the Cellular Components (11,477 dataset). The target was to determine the cellular- location bias for the intrinsically unstructured proteins. Bio mart was adopted to extracting the identifiers for the conventional dataset and found there were more disorganized regions in the nucleus, followed by the membrane. The supporting table 1 also depicted that entropy occurs in majority of the cell organelles.
PROTEIN DOMAINS AND DISORDERED REGIONS
DISORGANIZED REGIONS IN PFAM DOMAINS AND ITS FUNCTIONAL SIGNIFICANCEThe co-occurrence of disordered regions and Pfam domains was examined. Precisely, 13% of the Pfam domains preceded or followed a disorganized region, 57 pairs of domains for about 421 proteins occurred with a deranged region sandwiched between them and 90 pairs were without any disordered regions. Interestingly, 163 Pfam domains appeared to be in the same protein as disordered hit. The functional difference was determined using pfam2go (that mapped Pfam domains with GO terms) where they found some differences between ordered and disordered proteins but they had very less evidence to prove it.
TABLE 2
COMPARISON OF PROTEINS THAT ARE DISORDERED AND SPLICED AND NOT DISORDERED AND NON SPLICED
ALTERNATIVE SPLICING AND DISORDERED REGIONS
When these 7179 genes were subjected to alternative splicing, the yield was 18,830 splice variants. The connection between the disordered region and the
200546331 BIOL5202M
splice variants was tested using Wilcoxon rank test and the results were that the spliced-disordered regions were double the number of non-spliced disordered regions. Mostly, the proteins which were not spliced did not contain any disrupted regions.
FIGURE 3
LENGTH DIFFERENCE BETWEEN SPLICED AND NON-SPLICED, BETWEEN DISOREDERED AND ORDERED PROTEINS
As mentioned earlier, disordered proteins were longer than non-disordered and in order to test this hypothesis they grouped the native dataset into length bins of 50 residues, cross-tabulation analysis was applied and coupled spiced and disorder events revealed the they were length-dependant, also indicated significant P values at different lengths.
TCOFFEE PASTE THE SEQUENCE
EVOLUTIONARY CONSERVATION OF DISORDERED REGIONS
RUN THE ALIGNMENT
SET THE PARAMETERSS
FIGURE 4
PERCENTAGE OF DISORDERED RESIDUES CONSERVED OR SPLICEDBy using the T-COFFEE alignment the evolutionary conservation of disrupted regions was analysed. Splice variants with disorder were aligned and the longest disrupted region percentage of disordered residues conserved. The controversy in this figure is that the graph was not well defined. They have plotted for both conserved disordered and spliced disordered but even spliced deranged hits may be conserved, they have aligned protein sequences from the same gene identifiers which does not sound feasible because there might be different percentage of similarity at distinct variants. Hence evolutionary conservation is not evidently explained.
TABLE 3
COMPARISON OF CONSERVATION OF PROTEINS IN HUMANS WITH OTHER TAXAThe main axiom is that there are some evolutionary constraints on disorganized residues. The 163 Pfam domains in human were compared with the other eukaryotic taxa and inferred that there were 62 domains unique in all the taxa, drosophila followed the human, had more disrupted residues when differentiated with the others. These 62 domains were then mapped with GO terms and then their functions were spotted. With this functional analysis, a relationship for the evolutionary conservation was drawn.
DEVELOPED DATABASE OF PREDICTED DISORDERED PROTEINSDisoDB: DATABASE OF PREDICTED DISORDERED PROTEINS
A developed database that is a repository of disordered information comprising 39 fully sequenced eukaryotic proteomes. The most widely used largest database for predicting the intrinsically unstructured proteins. DisoProt is an annotated database which encompasses 523 proteins and 1195 disordered residues.
200546331 BIOL5202M