The Gene Name Dictionary And Interaction Word List Biology Essay

Published: November 2, 2015 Words: 1822

Having a comprehensive gene name dictionary is important to resolve gene name conflicts and ambiguities. This issue arises due to use of different symbols to denote the same gene. A single gene has a approved name and symbol given by the HGNC [68]. These symbols are also shown along with the approved symbol in HGNC. This database has 33343 approved symbols and 45148 other names or aliases as in HGNC. Still some of the symbols appeared in literature are not found in HGNC. In those cases GeneCard [69] service was referred and those aliases were added to the dictionary database. Although the GeneCard has a very comprehensive gene name alias list, it limits the download to 100 genes per day.

Interaction words were collected by referring literature and WordNet. 53 base words use to define interactions were collected. These words were then expanded and list of 197 words was created. These words are the most common words use to describe an interaction between biological entities.

7.2. Information retrieval

PubMed is the world largest literature resource available for biomedical researches. This contains over 2 million articles. Therefore PubMed was selected as the main literature source. PubMed was queried using four different types of queries with and with out limits to select a sample of articles which can give best balance between accuracy of final out put and the work load for the methodology. Table 06 shows the query string and amount of article retrieved on 2010/11/1. Terms entered in the query box are automatically mapped to Medical Subject Headings (MeSH) by the PubMed and were not modified. If we select too many articles (24931 articles for term preeclampsia) it could be computationally expensive. If we select too little (712 articles for term genetic AND preeclampsia with limits) it could affect the final result. Processing 1230 article abstracts is justifiable as it could give the best results with available computational power. Once the process pipe line is optimized, it is possible to process all abstract to extracts interaction if there are any. The date 2000/01/01 was selected as the lower limit since the research on genetic aspects of diseases are well studied and described during and after year 2000 in parallel with Human Genome Project and advancement of technology.

7.3. Text mining

First step of starting text mining is the sentence boundary detection. According to the results, GeniaSS tool has performed remarkably well on the given dataset with only 0.53% error. Most of the errors are due to ambiguities of use of decimal point. This is expected as the decimal point responsible for most errors in any corpus. Some errors are

* Single word with first capital letter followed by dot (PMID:20818961)

* Abbreviation followed by dot, space and number (PMID:20044877) after e.g.followed by space and word starting with capital letter (PMID:19321927)

* Citations (PMID : 16427138, 15123623)

However these errors have no significant effect on other steps and the final out come.

Once the SBD is done, tokenizing and NER was done using GeniaTagger. GeniaTagger is trained on Genia Corpus which is considering as the gold standard for biological text mining tasks. GeniaTagger was used for the Named Entity Recognition form the text. As any other tool in text mining this tool is not 100% perfect. Table 13 shows the performance of the GeniaTagger in NER according to its creators.

Table 13 : Performance of GeniaTagger in NER (taken from GeniaTagger web site)

Entity Type

Recall

Precision

F-score

Protein

81.41

65.82

72.79

DNA

66.76

65.64

66.20

RNA

68.64

60.45

64.29

Cell Line

59.60

56.12

57.81

Cell Type

70.54

78.51

74.31

Overall

75.78

67.45

71.37

These values indicate that the GeniaTagger suffer from some of the difficulties inherited to biomedical text. Final result is that it could miss some of the important entities. This could affect the final output. By examining the output generated by the GeniaTagger, it was noticed that most of these problems starts at the point of tokenizing step and propagates in to the subsequent steps.

Genes always have an official symbol and most cases have an alias or set of aliases. Some authors use the full name of a gene or protein rather than the symbol. Schumie et. al. (2004) has analyzed abstracts and full text in the use of gene names and symbols and found that 30% of the gene symbols in the abstract and 18% in the full text are accompanied by their gene names [66]. According to Armstrong C, aliases are used more often than full names. Aliases are very ambiguous and current attempts at disambiguation vary from 77 to 100 percent accuracy. This depends on the details of the model and the species to which the gene belongs [67].

Most of the symbols are either approved abbreviations or aliases. Generally, full name consists of more than one word. At the tokenizing step, full name breaks in to its constituent parts and loose its meaning. Most of the time these parts are not symbols by them selves and they cannot be identified by comparing with the gene name dictionary. This result is loss of information. For example von Willebrand factoris broken in to 3 tokens; von, Willebrand, factor(PMID:20939248). When taken as a single unit it is a gene with the approved symbol VWF(hgnc_id=12726) but as three words, we loose the information completely. In some cases these token may be recognized as a symbol for a gene. For example angiotensin IIis broken into 2 tokens; angiotensinand II. Here the roman number IIis a aliases for the gene GCNT2. This is a completely wrong identification and has definite effect on final outcome.

Sometimes the names can be confusing and cannot disambiguate with out considering the words around. Some authors have used abbreviations or symbols which are not common aliases for the gene they try to refer. This problem aggravates when these symbols become an approved symbol or an aliases for a completely different gene. Inconsistent use of Roman and Arabic numbers also result in name ambiguity. Table 14 describes this problem for angiotensin II (AngII) receptorwhich is taken from PubMed article (PMID: 20923405).

Table 14 : Gene name ambiguity

Word(s)

Approved symbol

angiotensin

-

II

GCNT2

AngII

AGT

receptor

-

angiotensin II

AGT

angiotensin II receptor

AGTR2

Ang2

ANGPT2

Disambiguating this sort of names is still a challenge in text mining and relies on manual curation.

Spelling variation also complicates the gene name identification process. Table 15 shows the variations for interleukin 6in articles and how it is tokenized by GeniaTagger and the effect on name recognition.

Table 15 : Spelling variations and its effect on gene name recognition

Spelling variation

Tokens

Approved symbol

interleukin 6

Interleukin, 6

-

IL 6

IL, 6

-

IL-6

IL-6

IL6

IL - 6

IL, -, 6

-

IL- 6

IL-, 6

-

IL -6

IL, -6

-

IL6

IL6

IL6

From this example it is evident how tokenizing affect the gene name recognition and how it can lead to loss of information. These issues are recognized in the current project and they could have some effect on the final output.

According to the results in table 08, GeniaTagger has tagged more words as proteins than DNA or RNA. This does not have any major effect on the final output as the same approved symbol is used to identify DNA, RNA or Protein in most of the cases. An example is given in table 16.

Table 16 : Use of symbols across Proteins, DNA and RNA

Gene symbol (HGNC)

Protein symbol (UniProt)

RNA (NCBI)

SOD1

SOD1

SOD1

FLT1

FLT1

TLR3

TLR3

In some cases splice variants of genes appear in literature. They do not have an approved name other than its base gene symbols. These names are also difficult to recognize during NER and name normalizing steps. One example is sFlt-1 or sVEGFR-1 which needs manual curation.

It was assumed that if a sentence contains more than two unique gene symbols and at least one interaction word describes some sort of interaction between those genes. All the sentences were filtered according to this rule. 457 out of 11172 sentences satisfied this rule. This rue has effectively removed the sentences which describe some action with the preeclampsia. This has removed 210 such sentences form candidate sentence list.

7.4. Information Extraction

Parsing and feature extraction step is an ongoing task at the time of this writing. However even with out this step, number of sentences that need to be manually examined has been reduced to 457 out of 11172 sentences. This is nearly 96% reduction and is a remarkable achievement. Out of those 457 sentences 42 sentences describe a interaction between genes. These interactions were extracted by reading each sentence. Manual reading is essential to measure the recall and precision of the complete process.

7.5. Gene-gene Network

59 genes were identified having 51 interactions. This network graph is shown in figure 22. This network can be described as an undirected graph with 59 nodes and 51 edges. The network is not full connected and consists of 12 components. The largest distance between any two nodes or the diameter of a network 6 and the average distance between any two nodes or the average path length is 2.5. The network is analyzed according to the degree. Figure 23 shows the distribution of degree of nodes. According to the network obtained by the method, FLT1 and TNF gene has most interaction with other gene. They could have important role in development of the disease compared to other gene in the network.

Figure 25 : Distribution of degree of gene nodes

8. Conclusions and Future work

Current approach in using text mining methods to extract gene-gene interactions is found to be very effective although not perfect. It has reduced the number of sentences that is needed to be read manually to 446 out of 11172 sentences, i.e. only 4% of the total work. It has identified 59 genes participating with 51 interactions. Gene TNF and FLT1 have the most number of connections with others. This generates the hypothesis that they may have major contribution to the pathogenesis of preeclampsia, which could be tested in laboratory. GeniaSS is very good tool in SBD and the output of the GeniaTagger could be used to improve the gene name identification. When the automatic information extraction is completed manual reading will not be required.

This method can be improved in several different ways. There are many more words use to indicates an interactions. So the interaction word list can be expanded by adding more verbs, nouns, and spelling variation like (localize/localise). Gene name dictionary need to be expanded to include more aliases. One another important aspect that needs to be handled is the tokenizing and NER. Automatic information extraction is currently under development. Ultimate goal of this approach is to develop a pipe line which can be used to extract gene or protein interaction for any given disease from free text articles.