Studying The Developments To Data Records Information Technology Essay

Published: November 30, 2015 Words: 1583

Most of the content of T3SEDb records comes from NCBI Entrez Protein hereon named NCBI Protein. Other than NCBI Protein, there can be more cross-referencing to a wider range of resource databases by linking relevant data to T3SEDb individual records. Other types of resource databases include nucleotide sequence database, genomics and or proteomic collections (see table X). This will provide additional and or complementary information no doubt useful for researchers working on T3SE.

The approach of integrating information from different resources not only ensures comprehensiveness of data provided, and when paired with regular updates, enhances credibility of displayed information as well. UniProtKB updates its cross-references on a monthly basis.

To better contribute to the scientific community and improve the importance of this database, entries should encompass more meaningful, utilizable data for researchers working on T3SE; examples are listed in table X.

Information

Source Database(s)

Gene location/properties

Ensembl, EMBL, Entrez Gene, Refseq DNA, UniGene

Domains

SMART, InterPro, Pfam

Structures

PDB, HSSP, PSSH

Related Diseases

OMIM

Protein function

Gene Ontology

Pathways

KEGG, Panther, Reactome

Literature

CiteXplore, UK Pubmed Central

Table X.

Enriching records can be achieved via the curation process. In the event that T3SEDb expands the wealth of information each record holds, limiting records to a single web page should be maintained, as that would facilitate exploration of information, minimizing the hassle of redirection.

Expanding curation efforts

This can be done with regular data and literature mining, with the former extending beyond keyword search, and the latter to extend beyond those found attached to NCBI Protein records.

Data mining

Besides keyword and sequence searches, submission of new sequences by researchers, published literature documenting novel protein sequences can be regularly searched for and be incorporated into T3SEDb as well.

Importing new data can be done through manual and automated means. OpenFluDB, a database for human and animal influenza viruses, imports its data daily from GenBank. Although such automated means are desirable, checks must be in place in the event where automated parsing process set by T3SEDb administrators for example, fails. OpenFluDb deals with such events by diverting such entries aside for manual curation.

T3SEDb allows for submission of new T3SE sequences from its users. However, the format for submission is favourable for small number of sequences to be submitted, as sequences have to be submitted one at a time. In the case where batch upload is required, providing a template will be useful. OpenFluDB uses a combination of Microsoft Excel file in a fixed format and a text file to house nucleotide sequences in FASTA format. A system is set in place by OpenFluDB to deal with batch upload failures, either due to inconsistency in annotation or improper sequences. When this happens, the user is notified.

After new sequences are updated, attaching meaning to them is the next step. Generating meaningful data requires information from databases, which can be obtained either via automated, or manual means. This is also known as curation. UniProt defines it as "information in each entry is annotated and reviewed by a curator".

An example of manual curation by UniProtKB can be summarized into six major steps - (i) sequence curation, (ii) sequence analysis, (iii) literature curation, (iv) family-based curation, (v) evidence attribution and (vi) quality assurance and integration of completed entries. Expectedly, manual curation is demanding both in terms of labor and time. Coupled with the speed of which new sequences are discovered everyday, manual curation alone is incapable of matching up, not forgetting that experimental data for encoded proteins are not always available, adding on to the difficulty of manual curation. However, high quality information resulting from utilization of data from literature, with manual verification of results makes manual curation essential for providing accurate data. To reconcile some difficulties encountered by manual curation, automated curation methods can be utilized. This involves matching information from known proteins to uncharacterized ones. With both measures in place, it is possible to incorporate as much information that is available into the records.

Information added after the curation process can be linked to its source for users to track annotation credibility and accuracy, as well as reducing the amount of time researchers have to spend browsing the different databases available.

Other than having comprehensive, accurate and updated information, reducing redundancy in a database is the next step. This can be achieved by BLAST searching against T3SEDb, sequences to be manually curated. This picks up homologs of the same gene. Thereafter, sequences from the same gene and organism can be compared and combined into one single record.

Literature mining

Currently, the only linked publications in T3SEDb are the ones reported within NCBI Protein database records. As the main source of experimental data comes from journal articles there is wealth of information within publications, with many different annotations potentially able to be identified, ranging from protein/gene name, protein-protein interactions, post-translational modification, and so forth. Thus, T3SEDb can consider expanding its literature mining efforts. Relevant publications can be searched from literature databases (see table X) using appropriate tools.

T3SEDb entries are simplified; as such standardization of terms and vocabularies used is not a major concern. However, in the event that T3SEDb decides to expand information provided in records, standardized vocabularies will maintain consistency over records, simplifying data access.

Customizing format of displayed data

T3SEDb has a general output of 11 fields - Entry ID (identifier of T3SEDb), Effector Name, Accession number from NCBI Entrez Protein, source organism of the effector, sequence length, experimental status, last sequence update, name and accession of the primary/source database that the effector was retrieved from, sequence data, literature references (hyperlinked PubMed IDs) and T3SEdb curation comments. There are no options to toggle any fields on or off, and the 11 fields are fixed as it is.

It would be beneficial to users looking to utilize data from T3SEDb if there are more fields (information) listed, with the option of filtering fields displayed. Also, having the option to export data into a user-friendly format such as Excel, or text, can help users better manipulate data that they have shortlisted. UniProt for example, allows configuration of individual entries as well as returned result sets. For individual entries, the order of information sections can be customized, while returned result sets can be customized into desired rows and columns of summarized information. Additionally these customizations can be saved, thus future queries or new data released require no recustomization. UniProt also allows data exportation in different formats, such as plain text, XML, RDF and GFF for data files, and FASTA for sequences.

Improving flexibility of Search

T3SEDb allows sequence similarity searches using BLAST onsite (BLAST tab), as well as text queries to be made through the 'Search' tab using NCBI accession number, domain or general keyword. Search can be restricted to a certain experimental status of T3SE sequences. However, more often than not users may have identifiers other than that of NCBI Protein. For example, users starting off with UniProt identifiers have to visit the corresponding record in NCBI Protein database, obtain the accession number, before making a search in T3SEDb.

It would be useful if search options can be expanded. For example, UniProt allows for full and field-based text searches, sequence similarity searching, multiple sequence alignments, batch retrieval and database identifier mapping. Additionally, UniProt has a general guide to keyword search procedures (see http://www.uniprot.org/help/text-search for an example of UniProt guideline). This is something T3SEDb can emulate.

Often, researchers work with a group instead of a single sequence. In this case, having a batch retrieval tool will greatly facilitate data mining. NCBI Entrez and UniProt are examples of databases with batch retrieval tools.

Identifier mapping is UniProt's measure of integrating the different identifiers used in major databases into their own system of identifiers, and it allows mapping of more than one identifier, essentially a batch retrieval homolog itself.

On-site analytical tools

Currently, T3SEDb offers an onsite prediction tool for T3SE proteins, as well as links for other analysis tool via Protscale (http://www.expasy.ch/cgi-bin/protscale.pl?1). To go one step further, T3SEDb can offer a wider range of analytical tools relevant to the study of T3SE for example, tools similar to those available on WEMBOSS (i.e. hydropathy plots, global alignment). Taking Human-gpDB as example, this database provides analytical tools such as domain architecture analysis, protein-protein/protein-chemical interactions and pathway enrichment to help in data integration for users of its database.

However, development of analytical tools may be too technologically demanding. As such, providing links to tools that are relevant to the study of T3SE is another feasible approach.

Dedicated section for updates

To keep users of T3SEDb in the loop, updates made to T3SEDb can be given a dedicated section in the home page; an addition to the current brief introduction of what T3SE is.

Feedback page

Feedbacks are invaluable. It enables improvements to be highlighted, and errors (i.e. incorrect sequences or annotation) to be corrected in the quickest time possible. As such, adding a page to T3SEDb dedicated for users to return feedbacks will be a valuable investment. Currently T3SEDb provides only the unlinked email accounts of the database administrators.

T3SEDb interface

The interface of T3SEDb can be more standardized. The type and size of fonts used should be uniform across the site, especially the type of font used. Yet, some pages (i.e. Field Description, Annotation and Update Policies as well as individual entries) used different fonts with variation in sizes. The content of a database is what matters, but standardization of the interface will help making navigation through T3SEDb pleasant.