Importance Of Data Integration In Bioinformatics Information Technology Essay

Published: November 30, 2015 Words: 2376

In general Bioinformatics is the science of analyzing, managing omics data using advanced computational techniques. It helps to gather, analyze, integrate and store the biological data such as protein structures, protein domains, DNA, RNA sequences effectively and to investigate the relation between different datasets [1] with the development of new statistics and algorithms.

Data Integration refers to combining data residing from different sources and provides a user with a single view of data. With the recent advancements going on in the bioinformatics there are many bioinformatics product tools available in the internet for free of cost and many commercial software tools for researchers which are less expensive. Biologists spend enormous amount of time and effort accessing multiple data sources and using data analysis tools to determine the results. Because of this integration from different data sources has become very important and critical.

In this paper we describe a wide variety of data integration techniques implemented in bioinformatics such as data warehousing, web-services, controlled vocabularies highlighting the Web-services approach and an application for the integration of data in bioinformatics i.e. Bio-Extract Server and possible improvement is discussed.

LITERATURE REVIEW:

Importance of data integration in bioinformatics:

In recent years biological science has reached its peak in the form of genome, proteome, glycome providing enormous amount of data. These information resources are of great of importance to the biology, medical and drug discovery. Integration of various sources of heterogeneous data into a single entity is the biggest challenge of the bioinformatics research. There exists a need for data integration due to the fact that now-a-days the bio-information is spread out over internet in various formats such as XML, flat files, relational databases. These wide varieties of formats can be integrated by different data integration techniques. With the development of tool, services and integration there is an increased efficiency to gather information from various data sources. There are various bioinformatics databases around the globe and the need to integrate them has greatly increased. This way it supports the data analysis in a broad area which is a powerful and facing challenge in bioinformatics. Research has been going on since more than two decades. A federated database was one of the old ways in data integration systems [2]. With the development in the new technologies new ways of data integration has evolved. In the present paper we will discuss about the new techniques of data integrations.

Earlier HTML was used for browsing, publishing and gathering of data. It has various advantages like, user friendly, required very minimal learning effort, easy to learn and develop due simple syntax having self explanatory tags. It is good in accessing simple documents but limited advantages. Some of the limitations of HTML are like providing six levels of Heading. The work around are very cumbersome to implement if a document has more than six level heading. HTML describes nothing about the semantics of document. HTML pages provide very vague information which makes it very difficult to programmatically get the data from them. HTML has been replaced with XML integration with bioinformatics databases [3] to overcome the limitations.

Data Integration Techniques:

Bio-warehousing:

Bio-data warehousing is a dynamic manner of creating a single repository or warehouse integrated from various heterogeneous data sources. The data extracted from various databases are often structured differently at the data source. Advanced computations techniques are required to integrate, interpret and exchange data among different databases [5]. Biological databases are heterogeneous and their presentation is varied across databases. Data integration is requested to achieve good results although having a similar semantics. To perform this, bio data warehousing is required to present the data into a single framework [6]. There is need to understand the formats of the database in order to carry out information exchange in an appropriate format.

XML in Bio-Informatics:

XML has been developed from Standard Generalized Markup Language which is an international language that defines content of different types of electronic document and description of the structures. SGML is very independent to define our own tags and therefore a meta-language to design markup languages. There is no much difference between XML and SGLM. Similar to SGLM [3], XML defines a set of tags to one or more documents. Different types of elements in a document are identified by these tags. XML are used to define strong structured document so the program can easily follow the logic and extract relevant data. Various commercial and academic personals are using XML for easier writing and scientific data exchange. In recent past most of the biology related data is being represented using XML as its framework. Biological sequence markup language [3] is used for representing the DNA, RNA and protein sequences and their functionalities.

Advantages of XML in Bio-Informatics:

Highly Flexible

Internet oriented

Open framework

Disadvantages of XML [3] in Bio-Informatics:

Overhead of text based format needs to be evaluated before adopting XML.

Technical scalability problems exists which leads to poor performance.

No inheritance

No support for numerical values, tables and matrices.

Controlled Vocabulary:

In this type of technique heterogeneous data integration is based on one field i.e. ontology. Most of the researchers aim at using the controlled vocabulary. Protein ontology is a standard way to represent protein related data such as its structure and functionality [7] helping out in data integration. Controlled vocabulary is also needed in genetic related information i.e. -omics databases. Ontologies are enabling the biological researchers to give the specifications of entities, their attributes and relationships among them. Gene Ontology allows data integration by allowing the researchers to store results and generate reports by annotating gene products. Vocabularies are used to improve the effectiveness of data storage and retrieval via some programming language. In order to continuously integrate with the public databases most of the bio-informatics applications use the controlled vocabularies. Major purpose of controlled ontology or vocabulary is to gain consistency in fast understanding [8] and accurate retrieval of the information.

Web services technologies:

Web-services are the standard way of integration using XML, SOAP, WSDL and UDDI [9]. XML does the tagging of data, SOAP will transfer the data and WSDL will describe the available services and UDDI will list the available services. It is a program which is present at the server and is accessible to other programs to perform their tasks. When the request is processed the server responds back to client. In comparison to traditional client/server model the Web services are not provided with GUI instead provides a business logic and data. This processes through an API over the network. The GUI is added to web services by the developer with their requirement. Web services are used to by different application to communicate between databases without time taking due to the reason that most of the communication is done in XML.

Web services Architecture [9]:

Figure 1

Web services architecture mainly consists of three components:

Service Providers: In order to make the services available, the service providers provide the services to register.

Service Agent: It is the agent between the service provider and service requester which is used for service exchange.

Service Requester: Requests the services from service agent and utilizes these services to create the applications.

Web Services has three main operations:

Publish/Unpublish: Service provider releases either the registration of the service or the service itself.

Find: This operation is used by the service agent when the requester asks for finding the implementation of operation by describing the services. Then the service agent finds and distributes the matched results.

Blind: This operation is used to blind the service requester and provider so that the requester can access and utilize the provider's services.

Advantages of Web services:

Flexible and easily can adapt the changes in the system.

Real Cross-platform which solves the interoperability problem.

Data is loosely coupled.

Data can be used at any time.

It develops data quicker, easier and at low cost.

A mass of genetic information is generated due to the development of human genome projects in the world .There is a great number of different kinds of bioinformatics databases in the world and there is a necessity to unify them and this is a big challenging issue in bioinformatics.

Integration of heterogeneous bioinformatics databases:

It is a process of integrating or sharing of several bioinformatics heterogeneous databases [9] to create a most useful application.

Heterogeneous Data Integration architecture for Bioinformatics:

Figure 2

Heterogeneous database layer: It consists of a wide variety of bioinformatics database with different semantics, operations and data formats.

Data Integration Layer: This is also called as web services layer. It mainly downloads all the biological information coming from different databases in the XML format. Then this is being mapped with the public database of the data sharing layer. Data integration layer will link the distributed heterogeneous and shared databases. Synchronization of data i.e. data communication is achieved between the heterogeneous and public database by using the web services. High reusability of data integration layer can be achieved based on the middleware design ideas of XML and web services.

Steps of data Integration:

Automatically configure and grab the sharing data from the heterogeneous database.

Transfer the grabbed data into an XML document.

XML document must be sent to data integration server via web service and that should be mapped to the shared database [10].

Public database or data sharing layer: In order to create a public database a SQL server database is being adopted. The data that is to be shared or integrated is being stored in public database. There exist many kinds of bio information which is quite confusing in the biology meaning.

Uniform Appication Layer: A wide variety of applications can be provided based on the bioinformatics database such protein, RNA and DNA sequence.

APPLICATION OF BIOINFORMATICS IN DATA INTEGRATION:

Bio-extract server is a web based bioinformatics application which is created to analyze, consolidate and retrieve data from different bio-molecular data sources in the form of a mash up. This application is limited only for the retrieval of protein and DNA sequences. It offers researchers a access to a wide variety of heterogeneous bio-molecular data and analytical tools. Instead of accessing multiple data sources this application provides centralized distribution point for data and analytical tool integration by reducing the efforts of the user. It also offers query capabilities that are flexible and researchers do not need to be familiar with any query language. Bio-extract server combines the features of data warehouse, federation and mediator [11]. And it provides data integration with analytical tools which is mainly implemented by mediator service.

Bio-Extract Server Functionality: The basic operations of bio-extract server are to allow the specification of data sources by researchers and query the data sources and store the result for later reuse and use analytic tools on the required data. Only individual researchers who have logged on with a password can only extract and save where as remaining all the public data sources are available for "guests" [11].

Data Sources: Most of the data sources stores all molecular biology related data and they provide own access methods and query interfaces. Data queries are performed on the specific bio-extract server databases. Data sources of bio-extract server can be in the form web services hosted data sources, relational databases or data warehouses. Each of these has got its own advantages so all these are included in the bio-extract server.

Querying Data Sources: By using web based GUI user queries can be constructed with a unique query language and returns unique formatted results. Queries can be constructed by first selecting the search field and then operator that is required to perform operation and at last enter the search value that is to be searched. The operator that is generally used for string searching is "=".A mapping from a searchable constituent to set of common term is required in order to make a single query to work on database. If multiple data sources are selected then only a set of data sources that support the search terms are queried.

Viewing Results: Query results executed by the server are the combination of results from each selected data source. Mash-up list of records are provided to user as a result of a query. There are three fields: unique identifier which associates a link to data source. The second is the additional identifier which is linked to the local webpage which displays record details. The third is the short summary of the record. If the data source of a particular record is either a URL or web based or web service call using a format specifier a detailed record page is constructed which uses format templates to define the method of displaying the data in detail format.

Applying Analytical Tools: Some of these tools are the web-based applications and standalone or web services. The two main steps are adding of analytical tools and execution of them [11].

Bio-Extract Server administration: In order to integrate the data sources and analytical tools into a single system a web-based administrative module was implemented. The process of integration starts with the identification of data source and entities and the relationship between those entities. Analysis of processes at different levels of abstraction is captured in bio-extract workflows [11] and functionality is within the server and generates the workflow reports.

Advantages:

Researchers can save and query extracts of data subsequently.

This system is flexible and modular.

Integration of additional data sources and analytical tools is easy with the ability to save, execute and modify.

Disadvantages:

Analytical tool records are not clearly defined.

File formatting options are limited.

POSSIBLE IMPROVEMENT:

We need to devise an algorithm for Web source composition, in order to check whether the two data sources are composible or not which would be easier to achieve integration. The possible improvement of the bio-extract server would be to improve the execution of analytical tools. Another possible improvement would also be to use this bio-extract server application in the integration of heterogeneous databases based on web services in order to achieve efficient and faster integration of data. In near future a more robust file format converter must be included in the bio-extract server application which will automate the file conversion to the required format by using the output from a previous executed tool.