Epigenetic Biomarker Screening Workbench On Histone Modification Biology Essay

Published: November 2, 2015 Words: 4415

Epigenetic mechanism regulates heritable gene expression that is faithfully propagated through multiple cell divisions without alteration in the DNA sequence. Epigenetic modification such as histone modification, DNA methylation and non-histone protein regulate the DNA packaging and influence gene expression. However, abnormal epigenetic alterations may lead to some common diseases such as cancer, autoimmune disease and mental disorders. DNA molecules are wrapped around the octamer of histone, named nucleosome that acts as the structural unit of chromatin. Most commonly experimental technique to study histone modifications is chromatin immunoprecipitation (ChIP) which has several limitations such as tedious, time consuming and laborious. Thus, computational approaches are introduced in order to solve these problems.

CHAPTER 1

INTRODUCTION

1.1 Overview

In this new century, epigenetics is one of the rapidly expanding areas in biology field. Basically, epigenetic mechanisms are influenced by few factors such as development during childhood or utero, environmental chemicals, drugs, aging and diet. Other than DNA methylation and histone modification, there are also other mechanisms such as nucleosome positioning, chromatin accessibility, binding of non-histone protein and non-coding RNA have been discovered to prove the accelerating improvement in the discovery of epigenomic. With the advances technology nowadays, it has become possible to undertake epigenomic studies in depth. Thus, a basic understanding of epigenetic mechanisms, interactions and alterations in health and disease, has become important in biomedical research.

In general, epigenetics is defined as the study of heritable changes and gene regulation across the genome that not directly affects the DNA sequence. Epigenetic mechanisms regulate gene expression by propagates the pattern of epigenetic information over multiple cell divisions, which the epigenetic regulation helps in differentiation of cell and cell fate decision (Dillon, 2006). Generally, the more loosely the gene's DNA is packaged, the more likely it is transcribed actively. In conversely, the gene's DNA is more likely to switch off if it is tightly wrapped. Epigenetic alterations are reversible processes on DNA strand or histones that influence gene expression without affecting the DNA sequence.

However, the term "epigenomics" has different definition with "epigenetics". In general, epigenomics refer to the study of the epigenetic modifications on the entire genome which is analogous to proteomic and genomics. In contrast, epigenetics usually refer to the study of regulation of gene expression on single gene.

For didactic purposes, DNA methylation and chromatin structure are important in normal mammal development(Bradley E.Bernstein, 2007). These alterations responsible in several roles which include imprinting, development controls, gene transcription, heterochromatin maintenance, tissue specific expression controls and genomic stability. However, incorrect and abnormal modifications may lead to wide range of disorder such as cancer, Rett syndrome and immunodeficiency (Christine Ladd-Acosta, 2009).

Figure 1.1 Epigenetic mechanism

Due to increasing epigenetic data available, computational epigenetics techniques play an important role inpredictive epigenomic especially in diagnosis and treatment of cancer. Epigenetic prediction from genomic sequence is advantageous because experimental data can be replaced by prediction result especially for recent discovered epigenomic mechanism(Christoph Bock T. L., 2008).Besides, algorithm of epigenomic prediction develops statistical epigenomic information based on training data and act as models of epigenomic mechanism(Christoph Bock T. L., 2008).Therefore, prediction of epigenomic is helpful in clinical field as it is time efficient to provide early prediction of several diseases.

With the improvement of microarray and DNA sequencing techniques, identification of epigenetic differences between healthy and disease cells as well as different cell types have been studied (Yuan, 2011). Chromatin immunoprecipitation (ChIP) have been developed to analyze histone modification by enriching DNA fragment from the target site (Nair, 2010). When the microarray technology combined with ChIP which is known as ChIP-on-chip, the protein binding to the loci can be detected. However, this technique is low sequence resolution and low coverage as well as expensive. Due to the advancement in ultrahigh-throughput sequencing technologies, ChIP-seq acts as the main approach that uses high-throughput sequencing technology instead of tiling array to compare the sample and control DNA. In contrast to ChIP-on-chip, ChIP-seq is more beneficial due to reduced in data normalizationand highly cost efficient. Therefore, these techniques are useful in epigenomic prediction as it can process huge amount of data with high accuracy.

Currently, there are several computational epigenetics tools are introduced such as ChIP-seq Web Server and EpiGRAPH. These tools are useful in analyzing huge amount of genome and epigenomic data. ChIP-seq Web Server is an online tool that efficient in analyzing ChIP-seq data, including positional correlation analysis and peak detection by locating transcription factor binding sites. Moreover, various types of data sets are available in this server such as ChIP-seq data, DNA methylation data and sequence-derived features including CpG score.

Figure 1.2: ChIP-Seq web server (http://ccg.vital-it.ch/chipseq/)

On the other hand, EpiGRAPH is a tool that mainly for genome and epigenome analysis. It was implemented to help researchers to study large scale datasets such as ChIP-on-chip, tiling microarrays and sequencing which are widely generated by advanced technologies nowadays. EpiGRAPH software supplies a default analysis workflow to most datasets for occasional users. It applies statistical analyses and machine learning algorithm based on large amount of genome and epigenome information. Whereas for advanced users, they are allowed full access to its documentation system and standardized XML-based analysis.

Figure 1.3 EpiGRAPH software

Due to the rapid progress in generating epigenetics data sets, adequate computational tools or software are required for data processing and epigenomic prediction. Understanding the genomic distribution of epigenetic information can provide some basic idea on prediction of epigenetic on cancer. In order to developa epigenetic prediction software, support vector machine (SVM) is very useful in analysing data and recognized pattern which help in classification and regression analysis. SVM classifies the data into two possible classes based on the feature of data. Therefore, to obtain high accuracy of the prediction results, I'll apply SVM in the process of implementation of the histone modification prediction application.

1.2 Problem Statement

Epigenetic modifications have been revealed to be associated with disease progression and therapeutic affected. Prediction of epigenomic plays a crucial role in diagnosis and prognosis of disease. However, there are some problems that need to be addressed. Currently, most of the histone modification prediction techniques are experimental methods which are tedious, laborious and time consuming as well as producing noise. Moreover, the reagents and material needed for experiment are expensive.

On the other hand, experimental method required more man power as computational technique can be done independently. Another limitation is place constraint. This is because experiment only can be carried out in the laboratory whereas computational software can be applied everywhere.

Therefore,to improve the efficiency as well as reduce the cost and experimental time, a computational predictive application is useful to solve these problems.

1.3 Project Objective

As there is lack of computational epigenetic prediction tool, I will try to implement a web based application that act as a prediction of histone modification tool to improve efficiency compared to experimental tool.

In this project, my objectives are

To survey a common pattern or feature of DNA sequence that involve histone modification

To develop a bioinformatics application to predict histone modification in DNA sequence

Project Scope

This project scope is highly related to the objectives. In order to develop an application to predict histone modification, the process is divided into three parts which is feature selection, design and implementation. By studying existing journals or software, it is necessary to select the feature that most associated to histone modification to improve the accuracy of results. The tasks in design and implementation include analyzing of algorithm, implement coding to develop the application.

CHAPTER 2

LITERATURE REVIEW

2.1 Histone Modification

Nucleosome is a fundamental unit of human chromatin which consist 146 bp loop of DNA wrapped over the octamer of a histone H2A, H2B, H3 and H4. Histones are associated to various post-translational modifications in multiple sites and multiple ways such as acetylation and methylation of lysine and arginine, phosphorylation, glycosylation, ubiquitylation as well as ADP-ribosylation (Yuan, 2011). These alterations mostly take place on N-terminal and play a crucial role in gene regulation and chromatin processes. In general, histone modification by transcriptional control is important in devoting chromatin status and recruiting complex non-histone protein, thus dictating higher order of chromatin structure.

In medical field, histone modification patterns helps in prognostic and diagnostic of cancer. Commonly, histone acetylation and methylation are the common characteristic of methylated sequence that being repressed in normal cells. Precipitation of DNA methylation may occurs by histone modification such as histone H3 Lys9 (H3K9), histone H3 Lys27 (H3K27) and histone H4 Lys20 (H4K20) which characterized the repressive chromatin structure. One of the examples is increase of histone H3 Lys18 (H3K18) acetylation activation and histone H3 Lys4 (H3K4) dimethylation leads to poor prognosis (Konvo Y, 2009).

There are huge numbers of histone modifications by each having their own global distributions, hence the investigation of histone modification pattern are more complicated and not well studied. The investigation of three histone marks,H3K4me3, H3K27me3 and H3K9me3 that recruit mechanisms reveal that the functions of histone varies between different modifications. H3K4me3 marks which locate at the transcription site are associated with gene activation. In contrast, H3K4me3 and H3K9me3 may lead to gene silencing (Yuan, 2012). In spite of their different functions, H3K4me3 and H3K27me3 formed bivalent domains by colocalization which plays a crucial role by highlighting the genes that are responsible in activation during cell differentiation. Besides, another factor that increase the complexity of histone methylation is it can happens in three flavors which are mono-, di- and tri-methylation, each having different functions (Yuan, 2011). For example, H3K4me1 enriched at tissue-specific enhancer whereas H3K4me3 is highly associated with promoters. Histone modifying enzymes are recruited to specific region by interacting with transcription factors, non-coding RNAs and DNA interacting regulators (Yuan, 2011). Several evidences show that there are different methods in distribution of histone modification in DNA sequences. For example, H3K4me3 is highly associated with CpG density (Berstein, 2006) whereas H3K9me3 is poorly associated with repetitive elements.

Due to the high level of overlapping between H3K4me3 and CpG islands, the prediction of genome wide distribution of H3K4me3 in DNA sequence is highly accurate in different cell types (Yuan, 2009). On the other hands, computational researches on histone modifications mainly focus on H3K27me3 through the investigation of Polycomb group (PcG) protein targeting (Yuan, 2011). PcG was originally discovered in Drosophilla for regulating Homeotic (Hox) gene and repress the expression of the target genes through H3K27me3. Polycomb response element (PRE) are DNA element that well characterized and play an important role in PcG recruitment. Computational predictions of the PRE mainly focus around transcription factor motif. It is found that single transcription factor motif is not sufficient to differentiate PRE and non-PREs (Ringrose L, 2003). However, the discriminating of PREs and non-PREs is easier by combining different motifs together. In this study, the investigators test the prediction accuracy by experimentally validation and the surprisingly result showed that 29 out of 43 predicted PREs were verified (Ringrose L, 2003). This testing proved that the accuracy of the prediction is satisfying. Compared to H3K27me3 and H3K4me3, analysis of H3K9me3 is more complex and difficult due to the strong association with regions which cannot be mapped to the reference genome (Yuan, 2012). Thus, the accuracy of prediction for genome-wide distribution of H3K9me3 is slightly lower (Yuan, 2009).

2.1 In normal cell

In normal cell, covalent modifications of histone alter by changing the compacted and inactive heterochromatin into open and active euchromatin, or vice versa. The reversible modifications involves are methylation, phosphorylation, acetylation and ubiquitinylation that normally takes place at N-terminal and C-terminal of core histones (Wong JJ, 2007). Histone alterations happen with the aid of various families of enzymes such as histone deacetylases and histone methyltransferases. The regulation of these enzymes is crucial in normal cellular function whereas any alteration in function may lead to some diseases. Global modification of histone alteration pattern can affect the integrity and structure of the genome, thus disrupt the normal gene expression.

Figure 2.1.1: Epigenetic alterations and effect causes on transcription. The modification of the histone tails can either lead to activation or repression of transcription.

Generally, the increase of histone acetylation of the active chromatin structure may lead to increase of transcriptional activity. P300 is one of the histone acetyl transferase that catalyze lysine acetylation histone H3 and H4. In contrast, histone deacetylaze antagonistically produce transcriptional repression by the interaction with methyl-CpG binding protein (Ng H H, 2000). On the other hand, methylation and phosphorylation of histone takes place in activation of chromatin. For example, gene can be activated by methylation of lysine 4 and lysine 14 whereas methylation of lysine 9 may lead to gene silencing.

2.1.2 InCancerCell

Compared to DNA methylation, current research and information about histone modification in cancer is less advance. In order to develop colon cancer, various modification alterations involve such as changes in the structure of histone tail and replacement of histone by variants. Generally, abnormal methylation of tumor-suppressor gene is associated with deacetylation and methylation of lysine 9 on histone H3. These two mechanisms occurs and affect the same location, thus, they are mutually exclusive. Methylation of lysine 9 residue may lead to gene silencing by recruiting heterochromatin-associated protein (Kondo Y, 2003). Recently, a research found out that the global alteration of histone H4, loss of methylation from lysine 16 and trimethylation from lysine 20 associated to the marker for malignant transformation (Fraga M F, 2005).

The most common histone modification which occurs in both normal and cancer perspective is histone H3 which histone H3 lysine 4 trimethylation (H3K4me3) is found in active gene's promoter and histone H3 lysine 27 trimethylation is associated in promoter of inactive gene(T, 2007). In general, SMYD3, a histone methyl transferase which specifically for H3K4 is overexpress in colorectal cancer and breast cancer, denoting that hypermethylation occurs at oncogenes' promoter. In short, the increased activity of SMYD3 may enhance the transcription of oncogene and cell cycle regulatory genes. Other than colon cancer, these alterations in histone modification also act as marker for other human tumors such as breast cancer and ovarian cancer.

Individual histone can be replaced by other variants, for example H2A.Z variant for H2A. H2A.Z variant is important in embryogenesis and prevent the spread of heterochromatin into euchromatin regions. Abnormal replacement of variant disturbs the boundaries between heterochromatin and euchromatin which may cause cancer(Raisner R M, 2005).

Since histone modification and chromatin remodeling are highly associated to cancer, there are still doubt regarding the extent to which these changes occur and the way to predict histone modification. However, several new approaches to study histone marks on genome scale have been developed recently.

2.2Epigenomic Prediction

Epigenomic prediction is studied based on the selected feature for certain modifications of genome sequence, commonly DNA methylation and histone modification. Epigenomics helps us in understanding the biological process and having potential in therapeutic application. In tumour pathogenesis, epigenomics provide information and pattern of the epigenetic modification such as that acts as the sign of cell malignancy.

2.2.1 Computational Histone Modification Prediction

Computational epigenetic methods are effective in acting as prognostic and predictive biomarkers to predictthe occurrence or probability of epigenetic event occurring. The mapping of histone modification has been carried out researches by using the experimental methods which are the combination of chromatin immunoprecipitationand DNA microarray (Tho Hoan Pham D. H., 2007) as well as bisulphate modification with polymerase-chain-reaction (Ze-Jun Liu, Masato Maekawa, 2003). However, computational methods are more time and cost efficient as experimental technique is laborious and expensive.

The most common computational technique is supervised learning for vectorization(Friedman, 2001). Vectorization is a necessary process in order to develop a predictive model to convert the sequence into numerical value. If the feature of the sequence is known, the frequency of the feature can act as numerical predictors. However, various types of sequence features have been introduced as the discriminative sequence features are not known in most of the case. Examples of the sequence feature include word count (HE, 2007), repetitive element (Yuan, Linking genome to epigenome, 2012), poly A-T and TF motifs.

Feature selection techniques have been carried out in either traditional or machine learning approaches to overcome overfitting (Yuan, Linking genome to epigenome, 2012).Overfitting might occur as there are huge numbers of sequence features that increase the complexity of the model. An overfitting model gives poor performance in prediction due to minor fluctuation in the data. Traditional methods include principal component analysis, stepwise regression and penalized regression whereas machine learning techniques are support vector machine (SVM) and decision tree. Moreover, data mining approach, Bayesion addictive regreesion tree (BART) helps to improve the accuracy of the prediction. These techniques can be applied to different prediction of epigenetic modifications include histone modification, DNA methylation and nucleosome positioning.

A group of researchers from Japan had developed a computational approach to qualitatively predict histone modification including nucleosome occupancy, acetylation and methylation location in DNA sequences(Tho Hoan Pham D. H., 2007). The approach is built by using support vector machine (SVM) to study the model from data set that discriminate area with high and low level of histone occupancy, methylation and acetylation in a DNA sequence. The features that used to determine the histone occupancy and modification states at each position are subsequent of length L equally expanding from both sides and genetic element such as promoter, begin and end of a gene. In this paper, the reason of selecting these two features as the factors is not mentioned. These factors are converted into vector before applying in SVM.

Figure 2.2The accuracy and coverage of the prediction at different confidence level of different datasets (Tho Hoan Pham D. H., 2007).

Compared to the experimental technique, although this approach could not determine the area quantitatively, somehow it could determine whether the nucleosome and modification is present in DNA sequence accurately. As shown in the table below, the prediction accuracy of histone occupancy and modification are 93.19% and 24.7% with the confidence level greater than 0.75.

CHAPTER 3

METHODOLOGY

The main focus in this chapter is to briefly describe the algorithm, system and software needed in order to develop an application to predict histone modification.

3.1 System Development Life Cycle (SDLC)

System development is the process of analyzing, design and implementing a new or improves existing application or software. System development includes phases involved in information system development, from initial feasibility studies to the maintenance of the application.

System Development Life Cycle (SDLC) is a process designed for analysts and engineer which is employed during the development of new system or improving an existing system(OSQA, 2009). The objective of SDLC is to develop a system that meet the owner's expectation and requirement within cost and time constraints.

3.1.2 Waterfall Model

Waterfall model is one of the sequential design models created and widely used in software development process. In this approach, the process of the development is divided into different phases. These phases are cascade into each other to ensure that the next step is started when the expectation of the previous phase is achieved.

Figure 3.1 SDLC phases

Generally, the waterfall model is divided five phase. The first step is analyzing of the requirement. In this phase, the information, behavior and requirement of the system are analyzed for the validity of incorporating the requirement into the system. The process is followed by system and software design phase to establish the system architecture by selecting the planning the best solution. The third step of waterfall model is implementation by introducing the solution to ensure that it meet the specification and requirement. Next, system verification is to test and verify whether the system meet the requirement. The last phase of SDLC is maintenance to solve the problem that arises from time to time and enhancing the developed system.

3.2 Feature Selection

Generally, there are two features that highly associated to histone modification which are Alu repeats and CpG islands.Based on experiment, 81% of the Histone 3 lysine 9 methylation involves repetitive elements which include short interspersed transposable element (SINE or Alu elements), long interspersed transposable element (LINE) and long terminal repeat (LTR) (Yutaka Kondo, 2003). Further characterization of the repetitive elements revealed that 68% were Alu elements. Thus, H3K9 methylation is highly associated to human repetitive elements, especially Alu element. Therefore, Alu repeats are selected as a feature to predict histone modification.

CpG islands are formed by clustered CpGdinucleotides. It id found that 60% to 90% of CpG islands are methylated. The unmethylatedCpG islands are found to consist of highly acetylated histone H3 and H4. Therefore, CpG islands is selected as one of the feature to predict histone modification. The features that involve in CpG islands are sequence length, %G+C and CpG ratio(Fang Fang S. F., 2006). %G+C and CpG ratio are the features that used as classification(Fang Fang S. F., 2006).. The comparison of the values of these two features between methylation-prone class and methylation resistant class are found to be distinctly different.

Feature

Methylation-prone

Methylation-resistant

Mean

Standard Deviation

Mean

%G+C

55.26

6.32

64.34

CpG ratio

0.672

0.132

0.779

Table 3.2. Comparison of the distributions of %G+C, CpG ratio of CpG islands function in methylation-prone class and methylation-resistant class with 400bp(Fang Fang S. F., 2006).

3.3 N-score Model

N-score model is developed to predict nucleosome positions and histone modification location in human (Yuan, 2009). In my project, N-score model is applied to calculate the value of N-score which associated with Alu repeats for different histone modifications. The formula of N-score model is

where xl(S), yl(S) and zl(S) is the wavelet energies, word counts and structure feature for sequence S (Yuan, 2009).

To determine the word count, the frequencies of overlapping k-mers for k = 1 to 6 including the complementary words are enumerated (Heather E. Peckham, 2007). The structure feature is also included in the model (William Lee, 2007).

This model is implemented by using Phyton. The value of N-score are diverse with different types of histone modifications, this enable us to predict the histone modification targets regarding to the N-score.

3.2Support Vector Machine (SVM)

Support vector machine (SVM) is also known as support vector network that act as a supervised machine learning system which associated with learning algorithm and can be applied to analyzation of regression and classification(Tho Hoan Pham T. B., 2007). There are varieties of techniques to separate two classes linearly such as single and multi-layer perceptron(Nikolay Stanevski, 2005). The basic description of binary classification by utilizing DNA is stated below(Tho Hoan Pham D. H., 2007).

Mapping of input vector xi to a vector Ï•(xi) linearly or non-linearly in a richer feature space, it must be relevant to the selected kernel function.

Based on the first step, determine the optimized linear division within the feature space. Construct an optimal margin of the hyperplanewT Ï•(xi) + b that provides maximum boundary between two separating classes.

Let (xi, yi), i = 1, ... N be a training dataset, where xi denoted a vector of features and yi= +1 is a class attribute. The formula of the classification is:

f(x) =ΣαiyiK(x,xi) + b

where K (xi, xi) is a symmetric positive function that reveals the similarity between sample xi and xj. The αiand b minimized the error of prediction in the training process.

Figure 3.2: Illustration of binary classification in SVM

CHAPTER 4

IMPLEMENTATION

Project Workflow and Gantt Chart

Project planning in important to ensure the project can be completed within time given. With proper time management, tasks and milestones can be done with higher efficiency and provide better result performance. Gantt chart is a useful technique to illustrate the project tasks and schedule.

Week

1

2

3

4

5

6

7

8

9

10

11

12

Task

Title selection and registration

Literature review

Collection of information

Project analysis

Design and implementation

Completion of interim report

Figure 4.1 Gantt chart

The above Gantt chart illustrates the project schedule for phase 1. Registration of the final year project's title was done on 1st week of Delta first semester. Literature reviewwas carried out to do research and study about current existing techniques from 2nd week to 5th week with the duration of five weeks. Collection of information was carried out at the same time with study of literature review from 5th week to 7th week. The process is followed by analysis of the project for a period of two weeks from 7th week. In the 9th week, design and implementation of the software was carried out. In the last three week before submission, writing of the interim report was carried out.

Project Pipeline

Input training dataset Training

Feature selection

Support vector machine

Prediction results

Figure 4.2 Flowchart of project pipeline

As shown in the flowchart above, the project pipeline is divided into four steps. The project process is initiated by input training dataset that consists of two types of data. The first set of data represents the positive case and the second case of data represents the negative case.

The second step is feature selection. By studying existing sources and doing survey, the feature that most likely associated to histone modification is chosen to ensure good predictive performance.

The progress is followed by classification by SVM-based machine learning which is a supervised learning modal. SVM classifies data into positive and negative classes by analyzing data and recognizing pattern. The last step is generation of prediction results.

To determine the accuracy of the result, the obtained prediction result will be compared to existing results that other researchers have done. The existing results can be extracted either from experimental prediction or computational prediction results.

CONCLUSION

In this study, anambitious project was initiated to develop a computational approachto predict histone modification. Recent studies reveal that the advance studies of epigenomic have led to identification of epigenetic biomarkers that help in prognosis and diagnosis of some diseases such as colon cancer and gastrointestinal cancer.Computational predictive application is cost and time efficient compared to experimental technique as it is not laborious and doesn't required expensive reagents or materials.From this project, I had gained knowledge about mechanism of epigenetic alteration especially histone modification. Besides, I'm more exposed to different methods of predicting histone modification either experimentally or computationally.