The CG island is a stretch of DNA (usually longer than 200 bases) in which the frequency of the CG sequence is higher than other regions. It is also called the CpG island, where "p" simply indicates that "C" and "G" are connected by a phosphodiester bond.
CpG islands are often located around the promoters of housekeeping genes (which are essential for general cell functions) or other genes frequently expressed in a cell. At these locations, the CG sequence is not methylated. By contrast, the CG sequences in inactive genes are usually methylated to suppress their expression. The methylated cytosine may be converted to thymine by accidental deamination. Unlike the cytosine to uracil mutation which is efficiently repaired, the cytosine to thymine mutation can be corrected only by the mismatch repair which is very inefficient. Hence, over evolutionary time scales, the methylated CG sequence will be converted to the TG sequence.
This explains the deficiency of the CG sequence in inactive genes.
The following example was taken from Chromosome #1 in the Human Genome. Notice the higher concentration of G+C in the CpG island near Base #2000. In this case, the minimum length was set to be 200 nucleotides, so the second peak around Base #5200 was not recognized
CpG islands occur near the beginning of a gene in the human genome and can be used for gene finding. In these Islands, there will be a more equal concentration of c and g nucleotides. In inactive regions, there will be a higher concentration of t nucleotides than c nucleotides.
In this assignment, you will write code for a Hidden Markov Model to find CpG islands. You will implement the Viterbi algorithm to find the most probable sequence of hidden states for the following model. Then you will use annotated data to find model parameters for your HMM and evaluate your accuracy.
We will use the following general definition of a Hidden Markov Model for this description.
It is easy to see how you can generate the probability of a given sequence of hidden states, using this construct.
In finding a CPG island, we would like to determine the most likely path through the model.
There are two ways of doing this. If you are interested in the overall probability of a sequence of states, then you use the forward algorithm which requires you to sum the probabilities of all possible previous states. The Viterbi algorithm just uses the previous state that generates the most probable arrival at the current state and takes the product of these max probabilities to select the correct path. The following example will illustrate the algorithm.
In this example, both states have a 0.5 probability, so the probability of being in state B is (0.5 * 0.251) with 0.251 being the probability of emitting an a (this is the observed first character) while in state B.
The probability of being in state I for the initial state is (0.5 * 0.25) with 0.25 being the probability of emitting an a (this is the observed first character) while in state I.
In order to calculate the probability of being in state B at time=2, Viterbi calculates the max probability of coming from states I or B in the previous state.
Since the probability is higher for coming from state B, we will use this to calculate the probability of being in state B at time=2. Since the probability of emitting a c in state B is 0.098, the total probability of being in state B at time=2 is (0.5*0.251*0.7)*0.098.
In order to calculate the probability of being in state I at time=2, Viterbi calculates the max probability of coming from states I or B in the previous state.
Since the probability is higher for coming from state I, we will use this to calculate the probability of being in state I at time=2. Since the probability of emitting a c in state I is 0.25, the total probability of being in state B at time=2 is (0.5*0.25*0.5)*0.25.
This same procedure can be used to calculate the probabilities of being in each state until the whole sequence has been analyzed. You can compute the log likelihood by using the logarithm of the probabilities instead of the actual values. You then substitute addition for multiplication. Since the probability of the state at the previous time step is already in log form, you just add it, instead of taking the log of it.
You should experiment with different values for emission and state transition probabilities to determine values that categorize the data most accurately.
In order to find an island, slide a window across your most probable state list looking for a threshold of Island states. You will have to determine the threshold values. Plot the number of I states in the window and turn it in as part of your writeup.
You will probably find that the model parameters given are not very good at discriminating between islands and normal DNA. The next part of the assignment will give you the opportunity to calculate model parameters. Although sophisticated algorithms can be used in unsupervised training, we are going to use a very simple approach.
Using annotated sequence data, calculate the Bayesian probability of each emission or transition. For example, if you observe that when you are in state B, you transition to state I 100 times and stay in state B 200 times, then the probability of B->A transitions should be 0.33 and the probability of B->B should be 0.66. Count up the emissions from each state. If the totals from state B are a->20, c->30, t->40, g->60, then the probability of emitting a t in state B is 40/(60+40+30+20). Use this model based on supervised training to find islands and compare it to your initial results. You can use this annotated data along with that you find in other places to train your model.
log(a + b) = f(log a, log b) = log a + log {1 + e ^ (log b - log a) }-
来源:http://dna.cs.byu.edu/bio465/Labs/hmm.shtml
参考:http://en.wikipedia.org/wiki/Hidden_Markov_model