Previous Up Next

5  Looking for CpG islands



In mammal genomes, methylation of cytosine in dinucleotides CG (called CpG) entails low frequency of such dinucleotides, excepted in small segments, called CpG islands (see [PDM01]). A usual way to detect CpG islands is to compute the ratio number of observed CpG/number of expected CpG on sliding windows.

To goal of this part is to segment a mouse sequence to reveal CpG islands. The model used is 1-length HMM with two states: model of CpG island vs model of methylated sequence. Both models are learned from the proportions of words in sets of sequences, respectively in files mus_cpg.fa and mus_tem.fa.

Moreover, we use the information that in mouse genome, CpG islands are in average 1000 bases long, and are spaced by in average 125,000 bases.
  1. We count the 1|1-proportions on the Lsequence built from both sets of sequences.
    import lsequence
    ls_cpg=lsequence.Lsequence(fic="mus_cpg.fa")
    ls_tem=lsequence.Lsequence(fic="mus_tem.fa")
    import compte
    c_cpg=compte.Compte()
    for s in ls_cpg:
        c_cpg.add_seq(s,2)
    
    c_tem=compte.Compte()
    for s in ls_tem:
        c_tem.add_seq(s,2)
    
    p_cpg=c_cpg.strip().prop(lprior=1,lpost=1)
    print p_cpg
    p_tem=c_tem.strip().prop(lprior=1,lpost=1)
    print p_tem
    


  2. We have to remove the N letters:
    p_cpg=c_cpg.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
    print p_cpg
    p_tem=c_tem.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
    print p_tem
    


  3. Then we build a Lproportion for our HMM:
    import lcompte
    lp=lcompte.Lproportion()
    lp[0]=p_cpg       # state 0 for cpg islands
    lp[1]=p_tem       # state 1 for sequences between of cpg islands
    lp.g_inter(0,0,1-1/1000.0)
    lp.g_inter(0,1,1/1000.0)
    lp.g_inter(1,0,1/125000.0)
    lp.g_inter(1,1,1-1/125000.0)
    print lp
    


  4. To build partitions of the studied sequence, we have use a Lexique.
    import sequence
    s_mus=sequence.Sequence(fic="mus2.fa")
    import lexique
    lx_hmm=lexique.Lexique()
    lx_hmm.read_Lprop(lp)
    


  5. The first partition is computed via Viterbi algorithm.
    import partition
    p_vit=partition.Partition()
    p_vit.viterbi(s_mus,lx_hmm)
    print p_vit
    


  6. The second partition is built via forward-backward algorithm, by the most probable state on each position. Then, before, the Matrice of the log-probabilities of the states must be computed.
    import matrice
    m_fb=matrice.Matrice()
    m_fb.fb(s_mus,lx_hmm)
    p_fb=partition.Partition()
    p_fb.read_Matrice(m_fb)
    print p_fb
    


  7. We want to draw these partitions such that the height of each arc is proportional to the density in C+G.

    We use a Lexique with the descriptor C+G.
    lx_cg=lexique.Lexique(str="+(CG)")
    
    And we draw the partitions:
    p_vit.draw_nf("mus_vit_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
    p_fb.draw_nf("mus_fb_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
    


  8. We want to draw these partitions such that the height of each arc is proportional to the value:
    number of observed CpG
    number of expected CpG
    =
    number of CpG.length of the segment
    number of C.number of G

    We use a Lexique with 3 descriptors: C G and word CG, and define the function on a lexique, a sequence, and a segment, that returns the ratio.
    lx_oe=lexique.Lexique(str="0:C 1:G 2:`CG'")
    def height(l,sq, sg):
        dl=l.ls_evalue(sq,deb=sg.deb(),fin=sg.fin())
        return float(dl[2]*len(sg))/(dl[1]*dl[0])
    
    
    And we draw the partitions:
    p_vit.draw_nf("mus_vit_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
    p_fb.draw_nf("mus_fb_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
    


  9. Now we compute the MPP corresponding to the Lexique; and we draw it, with the height of each arc proportional with the G+C density.
    import parti_simp
    ps=parti_simp.Parti_simp()
    ps.mpp(s_mus, lx_hmm, 50)
    ps.draw_nf("mus_ps_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
    


  10. We want to compare those methods with some specifically made for CpG islands detection: CpGproD [PM01], CpGplot [LGLP92] and CpG Island Searcher [TJ02]. As the outputs of these programs have different syntaxes, they are converted in several files. Those files are, respectively, cpgprod.part, cpgplot.part, and cpgis.part.

    They are all put un a Parti_simp, with the ones computed with HMM algorithms:
    pscmp=parti_simp.Parti_simp()
    pscmp.append(partition.Partition(fic="cpgprod.part"))
    pscmp[-1].s_name("cpgprod")
    pscmp.append(partition.Partition(fic="cpgplot.part"))
    pscmp[-1].s_name("cpgplot")
    pscmp.append(partition.Partition(fic="cpgis.part"))
    pscmp[-1].s_name("cpgis")
    pscmp.append(p_fb)
    pscmp[-1].s_name("fb")
    pscmp.append(p_vit)
    pscmp[-1].s_name("vit")
    


  11. We compare these partitions with the ones computed with MPP, and put the closest ones together, with the criterium of the number of same predictions.
    lmax=[]
    for p in pscmp:
        max=0
        imax=0
        for i in range(len(ps)):
            v=ps[i].pts_comm(p)
            if v>max:
                  imax=i
                  max=v
        lmax.append(imax)
    
    j=1
    for i in lmax:
        pscmp.insert(j,ps[i])
        j+=2
    
    pscmp.draw_nf("tous_cpg.ps",seg=[0])
    

Previous Up Next