找最大重复字串

找最大重复字串

http://www.chinaunix.net/jh/24/464277.html

http://www.chinaunix.net 作者:bioinfor  发表于:2007-03-27 19:16:25
【发表评论】【查看原文】【Shell讨论区】【关闭】 

1) Write a program to identify all the repetitive patterns in a string of
charaters (INPUT).  The string is only composed of A,C,G,T characters.  The
maximum length of string is 10000.  The minimum length of repeat is 10
characters.   Output: position, size, and patterns.  Here is an example:
1)写一个程序,识别字符串中所有的重复片段(重复模式),字符串由A,C,G,T组成,字符串最长为10000,随机产生。重复的片段最小是10个符串。输出:位置,大小,和片段。如下:
String:
TAAAAACTCGGGGT AAAAACTCGGGGAAAA
Repeat:
Repeat: AAAAACTCGGGG, Size: 12, Start Positions: 2, 15

解释:如这就就有两个重复(空格格开了):T  AAAAACTCGGGG  T AAAAACTCGGGG AAAA
这两个重复位置分别在字符串的2和15位,大小为12

 

2) Write a program to identify all the INVERTED repetitive patterns (e.g.
TAACCG => GCCAAT) in a string of character (INPUT).  The string is only
composed of A,C,G,T characters.  The maximum length of string is 10000.  The
minimum length of repeat is 10 characters.   Output: position, size, and
patterns.   Here is an example:

写一个程序识别所有的反向重复,如TAACCG => GCCAAT,也就是前面的反过来就是后面的字符串。和上面一样,字符串最长为10000,随机产生,最小反向重复片段为10,输出位置,大小,和片段,如下:
String:
CAAAAACGAGGGGTTTGGGGAGCAAAAA
Inverted Repeat:
Inverted Repeat: AAAAACGAGGGG, Size: 12, Start Positions: 17, 2

解释:如上面,C  AAAAACGAGGGG  TTT   GGGGAGCAAAAA
AAAAACGAGGGG和GGGGAGCAAAAA分别是反向重复,分别在2和17位上,大小为12。
 
 
哈, 学生物的吧。 去 perl.com 有 total solution, 如 bioperl.

或用 awk 自己写, 要讲效率的话, 还是 C 好。

 lightspeed 回复于:2004-12-12 12:26:33

注意: 下面的例子中的 STR_MAX 及 STR_MIN 可根据需要设定。 

 

#!/bin/awk -f
#
# A script can be used to check any repeat pieces of nucleotide sequences.
#
# Design: lighspeed
# Date: Dec. 12, 2004
#
# Usage::  $0 datafile
#
# Variables:
#
# left - a repeat string to be matched
# right - the right side string used to match any left in it
# rev_left - reverse string of left
# rev_right - the right side string used to match any rev_left in it
# flag[position] - an array element which will be set if the string in the position is matched
#

{
  L=length($0)
  STR_MIN=10
#  STR_MAX=int(L / 2)
  STR_MAX=20
  print "------------------- Line# "NR" --------------------\n"

  for ( Str_Len=STR_MIN; Str_Len <= STR_MAX; Str_Len ++ ) {

    for ( i in flag )
      delete flag

    for ( Position=1; Position <= L - 2 * Str_Len + 1; Position ++ ) {

      if ( Position in flag )
        continue

      count=0
      pos=Position
      offset=Position
      rev_offset=offset
      rev_count=0
      rev_pos=""
      rev_left=""
      left=substr($0,Position,Str_Len)
      if (index(left,"A")==0 || index(left,"C")==0 || index(left,"G")==0 || index(left,"T")==0 )
        continue

      right=substr($0, Position + 1)

      for ( i=length(left); i>=1; i-- ) {
        rev_left=rev_left""substr(left,i,1)
      }

      rev_right=right

      while ( Str_Len <= length(right) ) {
        i=index(right,left)
        if ( i > 0 ) {
   count ++
          j=offset + i
   pos=pos";"j
          flag[j]=1
          right=substr(right, i + 1)
          offset+=i
        }
        else
          break
      }

      while ( Str_Len <= length(rev_right) ) {
        i=index(rev_right,rev_left)
        if ( i > 0 ) {
   rev_count ++
          j=rev_offset + i
          if ( rev_pos == "" ) {
     rev_pos=j
          }
          else
     rev_pos=rev_pos";"j
          flag[j]=1
          rev_right=substr(rev_right, i + 1)
          rev_offset+=i
        }
        else
          break
      }

      if (count > 0 )
        match_number[Str_Len] ++

      if (rev_count > 0 )
        rev_number[Str_Len] ++

      if (count > 0 || rev_count > 0) {
        print  left, "Length="Str_Len, "Position="pos, (rev_count >0) ? "Rev_Position="rev_pos : ""
      }
     
    }
  }
  print ""
  print "Summary of Line# "NR
  print "------------------"
   for ( i in match_number ) {
    print "String length :: " i "  Matched Strings :: " match_number
    delete match_number
  }
  print ""
  for ( i in rev_number ) {
    print "String length :: " i "  Matched Reverse Strings :: " rev_number
    delete rev_number
  }                       

}
 

 


数据文件来自 DNA  NCBI35 的片段。 处理成 10000 个字符的单行文件。


# cat datafile

TAAAATGTGTAATCAACTAATACAAAGCAAGTTTTGTACTTTTTGTTGAATTTATTACTAAGTAT
TCTTTTTGATGCAATTGTAAGTAGAAATATTTATTTATTAAGAGATAGGGTCTTACTGTGTGGCC
CAGTATGGCCTTGAACTCCTGGGCTTAAGACATCCTCCTGCTGCAGCCTCCTGAGTAACTGAGAT
TACAGGTGTGCACCACCTCGCCTGGCTCAGAATGGTTTTCTTAACTTCATTTTTAGATTGTTCAC
TGTGAATATATCGAATTACAATAGTTTAGGCTGGGCATGGTGGCTCACGCCTGTAATCCTAGCAC
TTGGGGAGGCTGAGGTGGGTGGATAACTTGAGGCCAGGAGTTTCAGATCAGCCTGGCCATCACAG
AGAAACCTTGTCTTTACCAAAATCACAACAAATTAATTAGCTGGTTGTGGTGGTGCATGCTTGCA
ATCCCAGCTACTGGGGAGGCTGAGGTACGAGAATTACCTGAACCCAGGAGGTGGAGGTTGCAGTG
AACCGAGATAGTTCCACTGCACTCCAGCCTGGGCGACAGAACGGTTTTTGTATGCTTCAACCTTA
CTGAACTCATTTATTCATTCTGATATTTACTTTAGTGGATTCTGTATGATTTTCTATATGCAAGA
TGCTGTCATTTGCAAATAGAGATAGTTTTTCTTTTTTTGTTTCCAATCTGAATGTGTTTTATTTC
ATTTTCTTGCCTAATGCTCCTCATTAGTTTTCAATGTTGCATAGTATTTTATTGCATGGACGCAC
CATAATTACTTTTACCAATCTCTTATTGATGGACATGAAGGTTATTTCCAAACTCTTGTGGTTAT
AACAATGCTGTAATAGATAACCTATTACAAAGAACAGTTCTCAACTCTTTTGGTCTTGGGACTAC
TTTACCTATTTATGTATAAGTTTCAAGTTTGGGCTTAGAAAGAATTTAATAATCATGCTAATTTT
GTTTTGTTTTCTTTTTTTTTACTCCTGGACCCAAGCGGTCTTCCCACCTCAACCTCCCAAGTAGC
TGAGACTACAAGGGTGAACCATCACCCTGGGTAATTTTTAAATTGTTGGCTGGGCACAGTGGCTC
ACGCCTGTAATCCTAGCACTTTGGGAGGCTGAGACAGGCGGATTACCTGAGGTTGGGAGTTCAAG
ACCAGCCTGGCCAACATGGTGAAACCCTGTCTCTACTAAAAACACAAAAAATGAGCCAGGTGCAG
TGGTGCGTGCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAATTCAGGAG
GTGGATGTTGCGGTGAGCTGAGATCGTGCCACTGAACTCCAGCCTGGGCGACAGAGCAAGATTTC
ATTTCAAAAAACAAAAAGAAAAAAATTTTTAAAAATTGTTTTGAAGAGATACGGTTTCCCTATGT
TGCCTAGGCTGGTCTCATGCGATTCTCCTGCCTTGGCCTCCCAAAGTGTTGGGATTATAGACATG
AGACACCACAAATTTAAACAAGGACTTTTTTTATTTTTTAAAGAGATTACTTTTTCTGAGTAAAC
AAGGACTTTTAAAACAAGGTACTAAAAATCTGGCTGGGCGTGGTGGCTCGCTCCTGTAATCCCAG
CACTTTGGGAGGCTGAGGTGGGCGGATCACGAGGTCAAGAGATCAAGACCATTCTGGCCAACATA
GTGAAACCCCGTCTCTGCTAAAAATACAAAAATTAGCTAGGTGTGGTGGTGCACGCCTGTAGTCC
CAGCTGCTCAGGAGGCTAAGGCAGGAGAATCACTTGAACCCGGGAGGCAGAGGTTGTAGTGAGCC
GAGATCTCACCACTGCACTCCAGCCTGGCAACAGAGTGAGATTCCGTCTCAAAAAAAAAATTTTT
TTTAATAAATAAATAAATAAAAATCTTGAAATTTTTATTAGGTCCTGGTGTTTCTAATTTTAATA
TGATTTAGTTCTCAAGTGCTAGTTAATACTTCATTAATCAGCCAGATGGAAGTGGGGATACTATG
GAAACAGCATAGGCAAAGCTTAAAGATAAATGAGACCATGGTTTGAAAATATAGGGTGGCATGCG
CTTTGGTTCAAGGCAATTTGATCATCACAACAATTTGGCTTAAACAGCACTTTGGTTGAAAATGA
ATATCCCCTAGTTATGTGTTTTTCAAGTATTGGTCATTTTGGTATATCATGAGTTGTTTTGCAAA
CTTTTGTGCCAAAGTTTTCAGGAAAACTTTCTAATATTTGCTTTTGTGTTTCTAACTGATTTTCA
GAGAAGTTGTAATTTTGATGTTTTTTCCTTTTAGTGAGCATGCTTTAACAAAAAACAATAACAGA
AACTGTGTCAAAGAAAAGGACCTGTAATCTTCAGGGTTTGTAGTCTTTTTCCTCTTAAAAAACCC
TTTTCCTAATTAATGGCAGTTACATCTGCATGGCTGGTTTGGGTAAGTCTTCATTTTGTTGTATT
GCTGAGTAACAGTCAACAAAGGTTTATCAACTCTTGGTTAAGGGTTCCTTTCATGTTGTGAGTAA
ACATGAACAATATAGGATCTTATCCTTTTAAGCTATCATGCAAGAAACATGTGAGGTCTCTTAAA
AATTCACTGTGCTGGCCGGGCATGGTGGCTCACGCTTGTAATCCTAGCACTTTGGGAGGCTGAGG
TGGGTGGATCACTTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCT
ACAAAAATACAAAAATCAGCCGGGCATGATGGCGGGCAGGTGCTTGTAATCCCAGCTACTTGGGA
GGCTGAGACAGGAGACTCGCTTGAACCCGGGAGGCGGAGGTTGTAGTGAGCCGAGATTGTGCCAC
TGCACTCCAGCCTGGATGACAGAGCAAGACTCCATCTCAAAAAAGAAAAAAAAAAAAAATTGTGC
TGGCTGGGCTCAGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCAC
CTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCATCTCTACAAAAATACAA
AAATTAGCCAGGCATAATGGCGAGTGCCTGTAATCCAAGCTACTTGGGAGGCTGAGGCAGGAGAA
TCGCTTGAACCCGGGAGTGAGCCGAGATGGCGCCACTGCACTCTAGCTTGGGTGACAACAGCAAG
ATTCTGTCTCAGAAAAAAAAAAAAAATTAACTGTGCTTATAAATGGGAGCTAAATTAGGAAAAAA
ATAAAAAGTAAAAAGAAAATGAAAATAAAAATTTAAAAAATATATTAACAAATTACCTGTCCTAA
GGTAAAATTCTTTTTTTTTTTCTTGAGACGGAGTCTCGCTCTGTCGCCCACTCGGAAAGGAGTGC
CAATCTCGGCGTGAAAATGTGTCTGATGCGTATGCACCTGAGCTAGAAAGCCCAAAGACTGCTAA
GAAGCATGTGAGGGCTCAGAAACAAACATGTTTGGGCTTCGAAAGCCTGTTTTTGGAACCACTTT
CCCTTGTCTGCAAGGCAGAGGGAGGGAGGTACTCTGTTATTTCTAAGTCTCTCTTGAGCTCTTAC
ACTGTGCAAGCCCATGAACGTATTTAATCGTGCATTAGACAATTGTTTTTAATCTATGCCCTGCC
TCTCCCAAGATCAACCTTTCCCTGAGATCGGGGCCCCCTCTGGGTGCACAGGGATATTTTTATTT
TTTGAGTTGGAGTTTTGCTCTTGTCACCCAGGCTGGAGTGCAATGGCATGATCTTGACTCACTGA
AACCTCTTCCTCCCGGCTTCCAGTGATTCTTCTGCCTCAGCCTCCCAAGCAGCTGAGATTACAGG
CATGCACCACCACACTTCGGTTAATTTTTGTATTTTTAGGAGAGATGGAGATTCACCATGTTGGC
CAGGCTGGTCTTGAACTCCTGACCTCAGGTGATCCTCCCGCCTTGGCCTCCCAAAATGCTGGGAT
TATAGGCGTGAGGCACCGTGCCCAGCCCATAGGGATATTTTTATATACTTTCCTGCCCCATGGGT
CAACTGTTCTTGAACCAAAGAAACAAGAGGCGGGGAAGTTATAGGAAGCTTTTAAAATATGCTTC
TGTGCAGCACTGCTCGCAGCGTGTCACAGATGTGCGGTATTGGAAGACGAAGGTGAAACTGCATG
GAGATGATTGTGTGGGGGATGAGGAGGTGGTGGGTAGGGGACTTGGCTTTCTTCACACAAAGACA
TCCAGGCAAATGGTAAGTCCAAAAGCCCTGTGACAGATAATGGCCATTGTTCCTGCAGGGTGACT
CTTTTCTCTTCTTTTTTTTCTTTTTGAGGCGGAGTCTCACTCTGTCATCTATGCTGGAGTGCAAT
GGTGCGATCTTGGCTCACTGCAACTTCCGCTTCCCGGGTTCAAAGTGATTTTTCTGCCTCAGCCC
TCCCGAGTAGCTGGGACTACAGGTGCGCGCCACCATGGCCAGCTAATTTTTATATTTTTAGTAGA
GACGGGGTTTCTCCATGTTAGCCAGGATGGTCTCGATCTCTTGATCTCGTGATCCACCCGCCTCA
GCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCGGCGCCCGGCCCTATACACATGATTTTG
AACATACTGACAGATGGAGAAAACCACTTTGGAAAAGATACTTCACATGTTCTAGAGACGATTTA
AACCATTAAGCATTCTATGAAGCTTCTGAAGGTCTGTCAGATTTTAAATGACAACAGTGAAATTT
TAAAACAAGAACAGAAGTCAGCACCAAAGCTAGTTTAACATTAATAATAAGTGAGCCAATAAATA
GGTCTATGTTTGCCCAGGCAGGTTTTGCTTATTATGTCAGTTGGAAAGCCAGAAGGAAACTGGTT
TTAACTCTTAATATAACCTGTATCATGACACCATCACTTTACCAGAAATGTAGCTGATGTCAGCA
TAAGACTGAGACAGTTTACATTTAAAACTGTTGTTTCCTTTCCAACTATTTTCATAATTCATTCA
TGGTATAGGATTGAGACTATTTCCTTAAACAGAAAAAAATGGGTAATTAACATTGAGAACTTTCC
ATGTGCCAGATACTGTATGAACTGTCTTAATTTTCATAGCCACCCTGCAAGATATTATCCTCATC
TTTTTAGAGGAAGAAACAAGTTTCAAGAAATGAAGTAGGTTTTCTAAGGCCACAGCTATAGTAAA
GAGGTGGAGCTGACATTCAAGCTTGGATATGAATTATTATAATTTCCACAGCACTACACAGCTGT
CATTTTCTCTACCTGCAAAACTAAATAAATACTGTTAAAAATAAAAGATGATCTCCAAGATCTCT
AAACATTAAAATTTTACAATAAACTGGTTGAGGTGACACATGCCTATATTTTCAGCTACTCAGGA
GTTTGAGACTGGCCTGGACAACATAGCAAGACCCTGTCTCTAAATTTAAAAAACAAATTACAATG
AGATAATCTTAGACCAGAGAAAGGAAAGTGAAATAGCTATTTGGATTATAAACTGTTTTAGTAAC
TCAAATGTAATGTGTGGTGGTGACAATATCTTTGATTCCTGGGAAGGTCATTGTGAAAGGGAATA
GAAAATGCCTTGAAGTCAAAATATAAGGCTCTCAAATAGAAAAATAAATATAACATTTAAGTATT
ATCAACAGAGAACCAAGTTAGAAAAAACTAGTTATAGTCTGAAACAATGCTGTTTAAAAGACTGC
AGTCACCAGTGTAAACTGACTCAGGCAACACTTCCCAGGGTCCATGCCGTGGACAACTGACTAAT
CTCTCTATAAACAATTCTTGACACTAGATAGGCCTTTACTAAGAGCAACCAGAGACAGAAATTAG
TATCGACAGTGGAGTTTTAAAATCACACTTAAAAAAATATTATTGGCTGGGCACAGTGGCTCACG
CCTGTAATCCCAGCACTTTGGGAGGCTGAGGCAGGCAGATCATGAGGTCAGGAGATCAAGGCTAT
CCTGGCCAACATGGTGAAACCCCGTCTCTATTAAAAATACAAAAATTAGCCGGGCGTGGCGGTGA
GTGCCTGTAGCCCCAGCTACTTGGGAGGGTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCGGAG
GCTACAGTAAGCCGAGTTCGTGCCACTGCACTCCAGCCTCGGCGACGGAGCGAGACTCCCTCTCA
AAAAAAGAAAAAAAAAATGTAGATTATATTCTGTGAATATTACATCACAGAATAAAACTCTGGAT
ATAATACATGGGAGAGTTAATATCCAGAAAGACATTGTGCATTTTTGGTCTAAGTTTCATGAGAC
AAAATATTATTTTCTTTTCTGAGACTCAATTCTTTCCCAAAGGGATCAGTTCTCTTAAGTGGACC
TTTTTACAGCCTTTCAGCTGGCTCAAAAGATGAGTTTTGGCGAACAAGATTATCGATACTCACTG
AGCAAGTGGTAGTTAGAATCCCTTTCATATTTGAAGGTCAAACGGCCATAGCTGACATGATTTAG
ATTCTTCAGCCACTCAAAGTAAGATACTGTCACTCCTCCAGCATTCAAGTAGAGATCCTATGCAC
AAAAATAAGACAAAGAAATTAGAAGATGATGGTTTTCGTAAAAGCTGAAAATGAACCTAAGACCT
TAATTTCAATACCAAGGTAGACTGGACTTCAAATATCGCAAATATATTTTAGCCAGTATCAGGAA
TTTCACACTTAATTAACACTCCTTCCCATCCCACCCAATTCCATCTAAGGCTTTTTCTATTTAGG
AAAAAAAAAAATCATTTTTTGGCTTATTAATCAAGGAAAGTTAATAATCTTTGGTTAGAGCCTCT
TCTCTACCAGAAGTTAGTTCTCAGACTAAATGGCTTGCCCACCAACCAGTTGGACTGGACTGTCC
ACAGGGCCTCTCAGAAGACAGGATTCTTTCTCCTATTACCTAAGGGTAGCCCATTTCAGTTACAT
TAAATGTCTTAAGTGCTTTCAGCAAAGGGGGTTCTTTAAAATATATTCCAAGCCCACATTAATTT
CTAGTAACTTTTTGGTGTAGACTCATTTTTACTTGCTAAAAAACCTGAGCACGTGTTCCTCATAT
TAGTTTTCTGAGTAAAGCTGGAAAGGGCACTTGAAATGCATAAGGTTAGGAATCATATAGAAAAT
CTTAAGAGCTTTAGTTAGAATAGTGTTTCTAACACAGTACACATTTATATAACCAGACTCTTAGA
AGGCTAAAGACATTCAGTAAAGAGCCTGAAATTGGAATAAATGTTTCGATCAAAGTGAAAATTAA
CAGGCTTAGAATTAACCATGCTTCTACTATATTTTCTCAAAAGTGAAAAAGATGAATTCACTAGA
GCTTGGAGACTAATAATTCCTCTCTTCCTCCAAATTCCTTGCAAAAGACTATTATGATTCTAAGT
ACATATAAAGCCTAATAAATATAGATGACTTACTGGAATAACCATAATGTTTCTCTCCAGGAAGA
TCTTGTCAGCTTCTGGAGTTGTTGGCCCATTGGCACCTTCAGCAATGATCTGCAAGAGAGTCAGG
AACATAGAGAAATGCGAACACCACCGTCAAATCCCCTCCACTGAGGGCAAGAGATGTGCATATAT
GAACAAGGGGCTGTGGGGAGAAGCACAGTTTCAGTTAAAGTTAAATAGAGGTTATTTTTCTCTGC
CAAGTGTATAAAACTACCTTTCACTTTTCTATTTATCTAGGTTTTTTTTTTGTTTGCTTGTTTTT
TTTTTTTACAGGAGTGTCAGGCAGATGCGTTTGTTTTGGTAATGGTTGCACAACTCTGTGCCGGT
AGCTAAAAGCCATTAAATTATGTACCTTAAATGGGGGAACTGTATGGTATGTGCAGTATGTGCCA
ATAAAGCTGCTAAAAAGAAAGAGAAAACTCAATCAGACTCTTCTATGACCCCCCTAACGTCATTC
ACATTGATAATGTTGGTTCTGGTTTCTATAATGTTGTCACCTTGGCTTTGACTCTGGGTGCGTTG
GATTTGGTCAACTGCTTCTCACTGGCAGCTGGGATCAGTATGTCACAGTCGGCCTCCAAGATGCT
TCCTTCATAGGGCTTTGCCTTGGGGAAGCCCAGAATGGACCCATGTTGCTGCCATTGATTGAAAA
TCACAATTAATAGCTGCACCAGAGTTTTAAATATTTATATTTAGTGTCTATGCTATAAAAATGTA
TTAATACCAATTTGAAGTCTTCCAGTTCCTTTGGGTCAATACCATCTGGATTCCATATACTCCCA
TCAGACTCACCAACAGCAATACATTTAGCACCAAAACGATGTAAATATCTCATAGAGTGCAGGCC
CACATTACCAAATCCCTGTGAAGAACAATTACCCATAACACAAAAATTAAAGTCCTGGTATAGAC
AGCAAGAGTCATATTTTGACCAATGTAAATCCATACTCTGTTTATTTAGAAGCAATTCTCAAAAT
TCTTTTGCCAAAAGAAACAATGTACTAACTGGTTTCTCTTCAACAATAAAATTCTCTGTTTAAGA
ATGTGATGAGCGGGCATGGTGGCTCACGCTTGTAATTCCAGCACTTTGGGAGGCTGAGGCAGGTG
GATCACTTGAGGTCAGGAGTTCGAGATCAGCCATGGCCAACATGGTGAAACCCCGTCTCTACTAA
AAATACAAAAATTAGGCATGGGGTCCGTGCCTGTAATCCCAGCTACTTGGGAGAATGAGGTAGGA
GAATCACTTGAACCTGGGAGGTGGAGGTTGCAGTGAGCCGAGACTGCACTCCAGCCTGGGCAATA
GGGTGAGACTCCATCTCAATCAATCAATAAATGGCAGTGGTGTTAAGTACACCACCACTTTTTGC
TTTTTTTTTTTTTTTTTTTTTGATGGAGTTTTGCTCTTGTTGCCCAGGCTGGAGTGCAATAGCGG
AATCTCGGCTCACCACAATCTCTGCCTCCCAGGTTCAAGCAATTCTCCTGCCTCAGCCTCCTAAG
TAGCTGGGATTACAGGCATGCGCCACCTCGCCTGGCTAATTTTGTATTTTTAGTAGAGACAGGGT
TTCTCCATGTTGGTCAGGCTGGTCTCGAACTCCTGACCTCAGGTGATCCGCCACTTCAGTCTGCT
AAAGTGCTGGGATTACAGCTGTGAGCCACAGTGCCCGGACTTTCTTTTTTTTTTTTTTTTGCCAA
TTTGTATTTTATTTTTGCTAATTTAAAAAATAGTTAATAGAATATCAGAAATACTGAACATTATC
ATTTCCATAAATGCAAGAGTGTATACATTTTCCACACACTGAGTACTAGCTAGATTTTCTATGAT
AAACTCTGACCACTTCTTCAGGCAATTCATGTACTTACTTCAGCATTATCATTAATGATGAAGGT
TCTAGAACCATCATGAACAAGGGTCCCATCTTCACACAAGTTACTTAACTGCTGGGAGGCTCTAT
TTCATCTTATGTAAACTACAGATAATACCTACTCACCTCAAGGGTATATCAAGAGTTTATGTAAG
CTAAGTTTGTAGAAAGTAGTTAGCACAGTGCCAGGAAGGGTCCAAGAAGAAATGGTACTTACTAT
GAAATATTTGTACGTATATATGTATGCATGTTAATGAGCTCTTATTAGCTGTGTTCATTAAAGGT
TTTCTCTATCCTGTGATTTGCTTTTAGATTTTGGAATACATTTCATTGTGCACATTCCATTTGTA
TTATTAATATAACAATATTTATTACTATTATTATTATCATCATCAATTCAATCACATCTACTGTA
TCCCTGATGATGACCATGATTCCTTTAATAATCATAAAACTCTCTTCCCTTCATCACGGGGTAAA
TAACCTATCACAATGCTGTAAGTCTCCATCAGCACCCCAGGCTGCCCCTGCTGACTTACCAGATC
TGCTACTTCAGCCAGATCAATCATCATGTTAATGTCACACCCACATTCAATAATGGTGAGTCGGA
GCTTTTTACCTGTAAGTGAAAAAGATAAAAATTTTACTTTAAAAAGACCCTGAAA

 

 


运行:

# time ./1 datafile > report

real    1m14.81s
user    0m50.65s
sys     0m0.13s

# wc -l report

    3355 report


下面是 report 文件中的部分内容:

 

# cat report

------------------- Line# 1 --------------------

TAATCAACTA Length=10 Position=10 Rev_Position=12
TTTATTACTA Length=10 Position=51;9703
GGAGGCTGAG Length=10 Position=330;470;1129;1265;1633;2655;2793;3102;5936;8564
AAAAAAAAAA Length=10 Position=1871;2906;2907;2908;2909;2910;3198;3199;3200;3201;3202;6183;6761;6762 Rev_Position=2906;2907;2908;2909;2910;3198;3199;3200;3201;3202;6183;6761;6762
GTAGTGAGCCGA Length=12 Position=1811;2838
AATAAATAAATA Length=12 Position=1889;1893 Rev_Position=1890;1894
CCACTGCACTCCA Length=13 Position=534;1830;2857;6133
GTTTTTCTTTTTT Length=13 Position=675 Rev_Position=4304
GTAATCCCAGCTAC Length=14 Position=1248;2776;8678
CCTGTAATCCTAGCA Length=15 Position=310;1109
GGATCACTTGAGGTCA Length=16 Position=2671;8580
CGGGCATGGTGGCTCAC Length=17 Position=2617;8526
CACTTGAGGTCAGGAGTT Length=18 Position=2675;8584
CCAGCCTGGCCAACATGGT Length=19 Position=1172;2698;3011
CTTTGGGAGGCTGAGGCAGG Length=20 Position=5931;8559
TTTTTTTTTTTTTTTTTTTT Length=20 Position=8841;8842 Rev_Position=8842

Summary of Line# 1
------------------
String length :: 10  Matched Strings :: 541
String length :: 11  Matched Strings :: 438
String length :: 12  Matched Strings :: 372
String length :: 13  Matched Strings :: 329
String length :: 14  Matched Strings :: 293
String length :: 15  Matched Strings :: 258
String length :: 16  Matched Strings :: 226
String length :: 17  Matched Strings :: 200
String length :: 18  Matched Strings :: 171
String length :: 19  Matched Strings :: 146
String length :: 20  Matched Strings :: 124

String length :: 10  Matched Reverse Strings :: 142
String length :: 11  Matched Reverse Strings :: 69
String length :: 12  Matched Reverse Strings :: 40
String length :: 13  Matched Reverse Strings :: 25
String length :: 14  Matched Reverse Strings :: 15
String length :: 15  Matched Reverse Strings :: 7
String length :: 16  Matched Reverse Strings :: 5
String length :: 17  Matched Reverse Strings :: 2
String length :: 18  Matched Reverse Strings :: 1
String length :: 19  Matched Reverse Strings :: 1
String length :: 20  Matched Reverse Strings :: 1

 lightspeed 回复于:2004-12-16 23:39:29

这里是 final version.

1. 增加了二分查找最长匹配串并自动设置 STR_MAX 的代码。
2. 改进了 reverse string 算法, 大大提高了运行效率。

# time ./1 datafile > report1

real    2m57.43s
user    2m29.33s
sys     0m0.08s

$ time ./1 -v r=1 datafile > report2

real    0m53.28s
user    0m46.03s
sys     0m0.03s    

 

#!/bin/awk -f
#
# A script can be used to check any repeat pieces of nucleotide sequences.
#
# Design: lighspeed
# Date: Dec. 16, 2004
#
# Repeat Match Usage::  $0 datafile
# Reverted Repeat Match Usage::  $0 -v r=1 datafile
#

# function is_overlap : check if a string (position=p, length=l) is overlap with
# matched strings which are stored in array record (position=i, length=record)

function is_overlap(p, l) {
  e = p + l - 1
  for (i in record) {
   a = i + record - 1
   if (( i >= p && i <= e ) || ( a >= p && a <= e ) || ( p >= i && p <= a ) || ( e >= i && e <= a ))
      return 1
  }
  return 0
}

# flag=0 find the longest matched string and set STR_MAX
# flag=1 find all other matched strings

function find_string(STR_MIN, STR_MAX, flag) {
 
  for (i in record)
    delete record

  if ( flag == 1 ) {
    if ( r == 1 )
      print "---------------Reverted Repeat Match Line# "NR" -----------------\n"
    else
      print "------------------Repeat Match Line# "NR" --------------------\n"
  }

  for ( Str_Len=STR_MAX; Str_Len >= STR_MIN; Str_Len -- ) {

    for ( Position=1; Position <= L - 2 * Str_Len + 1; Position ++ ) {
      if ( is_overlap(Position,Str_Len) == 1 )
        continue

      count=0
      pos=Position
      offset=Position + Str_Len - 1
      left=substr($0,Position,Str_Len)

      if (index(left,"A")==0 || index(left,"C")==0 || index(left,"G")==0 || index(left,"T")==0 )
        continue

      right=substr($0, Position + Str_Len)

# Reverse string left. The start position in reverse string is L -Position - Str_Len + 2

      if ( r == 1 ) {
        old_left=left
        left=substr(REVERSE_STRING, L - Position - Str_Len + 2, Str_Len)
      }
      while ( Str_Len <= length(right) ) {
        i=index(right,left)
        if ( i > 0 ) {
          if ( flag == 0 )
            return 1
          j=offset + i
          if ( is_overlap(j,Str_Len) == 0 ) {
     count ++
            record[Position]=Str_Len
            record[j]=Str_Len
     pos=pos","j
          }
          right=substr(right, i + Str_Len)
          offset+=(i + Str_Len - 1)
        }
        else
          break
      }


      if (count > 0 && flag == 1) {
        match_number[Str_Len] ++
        if (r == 1)
          print  "Reverted Repeat: " old_left",", "Size: "Str_Len",", "Start Positions: "pos
        else
          print  "Repeat: " left",", "Size: "Str_Len",", "Start Positions: "pos
      }
     
    }
  }
  if ( flag == 0 )
    return 0
}

{

  L=length($0)
  if ( r == 1 ) {
    REVERSE_STRING=""
# split($0, aaa, "") :  Use null string "" to split every characters into array aaa
    split($0, aaa, "")
    for (i=L; i >= 1; i -- ) {
      REVERSE_STRING=REVERSE_STRING""aaa
      delete aaa
    }
  }

  STR_MIN=10
  STR_MAX=0

# Find max matched string and set STR_MAX by binary search algorithm
  low=STR_MIN - 1
  high=int(L / 2) + 2
  while ( low < high ) {
    if ( (high - low) == 1 ) {
      if ( low >= STR_MIN )
        STR_MAX=low
      break
    }
    mid=int((high + low) / 2)
    status=find_string(mid, mid, 0)
    if (status)
      low=mid
    else
      high=mid
  }
  if ( STR_MAX >= STR_MIN )
    find_string(STR_MIN, STR_MAX, 1)
}

 


3. data2 (100000 字符) 的结果

当长度增加到 100000 时, 速度变的很慢, 我只测了正序, 而且没有完全
完成, 但可以估计时间。

由于存在的重复匹配串的最大长度很小, 我采用二分查找首先找到重复匹配串的最大长度
然后再循环至最小长度。 下面结果前面的

9 50002 0
9 25005 0
.

就是显示查找的过程。 找到重复匹配串的最大长度为 43 用时 67 min.
然后, 从 43 循环至 10. 每个长度用时约 5 min

总时间大约为 67 + 34 * 5 = 237 min 也就是大约 4 小时。


9 50002 0
9 25005 0
9 12507 0
9 6258 0
9 3133 0
9 1571 0
9 790 0
9 399 0
9 204 0
9 106 0
9 57 1
33 57 0
33 45 1
39 45 1
42 45 1
43 45 0


Repeat: ATTGGATCATTGATCTAATCCAACCACATAACTATAATTACAG, Size: 43, Start Positions: 25161,25204
Repeat: TCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCG, Size: 41, Start Positions: 26357,75092
Repeat: AGTCTTGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCG, Size: 39, Start Positions: 31148,69805
.
.

产生随机字符串
$ cat rand.pl
#!/usr/bin/perl -w
@array=("A","T","C","G");
for ($a=0;$a<1000;$a++) {
        $rand=$array[int rand scalar @array];
        $out.=$rand;
}
print "$out\n";

$ perl rand.pl>datafile
$ cat datafile
ATACTCGTTGACTCTACTCATAACAGCATTAAAGACCGTAGGGTAGGCT...
riverfor 回复于:2004-12-18 03:18:35


=======================repeat.py===========================
# !/usr/bin/pyton
# Just enjoy the programming, share your excellent code
# report bugs to email/ msn: [email protected]
#
import string
import commands
import sys
## data for deamon
data = "1234567890112345678901234567890 1234567890989901234567890"
## call this functions to get the real data which was saved inthe filePath
def getSrcDataFromFile(filePath):
    data = commands.getoutput("cat %s" % filePath)
##
result = []
## get all substrs used to check whether be fit for
def getAllSubStr(splitCode = " ", minDataLen = 10, maxDataLen = 10000):
    dataLen = len(data) + 1
    for x in range(dataLen):
        for j in range(x, dataLen):
            seg, ilen = data[x:j], j - x
            # check whether to be fit for these conditions
            if seg.count(splitCode) == 0 :
                if ilen >= minDataLen and ilen <= maxDataLen:
                    if result.count(seg) < 1:
                        result.append(seg)
                   ## process
def process(filePath = ""):
    if filePath != "":
        getSrcDataFromFile(filePath)
    getAllSubStr()
    print "sub data\t repeat \t indexs"
    for item in result:
        counts, itemlen = data.count(item), len(item)
        if counts > 1:
            indexs, start, seg = [], 0, data
            for x in range(counts):
                if seg != "":
                    start += seg.index(item)
                indexs.append(start)
                start += itemlen
                seg = seg[start:]
            print item, "\t" ,counts, "\t" ,indexs
                  
if __name__ == "__main__":
    print "source data: %s " % data
    print "========result:========"
    if len(sys.argv) < 2:
        process()
    else:
        proccess(sys.argv[1])
 
================================================
This script only print the sub data repeats twice or more, which was with the condition of splitCode = " ", minDataLen = 10, maxDataLen = 10000.
Usage: python repeay.py [data file] [| awk / sed ....]

第一次发贴,嘻嘻,没时间写注释啦,但python本来就是如此易读(不要笑话我程序中的英文注解的语法错误哦,因为很多系统的中文支持不好,习惯了).很纳闷为什么cu连php版面都有,却连python的论坛都没有呢,我用过php, java, c, python. 我觉得python绝对是值得我们关注的一门语言,它可以用做shell(远比sed awk易读), 网站脚本(mod_py), application语言(网络的twisted框架, tk的GUI, 各种数据库的支持)......

用awk写了一下第一题,
用了最笨得办法,效率超低。处理这个datafile用了10s,而且只适用处理短一些的串。。。

 

BEGIN {
        _MIN_LEN=10;
}
function find_max_str()
{
        for(i=length($0)/2;i>=_MIN_LEN;i--) {
                for(j=0;j                         k=substr($0,j,i);

                        if((index(k," ")>0)||!((index(k,"A")>0 && index(k,"C")>0 && index(k,"G")>0 && index(k,"T")>0))) {
                                continue;
                        }

                        if((a[k]++)==2) {
                                printf("%d:%d:%d:%d:%s\n",NR,b[k],j,length(k),k);
                                return 0;
                        }
                        else {
                                b[k]=j;
                        }
                }
                for(idx1 in a) {
                        delete a[idx1];
                }
                for(idx2 in b) {
                        delete b[idx2];
                }
        }
        return 1;
}
{
        find_max_str();
}

 

 

 zhl1979 回复于:2007-03-27 19:16:25

awk '{ for (l=10;l<50;l++){for(i=1;i<(10001-l);i++){a=substr($0,i,l) ;hash[a]++  ;   t=hashb[a]; t=i" "t; hashb[a]=t }}}END{for(a  in hash ){ if (hash[a]>1){ print a ,hash[a], hashb[a] }}}

 

 

你可能感兴趣的:(找最大重复字串)