找最大重复字串
http://www.chinaunix.net/jh/24/464277.html
http://www.chinaunix.net 作者:bioinfor 发表于:2007-03-27 19:16:25
【发表评论】【查看原文】【Shell讨论区】【关闭】
1) Write a program to identify all the repetitive patterns in a string of
charaters (INPUT). The string is only composed of A,C,G,T characters. The
maximum length of string is 10000. The minimum length of repeat is 10
characters. Output: position, size, and patterns. Here is an example:
1)写一个程序,识别字符串中所有的重复片段(重复模式),字符串由A,C,G,T组成,字符串最长为10000,随机产生。重复的片段最小是10个符串。输出:位置,大小,和片段。如下:
String:
TAAAAACTCGGGGT AAAAACTCGGGGAAAA
Repeat:
Repeat: AAAAACTCGGGG, Size: 12, Start Positions: 2, 15
解释:如这就就有两个重复(空格格开了):T AAAAACTCGGGG T AAAAACTCGGGG AAAA
这两个重复位置分别在字符串的2和15位,大小为12
2) Write a program to identify all the INVERTED repetitive patterns (e.g.
TAACCG => GCCAAT) in a string of character (INPUT). The string is only
composed of A,C,G,T characters. The maximum length of string is 10000. The
minimum length of repeat is 10 characters. Output: position, size, and
patterns. Here is an example:
写一个程序识别所有的反向重复,如TAACCG => GCCAAT,也就是前面的反过来就是后面的字符串。和上面一样,字符串最长为10000,随机产生,最小反向重复片段为10,输出位置,大小,和片段,如下:
String:
CAAAAACGAGGGGTTTGGGGAGCAAAAA
Inverted Repeat:
Inverted Repeat: AAAAACGAGGGG, Size: 12, Start Positions: 17, 2
解释:如上面,C AAAAACGAGGGG TTT GGGGAGCAAAAA
AAAAACGAGGGG和GGGGAGCAAAAA分别是反向重复,分别在2和17位上,大小为12。
哈, 学生物的吧。 去 perl.com 有 total solution, 如 bioperl.
或用 awk 自己写, 要讲效率的话, 还是 C 好。
lightspeed 回复于:2004-12-12 12:26:33
注意: 下面的例子中的 STR_MAX 及 STR_MIN 可根据需要设定。
#!/bin/awk -f
#
# A script can be used to check any repeat pieces of nucleotide sequences.
#
# Design: lighspeed
# Date: Dec. 12, 2004
#
# Usage:: $0 datafile
#
# Variables:
#
# left - a repeat string to be matched
# right - the right side string used to match any left in it
# rev_left - reverse string of left
# rev_right - the right side string used to match any rev_left in it
# flag[position] - an array element which will be set if the string in the position is matched
#
{
L=length($0)
STR_MIN=10
# STR_MAX=int(L / 2)
STR_MAX=20
print "------------------- Line# "NR" --------------------\n"
for ( Str_Len=STR_MIN; Str_Len <= STR_MAX; Str_Len ++ ) {
for ( i in flag )
delete flag
for ( Position=1; Position <= L - 2 * Str_Len + 1; Position ++ ) {
if ( Position in flag )
continue
count=0
pos=Position
offset=Position
rev_offset=offset
rev_count=0
rev_pos=""
rev_left=""
left=substr($0,Position,Str_Len)
if (index(left,"A")==0 || index(left,"C")==0 || index(left,"G")==0 || index(left,"T")==0 )
continue
right=substr($0, Position + 1)
for ( i=length(left); i>=1; i-- ) {
rev_left=rev_left""substr(left,i,1)
}
rev_right=right
while ( Str_Len <= length(right) ) {
i=index(right,left)
if ( i > 0 ) {
count ++
j=offset + i
pos=pos";"j
flag[j]=1
right=substr(right, i + 1)
offset+=i
}
else
break
}
while ( Str_Len <= length(rev_right) ) {
i=index(rev_right,rev_left)
if ( i > 0 ) {
rev_count ++
j=rev_offset + i
if ( rev_pos == "" ) {
rev_pos=j
}
else
rev_pos=rev_pos";"j
flag[j]=1
rev_right=substr(rev_right, i + 1)
rev_offset+=i
}
else
break
}
if (count > 0 )
match_number[Str_Len] ++
if (rev_count > 0 )
rev_number[Str_Len] ++
if (count > 0 || rev_count > 0) {
print left, "Length="Str_Len, "Position="pos, (rev_count >0) ? "Rev_Position="rev_pos : ""
}
}
}
print ""
print "Summary of Line# "NR
print "------------------"
for ( i in match_number ) {
print "String length :: " i " Matched Strings :: " match_number
delete match_number
}
print ""
for ( i in rev_number ) {
print "String length :: " i " Matched Reverse Strings :: " rev_number
delete rev_number
}
}
数据文件来自 DNA NCBI35 的片段。 处理成 10000 个字符的单行文件。
# cat datafile
TAAAATGTGTAATCAACTAATACAAAGCAAGTTTTGTACTTTTTGTTGAATTTATTACTAAGTAT
TCTTTTTGATGCAATTGTAAGTAGAAATATTTATTTATTAAGAGATAGGGTCTTACTGTGTGGCC
CAGTATGGCCTTGAACTCCTGGGCTTAAGACATCCTCCTGCTGCAGCCTCCTGAGTAACTGAGAT
TACAGGTGTGCACCACCTCGCCTGGCTCAGAATGGTTTTCTTAACTTCATTTTTAGATTGTTCAC
TGTGAATATATCGAATTACAATAGTTTAGGCTGGGCATGGTGGCTCACGCCTGTAATCCTAGCAC
TTGGGGAGGCTGAGGTGGGTGGATAACTTGAGGCCAGGAGTTTCAGATCAGCCTGGCCATCACAG
AGAAACCTTGTCTTTACCAAAATCACAACAAATTAATTAGCTGGTTGTGGTGGTGCATGCTTGCA
ATCCCAGCTACTGGGGAGGCTGAGGTACGAGAATTACCTGAACCCAGGAGGTGGAGGTTGCAGTG
AACCGAGATAGTTCCACTGCACTCCAGCCTGGGCGACAGAACGGTTTTTGTATGCTTCAACCTTA
CTGAACTCATTTATTCATTCTGATATTTACTTTAGTGGATTCTGTATGATTTTCTATATGCAAGA
TGCTGTCATTTGCAAATAGAGATAGTTTTTCTTTTTTTGTTTCCAATCTGAATGTGTTTTATTTC
ATTTTCTTGCCTAATGCTCCTCATTAGTTTTCAATGTTGCATAGTATTTTATTGCATGGACGCAC
CATAATTACTTTTACCAATCTCTTATTGATGGACATGAAGGTTATTTCCAAACTCTTGTGGTTAT
AACAATGCTGTAATAGATAACCTATTACAAAGAACAGTTCTCAACTCTTTTGGTCTTGGGACTAC
TTTACCTATTTATGTATAAGTTTCAAGTTTGGGCTTAGAAAGAATTTAATAATCATGCTAATTTT
GTTTTGTTTTCTTTTTTTTTACTCCTGGACCCAAGCGGTCTTCCCACCTCAACCTCCCAAGTAGC
TGAGACTACAAGGGTGAACCATCACCCTGGGTAATTTTTAAATTGTTGGCTGGGCACAGTGGCTC
ACGCCTGTAATCCTAGCACTTTGGGAGGCTGAGACAGGCGGATTACCTGAGGTTGGGAGTTCAAG
ACCAGCCTGGCCAACATGGTGAAACCCTGTCTCTACTAAAAACACAAAAAATGAGCCAGGTGCAG
TGGTGCGTGCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAATTCAGGAG
GTGGATGTTGCGGTGAGCTGAGATCGTGCCACTGAACTCCAGCCTGGGCGACAGAGCAAGATTTC
ATTTCAAAAAACAAAAAGAAAAAAATTTTTAAAAATTGTTTTGAAGAGATACGGTTTCCCTATGT
TGCCTAGGCTGGTCTCATGCGATTCTCCTGCCTTGGCCTCCCAAAGTGTTGGGATTATAGACATG
AGACACCACAAATTTAAACAAGGACTTTTTTTATTTTTTAAAGAGATTACTTTTTCTGAGTAAAC
AAGGACTTTTAAAACAAGGTACTAAAAATCTGGCTGGGCGTGGTGGCTCGCTCCTGTAATCCCAG
CACTTTGGGAGGCTGAGGTGGGCGGATCACGAGGTCAAGAGATCAAGACCATTCTGGCCAACATA
GTGAAACCCCGTCTCTGCTAAAAATACAAAAATTAGCTAGGTGTGGTGGTGCACGCCTGTAGTCC
CAGCTGCTCAGGAGGCTAAGGCAGGAGAATCACTTGAACCCGGGAGGCAGAGGTTGTAGTGAGCC
GAGATCTCACCACTGCACTCCAGCCTGGCAACAGAGTGAGATTCCGTCTCAAAAAAAAAATTTTT
TTTAATAAATAAATAAATAAAAATCTTGAAATTTTTATTAGGTCCTGGTGTTTCTAATTTTAATA
TGATTTAGTTCTCAAGTGCTAGTTAATACTTCATTAATCAGCCAGATGGAAGTGGGGATACTATG
GAAACAGCATAGGCAAAGCTTAAAGATAAATGAGACCATGGTTTGAAAATATAGGGTGGCATGCG
CTTTGGTTCAAGGCAATTTGATCATCACAACAATTTGGCTTAAACAGCACTTTGGTTGAAAATGA
ATATCCCCTAGTTATGTGTTTTTCAAGTATTGGTCATTTTGGTATATCATGAGTTGTTTTGCAAA
CTTTTGTGCCAAAGTTTTCAGGAAAACTTTCTAATATTTGCTTTTGTGTTTCTAACTGATTTTCA
GAGAAGTTGTAATTTTGATGTTTTTTCCTTTTAGTGAGCATGCTTTAACAAAAAACAATAACAGA
AACTGTGTCAAAGAAAAGGACCTGTAATCTTCAGGGTTTGTAGTCTTTTTCCTCTTAAAAAACCC
TTTTCCTAATTAATGGCAGTTACATCTGCATGGCTGGTTTGGGTAAGTCTTCATTTTGTTGTATT
GCTGAGTAACAGTCAACAAAGGTTTATCAACTCTTGGTTAAGGGTTCCTTTCATGTTGTGAGTAA
ACATGAACAATATAGGATCTTATCCTTTTAAGCTATCATGCAAGAAACATGTGAGGTCTCTTAAA
AATTCACTGTGCTGGCCGGGCATGGTGGCTCACGCTTGTAATCCTAGCACTTTGGGAGGCTGAGG
TGGGTGGATCACTTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCT
ACAAAAATACAAAAATCAGCCGGGCATGATGGCGGGCAGGTGCTTGTAATCCCAGCTACTTGGGA
GGCTGAGACAGGAGACTCGCTTGAACCCGGGAGGCGGAGGTTGTAGTGAGCCGAGATTGTGCCAC
TGCACTCCAGCCTGGATGACAGAGCAAGACTCCATCTCAAAAAAGAAAAAAAAAAAAAATTGTGC
TGGCTGGGCTCAGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCAC
CTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCATCTCTACAAAAATACAA
AAATTAGCCAGGCATAATGGCGAGTGCCTGTAATCCAAGCTACTTGGGAGGCTGAGGCAGGAGAA
TCGCTTGAACCCGGGAGTGAGCCGAGATGGCGCCACTGCACTCTAGCTTGGGTGACAACAGCAAG
ATTCTGTCTCAGAAAAAAAAAAAAAATTAACTGTGCTTATAAATGGGAGCTAAATTAGGAAAAAA
ATAAAAAGTAAAAAGAAAATGAAAATAAAAATTTAAAAAATATATTAACAAATTACCTGTCCTAA
GGTAAAATTCTTTTTTTTTTTCTTGAGACGGAGTCTCGCTCTGTCGCCCACTCGGAAAGGAGTGC
CAATCTCGGCGTGAAAATGTGTCTGATGCGTATGCACCTGAGCTAGAAAGCCCAAAGACTGCTAA
GAAGCATGTGAGGGCTCAGAAACAAACATGTTTGGGCTTCGAAAGCCTGTTTTTGGAACCACTTT
CCCTTGTCTGCAAGGCAGAGGGAGGGAGGTACTCTGTTATTTCTAAGTCTCTCTTGAGCTCTTAC
ACTGTGCAAGCCCATGAACGTATTTAATCGTGCATTAGACAATTGTTTTTAATCTATGCCCTGCC
TCTCCCAAGATCAACCTTTCCCTGAGATCGGGGCCCCCTCTGGGTGCACAGGGATATTTTTATTT
TTTGAGTTGGAGTTTTGCTCTTGTCACCCAGGCTGGAGTGCAATGGCATGATCTTGACTCACTGA
AACCTCTTCCTCCCGGCTTCCAGTGATTCTTCTGCCTCAGCCTCCCAAGCAGCTGAGATTACAGG
CATGCACCACCACACTTCGGTTAATTTTTGTATTTTTAGGAGAGATGGAGATTCACCATGTTGGC
CAGGCTGGTCTTGAACTCCTGACCTCAGGTGATCCTCCCGCCTTGGCCTCCCAAAATGCTGGGAT
TATAGGCGTGAGGCACCGTGCCCAGCCCATAGGGATATTTTTATATACTTTCCTGCCCCATGGGT
CAACTGTTCTTGAACCAAAGAAACAAGAGGCGGGGAAGTTATAGGAAGCTTTTAAAATATGCTTC
TGTGCAGCACTGCTCGCAGCGTGTCACAGATGTGCGGTATTGGAAGACGAAGGTGAAACTGCATG
GAGATGATTGTGTGGGGGATGAGGAGGTGGTGGGTAGGGGACTTGGCTTTCTTCACACAAAGACA
TCCAGGCAAATGGTAAGTCCAAAAGCCCTGTGACAGATAATGGCCATTGTTCCTGCAGGGTGACT
CTTTTCTCTTCTTTTTTTTCTTTTTGAGGCGGAGTCTCACTCTGTCATCTATGCTGGAGTGCAAT
GGTGCGATCTTGGCTCACTGCAACTTCCGCTTCCCGGGTTCAAAGTGATTTTTCTGCCTCAGCCC
TCCCGAGTAGCTGGGACTACAGGTGCGCGCCACCATGGCCAGCTAATTTTTATATTTTTAGTAGA
GACGGGGTTTCTCCATGTTAGCCAGGATGGTCTCGATCTCTTGATCTCGTGATCCACCCGCCTCA
GCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCGGCGCCCGGCCCTATACACATGATTTTG
AACATACTGACAGATGGAGAAAACCACTTTGGAAAAGATACTTCACATGTTCTAGAGACGATTTA
AACCATTAAGCATTCTATGAAGCTTCTGAAGGTCTGTCAGATTTTAAATGACAACAGTGAAATTT
TAAAACAAGAACAGAAGTCAGCACCAAAGCTAGTTTAACATTAATAATAAGTGAGCCAATAAATA
GGTCTATGTTTGCCCAGGCAGGTTTTGCTTATTATGTCAGTTGGAAAGCCAGAAGGAAACTGGTT
TTAACTCTTAATATAACCTGTATCATGACACCATCACTTTACCAGAAATGTAGCTGATGTCAGCA
TAAGACTGAGACAGTTTACATTTAAAACTGTTGTTTCCTTTCCAACTATTTTCATAATTCATTCA
TGGTATAGGATTGAGACTATTTCCTTAAACAGAAAAAAATGGGTAATTAACATTGAGAACTTTCC
ATGTGCCAGATACTGTATGAACTGTCTTAATTTTCATAGCCACCCTGCAAGATATTATCCTCATC
TTTTTAGAGGAAGAAACAAGTTTCAAGAAATGAAGTAGGTTTTCTAAGGCCACAGCTATAGTAAA
GAGGTGGAGCTGACATTCAAGCTTGGATATGAATTATTATAATTTCCACAGCACTACACAGCTGT
CATTTTCTCTACCTGCAAAACTAAATAAATACTGTTAAAAATAAAAGATGATCTCCAAGATCTCT
AAACATTAAAATTTTACAATAAACTGGTTGAGGTGACACATGCCTATATTTTCAGCTACTCAGGA
GTTTGAGACTGGCCTGGACAACATAGCAAGACCCTGTCTCTAAATTTAAAAAACAAATTACAATG
AGATAATCTTAGACCAGAGAAAGGAAAGTGAAATAGCTATTTGGATTATAAACTGTTTTAGTAAC
TCAAATGTAATGTGTGGTGGTGACAATATCTTTGATTCCTGGGAAGGTCATTGTGAAAGGGAATA
GAAAATGCCTTGAAGTCAAAATATAAGGCTCTCAAATAGAAAAATAAATATAACATTTAAGTATT
ATCAACAGAGAACCAAGTTAGAAAAAACTAGTTATAGTCTGAAACAATGCTGTTTAAAAGACTGC
AGTCACCAGTGTAAACTGACTCAGGCAACACTTCCCAGGGTCCATGCCGTGGACAACTGACTAAT
CTCTCTATAAACAATTCTTGACACTAGATAGGCCTTTACTAAGAGCAACCAGAGACAGAAATTAG
TATCGACAGTGGAGTTTTAAAATCACACTTAAAAAAATATTATTGGCTGGGCACAGTGGCTCACG
CCTGTAATCCCAGCACTTTGGGAGGCTGAGGCAGGCAGATCATGAGGTCAGGAGATCAAGGCTAT
CCTGGCCAACATGGTGAAACCCCGTCTCTATTAAAAATACAAAAATTAGCCGGGCGTGGCGGTGA
GTGCCTGTAGCCCCAGCTACTTGGGAGGGTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCGGAG
GCTACAGTAAGCCGAGTTCGTGCCACTGCACTCCAGCCTCGGCGACGGAGCGAGACTCCCTCTCA
AAAAAAGAAAAAAAAAATGTAGATTATATTCTGTGAATATTACATCACAGAATAAAACTCTGGAT
ATAATACATGGGAGAGTTAATATCCAGAAAGACATTGTGCATTTTTGGTCTAAGTTTCATGAGAC
AAAATATTATTTTCTTTTCTGAGACTCAATTCTTTCCCAAAGGGATCAGTTCTCTTAAGTGGACC
TTTTTACAGCCTTTCAGCTGGCTCAAAAGATGAGTTTTGGCGAACAAGATTATCGATACTCACTG
AGCAAGTGGTAGTTAGAATCCCTTTCATATTTGAAGGTCAAACGGCCATAGCTGACATGATTTAG
ATTCTTCAGCCACTCAAAGTAAGATACTGTCACTCCTCCAGCATTCAAGTAGAGATCCTATGCAC
AAAAATAAGACAAAGAAATTAGAAGATGATGGTTTTCGTAAAAGCTGAAAATGAACCTAAGACCT
TAATTTCAATACCAAGGTAGACTGGACTTCAAATATCGCAAATATATTTTAGCCAGTATCAGGAA
TTTCACACTTAATTAACACTCCTTCCCATCCCACCCAATTCCATCTAAGGCTTTTTCTATTTAGG
AAAAAAAAAAATCATTTTTTGGCTTATTAATCAAGGAAAGTTAATAATCTTTGGTTAGAGCCTCT
TCTCTACCAGAAGTTAGTTCTCAGACTAAATGGCTTGCCCACCAACCAGTTGGACTGGACTGTCC
ACAGGGCCTCTCAGAAGACAGGATTCTTTCTCCTATTACCTAAGGGTAGCCCATTTCAGTTACAT
TAAATGTCTTAAGTGCTTTCAGCAAAGGGGGTTCTTTAAAATATATTCCAAGCCCACATTAATTT
CTAGTAACTTTTTGGTGTAGACTCATTTTTACTTGCTAAAAAACCTGAGCACGTGTTCCTCATAT
TAGTTTTCTGAGTAAAGCTGGAAAGGGCACTTGAAATGCATAAGGTTAGGAATCATATAGAAAAT
CTTAAGAGCTTTAGTTAGAATAGTGTTTCTAACACAGTACACATTTATATAACCAGACTCTTAGA
AGGCTAAAGACATTCAGTAAAGAGCCTGAAATTGGAATAAATGTTTCGATCAAAGTGAAAATTAA
CAGGCTTAGAATTAACCATGCTTCTACTATATTTTCTCAAAAGTGAAAAAGATGAATTCACTAGA
GCTTGGAGACTAATAATTCCTCTCTTCCTCCAAATTCCTTGCAAAAGACTATTATGATTCTAAGT
ACATATAAAGCCTAATAAATATAGATGACTTACTGGAATAACCATAATGTTTCTCTCCAGGAAGA
TCTTGTCAGCTTCTGGAGTTGTTGGCCCATTGGCACCTTCAGCAATGATCTGCAAGAGAGTCAGG
AACATAGAGAAATGCGAACACCACCGTCAAATCCCCTCCACTGAGGGCAAGAGATGTGCATATAT
GAACAAGGGGCTGTGGGGAGAAGCACAGTTTCAGTTAAAGTTAAATAGAGGTTATTTTTCTCTGC
CAAGTGTATAAAACTACCTTTCACTTTTCTATTTATCTAGGTTTTTTTTTTGTTTGCTTGTTTTT
TTTTTTTACAGGAGTGTCAGGCAGATGCGTTTGTTTTGGTAATGGTTGCACAACTCTGTGCCGGT
AGCTAAAAGCCATTAAATTATGTACCTTAAATGGGGGAACTGTATGGTATGTGCAGTATGTGCCA
ATAAAGCTGCTAAAAAGAAAGAGAAAACTCAATCAGACTCTTCTATGACCCCCCTAACGTCATTC
ACATTGATAATGTTGGTTCTGGTTTCTATAATGTTGTCACCTTGGCTTTGACTCTGGGTGCGTTG
GATTTGGTCAACTGCTTCTCACTGGCAGCTGGGATCAGTATGTCACAGTCGGCCTCCAAGATGCT
TCCTTCATAGGGCTTTGCCTTGGGGAAGCCCAGAATGGACCCATGTTGCTGCCATTGATTGAAAA
TCACAATTAATAGCTGCACCAGAGTTTTAAATATTTATATTTAGTGTCTATGCTATAAAAATGTA
TTAATACCAATTTGAAGTCTTCCAGTTCCTTTGGGTCAATACCATCTGGATTCCATATACTCCCA
TCAGACTCACCAACAGCAATACATTTAGCACCAAAACGATGTAAATATCTCATAGAGTGCAGGCC
CACATTACCAAATCCCTGTGAAGAACAATTACCCATAACACAAAAATTAAAGTCCTGGTATAGAC
AGCAAGAGTCATATTTTGACCAATGTAAATCCATACTCTGTTTATTTAGAAGCAATTCTCAAAAT
TCTTTTGCCAAAAGAAACAATGTACTAACTGGTTTCTCTTCAACAATAAAATTCTCTGTTTAAGA
ATGTGATGAGCGGGCATGGTGGCTCACGCTTGTAATTCCAGCACTTTGGGAGGCTGAGGCAGGTG
GATCACTTGAGGTCAGGAGTTCGAGATCAGCCATGGCCAACATGGTGAAACCCCGTCTCTACTAA
AAATACAAAAATTAGGCATGGGGTCCGTGCCTGTAATCCCAGCTACTTGGGAGAATGAGGTAGGA
GAATCACTTGAACCTGGGAGGTGGAGGTTGCAGTGAGCCGAGACTGCACTCCAGCCTGGGCAATA
GGGTGAGACTCCATCTCAATCAATCAATAAATGGCAGTGGTGTTAAGTACACCACCACTTTTTGC
TTTTTTTTTTTTTTTTTTTTTGATGGAGTTTTGCTCTTGTTGCCCAGGCTGGAGTGCAATAGCGG
AATCTCGGCTCACCACAATCTCTGCCTCCCAGGTTCAAGCAATTCTCCTGCCTCAGCCTCCTAAG
TAGCTGGGATTACAGGCATGCGCCACCTCGCCTGGCTAATTTTGTATTTTTAGTAGAGACAGGGT
TTCTCCATGTTGGTCAGGCTGGTCTCGAACTCCTGACCTCAGGTGATCCGCCACTTCAGTCTGCT
AAAGTGCTGGGATTACAGCTGTGAGCCACAGTGCCCGGACTTTCTTTTTTTTTTTTTTTTGCCAA
TTTGTATTTTATTTTTGCTAATTTAAAAAATAGTTAATAGAATATCAGAAATACTGAACATTATC
ATTTCCATAAATGCAAGAGTGTATACATTTTCCACACACTGAGTACTAGCTAGATTTTCTATGAT
AAACTCTGACCACTTCTTCAGGCAATTCATGTACTTACTTCAGCATTATCATTAATGATGAAGGT
TCTAGAACCATCATGAACAAGGGTCCCATCTTCACACAAGTTACTTAACTGCTGGGAGGCTCTAT
TTCATCTTATGTAAACTACAGATAATACCTACTCACCTCAAGGGTATATCAAGAGTTTATGTAAG
CTAAGTTTGTAGAAAGTAGTTAGCACAGTGCCAGGAAGGGTCCAAGAAGAAATGGTACTTACTAT
GAAATATTTGTACGTATATATGTATGCATGTTAATGAGCTCTTATTAGCTGTGTTCATTAAAGGT
TTTCTCTATCCTGTGATTTGCTTTTAGATTTTGGAATACATTTCATTGTGCACATTCCATTTGTA
TTATTAATATAACAATATTTATTACTATTATTATTATCATCATCAATTCAATCACATCTACTGTA
TCCCTGATGATGACCATGATTCCTTTAATAATCATAAAACTCTCTTCCCTTCATCACGGGGTAAA
TAACCTATCACAATGCTGTAAGTCTCCATCAGCACCCCAGGCTGCCCCTGCTGACTTACCAGATC
TGCTACTTCAGCCAGATCAATCATCATGTTAATGTCACACCCACATTCAATAATGGTGAGTCGGA
GCTTTTTACCTGTAAGTGAAAAAGATAAAAATTTTACTTTAAAAAGACCCTGAAA
运行:
# time ./1 datafile > report
real 1m14.81s
user 0m50.65s
sys 0m0.13s
# wc -l report
3355 report
下面是 report 文件中的部分内容:
# cat report
------------------- Line# 1 --------------------
TAATCAACTA Length=10 Position=10 Rev_Position=12
TTTATTACTA Length=10 Position=51;9703
GGAGGCTGAG Length=10 Position=330;470;1129;1265;1633;2655;2793;3102;5936;8564
AAAAAAAAAA Length=10 Position=1871;2906;2907;2908;2909;2910;3198;3199;3200;3201;3202;6183;6761;6762 Rev_Position=2906;2907;2908;2909;2910;3198;3199;3200;3201;3202;6183;6761;6762
GTAGTGAGCCGA Length=12 Position=1811;2838
AATAAATAAATA Length=12 Position=1889;1893 Rev_Position=1890;1894
CCACTGCACTCCA Length=13 Position=534;1830;2857;6133
GTTTTTCTTTTTT Length=13 Position=675 Rev_Position=4304
GTAATCCCAGCTAC Length=14 Position=1248;2776;8678
CCTGTAATCCTAGCA Length=15 Position=310;1109
GGATCACTTGAGGTCA Length=16 Position=2671;8580
CGGGCATGGTGGCTCAC Length=17 Position=2617;8526
CACTTGAGGTCAGGAGTT Length=18 Position=2675;8584
CCAGCCTGGCCAACATGGT Length=19 Position=1172;2698;3011
CTTTGGGAGGCTGAGGCAGG Length=20 Position=5931;8559
TTTTTTTTTTTTTTTTTTTT Length=20 Position=8841;8842 Rev_Position=8842
Summary of Line# 1
------------------
String length :: 10 Matched Strings :: 541
String length :: 11 Matched Strings :: 438
String length :: 12 Matched Strings :: 372
String length :: 13 Matched Strings :: 329
String length :: 14 Matched Strings :: 293
String length :: 15 Matched Strings :: 258
String length :: 16 Matched Strings :: 226
String length :: 17 Matched Strings :: 200
String length :: 18 Matched Strings :: 171
String length :: 19 Matched Strings :: 146
String length :: 20 Matched Strings :: 124
String length :: 10 Matched Reverse Strings :: 142
String length :: 11 Matched Reverse Strings :: 69
String length :: 12 Matched Reverse Strings :: 40
String length :: 13 Matched Reverse Strings :: 25
String length :: 14 Matched Reverse Strings :: 15
String length :: 15 Matched Reverse Strings :: 7
String length :: 16 Matched Reverse Strings :: 5
String length :: 17 Matched Reverse Strings :: 2
String length :: 18 Matched Reverse Strings :: 1
String length :: 19 Matched Reverse Strings :: 1
String length :: 20 Matched Reverse Strings :: 1
lightspeed 回复于:2004-12-16 23:39:29
这里是 final version.
1. 增加了二分查找最长匹配串并自动设置 STR_MAX 的代码。
2. 改进了 reverse string 算法, 大大提高了运行效率。
# time ./1 datafile > report1
real 2m57.43s
user 2m29.33s
sys 0m0.08s
$ time ./1 -v r=1 datafile > report2
real 0m53.28s
user 0m46.03s
sys 0m0.03s
#!/bin/awk -f
#
# A script can be used to check any repeat pieces of nucleotide sequences.
#
# Design: lighspeed
# Date: Dec. 16, 2004
#
# Repeat Match Usage:: $0 datafile
# Reverted Repeat Match Usage:: $0 -v r=1 datafile
#
# function is_overlap : check if a string (position=p, length=l) is overlap with
# matched strings which are stored in array record (position=i, length=record)
function is_overlap(p, l) {
e = p + l - 1
for (i in record) {
a = i + record - 1
if (( i >= p && i <= e ) || ( a >= p && a <= e ) || ( p >= i && p <= a ) || ( e >= i && e <= a ))
return 1
}
return 0
}
# flag=0 find the longest matched string and set STR_MAX
# flag=1 find all other matched strings
function find_string(STR_MIN, STR_MAX, flag) {
for (i in record)
delete record
if ( flag == 1 ) {
if ( r == 1 )
print "---------------Reverted Repeat Match Line# "NR" -----------------\n"
else
print "------------------Repeat Match Line# "NR" --------------------\n"
}
for ( Str_Len=STR_MAX; Str_Len >= STR_MIN; Str_Len -- ) {
for ( Position=1; Position <= L - 2 * Str_Len + 1; Position ++ ) {
if ( is_overlap(Position,Str_Len) == 1 )
continue
count=0
pos=Position
offset=Position + Str_Len - 1
left=substr($0,Position,Str_Len)
if (index(left,"A")==0 || index(left,"C")==0 || index(left,"G")==0 || index(left,"T")==0 )
continue
right=substr($0, Position + Str_Len)
# Reverse string left. The start position in reverse string is L -Position - Str_Len + 2
if ( r == 1 ) {
old_left=left
left=substr(REVERSE_STRING, L - Position - Str_Len + 2, Str_Len)
}
while ( Str_Len <= length(right) ) {
i=index(right,left)
if ( i > 0 ) {
if ( flag == 0 )
return 1
j=offset + i
if ( is_overlap(j,Str_Len) == 0 ) {
count ++
record[Position]=Str_Len
record[j]=Str_Len
pos=pos","j
}
right=substr(right, i + Str_Len)
offset+=(i + Str_Len - 1)
}
else
break
}
if (count > 0 && flag == 1) {
match_number[Str_Len] ++
if (r == 1)
print "Reverted Repeat: " old_left",", "Size: "Str_Len",", "Start Positions: "pos
else
print "Repeat: " left",", "Size: "Str_Len",", "Start Positions: "pos
}
}
}
if ( flag == 0 )
return 0
}
{
L=length($0)
if ( r == 1 ) {
REVERSE_STRING=""
# split($0, aaa, "") : Use null string "" to split every characters into array aaa
split($0, aaa, "")
for (i=L; i >= 1; i -- ) {
REVERSE_STRING=REVERSE_STRING""aaa
delete aaa
}
}
STR_MIN=10
STR_MAX=0
# Find max matched string and set STR_MAX by binary search algorithm
low=STR_MIN - 1
high=int(L / 2) + 2
while ( low < high ) {
if ( (high - low) == 1 ) {
if ( low >= STR_MIN )
STR_MAX=low
break
}
mid=int((high + low) / 2)
status=find_string(mid, mid, 0)
if (status)
low=mid
else
high=mid
}
if ( STR_MAX >= STR_MIN )
find_string(STR_MIN, STR_MAX, 1)
}
3. data2 (100000 字符) 的结果
当长度增加到 100000 时, 速度变的很慢, 我只测了正序, 而且没有完全
完成, 但可以估计时间。
由于存在的重复匹配串的最大长度很小, 我采用二分查找首先找到重复匹配串的最大长度
然后再循环至最小长度。 下面结果前面的
9 50002 0
9 25005 0
.
就是显示查找的过程。 找到重复匹配串的最大长度为 43 用时 67 min.
然后, 从 43 循环至 10. 每个长度用时约 5 min
总时间大约为 67 + 34 * 5 = 237 min 也就是大约 4 小时。
9 50002 0
9 25005 0
9 12507 0
9 6258 0
9 3133 0
9 1571 0
9 790 0
9 399 0
9 204 0
9 106 0
9 57 1
33 57 0
33 45 1
39 45 1
42 45 1
43 45 0
Repeat: ATTGGATCATTGATCTAATCCAACCACATAACTATAATTACAG, Size: 43, Start Positions: 25161,25204
Repeat: TCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCG, Size: 41, Start Positions: 26357,75092
Repeat: AGTCTTGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCG, Size: 39, Start Positions: 31148,69805
.
.
产生随机字符串
$ cat rand.pl
#!/usr/bin/perl -w
@array=("A","T","C","G");
for ($a=0;$a<1000;$a++) {
$rand=$array[int rand scalar @array];
$out.=$rand;
}
print "$out\n";
$ perl rand.pl>datafile
$ cat datafile
ATACTCGTTGACTCTACTCATAACAGCATTAAAGACCGTAGGGTAGGCT...
riverfor 回复于:2004-12-18 03:18:35
=======================repeat.py===========================
# !/usr/bin/pyton
# Just enjoy the programming, share your excellent code
# report bugs to email/ msn: [email protected]
#
import string
import commands
import sys
## data for deamon
data = "1234567890112345678901234567890 1234567890989901234567890"
## call this functions to get the real data which was saved inthe filePath
def getSrcDataFromFile(filePath):
data = commands.getoutput("cat %s" % filePath)
##
result = []
## get all substrs used to check whether be fit for
def getAllSubStr(splitCode = " ", minDataLen = 10, maxDataLen = 10000):
dataLen = len(data) + 1
for x in range(dataLen):
for j in range(x, dataLen):
seg, ilen = data[x:j], j - x
# check whether to be fit for these conditions
if seg.count(splitCode) == 0 :
if ilen >= minDataLen and ilen <= maxDataLen:
if result.count(seg) < 1:
result.append(seg)
## process
def process(filePath = ""):
if filePath != "":
getSrcDataFromFile(filePath)
getAllSubStr()
print "sub data\t repeat \t indexs"
for item in result:
counts, itemlen = data.count(item), len(item)
if counts > 1:
indexs, start, seg = [], 0, data
for x in range(counts):
if seg != "":
start += seg.index(item)
indexs.append(start)
start += itemlen
seg = seg[start:]
print item, "\t" ,counts, "\t" ,indexs
if __name__ == "__main__":
print "source data: %s " % data
print "========result:========"
if len(sys.argv) < 2:
process()
else:
proccess(sys.argv[1])
================================================
This script only print the sub data repeats twice or more, which was with the condition of splitCode = " ", minDataLen = 10, maxDataLen = 10000.
Usage: python repeay.py [data file] [| awk / sed ....]
第一次发贴,嘻嘻,没时间写注释啦,但python本来就是如此易读(不要笑话我程序中的英文注解的语法错误哦,因为很多系统的中文支持不好,习惯了).很纳闷为什么cu连php版面都有,却连python的论坛都没有呢,我用过php, java, c, python. 我觉得python绝对是值得我们关注的一门语言,它可以用做shell(远比sed awk易读), 网站脚本(mod_py), application语言(网络的twisted框架, tk的GUI, 各种数据库的支持)......
用awk写了一下第一题,
用了最笨得办法,效率超低。处理这个datafile用了10s,而且只适用处理短一些的串。。。
BEGIN {
_MIN_LEN=10;
}
function find_max_str()
{
for(i=length($0)/2;i>=_MIN_LEN;i--) {
for(j=0;j
if((index(k," ")>0)||!((index(k,"A")>0 && index(k,"C")>0 && index(k,"G")>0 && index(k,"T")>0))) {
continue;
}
if((a[k]++)==2) {
printf("%d:%d:%d:%d:%s\n",NR,b[k],j,length(k),k);
return 0;
}
else {
b[k]=j;
}
}
for(idx1 in a) {
delete a[idx1];
}
for(idx2 in b) {
delete b[idx2];
}
}
return 1;
}
{
find_max_str();
}
zhl1979 回复于:2007-03-27 19:16:25
awk '{ for (l=10;l<50;l++){for(i=1;i<(10001-l);i++){a=substr($0,i,l) ;hash[a]++ ; t=hashb[a]; t=i" "t; hashb[a]=t }}}END{for(a in hash ){ if (hash[a]>1){ print a ,hash[a], hashb[a] }}}