计算序列aa频率

  其实Python的语法也看了好一阵子,但一直没有实战,但是今天上完多序列比对,突然就想试试Python的实战,没想到上手这的是快。

  随便先写个貌似蛋白序列的字符串

>>> proseq = 'AWJFBAJDKAJNFKAFLMMALFWSHFBJSBFKJANDKAJNDKJQNKJQK'

Python的索引功能

>>>'MEafw' [1]

'E'

>>> 'MEafw' [0]

'M'

>>> 'MEafw' [-1]

'w'

>>> proseq [1]

'W'

这里要注意的是Python的计数原则是从0开始的,故0对应着的是第一个字符,此外负号代表的从后往前计数

切片

>>> proseq [0:3]

'AWJ'

>>> proseq [3:]

'FBAJDKAJNFKAFLMMALFWSHFBJSBFKJANDKAJNDKJQNKJQK'

切片的功能类似索引,不过就是截取一段范围而已,中间用“:”表示区间范围,如果不加位置,则直接到最后

字符串运算

>>> 'pro' * 2

'propro'

>>> 'pro' + 'pro'

'propro'

有意思的是,在Python中的字符串并不仅仅限于数字,字符串也是可以运算的

>>> len(proseq)

49

len()函数可以完成确定字符串长度

>>> proseq.count('A')

7

.count()函数可进行字符计算相关工作

为什么len()则前置,而.count()的命令后置?

因为这是是否为内置函数的区别,如len()为内置函数,具有很广泛的广式广式性。而其他函数特定的适用于某些特定的数据类型

For 循环

基本语法:

for   in :

..........

语法一定要会,不然白搭

可以字符串或是对象的集合,是变量名,是遍历时提取元素的值。第一次循环取得第一个值,依次向下,通过缩进四个字符标记循环体,指令最后执行退出循环体

>>> for amino_acid in 'ABCDEFGHIKLMNPQRSTVWY':

...    number = proseq.count(amino_acid)

...    print(amino_acid , number)

...

A 7

B 3

C 0

D 3

E 0

F 6

G 0

H 1

I 0

K 7

L 2

M 2

N 4

P 0

Q 2

R 0

S 2

T 0

V 0

W 2

Y 0

这里注意Python3以上版本print需要print(),此外print可以利用,分离同时打印多个


实战;Telomerase reverse transcriptase中哪个氨基酸出现最频繁?

telomerase = '''MSITDLSPTLGILRSLYPHVQVLVDFADDIVFREGHKATLIEESDTSHFKSFVRGIFVCF

... HKELQQVPSCNQICTLPELLAFVLNSVKRKRKRNVLAHGYNFQSLAQEERDADQFKLQGD

... VTQSAAYVHGSDLWRKVSMRLGTDITRYLFESCSVFVAVPPSCLFQVCGIPIYDCFSLAT

... ASLGFSLQSRGCRERCLGVNSMKRRAFNVKRYLRKRKTETDQKDEARVCSGKRRRVMEED

... KVSCETMQDGESGKTTLVQKQPGSKKRSEMEATLLPLEGGPSWRSGTFPPLPPSQSFMRT

... LGFLYGGRGMRSFLLNRKKKTAEGFRKIQGRDLIRIVFFEGVLYLNGLERKPKKLPRRFF

... NMVPLFSQLLRQHRRCPYSRLLQKTCPLVGIKDAGQAELSSFLPQHCGSHRVYLFVRECL

... LAVIPQELWGSEHNRLLYFARVRFFLRSGKFERLSVAELMWKIKVNNCDWLKISKTGRVP

... PSELSYRTQILGQFLAWLLDGFVVGLVRACFYATESMGQKNAIRFYRQEVWAKLQDLAFR

... SHISKGQMVELTPDQVAALPKSTIISRLRFIPKTDGMRPITRVIGADAKTRLYQSHVRDL

... LDMLRACVCSTPSLLGSTVWGMTDIHKVLSSIAPAQKEKPQPLYFVKMDVSGAYESLPHN

... KLIEVINQVLTPVLNEVFTIRRFAKIWADSHEGLKKAFIRQADFLEANMGSINMKQFLTS

... LQKKGKLHHSVLVEQIFSSDLEGKDALQFFTQILKGSVIQFGKKTYRQCQGVPQGSAVSS

... VLCCLCYGHMENVLFKDIINKKSCLMRLVDDFLLITPNLHDAQTFLKILLAGVPQYGLVV

... NPQKVVVNFEDYGSTDSCPGLRVLPLRCLFPWCGLLLDTHTLDIYKDYSSYADLSLRYSL

... TLGSCHSAGHQMKRKLMGILRLKCHALFLDLKTNSLEAIYKNIYKLLLLHALRFHVCAQS

... LPFGQSVAKNPAYFLLMIWDMVEYTNYLIRLSNNGLISGSTSQTGSVQYEAVELLFCLSF

... LLVLSKHRRLYKDLLLHLHKRKRRLEQCLGDLRLARVRQAANPRNPLDFLAIKT'''

(注意这里一定要用‘’‘ ’‘’ 不然你试试用’‘’能概括这么多蛋白序列,中间又不能用\来继续)

>>> telomerase

'MSITDLSPTLGILRSLYPHVQVLVDFADDIVFREGHKATLIEESDTSHFKSFVRGIFVCF\nHKELQQVPSCNQICTLPELLAFVLNSVKRKRKRNVLAHGYNFQSLAQEERDADQFKLQGD\nVTQSAAYVHGSDLWRKVSMRLGTDITRYLFESCSVFVAVPPSCLFQVCGIPIYDCFSLAT\nASLGFSLQSRGCRERCLGVNSMKRRAFNVKRYLRKRKTETDQKDEARVCSGKRRRVMEED\nKVSCETMQDGESGKTTLVQKQPGSKKRSEMEATLLPLEGGPSWRSGTFPPLPPSQSFMRT\nLGFLYGGRGMRSFLLNRKKKTAEGFRKIQGRDLIRIVFFEGVLYLNGLERKPKKLPRRFF\nNMVPLFSQLLRQHRRCPYSRLLQKTCPLVGIKDAGQAELSSFLPQHCGSHRVYLFVRECL\nLAVIPQELWGSEHNRLLYFARVRFFLRSGKFERLSVAELMWKIKVNNCDWLKISKTGRVP\nPSELSYRTQILGQFLAWLLDGFVVGLVRACFYATESMGQKNAIRFYRQEVWAKLQDLAFR\nSHISKGQMVELTPDQVAALPKSTIISRLRFIPKTDGMRPITRVIGADAKTRLYQSHVRDL\nLDMLRACVCSTPSLLGSTVWGMTDIHKVLSSIAPAQKEKPQPLYFVKMDVSGAYESLPHN\nKLIEVINQVLTPVLNEVFTIRRFAKIWADSHEGLKKAFIRQADFLEANMGSINMKQFLTS\nLQKKGKLHHSVLVEQIFSSDLEGKDALQFFTQILKGSVIQFGKKTYRQCQGVPQGSAVSS\nVLCCLCYGHMENVLFKDIINKKSCLMRLVDDFLLITPNLHDAQTFLKILLAGVPQYGLVV\nNPQKVVVNFEDYGSTDSCPGLRVLPLRCLFPWCGLLLDTHTLDIYKDYSSYADLSLRYSL\nTLGSCHSAGHQMKRKLMGILRLKCHALFLDLKTNSLEAIYKNIYKLLLLHALRFHVCAQS\nLPFGQSVAKNPAYFLLMIWDMVEYTNYLIRLSNNGLISGSTSQTGSVQYEAVELLFCLSF\nLLVLSKHRRLYKDLLLHLHKRKRRLEQCLGDLRLARVRQAANPRNPLDFLAIKT'

>>>

>>> for amino in "ABCDEFGHIKLMNPQRSTVWY":

...    number = telomerase.count(amino)

...    print(amino , number) 

...

A 56

B 0

C 32

D 46

E 46

F 59

G 64

H 28

I 47

K 73

L 146

M 24

N 31

P 44

Q 54

R 79

S 83

T 46

V 73

W 11

Y 32

好啦,现在很明显是L lys是最多的,B是最少的呀,都没有,但是好像突然发现有点问题,一个len()函数瞅瞅

>>> len('ABCDEFGHIKLMNPQRSTVWY')

21

瞬间好像明白了什么,下次aa单字符再写错,直接面壁思过!

你可能感兴趣的:(计算序列aa频率)