Python读取fasta文件

AMPs.fasta文件格式:

>AP00001 |antibacterial |anticancer/tumor |antifungal
GLWSKIKEVGKEAAKAAAKAAGKAALGAVSEAV

>AP00004 |antibacterial |antifungal
NLCERASLTWTGNCGNTGHCDTQCRNWESAKHGACHKRGNWKCFCYFDC

>AP00005 |antibacterial
VFIDILDKVENAIHNAAQVGIGFAKPFEKLINPK

>AP00006 |antibacterial
GNNRPVYIPQPRPPHPRI

>AP00008 |antibacterial
RLCRIVVIRVCR

>AP00020 |antibacterial |anticancer/tumor |antifungal
GLFDIVKKIAGHIAGSI

>AP00026 |antibacterial |anticancer/tumor |antifungal |anti-HIV |antiviral
FKCRRWQWRMKKLGAPSITCVRRAF

>AP00027 |antibacterial
ITPATPFTPAIITEITAAVIA

>AP00035 |antibacterial |anticancer/tumor
KSSAYSLQMGATAIKQVKKLFKKWGW

>AP00050 |antibacterial
GIGASILSAGKSALKGLAKGLAEHFAN

 

代码任务:读取序列并统计序列长度分布

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

sns.set_style('dark')

train_pos_data_path = r"./AMPs.fasta"
train_neg_data_path = r"./nonAMPs.fasta"
test_pos_data_path = r'./amp920.fasta'
test_neg_data_path = r'./nonamp920.fasta'

def read_from_file_with_enter(filename):
    fr = open(filename,'r')
    sample = ""
    samples = []
    for line in fr:
        if line.startswith('>'):
            sample = ""
            continue
        if line.startswith('\n'):
            samples.append(sample)
            continue
        sample += line[:-1]
    return samples

def read_from_file(filename):
    fr = open(filename, 'r')
    sample = ""
    samples = []
    for line in fr:
        if line.startswith('>'):
            if sample != "":
                samples.append(sample)
                sample=""
            continue
        sample+=line[:-1]
    return samples

def statistics_length(samples):
    lengths = []
    for sample in samples:
        lengths.append(len(sample))
    return pd.DataFrame(lengths,columns=['Length'])


train_pos_data = read_from_file_with_enter(train_pos_data_path)
train_neg_data = read_from_file(train_neg_data_path)
test_pos_data = read_from_file(test_pos_data_path)
test_neg_data = read_from_file(test_neg_data_path)


train_pos_len = statistics_length(train_pos_data)
train_neg_len = statistics_length(train_neg_data)

plt.figure(figsize=(6,4))
g = sns.distplot(a=train_pos_len.Length,label="Train_Pos",kde=False,bins=20)
g = sns.distplot(a=train_neg_len.Length,label="Train_Neg",kde=False,bins=20)
plt.ylabel("Number")
plt.title("Length of sample")
plt.legend(loc='best')

plt.figure(figsize=(6,4))
g = sns.kdeplot(data=train_pos_len.Length,shade=True)
g = sns.kdeplot(data=train_neg_len.Length,shade=True)
plt.title("Kernel density estimation")

plt.show(g)

 

效果展示

直方图:

Python读取fasta文件_第1张图片

 

核密度估计图:

Python读取fasta文件_第2张图片

你可能感兴趣的:(Python读取fasta文件)