在处理dna序列或这蛋白质序列时,常常需要把序列转化为数字,这样才能形成矩阵输入模型训练,一般而言,有三种方法用于序列编码:顺序编码,独热编码,kmer编码。
第一种顺序编码即使把atcg四种碱基编码成具体数据,如把atct编码成[0.25, 0.5, 0.75, 1.0],其他的如n可以编码为0。
首先处理序列,形成字符串str
# function to convert a DNA sequence string to a numpy array
# converts to lower case, changes any non 'acgt' characters to 'n'
import numpy as np
import re
def string_to_array(my_string):
my_string = my_string.lower()
my_string = re.sub('[^acgt]', 'z', my_string)
my_array = np.array(list(my_string))
return my_array
# create a label encoder with 'acgtn' alphabet
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','c','g','t','z']))
设置要编码的数字:
# function to encode a DNA sequence string as an ordinal vector
# returns a numpy vector with a=0.25, c=0.50, g=0.75, t=1.00, n=0.00
def ordinal_encoder(my_array):
integer_encoded = label_encoder.transform(my_array)
float_encoded = integer_encoded.astype(float)
float_encoded[float_encoded == 0] = 0.25 # A
float_encoded[float_encoded == 1] = 0.50 # C
float_encoded[float_encoded == 2] = 0.75 # G
float_encoded[float_encoded == 3] = 1.00 # T
float_encoded[float_encoded == 4] = 0.00 # anything else, z
return float_encoded
测试:
>>> test_sequence = 'AACGCGCTTNN'
>>> ordinal_encoder(string_to_array(test_sequence))
array([0.25, 0.25, 0.5 , 0.75, 0.5 , 0.75, 0.5 , 1. , 1. , 0. , 0. ])
Allen Chieng, Hoon Choong and Nung Kion Lee等研究者认为这种方法训练效果较好,可以查看这篇文章"Evaluation of Convolutionary Neural Networks Modeling of DNA Sequences using Ordinal versus one-hot Encoding Method"
把 “ATGC” 变成[0,0,0,1], [0,0,1,0], [0,1,0,0], [1,0,0,0],因此,1000bp的序列成为1000*4的矩阵
# function to one-hot encode a DNA sequence string
# non 'acgt' bases (n) are 0000
# returns a L x 4 numpy array
from sklearn.preprocessing import OneHotEncoder
def one_hot_encoder(my_array):
integer_encoded = label_encoder.transform(my_array)
onehot_encoder = OneHotEncoder(sparse=False, dtype=int, n_values=5)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
onehot_encoded = np.delete(onehot_encoded, -1, 1)
return onehot_encoded
测试一条序列:
>>> test_sequence = 'AACGCGGTTNN'
>>> one_hot_encoder(string_to_array(test_sequence))
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=int64)
此方法依次切割一定长度的序列形成kmer,然后按照自然语言处理的方式来训练,如“ATGCATGCA” 如果打断成6bp的序列,就会成为 ‘ATGCAT’, ‘TGCATG’, ‘GCATGC’, ‘CATGCA’
首先获得kmer:
>>> def getKmers(sequence, size):
>>> return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
# 测试该函数
>>> mySeq = 'CATGGCCATCCCCCCCCGAGCGGGGGGGGGG'
>>> getKmers(mySeq, size=6)
['catggc',
'atggcc',
'tggcca',
'ggccat',
'gccatc',
'ccatcc',
'catccc',
'atcccc',
'tccccc',
'cccccc',
'cccccc',
'cccccc',
'cccccg',
'ccccga',
'cccgag',
'ccgagc',
'cgagcg',
'gagcgg',
'agcggg',
'gcgggg',
'cggggg',
'gggggg',
'gggggg',
'gggggg',
'gggggg',
'gggggg']
将kmer 连接起来
>>> words = getKmers(mySeq, size=6)
>>> sentence = ' '.join(words)
>>> sentence
'catggc atggcc tggcca ggccat gccatc ccatcc catccc atcccc tccccc cccccc cccccc cccccc cccccg ccccga cccgag ccgagc cgagcg gagcgg agcggg gcgggg cggggg gggggg gggggg gggggg gggggg gggggg'
mySeq2 = 'GATGGCCATCCCCGCCCGAGCGGGGGGGG'
mySeq3 = 'CATGGCCATCCCCGCCCGAGCGGGCGGGG'
sentence2 = ' '.join(getKmers(mySeq2, size=6))
sentence3 = ' '.join(getKmers(mySeq3, size=6))
编码,形成字袋
# Creating the Bag of Words model
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer()
>>> X = cv.fit_transform([sentence, sentence2, sentence3]).toarray()
>>> X
array([[1, 1, 1, 1, 1, 1, 3, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
0, 1, 1, 0, 0, 5, 1, 0, 1],
[1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
0, 1, 1, 0, 0, 3, 0, 1, 1],
[1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
1, 1, 1, 1, 1, 0, 0, 1, 1]], dtype=int64)