pandas特定的列经过str之后,就可以使用各种python常用的字符处理方法了。
首先,构建dataframe:
import pandas as pd
d={'gene':{'a':'gene1','b':'gene2','c':'gene3','d':'gene4'},'expression':{'a':'low:0','b':'mid:3','c':'mid:4','d':'high:9'},'description':{'a':'transposon element','b':'nuclear genes','c':'retrotransposon','d':'unknown'}}
df=pd.DataFrame(d)
print(df)
gene expression description
a gene1 low:0 transposon element
b gene2 mid:3 nuclear genes
c gene3 mid:4 retrotransposon
d gene4 high:9 unknown
df1=df[df['description'].str.contains('transposon')]
print(df1)
gene expression description
a gene1 low:0 transposon element
c gene3 mid:4 retrotransposon
df1=df['columns_name'].str.split(':',expand=True)
print(df1)
0 1
a low 0
b mid 3
c mid 4
d high 9
当然,可以直接将这两列加到df中:
df[['exp1','exp2']]=df['expression'].str.split(':',expand=True)
print(df)
gene expression description exp1 exp2
a gene1 low:0 transposon element low 0
b gene2 mid:3 nuclear genes mid 3
c gene3 mid:4 retrotransposon mid 4
d gene4 high:9 unknown high 9
注意1:
此时exp2这一列的数据类型是object,即python中的str,而不是int。可以通过astype将其转换为int
print(df['exp2'].dtype)
dtype('O')#'O'即object
df['exp2']=df['exp2'].astype(int)
print(df['exp2'].dtype)
dtype('int32')
注意2:
expand=True不加的话,df1中将只有一列,其实就是一个series。
df=pd.DataFrame(d)
df1=df['expression'].str.split(':')
print(df1)
a [low, 0]
b [mid, 3]
c [mid, 4]
d [high, 9]
Name: expression, dtype: object
type(df1)
<class 'pandas.core.series.Series'>
print(df)
gene expression description exp1 exp2
a gene1 low:0 transposon element low 0
b gene2 mid:3 nuclear genes mid 3
c gene3 mid:4 retrotransposon mid 4
d gene4 high:9 unknown high 9
df['gene']=df['gene'].str.replace('gene','Gene')
print(df)
gene expression description exp1 exp2
a Gene1 low:0 transposon element low 0
b Gene2 mid:3 nuclear genes mid 3
c Gene3 mid:4 retrotransposon mid 4
d Gene4 high:9 unknown high 9
df1=df[df['expression'].str.startswith('m')]
print(df1)
gene expression description exp1 exp2
b gene2 mid:3 nuclear genes mid 3
c gene3 mid:4 retrotransposon mid 4
s=df['expression'].str.findall('[a-z]+')
print(s)
a [low]
b [mid]
c [mid]
d [high]
Name: expression, dtype: object
print(df1)
gene expression description exp1 exp2
b gene2 mid:3 nuclear genes mid 3
c gene3 mid:4 retrotransposon mid 4
df1['expression']=df1['expression'].str.lstrip('mid:')
print(df1)
gene expression description exp1 exp2
b gene2 3 nuclear genes mid 3
c gene3 4 retrotransposon mid 4