pandas str方法的使用

pandas的str方法

pandas特定的列经过str之后,就可以使用各种python常用的字符处理方法了。
首先,构建dataframe:

import pandas as pd
d={'gene':{'a':'gene1','b':'gene2','c':'gene3','d':'gene4'},'expression':{'a':'low:0','b':'mid:3','c':'mid:4','d':'high:9'},'description':{'a':'transposon element','b':'nuclear genes','c':'retrotransposon','d':'unknown'}}
df=pd.DataFrame(d)
print(df)
    gene expression         description
a  gene1      low:0  transposon element
b  gene2      mid:3       nuclear genes
c  gene3      mid:4          retrotransposon
d  gene4     high:9             unknown

几种常见的str方法如下:

筛选出含有特定字符串的行:contains() 方法

df1=df[df['description'].str.contains('transposon')]
print(df1)
    gene expression         description
a  gene1      low:0  transposon element
c  gene3      mid:4          retrotransposon

字符串分割(将特定列拿出来,按特定字符分开,然后形成一个新的dataframe)

df1=df['columns_name'].str.split(':',expand=True)
print(df1)
      0  1
a   low  0
b   mid  3
c   mid  4
d  high  9

当然,可以直接将这两列加到df中:

df[['exp1','exp2']]=df['expression'].str.split(':',expand=True)
print(df)
    gene expression         description  exp1 exp2
a  gene1      low:0  transposon element   low    0
b  gene2      mid:3       nuclear genes   mid    3
c  gene3      mid:4     retrotransposon   mid    4
d  gene4     high:9             unknown  high    9

注意1:
此时exp2这一列的数据类型是object,即python中的str,而不是int。可以通过astype将其转换为int

print(df['exp2'].dtype)
dtype('O')#'O'即object

df['exp2']=df['exp2'].astype(int)

print(df['exp2'].dtype)
dtype('int32')

注意2:
expand=True不加的话,df1中将只有一列,其实就是一个series。

df=pd.DataFrame(d)
df1=df['expression'].str.split(':')
print(df1)
a     [low, 0]
b     [mid, 3]
c     [mid, 4]
d    [high, 9]
Name: expression, dtype: object
type(df1)
<class 'pandas.core.series.Series'>

字符串的替换

print(df)
    gene expression         description  exp1 exp2
a  gene1      low:0  transposon element   low    0
b  gene2      mid:3       nuclear genes   mid    3
c  gene3      mid:4     retrotransposon   mid    4
d  gene4     high:9             unknown  high    9

df['gene']=df['gene'].str.replace('gene','Gene')

print(df)
    gene expression         description  exp1 exp2
a  Gene1      low:0  transposon element   low    0
b  Gene2      mid:3       nuclear genes   mid    3
c  Gene3      mid:4     retrotransposon   mid    4
d  Gene4     high:9             unknown  high    9

字符串两端的字符的判断 startswith 与 endswith

df1=df[df['expression'].str.startswith('m')]
print(df1)
    gene expression      description exp1 exp2
b  gene2      mid:3    nuclear genes  mid    3
c  gene3      mid:4  retrotransposon  mid    4

正则表达式 findall的使用

s=df['expression'].str.findall('[a-z]+')
print(s)
a     [low]
b     [mid]
c     [mid]
d    [high]
Name: expression, dtype: object

去除特定字符strip(包括lstrip和rstrip)

print(df1)
    gene expression      description exp1 exp2
b  gene2      mid:3    nuclear genes  mid    3
c  gene3      mid:4  retrotransposon  mid    4

df1['expression']=df1['expression'].str.lstrip('mid:')
print(df1)
    gene expression      description exp1 exp2
b  gene2          3    nuclear genes  mid    3
c  gene3          4  retrotransposon  mid    4

你可能感兴趣的:(数据处理,python,字符串)