10 Minutes to pandas(pandas官方网站的翻译)

本文主要参考官方网站对pandas的介绍,加上自己的理解,有不对的地方多多包涵哈!!!

pandas模块介绍

##通常会加载以下模块
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

一、创建对象

1.通过列表的值创建Series

  Series是pandas提供的以为数组,它类似于numpy中的Array,pandas默认会创建一个整数的索引,但也可以是字符型的索引
s = pd.Series([1,3,5,np.nan,6,8])

s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
s = pd.Series([1,3,5,np.nan,6,8],[1:6])   #Series只能是一维的数组,当维数大于1时将会报错
  File "", line 1
    s = pd.Series([1,3,5,np.nan,6,8],[1:6])   #Series只能是一维的数组,当维数大于1时将会报错
                                       ^
SyntaxError: invalid syntax

2.通过numpy array 创建DataFrame,并带有日期索引和列标签

在pandas中有一个非常常用的函数date_range,尤其是在处理时间序列数据时,这个函数的作用就是产生一个DatetimeIndex,就是时间序列数据的索引。
函数原型pandas.date_range(start=None, end=None, periods=None, freq=’D’, tz=None, normalize=False, name=None, closed=None, **kwargs)

参数:(1)start:string或datetime-like,默认值是None,表示日期的起点。
(2)end:string或datetime-like,默认值是None,表示日期的终点。
(3)periods:integer或None,默认值是None,表示你要从这个函数产生多少个日期索引值;如果是None的话,那么start和end必须不能为None。
(4)freq:string或DateOffset,默认值是’D’,表示以自然日为单位,这个参数用来指定计时单位,比如’5H’表示每隔5个小时计算一次。
(5)tz:string或None,表示时区,例如:’Asia/Hong_Kong’。
(6)normalize:bool,默认值为False,如果为True的话,那么在产生时间索引值之前会先把start和end都转化为当日的午夜0点。
(7)name:str,默认值为None,给返回的时间索引指定一个名字。
(8)closed:string或者None,默认值为None,表示start和end这个区间端点是否包含在区间内,可以有三个值,’left’表示左闭右开区间,’right’表示左开右闭区间,None表示两边都是闭区间。

返回值:DatetimeIndex

pd.date_range(start="20130104",end="20130110")  #产生时间序列数据的索引

dates = pd.date_range('20130101', periods=6)  #产生6个时间序列数据的索引
dates
DatetimeIndex(['2013-01-04', '2013-01-05', '2013-01-06', '2013-01-07',
               '2013-01-08', '2013-01-09', '2013-01-10'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) #dataframe是一个类似表的结构,由多个Series组成,而Series在dataframe中叫做columns
df
  A B C D
2013-01-01 0.765232 0.692670 1.141776 2.540531
2013-01-02 -0.898543 -0.659491 -0.430778 0.570982
2013-01-03 -0.025247 0.015063 -1.915272 0.372160
2013-01-04 -0.139174 1.516186 -1.151047 -0.389001
2013-01-05 0.663521 -0.280017 -0.995703 3.404915
2013-01-06 -0.203886 0.695235 1.311637 0.568774

3.通过dict的对象来创建想series一样的Dataframe

  Dataframe的行为索引index,列为labels
df2 = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
df2
  A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
df2.dtypes  #查看dataframe中每一列的类型
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
df2.<TAB> #如果你用的是IPython,输入df2.按下tab键会显示以下的可选属性
# df2.A                  df2.bool
# df2.abs                df2.boxplot
# df2.add                df2.C
# df2.add_prefix         df2.clip
# df2.add_suffix         df2.clip_lower
# df2.align              df2.clip_upper
# df2.all                df2.columns
# df2.any                df2.combine
# df2.append             df2.combine_first
# df2.apply              df2.compound
# df2.applymap           df2.consolidate
# df2.D
  File "", line 1
    df2.
        ^
SyntaxError: invalid syntax

二、查看数据

1.查看数据框中的top & bottom行

df.head()  #默认是查看数据框中的前5行
  A B C D
2013-01-01 -0.087393 0.872594 0.251184 1.149018
2013-01-02 1.655268 0.616169 -0.379986 1.327039
2013-01-03 0.042210 1.488178 -0.983630 0.323413
2013-01-04 0.271114 -0.088969 0.567894 0.928066
2013-01-05 2.147626 0.291387 0.489159 0.445913
df.head(6) #可指定查看的行数
  A B C D
2013-01-01 -0.087393 0.872594 0.251184 1.149018
2013-01-02 1.655268 0.616169 -0.379986 1.327039
2013-01-03 0.042210 1.488178 -0.983630 0.323413
2013-01-04 0.271114 -0.088969 0.567894 0.928066
2013-01-05 2.147626 0.291387 0.489159 0.445913
2013-01-06 0.131625 0.264920 -1.441035 -1.163547
df.tail(3) #查看数据框中的后3行
  A B C D
2013-01-04 0.271114 -0.088969 0.567894 0.928066
2013-01-05 2.147626 0.291387 0.489159 0.445913
2013-01-06 0.131625 0.264920 -1.441035 -1.163547

2.查看index,colums和底层的数据

df.index  #查看数据框中的索引
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df.columns  #查看数据框中的每一列
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values #查看df中的值
array([[-0.0873935 ,  0.87259398,  0.25118408,  1.14901843],
       [ 1.65526787,  0.61616926, -0.37998566,  1.32703857],
       [ 0.04220978,  1.48817811, -0.98362978,  0.32341307],
       [ 0.27111352, -0.08896946,  0.56789422,  0.92806564],
       [ 2.14762554,  0.29138675,  0.48915928,  0.44591301],
       [ 0.13162548,  0.2649196 , -1.44103542, -1.16354691]])
df.describe() #统计每一列的统计量,包括计数,均值,标准差,最小值,25%分位数,75%分位数,50%中位数,最大值
  A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.693408 0.574046 -0.249402 0.501650
std 0.955779 0.555263 0.829478 0.904427
min -0.087393 -0.088969 -1.441035 -1.163547
25% 0.064564 0.271536 -0.832719 0.354038
50% 0.201369 0.453778 -0.064401 0.686989
75% 1.309229 0.808488 0.429665 1.093780
max 2.147626 1.488178 0.567894 1.327039
 df.T #转置数据框
  2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00 2013-01-05 00:00:00 2013-01-06 00:00:00
A -0.087393 1.655268 0.042210 0.271114 2.147626 0.131625
B 0.872594 0.616169 1.488178 -0.088969 0.291387 0.264920
C 0.251184 -0.379986 -0.983630 0.567894 0.489159 -1.441035
D 1.149018 1.327039 0.323413 0.928066 0.445913 -1.163547

3.排序

(1)索引排序

 df.sort_index() #索引排序,默认axis=0,ascending=True,升序排序
  A B C D
2013-01-01 -0.087393 0.872594 0.251184 1.149018
2013-01-02 1.655268 0.616169 -0.379986 1.327039
2013-01-03 0.042210 1.488178 -0.983630 0.323413
2013-01-04 0.271114 -0.088969 0.567894 0.928066
2013-01-05 2.147626 0.291387 0.489159 0.445913
2013-01-06 0.131625 0.264920 -1.441035 -1.163547
df.sort_index(axis=1, ascending=False)# 对列进行排序,ascending=False 降序
  D C B A
2013-01-01 1.149018 0.251184 0.872594 -0.087393
2013-01-02 1.327039 -0.379986 0.616169 1.655268
2013-01-03 0.323413 -0.983630 1.488178 0.042210
2013-01-04 0.928066 0.567894 -0.088969 0.271114
2013-01-05 0.445913 0.489159 0.291387 2.147626
2013-01-06 -1.163547 -1.441035 0.264920 0.131625

(2)值排序

df.sort_values(by='B') #按列B的值从小排列到大,默认是升序
  A B C D
2013-01-02 -0.898543 -0.659491 -0.430778 0.570982
2013-01-05 0.663521 -0.280017 -0.995703 3.404915
2013-01-03 -0.025247 0.015063 -1.915272 0.372160
2013-01-01 0.765232 0.692670 1.141776 2.540531
2013-01-06 -0.203886 0.695235 1.311637 0.568774
2013-01-04 -0.139174 1.516186 -1.151047 -0.389001
df.sort_values(by='B',ascending=False)   #按B列的值从大排列到小
  A B C D
2013-01-04 -0.139174 1.516186 -1.151047 -0.389001
2013-01-06 -0.203886 0.695235 1.311637 0.568774
2013-01-01 0.765232 0.692670 1.141776 2.540531
2013-01-03 -0.025247 0.015063 -1.915272 0.372160
2013-01-05 0.663521 -0.280017 -0.995703 3.404915
2013-01-02 -0.898543 -0.659491 -0.430778 0.570982
df.sort_values(by=["A","C"]) #按A,C列的值排列
  A B C D
2013-01-01 -0.087393 0.872594 0.251184 1.149018
2013-01-03 0.042210 1.488178 -0.983630 0.323413
2013-01-06 0.131625 0.264920 -1.441035 -1.163547
2013-01-04 0.271114 -0.088969 0.567894 0.928066
2013-01-02 1.655268 0.616169 -0.379986 1.327039
2013-01-05 2.147626 0.291387 0.489159 0.445913

三.选择数据

1.通过索引index

df['A']  #选择单列,结果产生一个Series,等价于df.A
2013-01-01   -0.087393
2013-01-02    1.655268
2013-01-03    0.042210
2013-01-04    0.271114
2013-01-05    2.147626
2013-01-06    0.131625
Freq: D, Name: A, dtype: float64
df[0:3] #通过[],对行进行切片,选择前3行.
  A B C D
2013-01-01 -0.087393 0.872594 0.251184 1.149018
2013-01-02 1.655268 0.616169 -0.379986 1.327039
2013-01-03 0.042210 1.488178 -0.983630 0.323413
df['20130102':'20130104'] #通过索引进行切片
  A B C D
2013-01-02 1.655268 0.616169 -0.379986 1.327039
2013-01-03 0.042210 1.488178 -0.983630 0.323413
2013-01-04 0.271114 -0.088969 0.567894 0.928066

2.通过具体标签label选取列,使用datafram.loc。

有以下三种形式,
df.loc[行标签,列标签]
df.loc['a':'b'] #选取ab两行数据
df.loc[:,'one'] #选取one列的数据 df.loc的第一个参数是行标签,第二个参数为列标签(可选参数,默认为所有列标签),两个参数既可以是列表也可以是单个字符,如果两个参数都为列表则返回的是DataFrame,否则,则为Series。

dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
dates[0]  #选择dates中的第一条记录
Timestamp('2013-01-01 00:00:00', freq='D')
df.loc[dates[0]] #显示行索引等于2013-01-01的记录
A   -0.087393
B    0.872594
C    0.251184
D    1.149018
Name: 2013-01-01 00:00:00, dtype: float64
df.loc[:,['A','B']]  #选取标签为A,B的列,并且选完类型还是dataframe
  A B
2013-01-01 -0.087393 0.872594
2013-01-02 1.655268 0.616169
2013-01-03 0.042210 1.488178
2013-01-04 0.271114 -0.088969
2013-01-05 2.147626 0.291387
2013-01-06 0.131625 0.264920
df.loc['20130102',['A','B']]   #显示等于20130102的A,B数据
A    1.655268
B    0.616169
Name: 2013-01-02 00:00:00, dtype: float64
df.loc[dates[0],'A'] #为了得到第一行第一列的值,为了得到一个标量值
-0.08739349829740166

3.通过位置索引来选择数据集--iloc

df.iloc[3]  #通过传递数值进行位置选择,选择的是行
A    0.271114
B   -0.088969
C    0.567894
D    0.928066
Name: 2013-01-04 00:00:00, dtype: float64
df.iloc[3:5,0:2] #使用整数切片,作用与numpy/python中的情况类似
  A B
2013-01-04 0.271114 -0.088969
2013-01-05 2.147626 0.291387
df.iloc[[1,2,4],[0,2]] #使用整数位置列表,作用和numpy/python中的情况类似
  A C
2013-01-02 1.655268 -0.379986
2013-01-03 0.042210 -0.983630
2013-01-05 2.147626 0.489159
df.iloc[1:3,:] #对行进行切片,后面的:可以省略
  A B C D
2013-01-02 1.655268 0.616169 -0.379986 1.327039
2013-01-03 0.042210 1.488178 -0.983630 0.323413
df.iloc[1:3,]  #省略后面的冒号
  A B C D
2013-01-02 1.655268 0.616169 -0.379986 1.327039
2013-01-03 0.042210 1.488178 -0.983630 0.323413
df.iloc[:,1:3]  #对列进行切片
  B C
2013-01-01 0.872594 0.251184
2013-01-02 0.616169 -0.379986
2013-01-03 1.488178 -0.983630
2013-01-04 -0.088969 0.567894
2013-01-05 0.291387 0.489159
2013-01-06 0.264920 -1.441035
df.iloc[,1:3]   #不可以省略前面的:
  File "", line 1
    df.iloc[,1:3]
            ^
SyntaxError: invalid syntax
df.iloc[1,1] #获取特定的值
0.6161692607864357
df.iat[1,1] #快速获取一个值,和前面的方法是等价的
0.6161692607864357

4.布尔值索引(boolean Indexing)

df[df.A>0]   #使用一个单列的值选择数据
  A B C D
2013-01-02 1.655268 0.616169 -0.379986 1.327039
2013-01-03 0.042210 1.488178 -0.983630 0.323413
2013-01-04 0.271114 -0.088969 0.567894 0.928066
2013-01-05 2.147626 0.291387 0.489159 0.445913
2013-01-06 0.131625 0.264920 -1.441035 -1.163547
df[df>0]  #从一个dataframe中选择数据,当布尔值为真时,不满足条件的值自动显示为NaN
  A B C D
2013-01-01 NaN 0.872594 0.251184 1.149018
2013-01-02 1.655268 0.616169 NaN 1.327039
2013-01-03 0.042210 1.488178 NaN 0.323413
2013-01-04 0.271114 NaN 0.567894 0.928066
2013-01-05 2.147626 0.291387 0.489159 0.445913
2013-01-06 0.131625 0.264920 NaN NaN
df2 = df.copy()# 使用isin()方法来过滤数据

df2['E'] = ['one','one','two','three','four','three']

df2
  A B C D E
2013-01-01 -0.087393 0.872594 0.251184 1.149018 one
2013-01-02 1.655268 0.616169 -0.379986 1.327039 one
2013-01-03 0.042210 1.488178 -0.983630 0.323413 two
2013-01-04 0.271114 -0.088969 0.567894 0.928066 three
2013-01-05 2.147626 0.291387 0.489159 0.445913 four
2013-01-06 0.131625 0.264920 -1.441035 -1.163547 three
df2[df2['E'].isin(['two','four'])] #选取E列中为'two','four'的数据
  A B C D E
2013-01-03 0.042210 1.488178 -0.983630 0.323413 two
2013-01-05 2.147626 0.291387 0.489159 0.445913 four
df2['E'].isin(['two','four'])  #返回一列的布尔值
2013-01-01    False
2013-01-02    False
2013-01-03     True
2013-01-04    False
2013-01-05     True
2013-01-06    False
Freq: D, Name: E, dtype: bool

5.设置(setting)

(1)设置一个新列会自动将数据和索引对齐.
Series是pandas模块提供的一种一维的数组,它类似numpy中的Array,但是有一些特殊的功能。pandas的数据结构都会包含索引。索引可以是数字,也可以是字符

s1 = pd.Series([1,2,3,4,5,6],index = pd.date_range('20130102',periods=6))
s1     #创建一个Series
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64
df['F'] = s1 #将s1赋值给df中的F列
df
  A B C D F
2013-01-01 -0.087393 0.872594 0.251184 1.149018 NaN
2013-01-02 1.655268 0.616169 -0.379986 1.327039 1.0
2013-01-03 0.042210 1.488178 -0.983630 0.323413 2.0
2013-01-04 0.271114 -0.088969 0.567894 0.928066 3.0
2013-01-05 2.147626 0.291387 0.489159 0.445913 4.0
2013-01-06 0.131625 0.264920 -1.441035 -1.163547 5.0

(2)通过标签来设置值

df.at[dates[0],'A'] = 0  #将df中的第一行第一列的值设置为0
df
  A B C D F
2013-01-01 0.000000 0.872594 0.251184 1.149018 NaN
2013-01-02 1.655268 0.616169 -0.379986 1.327039 1.0
2013-01-03 0.042210 1.488178 -0.983630 0.323413 2.0
2013-01-04 0.271114 -0.088969 0.567894 0.928066 3.0
2013-01-05 2.147626 0.291387 0.489159 0.445913 4.0
2013-01-06 0.131625 0.264920 -1.441035 -1.163547 5.0

(3)通过numpy array重新给一列赋值

df.loc[:,'D'] = np.array([5]*len(df)) #将D列的值全部赋值为5
df
  A B C D F
2013-01-01 0.000000 0.872594 0.251184 5 NaN
2013-01-02 1.655268 0.616169 -0.379986 5 1.0
2013-01-03 0.042210 1.488178 -0.983630 5 2.0
2013-01-04 0.271114 -0.088969 0.567894 5 3.0
2013-01-05 2.147626 0.291387 0.489159 5 4.0
2013-01-06 0.131625 0.264920 -1.441035 5 5.0

(4)通过where操作来设置新值

df2 = df.copy()
df2[df2>0] = -df2  #在所有大于0的值前面加上负号
df2
  A B C D F
2013-01-01 0.000000 -0.872594 -0.251184 -5 NaN
2013-01-02 -1.655268 -0.616169 -0.379986 -5 -1.0
2013-01-03 -0.042210 -1.488178 -0.983630 -5 -2.0
2013-01-04 -0.271114 -0.088969 -0.567894 -5 -3.0
2013-01-05 -2.147626 -0.291387 -0.489159 -5 -4.0
2013-01-06 -0.131625 -0.264920 -1.441035 -5 -5.0

四、缺失数据(Missing Data)

pandas主要是使用值np.nan来代替缺失数据,默认情况下是不包含在计算中。

1. reindex()方法可以对指定轴上的索引进行改变/增加/删除操作,这将返回原始数据的一个拷贝

df1 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df1
  A B C D F E
2013-01-01 0.000000 0.872594 0.251184 5 NaN NaN
2013-01-02 1.655268 0.616169 -0.379986 5 1.0 NaN
2013-01-03 0.042210 1.488178 -0.983630 5 2.0 NaN
2013-01-04 0.271114 -0.088969 0.567894 5 3.0 NaN
df1.loc[dates[0]:dates[1],'E'] = 1
df1
  A B C D F E
2013-01-01 0.000000 0.872594 0.251184 5 NaN 1.0
2013-01-02 1.655268 0.616169 -0.379986 5 1.0 1.0
2013-01-03 0.042210 1.488178 -0.983630 5 2.0 NaN
2013-01-04 0.271114 -0.088969 0.567894 5 3.0 NaN

2.删除带有缺失值的行

df1.dropna(how='any')  #删除任何有缺失值的行
  A B C D F E
2013-01-02 1.655268 0.616169 -0.379986 5 1.0 1.0
df1.fillna(value=5)  #用5来代替所有的缺失值
  A B C D F E
2013-01-01 0.000000 0.872594 0.251184 5 5.0 1.0
2013-01-02 1.655268 0.616169 -0.379986 5 1.0 1.0
2013-01-03 0.042210 1.488178 -0.983630 5 2.0 5.0
2013-01-04 0.271114 -0.088969 0.567894 5 3.0 5.0

3.当值为nan时返回布尔值TRUE,否则返回FALSE

pd.isna(df1)
  A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True

五、相关的操作(operation)

1.统计(stats)

操作一般不包括缺失值

(1)进行描述性统计

df.mean()  #求每一列的均值,参数为轴,可选0或1,默认情况下为0,即按照列运算
A    0.707974
B    0.574046
C   -0.249402
D    5.000000
F    3.000000
dtype: float64

(2)对另一轴进行同样的操作

df.mean(1) #对每一行求均值
2013-01-01    1.530945
2013-01-02    1.578290
2013-01-03    1.509352
2013-01-04    1.750008
2013-01-05    2.385634
2013-01-06    1.791102
Freq: D, dtype: float64

(3)对具有不同维度的对象进行操作需要对齐。此外,pandas会对指定的对象自动broadcasts

s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(-2) #shift函数用来控制值前移还是后移,当小于0时表示前移
s
2013-01-01    5.0
2013-01-02    NaN
2013-01-03    6.0
2013-01-04    8.0
2013-01-05    NaN
2013-01-06    NaN
Freq: D, dtype: float64
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2) #表示将值向后移动2个单位,缺少的值用NAN填充
s
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
df.sub(s,axis='index')
  A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.957790 0.488178 -1.983630 4.0 1.0
2013-01-04 -2.728886 -3.088969 -2.432106 2.0 0.0
2013-01-05 -2.852374 -4.708613 -4.510841 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN

2.Apply

对数据应用函数
df
  A B C D F
2013-01-01 0.000000 0.872594 0.251184 5 NaN
2013-01-02 1.655268 0.616169 -0.379986 5 1.0
2013-01-03 0.042210 1.488178 -0.983630 5 2.0
2013-01-04 0.271114 -0.088969 0.567894 5 3.0
2013-01-05 2.147626 0.291387 0.489159 5 4.0
2013-01-06 0.131625 0.264920 -1.441035 5 5.0
df.apply(np.cumsum)  #对df数据框中的每一列累计求和
  A B C D F
2013-01-01 0.000000 0.872594 0.251184 5 NaN
2013-01-02 1.655268 1.488763 -0.128802 10 1.0
2013-01-03 1.697478 2.976941 -1.112431 15 3.0
2013-01-04 1.968591 2.887972 -0.544537 20 6.0
2013-01-05 4.116217 3.179359 -0.055378 25 10.0
2013-01-06 4.247842 3.444278 -1.496413 30 15.0
df.apply(lambda x:x.max()-x.min()) #lambda语句中,冒号前是参数,可以有多个,用逗号隔开,冒号右边的返回值。
A    2.147626
B    1.577148
C    2.008930
D    0.000000
F    4.000000
dtype: float64

3.直方图(Histogramming)

s = pd.Series(np.random.randint(0,7,size=10))  #从0到7中随机抽取10个整数
s
0    5
1    6
2    5
3    3
4    0
5    2
6    3
7    6
8    4
9    0
dtype: int32
s.value_counts()  #对s中的值计数
6    2
5    2
3    2
0    2
4    1
2    1
dtype: int64

4.字符串方法

Series在str属性中配置了一系列的字符串处理方法,使得其更加容易对数组中的每一个元素进行操作

s = pd.Series(['A','B','C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower() #对s中的字符串全部转换为小写
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

六、合并(Merge)

pandas提供了各种各样的设施,使得可以很容易组合Series,Dataframe和pandas对象

1.Concat

df = pd.DataFrame(np.random.randn(10,4))  #创建10行4列的dataframe
df
  0 1 2 3
0 0.943826 -0.051977 -2.032401 -0.287685
1 -1.176138 -0.978212 -1.072027 -0.627176
2 -1.099693 0.850744 0.659844 0.172439
3 -0.288210 0.289878 0.252331 0.833933
4 0.541648 -2.114519 0.211821 0.277398
5 2.538557 -1.699267 -0.454330 -0.490725
6 1.042810 -0.078370 0.274850 -1.200096
7 0.365216 0.187428 -0.469872 0.046218
8 -0.525191 -0.998904 0.156138 -0.797593
9 0.771242 -0.763656 -0.822907 0.409141
pieces = [df[:3],df[3:7],df[7:]] #break it into pieces
pieces
[          0         1         2         3
 0  0.943826 -0.051977 -2.032401 -0.287685
 1 -1.176138 -0.978212 -1.072027 -0.627176
 2 -1.099693  0.850744  0.659844  0.172439,
           0         1         2         3
 3 -0.288210  0.289878  0.252331  0.833933
 4  0.541648 -2.114519  0.211821  0.277398
 5  2.538557 -1.699267 -0.454330 -0.490725
 6  1.042810 -0.078370  0.274850 -1.200096,
           0         1         2         3
 7  0.365216  0.187428 -0.469872  0.046218
 8 -0.525191 -0.998904  0.156138 -0.797593
 9  0.771242 -0.763656 -0.822907  0.409141]
pd.concat(pieces)  #合并个列
  0 1 2 3
0 0.943826 -0.051977 -2.032401 -0.287685
1 -1.176138 -0.978212 -1.072027 -0.627176
2 -1.099693 0.850744 0.659844 0.172439
3 -0.288210 0.289878 0.252331 0.833933
4 0.541648 -2.114519 0.211821 0.277398
5 2.538557 -1.699267 -0.454330 -0.490725
6 1.042810 -0.078370 0.274850 -1.200096
7 0.365216 0.187428 -0.469872 0.046218
8 -0.525191 -0.998904 0.156138 -0.797593
9 0.771242 -0.763656 -0.822907 0.409141

2.Join

(1)类似SQL风格的合并

left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
right = pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})
left
  key lval
0 foo 1
1 foo 2
right
  key rval
0 foo 4
1 foo 5
pd.merge(left,right,on='key') #根据key关键词合并数据,一个key有多个值
  key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

(2)另外的一个实例

 left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
left
  key lval
0 foo 1
1 bar 2
right
  key rval
0 foo 4
1 bar 5
pd.merge(left, right, on='key')
  key lval rval
0 foo 1 4
1 bar 2 5

3.Append

 将行追加到数据框中
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
 df
  A B C D
0 -0.207538 0.571509 -0.017933 -0.439429
1 1.151596 -1.270133 0.173458 1.943956
2 0.358885 0.405927 0.178656 1.300998
3 -1.430587 -0.713583 1.068798 0.615146
4 0.829949 -0.737999 0.445106 -1.736728
5 0.333546 -0.333385 -0.190337 -0.719699
6 0.311432 0.031742 0.132947 1.233933
7 -1.584257 1.283583 -1.006611 0.643552
 s = df.iloc[3]  #选择dataframe中索引为3的行
s
A   -1.430587
B   -0.713583
C    1.068798
D    0.615146
Name: 3, dtype: float64
df.append(s, ignore_index=True)   #将s插入到df中的尾部,忽略本身的索引
  A B C D
0 -0.207538 0.571509 -0.017933 -0.439429
1 1.151596 -1.270133 0.173458 1.943956
2 0.358885 0.405927 0.178656 1.300998
3 -1.430587 -0.713583 1.068798 0.615146
4 0.829949 -0.737999 0.445106 -1.736728
5 0.333546 -0.333385 -0.190337 -0.719699
6 0.311432 0.031742 0.132947 1.233933
7 -1.584257 1.283583 -1.006611 0.643552
8 -1.430587 -0.713583 1.068798 0.615146
df.append(s, ignore_index=False)  #使用本身的索引
  A B C D
0 -0.207538 0.571509 -0.017933 -0.439429
1 1.151596 -1.270133 0.173458 1.943956
2 0.358885 0.405927 0.178656 1.300998
3 -1.430587 -0.713583 1.068798 0.615146
4 0.829949 -0.737999 0.445106 -1.736728
5 0.333546 -0.333385 -0.190337 -0.719699
6 0.311432 0.031742 0.132947 1.233933
7 -1.584257 1.283583 -1.006611 0.643552
3 -1.430587 -0.713583 1.068798 0.615146

七、分组(Grouping)

对于”group by”操作,我们通常是指以下一个或多个操作步骤:

(Splitting)按照一些规则将数据分为不同的组;

(Applying)对于每组数据分别执行一个函数;

(Combining)将结果组合到一个数据结构中;
 df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ....:                           'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three',
   ....:                           'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(8),
   ....:                    'D' : np.random.randn(8)})
df
  A B C D
0 foo one 0.034416 -0.343473
1 bar one 1.332412 -0.627173
2 foo two 0.525722 2.446132
3 bar three -1.877499 1.777156
4 foo two -0.210871 -0.358429
5 bar two 1.045275 -0.873375
6 foo one 0.333061 0.951301
7 foo three 0.412129 0.125475

1.分组,然后将函数sum应用到每一组的数据中

df.groupby('A').sum()  #根据A组的取值分组,然后对每一组求平均值
  C D
A    
bar 0.500188 0.276609
foo 1.094456 2.821007

2.对多列进行分组形成一个层次索引,然后执行函数

df.groupby(['A','B']).sum()
    C D
A B    
bar one 1.332412 -0.627173
three -1.877499 1.777156
two 1.045275 -0.873375
foo one 0.367477 0.607829
three 0.412129 0.125475
two 0.314851 2.087703

八、Reshaping(重新修整)

1.stack

zip() 函数用于将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表。如果各个迭代器的元素个数不一致,则返回列表长度与最短的对象相同,利用 * 号操作符,可以将元组解压为列表。

a = [1,2,3]
b = [4,5,6]
c = [4,5,6,7,8]
zipped = zip(a,b)
list(zipped)  #将对象中对应的元素打包成一个个元组,
[(1, 4), (2, 5), (3, 6)]
list(zip(a,c))  #如果各个迭代器的元素个数不一致,则返回的列表与最短的对象相同
[(1, 4), (2, 5), (3, 6)]
zip(*zipped)  ## 与 zip 相反,可理解为解压,返回二维矩阵式
for i in zipped
    print i
  File "", line 1
    for i in zipped
                   ^
SyntaxError: invalid syntax
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ....:                      'foo', 'foo', 'qux', 'qux'],
   ....:                     ['one', 'two', 'one', 'two',
   ....:                      'one', 'two', 'one', 'two']]))  #将两列合并
tuples #显示合并后的数据
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
 index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) #运用MultiIndex.from_tuples可以生成对应的索引和水平位置
 index 
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
    A B
first second    
bar one 0.339034 0.414004
two -0.518862 1.997310
baz one -0.730239 -0.277486
two -0.139502 -1.313450
foo one 0.759715 1.829978
two -0.257618 -0.189971
qux one -0.520739 2.427057
two 0.887668 0.852152
df2 = df[:4]
 df2
    A B
first second    
bar one 0.339034 0.414004
two -0.518862 1.997310
baz one -0.730239 -0.277486
two -0.139502 -1.313450

(1)stack()方法用列标签新增一列水平

stacked = df2.stack()
stacked
first  second   
bar    one     A    0.339034
               B    0.414004
       two     A   -0.518862
               B    1.997310
baz    one     A   -0.730239
               B   -0.277486
       two     A   -0.139502
               B   -1.313450
dtype: float64

(2)stack()方法的逆操作为unstack(),默认是解压最后一层

stacked.unstack()#解压最后一层
    A B
first second    
bar one 0.339034 0.414004
two -0.518862 1.997310
baz one -0.730239 -0.277486
two -0.139502 -1.313450
 stacked.unstack(1)#解压第二层
  second one two
first      
bar A 0.339034 -0.518862
B 0.414004 1.997310
baz A -0.730239 -0.139502
B -0.277486 -1.313450
  stacked.unstack(2) #解压第3层
    A B
first second    
bar one 0.339034 0.414004
two -0.518862 1.997310
baz one -0.730239 -0.277486
two -0.139502 -1.313450
stacked.unstack(0)#解压第一层
  first bar baz
second      
one A 0.339034 -0.730239
B 0.414004 -0.277486
two A -0.518862 -0.139502
B 1.997310 -1.313450

2.透视表(Pivot Table)

 df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
   .....:                    'B' : ['A', 'B', 'C'] * 4,
   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
   .....:                    'D' : np.random.randn(12),
   .....:                    'E' : np.random.randn(12)})
df
  A B C D E
0 one A foo 1.004031 0.301297
1 one B foo 0.343626 1.474257
2 two C foo -0.273272 0.164824
3 three A bar 1.621494 -0.127404
4 one B bar -0.485918 -0.441188
5 one C bar 1.384366 0.495152
6 two A foo -0.756043 -1.061713
7 three B foo 0.948366 -0.006379
8 one C foo -0.985923 1.927168
9 one A bar -0.941727 0.728585
10 two B bar -0.638742 -0.710529
11 three C bar 0.566670 0.896022
pd.pivot_table(df,values='D',index=['A','B'],columns=['C']) #索引为A,B,列为C,选用D中的值
  C bar foo
A B    
one A -0.941727 1.004031
B -0.485918 0.343626
C 1.384366 -0.985923
three A 1.621494 NaN
B NaN 0.948366
C 0.566670 NaN
two A NaN -0.756043
B -0.638742 NaN
C NaN -0.273272

九、时间序列(Time Series)

pandas 拥有简单,强大,高效的函数用来处理频率转换中的重采样问题(例如将秒数据转换为5分钟数据)

 rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()
2012-01-01    26372
Freq: 5T, dtype: int32

1.时区表示

 rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2012-03-06   -0.028910
2012-03-07    0.482453
2012-03-08   -0.936729
2012-03-09   -0.027259
2012-03-10    1.124380
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-06 00:00:00+00:00   -0.028910
2012-03-07 00:00:00+00:00    0.482453
2012-03-08 00:00:00+00:00   -0.936729
2012-03-09 00:00:00+00:00   -0.027259
2012-03-10 00:00:00+00:00    1.124380
Freq: D, dtype: float64

2.转换时区

ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00   -0.028910
2012-03-06 19:00:00-05:00    0.482453
2012-03-07 19:00:00-05:00   -0.936729
2012-03-08 19:00:00-05:00   -0.027259
2012-03-09 19:00:00-05:00    1.124380
Freq: D, dtype: float64

3.时区跨度转换

rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-01-31    0.899926
2012-02-29    0.865466
2012-03-31   -0.323191
2012-04-30    0.694425
2012-05-31   -1.996379
Freq: M, dtype: float64
ps = ts.to_period()
ps
2012-01    0.899926
2012-02    0.865466
2012-03   -0.323191
2012-04    0.694425
2012-05   -1.996379
Freq: M, dtype: float64
 ps.to_timestamp()
2012-01-01    0.899926
2012-02-01    0.865466
2012-03-01   -0.323191
2012-04-01    0.694425
2012-05-01   -1.996379
Freq: MS, dtype: float64

4.period和timestamp之间的转换让某些算术函数应用起来非常方便。下面的例子将一个quarterly frequency with year ending in November 转化成 9am of the end of the month following the quarter end:

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()
1990-03-01 09:00   -0.496062
1990-06-01 09:00   -0.197552
1990-09-01 09:00    0.301589
1990-12-01 09:00   -0.236359
1991-03-01 09:00   -0.647946
Freq: H, dtype: float64

十、分类(Categoricals)

pandas的dataframe可以包含categorical类型的数据
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
  id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
df['grade'] = df['raw_grade'].astype('category') #将数据转换成分类数据
df['grade']
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

1.重命名类别(to more meaning names)

df['grade'].cat.categories = ['very good','good','very bad'] #给grade中的值重命名
df
  id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad

2.重命名类别并添加缺失数据

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

3.按照类别排序,而不是根据原有的顺序

df.sort_values(by='grade')
  id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good

4.根据类别列分组(包含空的类别)

df.groupby('grade').size()
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

十一、画图(plotting)

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
   .....:                   columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')

十二、Getting Data In/Out

1.CSV文件

(1)将数据写入一个csv文件

df.to_csv('foo.csv') #df为存储数据的源文件,foo为读出的csv文件的命名

(2)读取csv数据文件

foo = pd.read_csv('foo.csv') #从电脑中读入一个csv文件
foo.head(6) #查看foo文件中的前6行
  Unnamed: 0 A B C D
0 2000-01-01 -2.086601 -1.177304 1.121419 -0.685302
1 2000-01-02 -1.780664 -0.778726 1.859689 -0.426140
2 2000-01-03 -1.043800 0.331660 2.781106 0.173191
3 2000-01-04 -1.785721 0.553241 3.027602 -0.776087
4 2000-01-05 -2.017765 1.972538 3.681418 -1.735131
5 2000-01-06 -2.058775 2.031710 3.448706 -1.738626

2.HDF5

(1)写入HDF5

df.to_hdf('foo.h5','df')

(2)读取HDF5文件

foo= pd.read_hdf('foo.h5','df')
foo.head()
  A B C D
2000-01-01 -2.086601 -1.177304 1.121419 -0.685302
2000-01-02 -1.780664 -0.778726 1.859689 -0.426140
2000-01-03 -1.043800 0.331660 2.781106 0.173191
2000-01-04 -1.785721 0.553241 3.027602 -0.776087
2000-01-05 -2.017765 1.972538 3.681418 -1.735131

3.Excel

(1)写入Excel

df.to_excel('foo.xlsx', sheet_name='Sheet1') #foo为工作簿名,Sheet1为表名称

(2)读取Excel

foo = pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
foo.head() 
  A B C D
2000-01-01 -2.086601 -1.177304 1.121419 -0.685302
2000-01-02 -1.780664 -0.778726 1.859689 -0.426140
2000-01-03 -1.043800 0.331660 2.781106 0.173191
2000-01-04 -1.785721 0.553241 3.027602 -0.776087
2000-01-05 -2.017765 1.972538 3.681418 -1.735131

你可能感兴趣的:(python,数据处理)