pandas的基本函数
实验目的
熟练掌握pandas基本函数使用方法
实验原理
列转行方法
stack函数:pandas.DataFrame.stack(self, level=-1, dropna=True),对于普通的DataFrame而言,直接列索引转换到最内层行索引,生一个Series对象。
对于层次化索引的DataFrame而言,可以将指定的索引层转换到行上,默认是将最内层的列索引转换到最内层行。
unstack函数:pandas.DataFrame.unstack(self, level=-1, fill_value=None),对于普通的DataFrame而言,直接将列索引转换到行索引的最外层索引,生成一个Series对象,对于层次化索引的DataFrame而言,和stack函数类似,似乎把两层索引当作一个整体,当level为列表时报错。
melt函数:pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=‘value’, col_level=None),id_vars可以理解为结果需要保留的原始列,value_vars可以理解为需要列转行的列名;
var_name把列转行的列变量重新命名,默认为variable;value_name列转行对应变量的值的名称。
行转列方法
unstack函数:pandas.DataFrame.unstack(self, level=-1, fill_value=None)
实验环境
Python 3.6.1
Jupyter
实验内容
练习pandas的主要的基本函数的使用。
代码部分
import pandas as pd
import numpy as np
1.创建一个DataFrame,名为df
df = pd.DataFrame({'A':np.random.randint(1, 100, 4),'B':pd.date_range(start='20130101', periods=4, freq='D'),'C':pd.Series([1, 2, 3, 4],index=['zhang', 'li', 'zhou', 'wang'],dtype='float32'),'D':np.array([3] * 4,dtype='int32'), 'E':pd.Categorical(["test","train","test","train"]),'F':'foo'})
df
|
A |
B |
C |
D |
E |
F |
zhang |
12 |
2013-01-01 |
1.0 |
3 |
test |
foo |
li |
40 |
2013-01-02 |
2.0 |
3 |
train |
foo |
zhou |
26 |
2013-01-03 |
3.0 |
3 |
test |
foo |
wang |
27 |
2013-01-04 |
4.0 |
3 |
train |
foo |
2.二维数据查看
df.head()
|
A |
B |
C |
D |
E |
F |
zhang |
12 |
2013-01-01 |
1.0 |
3 |
test |
foo |
li |
40 |
2013-01-02 |
2.0 |
3 |
train |
foo |
zhou |
26 |
2013-01-03 |
3.0 |
3 |
test |
foo |
wang |
27 |
2013-01-04 |
4.0 |
3 |
train |
foo |
df.head(3)
|
A |
B |
C |
D |
E |
F |
zhang |
12 |
2013-01-01 |
1.0 |
3 |
test |
foo |
li |
40 |
2013-01-02 |
2.0 |
3 |
train |
foo |
zhou |
26 |
2013-01-03 |
3.0 |
3 |
test |
foo |
df.tail(2)
|
A |
B |
C |
D |
E |
F |
zhou |
26 |
2013-01-03 |
3.0 |
3 |
test |
foo |
wang |
27 |
2013-01-04 |
4.0 |
3 |
train |
foo |
3.查看二维数据的索引、列名和数据
df.index
Index(['zhang', 'li', 'zhou', 'wang'], dtype='object')
df.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
df.values
array([[12, Timestamp('2013-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
[40, Timestamp('2013-01-02 00:00:00'), 2.0, 3, 'train', 'foo'],
[26, Timestamp('2013-01-03 00:00:00'), 3.0, 3, 'test', 'foo'],
[27, Timestamp('2013-01-04 00:00:00'), 4.0, 3, 'train', 'foo']],
dtype=object)
4.查看数据的统计信息
df.describe()
|
A |
C |
D |
count |
4.000000 |
4.000000 |
4.0 |
mean |
26.250000 |
2.500000 |
3.0 |
std |
11.441882 |
1.290994 |
0.0 |
min |
12.000000 |
1.000000 |
3.0 |
25% |
22.500000 |
1.750000 |
3.0 |
50% |
26.500000 |
2.500000 |
3.0 |
75% |
30.250000 |
3.250000 |
3.0 |
max |
40.000000 |
4.000000 |
3.0 |
5.二维数据转置
df.T
|
zhang |
li |
zhou |
wang |
A |
12 |
40 |
26 |
27 |
B |
2013-01-01 00:00:00 |
2013-01-02 00:00:00 |
2013-01-03 00:00:00 |
2013-01-04 00:00:00 |
C |
1 |
2 |
3 |
4 |
D |
3 |
3 |
3 |
3 |
E |
test |
train |
test |
train |
F |
foo |
foo |
foo |
foo |
6.排序
df.sort_index(axis=0, ascending=False)
|
A |
B |
C |
D |
E |
F |
zhou |
26 |
2013-01-03 |
3.0 |
3 |
test |
foo |
zhang |
12 |
2013-01-01 |
1.0 |
3 |
test |
foo |
wang |
27 |
2013-01-04 |
4.0 |
3 |
train |
foo |
li |
40 |
2013-01-02 |
2.0 |
3 |
train |
foo |
df.sort_index(axis=0, ascending=True)
|
A |
B |
C |
D |
E |
F |
li |
40 |
2013-01-02 |
2.0 |
3 |
train |
foo |
wang |
27 |
2013-01-04 |
4.0 |
3 |
train |
foo |
zhang |
12 |
2013-01-01 |
1.0 |
3 |
test |
foo |
zhou |
26 |
2013-01-03 |
3.0 |
3 |
test |
foo |
df.sort_index(axis=1, ascending=False)
|
F |
E |
D |
C |
B |
A |
zhang |
foo |
test |
3 |
1.0 |
2013-01-01 |
12 |
li |
foo |
train |
3 |
2.0 |
2013-01-02 |
40 |
zhou |
foo |
test |
3 |
3.0 |
2013-01-03 |
26 |
wang |
foo |
train |
3 |
4.0 |
2013-01-04 |
27 |
df.sort_values(by='A')
|
A |
B |
C |
D |
E |
F |
zhang |
12 |
2013-01-01 |
1.0 |
3 |
test |
foo |
zhou |
26 |
2013-01-03 |
3.0 |
3 |
test |
foo |
wang |
27 |
2013-01-04 |
4.0 |
3 |
train |
foo |
li |
40 |
2013-01-02 |
2.0 |
3 |
train |
foo |
7.重复值处理
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4, 'k2':[1, 1, 2, 3, 3, 4, 4]})
data
|
k1 |
k2 |
0 |
one |
1 |
1 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
two |
3 |
5 |
two |
4 |
6 |
two |
4 |
1)检测重复行
data.duplicatedlicated()
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
2) 返回新数组,删除重复行
data.drop_duplicates()
|
k1 |
k2 |
0 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
5 |
two |
4 |
3)删除k1列的重复数据,保留首行重复数据。
data.drop_duplicates(['k1'],keep='last')
8.映射
1)使用函数进行映射,将data中k1列的值转换为大写
data['k1']=data['k1'].map(str.upper)
data
|
k1 |
k2 |
0 |
ONE |
1 |
1 |
ONE |
1 |
2 |
ONE |
2 |
3 |
TWO |
3 |
4 |
TWO |
3 |
5 |
TWO |
4 |
6 |
TWO |
4 |
2)使用字典表示映射关系,将data中k1列的值转换为小写。
data['k1']=data['k1'].map({'ONE':'one','TWO':'two'})
data
|
k1 |
k2 |
0 |
one |
1 |
1 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
two |
3 |
5 |
two |
4 |
6 |
two |
4 |
3) 使用lambda表达式表示映射关系,将data中k2列的值加5。
data['k2'] = data['k2'].map(lambda x:x+5)
data
|
k1 |
k2 |
0 |
one |
6 |
1 |
one |
6 |
2 |
one |
7 |
3 |
two |
8 |
4 |
two |
8 |
5 |
two |
9 |
6 |
two |
9 |
4) 使用lambda表达式表示映射关系,将data中索引的值加5。
data.index = data.index.map(lambda x:x+5)
data
|
k1 |
k2 |
5 |
one |
6 |
6 |
one |
6 |
7 |
one |
7 |
8 |
two |
8 |
9 |
two |
8 |
10 |
two |
9 |
11 |
two |
9 |
5) 使用lambda表达式表示映射关系,将data中列名转换为大写
data.columns=data.columns.map(str.upper)
data
|
K1 |
K2 |
5 |
one |
6 |
6 |
one |
6 |
7 |
one |
7 |
8 |
two |
8 |
9 |
two |
8 |
10 |
two |
9 |
11 |
two |
9 |
9.数据离散化
from random import randrange
data=[randrange(100) for _ in range(10)]
category=[0,25,50,100]
pd.cut(data,category)
[(50, 100], (50, 100], (0, 25], (50, 100], (50, 100], (50, 100], (0, 25], (50, 100], (50, 100], (50, 100]]
Categories (3, interval[int64]): [(0, 25] < (25, 50] < (50, 100]]
1) 按category对data数据进行切分,使得参数right=False形成左闭右开区间
pd.cut(data,category,right=False)
[[50, 100), [50, 100), [0, 25), [50, 100), [50, 100), [50, 100), [0, 25), [50, 100), [50, 100), [50, 100)]
Categories (3, interval[int64]): [[0, 25) < [25, 50) < [50, 100)]
2 )按category对data数据进行切分,使得参数right=False形成左闭右开区间,并对每个区间打标签
labels = ['low', 'middle', 'high']
pd.cut(data,category,right=False,labels=labels)
['high', 'high', 'low', 'high', 'high', 'high', 'low', 'high', 'high', 'high']
Categories (3, object): ['low' < 'middle' < 'high']
3)对data数据按4分位进行切分。
data
pd.cut(data,4)
[(45.5, 66.75], (66.75, 88.0], (2.915, 24.25], (66.75, 88.0], (45.5, 66.75], (66.75, 88.0], (2.915, 24.25], (45.5, 66.75], (66.75, 88.0], (45.5, 66.75]]
Categories (4, interval[float64]): [(2.915, 24.25] < (24.25, 45.5] < (45.5, 66.75] < (66.75, 88.0]]
10.频次统计与位移
1)将df数据通过copy方法赋值为df1,然后对df1数据使用shift方法下移一行(负数表示上移)。
df1=df.copy()
df1.shift(1)
|
A |
B |
C |
D |
E |
F |
zhang |
NaN |
NaT |
NaN |
NaN |
NaN |
NaN |
li |
12.0 |
2013-01-01 |
1.0 |
3.0 |
test |
foo |
zhou |
40.0 |
2013-01-02 |
2.0 |
3.0 |
train |
foo |
wang |
26.0 |
2013-01-03 |
3.0 |
3.0 |
test |
foo |
2)对df1中D列数据进行直方图统计
df1['D'].value_counts()
3 4
Name: D, dtype: int64
11.透视转换
df = pd.DataFrame({'a':[1,2,3,4],'b':[2,3,4,5],'c':[3,4,5,6], 'd':[3,3,3,3]})
df
|
a |
b |
c |
d |
0 |
1 |
2 |
3 |
3 |
1 |
2 |
3 |
4 |
3 |
2 |
3 |
4 |
5 |
3 |
3 |
4 |
5 |
6 |
3 |
1) 将df的a列值作为索引,b列值作为列名,c列值作为值,构建透视图。
df.pivot(index='a', columns='b', values='c')
b |
2 |
3 |
4 |
5 |
a |
|
|
|
|
1 |
3.0 |
NaN |
NaN |
NaN |
2 |
NaN |
4.0 |
NaN |
NaN |
3 |
NaN |
NaN |
5.0 |
NaN |
4 |
NaN |
NaN |
NaN |
6.0 |
2) 将df的a列值作为索引,b列值作为列名,d列值作为值,构建透视图
df.pivot(index='a', columns='b', values='d')
b |
2 |
3 |
4 |
5 |
a |
|
|
|
|
1 |
3.0 |
NaN |
NaN |
NaN |
2 |
NaN |
3.0 |
NaN |
NaN |
3 |
NaN |
NaN |
3.0 |
NaN |
4 |
NaN |
NaN |
NaN |
3.0 |
12.数据差分
1)新建数据帧名为df
df = pd.DataFrame({'a':np.random.randint(1, 100, 10),'b':np.random.randint(1, 100, 10)},index=map(str, range(10)))
df
|
a |
b |
0 |
26 |
26 |
1 |
16 |
6 |
2 |
48 |
30 |
3 |
2 |
52 |
4 |
21 |
86 |
5 |
81 |
63 |
6 |
99 |
2 |
7 |
9 |
73 |
8 |
38 |
2 |
9 |
51 |
59 |
2)对df的行进行一阶差分
df.diff()
|
a |
b |
0 |
NaN |
NaN |
1 |
24.0 |
24.0 |
2 |
-86.0 |
-3.0 |
3 |
5.0 |
26.0 |
4 |
56.0 |
15.0 |
5 |
7.0 |
-9.0 |
6 |
-40.0 |
-21.0 |
7 |
-6.0 |
-42.0 |
8 |
22.0 |
61.0 |
9 |
21.0 |
-13.0 |
3) 对df的列进行一阶差分
df.diff(axis=1)
|
a |
b |
0 |
NaN |
-58.0 |
1 |
NaN |
-58.0 |
2 |
NaN |
25.0 |
3 |
NaN |
46.0 |
4 |
NaN |
5.0 |
5 |
NaN |
-11.0 |
6 |
NaN |
8.0 |
7 |
NaN |
-28.0 |
8 |
NaN |
11.0 |
9 |
NaN |
-23.0 |
4) 对df的行进行二阶差分。
df.diff(periods=2)
|
a |
b |
0 |
NaN |
NaN |
1 |
NaN |
NaN |
2 |
-62.0 |
21.0 |
3 |
-81.0 |
23.0 |
4 |
61.0 |
41.0 |
5 |
63.0 |
6.0 |
6 |
-33.0 |
-30.0 |
7 |
-46.0 |
-63.0 |
8 |
16.0 |
19.0 |
9 |
43.0 |
48.0 |
13.计算相关系数
1)新建一个dataframe名为df
df = pd.DataFrame({'A':np.random.randint(1, 100, 10),'B':np.random.randint(1, 100, 10),'C':np.random.randint(1, 100, 10)})
df
|
A |
B |
C |
0 |
5 |
69 |
16 |
1 |
1 |
59 |
17 |
2 |
90 |
92 |
62 |
3 |
22 |
37 |
72 |
4 |
23 |
49 |
80 |
5 |
26 |
10 |
89 |
6 |
10 |
13 |
7 |
7 |
45 |
45 |
65 |
8 |
17 |
43 |
56 |
9 |
73 |
48 |
44 |
2) 计算df的相关系数, pearson相关系数
df.corr()
|
A |
B |
C |
A |
1.000000 |
0.429682 |
0.357246 |
B |
0.429682 |
1.000000 |
-0.127607 |
C |
0.357246 |
-0.127607 |
1.000000 |
3) 计算df的相关系数, Kendall相关系数
df.corr('kendall')
|
A |
B |
C |
A |
1.000000 |
0.066667 |
0.333333 |
B |
0.066667 |
1.000000 |
-0.244444 |
C |
0.333333 |
-0.244444 |
1.000000 |
4) 计算df的相关系数, spearman秩相关
df.corr('spearman')
|
A |
B |
C |
A |
1.000000 |
0.042424 |
0.527273 |
B |
0.042424 |
1.000000 |
-0.272727 |
C |
0.527273 |
-0.272727 |
1.000000 |
14.重塑Reshaping
1) 新进一个DataFrame,为名df,将df的前4行赋值给df2.
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
tuples
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['A', 'B'])
index
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['A', 'B'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
|
|
A |
B |
A |
B |
|
|
bar |
one |
0.090815 |
0.776030 |
two |
1.217861 |
-0.726099 |
baz |
one |
0.801053 |
-1.271264 |
two |
0.785470 |
0.514176 |
foo |
one |
-0.066081 |
-0.724912 |
two |
-1.927954 |
0.247929 |
qux |
one |
0.030065 |
1.202135 |
two |
-0.169432 |
-1.266427 |
df2 = df[:4]
df2
|
|
A |
B |
A |
B |
|
|
bar |
one |
0.090815 |
0.776030 |
two |
1.217861 |
-0.726099 |
baz |
one |
0.801053 |
-1.271264 |
two |
0.785470 |
0.514176 |
2) 使用stack方法对df2进行列转行,将结果返回给stacked.
stacked = df2.stack()
stacked
A B
bar one A 0.090815
B 0.776030
two A 1.217861
B -0.726099
baz one A 0.801053
B -1.271264
two A 0.785470
B 0.514176
dtype: float64
3) 使用unstack方法对stacked进行行转列,默认level=2,解压最内层
stacked.unstack()
|
|
A |
B |
A |
B |
|
|
bar |
one |
0.090815 |
0.776030 |
two |
1.217861 |
-0.726099 |
baz |
one |
0.801053 |
-1.271264 |
two |
0.785470 |
0.514176 |
4) 使用unstack方法对stacked进行列转行,设置level=1,解压中间层
stacked.unstack(1)
|
B |
one |
two |
A |
|
|
|
bar |
A |
0.090815 |
1.217861 |
B |
0.776030 |
-0.726099 |
baz |
A |
0.801053 |
0.785470 |
B |
-1.271264 |
0.514176 |
5) 使用unstack方法对stacked进行列转行,默认level=0,解压最外层。
stacked.unstack(0)
|
A |
bar |
baz |
B |
|
|
|
one |
A |
0.090815 |
0.801053 |
B |
0.776030 |
-1.271264 |
two |
A |
1.217861 |
0.785470 |
B |
-0.726099 |
0.514176 |
6) 使用unstack方法对stacked进行列转行,默认level=‘A’,解压最外层
stacked.unstack('A')
|
A |
bar |
baz |
B |
|
|
|
one |
A |
0.090815 |
0.801053 |
B |
0.776030 |
-1.271264 |
two |
A |
1.217861 |
0.785470 |
B |
-0.726099 |
0.514176 |
15.melt函数:将DataFrame的列转行。
1) 新建一个DataFrame,名为df
df=pd.DataFrame(np.arange(8).reshape(2,4),index=['AA','BB'],columns=['A','B','C','D'])
df
|
A |
B |
C |
D |
AA |
0 |
1 |
2 |
3 |
BB |
4 |
5 |
6 |
7 |
2) 使用melt函数,将df进行列转行操作,保留A,C两个原始列,将B,D两列进行列转行,将列转行的列变量重新命名为B|D,列转行对应变量的值的名称命名为B|D_value 。
pd.melt(df,id_vars=['A','C'],value_vars=['B','D'],var_name='B|D',value_name='(B|D)_value')
|
A |
C |
B|D |
(B|D)_value |
0 |
0 |
2 |
B |
1 |
1 |
4 |
6 |
B |
5 |
2 |
0 |
2 |
D |
3 |
3 |
4 |
6 |
D |
7 |
16.sub函数:截取DataFrame中的行或列。
1) 新建一个DataFrame,名为df
df=pd.DataFrame({ 'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df
|
one |
two |
three |
a |
-0.553244 |
0.030308 |
NaN |
b |
0.827353 |
0.143289 |
-0.986206 |
c |
-0.313427 |
-0.875161 |
0.074300 |
d |
NaN |
-0.164881 |
0.229687 |
2)取df中索引为1的行,赋值给row,取列名为two的列赋值给column,使用sub方法将df的row行截取掉,axis='columns’或1。
row=df.iloc[1]
row
one 0.827353
two 0.143289
three -0.986206
Name: b, dtype: float64
column=df['two']
column
a 0.030308
b 0.143289
c -0.875161
d -0.164881
Name: two, dtype: float64
df.sub(row,axis='columns')
|
one |
two |
three |
a |
-1.380597 |
-0.112981 |
NaN |
b |
0.000000 |
0.000000 |
0.000000 |
c |
-1.140780 |
-1.018450 |
1.060507 |
d |
NaN |
-0.308170 |
1.215894 |
df.sub(row,axis=1)
|
one |
two |
three |
a |
-1.380597 |
-0.112981 |
NaN |
b |
0.000000 |
0.000000 |
0.000000 |
c |
-1.140780 |
-1.018450 |
1.060507 |
d |
NaN |
-0.308170 |
1.215894 |
3)使用sub方法将df的column列截取掉,axis='index’或0。
df.sub(column,axis='index')
|
one |
two |
three |
a |
-0.583553 |
0.0 |
NaN |
b |
0.684063 |
0.0 |
-1.129495 |
c |
0.561734 |
0.0 |
0.949461 |
d |
NaN |
0.0 |
0.394568 |
df.sub(column,axis=0)
|
one |
two |
three |
a |
-0.583553 |
0.0 |
NaN |
b |
0.684063 |
0.0 |
-1.129495 |
c |
0.561734 |
0.0 |
0.949461 |
d |
NaN |
0.0 |
0.394568 |
17.删除操作
1)删除指定行,返回一个删除后的DataFrame,对原始DataFrame不做改变。
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,'k2':[1, 1, 2, 3, 3, 4, 4]})
data
|
k1 |
k2 |
0 |
one |
1 |
1 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
two |
3 |
5 |
two |
4 |
6 |
two |
4 |
data.drop(5,axis=0)
|
k1 |
k2 |
0 |
one |
1 |
1 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
two |
3 |
6 |
two |
4 |
2)删除指定行,对data本身进行删除操作。
data.drop(3,inplace=True)
data
|
k1 |
k2 |
0 |
one |
1 |
1 |
one |
1 |
2 |
one |
2 |
4 |
two |
3 |
5 |
two |
4 |
6 |
two |
4 |
3)删除指定列,返回一个删除后的DataFrame,对原始DataFrame不做改变。
data.drop('k1',axis=1)
|
k2 |
0 |
1 |
1 |
1 |
2 |
2 |
4 |
3 |
5 |
4 |
6 |
4 |