pandas的基本函数

pandas的基本函数

实验目的
熟练掌握pandas基本函数使用方法

实验原理

列转行方法
stack函数:pandas.DataFrame.stack(self, level=-1, dropna=True),对于普通的DataFrame而言,直接列索引转换到最内层行索引,生一个Series对象。
对于层次化索引的DataFrame而言,可以将指定的索引层转换到行上,默认是将最内层的列索引转换到最内层行。
unstack函数:pandas.DataFrame.unstack(self, level=-1, fill_value=None),对于普通的DataFrame而言,直接将列索引转换到行索引的最外层索引,生成一个Series对象,对于层次化索引的DataFrame而言,和stack函数类似,似乎把两层索引当作一个整体,当level为列表时报错。
melt函数:pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=‘value’, col_level=None),id_vars可以理解为结果需要保留的原始列,value_vars可以理解为需要列转行的列名;
var_name把列转行的列变量重新命名,默认为variable;value_name列转行对应变量的值的名称。

行转列方法
unstack函数:pandas.DataFrame.unstack(self, level=-1, fill_value=None)

实验环境

Python 3.6.1
Jupyter

实验内容
练习pandas的主要的基本函数的使用。

代码部分

import pandas as pd
import numpy as np

1.创建一个DataFrame,名为df

df = pd.DataFrame({'A':np.random.randint(1, 100, 4),'B':pd.date_range(start='20130101', periods=4, freq='D'),'C':pd.Series([1, 2, 3, 4],index=['zhang', 'li', 'zhou', 'wang'],dtype='float32'),'D':np.array([3] * 4,dtype='int32'), 'E':pd.Categorical(["test","train","test","train"]),'F':'foo'})  
df
A B C D E F
zhang 12 2013-01-01 1.0 3 test foo
li 40 2013-01-02 2.0 3 train foo
zhou 26 2013-01-03 3.0 3 test foo
wang 27 2013-01-04 4.0 3 train foo

2.二维数据查看

df.head() # 默认前五行
A B C D E F
zhang 12 2013-01-01 1.0 3 test foo
li 40 2013-01-02 2.0 3 train foo
zhou 26 2013-01-03 3.0 3 test foo
wang 27 2013-01-04 4.0 3 train foo
df.head(3)   # 查看前三行
A B C D E F
zhang 12 2013-01-01 1.0 3 test foo
li 40 2013-01-02 2.0 3 train foo
zhou 26 2013-01-03 3.0 3 test foo
df.tail(2) # 查看后两行
A B C D E F
zhou 26 2013-01-03 3.0 3 test foo
wang 27 2013-01-04 4.0 3 train foo

3.查看二维数据的索引、列名和数据

df.index
Index(['zhang', 'li', 'zhou', 'wang'], dtype='object')
df.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
df.values
array([[12, Timestamp('2013-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [40, Timestamp('2013-01-02 00:00:00'), 2.0, 3, 'train', 'foo'],
       [26, Timestamp('2013-01-03 00:00:00'), 3.0, 3, 'test', 'foo'],
       [27, Timestamp('2013-01-04 00:00:00'), 4.0, 3, 'train', 'foo']],
      dtype=object)

4.查看数据的统计信息

df.describe()
A C D
count 4.000000 4.000000 4.0
mean 26.250000 2.500000 3.0
std 11.441882 1.290994 0.0
min 12.000000 1.000000 3.0
25% 22.500000 1.750000 3.0
50% 26.500000 2.500000 3.0
75% 30.250000 3.250000 3.0
max 40.000000 4.000000 3.0

5.二维数据转置

df.T
zhang li zhou wang
A 12 40 26 27
B 2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00
C 1 2 3 4
D 3 3 3 3
E test train test train
F foo foo foo foo

6.排序

df.sort_index(axis=0, ascending=False)     # 对索引进行降序排序  
A B C D E F
zhou 26 2013-01-03 3.0 3 test foo
zhang 12 2013-01-01 1.0 3 test foo
wang 27 2013-01-04 4.0 3 train foo
li 40 2013-01-02 2.0 3 train foo
df.sort_index(axis=0, ascending=True)      # 对索引进行升序排序  
A B C D E F
li 40 2013-01-02 2.0 3 train foo
wang 27 2013-01-04 4.0 3 train foo
zhang 12 2013-01-01 1.0 3 test foo
zhou 26 2013-01-03 3.0 3 test foo
df.sort_index(axis=1, ascending=False)     # 对列名进行降序排序  
F E D C B A
zhang foo test 3 1.0 2013-01-01 12
li foo train 3 2.0 2013-01-02 40
zhou foo test 3 3.0 2013-01-03 26
wang foo train 3 4.0 2013-01-04 27
df.sort_values(by='A')                     # 对A列的值进行升序排序 
A B C D E F
zhang 12 2013-01-01 1.0 3 test foo
zhou 26 2013-01-03 3.0 3 test foo
wang 27 2013-01-04 4.0 3 train foo
li 40 2013-01-02 2.0 3 train foo

7.重复值处理

data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4, 'k2':[1, 1, 2, 3, 3, 4, 4]})  
data
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4

1)检测重复行

data.duplicatedlicated()
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

2) 返回新数组,删除重复行

data.drop_duplicates() 
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4

3)删除k1列的重复数据,保留首行重复数据。

data.drop_duplicates(['k1'],keep='last')  
k1 k2
2 one 2
6 two 4

8.映射

1)使用函数进行映射,将data中k1列的值转换为大写

data['k1']=data['k1'].map(str.upper)  
data  
k1 k2
0 ONE 1
1 ONE 1
2 ONE 2
3 TWO 3
4 TWO 3
5 TWO 4
6 TWO 4

2)使用字典表示映射关系,将data中k1列的值转换为小写。

data['k1']=data['k1'].map({'ONE':'one','TWO':'two'})  
data 
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4

3) 使用lambda表达式表示映射关系,将data中k2列的值加5。

data['k2'] = data['k2'].map(lambda x:x+5)  
data
k1 k2
0 one 6
1 one 6
2 one 7
3 two 8
4 two 8
5 two 9
6 two 9

4) 使用lambda表达式表示映射关系,将data中索引的值加5。

data.index = data.index.map(lambda x:x+5)  
data 
k1 k2
5 one 6
6 one 6
7 one 7
8 two 8
9 two 8
10 two 9
11 two 9

5) 使用lambda表达式表示映射关系,将data中列名转换为大写

data.columns=data.columns.map(str.upper)  
data  
K1 K2
5 one 6
6 one 6
7 one 7
8 two 8
9 two 8
10 two 9
11 two 9

9.数据离散化

from random import randrange  
data=[randrange(100) for _ in range(10)]  
category=[0,25,50,100]  
pd.cut(data,category) 
[(50, 100], (50, 100], (0, 25], (50, 100], (50, 100], (50, 100], (0, 25], (50, 100], (50, 100], (50, 100]]
Categories (3, interval[int64]): [(0, 25] < (25, 50] < (50, 100]]

1) 按category对data数据进行切分,使得参数right=False形成左闭右开区间

pd.cut(data,category,right=False) 
[[50, 100), [50, 100), [0, 25), [50, 100), [50, 100), [50, 100), [0, 25), [50, 100), [50, 100), [50, 100)]
Categories (3, interval[int64]): [[0, 25) < [25, 50) < [50, 100)]

2 )按category对data数据进行切分,使得参数right=False形成左闭右开区间,并对每个区间打标签

labels = ['low', 'middle', 'high']  
pd.cut(data,category,right=False,labels=labels)
['high', 'high', 'low', 'high', 'high', 'high', 'low', 'high', 'high', 'high']
Categories (3, object): ['low' < 'middle' < 'high']

3)对data数据按4分位进行切分。

data  
pd.cut(data,4) 
[(45.5, 66.75], (66.75, 88.0], (2.915, 24.25], (66.75, 88.0], (45.5, 66.75], (66.75, 88.0], (2.915, 24.25], (45.5, 66.75], (66.75, 88.0], (45.5, 66.75]]
Categories (4, interval[float64]): [(2.915, 24.25] < (24.25, 45.5] < (45.5, 66.75] < (66.75, 88.0]]

10.频次统计与位移

1)将df数据通过copy方法赋值为df1,然后对df1数据使用shift方法下移一行(负数表示上移)。

df1=df.copy()  
df1.shift(1)  
A B C D E F
zhang NaN NaT NaN NaN NaN NaN
li 12.0 2013-01-01 1.0 3.0 test foo
zhou 40.0 2013-01-02 2.0 3.0 train foo
wang 26.0 2013-01-03 3.0 3.0 test foo

2)对df1中D列数据进行直方图统计

df1['D'].value_counts()
3    4
Name: D, dtype: int64

11.透视转换

df = pd.DataFrame({'a':[1,2,3,4],'b':[2,3,4,5],'c':[3,4,5,6], 'd':[3,3,3,3]}) 
df
a b c d
0 1 2 3 3
1 2 3 4 3
2 3 4 5 3
3 4 5 6 3

1) 将df的a列值作为索引,b列值作为列名,c列值作为值,构建透视图。

df.pivot(index='a', columns='b', values='c')
b 2 3 4 5
a
1 3.0 NaN NaN NaN
2 NaN 4.0 NaN NaN
3 NaN NaN 5.0 NaN
4 NaN NaN NaN 6.0

2) 将df的a列值作为索引,b列值作为列名,d列值作为值,构建透视图

df.pivot(index='a', columns='b', values='d')
b 2 3 4 5
a
1 3.0 NaN NaN NaN
2 NaN 3.0 NaN NaN
3 NaN NaN 3.0 NaN
4 NaN NaN NaN 3.0

12.数据差分

1)新建数据帧名为df

df = pd.DataFrame({'a':np.random.randint(1, 100, 10),'b':np.random.randint(1, 100, 10)},index=map(str, range(10)))  
df 
a b
0 26 26
1 16 6
2 48 30
3 2 52
4 21 86
5 81 63
6 99 2
7 9 73
8 38 2
9 51 59

2)对df的行进行一阶差分

df.diff()
a b
0 NaN NaN
1 24.0 24.0
2 -86.0 -3.0
3 5.0 26.0
4 56.0 15.0
5 7.0 -9.0
6 -40.0 -21.0
7 -6.0 -42.0
8 22.0 61.0
9 21.0 -13.0

3) 对df的列进行一阶差分

df.diff(axis=1)
a b
0 NaN -58.0
1 NaN -58.0
2 NaN 25.0
3 NaN 46.0
4 NaN 5.0
5 NaN -11.0
6 NaN 8.0
7 NaN -28.0
8 NaN 11.0
9 NaN -23.0

4) 对df的行进行二阶差分。

df.diff(periods=2)
a b
0 NaN NaN
1 NaN NaN
2 -62.0 21.0
3 -81.0 23.0
4 61.0 41.0
5 63.0 6.0
6 -33.0 -30.0
7 -46.0 -63.0
8 16.0 19.0
9 43.0 48.0

13.计算相关系数

1)新建一个dataframe名为df

df = pd.DataFrame({'A':np.random.randint(1, 100, 10),'B':np.random.randint(1, 100, 10),'C':np.random.randint(1, 100, 10)})
df
A B C
0 5 69 16
1 1 59 17
2 90 92 62
3 22 37 72
4 23 49 80
5 26 10 89
6 10 13 7
7 45 45 65
8 17 43 56
9 73 48 44

2) 计算df的相关系数, pearson相关系数

df.corr()
A B C
A 1.000000 0.429682 0.357246
B 0.429682 1.000000 -0.127607
C 0.357246 -0.127607 1.000000

3) 计算df的相关系数, Kendall相关系数

df.corr('kendall')
A B C
A 1.000000 0.066667 0.333333
B 0.066667 1.000000 -0.244444
C 0.333333 -0.244444 1.000000

4) 计算df的相关系数, spearman秩相关

df.corr('spearman')
A B C
A 1.000000 0.042424 0.527273
B 0.042424 1.000000 -0.272727
C 0.527273 -0.272727 1.000000

14.重塑Reshaping

1) 新进一个DataFrame,为名df,将df的前4行赋值给df2.

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))  
tuples
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['A', 'B'])  
index
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['A', 'B'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])  
df
A B
A B
bar one 0.090815 0.776030
two 1.217861 -0.726099
baz one 0.801053 -1.271264
two 0.785470 0.514176
foo one -0.066081 -0.724912
two -1.927954 0.247929
qux one 0.030065 1.202135
two -0.169432 -1.266427
df2 = df[:4]  
df2  
A B
A B
bar one 0.090815 0.776030
two 1.217861 -0.726099
baz one 0.801053 -1.271264
two 0.785470 0.514176

2) 使用stack方法对df2进行列转行,将结果返回给stacked.

stacked = df2.stack()
stacked
A    B     
bar  one  A    0.090815
          B    0.776030
     two  A    1.217861
          B   -0.726099
baz  one  A    0.801053
          B   -1.271264
     two  A    0.785470
          B    0.514176
dtype: float64

3) 使用unstack方法对stacked进行行转列,默认level=2,解压最内层

stacked.unstack()
A B
A B
bar one 0.090815 0.776030
two 1.217861 -0.726099
baz one 0.801053 -1.271264
two 0.785470 0.514176

4) 使用unstack方法对stacked进行列转行,设置level=1,解压中间层

stacked.unstack(1) 
B one two
A
bar A 0.090815 1.217861
B 0.776030 -0.726099
baz A 0.801053 0.785470
B -1.271264 0.514176

5) 使用unstack方法对stacked进行列转行,默认level=0,解压最外层。

stacked.unstack(0)  
A bar baz
B
one A 0.090815 0.801053
B 0.776030 -1.271264
two A 1.217861 0.785470
B -0.726099 0.514176

6) 使用unstack方法对stacked进行列转行,默认level=‘A’,解压最外层

stacked.unstack('A')
A bar baz
B
one A 0.090815 0.801053
B 0.776030 -1.271264
two A 1.217861 0.785470
B -0.726099 0.514176

15.melt函数:将DataFrame的列转行。

1) 新建一个DataFrame,名为df

df=pd.DataFrame(np.arange(8).reshape(2,4),index=['AA','BB'],columns=['A','B','C','D'])  
df 
A B C D
AA 0 1 2 3
BB 4 5 6 7

2) 使用melt函数,将df进行列转行操作,保留A,C两个原始列,将B,D两列进行列转行,将列转行的列变量重新命名为B|D,列转行对应变量的值的名称命名为B|D_value 。

pd.melt(df,id_vars=['A','C'],value_vars=['B','D'],var_name='B|D',value_name='(B|D)_value')
A C B|D (B|D)_value
0 0 2 B 1
1 4 6 B 5
2 0 2 D 3
3 4 6 D 7

16.sub函数:截取DataFrame中的行或列。

1) 新建一个DataFrame,名为df

df=pd.DataFrame({ 'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])}) 
df
one two three
a -0.553244 0.030308 NaN
b 0.827353 0.143289 -0.986206
c -0.313427 -0.875161 0.074300
d NaN -0.164881 0.229687

2)取df中索引为1的行,赋值给row,取列名为two的列赋值给column,使用sub方法将df的row行截取掉,axis='columns’或1。

#取df中索引为1的行,赋值给row  
row=df.iloc[1]  
row
one      0.827353
two      0.143289
three   -0.986206
Name: b, dtype: float64
#取列名为two的列赋值给column  
column=df['two']  
column
a    0.030308
b    0.143289
c   -0.875161
d   -0.164881
Name: two, dtype: float64
#使用sub方法将df的row行截取掉,axis='columns'或1。  
df.sub(row,axis='columns') 
one two three
a -1.380597 -0.112981 NaN
b 0.000000 0.000000 0.000000
c -1.140780 -1.018450 1.060507
d NaN -0.308170 1.215894
df.sub(row,axis=1) 
one two three
a -1.380597 -0.112981 NaN
b 0.000000 0.000000 0.000000
c -1.140780 -1.018450 1.060507
d NaN -0.308170 1.215894

3)使用sub方法将df的column列截取掉,axis='index’或0。

#使用sub方法将df的column列截取掉,axis='index'或0。  
df.sub(column,axis='index') 
one two three
a -0.583553 0.0 NaN
b 0.684063 0.0 -1.129495
c 0.561734 0.0 0.949461
d NaN 0.0 0.394568
df.sub(column,axis=0) 
one two three
a -0.583553 0.0 NaN
b 0.684063 0.0 -1.129495
c 0.561734 0.0 0.949461
d NaN 0.0 0.394568

17.删除操作

1)删除指定行,返回一个删除后的DataFrame,对原始DataFrame不做改变。

data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,'k2':[1, 1, 2, 3, 3, 4, 4]})  
data  
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
data.drop(5,axis=0)  
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
6 two 4

2)删除指定行,对data本身进行删除操作。

data.drop(3,inplace=True)  
data  
k1 k2
0 one 1
1 one 1
2 one 2
4 two 3
5 two 4
6 two 4

3)删除指定列,返回一个删除后的DataFrame,对原始DataFrame不做改变。

data.drop('k1',axis=1)
k2
0 1
1 1
2 2
4 3
5 4
6 4

你可能感兴趣的:(pandas,可视化,Python)