Pandas是在NumPy基础上建立的新程序库,提供了一种高效的DataFrame数据结构,DataFrame的本质是一种带行列标签的二维数组,其最主要的特性之一是支持缺失值补全。在安装Pandas之前,要先安装NumPy,在科学发行版Anaconda中,已经默认包含安装有NumPy和Pandas。
在这里我还是要推荐下我自己建的Python开发学习群:483546416,都是学Python开发的,如果你正在学习Python ,小编欢迎你加入,今天分享的这个案例已经上传到群文件,大家都是软件开发党,不定期分享干货(只有Python软件开发相关的),包括我自己整理的一份2017最新的Python零基础资料和Python进阶教程,欢迎进加入
一种常用的导入习惯如下:
import numpy as npimport pandas as pd
Pandas的Series对象本质上是带索引的一维数组,而DataFrame则是二维数组,可以通过数组来建立Series对象。指定index参数以确定索引值,如果不指定index参数,系统会根据需要自动生成索引。创建Series对象时,可以采用数组,也可以使用标量(Series会把这个量重复到每个索引上),也可以使用Python字典(index默认是排序的字典键)
data=pd.Series(np.arange(5),index=['a','b','c','d','e'])
data
a 0
b 1
c 2
d 3
e 4
dtype: int64
data.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
data.values
array([0, 1, 2, 3, 4])
以下通过population和area两个Series对象创建一个DataFrame对象。
population_dict={'California':38, 'Texas':26, 'New York':19, 'Florida':19, 'Illinois':12}
population=pd.Series(population_dict)
population
California 38
Florida 19
Illinois 12
New York 19
Texas 26
dtype: int64
area_dict={'California':423, 'Texas':170, 'New York':150, 'Florida':141, 'Illinois':695}
area=pd.Series(area_dict)
area参数
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in ()
5 'Illinois':695}
6 area=pd.Series(area_dict)
----> 7 area参数
NameError: name 'area参数' is not defined
states=pd.DataFrame({'population':population, 'area':area})
states
states.index
states.columns
Pandas的Series和DataFrame的Index对象非常有用,可以看作不可变的数组,用于进行索引,可以对Index对象通过& | ^等操作实现交集、并集或者异或操作。
在对Pandas对象进行索引时,可以采用字典查询或者数组的切片、索引等方法,也可以使用loc(显式,明确指出索引名称),iloc(隐式,指出索引序号)和ix(混合)等索引办法。
Pandas继承了来自NumPy中基本计算基础,并在此基础上进行了扩充与发展。简单来说,对于一元运算(函数和三角函数),在输出结果中保留索引和列标签,对于二元运算(加减乘除等),会自动对齐索引进行计算,这也是Pandas最突出的特点之一。
rng=np.random.RandomState(42)
ser=pd.Series(rng.randint(0,10,4))
ser
df=pd.DataFrame(rng.randint(0,10,(3,4)),columns=['A','B','C','D'])
df
np.sin(df*np.pi/4)
#索引对齐的示例,Pandas默认使用NaN填充缺失值,但也可以自行进行指定。比如。A=pd.Series([1,5,8],index=[0,1,2])
B=pd.Series([3,6,9],index=[1,2,3])
A+B
A.add(B,fill_value=0) #用0填充缺失值进行计算
fill=B.mean() #使用B的均值6来填充缺失值,如果是二维数组,需要使用B.stack().mean()压缩到一维再计算A.add(B,fill_value=fill)
Pandas对缺失值主要分为两种办法,一种是采用覆盖全局的掩玛表示,另一种是使用标签值(sentinel value)表示,涉及到的缺失值大致可分为null,NAN或者NA三种形式。
vals1=np.array([1,None,3,4]) #None是Python类型的缺失值,此时不能对数组进行sum之类的操作vals1
vals2=np.array([1,np.nan,2,3])#可以进行操作,但是未必会是有效结果,因为任何值和nan计算结果都是nonvals2
Pandas处理缺失值的常用方法有以下几个:
index=[('Califonia',2000),('Califonia',2001),
('New York',2000),('New York',2001),
('Texas',2000),('Texas',2001)]
populations=[33,37,19,19,20,25]
index=pd.MultiIndex.from_tuples(index)
index
pop=pd.Series(populations,index=index) #如果已经指定过index,也可以使用pop.reindex(index)重置索引pop
pop[:,2000] #对多层索引切片来快速查询需要值
pop_df=pop.unstack()#unstack方法将多级索引转换成普通的DataFrame对象pop_df
pop_df.stack()#相应地,stack将DataFrame转换为多级索引
创建多级索引,可以显式地使用pd.MultiIndex.from_arrays(从数组),from_tuples(从元祖),from_product(从两个索引的笛卡尔积)或者直接用pd.MultiIndex(levels=[…],levels=[…])这样的方式直接创建。更新index采用reindex方法,重置索引使用reset_index方法。也可以通过index.names=[‘’,’’]方法为多级索引命名。创建列索引时,可以使用columns=pd.MultiIndex…形式,通过行列都实现多级索引来更方便的进行数据统计工作。df1=pd.DataFrame({'employee':['Bob','Jake','Lisa','Sue'], 'group':['Accounting','Engineering','Engineering','HR']})
df2=pd.DataFrame({'employee':['Bob','Jake','Lisa','Sue'], 'hire_date':[2004,2008,2012,2014]})
df3=pd.merge(df1,df2) #一对一映射df4=pd.DataFrame({'group':['Accounting','Engineering','HR'], 'Supervisor':['Carly','Guido','Steve']})
df5=pd.merge(df3,df4) #多对一映射,自动保留重复值df6=pd.DataFrame({'group':['Accounting','Accounting', 'Engineering','Engineering','HR','HR'], 'skills':['math','spreadsheets','coding','linux', 'spreadsheets','organization']})
df7=pd.merge(df5,df6) #多对多映射df8=pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'], 'salary':[70,80,120,90]})#列名一致时,可以显性指定on参数,或者忽略,不一致时合并方法如下df9=pd.merge(df7,df8,left_on='employee',right_on='name').drop('name',axis=1)'''
利用left_on参数和right_on参数指定要合并的列,合并后通过drop方法去掉重复列
也可以通过left_index=True,right_index=True来通过索引实现合并
还可以混合left_on和right_index实现合并
在包含重名列都需要保存时,也可以设置suffixes参数,来为各个列添加后缀。
'''print(df1)
print(df2)
print(df3)
print(df4)
print(df5)
print(df6)
print(df7)
print(df8)
print(df9)
employee group
0 Bob Accountingemployee hire_date
0 Bob 2004employee group hire_date
0 Bob Accounting 2004Supervisor group
0 Carly Accountingemployee group hire_date Supervisor
0 Bob Accounting 2004 Carly group skills
0 Accounting mathemployee group hire_date Supervisor skills
0 Bob Accounting 2004 Carly math name salary
0 Bob 70employee group hire_date Supervisor skills salary
0 Bob Accounting 2004 Carly math 70import seaborn as sns
planets=sns.load_dataset('planets') #以Seaborn数据库中星球表作为示例print(planets.shape)
print(planets.head())
(1035, 6) method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006print(planets.groupby('method')['orbital_period'].max())
methodplanets.groupby('method')['year'].describe().unstack()
method
count Astrometry 2.000000 Eclipse Timing Variations 9.000000
Imaging 38.000000
Microlensing 23.000000
Orbital Brightness Modulation 3.000000
Pulsar Timing 5.000000
Pulsation Timing Variations 1.000000
Radial Velocity 553.000000
Transit 397.000000
Transit Timing Variations 4.000000
mean Astrometry 2011.500000 Eclipse Timing Variations 2010.000000
Imaging 2009.131579
Microlensing 2009.782609
Orbital Brightness Modulation 2011.666667
Pulsar Timing 1998.400000
Pulsation Timing Variations 2007.000000
Radial Velocity 2007.518987
Transit 2011.236776
Transit Timing Variations 2012.500000
std Astrometry 2.121320 Eclipse Timing Variations 1.414214
Imaging 2.781901
Microlensing 2.859697
Orbital Brightness Modulation 1.154701
Pulsar Timing 8.384510
Pulsation Timing Variations NaN
Radial Velocity 4.249052
Transit 2.077867
Transit Timing Variations 1.290994
...
50% Astrometry 2011.500000 Eclipse Timing Variations 2010.000000
Imaging 2009.000000
Microlensing 2010.000000
Orbital Brightness Modulation 2011.000000
Pulsar Timing 1994.000000
Pulsation Timing Variations 2007.000000
Radial Velocity 2009.000000
Transit 2012.000000
Transit Timing Variations 2012.500000
75% Astrometry 2012.250000 Eclipse Timing Variations 2011.000000
Imaging 2011.000000
Microlensing 2012.000000
Orbital Brightness Modulation 2012.000000
Pulsar Timing 2003.000000
Pulsation Timing Variations 2007.000000
Radial Velocity 2011.000000
Transit 2013.000000
Transit Timing Variations 2013.250000
max Astrometry 2013.000000 Eclipse Timing Variations 2012.000000
Imaging 2013.000000
Microlensing 2013.000000
Orbital Brightness Modulation 2013.000000
Pulsar Timing 2011.000000
Pulsation Timing Variations 2007.000000
Radial Velocity 2014.000000
Transit 2014.000000
Transit Timing Variations 2014.000000
Length: 80, dtype: float64titanic=sns.load_dataset('titanic')
print(titanic.head())
titanic.pivot_table('survived',index='sex',columns='class')
survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third who adult_male deck embark_town alive alone
0 man True NaN Southampton no Falseclass
First
Second
Third
sex
female
0.968085
0.921053
0.500000
male
0.368852
0.157407
0.135447
age=pd.cut(titanic['age'],[0,18,30,60,80])
titanic.pivot_table('survived',['sex',age],'class')#从结果可以看出,年轻女性存活率95.8%,男性14.7%,所以Rose活了而Jack死了
class
First
Second
Third
sex
age
female
(0, 18]
0.909091
1.000000
0.511628
(18, 30]
0.958333
0.900000
0.500000
(30, 60]
0.979167
0.900000
0.272727
(60, 80]
1.000000
NaN
1.000000
male
(0, 18]
0.800000
0.600000
0.215686
(18, 30]
0.428571
0.027027
0.147541
(30, 60]
0.412698
0.090909
0.118421
(60, 80]
0.083333
0.333333
0.000000
fare=pd.qcut(titanic['fare'],3) #也可以按船票纵向分割,再进行分析titanic.pivot_table('survived',['sex',age],[fare,'class'])
fare (-0.001, 8.662] (8.662, 26.0] \ (18, 30] NaN 0.611111 NaN 0.880000 0.411765
(30, 60] NaN 0.000000 1.0 0.875000 0.416667
(60, 80] NaN NaN NaN NaN 1.000000
male (0, 18] NaN 0.166667 NaN 0.500000 0.500000 (18, 30] NaN 0.139785 NaN 0.033333 0.148148
(30, 60] 0.0 0.116667 0.0 0.111111 0.000000
(60, 80] NaN 0.000000 0.0 0.333333 NaN
fare (26.0, 512.329] (18, 30] 0.958333 1.0 0.000000
(30, 60] 0.978261 1.0 0.142857
(60, 80] 1.000000 NaN NaN
male (0, 18] 0.800000 0.8 0.052632 (18, 30] 0.428571 0.0 0.500000
(30, 60] 0.448276 0.0 0.500000
(60, 80] 0.090909 NaN NaN