目录
参考:
Pandas介绍
主要数据结构
Series
DataFrame
索引对象
基本功能
数据导入导出
pandas中文网:https://www.pypandas.cn
《python for Data Analysis》
Pandas是python的核心数据分析支持库,基于Numpy数组构建。二者最大的不同是pandas是专门为处理表格和混杂数据设计的(可以针对行列命名),而Numpy更适合处理统一的数值数组数据。可以类比列表和字典的区别去理解。实际应用中通常将二者搭配使用。
定义:
Series是一种类似于一维数组的对象。
组成:它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。若未指定索引,则默认从0开始以数字分配索引。
例:生成简单的Series。
series_1= pd.Series([3,4,5,6,5])
series_1
0 3
1 4
2 5
3 6
4 5
dtype: int64
例:通过Series对象的values和index属性获取其的数据形式和索引:
series_1.values
array([3, 4, 5, 6, 5], dtype=int64)
series_1.index
RangeIndex(start=0, stop=5, step=1)
例:为Series对象指定索引:
series_2= pd.Series([3,4,5,6,5],index=['a','b','c','d','e'])
series_2
a 3
b 4
c 5
d 6
e 5
dtype: int64
例:通过索引(表)访问Series对象中的值:
series_2['a']
3
series_2[['b','d','e']]
b 4
d 6
e 5
dtype: int64
例:使用NumPy函数或类似NumPy的运算(如根据布尔型数组进行过滤、标量乘法、应用数学函数等)会保持索引和数值的关系:
series_2*3
a 9
b 12
c 15
d 18
e 15
dtype: int64
例:可以基于字典直接创建Series:pd.Series(字典),其索引对应字典的键:
dict_1={'name':'tom','age':3,'color':'blue'}
series_3=pd.Series(dict_1)
series_3
name tom
age 3
color blue
dtype: object
例:可以对字典传入指定的索引表来改变所生成Series对象的数据的顺序:
index_new=['age','name','weigh']
series_4=pd.Series(dict_1,index=index_new)
series_4
age 3
name tom
weigh NaN
dtype: object
注:在pandas中,NaN即“非数字”(not a number,它用于表示缺失或NA值。
例:pandas的isnull和notnull函数可用于检测缺失数据:
pd.isnull(series_4)
age False
name False
weigh True
dtype: bool
pd.notnull(series_4)
age True
name True
weigh False
dtype: bool
series_4.isnull()
age False
name False
weigh True
dtype: bool
series_4.notnull()
age True
name True
weigh False
dtype: bool
例:Series可根据索引自动做对齐运算:
series_5=pd.Series([1,2,3],index=['a','b','c'])
series_6=pd.Series([4,5,6],index=['b','c','d'])
series_5+series_6
a NaN
b 6.0
c 8.0
d NaN
dtype: float64
例:Series对象本身及其索引都有一个name属性:
series_5.name='value'
series_5.index.name='index'
series_5
index
a 1
b 2
c 3
Name: value, dtype: int64
DataFrame是一个表格型的数据结构,由一组有序的“列”组成,不同列的值类型可以不同。
DataFrame不仅有列索引,还有行索引。
DataFrame中的数据是以一个或多个“二维块”存放的。
例:最常用的创建DataFrame的方式:直接传入一个由等长列表或Numpy数组组成的字典:
data={'food':['potato','rice','beef','dumpling'],'car':['BMW','ford','toyota','benz'],'fruit':['peach','apple','watermelon','pineapple']}
dataframe_1=pd.DataFrame(data)
dataframe_1
food car fruit
0 potato BMW peach
1 rice ford apple
2 beef toyota watermelon
3 dumpling benz pineapple
DataFrame为数据自动加上了行索引,默认为从0开始的数值。
对于特别大的DataFrame,可以用head显示前五行:dataframe_1.head()。
例:以指定的索引进行显示:
pd.DataFrame(data,columns=['car','fruit','food'])
car fruit food
0 BMW peach potato
1 ford apple rice
2 toyota watermelon beef
3 benz pineapple dumpling
例:如果传入的列在数据中找不到,就会在结果中产生缺失值:(index长度必须和实际行数一致)
dataframe_2=pd.DataFrame(data,columns=['car','fruit','food','beautiful girl'],index=['one','two','three','four'])
dataframe_2
car fruit food beautiful girl
one BMW peach potato NaN
two ford apple rice NaN
three toyota watermelon beef NaN
four benz pineapple dumpling NaN
例:将DataFrame的列获取为一个Series
dataframe_2['car']
one BMW
two ford
three toyota
four benz
Name: car, dtype: object
dataframe_2.food
one potato
two rice
three beef
four dumpling
Name: food, dtype: object
例:用loc属性获取DataFrame的行:
dataframe_2.loc['two']
car ford
fruit apple
food rice
beautiful girl NaN
Name: two, dtype: object
例:给DataFrame的列赋值:
dataframe_2['beautiful girl']=0
dataframe_2
car fruit food beautiful girl
one BMW peach potato 0
two ford apple rice 0
three toyota watermelon beef 0
four benz pineapple dumpling 0
dataframe_2['beautiful girl']=np.arange(6.)
dataframe_2['beautiful girl']=np.arange(4.)
dataframe_2
car fruit food beautiful girl
one BMW peach potato 0.0
two ford apple rice 1.0
three toyota watermelon beef 2.0
four benz pineapple dumpling 3.0
例:用Series为DataFrame赋值,会匹配index进行赋值:
dataframe_2
car fruit food beautiful girl
one BMW peach potato 0.0
two ford apple rice 1.0
three toyota watermelon beef 2.0
four bench pineapple dumpling 3.0
series_1=pd.Series(['diaochan','xishi','dongshi','xiaoqiao'],index=['1','two','3','four'])
dataframe_2['beautiful girl']=series_1
dataframe_2
car fruit food beautiful girl
one BMW peach potato NaN
two ford apple rice xishi
three toyota watermelon beef NaN
four bench pineapple dumpling xiaoqiao
例:删除DataFrame的某一列:
dataframe_2
car fruit food beautiful girl beautifulgirl
one BMW peach potato NaN NaN
two ford apple rice xishi NaN
three toyota watermelon beef NaN NaN
four bench pineapple dumpling xiaoqiao NaN
del dataframe_2['beautifulgirl']
dataframe_2
car fruit food beautiful girl
one BMW peach potato NaN
two ford apple rice xishi
three toyota watermelon beef NaN
four bench pineapple dumpling xiaoqiao
通过索引方式返回的列只是相应数据的视图,并不是副本。因此,对返回的Series所做的任何就地修改都会反映到源DataFrame上。通过Series的copy方法几颗指定复制列。
例:嵌套字典
dict_dict={'pets':{'cat':'Tom','dog':'Tiger'},'plants':{'flower':'rose','tree':'pine','grass':'vanilla'}}
例:可以将嵌套字典传给DataFrame:外层字典的键将作为列名,内层字典的键键则作为索引:
dataframe_3 = pd.DataFrame(dict_dict)
dataframe_3
pets plants
cat Tom NaN
dog Tiger NaN
flower NaN rose
tree NaN pine
grass NaN vanilla
例:如何对DataFrame进行转置:
dataframe_3.T
cat dog flower tree grass
pets Tom Tiger NaN NaN NaN
plants NaN NaN rose pine vanilla
DataFrame构造函数能接受那些数据类型?
例:DataFrame的value属性——以二维ndarray的形式返回DataFrame中的数据:
dataframe_3.values
array([['Tom', nan],
['Tiger', nan],
[nan, 'rose'],
[nan, 'pine'],
[nan, 'vanilla']], dtype=object)
pandas的索引对象负责管理轴标签和其它元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其它序列的标签都会被转换成一个Index:
series_1 = pd.Series(range(4),index=['a','b','c','d'])
index = series_1.index
index
Index(['a', 'b', 'c', 'd'], dtype='object')
Index对象不可变,用户无法修改。
index的方法和属性:
待补充
例:定义DataFrame,导出csv文件,将csv文件导入修改行索引(index)再次导出:
import pandas as pd
data_1 = {'V1': [1, 0, 1, 1, 0], 'V2': [0, 1, 0, 1, 1], 'V3': [1, 1, 1, 0, 0], 'V4': [0, 0, 0, 0, 0], 'V5': [0, 0, 1, 1, 1] }
dataFrame_1 = pd.DataFrame(data_1, index=['a', 'b', 'c', 'd', 'e'])
print(dataFrame_1)
dataFrame_1.to_csv('E:\coding\codes\practice\\test_1.csv',index=False)
# index = False 作用:不将index列写进test_1.csv文件
dataFrame_2 = pd.read_csv('E:\coding\codes\practice\\test_1.csv')
# read_csv 方法认为csv文件中的每一列都是数据,并默认给读取到的数据加上行索引(从0开始的数值)
print(dataFrame_2)
dataFrame_2.index = pd.Series(['one', 'two', 'three', 'four', 'five'])
dataFrame_2.to_csv('E:\coding\codes\practice\\test_2.csv')
# 将DataFrame_2写入test_2.csv
print(dataFrame_2)
打印结果:
V1 V2 V3 V4 V5
a 1 0 1 0 0
b 0 1 1 0 0
c 1 0 1 0 1
d 1 1 0 0 1
e 0 1 0 0 1
V1 V2 V3 V4 V5
0 1 0 1 0 0
1 0 1 1 0 0
2 1 0 1 0 1
3 1 1 0 0 1
4 0 1 0 0 1
V1 V2 V3 V4 V5
one 1 0 1 0 0
two 0 1 1 0 0
three 1 0 1 0 1
four 1 1 0 0 1
five 0 1 0 0 1
Process finished with exit code 0