Pandas学习笔记

目录

参考:

Pandas介绍

主要数据结构

Series

DataFrame

索引对象

基本功能

数据导入导出



参考:

pandas中文网:https://www.pypandas.cn

《python for Data Analysis》

Pandas介绍

Pandas是python的核心数据分析支持库,基于Numpy数组构建。二者最大的不同是pandas是专门为处理表格和混杂数据设计的(可以针对行列命名),而Numpy更适合处理统一的数值数组数据。可以类比列表和字典的区别去理解。实际应用中通常将二者搭配使用。

主要数据结构

Series

定义:

Series是一种类似于一维数组的对象。

组成:它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。若未指定索引,则默认从0开始以数字分配索引。

例:生成简单的Series。

series_1= pd.Series([3,4,5,6,5])
series_1
0    3
1    4
2    5
3    6
4    5
dtype: int64

例:通过Series对象的values和index属性获取其的数据形式和索引:

series_1.values
array([3, 4, 5, 6, 5], dtype=int64)
series_1.index
RangeIndex(start=0, stop=5, step=1)

例:为Series对象指定索引:

series_2= pd.Series([3,4,5,6,5],index=['a','b','c','d','e'])
series_2
a    3
b    4
c    5
d    6
e    5
dtype: int64

例:通过索引(表)访问Series对象中的值:

series_2['a']
3

series_2[['b','d','e']]
b    4
d    6
e    5
dtype: int64

例:使用NumPy函数或类似NumPy的运算(如根据布尔型数组进行过滤、标量乘法、应用数学函数等)会保持索引和数值的关系:

series_2*3
a     9
b    12
c    15
d    18
e    15
dtype: int64

例:可以基于字典直接创建Series:pd.Series(字典),其索引对应字典的键:

dict_1={'name':'tom','age':3,'color':'blue'}
series_3=pd.Series(dict_1)
series_3
name      tom
age         3
color    blue
dtype: object

例:可以对字典传入指定的索引表来改变所生成Series对象的数据的顺序:

index_new=['age','name','weigh']
series_4=pd.Series(dict_1,index=index_new)
series_4
age        3
name     tom
weigh    NaN
dtype: object

注:在pandas中,NaN即“非数字”(not a number,它用于表示缺失或NA值。

例:pandas的isnull和notnull函数可用于检测缺失数据:

pd.isnull(series_4)
age      False
name     False
weigh     True
dtype: bool

pd.notnull(series_4)
age       True
name      True
weigh    False
dtype: bool

series_4.isnull()
age      False
name     False
weigh     True
dtype: bool

series_4.notnull()
age       True
name      True
weigh    False
dtype: bool

例:Series可根据索引自动做对齐运算:

series_5=pd.Series([1,2,3],index=['a','b','c'])
series_6=pd.Series([4,5,6],index=['b','c','d'])
series_5+series_6
a    NaN
b    6.0
c    8.0
d    NaN
dtype: float64

例:Series对象本身及其索引都有一个name属性:

series_5.name='value'
series_5.index.name='index'
series_5
index
a    1
b    2
c    3
Name: value, dtype: int64

DataFrame

DataFrame是一个表格型的数据结构,由一组有序的“列”组成,不同列的值类型可以不同。

DataFrame不仅有列索引,还有行索引。

DataFrame中的数据是以一个或多个“二维块”存放的。

例:最常用的创建DataFrame的方式:直接传入一个由等长列表或Numpy数组组成的字典:

data={'food':['potato','rice','beef','dumpling'],'car':['BMW','ford','toyota','benz'],'fruit':['peach','apple','watermelon','pineapple']}
dataframe_1=pd.DataFrame(data)
dataframe_1
       food     car       fruit
0    potato     BMW       peach
1      rice    ford       apple
2      beef  toyota  watermelon
3  dumpling    benz   pineapple

DataFrame为数据自动加上了行索引,默认为从0开始的数值。

对于特别大的DataFrame,可以用head显示前五行:dataframe_1.head()。

例:以指定的索引进行显示:

pd.DataFrame(data,columns=['car','fruit','food'])
      car       fruit      food
0     BMW       peach    potato
1    ford       apple      rice
2  toyota  watermelon      beef
3    benz   pineapple  dumpling

例:如果传入的列在数据中找不到,就会在结果中产生缺失值:(index长度必须和实际行数一致)

dataframe_2=pd.DataFrame(data,columns=['car','fruit','food','beautiful girl'],index=['one','two','three','four'])
dataframe_2
          car       fruit      food beautiful girl
one       BMW       peach    potato            NaN
two      ford       apple      rice            NaN
three  toyota  watermelon      beef            NaN
four     benz   pineapple  dumpling            NaN

例:将DataFrame的列获取为一个Series

dataframe_2['car']
one         BMW
two        ford
three    toyota
four       benz
Name: car, dtype: object
dataframe_2.food
one        potato
two          rice
three        beef
four     dumpling
Name: food, dtype: object

例:用loc属性获取DataFrame的行:

dataframe_2.loc['two']
car                ford
fruit             apple
food               rice
beautiful girl      NaN
Name: two, dtype: object

例:给DataFrame的列赋值:

dataframe_2['beautiful girl']=0
dataframe_2
          car       fruit      food  beautiful girl
one       BMW       peach    potato               0
two      ford       apple      rice               0
three  toyota  watermelon      beef               0
four     benz   pineapple  dumpling               0
dataframe_2['beautiful girl']=np.arange(6.)

dataframe_2['beautiful girl']=np.arange(4.)
dataframe_2
          car       fruit      food  beautiful girl
one       BMW       peach    potato             0.0
two      ford       apple      rice             1.0
three  toyota  watermelon      beef             2.0
four     benz   pineapple  dumpling             3.0

例:用Series为DataFrame赋值,会匹配index进行赋值:

dataframe_2
          car       fruit      food  beautiful girl
one       BMW       peach    potato             0.0
two      ford       apple      rice             1.0
three  toyota  watermelon      beef             2.0
four    bench   pineapple  dumpling             3.0

series_1=pd.Series(['diaochan','xishi','dongshi','xiaoqiao'],index=['1','two','3','four'])
dataframe_2['beautiful girl']=series_1
dataframe_2
          car       fruit      food beautiful girl
one       BMW       peach    potato            NaN
two      ford       apple      rice          xishi
three  toyota  watermelon      beef            NaN
four    bench   pineapple  dumpling       xiaoqiao

例:删除DataFrame的某一列:

dataframe_2
          car       fruit      food beautiful girl beautifulgirl
one       BMW       peach    potato            NaN           NaN
two      ford       apple      rice          xishi           NaN
three  toyota  watermelon      beef            NaN           NaN
four    bench   pineapple  dumpling       xiaoqiao           NaN

del dataframe_2['beautifulgirl']
dataframe_2
          car       fruit      food beautiful girl
one       BMW       peach    potato            NaN
two      ford       apple      rice          xishi
three  toyota  watermelon      beef            NaN
four    bench   pineapple  dumpling       xiaoqiao

通过索引方式返回的列只是相应数据的视图,并不是副本。因此,对返回的Series所做的任何就地修改都会反映到源DataFrame上。通过Series的copy方法几颗指定复制列。

例:嵌套字典

dict_dict={'pets':{'cat':'Tom','dog':'Tiger'},'plants':{'flower':'rose','tree':'pine','grass':'vanilla'}}

例:可以将嵌套字典传给DataFrame:外层字典的键将作为列名,内层字典的键键则作为索引:

dataframe_3 = pd.DataFrame(dict_dict) 
dataframe_3
         pets   plants
cat       Tom      NaN
dog     Tiger      NaN
flower    NaN     rose
tree      NaN     pine
grass     NaN  vanilla

例:如何对DataFrame进行转置:

dataframe_3.T
        cat    dog flower  tree    grass
pets    Tom  Tiger    NaN   NaN      NaN
plants  NaN    NaN   rose  pine  vanilla

DataFrame构造函数能接受那些数据类型?

Pandas学习笔记_第1张图片

例:DataFrame的value属性——以二维ndarray的形式返回DataFrame中的数据:

dataframe_3.values
array([['Tom', nan],
       ['Tiger', nan],
       [nan, 'rose'],
       [nan, 'pine'],
       [nan, 'vanilla']], dtype=object)

索引对象

pandas的索引对象负责管理轴标签和其它元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其它序列的标签都会被转换成一个Index:

series_1 = pd.Series(range(4),index=['a','b','c','d'])
index = series_1.index 
index
Index(['a', 'b', 'c', 'd'], dtype='object')

Index对象不可变,用户无法修改。

index的方法和属性:

Pandas学习笔记_第2张图片

基本功能

待补充

数据导入导出

例:定义DataFrame,导出csv文件,将csv文件导入修改行索引(index)再次导出:

import pandas as pd
data_1 = {'V1': [1, 0, 1, 1, 0], 'V2': [0, 1, 0, 1, 1], 'V3': [1, 1, 1, 0, 0], 'V4': [0, 0, 0, 0, 0], 'V5': [0, 0, 1, 1, 1] }
dataFrame_1 = pd.DataFrame(data_1, index=['a', 'b', 'c', 'd', 'e'])
print(dataFrame_1)

dataFrame_1.to_csv('E:\coding\codes\practice\\test_1.csv',index=False)
# index = False 作用:不将index列写进test_1.csv文件

dataFrame_2 = pd.read_csv('E:\coding\codes\practice\\test_1.csv')
# read_csv 方法认为csv文件中的每一列都是数据,并默认给读取到的数据加上行索引(从0开始的数值)

print(dataFrame_2)
dataFrame_2.index = pd.Series(['one', 'two', 'three', 'four', 'five'])
dataFrame_2.to_csv('E:\coding\codes\practice\\test_2.csv')
# 将DataFrame_2写入test_2.csv
print(dataFrame_2)

打印结果:

   V1  V2  V3  V4  V5
a   1   0   1   0   0
b   0   1   1   0   0
c   1   0   1   0   1
d   1   1   0   0   1
e   0   1   0   0   1
   V1  V2  V3  V4  V5
0   1   0   1   0   0
1   0   1   1   0   0
2   1   0   1   0   1
3   1   1   0   0   1
4   0   1   0   0   1
       V1  V2  V3  V4  V5
one     1   0   1   0   0
two     0   1   1   0   0
three   1   0   1   0   1
four    1   1   0   0   1
five    0   1   0   0   1

Process finished with exit code 0

 

你可能感兴趣的:(Python)