Python语法基础之DataFrame

Python数据分析首先需要进行数据清洗处理,涉及到很多DataFrame和Series相关知识,这里对涉及到的常用方法进行整理,主要设计数据增减、变更索引、数值替换等。其中一些函数的参数并没有介绍齐全,可以通过参考pandas文档或者在编辑器输入方法+?查询(例如df.reindex?),实践是检验知识水平的最好途径。

文档目录:

    • 更改索引
    • 数据删除
      • 删除整列
      • 删除整行
      • 重复行删除
      • 包含缺失值的行/列删除
    • 数据替换
      • 缺失值替换
      • 其他值替换
    • 数据索引
      • 标签索引 loc
      • 位置索引 iloc
      • 通过逻辑选择
    • 数据排序

import pandas as pd
import numpy as np

df = pd.DataFrame({'name':['James','Curry','James','Kobe','Wade'],
                   'age':[31,30,31,35,38],
                   'score':[18,25,18,17,15],
                   'block':[5,2,5,3,2]},index = ['player1','player2','player3','player4','player5'])
print(df)
         age  block   name  score
player1   31      5  James     18
player2   30      2  Curry     25
player3   31      5  James     18
player4   35      3   Kobe     17
player5   38      2   Wade     15

更改索引

  • 创建一个新索引(行列)reindex:重新创建新索引,原有数据会根据新索引进行重排,如果索引值不存在,会引入缺失值,原有索引对应的值不会发生变化
# 设置inplace = True 可以直接在原dDataFrame上修改,否则会复制修改
df_reindex = df.reindex(columns = ['name','age','block','score','reb'],
           index = ['player1','player2','player3','player4','player5','player6'])
print(df_reindex)
          name   age  block  score  reb
player1  James  31.0    5.0   18.0  NaN
player2  Curry  30.0    2.0   25.0  NaN
player3  James  31.0    5.0   18.0  NaN
player4   Kobe  35.0    3.0   17.0  NaN
player5   Wade  38.0    2.0   15.0  NaN
player6    NaN   NaN    NaN    NaN  NaN
  • 重新给索引命名rename,可以结合字典给部分索引重新命名,或者结合相关函数对索引进行整体重新命名
# 用字典修改
new_index = {'player1':'PLAYER1'}
new_col = {'name':'Name','age':'Age'}
df_rename_dict = df.rename(index = new_index,columns = new_col) # inplace = True 可以直接在原DataFrame上修改,否则会复制修改
print(df_rename_dict)
         Age  block   Name  score
PLAYER1   31      5  James     18
player2   30      2  Curry     25
player3   31      5  James     18
player4   35      3   Kobe     17
player5   38      2   Wade     15
# 用函数修改
df_rename_fun = df.rename(columns = str.title)
print(df_rename_fun)
         Age  Block   Name  Score
player1   31      5  James     18
player2   30      2  Curry     25
player3   31      5  James     18
player4   35      3   Kobe     17
player5   38      2   Wade     15
# 用map函数修改,这种方法是直接在原DataFrame上修改
df.columns = df.columns.map(str.title)
df.index = df.index.map(str.upper)
print(df)
         Age  Block   Name  Score
PLAYER1   31      5  James     18
PLAYER2   30      2  Curry     25
PLAYER3   31      5  James     18
PLAYER4   35      3   Kobe     17
PLAYER5   38      2   Wade     15
  • 将一列或者多列变为行索引 set_index
# 将一列作为索引
df_set1index = df.set_index(['Name'])
print(df_set1index)
       Age  Block  Score
Name                    
James   31      5     18
Curry   30      2     25
James   31      5     18
Kobe    35      3     17
Wade    38      2     15
# 将两列作为索引,默认这些作为索引的列会从DataFrame中删除,设置 drop = False 可以将其保留
df_set2index = df.set_index(['Name','Block'],drop = False)
print(df_set2index)
             Age  Block   Name  Score
Name  Block                          
James 5       31      5  James     18
Curry 2       30      2  Curry     25
James 5       31      5  James     18
Kobe  3       35      3   Kobe     17
Wade  2       38      2   Wade     15
  • 将行索引变为DataFrame的一列 reset_index
df_rIdx = df.reset_index()#也可以设置drop = False,保留原index
print(df_rIdx)
     index  Age  Block   Name  Score
0  PLAYER1   31      5  James     18
1  PLAYER2   30      2  Curry     25
2  PLAYER3   31      5  James     18
3  PLAYER4   35      3   Kobe     17
4  PLAYER5   38      2   Wade     15

数据删除

主要包括多重形式下行列的删除

删除整列

  • 用del删除,在原DataFrame上直接修改删除
  • 用drop方法删除,返回删除后的复制版本,不会修改原DataFrame
df2 = df.copy()
print(df2)
         Age  Block   Name  Score
PLAYER1   31      5  James     18
PLAYER2   30      2  Curry     25
PLAYER3   31      5  James     18
PLAYER4   35      3   Kobe     17
PLAYER5   38      2   Wade     15
# 用del方法删除
del df2['Age']
print(df2)
         Block   Name  Score
PLAYER1      5  James     18
PLAYER2      2  Curry     25
PLAYER3      5  James     18
PLAYER4      3   Kobe     17
PLAYER5      2   Wade     15
# 用drop方法删除,默认axis = 0,设置axis = 1才能删除列
df2_drop = df2.drop(['Block','Score'],axis = 1)
print(df2_drop)
          Name
PLAYER1  James
PLAYER2  Curry
PLAYER3  James
PLAYER4   Kobe
PLAYER5   Wade

删除整行

df3 = df.copy()
print(df3)
         Age  Block   Name  Score
PLAYER1   31      5  James     18
PLAYER2   30      2  Curry     25
PLAYER3   31      5  James     18
PLAYER4   35      3   Kobe     17
PLAYER5   38      2   Wade     15
# 默认drop参数axis = 0,删除行
df3_drop = df3.drop(['PLAYER1'])
print(df3_drop)
         Age  Block   Name  Score
PLAYER2   30      2  Curry     25
PLAYER3   31      5  James     18
PLAYER4   35      3   Kobe     17
PLAYER5   38      2   Wade     15

重复行删除

  • 重复判断 duplicated(),返回一个布尔型的Series,表示各行是否与前面重复,重复则显示True
df.duplicated()
PLAYER1    False
PLAYER2    False
PLAYER3     True
PLAYER4    False
PLAYER5    False
dtype: bool

该方法可以通过设置subset = [‘列名’]根据一列或多列对重复值进行判断,设置 keep=’last’使重复项最后一项显示False,其余为True
配合sum函数可以迅速判断,该行是否存在重复值,sum返回的数值即为重复行的数目

df.duplicated().sum()   Out:1
  • 还可以用Series的is_unique方法对单列是否有重复值进行判断,该方法能判断Series的values是否独立,没有重复则返回True
df['Name'].is_unique
False
df3_drop['Name'].is_unique
True
  • 重复值的删除使用drop_duplicates方法,返回的是删除掉重复行的DataFrame,不会修改原DataFrame
# 依据全部列进行判断
df_d = df.drop_duplicates()
print(df_d)
         Age  Block   Name  Score
PLAYER1   31      5  James     18
PLAYER2   30      2  Curry     25
PLAYER4   35      3   Kobe     17
PLAYER5   38      2   Wade     15
# 依据设定的一列或多列进行判断,默认会保留第一个出现的值组合,传入keep = 'last'后会保留最后一个,传入inplace = True则会取代原DataFrame
df_d2 = df.drop_duplicates(subset = ['Block'],keep = 'last')
print(df_d2)
         Age  Block   Name  Score
PLAYER3   31      5  James     18
PLAYER4   35      3   Kobe     17
PLAYER5   38      2   Wade     15

包含缺失值的行/列删除

  • 滤除缺失数据一般使用dropna,返回删除后的复制版本,不会修改原DataFrame
df_data = df.copy()
df_data.iloc[1,2] = np.nan
df_data.iloc[2] = np.nan
print(df_data)
          Age  Block   Name  Score
PLAYER1  31.0    5.0  James   18.0
PLAYER2  30.0    2.0    NaN   25.0
PLAYER3   NaN    NaN    NaN    NaN
PLAYER4  35.0    3.0   Kobe   17.0
PLAYER5  38.0    2.0   Wade   15.0
# 默认只要行内有一个NaN值,该行就会被删除
data_drop = df_data.dropna()
print(data_drop)
          Age  Block   Name  Score
PLAYER1  31.0    5.0  James   18.0
PLAYER4  35.0    3.0   Kobe   17.0
PLAYER5  38.0    2.0   Wade   15.0
# 如果只想删除全部为NaN的行,可以传入 how = 'all'
data_drop2 = df_data.dropna(how = 'all')
print(data_drop2)
          Age  Block   Name  Score
PLAYER1  31.0    5.0  James   18.0
PLAYER2  30.0    2.0    NaN   25.0
PLAYER4  35.0    3.0   Kobe   17.0
PLAYER5  38.0    2.0   Wade   15.0
# 如果想删除列,可以传入 axis = 1
data_drop3 = data_drop2.dropna(axis = 1)
print(data_drop3)
          Age  Block  Score
PLAYER1  31.0    5.0   18.0
PLAYER2  30.0    2.0   25.0
PLAYER4  35.0    3.0   17.0
PLAYER5  38.0    2.0   15.0

此外,dropna还有 (thresh=None, subset=None, inplace=False)三个参数,分别控制缺失值删除数目的阈值,根据subset指定列名的空值删除以及是否取代原DataFrame

数据替换

缺失值替换

  • 缺失值替换可以采用fillna
#可以直接全部替换为同一个值
df_1v = df_data.fillna(0)
print(df_1)
          Age  Block   Name  Score
PLAYER1  31.0    5.0  James   18.0
PLAYER2  30.0    2.0    NaN   25.0
PLAYER3  30.0   10.0    NaN    NaN
PLAYER4  35.0    3.0   Kobe   17.0
PLAYER5  38.0    2.0   Wade   15.0
#也可以传入列名为键的字典为不同列替换为不同值
df_dict = df_data.fillna({'Age':30,'Block':10})
print(df_dict)
          Age  Block   Name  Score
PLAYER1  31.0    5.0  James   18.0
PLAYER2  30.0    2.0    NaN   25.0
PLAYER3  30.0   10.0    NaN    NaN
PLAYER4  35.0    3.0   Kobe   17.0
PLAYER5  38.0    2.0   Wade   15.0
#可以设置method = 'ffill' 或 method = 'bfill'分别用前后值填充
df_m = df_data.fillna(method = 'ffill')
print(df_m)
          Age  Block   Name  Score
PLAYER1  31.0    5.0  James   18.0
PLAYER2  30.0    2.0  James   25.0
PLAYER3  30.0    2.0  James   25.0
PLAYER4  35.0    3.0   Kobe   17.0
PLAYER5  38.0    2.0   Wade   15.0

此外还有axis、limit、inplace参数分别设置轴、前后替换的阈值和是否替代

其他值替换

  • 使用replace替换
# 利用列表实现将不同值替换为同一值 
# 将Curry、kobe替换为Stephen
df_replace = df.replace(['Curry','Kobe'],'Stephen')
print(df_replace)
         Age  Block     Name  Score
PLAYER1   31      5    James     18
PLAYER2   30      2  Stephen     25
PLAYER3   31      5    James     18
PLAYER4   35      3  Stephen     17
PLAYER5   38      2     Wade     15
# 利用字典实现对不同值的不同替换
# 将Curry替换为Stephen,将Kobe替换为Bryant
df_reDict = df.replace({'Curry':'Stephen','Kobe':'Bryant'})
print(df_reDict)
         Age  Block     Name  Score
PLAYER1   31      5    James     18
PLAYER2   30      2  Stephen     25
PLAYER3   31      5    James     18
PLAYER4   35      3   Bryant     17
PLAYER5   38      2     Wade     15
# 利用双列表实现对不同值的不同替换
# 将Curry替换为Stephen,将Kobe替换为Bryant
df_reList = df.replace(['Curry','Kobe'],['Stephen','Bryant'])
print(df_reList)
         Age  Block     Name  Score
PLAYER1   31      5    James     18
PLAYER2   30      2  Stephen     25
PLAYER3   31      5    James     18
PLAYER4   35      3   Bryant     17
PLAYER5   38      2     Wade     15

数据索引

标签索引 loc

  • 行索引
print(df)
         Age  Block   Name  Score
PLAYER1   31      5  James     18
PLAYER2   30      2  Curry     25
PLAYER3   31      5  James     18
PLAYER4   35      3   Kobe     17
PLAYER5   38      2   Wade     15
# 单行
df.loc['PLAYER1']
# 多行
df.loc[['PLAYER1','PLAYER3']]# 注意传入的是列表,带方括号
# 连续多行
df.loc['PLAYER1':'PLAYER3'] # 注意是包含冒号右侧的值的
  • 列索引
# 单列
df['Age']
# 多列
df[['Age','Name']]# 注意传入的是列表,带方括号
# loc选取
df.loc[:,'Name'] 
PLAYER1    James
PLAYER2    Curry
PLAYER3    James
PLAYER4     Kobe
PLAYER5     Wade
Name: Name, dtype: object
  • 行列共同索引
df.loc['PLAYER2',['Age','Name']]
Age        30
Name    Curry
Name: PLAYER2, dtype: object

位置索引 iloc

  • 行索引
# 单行
df.iloc[1]
# 连续多行
df.iloc[1:3] # 不包含冒号右侧一项
  • 列索引
# 单列
df.iloc[:,1]
# 连续多列
df.iloc[:,1:3] # 不包含冒号右侧一项
  • 行列同时索引
df.iloc[0:2,2:4]

通过逻辑选择

df_logic = df[df['Score']>17]
print(df_logic)
         Age  Block   Name  Score
PLAYER1   31      5  James     18
PLAYER2   30      2  Curry     25
PLAYER3   31      5  James     18

数据排序

  • 按照索引排序 sort_index
# 可以设置axis按照行列进行排序,并可以设置ascending选择升序降序
df_sort = df.sort_index(axis = 1,ascending = False)
print(df_sort)
         Score   Name  Block  Age
PLAYER1     18  James      5   31
PLAYER2     25  Curry      2   30
PLAYER3     18  James      5   31
PLAYER4     17   Kobe      3   35
PLAYER5     15   Wade      2   38
  • 按照值进行排序 sort_values
# 可以设置ascending选择升序降序
df_sort2 = df.sort_values(by = ['Age','Score'])
print(df_sort2)
         Age  Block   Name  Score
PLAYER2   30      2  Curry     25
PLAYER1   31      5  James     18
PLAYER3   31      5  James     18
PLAYER4   35      3   Kobe     17
PLAYER5   38      2   Wade     15

你可能感兴趣的:(Python)