Python数据分析首先需要进行数据清洗处理,涉及到很多DataFrame和Series相关知识,这里对涉及到的常用方法进行整理,主要设计数据增减、变更索引、数值替换等。其中一些函数的参数并没有介绍齐全,可以通过参考pandas文档或者在编辑器输入方法+?查询(例如df.reindex?),实践是检验知识水平的最好途径。
文档目录:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['James','Curry','James','Kobe','Wade'],
'age':[31,30,31,35,38],
'score':[18,25,18,17,15],
'block':[5,2,5,3,2]},index = ['player1','player2','player3','player4','player5'])
print(df)
age block name score
player1 31 5 James 18
player2 30 2 Curry 25
player3 31 5 James 18
player4 35 3 Kobe 17
player5 38 2 Wade 15
# 设置inplace = True 可以直接在原dDataFrame上修改,否则会复制修改
df_reindex = df.reindex(columns = ['name','age','block','score','reb'],
index = ['player1','player2','player3','player4','player5','player6'])
print(df_reindex)
name age block score reb
player1 James 31.0 5.0 18.0 NaN
player2 Curry 30.0 2.0 25.0 NaN
player3 James 31.0 5.0 18.0 NaN
player4 Kobe 35.0 3.0 17.0 NaN
player5 Wade 38.0 2.0 15.0 NaN
player6 NaN NaN NaN NaN NaN
# 用字典修改
new_index = {'player1':'PLAYER1'}
new_col = {'name':'Name','age':'Age'}
df_rename_dict = df.rename(index = new_index,columns = new_col) # inplace = True 可以直接在原DataFrame上修改,否则会复制修改
print(df_rename_dict)
Age block Name score
PLAYER1 31 5 James 18
player2 30 2 Curry 25
player3 31 5 James 18
player4 35 3 Kobe 17
player5 38 2 Wade 15
# 用函数修改
df_rename_fun = df.rename(columns = str.title)
print(df_rename_fun)
Age Block Name Score
player1 31 5 James 18
player2 30 2 Curry 25
player3 31 5 James 18
player4 35 3 Kobe 17
player5 38 2 Wade 15
# 用map函数修改,这种方法是直接在原DataFrame上修改
df.columns = df.columns.map(str.title)
df.index = df.index.map(str.upper)
print(df)
Age Block Name Score
PLAYER1 31 5 James 18
PLAYER2 30 2 Curry 25
PLAYER3 31 5 James 18
PLAYER4 35 3 Kobe 17
PLAYER5 38 2 Wade 15
# 将一列作为索引
df_set1index = df.set_index(['Name'])
print(df_set1index)
Age Block Score
Name
James 31 5 18
Curry 30 2 25
James 31 5 18
Kobe 35 3 17
Wade 38 2 15
# 将两列作为索引,默认这些作为索引的列会从DataFrame中删除,设置 drop = False 可以将其保留
df_set2index = df.set_index(['Name','Block'],drop = False)
print(df_set2index)
Age Block Name Score
Name Block
James 5 31 5 James 18
Curry 2 30 2 Curry 25
James 5 31 5 James 18
Kobe 3 35 3 Kobe 17
Wade 2 38 2 Wade 15
df_rIdx = df.reset_index()#也可以设置drop = False,保留原index
print(df_rIdx)
index Age Block Name Score
0 PLAYER1 31 5 James 18
1 PLAYER2 30 2 Curry 25
2 PLAYER3 31 5 James 18
3 PLAYER4 35 3 Kobe 17
4 PLAYER5 38 2 Wade 15
主要包括多重形式下行列的删除
df2 = df.copy()
print(df2)
Age Block Name Score
PLAYER1 31 5 James 18
PLAYER2 30 2 Curry 25
PLAYER3 31 5 James 18
PLAYER4 35 3 Kobe 17
PLAYER5 38 2 Wade 15
# 用del方法删除
del df2['Age']
print(df2)
Block Name Score
PLAYER1 5 James 18
PLAYER2 2 Curry 25
PLAYER3 5 James 18
PLAYER4 3 Kobe 17
PLAYER5 2 Wade 15
# 用drop方法删除,默认axis = 0,设置axis = 1才能删除列
df2_drop = df2.drop(['Block','Score'],axis = 1)
print(df2_drop)
Name
PLAYER1 James
PLAYER2 Curry
PLAYER3 James
PLAYER4 Kobe
PLAYER5 Wade
df3 = df.copy()
print(df3)
Age Block Name Score
PLAYER1 31 5 James 18
PLAYER2 30 2 Curry 25
PLAYER3 31 5 James 18
PLAYER4 35 3 Kobe 17
PLAYER5 38 2 Wade 15
# 默认drop参数axis = 0,删除行
df3_drop = df3.drop(['PLAYER1'])
print(df3_drop)
Age Block Name Score
PLAYER2 30 2 Curry 25
PLAYER3 31 5 James 18
PLAYER4 35 3 Kobe 17
PLAYER5 38 2 Wade 15
df.duplicated()
PLAYER1 False
PLAYER2 False
PLAYER3 True
PLAYER4 False
PLAYER5 False
dtype: bool
该方法可以通过设置subset = [‘列名’]根据一列或多列对重复值进行判断,设置 keep=’last’使重复项最后一项显示False,其余为True
配合sum函数可以迅速判断,该行是否存在重复值,sum返回的数值即为重复行的数目
df.duplicated().sum() Out:1
df['Name'].is_unique
False
df3_drop['Name'].is_unique
True
# 依据全部列进行判断
df_d = df.drop_duplicates()
print(df_d)
Age Block Name Score
PLAYER1 31 5 James 18
PLAYER2 30 2 Curry 25
PLAYER4 35 3 Kobe 17
PLAYER5 38 2 Wade 15
# 依据设定的一列或多列进行判断,默认会保留第一个出现的值组合,传入keep = 'last'后会保留最后一个,传入inplace = True则会取代原DataFrame
df_d2 = df.drop_duplicates(subset = ['Block'],keep = 'last')
print(df_d2)
Age Block Name Score
PLAYER3 31 5 James 18
PLAYER4 35 3 Kobe 17
PLAYER5 38 2 Wade 15
df_data = df.copy()
df_data.iloc[1,2] = np.nan
df_data.iloc[2] = np.nan
print(df_data)
Age Block Name Score
PLAYER1 31.0 5.0 James 18.0
PLAYER2 30.0 2.0 NaN 25.0
PLAYER3 NaN NaN NaN NaN
PLAYER4 35.0 3.0 Kobe 17.0
PLAYER5 38.0 2.0 Wade 15.0
# 默认只要行内有一个NaN值,该行就会被删除
data_drop = df_data.dropna()
print(data_drop)
Age Block Name Score
PLAYER1 31.0 5.0 James 18.0
PLAYER4 35.0 3.0 Kobe 17.0
PLAYER5 38.0 2.0 Wade 15.0
# 如果只想删除全部为NaN的行,可以传入 how = 'all'
data_drop2 = df_data.dropna(how = 'all')
print(data_drop2)
Age Block Name Score
PLAYER1 31.0 5.0 James 18.0
PLAYER2 30.0 2.0 NaN 25.0
PLAYER4 35.0 3.0 Kobe 17.0
PLAYER5 38.0 2.0 Wade 15.0
# 如果想删除列,可以传入 axis = 1
data_drop3 = data_drop2.dropna(axis = 1)
print(data_drop3)
Age Block Score
PLAYER1 31.0 5.0 18.0
PLAYER2 30.0 2.0 25.0
PLAYER4 35.0 3.0 17.0
PLAYER5 38.0 2.0 15.0
此外,dropna还有 (thresh=None, subset=None, inplace=False)三个参数,分别控制缺失值删除数目的阈值,根据subset指定列名的空值删除以及是否取代原DataFrame
#可以直接全部替换为同一个值
df_1v = df_data.fillna(0)
print(df_1)
Age Block Name Score
PLAYER1 31.0 5.0 James 18.0
PLAYER2 30.0 2.0 NaN 25.0
PLAYER3 30.0 10.0 NaN NaN
PLAYER4 35.0 3.0 Kobe 17.0
PLAYER5 38.0 2.0 Wade 15.0
#也可以传入列名为键的字典为不同列替换为不同值
df_dict = df_data.fillna({'Age':30,'Block':10})
print(df_dict)
Age Block Name Score
PLAYER1 31.0 5.0 James 18.0
PLAYER2 30.0 2.0 NaN 25.0
PLAYER3 30.0 10.0 NaN NaN
PLAYER4 35.0 3.0 Kobe 17.0
PLAYER5 38.0 2.0 Wade 15.0
#可以设置method = 'ffill' 或 method = 'bfill'分别用前后值填充
df_m = df_data.fillna(method = 'ffill')
print(df_m)
Age Block Name Score
PLAYER1 31.0 5.0 James 18.0
PLAYER2 30.0 2.0 James 25.0
PLAYER3 30.0 2.0 James 25.0
PLAYER4 35.0 3.0 Kobe 17.0
PLAYER5 38.0 2.0 Wade 15.0
此外还有axis、limit、inplace参数分别设置轴、前后替换的阈值和是否替代
# 利用列表实现将不同值替换为同一值
# 将Curry、kobe替换为Stephen
df_replace = df.replace(['Curry','Kobe'],'Stephen')
print(df_replace)
Age Block Name Score
PLAYER1 31 5 James 18
PLAYER2 30 2 Stephen 25
PLAYER3 31 5 James 18
PLAYER4 35 3 Stephen 17
PLAYER5 38 2 Wade 15
# 利用字典实现对不同值的不同替换
# 将Curry替换为Stephen,将Kobe替换为Bryant
df_reDict = df.replace({'Curry':'Stephen','Kobe':'Bryant'})
print(df_reDict)
Age Block Name Score
PLAYER1 31 5 James 18
PLAYER2 30 2 Stephen 25
PLAYER3 31 5 James 18
PLAYER4 35 3 Bryant 17
PLAYER5 38 2 Wade 15
# 利用双列表实现对不同值的不同替换
# 将Curry替换为Stephen,将Kobe替换为Bryant
df_reList = df.replace(['Curry','Kobe'],['Stephen','Bryant'])
print(df_reList)
Age Block Name Score
PLAYER1 31 5 James 18
PLAYER2 30 2 Stephen 25
PLAYER3 31 5 James 18
PLAYER4 35 3 Bryant 17
PLAYER5 38 2 Wade 15
print(df)
Age Block Name Score
PLAYER1 31 5 James 18
PLAYER2 30 2 Curry 25
PLAYER3 31 5 James 18
PLAYER4 35 3 Kobe 17
PLAYER5 38 2 Wade 15
# 单行
df.loc['PLAYER1']
# 多行
df.loc[['PLAYER1','PLAYER3']]# 注意传入的是列表,带方括号
# 连续多行
df.loc['PLAYER1':'PLAYER3'] # 注意是包含冒号右侧的值的
# 单列
df['Age']
# 多列
df[['Age','Name']]# 注意传入的是列表,带方括号
# loc选取
df.loc[:,'Name']
PLAYER1 James
PLAYER2 Curry
PLAYER3 James
PLAYER4 Kobe
PLAYER5 Wade
Name: Name, dtype: object
df.loc['PLAYER2',['Age','Name']]
Age 30
Name Curry
Name: PLAYER2, dtype: object
# 单行
df.iloc[1]
# 连续多行
df.iloc[1:3] # 不包含冒号右侧一项
# 单列
df.iloc[:,1]
# 连续多列
df.iloc[:,1:3] # 不包含冒号右侧一项
df.iloc[0:2,2:4]
df_logic = df[df['Score']>17]
print(df_logic)
Age Block Name Score
PLAYER1 31 5 James 18
PLAYER2 30 2 Curry 25
PLAYER3 31 5 James 18
# 可以设置axis按照行列进行排序,并可以设置ascending选择升序降序
df_sort = df.sort_index(axis = 1,ascending = False)
print(df_sort)
Score Name Block Age
PLAYER1 18 James 5 31
PLAYER2 25 Curry 2 30
PLAYER3 18 James 5 31
PLAYER4 17 Kobe 3 35
PLAYER5 15 Wade 2 38
# 可以设置ascending选择升序降序
df_sort2 = df.sort_values(by = ['Age','Score'])
print(df_sort2)
Age Block Name Score
PLAYER2 30 2 Curry 25
PLAYER1 31 5 James 18
PLAYER3 31 5 James 18
PLAYER4 35 3 Kobe 17
PLAYER5 38 2 Wade 15