1 drop_duplicates() 返回删除重复行后的DataFrame,可以仅选择某些列。索引、时间型索引都是被忽略。
pandas.DataFrame.drop_duplicates 官方文档
方法:
DataFrame.drop_duplicates(self, subset=None, keep='first', inplace=False)
参数:
subset : column label or sequence of labels, optional 子集: 列标签或标签序列,可选
Only consider certain columns for identifying duplicates, by default use all of the columns
只考虑标识重复项的某些列,默认情况下使用所有列
keep : {‘first’, ‘last’, False}, default ‘first’
first
: Drop duplicates except for the first occurrence. 删除重复行,只剩下第一次出现的重复行。last
: Drop duplicates except for the last occurrence. 删除重复行,只剩下最后一次出现的重复行。- False : Drop all duplicates. 删除全部重复行。
inplace : boolean, default False 默认为返回一个副本
Whether to drop duplicates in place or to return a copy
是直接在原数据上修改,还是返回一个副本
举例:
1.1 创建DataFrame
import pandas as pd
data = pd.DataFrame({'Age': [37, 54, 38, 24, 54, 33, 54, 54, 18],
'Gender': ['Male', 'Female', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
'MaritalStatus': ['Divorced', 'Single', 'Married', 'Married', 'Single', 'Married','Divorced', 'Single', 'Single']})
原始数据输出,第1,4,7行为重复行:
Age Gender MaritalStatus
0 37 Male Divorced
1 54 Female Single
2 38 Male Married
3 24 Male Married
4 54 Female Single
5 33 Male Married
6 54 Male Divorced
7 54 Female Single
8 18 Female Single
1.2 keep='first' ,删除重复行,只剩下第一次出现的重复行
data_first = data.drop_duplicates(keep='first')
print(data_first)
删除了第4,7行的重复行,保留第1行的重复行,输出结果:
Age Gender MaritalStatus
0 37 Male Divorced
1 54 Female Single
2 38 Male Married
3 24 Male Married
5 33 Male Married
6 54 Male Divorced
8 18 Female Single
1.3 keep='last' ,删除重复行,只剩下最后一次出现的重复行
data_last = data.drop_duplicates(keep='last')
print(data_last)
删除了第1,4行的重复行,保留第7行的重复行,输出结果:
Age Gender MaritalStatus
0 37 Male Divorced
2 38 Male Married
3 24 Male Married
5 33 Male Married
6 54 Male Divorced
7 54 Female Single
8 18 Female Single
1.4 keep=False ,删除所有重复行
data_false = data.drop_duplicates(keep=False)
print(data_false)
删除了第1,4,7行的所有重复行,输出结果:
Age Gender MaritalStatus
0 37 Male Divorced
2 38 Male Married
3 24 Male Married
5 33 Male Married
6 54 Male Divorced
8 18 Female Single
1.5 默认inplace=False,返回一个副本
data.drop_duplicates(inplace=False, keep='last')
返回的为原数据副本,未改动,输出结果:
Age Gender MaritalStatus
0 37 Male Divorced
1 54 Female Single
2 38 Male Married
3 24 Male Married
4 54 Female Single
5 33 Male Married
6 54 Male Divorced
7 54 Female Single
8 18 Female Single
1.6 inplace=True,直接在原数据上修改
data.drop_duplicates(inplace=True,keep='last')
删除了第1,4行的重复行,保留第7行的重复行,输出结果:
Age Gender MaritalStatus
0 37 Male Divorced
1 54 Female Single
2 38 Male Married
3 24 Male Married
5 33 Male Married
6 54 Male Divorced
8 18 Female Single
2. duplicated() ,返回指示重复行的布尔级数(False 或 True),可选地只考虑某些列。
pandas.DataFrame.duplicated 官方文档
方法:
DataFrame.duplicated(self, subset=None, keep='first')
参数:
subset : column label or sequence of labels, optional 子集: 列标签或标签序列,可选
Only consider certain columns for identifying duplicates, by default use all of the columns
只考虑标识重复项的某些列,默认情况下使用所有列
keep : {‘first’, ‘last’, False}, default ‘first’
first
: Mark duplicates asTrue
except for the first occurrence. 将重复项标记为True(第一次出现的除外)。last
: Mark duplicates asTrue
except for the last occurrence. 将重复项标记为True(最后一次出现的除外)。- False : Mark all duplicates as
True
. 将所有重复项标记为True。
举例:
2.1 创建DataFrame
import pandas as pd
data = pd.DataFrame({'Age': [37, 54, 38, 24, 54, 33, 54, 54, 18],
'Gender': ['Male', 'Female', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
'MaritalStatus': ['Divorced', 'Single', 'Married', 'Married', 'Single', 'Married','Divorced', 'Single', 'Single']})
原始数据输出,第1,4,7行为重复行:
Age Gender MaritalStatus
0 37 Male Divorced
1 54 Female Single
2 38 Male Married
3 24 Male Married
4 54 Female Single
5 33 Male Married
6 54 Male Divorced
7 54 Female Single
8 18 Female Single
2.2 keep='first' ,将重复项标记为True(第一次出现的除外)
data_first = data.duplicated(keep='first')
print(data_first)
标记重复行第4,7行为True,第1行重复行为False,输出结果:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
dtype: bool
2.3 keep='last' ,将重复项标记为True(最后一次出现的除外)
data_last = data.duplicated(keep='last')
print(data_last)
标记重复行第1,4行为True,第7行重复行为False,输出结果:
0 False
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 False
dtype: bool
2.4 keep=False , 将所有重复项标记为True。
data_false =data.duplicated(keep=False)
print(data_false)
标记重复行第1,4,7行为True,输出结果:
0 False
1 True
2 False
3 False
4 True
5 False
6 False
7 True
8 False
dtype: bool
选出重复行:
duplicated_rows = data_false[data_false==True]
print(duplicated_rows)
第1,4,7行为重复行,输出结果:
1 True
4 True
7 True