pandas 去重函数 drop_duplicates() 和 选取重复行函数 duplicated()

 

1  drop_duplicates()  返回删除重复行后的DataFrame,可以仅选择某些列。索引、时间型索引都是被忽略。

pandas.DataFrame.drop_duplicates 官方文档

方法:

DataFrame.drop_duplicates(self, subset=None, keep='first', inplace=False)

参数:

subset : column label or sequence of labels, optional   子集: 列标签或标签序列,可选

Only consider certain columns for identifying duplicates, by default use all of the columns

只考虑标识重复项的某些列,默认情况下使用所有列

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence. 删除重复行,只剩下第一次出现的重复行。
  • last : Drop duplicates except for the last occurrence. 删除重复行,只剩下最后一次出现的重复行。
  • False : Drop all duplicates. 删除全部重复行。

inplace : boolean, default False  默认为返回一个副本

Whether to drop duplicates in place or to return a copy

是直接在原数据上修改,还是返回一个副本

 举例:

1.1 创建DataFrame 

import pandas as pd  

data = pd.DataFrame({'Age': [37, 54, 38, 24, 54, 33, 54, 54, 18],
                     'Gender': ['Male', 'Female', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
                     'MaritalStatus': ['Divorced', 'Single', 'Married', 'Married', 'Single', 'Married','Divorced', 'Single', 'Single']})

原始数据输出,第1,4,7行为重复行:

   Age  Gender MaritalStatus
0   37    Male      Divorced
1   54  Female        Single
2   38    Male       Married
3   24    Male       Married
4   54  Female        Single
5   33    Male       Married
6   54    Male      Divorced
7   54  Female        Single
8   18  Female        Single

1.2   keep='first' ,删除重复行,只剩下第一次出现的重复行

data_first = data.drop_duplicates(keep='first')
print(data_first)

删除了第4,7行的重复行,保留第1行的重复行,输出结果:

   Age  Gender MaritalStatus
0   37    Male      Divorced
1   54  Female        Single
2   38    Male       Married
3   24    Male       Married
5   33    Male       Married
6   54    Male      Divorced
8   18  Female        Single

1.3   keep='last' ,删除重复行,只剩下最后一次出现的重复行

data_last = data.drop_duplicates(keep='last')
print(data_last)

删除了第1,4行的重复行,保留第7行的重复行,输出结果:

   Age  Gender MaritalStatus
0   37    Male      Divorced
2   38    Male       Married
3   24    Male       Married
5   33    Male       Married
6   54    Male      Divorced
7   54  Female        Single
8   18  Female        Single

1.4   keep=False ,删除所有重复行

data_false = data.drop_duplicates(keep=False)
print(data_false)

删除了第1,4,7行的所有重复行,输出结果:

   Age  Gender MaritalStatus
0   37    Male      Divorced
2   38    Male       Married
3   24    Male       Married
5   33    Male       Married
6   54    Male      Divorced
8   18  Female        Single

1.5  默认inplace=False,返回一个副本

data.drop_duplicates(inplace=False, keep='last')

返回的为原数据副本,未改动,输出结果:

   Age  Gender MaritalStatus
0   37    Male      Divorced
1   54  Female        Single
2   38    Male       Married
3   24    Male       Married
4   54  Female        Single
5   33    Male       Married
6   54    Male      Divorced
7   54  Female        Single
8   18  Female        Single

1.6 inplace=True,直接在原数据上修改

data.drop_duplicates(inplace=True,keep='last')

删除了第1,4行的重复行,保留第7行的重复行,输出结果:

   Age  Gender MaritalStatus
0   37    Male      Divorced
1   54  Female        Single
2   38    Male       Married
3   24    Male       Married
5   33    Male       Married
6   54    Male      Divorced
8   18  Female        Single

 

2. duplicated() ,返回指示重复行的布尔级数(False 或 True),可选地只考虑某些列。

pandas.DataFrame.duplicated 官方文档

方法:

DataFrame.duplicated(self, subset=None, keep='first')

参数:

subset : column label or sequence of labels, optional   子集: 列标签或标签序列,可选

Only consider certain columns for identifying duplicates, by default use all of the columns

只考虑标识重复项的某些列,默认情况下使用所有列

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence. 将重复项标记为True(第一次出现的除外)。
  • last : Mark duplicates as True except for the last occurrence. 将重复项标记为True(最后一次出现的除外)。
  • False : Mark all duplicates as True. 将所有重复项标记为True。

举例:

2.1 创建DataFrame 

import pandas as pd  

data = pd.DataFrame({'Age': [37, 54, 38, 24, 54, 33, 54, 54, 18],
                     'Gender': ['Male', 'Female', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
                     'MaritalStatus': ['Divorced', 'Single', 'Married', 'Married', 'Single', 'Married','Divorced', 'Single', 'Single']})

原始数据输出,第1,4,7行为重复行:

   Age  Gender MaritalStatus
0   37    Male      Divorced
1   54  Female        Single
2   38    Male       Married
3   24    Male       Married
4   54  Female        Single
5   33    Male       Married
6   54    Male      Divorced
7   54  Female        Single
8   18  Female        Single

2.2   keep='first' ,将重复项标记为True(第一次出现的除外)

data_first = data.duplicated(keep='first')
print(data_first)

标记重复行第4,7行为True,第1行重复行为False,输出结果:

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7     True
8    False
dtype: bool

2.3  keep='last' ,将重复项标记为True(最后一次出现的除外)

data_last = data.duplicated(keep='last')
print(data_last)

标记重复行第1,4行为True,第7行重复行为False,输出结果:

0    False
1     True
2    False
3    False
4     True
5    False
6    False
7    False
8    False
dtype: bool

2.4  keep=False , 将所有重复项标记为True。

data_false =data.duplicated(keep=False)
print(data_false)

标记重复行第1,4,7行为True,输出结果:

0    False
1     True
2    False
3    False
4     True
5    False
6    False
7     True
8    False
dtype: bool

选出重复行:
 

duplicated_rows = data_false[data_false==True]
print(duplicated_rows)

第1,4,7行为重复行,输出结果:

1    True
4    True
7    True

你可能感兴趣的:(python,pandas)