DataFrame.
duplicated
(subset=None, keep='first')[source]
Return boolean Series denoting duplicate rows, optionally only considering certain columns
Parameters: | subset : column label or sequence of labels, optional
keep : {‘first’, ‘last’, False}, default ‘first’
|
---|---|
Returns: | duplicated : Series |
pandas不需要插入数据库,看来用pandas处理pci混淆问题更快。
https://blog.csdn.net/hguo11/article/details/82556171
pandas
代码如下:
import pandas as pd
import numpy as np
salaries = pd.DataFrame({
'name': ['BOSS', 'Lilei', 'Lilei', 'Han', 'BOSS', 'BOSS', 'Han', 'BOSS'],
'Year': [2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017],
'Salary': [1, 2, 3, 4, 5, 6, 7, 8],
'Bonus': [2, 2, 2, 2, 3, 4, 5, 6]
})
print(salaries)
print(salaries['Bonus'].duplicated(keep='first'))
print(salaries[salaries['Bonus'].duplicated(keep='first')].index)
print(salaries[salaries['Bonus'].duplicated(keep='first')])
print(salaries['Bonus'].duplicated(keep='last'))
print(salaries[salaries['Bonus'].duplicated(keep='last')].index)
print(salaries[salaries['Bonus'].duplicated(keep='last')])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
输出如下:
Bonus Salary Year name
0 2 1 2016 BOSS
1 2 2 2016 Lilei
2 2 3 2016 Lilei
3 2 4 2016 Han
4 3 5 2017 BOSS
5 4 6 2017 BOSS
6 5 7 2017 Han
7 6 8 2017 BOSS
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 False
Name: Bonus, dtype: bool
Int64Index([1, 2, 3], dtype='int64')
Bonus Salary Year name
1 2 2 2016 Lilei
2 2 3 2016 Lilei
3 2 4 2016 Han
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 False
Name: Bonus, dtype: bool
Int64Index([0, 1, 2], dtype='int64')
Bonus Salary Year name
0 2 1 2016 BOSS
1 2 2 2016 Lilei
2 2 3 2016 Lilei
---------------------
作者:耗子来啦
来源:CSDN
原文:https://blog.csdn.net/hguo11/article/details/82556171
版权声明:本文为博主原创文章,转载请附上博文链接!