我不知道我是否正确回答了你的问题,所以我会提供一些不同的方法。在
如果要删除包含相同id、Type和Time值的所有行,可以使用以下方法:frame=pd.read_excel(io=r"D:\xxxxxx\test.xlsx")
df=pd.DataFrame(frame)
drop_dup=df.drop_duplicates(subset=("id","Type","Time"))
print(drop_dup)
结果是:
^{pr2}$
这意味着有7行具有完全相同的类型、id和时间。
如果要删除完全相同的行(合并所有列),则会得到所需的结果:df=df.drop_duplicates()
此外:dup=df.duplicated(subset=("id","Type","Time"))
返回一个True/False数组,该数组指示行是否重复0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
如果您想知道数据帧的哪些单个值是重复的,请使用:dupl_val=df.apply(pd.Series.duplicated,axis=1)
id Duplicate 1 Duplicate 2 Total Duplicates Time Type Attribute
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False True False False False
5 False False True True False False False
6 False False False True False False False
打电话的原因pd系列复制此方法应用于DataFrame的轴1,即每个DataFrame列。DataFrame列是Pandas系列对象。在
如果您不想删除行,而只想指出哪些值是重复的,请使用:dupl_val=df.apply(pd.Series.duplicated,axis=1)
df=df.where(~dupl_val,"duplicate")
print(df)
id Duplicate 1 Duplicate 2 Total Duplicates \
0 121349100 NaN NaN NaN
1 121350610 NaN NaN NaN
2 124426041 NaN NaN NaN
3 124436734 NaN NaN NaN
4 124451775 1 NaN duplicate
5 124451775 1 duplicate duplicate
6 124451775 NaN 1 duplicate
Time Type Attribute
0 2017-04-19 18:08:00 Tea NaN
1 2017-04-19 18:08:00 Tea NaN
2 2017-05-05 12:21:00 Tea NaN
3 2017-04-25 15:20:00 Coffee NaN
4 2017-04-05 21:04:00 Coffee No
5 2017-06-05 07:38:00 Tea No
6 2017-04-05 21:04:00 Coffee NaN
编辑:
如果您只想将属性列设置为一个特殊值(我选择了“复制”),如果一行中的“id”、“Type”、“Time”值与另一行重复,并且不想更改其余列的值,则此代码应提供所需的结果:frame=pd.read_excel(io=r"D:\xxxxx\test.xlsx")
df=pd.DataFrame(frame)
dup=df.duplicated(subset=("id","Type","Time"))
duplicate="duplicate"
for i in range(len(dup)):
if dup[i]==True:
df.loc[i,"Attribute"]=duplicate
print(df)
id Duplicate 1 Duplicate 2 Total Duplicates \
0 121349100 NaN NaN NaN
1 121350610 NaN NaN NaN
2 124426041 NaN NaN NaN
3 124436734 NaN NaN NaN
4 124451775 1.0 NaN 1.0
5 124451775 1.0 1.0 1.0
6 124451775 NaN 1.0 1.0
7 124463136 NaN NaN NaN
Time Type Attribute
0 2017-04-19 18:08:00 Tea NaN
1 2017-04-19 18:08:00 Tea NaN
2 2017-05-05 12:21:00 Tea NaN
3 2017-04-25 15:20:00 Coffee NaN
4 2017-04-05 21:04:00 Coffee No
5 2017-06-05 07:38:00 Tea No
6 2017-04-05 21:04:00 Coffee duplicate
7 2017-06-05 05:40:00 Coffee NaN
[85 rows x 7 columns]
您可以看到,第6行(=原始excel文件中的第8行)包含第一个副本。在本例中,这是excel文件中第6行的副本。在
编辑2
在我的第二次编辑中,代码现在将把所有重复项(也是第一项)标记为“重复项”。同时,代码不再搜索所有三列(id、Time、Type),而是查找(id和Time)或(id和Type)或(Time和Type)。因此,这三个会议的所有组合dup=[df.duplicated(subset=(i),keep=False) for i in [("id","Type"),("id","Time"),("Time","Type")]]
duplicate="duplicate"
for i in range(len(dup)):
for j in range(len(dup[i])):
if dup[i][j]==True:
df.loc[j,"Attribute"]=duplicate
print(df)
|id Duplicate 1 Duplicate 2 Total Duplicates \
0 121349100 NaN NaN NaN
1 121350610 NaN NaN NaN
2 124426041 NaN NaN NaN
3 124436734 NaN NaN NaN
4 124451775 1.0 NaN 1.0
5 124451775 1.0 1.0 1.0
6 124451775 NaN 1.0 1.0
Time Type Attribute
0 2017-04-19 18:08:00 Tea duplicate
1 2017-04-19 18:08:00 Tea duplicate
2 2017-05-05 12:21:00 Tea NaN
3 2017-04-25 15:20:00 Coffee NaN
4 2017-04-05 21:04:00 Coffee duplicate
5 2017-06-05 07:38:00 Tea No
6 2017-04-05 21:04:00 Coffee duplicate
有关此函数的更多信息,请阅读:drop_duplicates,duplicated用于Series和DataFrames(主要区别在于,对于Series,函数应用于单个值,对于dataframe,它们分别应用于指定列的行)