python dataframe去重复_如何识别Python Pandas Dataframe中重复行的首次出现

I have a pandas DataFrame with duplicate values for a set of columns. For example:

df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})

In [2]: df

Out[2]:

Column1 Column2 Column3 Column4 is_duplicated dup_index

0 1 ABC DEF 10 False 0

1 2 XYZ DEF 40 False 1

2 3 ABC DEF 10 True 0

Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1).

I am looking for the following output:

Is_Duplicate, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]

Dup_Index the original index of the duplicate row.

In [3]: df

Out[3]:

Column1 Column2 Column3 Column4 Is_Duplicate Dup_Index

0 1 ABC DEF 10 False 0

1 2 XYZ DEF 40 False 1

2 3 ABC DEF 10 True 0

解决方案

There is a DataFrame method duplicated for the first column:

In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])

Out[11]:

0 False

1 False

2 True

In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])

To do the second you could try something like this:

In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])

In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])

In [15]: df1.index.map(lambda ind: g.indices[ind][0])

Out[15]: array([0, 1, 0])

In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])

In [17]: df

Out[17]:

Column1 Column2 Column3 Column4 is_duplicated dup_index

0 1 ABC DEF 10 False 0

1 2 XYZ DEF 40 False 1

2 3 ABC DEF 10 True 0

你可能感兴趣的:(python,dataframe去重复)