【python】详解pandas dataframe 去重函数 pandas.DataFrame.drop_duplicates

- 1、首先直接看文档:

df.drop_duplicates?
Signature: df.drop_duplicates(subset=None, keep='first', inplace=False)
Docstring:
Return DataFrame with duplicate rows removed, optionally only
considering certain columns

Parameters
----------
subset : column label or sequence of labels, optional
    Only consider certain columns for identifying duplicates, by
    default use all of the columns
keep : {'first', 'last', False}, default 'first'
    - ``first`` : Drop duplicates except for the first occurrence.
    - ``last`` : Drop duplicates except for the last occurrence.
    - False : Drop all duplicates.
inplace : boolean, default False
    Whether to drop duplicates in place or to return a copy

Returns
-------
deduplicated : DataFrame
File:      c:\users\yingfei-wjc\anaconda3\envs\tensorflow\lib\site-packages\pandas\core\frame.py
Type:      method
  • subset:可以指定传入单个列标签或者一个列标签的列表,默认是使用所有的列标签,即会删除一整行

  • keep:有{‘first’, ‘last’, False}三个可供选择, default ‘first’,意味着除了第一个后面重复的全部删除

  • inplace:返回是否替代过的值,默认False,即不改变原数据。

- 2、实例:

import pandas as pd
from pandas import DataFrame,Series

df = pd.read_csv('c:/Users/yingfei-wjc/Desktop/index_kchart.csv',index_col = 0)
df.sort_index(inplace =True )
#df.drop_duplicates(inplace = True)
print(df.head(5))

输出结果有重复的:

      pre    open    high    low   close  change_price  \
time                                                             
1990/12/19  96.05   96.05   99.98  95.79   99.98          3.93   
1990/12/19  96.05   96.05   99.98  95.79   99.98          3.93   
1990/12/19  96.05   96.05   99.98  95.79   99.98          3.93   
1990/12/20  99.98  104.30  104.39  99.98  104.39          4.41   
1990/12/20  99.98  104.30  104.39  99.98  104.39          4.41   

            change_percent  volume    amount  
time                                          
1990/12/19          4.0916  1260.0  494000.0  
1990/12/19          4.0916  1260.0  494000.0  
1990/12/19          4.0916  1260.0  494000.0  
1990/12/20          4.4109   197.0   84000.0  
1990/12/20          4.4109   197.0   84000.0  

使用pandas.DataFrame.drop_duplicates去重:

import pandas as pd
from pandas import DataFrame,Series

df = pd.read_csv('c:/Users/yingfei-wjc/Desktop/index_kchart.csv',index_col = 0)
df.sort_index(inplace =True )
df.drop_duplicates(inplace = True)
print(df.head(5))
pre    open    high     low   close  change_price  \
time                                                               
1990/12/19   96.05   96.05   99.98   95.79   99.98          3.93   
1990/12/20   99.98  104.30  104.39   99.98  104.39          4.41   
1990/12/21  104.39  109.07  109.13  103.73  109.13          4.74   
1990/12/21   96.05   96.05  109.13   95.79  109.13         13.08   
1990/12/24  109.13  113.57  114.55  109.13  114.55          5.42   

            change_percent  volume    amount  
time                                          
1990/12/19          4.0916  1260.0  494000.0  
1990/12/20          4.4109   197.0   84000.0  
1990/12/21          4.5407    28.0   16000.0  
1990/12/21         13.6179  1485.0  594000.0  
1990/12/24          4.9666    32.0   31000.0  

- 3、解决实际问题

上式仍然存在重复日期的问题,这里需要进一步处理:
1、导入数据时,不用设定日期为索引
2、指定 ‘column = 日期’来去重即可
3、设定日期为行索引
4、保存为new.csv文件

import pandas as pd
from pandas import DataFrame,Series

df = pd.read_csv('c:/Users/yingfei-wjc/Desktop/index_kchart.csv')#,index_col = 0)
df.sort_values(by = ['time'],inplace =True )
df.drop_duplicates(subset = 'time',inplace = True)
df.index = df['time'].values
df.drop(['time'],axis = 1,inplace = True)
print(df.head(5))
df.to_csv('new.csv')

结果如下:

         pre    open    high     low   close  change_price  \
1990/12/19   96.05   96.05   99.98   95.79   99.98          3.93   
1990/12/20   99.98  104.30  104.39   99.98  104.39          4.41   
1990/12/21  104.39  109.07  109.13  103.73  109.13          4.74   
1990/12/24  109.13  113.57  114.55  109.13  114.55          5.42   
1990/12/25  114.55  120.09  120.25  114.55  120.25          5.70   

            change_percent  volume    amount  
1990/12/19          4.0916  1260.0  494000.0  
1990/12/20          4.4109   197.0   84000.0  
1990/12/21          4.5407    28.0   16000.0  
1990/12/24          4.9666    32.0   31000.0  
1990/12/25          4.9760    15.0    6000.0  

你可能感兴趣的:(python)