目标:对比两个数据集是否完全相同
数据集:df1,df2
方法一:pandas
两个数据集相减
#df2减df1
import pandas as pd
set_diff_df = pd.concat([df1, df2, df1]).drop_duplicates(keep=False)
print(set_diff_df)
结果
Empty DataFrame
表示两个数据集相同
方法二:datacompy包
需要安装
这个包的详细说明https://capitalone.github.io/datacompy/install.html
Windows10 Python3环境 anaconda进行安装
conda install datacompy
成功后运行
import datacompy
compare=datacompy.Compare(df1,df2,abs_tol=0.000001)
print(compare.report())
报错
TypeError: 'NoneType' object is not iterable
后来发现是没有加入对比的连接列
以索引为连接列的代码如下
import datacompy
compare=datacompy.Compare(df1,df2,abs_tol=0.000001,on_index=True)
print(compare.report())
成功!
结果如下
ataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 12 2000
1 df2 12 2000
Column Summary
--------------
Number of columns in common: 12
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: index
Any duplicates on match values: No
Absolute Tolerance: 1e-06
Relative Tolerance: 0
Number of rows in common: 2,000
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 0
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 2,000
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 12
Total number of values which compare unequal: 0
也可以以数据的某个变量为连接列
语句中加入:
join_columns=['变量名']