ydata-quality是一个用于数据质量的库,类似sklearn之于机器学习。通过数据多阶段流程开发评估数据质量。只要你还有可用数据,运行DataQuality(df=my_df).evaluate()
代码,便可得到数据的复杂并详细的全面的评估概要。评估的角度主要有以下几个方面:
【注】以上是对每个模块的简单介绍,具体的用法上面给了git的官方文档。
from ydata_quality import DataQuality
import pandas as pd
df = pd.read_csv(f'../datasets/transformed/census_10k.csv') # load data
dq = DataQuality(df=df) # create the main class that holds all quality modules
results = dq.evaluate() # run the tests
Warnings:
TOTAL: 5 warning(s)
Priority 1: 1 warning(s)
Priority 2: 4 warning(s)Priority 1 - heavy impact expected:
* [DUPLICATES - DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns.
Priority 2 - usage allowed, limited human intelligibility:
* [DATA RELATIONS - HIGH COLLINEARITY - NUMERICAL] Found 3 numerical variables with high Variance Inflation Factor (VIF>5.0). The variables listed in results are highly collinear with other variables in the dataset. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove the highest VIF variables.
* [ERRONEOUS DATA - PREDEFINED ERRONEOUS DATA] Found 1960 ED values in the dataset.
* [DATA RELATIONS - HIGH COLLINEARITY - CATEGORICAL] Found 10 categorical variables with significant collinearity (p-value < 0.05). The variables listed in results are highly collinear with other variables in the dataset and sorted descending according to propensity. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove variables following the provided order.
* [DUPLICATES - EXACT DUPLICATES] Found 3 instances with exact duplicate feature values.
从报告中可以得到多个不同优先级的警告⚠️,这些警告是以上每个模块中默认的评估结果。还可以通过参数输入得到更多的结果,DataQuality()
的参数如下:
DataQuality(df: DataFrame,
label: str = None,
random_state: Optional[int] = None,
entities: Optional[List[Union[str, List[str]]]] = None,
is_close: bool = False,
ed_extensions: Optional[list] = None,
sample: Optional[DataFrame] = None,
model: Callable = None,
results_json_path: str = None,
error_tol: int = 0,
rel_error_tol: Optional[float] = None,
minimum_coverage: Optional[float] = 0.75,
sensitive_features: Optional[List[str]] = None,
dtypes: Optional[dict] = None,
corr_th: float = 0.8,
vif_th: float = 5,
p_th: float = 0.05,
plot: bool = False,
severity: str = 'ERROR')
"""
Args:
df (DataFrame): reference DataFrame used to run the DataQuality analysis.
label (str, optional): [MISSINGS, LABELLING, DRIFT ANALYSIS] target feature to be predicted.
If not specified, LABELLING is skipped.
random_state (int, optional): Integer seed for random reproducibility. Default is None.
Set to None for fully random behavior, no reproducibility.
entities: [DUPLICATES] entities relevant for duplicate analysis.
is_close: [DUPLICATES] Pass True to use numpy.isclose instead of pandas.equals in column comparison.
ed_extensions: [ERRONEOUS DATA] A list of user provided erroneous data values to append to defaults.
sample: [DRIFT ANALYSIS] data against which drift is tested.
model: [DRIFT ANALYSIS] model wrapped by ModelWrapper used to test concept drift.
results_json (str): [EXPECTATIONS] A path to the json output from a Great Expectations validation run.
error_tol (int): [EXPECTATIONS] Defines how many failed expectations are tolerated.
rel_error_tol (float): [EXPECTATIONS] Defines the maximum fraction of failed expectations, \
overrides error_tol.
minimum_coverage (float): [EXPECTATIONS] Minimum expected fraction of DataFrame columns covered by the \
expectation suite.
sensitive_features (List[str]): [BIAS & FAIRNESS] features deemed as sensitive attributes
dtypes (Optional[dict]): Maps names of the columns of the dataframe to supported dtypes. Columns not \
specified are automatically inferred.
corr_th (float): [DATA RELATIONS] Absolute threshold for high correlation detection. Defaults to 0.8.
vif_th (float): [DATA RELATIONS] Variance Inflation Factor threshold for numerical independence test, \
typically 5-10 is recommended. Defaults to 5.
p_th (float): [DATA RELATIONS] Fraction of the right tail of the chi squared CDF defining threshold for \
categorical independence test. Defaults to 0.05.
plot (bool): Pass True to produce all available graphical outputs, False to suppress all graphical output.
severity (str): Sets the logger warning threshold.
Valid levels are: [DEBUG, INFO, WARNING, ERROR, CRITICAL]
"""
默认情况下会进行Duplicates、Missing Values、Erroneous Data、Drift Analysis、Data Relations五项分析,如果参数label不为空则会进行Labelling分析,参数sensitive_features的list长度>0则会进行Bias & Fairness分析,参数results_json_path不为空则会进行Data Expectations分析。
这里,我们只能得到数据的每个优先级下的warning大概信息,想要获取详细信息还需进一步调用get_warnings()
dq.get_warnings(test='Duplicate Columns')
该模块功能主要判断三种重复:列重复、样本重复、根据某些特征groupby之后的样本重复。
①列重复:判断dataframe中的特征列数据是否重复;如下col2和col3两列重复。
index | Col1 | Col2 | Col3 |
---|---|---|---|
1 | 1 | 3 | 3 |
2 | 7 | 4 | 4 |
3 | 3 | 8 | 8 |
②样本重复:判断dataframe中的样本是否有重复的数据;如下index为1和2的两个样本重复
index | Col1 | Col2 | Col3 |
---|---|---|---|
1 | 1 | 2 | 3 |
2 | 1 | 2 | 3 |
3 | 3 | 8 | 8 |
③根据某些特征groupby之后的样本重复:根据指定特征groupby,然后判断剩下特征的值是否重复;如下指定col1进行groupby,对比样本中的col2和col3是否相等,即col1=1中的index=1和2重复,而虽然index=4和5的col2、col3也相等,但是col1不在同一个group中
index | Col1 | Col2 | Col3 |
---|---|---|---|
1 | 1 | 2 | 3 |
2 | 1 | 2 | 3 |
3 | 1 | 5 | 8 |
4 | 2 | 2 | 3 |
5 | 3 | 2 | 3 |
6 | 2 | 5 | 8 |
import pandas as pd
from ydata_quality.duplicates import DuplicateChecker
df = pd.read_csv('../datasets/transformed/guerry_histdata.csv')
dc = DuplicateChecker(df=df, entities=['Region', 'MainCity'])
results = dc.evaluate()
该模块功能主要判断特征缺失率是否大于阈值20%;计算缺失值特征的相关性,判断缺失值特征对模型预测的贡献度(模型训练的方法);
该模块功能主要是对标签的分析,有标签的缺失情况;分布情况、异常标签的检测、多分类标签的ovr性能分析。
该模块功能主要是分析数据集中包含的error data,error data为预定义好的数据,默认有:“?”, “UNK”, “Unknown”, “N/A”, “NA”, “”, “(blank)”。也可以根据参数追加自定义的error data。
该模块功能主要分析data drift类型有:数据样本drift、标签drift、概念drift。
数据特征的偏见和公平性。通过训练模型预测敏感特征,性能越好,越敏感。
评估数据特征中的相关性。
该库主要是利用统计学和机器学习的相关知识对数据进行几个方面的整体评估。涉及到很多数据上的处理,处理方法是基于pandas和sklearn。对于大数据集处理非常慢。