


  1. Duplicates
  2. Missing Values
  3. Labelling
  4. Erroneous Data
  5. Drift Analysis
  6. Data Expectations
  7. Bias & Fairness
  8. Data Relations


from ydata_quality import DataQuality
import pandas as pd

df = pd.read_csv(f'../datasets/transformed/census_10k.csv') # load data
dq = DataQuality(df=df) # create the main class that holds all quality modules
results = dq.evaluate() # run the tests

         TOTAL: 5 warning(s)
         Priority 1: 1 warning(s)
         Priority 2: 4 warning(s)

Priority 1 - heavy impact expected:
         * [DUPLICATES - DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns.
Priority 2 - usage allowed, limited human intelligibility:
         * [DATA RELATIONS - HIGH COLLINEARITY - NUMERICAL] Found 3 numerical variables with high Variance Inflation Factor (VIF>5.0). The variables listed in results are highly collinear with other variables in the dataset. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove the highest VIF variables.
         * [ERRONEOUS DATA - PREDEFINED ERRONEOUS DATA] Found 1960 ED values in the dataset.
         * [DATA RELATIONS - HIGH COLLINEARITY - CATEGORICAL] Found 10 categorical variables with significant collinearity (p-value < 0.05). The variables listed in results are highly collinear with other variables in the dataset and sorted descending according to propensity. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove variables following the provided order.
         * [DUPLICATES - EXACT DUPLICATES] Found 3 instances with exact duplicate feature values.


DataQuality(df: DataFrame,
            label: str = None,
            random_state: Optional[int] = None,
            entities: Optional[List[Union[str, List[str]]]] = None,
            is_close: bool = False,
            ed_extensions: Optional[list] = None,
            sample: Optional[DataFrame] = None,
            model: Callable = None,
            results_json_path: str = None,
            error_tol: int = 0,
            rel_error_tol: Optional[float] = None,
            minimum_coverage: Optional[float] = 0.75,
            sensitive_features: Optional[List[str]] = None,
            dtypes: Optional[dict] = None,
            corr_th: float = 0.8,
            vif_th: float = 5,
            p_th: float = 0.05,
            plot: bool = False,
            severity: str = 'ERROR')
    df (DataFrame): reference DataFrame used to run the DataQuality analysis.
    label (str, optional): [MISSINGS, LABELLING, DRIFT ANALYSIS] target feature to be predicted.
                            If not specified, LABELLING is skipped.
    random_state (int, optional): Integer seed for random reproducibility. Default is None.
        Set to None for fully random behavior, no reproducibility.
    entities: [DUPLICATES] entities relevant for duplicate analysis.
    is_close: [DUPLICATES] Pass True to use numpy.isclose instead of pandas.equals in column comparison.
    ed_extensions: [ERRONEOUS DATA] A list of user provided erroneous data values to append to defaults.
    sample: [DRIFT ANALYSIS] data against which drift is tested.
    model: [DRIFT ANALYSIS] model wrapped by ModelWrapper used to test concept drift.
    results_json (str): [EXPECTATIONS] A path to the json output from a Great Expectations validation run.
    error_tol (int): [EXPECTATIONS] Defines how many failed expectations are tolerated.
    rel_error_tol (float): [EXPECTATIONS] Defines the maximum fraction of failed expectations, \
        overrides error_tol.
    minimum_coverage (float): [EXPECTATIONS] Minimum expected fraction of DataFrame columns covered by the \
        expectation suite.
    sensitive_features (List[str]): [BIAS & FAIRNESS] features deemed as sensitive attributes
    dtypes (Optional[dict]): Maps names of the columns of the dataframe to supported dtypes. Columns not \
        specified are automatically inferred.
    corr_th (float): [DATA RELATIONS] Absolute threshold for high correlation detection. Defaults to 0.8.
    vif_th (float): [DATA RELATIONS] Variance Inflation Factor threshold for numerical independence test, \
        typically 5-10 is recommended. Defaults to 5.
    p_th (float): [DATA RELATIONS] Fraction of the right tail of the chi squared CDF defining threshold for \
        categorical independence test. Defaults to 0.05.
    plot (bool): Pass True to produce all available graphical outputs, False to suppress all graphical output.
    severity (str): Sets the logger warning threshold.
        Valid levels are: [DEBUG, INFO, WARNING, ERROR, CRITICAL]

默认情况下会进行Duplicates、Missing Values、Erroneous Data、Drift Analysis、Data Relations五项分析,如果参数label不为空则会进行Labelling分析,参数sensitive_features的list长度>0则会进行Bias & Fairness分析,参数results_json_path不为空则会进行Data Expectations分析。


dq.get_warnings(test='Duplicate Columns')


index Col1 Col2 Col3
1 1 3 3
2 7 4 4
3 3 8 8


index Col1 Col2 Col3
1 1 2 3
2 1 2 3
3 3 8 8


index Col1 Col2 Col3
1 1 2 3
2 1 2 3
3 1 5 8
4 2 2 3
5 3 2 3
6 2 5 8
import pandas as pd
from ydata_quality.duplicates import DuplicateChecker
df = pd.read_csv('../datasets/transformed/guerry_histdata.csv')
dc = DuplicateChecker(df=df, entities=['Region', 'MainCity'])
results = dc.evaluate()
3、Missing Values




5、Erroneous Data

该模块功能主要是分析数据集中包含的error data,error data为预定义好的数据,默认有:“?”, “UNK”, “Unknown”, “N/A”, “NA”, “”, “(blank)”。也可以根据参数追加自定义的error data。

6、Drift Analysis

该模块功能主要分析data drift类型有:数据样本drift、标签drift、概念drift。

7、Data Expectations
8、Bias & Fairness


9、Data Relations



