sklearn 中 pipeline 或 LabelBinariy出现 'fit_transform() takes 2 positional arguments but 3 were given'

在学习OReilly.Hands-On.Machine.Learning.with.Scikit-Learn.and.TensorFlow.2017.3时,执行以下代码会出错:


from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_attribs = ['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income']

cat_attribs = ["ocean_proximity"]
# 在sklearn 0.19下,由于LabelBinarizer的重写而导致代码失效
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', LabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
])

housing_prepared = full_pipeline.fit_transform(housing)

会报错fit_transform() takes 2 positional arguments but 3 were given

原因在于此行代码

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', LabelBinarizer()),
])

sklearn0.19 重写了LabelBinarizer类中的fit_transform,只能接受两个参数,而看报错信息:

Xt, fit_params = self._fit(X, y, **fit_params)
    282         if hasattr(last_step, 'fit_transform'):
--> 283             return last_step.fit_transform(Xt, y, **fit_params)
    284         elif last_step is None:
    285             return Xt

箭头指向的行,传入fit_transform参数为三个Xt, y, **fit_params.

接下来是我修改的代码,单独处理文本和数值属性,然后利用numpy的concatenate合并两个数组。

def processed_data(data):
    # 获得数值和文本属性
    num_attribs = num_attribs = ['longitude',
                                'latitude',
                                'housing_median_age',
                                'total_rooms',
                                'total_bedrooms',
                                'population',
                                'households',
                                'median_income']
    
    cat_attribs = ["ocean_proximity"]
    
    # 处理数值属性
    num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
    ])
    
    num_pipeline_ = num_pipeline.fit_transform(data)
    
    # 处理文本和分类属性
    cat_pipeline =  LabelBinarizer() 
    cat_pipeline_ = cat_pipeline.fit_transform(data[cat_attribs])
    
    # 用numpy合并
    housing_prepared = np.concatenate((num_pipeline_, cat_pipeline_), axis=1)
    
    return housing_prepared

你可能感兴趣的:(Python数据处理及可视化,Python,sklearn,pipeline)