相关知识:《统计学习方法》决策树(Decision Tree,DT)
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
df = pd.read_csv('./ad.data', header=None)
df.head(10)
数据前3列为广告图片的宽高、长宽比,余下特征为文字变量出现频率的编码特征
最后一列为标签列,是否为广告
y = df[len(df.columns)-1]
y
0 ad.
1 ad.
2 ad.
3 ad.
4 ad.
...
3274 nonad.
3275 nonad.
3276 nonad.
3277 nonad.
3278 nonad.
Name: 1558, Length: 3279, dtype: object
y = [1 if e == 'ad.' else 0 for e in y]
X = df.drop(df.columns[len(df.columns)-1], axis=1)
X
?
无效数据X.replace(to_replace=' *\?', value=-1,regex=True,inplace=True)
X
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline = Pipeline([
('clf', DecisionTreeClassifier(criterion='entropy'))
])
parameters = {
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_estimator_.get_params()
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
for param_name in sorted(parameters.keys()):
print('t%s: %r' % (param_name, best_parameters[param_name]))
predictions = grid_search.predict(X_test)
print(classification_report(y_test, predictions))
Best score: 0.890
Best parameters set:
tclf__max_depth: 155
tclf__min_samples_leaf: 2
tclf__min_samples_split: 2
precision recall f1-score support
0 0.97 0.99 0.98 716
1 0.94 0.82 0.88 104
accuracy 0.97 820
macro avg 0.96 0.91 0.93 820
weighted avg 0.97 0.97 0.97 820
看见广告类1
的,精准率和召回率都还不错。
优点:
缺点: