第一次参加一些数据竞赛,我会经常以以下的代码模式起手。
_file('data/train.tsv')
train_y = extract_targets(train)
train_essays = extract_essays(train)
train_tokens = get_tokens(train_essays)
train_features = extract_feactures(train)
classifier = MultinomialNB()
scores = []
train_idx, cv_idx in KFold():
classifier.fit(train_features[train_idx], train_y[train_idx])
scores.append(model.score(train_features[cv_idx], train_y[cv_idx]))
print("Score: {}".format(np.mean(scores)))
通常情况下,在第一次提交的时候会得到一个不错的分数,但是为了进一步提高分数,我通常会尝试提取更多的特征。
比如说文本n-gram计数,
我想用tf-idf特征。另外,我想增加文章的长度特征,最好能把拼写错误的字数也做为一个特征。这样的话,我干脆把这些都打包成一个extract_features函数,可以提取一篇文章中的三个特征矩阵,然后将这三个矩阵在数轴1上进行合并。
在提取完了特征以后,我发现我又做了很多normalize和scale的操作,我的代码会变的非常长,虽然我自己仍然能理解。
pipeline = Pipeline([
('extract_essays', EssayExractor()),
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer()),
('classifier', MultinomialNB())
])
train = read_file('data/train.tsv')
train_y = extract_targets(train)
scores = []
train_idx, cv_idx in KFold():
model.fit(train[train_idx], train_y[train_idx])
scores.append(model.score(train[cv_idx], train_y[cv_idx]))
print("Score: {}".format(np.mean(scores)))
pipeline = Pipeline([
('extract_essays', EssayExractor()),
('features', FeatureUnion([
('ngram_tf_idf', Pipeline([
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer())
])),
('essay_length', LengthTransformer()),
('misspellings', MispellingCountTransformer())
])),
('classifier', MultinomialNB())
])
在这个例子里,文章提取后进入了三个并行的处理流程:ngram_tf_idf、essay_length、misspellings,然后将三个合并起来输入到分类器。
pipeline = Pipeline([
('features', FeatureUnion([
('continuous', Pipeline([
('extract', ColumnExtractor(CONTINUOUS_FIELDS)),
('scale', Normalizer())
])),
('factors', Pipeline([
('extract', ColumnExtractor(FACTOR_FIELDS)),
('one_hot', OneHotEncoder(n_values=5)),
('to_dense', DenseTransformer())
])),
('weekday', Pipeline([
('extract', DayOfWeekTransformer()),
('one_hot', OneHotEncoder()),
('to_dense', DenseTransformer())
])),
('hour_of_day', HourOfDayTransformer()),
('month', Pipeline([
('extract', ColumnExtractor(['datetime'])),
('to_month', DateTransformer()),
('one_hot', OneHotEncoder()),
('to_dense', DenseTransformer())
])),
('growth', Pipeline([
('datetime', ColumnExtractor(['datetime'])),
('to_numeric', MatrixConversion(int)),
('regression', ModelTransformer(LinearRegression()))
]))
])),
('estimators', FeatureUnion([
('knn', ModelTransformer(KNeighborsRegressor(n_neighbors=5))),
('gbr', ModelTransformer(GradientBoostingRegressor())),
('dtr', ModelTransformer(DecisionTreeRegressor())),
('etr', ModelTransformer(ExtraTreesRegressor())),
('rfr', ModelTransformer(RandomForestRegressor())),
('par', ModelTransformer(PassiveAggressiveRegressor())),
('en', ModelTransformer(ElasticNet())),
('cluster', ModelTransformer(KMeans(n_clusters=2)))
])),
('estimator', KNeighborsRegressor())
])
class HourOfDayTransformer(TransformerMixin):
def transform(self, X, **transform_params):
hours = DataFrame(X['datetime'].apply(lambda x: x.hour))
return hours
def fit(self, X, y=None, **fit_params):
return self
class ModelTransformer(TransformerMixin):
def __init__(self, model):
self.model = model
def fit(self, *args, **kwargs):
self.model.fit(*args, **kwargs)
return self
def transform(self, X, **transform_params):
return DataFrame(self.model.predict(X))
pipeline会像内建的transformer一样对待这些自己写的transformer。