sklearn-基础使用

数据清洗可以用pandas,数据predict的时候就要用到大名鼎鼎的sklearn了,里面包含了很多基础的算法,可以帮助Data Scientist 解决很多问题。
(a)data normalization

from sklearn import preprocessing
# normalize the data attributes
normalized_X = preprocessing.normalize(X) #range from 0 to 1
# standardize the data attributes
standardized_X = preprocessing.scale(X)  #均值为0 方差为1

(b)feature extraction

from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()  #基于决策树接近树根
model.fit(X, y)
# display the relative importance of each attribute
print(model.feature_importances_)

rom sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)            #暴力破解呀,选出所有大小为3的子集算出误差最小的
rfe = rfe.fit(X, y)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
feature_selection模块

Univariate feature selection:单变量的特征选择
单变量特征选择的原理是分别单独的计算每个变量的某个统计指标,根据该指标来判断哪些指标重要。剔除那些不重要的指标。

sklearn.feature_selection模块中主要有以下几个方法:
SelectKBest和SelectPercentile比较相似,前者选择排名排在前n个的变量,后者选择排名排在前n%的变量。而他们通过什么指标来给变量排名呢?这需要二外的指定。
对于regression问题,可以使用f_regression指标。对于classification问题,可以使用chi2或者f_classif变量。
使用的例子:
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=10)

还有其他的几个方法,似乎是使用其他的统计指标来选择变量:using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.

文档中说,如果是使用稀疏矩阵,只有chi2指标可用,其他的都必须转变成dense matrix。但是我实际使用中发现f_classif也是可以使用稀疏矩阵的。

Recursive feature elimination:循环特征选择
不单独的检验某个变量的价值,而是将其聚集在一起检验。它的基本思想是,对于一个数量为d的feature的集合,他的所有的子集的个数是2的d次方减1(包含空集)。指定一个外部的学习算法,比如SVM之类的。通过该算法计算所有子集的validation error。选择error最小的那个子集作为所挑选的特征。
由以下两个方法实现:sklearn.feature_selection.RFE,sklearn.feature_selection.RFECV

L1-based feature selection:
该思路的原理是:在linear regression模型中,有的时候会得到sparse solution。意思是说很多变量前面的系数都等于0或者接近于0。这说明这些变量不重要,那么可以将这些变量去除。

Tree-based feature selection:决策树特征选择
基于决策树算法做出特征选择

(c)Algorithm Development

As I have said, Scikit-Learn has implemented all the basic algorithms of machine learning. Let’s take a look at some of them.

Logistic Regression

Most often used for solving tasks of classification (binary), but multiclass classification (the so-called one-vs-all method) is also allowed. The advantage of this algorithm is that there’s the probability of belonging to a class for each object at the output.

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Naive Bayes

Is also one of the most well-known machine learning algorithms, the main task of which is to restore the density of data distribution of the training sample. This method often provides good quality in multiclass classification problems.

from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

k-Nearest Neighbours

The kNN (k-Nearest Neighbors) method is often used as part of a more complex classification algorithm. For instance, we can use its estimate as an object’s feature. Sometimes, a simple kNN provides great quality on well-chosen features. When parameters (metrics mostly) are set well, the algorithm often gives good quality in regression problems.

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Decision Trees

Classification and Regression Trees (CART) are often used in problems, in which objects have category features and used for regression and classification problems. The trees are very well suited for multiclass classification.

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Support Vector Machines

SVM (Support Vector Machines) is one of the most popular machine learning algorithms used mainly for the classification problem. As well as logistic regression, SVM allows multi-class classification with the help of the one-vs-all method.

from sklearn import metrics
from sklearn.svm import SVC
# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

In addition to classification and regression algorithms, Scikit-Learn has a huge number of more complex algorithms, including clustering, and also implemented techniques to create compositions of algorithms, including Bagging and Boosting.

How to Optimize Algorithm Parameters

One of the most difficult stages in creating really efficient algorithms is choosing correct parameters. It’s usually easier with experience, but one way or another, we have to do the search. Fortunately, Scikit-Learn provides many implemented functions for this purpose.

As an example, let’s take a look at the selection of the regularization parameter, in which several values are searched in turn:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

Sometimes it is more efficient to randomly select a parameter from the given range, estimate the algorithm quality for this parameter and choose the best one.

import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)
(d) crossvalidation
sklearn中的cross validation模块,最主要的函数是如下函数:
sklearn.cross_validation.cross_val_score。他的调用形式是scores = cross_validation.cross_val_score(clf, raw data, raw target, cv=5, score_func=None)
参数解释:
clf是不同的分类器,可以是任何的分类器。比如支持向量机分类器。clf = svm.SVC(kernel='linear', C=1)
cv参数就是代表不同的cross validation的方法了。如果cv是一个int数字的话,并且如果提供了raw target参数,那么就代表使用StratifiedKFold分类方式,如果没有提供raw target参数,那么就代表使用KFold分类方式。
cross_val_score函数的返回值就是对于每次不同的的划分raw data时,在test data上得到的分类的准确率。至于准确率的算法可以通过score_func参数指定,如果不指定的话,是用clf默认自带的准确率算法。
还有其他的一些参数不是很重要。
cross_val_score具体使用例子见下:
>>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_validation.cross_val_score(
...    clf, raw data, raw target, cv=5)
...
>>> scores
array([ 1.  ...,  0.96...,  0.9 ...,  0.96...,  1.        ])

除了刚刚提到的KFold以及StratifiedKFold这两种对raw data进行划分的方法之外,还有其他很多种划分方法。但是其他的划分方法调用起来和前两个稍有不同(但是都是一样的),下面以ShuffleSplit方法为例说明:
>>> n_samples = raw_data.shape[0]
>>> cv = cross_validation.ShuffleSplit(n_samples, n_iter=3,
...     test_size=0.3, random_state=0)

>>> cross_validation.cross_val_score(clf, raw data, raw target, cv=cv)
...
array([ 0.97...,  0.97...,  1.        ])

还有的其他划分方法如下:
cross_validation.Bootstrap
cross_validation.LeaveOneLabelOut
cross_validation.LeaveOneOut
cross_validation.LeavePLabelOut
cross_validation.LeavePOut
cross_validation.StratifiedShuffleSplit

他们的调用方法和ShuffleSplit是一样的,但是各自有各自的参数。至于这些方法具体的意义,见machine learning教材。

你可能感兴趣的:(Kaggle,sklearn)