
A Gentle Introduction to Machine Learning with Pythonand Scikit-learn

GuillermoMoncecchi, Diego Garat, Raúl Garreta

本文主要展现使用scikit-learn机器学习方法基本的使用。主要包括分类、回归和聚类,分类和聚类的数据集是使用1936年由Sir Ronald Fisher 引入的莺尾花数据,回归使用的是Boston 房屋数据。


%pylab inline
Populating the interactive namespace from numpy and matplotlib
Import scikit-learn, numpy, scipy andpyplotIn [162]:

print ('Python version:', platform.python_version())
print ('IPython version:', IPython.__version__)
print ('numpy version:', np.__version__)
print ('scikit-learn version:', sklearn.__version__)
print ('matplotlib version:', matplotlib.__version__)

fromsklearnimport datasets
iris= datasets.load_iris()
X_iris= iris.data
y_iris= iris.target

print (X_iris.shape, y_iris.shape)
print ('Feature names:{0}'.format(iris.feature_names))
print ('Target classes:{0}'.format(iris.target_names))
print ('First instance features:{0}'.format(X_iris[0]))
(150, 4) (150,)
Feature names:['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target classes:['setosa' 'versicolor' 'virginica']
First instance features:[ 5.1  3.5  1.4  0.2]

colormarkers= [ ['red','s'], ['greenyellow','o'], ['blue','x']]
for i inrange(len(colormarkers)):
    px = X_iris[:, 0][y_iris== i]
    py = X_iris[:, 1][y_iris== i]
    plt.scatter(px, py, c=colormarkers[i][0], marker=colormarkers[i][1])

plt.title('Iris Dataset: Sepal width vs sepal length')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

for i inrange(len(colormarkers)):
    px = X_iris[:, 2][y_iris== i]
    py = X_iris[:, 3][y_iris== i]
    plt.scatter(px, py, c=colormarkers[i][0], marker=colormarkers[i][1])

plt.title('Iris Dataset: petal width vs petal length')
plt.xlabel('Petal length')
plt.ylabel('Petal width')


SupervisedLearning: Classification

1936年,Ronald Fisher引入莺尾花数据,使用他训练一条线性分类模型。构建一条特征的线性组合,即构造一条直线。
Separatetraining and testing sets分离训练和测试数据集

fromsklearn.cross_validationimport train_test_split
fromsklearnimport preprocessing

# Create dataset with only the first two attributes
X, y = X_iris[:, [0,1]], y_iris
# Test set will be the 25% taken randomly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

# Standarize the features
scaler= preprocessing.StandardScaler().fit(X_train)
X_train= scaler.transform(X_train)
X_test= scaler.transform(X_test)
Check that, after scaling, the mean is 0and the standard deviation is 1 (this should be exact in the training set, butonly approximated in the testing set, because we used the training set mediaand standard deviation):
print ('Training set mean:{:.2f} and standard deviation:{:.2f}'.format(np.average(X_train),np.std(X_train)))
print ('Testing set mean:{:.2f} and standard deviation:{:.2f}'.format(np.average(X_test),np.std(X_test)))
Training set mean:0.00 and standard deviation:1.00
Testing set mean:0.13 and standard deviation:0.71
Display the training data, after scaling.
colormarkers= [ ['red','s'], ['greenyellow','o'], ['blue','x']]
plt.figure('Training Data')
for i inrange(len(colormarkers)):
    xs = X_train[:, 0][y_train== i]
    ys = X_train[:, 1][y_train== i]
    plt.scatter(xs, ys, c=colormarkers[i][0], marker=colormarkers[i][1])

plt.title('Training instances, after scaling')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

Alinear, binary classifier

y_train_setosa= copy.copy(y_train)
# Every 1 and 2 classes in the training set will became just 1
y_test_setosa= copy.copy(y_test)

print ('New training target classes:\n{0}'.format(y_train_setosa))
New training target classes:
[1 0 1 1 1 0 0 1 0 1 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 0
 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 1 1 0 1 0 1 1
 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1

梯度下降方法由Louis Cauchy在1847年提出,用来解线性方程组。这个思想基于多变量函数总是在它的负梯度方向下降最快。如果我们想得到它的最小值,我们可以在它的负梯度方向移动。
sklearn中的每个分类方法都使用的相同的模式,我们通过可配置参数调用一种分类方法,在这个例子中,我们使用 linear_model.SGDClassifier,来告诉sklearn使用对数损失函数。
fromsklearnimport linear_model 
clf= linear_model.SGDClassifier(loss='log', random_state=42)
print (clf)
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,

clf.fit(X_train, y_train_setosa)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,

print (clf.coef_,clf.intercept_)
[[ 30.97129662 -17.82969037]] [ 17.34844577]

x_min, x_max = X_train[:, 0].min()-.5, X_train[:, 0].max()+.5
y_min, y_max = X_train[:, 1].min()-.5, X_train[:, 1].max()+.5
xs= np.arange(x_min, x_max, 0.5)
fig,axes= plt.subplots()
axes.set_title('Setosa classification')
axes.set_xlabel('Sepal length')
axes.set_ylabel('Sepal width')
axes.set_xlim(x_min, x_max)
axes.set_ylim(y_min, y_max)
plt.scatter(X_train[:,0][y_train==0], X_train[:, 1][y_train==0], c='red', marker='s')
plt.scatter(X_train[:,0][y_train==1], X_train[:, 1][y_train==1], c='black', marker='x')
ys= (-clf.intercept_[0]- xs * clf.coef_[0,0])/ clf.coef_[0,1]
plt.plot(xs, ys, hold=True)

The blue line is our decision boundary. Every time 30.97×sepal_length−17.82×sepal_width−17.3430.97×sepal_length−17.82×sepal_width−17.34 isgreater than zero we will have an iris setosa (class 0).

print ('If the flower has 4.7 petal width and 3.1 petal length is a {}'.format(

If the flower has 4.7 petal width and 3.1 petal length is a [‘setosa’]
Note that we first scaled the newinstance, then applyied the predict method, and used the resultto lookup into the iris target names arrays.

Backto the original three-class problem回到三种类型的方法上

Now, do the training using the threeoriginal classes. Using scikit-learn this is simple: we do exactly the sameprocedure, using the original three target classes:
clf2= linear_model.SGDClassifier(loss='log', random_state=33)
clf2.fit(X_train, y_train) 
print (len(clf2.coef_))

x_min, x_max = X_train[:, 0].min()-.5, X_train[:, 0].max()+.5
y_min, y_max = X_train[:, 1].min()-.5, X_train[:, 1].max()+.5
xs= np.arange(x_min,x_max,0.5)
fig, axes = plt.subplots(1,3)
for i in [0,1,2]:
    axes[i].set_title('Class '+ iris.target_names[i]+' versus the rest')
    axes[i].set_xlabel('Sepal length')
    axes[i].set_ylabel('Sepal width')
    axes[i].set_xlim(x_min, x_max)
    axes[i].set_ylim(y_min, y_max)
    for j in [0,1,2]:
        px = X_train[:, 0][y_train== j]
        py = X_train[:, 1][y_train== j]
        color = colormarkers[j][0]if j==ielse'black'
        marker ='o'if j==ielse'x'
        plt.scatter(px, py, c=color, marker=marker)     



Let us evaluate on the previous instanceto find the three-class prediction. Scikit-learn tries the three classifiers.
[[ 15.45793755  -1.60852842 -37.65225636]]



Evaluatingthe classifier评估模型

fromsklearnimport metrics
y_train_pred= clf2.predict(X_train)
print ('Accuracy on the training set:{:.2f}'.format(metrics.accuracy_score(y_train, y_train_pred)))
Accuracy on the training set:0.83

This means that our classifier correctlypredicts 83\% of the instances in the training set. But this is actually a badidea. The problem with the evaluating on the training set is that you havebuilt your model using this data, and it is possible that your model adjustsactually very well to them, but performs poorly in previously unseen data(which is its ultimate purpose). This phenomenon is called overfitting, and youwill see it once and again while you read this book. If you measure on yourtraining data, you will never detect overfitting. So, neverever measure on yourtraining data.
Remember we separated a portion of thetraining set? Now it is time to use it: since it was not used for training, weexpect it to give us and idead of how well our classifier performs onpreviously unseen data.
y_pred= clf2.predict(X_test)
print ('Accuracy on the training set:{:.2f}'.format(metrics.accuracy_score(y_test, y_pred)))
Accuracy on the training set:0.68

print (metrics.confusion_matrix(y_test, y_pred))
[[ 8  0  0]
 [ 0  3  8]
 [ 0  4 15]]

Accuracy on the test set is a goodperformance measure when the number of instances of each class is similar,i.e., we have a uniform distribution of classes. However, consider that 99percent of your instances belong to just one class (you have a skewed): aclassifier that always predicts this majority class will have an excellentperformance in terms of accuracy, despite the fact that it is an extremelynaive method (and that it will surely fail in the “difficult” 1% cases).
Within scikit-learn, there are severalevaluation functions; we will show three popular ones: precision, recall, andF1-score (or f-measure).
print (metrics.classification_report(y_test, y_pred, target_names=iris.target_names))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         8
 versicolor       0.43      0.27      0.33        11
  virginica       0.65      0.79      0.71        19

avg / total       0.66      0.68      0.66        38

· Precision computes theproportion of instances predicted as positives that were correctly evaluated(it measures how right is our classifier when it says that an instance ispositive).准确率,样本中被预测为正样本的概率
· Recall counts the proportionof positive instances that were correctly evaluated (measuring how right ourclassifier is when faced with a positive instance).召回率:被预测为正样本的数据占全部正样本的概率。
· F1-score is the harmonic meanof precision and recall, and tries to combine both in a single number.F1值是准确率和召回率的调和均值,组合为一个单独的数值F1。
Usingthe four flower attributes使用莺尾花的4维数据

# Test set will be the 25% taken randomly
X_train4, X_test4, y_train4, y_test4 = train_test_split(X_iris, y_iris, test_size=0.25, random_state=33)

# Standarize the features
scaler= preprocessing.StandardScaler().fit(X_train4)
X_train4= scaler.transform(X_train4)
X_test4= scaler.transform(X_test4)

# Build the classifier
clf3= linear_model.SGDClassifier(loss='log', random_state=33)
clf3.fit(X_train4, y_train4) 

# Evaluate the classifier on the evaluation set
y_pred4= clf3.predict(X_test4)
print (metrics.classification_report(y_test4, y_pred4, target_names=iris.target_names))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         8
 versicolor       0.78      0.64      0.70        11
  virginica       0.81      0.89      0.85        19

avg / total       0.84      0.84      0.84        38

UnsupervisedLearning: Clustering


fromsklearnimport cluster
clf_sepal= cluster.KMeans(init='k-means++', n_clusters=3, random_state=33)
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=33, tol=0.0001,
We can show the label assigned for eachinstance (note that this label is a cluster name, it has nothing to do with ouroriginal target classes... actually, when you are doing clustering you have notarget class!).
print (clf_sepal.labels_)
[1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 2 1 2 1 0 0 1 1 0 0 2 0 1 2 2 1 1 0 0 2 1 0
 1 1 2 1 0 2 0 1 0 2 2 0 2 1 0 0 1 0 0 0 2 1 0 1 0 1 0 1 2 1 1 1 0 1 0 2 1
 0 0 0 0 2 2 0 1 1 2 1 0 0 1 1 1 0 1 1 0 2 1 2 1 2 0 2 0 0 0 1 1 2 1 1 1 2

Using NumPy’s indexing capabilities, wecan display the actual target classes for each cluster, just to compare thebuilt clusters with our flower type classes…
print (y_train4[clf_sepal.labels_==0])
[0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0]

print (y_train4[clf_sepal.labels_==1])
[1 1 1 1 1 1 2 1 0 2 1 2 2 1 1 2 2 1 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
 2 1 2 1 1 2 1]
print (y_train4[clf_sepal.labels_==2])
[2 2 1 2 2 2 2 1 1 2 2 1 2 2 1 1 2 2 2 2 2 2 1 2 2]
As usually, is a good idea to display ourinstances and the clusters they belong to, to have a first approximation to howwell our algorithm is behaving on our data:
colormarkers= [ ['red','s'], ['greenyellow','o'], ['blue','x']]
sl_min, sl_max = X_train4[:, 0].min()-margin, X_train4[:, 0].max()+ margin
sw_min, sw_max = X_train4[:, 1].min()-margin, X_train4[:, 1].max()+ margin
sl, sw  = np.meshgrid(
    np.arange(sl_min, sl_max, step),
np.arange(sw_min, sw_max, step)
Zs= clf_sepal.predict(np.c_[sl.ravel(), sw.ravel()]).reshape(sl.shape)
centroids_s= clf_sepal.cluster_centers_
Display the data points and thecalculated regions
plt.imshow(Zs, interpolation='nearest', extent=(sl.min(), sl.max(), sw.min(), sw.max()), cmap= plt.cm.Pastel1, aspect='auto', origin='lower')
for j in [0,1,2]:
    px = X_train4[:, 0][y_train== j]
    py = X_train4[:, 1][y_train== j]
    plt.scatter(px, py, c=colormarkers[j][0], marker= colormarkers[j][1])
plt.scatter(centroids_s[:,0], centroids_s[:, 1],marker='*',linewidths=3, color='black', zorder=10)
plt.title('K-means clustering on the Iris dataset using Sepal dimensions\nCentroids are marked with stars')
plt.xlim(sl_min, sl_max)
plt.ylim(sw_min, sw_max)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")

**Repeat the experiment, using petaldimensions**
clf_petal= cluster.KMeans(init='k-means++', n_clusters=3, random_state=33)
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=33, tol=0.0001,
print (y_train4[clf_petal.labels_==0])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0]
print (y_train4[clf_petal.labels_==1])
[1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
print (y_train4[clf_petal.labels_==2])
[2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2]


colormarkers= [ ['red','s'], ['greenyellow','o'], ['blue','x']]
sl_min, sl_max = X_train4[:, 2].min()-margin, X_train4[:, 2].max()+ margin
sw_min, sw_max = X_train4[:, 3].min()-margin, X_train4[:, 3].max()+ margin
sl, sw  = np.meshgrid(
    np.arange(sl_min, sl_max, step),
    np.arange(sw_min, sw_max, step), 
Zs= clf_petal.predict(np.c_[sl.ravel(), sw.ravel()]).reshape(sl.shape)
centroids_s= clf_petal.cluster_centers_
plt.imshow(Zs, interpolation='nearest', extent=(sl.min(), sl.max(), sw.min(), sw.max()), cmap= plt.cm.Pastel1, aspect='auto', origin='lower')
for j in [0,1,2]:
    px = X_train4[:, 2][y_train4== j]
    py = X_train4[:, 3][y_train4== j]
    plt.scatter(px, py, c=colormarkers[j][0], marker= colormarkers[j][1])
plt.scatter(centroids_s[:,0], centroids_s[:, 1],marker='*',linewidths=3, color='black', zorder=10)
plt.title('K-means clustering on the Iris dataset using Petal dimensions\nCentroids are marked with stars')
plt.xlim(sl_min, sl_max)
plt.ylim(sw_min, sw_max)
plt.xlabel("Petal length")
plt.ylabel("Petal width")

clf= cluster.KMeans(init='k-means++', n_clusters=3, random_state=33)
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=33, tol=0.0001,

print (y_train[clf.labels_==0])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0]

print (y_train[clf.labels_==1])
[1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1]

print (y_train[clf.labels_==2])
[2 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2]

Measure precision & recall in thetesting set, using all attributes, and using only petal measures
print (metrics.classification_report(y_test, y_pred, target_names=['setosa','versicolor','virginica']))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         8
 versicolor       0.64      0.64      0.64        11
  virginica       0.79      0.79      0.79        19

avg / total       0.79      0.79      0.79        38

print (metrics.classification_report(y_test, y_pred_petal, target_names=['setosa','versicolor','virginica']))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         8
 versicolor       0.85      1.00      0.92        11
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

Wait, every performance measure is betterusing just two attributes. It is possible that less features give betterresults? Although at a first glance this seems contradictory, we will see infuture notebooks that selecting the right subset of features, a process calledfeature selection, could actually improve the performance of our algorithms.

SupervisedLearning: Regression监督学习:回归


fromsklearn.datasetsimport load_boston
boston= load_boston()
print ('Boston dataset shape:{}'.format(boston.data.shape))
Boston dataset shape:(506, 13)
print (boston.feature_names)
 'B' 'LSTAT']
Create training and testing sets, andscale values, as usual
X_train_boston= preprocessing.StandardScaler().fit_transform(X_train_boston)
y_train_boston= preprocessing.StandardScaler().fit_transform(y_train_boston)

deftrain_and_evaluate(clf, X_train, y_train, folds):
    clf.fit(X_train, y_train)
    print ('Score on training set: {:.2f}'.format(clf.score(X_train, y_train)))
    #create a k-fold cross validation iterator of k=5 folds
    cv = sklearn.cross_validation.KFold(X_train.shape[0], folds, shuffle=True, random_state=33)
    scores = sklearn.cross_validation.cross_val_score(clf, X_train, y_train, cv=cv)
    print ('Average score using {}-fold crossvalidation:{:.2f}'.format(folds,np.mean(scores)))

sklearn有一个线性模型叫 linear_model.SGDRegressor ,它使用随机梯度下降来降低平方损失。

fromsklearnimport linear_model
clf_sgd= linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=33)
train_and_evaluate(clf_sgd, X_train_boston, y_train_boston,5)
Score on training set: 0.73
Average score using 5-fold crossvalidation:0.70

[-0.06777406  0.06767528 -0.04290825  0.08828856 -0.11797833  0.3394894
 -0.01969258 -0.23195707  0.09594823 -0.05271866 -0.19913907  0.10355794

在上面的线性模型中我们调用penalty的参数为None,clf_sgd= linear_model.SGDRegressor(loss=’squared_loss’, penalty=None, random_state=33),则引入惩罚系数来避免过拟合,通过惩罚那些系统太大的超平面实现。这个参数默认是L2 或者L1,下面我们示例使用L2.

clf_sgd1= linear_model.SGDRegressor(loss='squared_loss', penalty='l2', random_state=33)
train_and_evaluate(clf_sgd1, X_train_boston, y_train_boston,folds=5)
Score on training set: 0.73
Average score using 5-fold crossvalidation:0.70


(5)在测试集上评估模型效果,predict(T), T是测试集合
