机器学习基础 维基翻译 保序回归 随机森林 Pipeline处理 及简单的sklearn例子 分类:机器学习Sklearn

Isotonic regression(保序回归)
In numerical analysis, isotonic regression (IR) involves finding a weighted 
least-squares fit x to Rn with weights vector w to Rn subject to a set of 
non-contradictory constraints of the kind xi >= xj.
(x 分量保序)
Such constraints define partial order or total order and can be represented 
as a directed graph G = (N, E)(有向图 N:节点 E:节点间的映射)
where N is the set of variables involved, and E is the ste of pairs (i, j)
for each constraint xi >= xj. Thus, the IR problem corresponds to the following quadratic program(QP) (二次规划)

实现代码:
[python]  view plain  copy
  1. import numpy as np   
  2. from sklearn.utils import check_random_state  
  3. from sklearn.isotonic import IsotonicRegression  
  4. from sklearn.linear_model import LinearRegression  
  5. import matplotlib.pyplot as plt   
  6. from matplotlib.collections import LineCollection  
  7.   
  8.   
  9. n = 100  
  10. x = np.arange(n)  
  11. rs = check_random_state(0)  
  12. y = rs.randint(-5050, size = (n,)) + 50. * np.log(1 + np.arange(n))  
  13.   
  14. ir = IsotonicRegression()  
  15.   
  16. y_ = ir.fit_transform(x, y)  
  17.   
  18. lr = LinearRegression()  
  19. ”’ 
  20. print x 
  21. print x[:, np.newaxis] 
  22. ”’  
  23. lr.fit(x[:, np.newaxis], y)  
  24.   
  25. seguments = [[[i, y[i]], [i, y_[i]]] for i in range(n)]  
  26. lc = LineCollection(seguments, zorder = 0)  
  27. lc.set_array(np.ones(len(y)))  
  28. lc.set_linewidths(0.5 * np.ones(n))  
  29.   
  30.   
  31.   
  32. fig = plt.figure()  
  33. plt.plot(x, y, ”r.”, markersize = 12)  
  34. plt.plot(x, y_, ”g.-“, markersize = 12)  
  35. plt.plot(x, lr.predict(x[:, np.newaxis]), ”b-“)  
  36. plt.gca().add_collection(lc)  
  37. plt.legend((”Data”“isotonic Fit”“Linear Fit”), loc = “lower right”)  
  38. plt.title(”isotonic regression”)  
  39. plt.show()  
import numpy as np 
from sklearn.utils import check_random_state
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt 
from matplotlib.collections import LineCollection


n = 100
x = np.arange(n)
rs = check_random_state(0)
y = rs.randint(-50, 50, size = (n,)) + 50. * np.log(1 + np.arange(n))

ir = IsotonicRegression()

y_ = ir.fit_transform(x, y)

lr = LinearRegression()
'''
print x
print x[:, np.newaxis]
'''
lr.fit(x[:, np.newaxis], y)

seguments = [[[i, y[i]], [i, y_[i]]] for i in range(n)]
lc = LineCollection(seguments, zorder = 0)
lc.set_array(np.ones(len(y)))
lc.set_linewidths(0.5 * np.ones(n))



fig = plt.figure()
plt.plot(x, y, "r.", markersize = 12)
plt.plot(x, y_, "g.-", markersize = 12)
plt.plot(x, lr.predict(x[:, np.newaxis]), "b-")
plt.gca().add_collection(lc)
plt.legend(("Data", "isotonic Fit", "Linear Fit"), loc = "lower right")
plt.title("isotonic regression")
plt.show()






Random forest
Random forests is a notion of general technique of random decision forests
that are an ensemble learning(集成学习)
method for classification, regression and other tasks, that operate by 
constructing a mutitude of decison trees at training time and outputting
the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forest correct for decision trees’ habit of overfitting to their training set.
(随机森林对决策树过拟合数据的特点进行了矫正)

Decision tree

Decision tree learning uses a decision tree as a predictive(预测) model
whice maps observations about an item to conclusions about the item’s 
target value. It is one of the predictive modeling approaches used  in 
statistics, data maining and machine learning. Tree models where the target
variable can be take a finite set of values called classification trees, In 
these tree structures, leaves represent conjunctions(结合) of features
that lead to those class labels. Decision trees where the target varibale 
can take continuous values (typically real numbers) are called regression
trees.

In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree
describes data but not decisions; rather the resulting classification tree
can be an input for decision making.This page deals with decision trees in 
data mining.

Bootstrap aggregating(聚合)
Given a standard training set D of size n, bagging generates m new training
sets Di, each of size n’, by sampling from D uniformly and with replacement.
(有放回抽样) 
By sampling with replacement, some observations may be repeated in each Di
is expected to have the fraction (1 - 1/e) of the unique example of D, the rest being duplicates. This kind of sample is known as a bootstrap sample.
The m model are fitted using the above m bootstrap samples and combined by
averaging the output (for regressiion) or voting(for classification).
(即特殊抽样后利用投票或平均值进行拟合)

From bagging to random forests
The above procedure describes the original bagging algorithm for tree.
Random forests differ in only ne way from this general scheme:(方案)
type use a modified tree learning algorithm that selects, at each candidate
split in the learning process, a random subset of the features. This
process is sometimes called “feature bagging”.
(随机选取一部分特征,并进行bagging)
The reason for doing this is the correlation of the trees in an 
ordinary bootstrap sample: if one or a few features are very strong
predictors for the response variable (target output), these features 
will be selected in many of the B trees,causing them to become correalted
. An analysis of how bagging and random subspace projection contribute
to accurarcy gains under different conditions is given by Ho.
(随机森林:组合随机抽选样本,并随机特征投影)

回归树的基本思想是将数据集利用决策树来划分集合(仅仅是利用特征进行划分)
在每一个划分的子集上实现回归,再将回归的结果进行平均得解。
相应可以推广到随机森林场合。

sklearn.ensemble::RandomForestRegressor
参数n_estimators 指出了随机森林中使用的树的数量。

numpy.random::shuffle 可以将数组打乱。

sklearn.preprocessing::Imputer(差补)

下面是对差补进行比较的随机森林程序:
[python]  view plain  copy
  1. import numpy as np   
  2. from sklearn.datasets import load_boston   
  3.   
  4. rng = np.random.RandomState(0)  
  5. dataset = load_boston()  
  6. X_full, y_full = dataset.data, dataset.target   
  7. n_samples = X_full.shape[0]  
  8. n_feature = X_full.shape[1]  
  9.   
  10. from sklearn.ensemble import RandomForestRegressor   
  11. from sklearn.cross_validation import cross_val_score  
  12.   
  13. estimator = RandomForestRegressor(random_state = 0, n_estimators = 100)  
  14. score = cross_val_score(estimator, X_full, y_full).mean()  
  15. print “Score with the entire dataset = %.2f” % score   
  16.   
  17. missing_rate = 0.75  
  18. n_missing_samples = np.floor(n_samples * missing_rate)  
  19. missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,   
  20.  dtype = np.bool), np.ones(n_missing_samples, dtype = np.bool)))  
  21.   
  22. rng.shuffle(missing_samples)  
  23. missing_features = rng.randint(0, n_feature, n_missing_samples)  
  24.   
  25. X_filtered = X_full[~missing_samples, :]  
  26. y_filtered = y_full[~missing_samples]  
  27.   
  28. estimator = RandomForestRegressor(random_state = 0, n_estimators = 100)  
  29. score = cross_val_score(estimator, X_filtered, y_filtered).mean()  
  30. print “Score without the samples containing missing values = %.2f” % score   
  31.   
  32. X_missing = X_full.copy()  
  33. X_missing[np.where(missing_samples)[0], missing_features] = 0  
  34. y_missing = y_full.copy()  
  35.   
  36. from sklearn.pipeline import Pipeline   
  37. from sklearn.preprocessing import Imputer  
  38.   
  39. estimator = Pipeline([(”imputer”, Imputer(missing_values = 0,  
  40.  strategy = ”mean”, axis = 0)), (“forest”,   
  41. RandomForestRegressor(random_state = 0, n_estimators = 100))])  
  42.   
  43. score = cross_val_score(estimator, X_missing, y_missing).mean()  
  44. print “Score after imputation of the missing values = %.2f” % score  
import numpy as np 
from sklearn.datasets import load_boston 

rng = np.random.RandomState(0)
dataset = load_boston()
X_full, y_full = dataset.data, dataset.target 
n_samples = X_full.shape[0]
n_feature = X_full.shape[1]

from sklearn.ensemble import RandomForestRegressor 
from sklearn.cross_validation import cross_val_score

estimator = RandomForestRegressor(random_state = 0, n_estimators = 100)
score = cross_val_score(estimator, X_full, y_full).mean()
print "Score with the entire dataset = %.2f" % score 

missing_rate = 0.75
n_missing_samples = np.floor(n_samples * missing_rate)
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples, 
 dtype = np.bool), np.ones(n_missing_samples, dtype = np.bool)))

rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_feature, n_missing_samples)

X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]

estimator = RandomForestRegressor(random_state = 0, n_estimators = 100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print "Score without the samples containing missing values = %.2f" % score 

X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()

from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import Imputer

estimator = Pipeline([("imputer", Imputer(missing_values = 0,
 strategy = "mean", axis = 0)), ("forest", 
RandomForestRegressor(random_state = 0, n_estimators = 100))])

score = cross_val_score(estimator, X_missing, y_missing).mean()
print "Score after imputation of the missing values = %.2f" % score




这里的结论是利用差补结果一般要更好。








matplotlib.pyplot::figure 
figsize指定显示的宽度及高度。

plt.axes([.2, .2, .7, .7])
指出占整个图像的画图区域的坐标轴的范围(方形区域)

plt.clf()
clear the current figure.

np.logspace(start, end, num = 50)
返回以10为底linspace的相应序列为幂指的指数函数值。

logistic回归中有类似支持向量机中的惩罚项,对参数长度惩罚。

利用网格法求解:
[python]  view plain  copy
  1. import numpy as np   
  2. import matplotlib.pyplot as plt   
  3.   
  4. from sklearn import linear_model, decomposition, datasets   
  5. from sklearn.pipeline import Pipeline   
  6. from sklearn.grid_search import GridSearchCV   
  7.   
  8. logistic = linear_model.LogisticRegression()  
  9. pca = decomposition.PCA()  
  10. pipe = Pipeline(steps = [(’pca’, pca), (‘logistic’, logistic)])  
  11.   
  12. digits = datasets.load_digits()  
  13. X_digits = digits.data   
  14. y_digits = digits.target  
  15.   
  16. pca.fit(X_digits)  
  17.   
  18. plt.figure(1, figsize = (43))  
  19. plt.clf()  
  20. plt.axes([.2, .2, .7, .7])  
  21. plt.plot(pca.explained_variance_, linewidth = 2)  
  22. plt.axis(”tight”)  
  23. plt.xlabel(”n_components”)  
  24. plt.ylabel(”explained_variance_”)  
  25.   
  26. n_components = [204064]  
  27. Cs = np.logspace(-443)  
  28.   
  29. estimator = GridSearchCV(pipe, dict(pca__n_components = n_components,  
  30.  logistic__C = Cs))  
  31. estimator.fit(X_digits, y_digits)  
  32.   
  33. plt.axvline(estimator.best_estimator_.named_steps[”pca”].n_components,  
  34.  linestyle = ”:”, label = “n_components chosen”)  
  35. plt.legend(prop = dict(size = 12))  
  36. plt.show()  
import numpy as np 
import matplotlib.pyplot as plt 

from sklearn import linear_model, decomposition, datasets 
from sklearn.pipeline import Pipeline 
from sklearn.grid_search import GridSearchCV 

logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps = [('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()
X_digits = digits.data 
y_digits = digits.target

pca.fit(X_digits)

plt.figure(1, figsize = (4, 3))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth = 2)
plt.axis("tight")
plt.xlabel("n_components")
plt.ylabel("explained_variance_")

n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)

estimator = GridSearchCV(pipe, dict(pca__n_components = n_components,
 logistic__C = Cs))
estimator.fit(X_digits, y_digits)

plt.axvline(estimator.best_estimator_.named_steps["pca"].n_components,
 linestyle = ":", label = "n_components chosen")
plt.legend(prop = dict(size = 12))

plt.show()

文章出处:https://blog.csdn.net/sinat_30665603/article/details/51926732

你可能感兴趣的:(numpy,python,机器学习)