Isotonic regression(保序回归)
In numerical analysis, isotonic regression (IR) involves finding a weighted
least-squares fit x to Rn with weights vector w to Rn subject to a set of
non-contradictory constraints of the kind xi >= xj.
(x 分量保序)
Such constraints define partial order or total order and can be represented
as a directed graph G = (N, E)(有向图 N:节点 E:节点间的映射)
where N is the set of variables involved, and E is the ste of pairs (i, j)
for each constraint xi >= xj. Thus, the IR problem corresponds to the following quadratic program(QP) (二次规划)
import numpy as np
from sklearn.utils import check_random_state
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
n = 100
x = np.arange(n)
rs = check_random_state(0)
y = rs.randint(-50, 50, size = (n,)) + 50. * np.log(1 + np.arange(n))
ir = IsotonicRegression()
y_ = ir.fit_transform(x, y)
lr = LinearRegression()
print x
print x[:, np.newaxis]
'''[:, np.newaxis], y)
seguments = [[[i, y[i]], [i, y_[i]]] for i in range(n)]
lc = LineCollection(seguments, zorder = 0)
lc.set_linewidths(0.5 * np.ones(n))
fig = plt.figure()
plt.plot(x, y, "r.", markersize = 12)
plt.plot(x, y_, "g.-", markersize = 12)
plt.plot(x, lr.predict(x[:, np.newaxis]), "b-")
plt.legend(("Data", "isotonic Fit", "Linear Fit"), loc = "lower right")
plt.title("isotonic regression")
Random forest
Random forests is a notion of general technique of random decision forests
that are an ensemble learning(集成学习)
method for classification, regression and other tasks, that operate by
constructing a mutitude of decison trees at training time and outputting
the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forest correct for decision trees' habit of overfitting to their training set.
Decision tree
Decision tree learning uses a decision tree as a predictive(预测) model
whice maps observations about an item to conclusions about the item's
target value. It is one of the predictive modeling approaches used in
statistics, data maining and machine learning. Tree models where the target
variable can be take a finite set of values called classification trees, In
these tree structures, leaves represent conjunctions(结合) of features
that lead to those class labels. Decision trees where the target varibale
can take continuous values (typically real numbers) are called regression
In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree
describes data but not decisions; rather the resulting classification tree
can be an input for decision making.This page deals with decision trees in
data mining.
Bootstrap aggregating(聚合)
Given a standard training set D of size n, bagging generates m new training
sets Di, each of size n', by sampling from D uniformly and with replacement.
By sampling with replacement, some observations may be repeated in each Di
is expected to have the fraction (1 - 1/e) of the unique example of D, the rest being duplicates. This kind of sample is known as a bootstrap sample.
The m model are fitted using the above m bootstrap samples and combined by
averaging the output (for regressiion) or voting(for classification).
From bagging to random forests
The above procedure describes the original bagging algorithm for tree.
Random forests differ in only ne way from this general scheme:(方案)
type use a modified tree learning algorithm that selects, at each candidate
split in the learning process, a random subset of the features. This
process is sometimes called "feature bagging".
The reason for doing this is the correlation of the trees in an
ordinary bootstrap sample: if one or a few features are very strong
predictors for the response variable (target output), these features
will be selected in many of the B trees,causing them to become correalted
. An analysis of how bagging and random subspace projection contribute
to accurarcy gains under different conditions is given by Ho.
参数n_estimators 指出了随机森林中使用的树的数量。
numpy.random::shuffle 可以将数组打乱。
import numpy as np
from sklearn.datasets import load_boston
rng = np.random.RandomState(0)
dataset = load_boston()
X_full, y_full =,
n_samples = X_full.shape[0]
n_feature = X_full.shape[1]
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score
estimator = RandomForestRegressor(random_state = 0, n_estimators = 100)
score = cross_val_score(estimator, X_full, y_full).mean()
print "Score with the entire dataset = %.2f" % score
missing_rate = 0.75
n_missing_samples = np.floor(n_samples * missing_rate)
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
dtype = np.bool), np.ones(n_missing_samples, dtype = np.bool)))
missing_features = rng.randint(0, n_feature, n_missing_samples)
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state = 0, n_estimators = 100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print "Score without the samples containing missing values = %.2f" % score
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
estimator = Pipeline([("imputer", Imputer(missing_values = 0,
strategy = "mean", axis = 0)), ("forest",
RandomForestRegressor(random_state = 0, n_estimators = 100))])
score = cross_val_score(estimator, X_missing, y_missing).mean()
print "Score after imputation of the missing values = %.2f" % score
plt.axes([.2, .2, .7, .7])
clear the current figure.
np.logspace(start, end, num = 50)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps = [('pca', pca), ('logistic', logistic)])
digits = datasets.load_digits()
X_digits =
y_digits =
plt.figure(1, figsize = (4, 3))
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth = 2)
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)
estimator = GridSearchCV(pipe, dict(pca__n_components = n_components,
logistic__C = Cs)), y_digits)
linestyle = ":", label = "n_components chosen")
plt.legend(prop = dict(size = 12))