狐狐的鹿鹿

【手把手机器学习入门到放弃】Random Forest && Extremely Randomized Trees参数全解析

随机森林 Random Forest && Extremely Randomized Trees

多种树有利于提高分类准确率

随机森林 Random Forest 是在决策树的基础上进行两种随机

随机选取一个数据集的一个子集作为样本
随机选取部分特征或者全部特征作为待选择特征库

超随机树 Extremely Randomized Trees

在随机森林的基础上对分裂阀值进行进一步的随机

文章目录

随机森林 Random Forest && Extremely Randomized Trees
不调参训练

使用随机森林算法
使用Extremely Randomized Trees

Random forest 参数
Extremely Randomized Trees 参数
测试树的多少对两种结果的影响
测试树的深度对两种算法的影响
OOBS参数的影响
min_samples_split 的作用
min_samples_leaf的影响
总结

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics

不调参训练

还是选用美国人群收入的数据进行实验，首先先进行一次不做任何调参的实验。

X = pd.read_csv('american_salary_feture.csv')
y = pd.read_csv('american_salary_label.csv', header=None)
y= np.array(y)
y=y.ravel()

print(X.shape, y.shape)

(32561, 106) (32561,)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

使用随机森林算法

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
                             min_samples_split=2, random_state=0)
clf = clf.fit(X_train, y_train)

print("train_score:",clf.score(X_train, y_train))
print("test_score:", clf.score(X_test, y_test))

print("train_f1_score:", metrics.f1_score(clf.predict(X_train), y_train))
print("test_f1_score:", metrics.f1_score(clf.predict(X_test), y_test))

train_score: 0.987018837018837
test_score: 0.8449821889202801
train_f1_score: 0.9723602755253291
test_f1_score: 0.6437041219649915

可以看到这是一个过拟合的算法，在训练集的表现远好于在测试集的表现

使用Extremely Randomized Trees

clf_e = ExtraTreesClassifier(n_estimators=10, max_depth=None,
                             min_samples_split=2, random_state=0)
clf_e.fit(X_train, y_train)

/Users/yaochenli/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  This is separate from the ipykernel package so we can avoid doing imports until





ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
                     oob_score=False, random_state=0, verbose=0,
                     warm_start=False)

print("train_score:",clf_e.score(X_train, y_train))
print("test_score:", clf_e.score(X_test, y_test))

print("train_f1_score:", metrics.f1_score(clf_e.predict(X_train), y_train))
print("test_f1_score:", metrics.f1_score(clf_e.predict(X_test), y_test))

train_score: 1.0
test_score: 0.8259427588748312
train_f1_score: 1.0
test_f1_score: 0.6084553744128214

依然是过拟合的模型，但是表现比随机森林更强

Random forest 参数

n_estimators : integer, optional (default=10)

The number of trees in the forest.

树的个数

Changed in version 0.20: The default value of n_estimators will change from 10 in version 0.20 to 100 in version 0.22.
criterion : string, optional (default=”gini”)

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.

损失函数，可以选择gini或者entropy
max_depth : integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

最大深度，如果不指定，那么树将会一直分裂到无法分裂为止
min_samples_split : int, float, optional (default=2)

The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.

If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

Changed in version 0.18: Added float values for fractions.

如果是整数，就是每个叶子节点需要进一步分裂的最少样本数，如果是小数，那么这个最少样本个数等于min_samples_split*样本总数。
min_samples_leaf : int, float, optional (default=1)

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

If int, then consider min_samples_leaf as the minimum number.

If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

Changed in version 0.18: Added float values for fractions.

如果是整数，就是每个叶子节点最少容纳的样本数，如果是小数，那么每个叶子节点最少容纳的个数等于min_samples_leaf*样本总数。如果某个分裂条件下分裂出得某个子树含有的样本数小于这个数字，那么不能进行分裂。
min_weight_fraction_leaf : float, optional (default=0.)

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

叶子节点最少需要占据总样本的比重，如果样本比重没有提供的话，每个样本占有相同比重
max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.

If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

If “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

If “log2”, then max_features=log2(n_features).

If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

分裂时需要考虑的最多的特征数，如果是整数，那么分裂时就考虑这几个特征，如果是小数，则分裂时考虑的特征数=max_features*总特征数，如果是“auto”或者“sqrt”，考虑的特征数是总特征数的平方根，如果是“log2”，考虑的特征数是log2（总特征素），如果是None，考虑的特征数=总特征数。需要注意的是，如果在规定的考虑特征数之内无法找到满足分裂条件的特征，那么决策树会继续寻找特征，直到找到一个满足分裂条件的特征。
max_leaf_nodes : int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

规定最多的叶子个数，根据区分度从高到低选择叶子节点，如果不传入这个参数，则不限制叶子节点个数。
min_impurity_decrease : float, optional (default=0.)

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.

最低分裂不纯度，当分裂后的减少的不纯度大于等于这个值时，才进行分裂。不纯度的计算公式如上。
min_impurity_split : float, (default=1e-7)

Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19. The default value of min_impurity_split will change from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use min_impurity_decrease instead.

最少分裂阀值，如果一个节点的不纯度大于这个值的时候才进行分裂。
bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.

是否使用自主采样法，即每次采样之后放回，对于数据集较小的情况适用，但自主采样法也会引入一定误差，对数据集较大的情况下不建议使用，如果这个选择False，那么所有数据都会用来生成每棵树
oob_score : bool (default=False)

Whether to use out-of-bag samples to estimate the generalization accuracy.

对于使用bootstrap的数据集，大约有36.8%的数据不会被取到，使用这些不会被取到的数据进行评分，有利于防止过拟合。
n_jobs : int or None, optional (default=None)

The number of jobs to run in parallel for both fit and predict. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

同时运行的线程数
random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

随机数值，用于打乱，默认使用np.random
verbose : int, optional (default=0)

Controls the verbosity when fitting and predicting.

训练过程中输出轮数信息
warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.

利用之前已经训练过的模型进行继续训练。
class_weight : dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

类别权重，对每个类别设置权重，示例如上，如果标签是多列的，那么每一列的的权重将会被相乘，如果在fit方法中传入了样本权重字典，那么类别权重会和样本权重相乘。
如果选择balanced_subsample，且选择了bootstrap，那么权重计算是根据每次bootstrap选出的数据集进行计算的。

Extremely Randomized Trees 参数

n_estimators : integer, optional (default=10)

The number of trees in the forest.

树的个数

Changed in version 0.20: The default value of n_estimators will change from 10 in version 0.20 to 100 in version 0.22.
criterion : string, optional (default=”gini”)

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.

损失函数，可以选择gini或者entropy
max_depth : integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

最大深度，如果不指定，那么树将会一直分裂到无法分裂为止
min_samples_split : int, float, optional (default=2)

The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.

If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

Changed in version 0.18: Added float values for fractions.

如果是整数，就是每个叶子节点需要进一步分裂的最少样本数，如果是小数，那么这个最少样本个数等于min_samples_split*样本总数。
min_samples_leaf : int, float, optional (default=1)

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

If int, then consider min_samples_leaf as the minimum number.

If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

Changed in version 0.18: Added float values for fractions.

如果是整数，就是每个叶子节点最少容纳的样本数，如果是小数，那么每个叶子节点最少容纳的个数等于min_samples_leaf*样本总数。如果某个分裂条件下分裂出得某个子树含有的样本数小于这个数字，那么不能进行分裂。
min_weight_fraction_leaf : float, optional (default=0.)

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

叶子节点最少需要占据总样本的比重，如果样本比重没有提供的话，每个样本占有相同比重
max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.

If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

If “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

If “log2”, then max_features=log2(n_features).

If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

分裂时需要考虑的最多的特征数，如果是整数，那么分裂时就考虑这几个特征，如果是小数，则分裂时考虑的特征数=max_features*总特征数，如果是“auto”或者“sqrt”，考虑的特征数是总特征数的平方根，如果是“log2”，考虑的特征数是log2（总特征素），如果是None，考虑的特征数=总特征数。需要注意的是，如果在规定的考虑特征数之内无法找到满足分裂条件的特征，那么决策树会继续寻找特征，直到找到一个满足分裂条件的特征。
max_leaf_nodes : int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

规定最多的叶子个数，根据区分度从高到低选择叶子节点，如果不传入这个参数，则不限制叶子节点个数。
min_impurity_decrease : float, optional (default=0.)

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.

最低分裂不纯度，当分裂后的减少的不纯度大于等于这个值时，才进行分裂。不纯度的计算公式如上。
min_impurity_split : float, (default=1e-7)

Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19. The default value of min_impurity_split will change from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use min_impurity_decrease instead.

最少分裂阀值，如果一个节点的不纯度大于这个值的时候才进行分裂。
bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.

是否使用自主采样法，即每次采样之后放回，对于数据集较小的情况适用，但自主采样法也会引入一定误差，对数据集较大的情况下不建议使用，如果这个选择False，那么所有数据都会用来生成每棵树
oob_score : bool (default=False)

Whether to use out-of-bag samples to estimate the generalization accuracy.

对于使用bootstrap的数据集，大约有36.8%的数据不会被取到，使用这些不会被取到的数据进行评分，有利于防止过拟合。
n_jobs : int or None, optional (default=None)

The number of jobs to run in parallel for both fit and predict. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

同时运行的线程数
random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

随机数值，用于打乱，默认使用np.random
verbose : int, optional (default=0)

Controls the verbosity when fitting and predicting.

训练过程中输出轮数信息
warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.

利用之前已经训练过的模型进行继续训练。
class_weight : dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

类别权重，对每个类别设置权重，示例如上，如果标签是多列的，那么每一列的的权重将会被相乘，如果在fit方法中传入了样本权重字典，那么类别权重会和样本权重相乘。
如果选择balanced_subsample，且选择了bootstrap，那么权重计算是根据每次bootstrap选出的数据集进行计算的。

两个树的参数是一样的，我们把他们放在一起做比较

测试树的多少对两种结果的影响

estimators = [10*x for x in range(1, 11)]

estimators

[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

f1_score_random_forest_train=np.zeros(10)
f1_score_random_forest_test=np.zeros(10)
f1_score_extreme_random_train=np.zeros(10)
f1_score_extreme_random_test=np.zeros(10)
i=0
for estimator in estimators:
    clf1=RandomForestClassifier(n_estimators = estimator, random_state=0, bootstrap ='False', class_weight="balanced_subsample")
    clf1.fit(X_train,y_train)
    f1_score_random_forest_test[i] = metrics.f1_score(clf1.predict(X_test), y_test)
    f1_score_random_forest_train[i] = metrics.f1_score(clf1.predict(X_train), y_train)
    
    clf2 = ExtraTreesClassifier(n_estimators = estimator, random_state=0, bootstrap = 'False', class_weight="balanced_subsample")
    clf2.fit(X_train,y_train)
    f1_score_extreme_random_test[i] = metrics.f1_score(clf2.predict(X_test), y_test)
    f1_score_extreme_random_train[i] = metrics.f1_score(clf2.predict(X_train), y_train)
    i=i+1

plt.figure(figsize=(10,6))
sns.set(style="whitegrid")
data = pd.DataFrame({"f1_score_random_forest_train":f1_score_random_forest_train,
                     "f1_score_random_forest_test": f1_score_random_forest_test,
                     "f1_score_extreme_random_train": f1_score_extreme_random_train,
                     "f1_score_extreme_random_test": f1_score_extreme_random_test},
                   index=estimators)
sns.lineplot(data=data)
plt.xlabel("estimators")
plt.ylabel("score")
plt.title("scores varies with number of estimators")

Text(0.5, 1.0, 'scores varies with number of estimators')

可以看到大约在60的时候，模型达到了最优效果,可以看到在测试集上随机森林算法的表现更好。

测试树的深度对两种算法的影响

depths = range(3,50)
f1_score_random_forest_train=np.zeros(50)
f1_score_random_forest_test=np.zeros(50)
f1_score_extreme_random_train=np.zeros(50)
f1_score_extreme_random_test=np.zeros(50)
for depth in depths:
    clf1=RandomForestClassifier(n_estimators = 60, max_depth=depth, random_state=0, bootstrap='False', class_weight="balanced_subsample")
    clf1.fit(X_train,y_train)
    f1_score_random_forest_test[depth] = metrics.f1_score(clf1.predict(X_test), y_test)
    f1_score_random_forest_train[depth] = metrics.f1_score(clf1.predict(X_train), y_train)
    
    clf2 = ExtraTreesClassifier(n_estimators = 60, max_depth=depth, random_state=0, bootstrap='False', class_weight="balanced_subsample")
    clf2.fit(X_train,y_train)
    f1_score_extreme_random_test[depth] = metrics.f1_score(clf2.predict(X_test), y_test)
    f1_score_extreme_random_train[depth] = metrics.f1_score(clf2.predict(X_train), y_train)

plt.figure(figsize=(10,6))
sns.set(style="whitegrid")
data = pd.DataFrame({"f1_score_random_forest_train":f1_score_random_forest_train,
                     "f1_score_random_forest_test": f1_score_random_forest_test,
                     "f1_score_extreme_random_train": f1_score_extreme_random_train,
                     "f1_score_extreme_random_test": f1_score_extreme_random_test})
sns.lineplot(data=data)
plt.xlabel("depths")
plt.ylabel("score")
plt.title("scores varies with number of depths")

Text(0.5, 1.0, 'scores varies with number of depths')

根据图线，在深度为10左右，分数和模型在训练集和测试集上有一个平衡，模型既没有过拟合，分数也比较高

OOBS参数的影响

clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf = clf.fit(X_train, y_train)

print("train_score:",clf.score(X_train, y_train))
print("test_score:", clf.score(X_test, y_test))

print("train_f1_score:", metrics.f1_score(clf.predict(X_train), y_train))
print("test_f1_score:", metrics.f1_score(clf.predict(X_test), y_test))

train_score: 1.0
test_score: 0.8514924456454981
train_f1_score: 1.0
test_f1_score: 0.670482420278005

clf = RandomForestClassifier(n_estimators=100, random_state=0, oob_score=True)
clf = clf.fit(X_train, y_train)

print("train_score:",clf.score(X_train, y_train))
print("test_score:", clf.score(X_test, y_test))

print("train_f1_score:", metrics.f1_score(clf.predict(X_train), y_train))
print("test_f1_score:", metrics.f1_score(clf.predict(X_test), y_test))

train_score: 1.0
test_score: 0.8514924456454981
train_f1_score: 1.0
test_f1_score: 0.670482420278005

光是oobs参数好像没有什么优化过拟合的效果

min_samples_split 的作用

这个参数是指最少需要多少个样本才进行下一步分裂，默认是2，我们常识一系列值来测试这个对优化过拟合模型的帮助

min_sample=range(2,51,2)
f1_score_random_forest_train=np.zeros(25)
f1_score_random_forest_test=np.zeros(25)
f1_score_extreme_random_train=np.zeros(25)
f1_score_extreme_random_test=np.zeros(25)
i=0
for sample in min_sample:
    clf1=RandomForestClassifier(n_estimators = 60,random_state=0, min_samples_split=sample, bootstrap='False', class_weight="balanced_subsample")
    clf1.fit(X_train,y_train)
    f1_score_random_forest_test[i] = metrics.f1_score(clf1.predict(X_test), y_test)
    f1_score_random_forest_train[i] = metrics.f1_score(clf1.predict(X_train), y_train)
    
    clf2 = ExtraTreesClassifier(n_estimators = 60,random_state=0, min_samples_split=sample, bootstrap='False', class_weight="balanced_subsample")
    clf2.fit(X_train,y_train)
    f1_score_extreme_random_test[i] = metrics.f1_score(clf2.predict(X_test), y_test)
    f1_score_extreme_random_train[i] = metrics.f1_score(clf2.predict(X_train), y_train)
    i=i+1

plt.figure(figsize=(10,6))
sns.set(style="darkgrid")
data = pd.DataFrame({"f1_score_random_forest_train":f1_score_random_forest_train,
                     "f1_score_random_forest_test": f1_score_random_forest_test,
                     "f1_score_extreme_random_train": f1_score_extreme_random_train,
                     "f1_score_extreme_random_test": f1_score_extreme_random_test},
                   index = min_sample)
sns.lineplot(data=data)
plt.xlabel("min_samples_split")
plt.ylabel("score")
plt.title("scores varies with number of min_samples_split")

Text(0.5, 1.0, 'scores varies with number of min_samples_split')

这里可以看到randomforest的f1_score超过了0.70，效果不错，可以看到这比单纯通过控制树的深度这种粗暴的方式来控制过拟合要更好，下面我们使用小数来进行调参

min_sample=np.linspace(0.001,0.02,20)
f1_score_random_forest_train=np.zeros(20)
f1_score_random_forest_test=np.zeros(20)
f1_score_extreme_random_train=np.zeros(20)
f1_score_extreme_random_test=np.zeros(20)
i=0
for sample in min_sample:
    clf1=RandomForestClassifier(n_estimators = 60,random_state=0, min_samples_split=sample, bootstrap='False', class_weight="balanced_subsample")
    clf1.fit(X_train,y_train)
    f1_score_random_forest_test[i] = metrics.f1_score(clf1.predict(X_test), y_test)
    f1_score_random_forest_train[i] = metrics.f1_score(clf1.predict(X_train), y_train)
    
    clf2 = ExtraTreesClassifier(n_estimators = 60,random_state=0, min_samples_split=sample, bootstrap='False', class_weight="balanced_subsample")
    clf2.fit(X_train,y_train)
    f1_score_extreme_random_test[i] = metrics.f1_score(clf2.predict(X_test), y_test)
    f1_score_extreme_random_train[i] = metrics.f1_score(clf2.predict(X_train), y_train)
    i=i+1

plt.figure(figsize=(10,6))
sns.set(style="darkgrid")
data = pd.DataFrame({"f1_score_random_forest_train":f1_score_random_forest_train,
                     "f1_score_random_forest_test": f1_score_random_forest_test,
                     "f1_score_extreme_random_train": f1_score_extreme_random_train,
                     "f1_score_extreme_random_test": f1_score_extreme_random_test},
                   index = min_sample)
sns.lineplot(data=data)
plt.xlabel("min_samples_split(fraction)")
plt.ylabel("score")
plt.title("scores varies with number of min_samples_split")

Text(0.5, 1.0, 'scores varies with number of min_samples_split')

可以看到随着叶子权重占比越来越高，过拟合的现象逐渐消失，但是欠拟合的现象也逐渐呈现，而且测试集上的成绩明显成锯齿形，说明在优化过程中出现震荡。我们发现对于本样本选取0.009的min_samples_split比较好

min_samples_leaf的影响

这个值表示每个叶子至少需要多少个样本，如果某个节点分裂后左子树或右子树中的样本数少于min_samples_leaf，那么不进行分裂

min_sample_leaf=range(1,31)
f1_score_random_forest_train=np.zeros(30)
f1_score_random_forest_test=np.zeros(30)
f1_score_extreme_random_train=np.zeros(30)
f1_score_extreme_random_test=np.zeros(30)
i=0
for sample in min_sample_leaf:
    clf1=RandomForestClassifier(n_estimators = 60,random_state=0, min_samples_leaf=sample, bootstrap='False', class_weight="balanced_subsample")
    clf1.fit(X_train,y_train)
    f1_score_random_forest_test[i] = metrics.f1_score(clf1.predict(X_test), y_test)
    f1_score_random_forest_train[i] = metrics.f1_score(clf1.predict(X_train), y_train)
    
    clf2 = ExtraTreesClassifier(n_estimators = 60,random_state=0, min_samples_leaf=sample, bootstrap='False', class_weight="balanced_subsample")
    clf2.fit(X_train,y_train)
    f1_score_extreme_random_test[i] = metrics.f1_score(clf2.predict(X_test), y_test)
    f1_score_extreme_random_train[i] = metrics.f1_score(clf2.predict(X_train), y_train)
    i=i+1

plt.figure(figsize=(10,6))
sns.set(style="darkgrid")
data = pd.DataFrame({"f1_score_random_forest_train":f1_score_random_forest_train,
                     "f1_score_random_forest_test": f1_score_random_forest_test,
                     "f1_score_extreme_random_train": f1_score_extreme_random_train,
                     "f1_score_extreme_random_test": f1_score_extreme_random_test},
                   index = min_sample_leaf)
sns.lineplot(data=data)
plt.xlabel("min_samples_leaf")
plt.ylabel("score")
plt.title("scores varies with number of min_samples_leaf")

Text(0.5, 1.0, 'scores varies with number of min_samples_leaf')

从结果来看，这个参数可以很好的解决过拟合的问题，但是也会降低模型的准确度

总结

这篇文章我们总共看了5个参数

n_estimators: 树不是越多越好，但随着树的增加模型效果会变好，但到一定数量之后反而会慢慢下降，不会再增加分数
max_depth: 减少树的深度可以解决过拟合的问题，但是由于颗粒度过粗，容易一下子把模型的效果下降太多
oob_score: 没有看出有什么效果，可能需要和别的参数配合使用
min_samples_split: 可以有效解决过拟合问题，解决过程也比较细腻，比较推荐
min_samples_leaf: 可以有效解决过拟合问题，配合min_samples_split食用，效果更佳

你可能感兴趣的:(手把手机器学习,实战区)

攻克AWS认证机器学习工程师（AWS Certified Machine Learning Engineer） - 助理级别认证：我的成功路线图硅基创想家 AI-人工智能与大模型 aws 机器学习云计算 AWS认证
引言当我决定考取AWS认证机器学习工程师-助理（AWSCertifiedMachineLearningEngineer—Associate）级别证书时，我就预料到这将是一段充满挑战但回报颇丰的旅程。跟你说吧，它在这两方面都没让我失望。这项考试面向的是不仅理解机器学习原理，还对AWS生态系统有扎实基础认知的专业人士。如果你还未达到AWS认证解决方案架构师-助理级别的水平，那你得先夯实这些基础。一个不
Tritonserver 在得物的最佳实践运维
一、Tritonserver介绍Tritonserver是Nvidia推出的基于GPU和CPU的在线推理服务解决方案，因其具有高性能的并发处理和支持几乎所有主流机器学习框架模型的特点，是目前云端的GPU服务高效部署的主流方案。Tritonserver的部署是以模型仓库(ModelRepository)的形式体现的，即需要模型文件和配置文件，且按一定的格式放置如下，根目录下每个模型有各自的文件夹。.
DeepSeek 实现原理探析 rockmelodies 人工智能 ai deepseek 深度学习
DeepSeek实现原理探析引言DeepSeek是一种基于深度学习的智能搜索技术，它通过结合自然语言处理（NLP）、信息检索（IR）和机器学习（ML）等多领域的技术，旨在提供更加精准、智能的搜索结果。本文将深入探讨DeepSeek的实现原理，分析其核心技术及其在实际应用中的表现。一、DeepSeek的核心技术自然语言处理（NLP）词嵌入（WordEmbedding）：DeepSeek使用如Word
常见的深度学习模型总结编码时空的诗意行者深度学习人工智能
1.深度前馈神经网络(DeepFeedforwardNetworks)发明时间：2006年左右，随着计算能力的提升和大数据集的可用性增加，深度学习开始兴起。发明动机：解决传统机器学习模型在复杂数据上的局限性，如线性模型无法处理非线性关系的数据。模型特点：由多个隐藏层组成的神经网络，每一层的节点与下一层的节点完全连接。应用场景：分类、回归、语音识别、图像识别等。2.卷积神经网络(Convolutio
手把手教你Linux内核编译：从零开始编写深度Linux C/C++全栈开发操作系统 linux 内存管理内核编译
在计算机技术的广袤星空中，Linux内核宛如一颗最为璀璨而神秘的巨星，散发着无尽的魅力与诱惑。它是操作系统的心脏，掌控着计算机系统的一切核心资源与底层运作。如今，我们即将踏上一场激动人心的冒险之旅——一步步解锁Linux内核，开启从零开始的编程征程。一、简介Linux内核作为操作系统的核心，其魅力在于多方面。首先，它负责资源管理和设备驱动等重要任务。学习Linux内核编程，能够让开发者深入了解操作
自动驾驶技术的未来趋势与挑战分析智能计算研究中心其他
内容概要自动驾驶技术自诞生以来经历了多个发展阶段。最初的研究集中在感知和控制系统的基础构建，随后进入了数据处理和算法的优化阶段，如今，随着人工智能和机器学习技术的快速应用，自动驾驶行业正处于一个前所未有的迅猛发展期。当前，行业内涌现出多种解决方案，各大汽车制造商与科技公司纷纷加大投入，推动这一领域的技术进步。市场需求不断增加，为自动驾驶技术注入活力。城市交通拥堵、环境污染等问题促使人们寻求更加智能
python 学习路线 Coding Happily python 学习 windows
学习顺序《python编程：从入门到实践》《Head-FirstPython》《“笨方法”学python3》《PythonCookbook》《Python机器学习基础教程》《FluentPython》《Python编程》《Python编程：从入门到实践》变量变量命名：仅用小写和下划线。变量本质:指向特定的值。字符串在字符串中使用变量：f’{varies1}{varies2}’更早版本:‘{}{}’
联想Y7000 2024版本笔记本 RTX4060安装ubuntu22.04双系统及深度学习环境配置七七@你一起学习深度学习 python
目录1..制作启动盘2.Windows磁盘分区，删除原来ubuntu的启动项3.四个设置4.安装ubuntu5.ubuntu系统配置1..制作启动盘先下载镜像文件，注意版本对应。Rufus-轻松创建USB启动盘用rufus制作时，需要注意选择正确的分区类型和系统类型。不然安装的系统会有问题！2.Windows磁盘分区，删除原来ubuntu的启动项手把手教你调整电脑磁盘的分区大小_调整分区大小-CS
【鸿蒙在OpenHarmony系统上集成OpenCV，实现图片裁剪】萌虎不虎 OpenHarmony harmonyos opencv 华为
鸿蒙在OpenHarmony系统上集成OpenCV，实现图片裁剪OpenCV介绍OpenCV（OpenSourceComputerVisionLibrary）是一个开源的计算机视觉和机器学习软件库。它由一系列的C函数和少量C++类构成，同时提供Python、Java和MATLAB等语言的接口，实现了图像处理和计算机视觉方面的很多通用算法。OpenCV具有极广的应用领域，它包括但不限于：人脸识别和物
使用 HuggingFace 库进行本地嵌入向量生成 qq_37836323 python 人工智能开发语言
在当今的AI和机器学习应用中，嵌入向量（embeddings）已成为不可或缺的一部分。嵌入向量能够将文本等高维数据转换为低维稠密向量，从而便于计算和分析。在本文中，我们将介绍如何使用HuggingFace库在本地生成嵌入向量，并演示相关代码。环境准备首先，我们需要安装一些必要的依赖库。可以通过以下命令进行安装：#安装必要的库!pipinstallsentence-transformers!pipi
花5分钟写个 grpc 微服务架构吧 π大星的日常 java 架构微服务 java
背景：当前微服务架构在开发中越来越常见，其目的在于将各个模块进行解耦，实现各个模块之间快速迭代。在golang项目中，最流行的微服务框架当属谷歌旗下的grpc框架。回想起我学grpc的时候，虽说不难，代码量不大，但还是遇到了很多坑的,如果照着网上的教程来写代码大概率是跑不通的。特此写一篇小白也能看懂的，最简单的，带你手把手写的基于grpc微服务架构项目。安装grpc，protoc工具和protob
机器学习面试笔试知识点-线性回归、逻辑回归(Logistics Regression)和支持向量机(SVM) qq742234984 机器学习线性回归逻辑回归
机器学习面试笔试知识点-线性回归、逻辑回归LogisticsRegression和支持向量机SVM微信公众号：数学建模与人工智能一、线性回归1.线性回归的假设函数2.线性回归的损失函数（LossFunction）两者区别3.简述岭回归与Lasso回归以及使用场景4.什么场景下用L1、L2正则化5.什么是ElasticNet回归6.ElasticNet回归的使用场景7.线性回归要求因变量服从正态分布
【AI】人工智能没那么神秘！仇辉攻防人工智能 ai 语言模型自然语言处理机器学习深度学习网络安全
AI是什么？人工智能（ArtificialIntelligence），英文缩写为AI。AI人工智能不是简单的应用程序，而是一类技术，包含机器学习、自然语言处理、计算机视觉等多个领域。AI系统通常由算法、数据、模型和代码组成，其中代码用于实现算法，数据用于训练模型，最终形成智能决策能力。AI可以嵌入到应用程序中，但其本身是一个复杂的技术体系。AI为什么这么聪明？AI之所以看起来很聪明，主要是因为它通
机器学习: 逻辑回归小源学AI 人工智能机器学习逻辑回归人工智能
概念与定义逻辑回归是一种用于分类问题的统计方法。它通过计算目标变量的概率来预测类别归属，并假设数据服从伯努利分布（二分类）或多项式分布（多分类）。逻辑回归模型输出的是概率值，通常使用sigmoid函数将线性组合映射到0和1之间。1.概念逻辑回归用于解决分类问题，特别是二分类问题。它通过估计输入变量与目标变量之间的关系来预测目标变量的类别。2.定义逻辑回归是一种广义线性模型，其核心思想是将线性组合通
GitHub 上的开源项目推荐临水逸 github 开源
GitHub上的开源项目有成千上万，涵盖了从前端框架到数据科学、机器学习、系统工具等各个领域。不同的人根据兴趣和需求，可能会有不同的排名。不过，一些开源项目因为其广泛的应用、社区支持和技术创新，通常被认为是“最好”的开源项目之一。下面是一些广受欢迎、常被认为是GitHub上最好的开源项目（按领域分类）：1.开发工具与库Bootstrap最流行的前端框架之一，用于快速开发响应式和现代化的网页。Vue
2024年机器学习高薪认证科技评论AI 机器学习人工智能
在这个数字时代，各大公司都在优先考虑使用AI（人工智能）和ML（机器学习）来解决各种问题。机器学习已成为技术领域中最具活力和收益潜力的领域之一，其在组织中的日益整合导致对具有认证资格专业人士的需求增加。认证不仅有助于提高在这一领域的专业知识，而且还能增加他们的收入潜力。本文深入探讨了2024年最具高薪潜力的机器学习认证，以及它们的价格，以便为您提供详尽的展望并帮助您选择合适的认证。最高薪的机器学习
【python 机器学习】sklearn转换器与预估器人才程序员杂谈 python 机器学习 sklearn 人工智能目标检测深度学习神经网络
文章目录sklearn转换器与预估器1.什么是转换器（Transformer）？通俗介绍：学术解释：2.什么是预估器（Estimator）？通俗介绍：学术解释：3.转换器与预估器的共同点4.转换器与预估器的区别5.使用`sklearn`中的转换器与预估器5.1示例：数据标准化（转换器）5.2示例：模型训练与预测（预估器）6.使用`Pipeline`结合转换器与预估器7.总结sklearn转换器与预
多图详解VSCode搭建Python开发环境爱编程的喵喵 Python基础课程 vscode ide python 开发环境
大家好，我是爱编程的喵喵。双985硕士毕业，现担任全栈工程师一职，热衷于将数据思维应用到工作与生活中。从事机器学习以及相关的前后端开发工作。曾在阿里云、科大讯飞、CCF等比赛获得多次Top名次。现为CSDN博客专家、人工智能领域优质创作者。喜欢通过博客创作的方式对所学的知识进行总结与归纳，不仅形成深入且独到的理解，而且能够帮助新手快速入门。本文通过多图的方式详细介绍了VSCode搭建Pyt
更符合DeepSeek的提问方式，学术论文方面的能力我总结了这几十个提示词！ AIWritePaper官方账号 AIWritePaper DeepSeek 学术论文人工智能 chatgpt 数据分析 prompt 论文阅读
DeepSeek提问技巧总结1.聚焦核心，细化问题：提问时应精准明确，避免过于宽泛或模糊。例如不要问“如何学习机器学习？”而应问“零基础如何机器学习”。对于复杂问题，可将其拆解为多个小问题，逐一提问。比如先问“学习机器学习先学习python更好吗？”再问“如何用Kaggle进行机器学习相关的数据竞赛？”2.提供背景，结构化描述：在提问时，提供问题的背景信息或目标，以便DeepSeek更准确地理解需
python 3.8 的anaconda怎么下载 xiamu_CDA python 开发语言
Python3.8版本的Anaconda下载与安装指南在当今数据科学、机器学习和人工智能领域，Anaconda作为一款集成了众多Python包的发行版，受到了广泛欢迎。它不仅简化了环境管理，还极大地提高了开发效率。本文将详细介绍如何下载并安装包含Python3.8的Anaconda发行版，帮助读者快速上手使用这一强大的工具。一、Anaconda简介Anaconda是由ContinuumAnalyt
Kibana全方位解析：告别小白，成为高手的必经之路！奔跑吧邓邓子项目实战 Logstash 可视化监控 kibana
目录一、Kibana概述1、Kibana简介2、Kibana与Elasticsearch的关系1.1相互依赖性1.2数据流动1.3功能互补1.4协同工作3、Kibana的主要功能1.1数据发现与探索1.2可视化与仪表板1.3监控与告警1.4Canvas可视化1.5机器学习1.6管道处理1.7报告与定时任务1.8管理与分析二、Kibana安装与配置1、环境要求1.1操作系统1.2Java运行环境1.
17.推荐系统的在线学习与实时更新郑万通推荐系统
接下来就讲解推荐系统的在线学习与实时更新。推荐系统的在线学习和实时更新是为了使推荐系统能够动态地适应用户行为的变化，保持推荐结果的实时性和相关性。以下是详细的介绍和实现方法。推荐系统的在线学习与实时更新在线学习的概念在线学习（OnlineLearning）是一种机器学习方法，与传统的批量学习（BatchLearning）不同，在线学习模型能够在数据流到达时逐步更新，而不是在整个数据集上训练一次。这
Java也能玩转机器学习？从零搭建你的第一个模型 prince_zxill 人工智能与机器学习教程 java 机器学习开发语言人工智能边缘计算
Java也能玩转机器学习？从零搭建你的第一个模型引言：一、打破认知：Java也能玩转机器学习1.1为什么选择Java？1.1.1无缝集成1.1.2JVM的跨平台优势1.1.3高性能计算能力1.1.4多线程与分布式计算1.2主流Java机器学习库全景1.2.1基础数值计算库1.2.2传统机器学习框架1.2.3深度学习生态1.2.4特殊领域工具1.3企业级机器学习架构1.3.1典型技术栈组合1.3.2
在Windows的IntelliJ IDEA中集成DeepSeek完整指南 TKang8912 intellij-idea java ide
在Windows的IntelliJIDEA中集成DeepSeek完整指南前言DeepSeek作为新一代智能开发助手，为开发者提供了代码生成、智能补全和AI问答等功能。本教程将手把手教你在Windows系统的IntelliJIDEA中配置DeepSeek开发环境，助您提升编码效率。环境准备硬件要求Windows10/1164位系统推荐8GB以上内存至少2GB可用磁盘空间软件要求IntelliJIDE
Python 调用 Azure OpenAI API ivwdcwso 开发 python azure flask openai 开发 ai 人工智能
在人工智能和机器学习快速发展的今天，AzureOpenAI服务为开发者提供了强大的工具来集成先进的AI能力到他们的应用中。本文将指导您如何使用Python调用AzureOpenAIAPI，特别是使用GPT-4模型进行对话生成。准备工作在开始之前，请确保您已经：拥有一个Azure账户并开通了AzureOpenAI服务。获取了API密钥和终端点URL。安装了Python和requests库。如果还没有
云原生周刊：K8s 严重漏洞 KubeSphere 云原生 k8s 容器平台 kubesphere 云计算
云原生周刊：K8s严重漏洞开源项目推荐KitOpsKitOps是一款开源的DevOps工具，专为AI/ML项目的全生命周期管理而设计，通过将模型、数据集、代码和配置打包并版本化为符合OCI（开放容器标准）的工件，简化了AI/ML工作流的部署与管理。KitOps支持统一打包，将AI/ML模型、数据集和配置封装为便携式工件，同时提供详细的版本控制，确保机器学习实验的可追溯性和可复现性。YokaiYok
如何配置syslog及修改默认端口号爱编程的喵喵 Linux解决方案 syslog 修改端口号 linux
大家好，我是爱编程的喵喵。双985硕士毕业，现担任全栈工程师一职，热衷于将数据思维应用到工作与生活中。从事机器学习以及相关的前后端开发工作。曾在阿里云、科大讯飞、CCF等比赛获得多次Top名次。现为CSDN博客专家、人工智能领域优质创作者。喜欢通过博客创作的方式对所学的知识进行总结与归纳，不仅形成深入且独到的理解，而且能够帮助新手快速入门。本文主要介绍了如何配置syslog及修改默认端口号
AI基础 -- AI学习路径图 sz66cm 人工智能学习
人工智能从数学到大语言模型构建教程第一部分：AI基础与数学准备1.绪论：人工智能的过去、现在与未来人工智能的定义与发展简史从符号主义到统计学习、再到深度学习与大模型的变迁本书内容概览与学习路径指引2.线性代数与矩阵运算向量与矩阵的基本概念矩阵分解（特征值分解、奇异值分解）张量运算简介（为后续深度学习做准备）在机器学习和深度学习中的应用示例3.概率论与统计基础随机变量、分布与期望方差贝叶斯理论与最大
探索 Dify：开源 LLM 应用开发平台 weixin_40941102 开源
探索Dify：开源LLM应用开发平台介绍在快速发展的AI和机器学习领域，开发人员不断寻求高效的工具，以无缝地从原型过渡到生产。Dify正是在这样的背景下应运而生的。这是一个开源平台，专为大语言模型（LLM）应用开发设计。凭借其直观的界面、全面的功能和强大的后端支持，Dify将彻底改变开发人员创建和部署AI应用程序的方式。Dify的核心功能1.工作流Dify提供了一个强大的可视化画布，用于构建和测试
DeepSeek-R1蒸馏技术：让小模型“继承”大模型的推理超能力马拉AI 人工智能机器学习深度学习
最近有不少朋友来询问Deepseek的核心技术，陆续针对DeepSeek-R1论文中的核心内容进行解读，并且用大家都能听懂的方式来解读。当大模型成为“老师”，小模型也能变“学霸”想象一下，一位经验丰富的数学老师（大模型）将自己解题的思维过程一步步拆解，手把手教给学生（小模型）。学生通过模仿老师的思路和技巧，最终也能独立解决复杂的题目——这就是“”模型蒸馏（Distillation）“”的核心思想。
SAX解析xml文件小猪猪08 xml
1.创建SAXParserFactory实例 2.通过SAXParserFactory对象获取SAXParser实例 3.创建一个类SAXParserHander继续DefaultHandler，并且实例化这个类 4.SAXParser实例的parse来获取文件 public static void main(String[] args) { //
为什么mysql里的ibdata1文件不断的增长？ brotherlamp linux linux运维 linux资料 linux视频 linux运维自学
我们在 Percona 支持栏目经常收到关于 MySQL 的 ibdata1 文件的这个问题。当监控服务器发送一个关于 MySQL 服务器存储的报警时，恐慌就开始了 —— 就是说磁盘快要满了。一番调查后你意识到大多数地盘空间被 InnoDB 的共享表空间 ibdata1 使用。而你已经启用了 innodbfileper_table，所以问题是： ibdata1存了什么？当你启用了 i
Quartz-quartz.properties配置 eksliang quartz
其实Quartz JAR文件的org.quartz包下就包含了一个quartz.properties属性配置文件并提供了默认设置。如果需要调整默认配置，可以在类路径下建立一个新的quartz.properties，它将自动被Quartz加载并覆盖默认的设置。下面是这些默认值的解释 #-----集群的配置 org.quartz.scheduler.instanceName =
informatica session的使用 18289753290 workflow session log Informatica
如果希望workflow存储最近20次的log，在session里的Config Object设置，log options做配置，save session log :sessions run ;savesessio log for these runs:20 session下面的source 里面有个tracing
Scrapy抓取网页时出现CRC check failed 0x471e6e9a != 0x7c07b839L的错误酷的飞上天空 scrapy
Scrapy版本0.14.4 出现问题现象： ERROR: Error downloading <GET http://xxxxx CRC check failed 解决方法 1.设置网络请求时的header中的属性'Accept-Encoding': '*;q=0' 明确表示不支持任何形式的压缩格式，避免程序的解压
java Swing小集锦永夜-极光 java swing
1.关闭窗体弹出确认对话框 1.1 this.setDefaultCloseOperation (JFrame.DO_NOTHING_ON_CLOSE); 1.2 this.addWindowListener ( new WindowAdapter () { public void windo
强制删除.svn文件夹随便小屋 java
在windows上，从别处复制的项目中可能带有.svn文件夹，手动删除太麻烦，并且每个文件夹下都有。所以写了个程序进行删除。因为.svn文件夹在windows上是只读的，所以用File中的delete()和deleteOnExist()方法都不能将其删除，所以只能采用windows命令方式进行删除
GET和POST有什么区别？及为什么网上的多数答案都是错的。 aijuans get post
如果有人问你，GET和POST，有什么区别？你会如何回答？我的经历前几天有人问我这个问题。我说GET是用于获取数据的，POST，一般用于将数据发给服务器之用。这个答案好像并不是他想要的。于是他继续追问有没有别的区别？我说这就是个名字而已，如果服务器支持，他完全可以把G
谈谈新浪微博背后的那些算法 aoyouzi 谈谈新浪微博背后的那些算法
本文对微博中常见的问题的对应算法进行了简单的介绍，在实际应用中的算法比介绍的要复杂的多。当然，本文覆盖的主题并不全，比如好友推荐、热点跟踪等就没有涉及到。但古人云“窥一斑而见全豹”，希望本文的介绍能帮助大家更好的理解微博这样的社交网络应用。微博是一个很多人都在用的社交应用。天天刷微博的人每天都会进行着这样几个操作：原创、转发、回复、阅读、关注、@等。其中，前四个是针对短博文，最后的关注和@则针
Connection reset 连接被重置的解决方法百合不是茶 java 字符流连接被重置
流是java的核心部分,,昨天在做android服务器连接服务器的时候出了问题,就将代码放到java中执行,结果还是一样连接被重置被重置的代码如下; 客户端代码; package 通信软件服务器; import java.io.BufferedWriter; import java.io.OutputStream; import java.io.O
web.xml配置详解之filter bijian1013 java web.xml filter
一.定义 <filter> <filter-name>encodingfilter</filter-name> <filter-class>com.my.app.EncodingFilter</filter-class> <init-param> <param-name>encoding<
Heritrix Bill_chen 多线程 xml 算法制造配置管理
作为纯Java语言开发的、功能强大的网络爬虫Heritrix，其功能极其强大，且扩展性良好，深受热爱搜索技术的盆友们的喜爱，但它配置较为复杂，且源码不好理解，最近又使劲看了下，结合自己的学习和理解，跟大家分享Heritrix的点点滴滴。 Heritrix的下载（http://sourceforge.net/projects/archive-crawler/）安装、配置，就不罗嗦了，可以自己找找资
【Zookeeper】FAQ bit1129 zookeeper
1.脱离IDE，运行简单的Java客户端程序 #ZkClient是简单的Zookeeper~$ java -cp "./:zookeeper-3.4.6.jar:./lib/*" ZKClient 1. Zookeeper是的Watcher回调是同步操作，需要添加异步处理的代码 2. 如果Zookeeper集群跨越多个机房，那么Leader/
The user specified as a definer ('aaa'@'localhost') does not exist 白糖_ localhost
今天遇到一个客户BUG，当前的jdbc连接用户是root，然后部分删除操作都会报下面这个错误：The user specified as a definer ('aaa'@'localhost') does not exist 最后找原因发现删除操作做了触发器，而触发器里面有这样一句 /*!50017 DEFINER = ''aaa@'localhost' */ 原来最初
javascript中showModelDialog刷新父页面 bozch JavaScript 刷新父页面 showModalDialog
在页面中使用showModalDialog打开模式子页面窗口的时候，如果想在子页面中操作父页面中的某个节点，可以通过如下的进行： window.showModalDialog('url',self,‘status...’); // 首先中间参数使用self 在子页面使用w
编程之美-买书折扣 bylijinnan 编程之美
import java.util.Arrays; public class BookDiscount { /**编程之美买书折扣书上的贪心算法的分析很有意思，我看了半天看不懂，结果作者说，贪心算法在这个问题上是不适用的。。下面用动态规划实现。哈利波特这本书一共有五卷，每卷都是8欧元，如果读者一次购买不同的两卷可扣除5%的折扣，三卷10%，四卷20%，五卷
关于struts2.3.4项目跨站执行脚本以及远程执行漏洞修复概要 chenbowen00 struts WEB安全
因为近期负责的几个银行系统软件，需要交付客户，因此客户专门请了安全公司对系统进行了安全评测，结果发现了诸如跨站执行脚本，远程执行漏洞以及弱口令等问题。下面记录下本次解决的过程以便后续 1、首先从最简单的开始处理，服务器的弱口令问题，首先根据安全工具提供的测试描述中发现应用服务器中存在一个匿名用户，默认是不需要密码的，经过分析发现服务器使用了FTP协议，而使用ftp协议默认会产生一个匿名用
[电力与暖气]煤炭燃烧与电力加温 comsci
在宇宙中,用贝塔射线观测地球某个部分,看上去,好像一个个马蜂窝,又像珊瑚礁一样,原来是某个国家的采煤区..... 不过,这个采煤区的煤炭看来是要用完了.....那么依赖将起燃烧并取暖的城市,在极度严寒的季节中...该怎么办呢? &nbs
oracle O7_DICTIONARY_ACCESSIBILITY参数 daizj oracle
O7_DICTIONARY_ACCESSIBILITY参数控制对数据字典的访问.设置为true,如果用户被授予了如select any table等any table权限,用户即使不是dba或sysdba用户也可以访问数据字典.在9i及以上版本默认为false,8i及以前版本默认为true.如果设置为true就可能会带来安全上的一些问题.这也就为什么O7_DICTIONARY_ACCESSIBIL
比较全面的MySQL优化参考 dengkane mysql
本文整理了一些MySQL的通用优化方法，做个简单的总结分享，旨在帮助那些没有专职MySQL DBA的企业做好基本的优化工作，至于具体的SQL优化，大部分通过加适当的索引即可达到效果，更复杂的就需要具体分析了，可以参考本站的一些优化案例或者联系我，下方有我的联系方式。这是上篇。 1、硬件层相关优化 1.1、CPU相关在服务器的BIOS设置中，可
C语言homework2，有一个逆序打印数字的小算法 dcj3sjt126com c
#h1# 0、完成课堂例子 1、将一个四位数逆序打印 1234 ==> 4321 实现方法一： # include <stdio.h> int main(void) { int i = 1234; int one = i%10; int two = i / 10 % 10; int three = i / 100 % 10;
apacheBench对网站进行压力测试 dcj3sjt126com apachebench
ab 的全称是 ApacheBench ，是 Apache 附带的一个小工具，专门用于 HTTP Server 的 benchmark testing ，可以同时模拟多个并发请求。前段时间看到公司的开发人员也在用它作一些测试，看起来也不错，很简单，也很容易使用，所以今天花一点时间看了一下。通过下面的一个简单的例子和注释，相信大家可以更容易理解这个工具的使用。
2种办法让HashMap线程安全 flyfoxs java jdk jni
多线程之--2种办法让HashMap线程安全多线程之--synchronized 和reentrantlock的优缺点多线程之--2种JAVA乐观锁的比较( NonfairSync VS. FairSync) HashMap不是线程安全的,往往在写程序时需要通过一些方法来回避.其实JDK原生的提供了2种方法让HashMap支持线程安全.
Spring Security（04）——认证简介 234390216 Spring Security 认证过程
认证简介目录 1.1 认证过程 1.2 Web应用的认证过程 1.2.1 ExceptionTranslationFilter 1.2.2 在request之间共享SecurityContext 1
Java 位运算 Javahuhui java 位运算
// 左移( << ) 低位补0 // 0000 0000 0000 0000 0000 0000 0000 0110 然后左移2位后，低位补0： // 0000 0000 0000 0000 0000 0000 0001 1000 System.out.println(6 << 2);// 运行结果是24 // 右移( >> ) 高位补"
mysql免安装版配置 ldzyz007 mysql
1、my-small.ini是为了小型数据库而设计的。不应该把这个模型用于含有一些常用项目的数据库。 2、my-medium.ini是为中等规模的数据库而设计的。如果你正在企业中使用RHEL,可能会比这个操作系统的最小RAM需求(256MB)明显多得多的物理内存。由此可见，如果有那么多RAM内存可以使用，自然可以在同一台机器上运行其它服务。 3、my-large.ini是为专用于一个SQL数据
MFC和ado数据库使用时遇到的问题你不认识的休道人 sql C++mfc
=================================================================== 第一个 =================================================================== try{ CString sql; sql.Format("select * from p
表单重复提交Double Submits rensanning double
可能发生的场景： *多次点击提交按钮 *刷新页面 *点击浏览器回退按钮 *直接访问收藏夹中的地址 *重复发送HTTP请求（Ajax）（1）点击按钮后disable该按钮一会儿，这样能避免急躁的用户频繁点击按钮。这种方法确实有些粗暴，友好一点的可以把按钮的文字变一下做个提示，比如Bootstrap的做法： http://getbootstrap.co
Java String 十大常见问题 tomcat_oracle java 正则表达式
　1.字符串比较，使用“==”还是equals()? 　　"=="判断两个引用的是不是同一个内存地址(同一个物理对象)。　　equals()判断两个字符串的值是否相等。　　除非你想判断两个string引用是否同一个对象，否则应该总是使用equals()方法。　　如果你了解字符串的驻留(String Interning)则会更好地理解这个问题。　　
SpringMVC 登陆拦截器实现登陆控制 xp9802 springMVC
思路，先登陆后，将登陆信息存储在session中，然后通过拦截器，对系统中的页面和资源进行访问拦截，同时对于登陆本身相关的页面和资源不拦截。实现方法： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23