zzulp

sklearn cookbook 总结

Sklearn cookbook总结

1 数据预处理

1.1 获取数据

sklearn自带一些数据集，可以通过datasets模块的load_*方法加载，还有一些数据集比较大，可以通过fetch_*的方式下载。下面的代码示例了加载boston的房价数据和下载california的房价数据的方法。

from sklearn import datasets

boston = datasets.load_boston()
print(boston.DESCR)

california = datasets.fetch_california_housing('./temp')
# print(california.DESCR)

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

1.2 数据处理

sklearn的preprocess模块提供了若干预处理数据的方法。其功能如下：

类或方法	作用
StandardScaler	数据减去均值后除以方差
MinMaxScaler	减去最小值除以最大最小值的差
normalize	将数据除以所有点的平方和
binary	由设置的阀值s进行二值化，x>s?1:0

其使用示例如下，由于是二维数组，计算在列上进行，即axis为0：

import numpy as np
from sklearn import preprocessing

a = np.array([[4., 2.], [2., 4.], [2, -2]], dtype=np.float)

print(a)

scaler = preprocessing.StandardScaler()
r = scaler.fit_transform(a)
print(r) 

scaler = preprocessing.MinMaxScaler()
r = scaler.fit_transform(a)
print(r)

r = preprocessing.normalize(a)
print(r) 

binary = preprocessing.Binarizer(3.5)
r = binary.fit_transform(a)
print(r)

[[ 4.  2.]
 [ 2.  4.]
 [ 2. -2.]]
[[ 1.41421356  0.26726124]
 [-0.70710678  1.06904497]
 [-0.70710678 -1.33630621]]
[[1.         0.66666667]
 [0.         1.        ]
 [0.         0.        ]]
[[ 0.89442719  0.4472136 ]
 [ 0.4472136   0.89442719]
 [ 0.70710678 -0.70710678]]
[[1. 0.]
 [0. 1.]
 [0. 0.]]

###1.3 分类编码
对于类别型的数据，需要将其数值化，以支持向量运算。

对于数值型的，可以使用preprocessing包的OneHotEncoder；对于字符串型的需要借助feature_extraction模块来进行。

from sklearn import preprocessing
from sklearn.feature_extraction import DictVectorizer


labels = [[1], [2], [3], [2]]

onehot = preprocessing.OneHotEncoder()
y = onehot.fit_transform(labels)

print(y.toarray())


labels = [{'kind':'apple'}, {'kind':'orange'}]
dv = DictVectorizer()
y = dv.fit_transform(labels)
print(y.toarray())


labels = [1,2,3,3,2,1]
lb = preprocessing.LabelBinarizer()
vec = lb.fit_transform(labels)
print(vec)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]
[[1. 0.]
 [0. 1.]]
[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 0 1]
 [0 1 0]
 [1 0 0]]

1.4 缺失值处理

缺失值可以表示为nan，但在计算中无法使用，因此根据需要可以填充为合适的值。sklearn和pandas都能处理缺失值。

import pandas as pd
from sklearn import preprocessing


data = np.array([[1, 2], [np.nan, 4]])
print('origin:\n', data)

imputer = preprocessing.Imputer(strategy='mean')
r = imputer.fit_transform(data)
print('sklean:\n', r)


data_df = pd.DataFrame(data)
df = data_df.fillna(data_df.mean())
print('pandas\n',df)

origin:
 [[ 1.  2.]
 [nan  4.]]
sklean:
 [[1. 2.]
 [1. 4.]]
pandas
      0    1
0  1.0  2.0
1  1.0  4.0

1.5 去除无用的维度

PCA是sklearn的一个分解模块，可以借助它来完成数据降维。

下面的代码对iris的特征进行PCA降维，通过对各维度的贡献分析，96%的变量可以由前两个主成分表示。因此可以把数据降低到前两维上，通过对PCA的参数n_components指定维度或比例，可以将数据进行降维。在只有两维的数据上通过plot作图以验证数据的可分性。

降维的另一个方法是使用FactorAnalysis类，使用上和PCA类似。其支持的核函数有liner, poly, rbf, sigmoid, cosine。

最后，利用矩阵的SVD也可以实现数据降维。各种降维方法的示例代码及效果如下：

import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.decomposition import TruncatedSVD


iris = datasets.load_iris()

pca = PCA()
dt = pca.fit_transform(iris.data)
print(pca.explained_variance_ratio_)

'''
array([  8.05814643e-01,   1.63050854e-01,   2.13486883e-02,......)
'''

fig, axes = plt.subplots(1,3)

pca = decomposition.PCA(n_components = 2)
dt = pca.fit_transform(iris.data)
axes[0].scatter(dt[:,0], dt[:,1], c=iris.target)


fa = FactorAnalysis(n_components=2)
dt = fa.fit_transform(iris.data)
axes[1].scatter(dt[:,0], dt[:,1], c=iris.target)


svd = TruncatedSVD()
dt = svd.fit_transform(iris.data)
axes[2].scatter(dt[:,0], dt[:,1], c=iris.target)

[0.92461621 0.05301557 0.01718514 0.00518309]

1.6 使用pipeline连接多个变换

对于多步处理，pipeline提供了一种便捷的组织代码的方式。如下示例：

from sklearn import pipeline, preprocessing, decomposition, datasets

iris = datasets.load_iris()

imputer = preprocessing.Imputer()
pca = decomposition.PCA(n_components=2)
line = [('imputer', imputer), ('pca', pca)]

pipe = pipeline.Pipeline(line)
dt = pipe.fit_transform(iris.data)
print dt.shape #(150,2)

1.7 利用高斯随机过程处理回归

如果假设变量的分布和自变量符合高斯分布或正态分布，则可以使用高斯过程来进行回归分析。

from sklearn import datasets
from sklearn.gaussian_process import GaussianProcess

boston = datasets.load_boston()

sel = np.random.choice([True, False], len(boston.data), p=[0.75, 0.25])
gp = GaussianProcess()
gp.fit(boston.data[sel], boston.target[sel])

pred = gp.predict(boston.data[~sel])
diff = pred - boston.target[~sel]
xtick = range(len(pred))

fig, axes = plt.subplots(2,1)

axes[0].plot(xtick, pred, c='red',label='predict')
axes[0].plot(xtick, boston.target[~sel], c='blue', label='real')

axes[1].plot(xtick, diff)

plt.show()

/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:58: DeprecationWarning: Class GaussianProcess is deprecated; GaussianProcess was deprecated in version 0.18 and will be removed in 0.20. Use the GaussianProcessRegressor instead.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function l1_cross_distances is deprecated; l1_cross_distances was deprecated in version 0.18 and will be removed in 0.20.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function constant is deprecated; The function constant of regression_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function squared_exponential is deprecated; The function squared_exponential of correlation_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function constant is deprecated; The function constant of regression_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function squared_exponential is deprecated; The function squared_exponential of correlation_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)

1.8 SGD处理回归

from sklearn import datasets
from sklearn.linear_model import SGDRegressor

X, y = datasets.make_regression(1000)
sel = np.random.choice([True, False], len(X), p=[0.75, 0.25])

sgd = SGDRegressor(max_iter=10, tol=0.1)
sgd.fit(X[sel], y[sel])

pred = sgd.predict(X[~sel])
diff = pred - y[~sel]

xtick = range(len(pred))

fig, axes = plt.subplots(2,1)

axes[0].plot(xtick, pred, c='red',label='predict')
axes[0].plot(xtick, y[~sel], c='blue', label='real')

axes[1].plot(xtick, diff)

plt.show()

2 线性模型

2.1 线性回归模型

from sklearn import datasets
from sklearn import linear_model

boston = datasets.load_boston()
model = linear_model.LinearRegression()

model.fit(boston.data, boston.target)
print(model.coef_)

pred = model.predict(boston.data)
diff = pred - boston.target
xtick = range(len(pred))

fig, axes = plt.subplots(2,1)

axes[0].plot(xtick, pred, c='red',label='predict')
axes[0].plot(xtick, boston.target, c='blue', label='real')

axes[1].plot(xtick, diff)

plt.show()

[-1.07170557e-01  4.63952195e-02  2.08602395e-02  2.68856140e+00
 -1.77957587e+01  3.80475246e+00  7.51061703e-04 -1.47575880e+00
  3.05655038e-01 -1.23293463e-02 -9.53463555e-01  9.39251272e-03
 -5.25466633e-01]

2.2 岭回归

岭回归在处理非满秩的矩阵时比较有用。它可以去除不相关的系数。

为了找到合适的参数，可以使用交叉验证的方法。

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
X, y = make_regression(n_samples=100, n_features=3, effective_rank=2, noise=5)


lr = LinearRegression()
lr.fit(X, y)

ridge = Ridge()
ridge.fit(X,y)

_, ax = plt.subplots(1,1)
ax.plot(range(len(lr.coef_)), lr.coef_)
ax.plot(range(len(ridge.coef_)), ridge.coef_)

plt.show()

ridge_cv = RidgeCV(alphas=[0.05, 0.08 ,0.1, 0.2, 0.8])
ridge_cv.fit(X, y)
print(ridge_cv.alpha_)

0.05

2.3 LASSO正则化

通过加入惩罚系数，消除系数之间的相关性。

如下示例所示，使用线性回归得到系数全部是相关的，但使用lasso处理后，只有5个是相关的。

为了找到lasso正则化合适的参数，我们需要使用交叉验证来寻找最优的参数。

from sklearn.linear_model import Lasso, LinearRegression
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1000, n_features=500, n_informative=5, noise=10)

lr = LinearRegression()
lr.fit(X, y)
print(np.sum(lr.coef_ !=0))

lasso = Lasso()
lasso.fit(X, y)
print(np.sum(lasso.coef_ !=0))

from sklearn.linear_model import LassoCV
lasso_cv = LassoCV()
lasso_cv.fit(X, y)
print(lasso_cv.alpha_)

500
6
0.7173410510859349

2.6 LARS正则化

LARS是一种回归手段，用于解决高维问题，即特征数远大于样本数量。LARS好处是可以设定一个较小的特征数量，防止数据的过拟合。LarsCV也可用来探索最优的参数。

from sklearn.datasets import make_regression
from sklearn.linear_model import Lars

X, y = make_regression(n_samples=200, n_features=500, n_informative=10, noise=2)
lars = Lars(n_nonzero_coefs=10)

lars.fit(X, y)

print(np.sum(lars.coef_ != 0 ))

2.7 逻辑回归

逻辑回归和上面的线性回归使用方法类似。

from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
lr = LogisticRegression()

lr.fit(iris.data[:100], iris.target[:100])

pred = lr.predict(iris.data[100:])

diff = pred - iris.target[100:]

print(diff)

[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1]

2.8 Bayes岭回归

from sklearn import datasets
from sklearn.linear_model import BayesianRidge
import matplotlib.pyplot as plt

boston = datasets.load_boston()
beyes = BayesianRidge()
beyes.fit(boston.data, boston.target)
print(beyes.coef_)

bayes = BayesianRidge(alpha_1=10, lambda_1=10)
beyes.fit(boston.data, boston.target)
print(beyes.coef_)

[-0.10035603  0.04970825 -0.04362901  1.89550379 -2.14222918  3.66953449
 -0.01058388 -1.24482568  0.27964471 -0.01405975 -0.79764042  0.01011661
 -0.56264033]
[-0.10035603  0.04970825 -0.04362901  1.89550379 -2.14222918  3.66953449
 -0.01058388 -1.24482568  0.27964471 -0.01405975 -0.79764042  0.01011661
 -0.56264033]

2.9 GBR回归

梯度提升树(GradientBoosting)是一种集成学习方法。在ensemble模块中提供了BGR供使用，下面的代码示例了GBR的使用方法，并与Linear模型的误差进行对比。

from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor as GBR
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

boston = datasets.load_boston()

lr = LinearRegression()
lr.fit(boston.data, boston.target)

pred = lr.predict(boston.data)
diff = pred - boston.target
plt.hist(diff, color='g')

gbr = GBR(n_estimators=100, max_depth=3, loss='ls')
gbr.fit(boston.data, boston.target)

pred = gbr.predict(boston.data)
diff = pred - boston.target
plt.hist(diff, color='r')

(array([  2.,   2.,  12.,  36.,  98., 125., 126.,  74.,  21.,  10.]),
 array([-5.64276266, -4.66928597, -3.69580928, -2.72233259, -1.7488559 ,
        -0.77537921,  0.19809748,  1.17157417,  2.14505086,  3.11852755,
         4.09200424]),
 )

3 聚类模型

3.1 使用KMeans进行聚类

KMeans是最为常用的聚类方式，在cluster模块中可以找到KMeans类。通过指定中心数量可以令模型进行无监督聚类。

KMeans实例对象属性labels_保存了所有点的聚类标签，可以通过标签观察同一类的点。

当数据量很大时，KMeans计算很慢，这时可以使用Mini Batch KMeans来加速计算。

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans, MiniBatchKMeans

rgb=np.array(['r', 'g', 'b'])
blobs, classes = make_blobs(500, centers=3)
_, ax = plt.subplots(figsize=(8,8))
ax.scatter(blobs[:,0], blobs[:, 1], color=rgb[classes])

kmean = KMeans(n_clusters=3)
kmean.fit(blobs)
print(kmean.cluster_centers_)
#for kind in range(len(classes)):
#    print(blobs[kmean.labels_== kind])
ax.scatter(kmean.cluster_centers_[:,0], kmean.cluster_centers_[:,1], marker='*', color='black')


mb_kmean = MiniBatchKMeans(n_clusters=3, batch_size=100)
mb_kmean.fit(blobs, classes)
print(mb_kmean.cluster_centers_)
ax.scatter(mb_kmean.cluster_centers_[:,0], mb_kmean.cluster_centers_[:,1], marker='o', color='yellow')

[[ 2.62505603  4.87281732]
 [-4.10154624 -7.5409836 ]
 [-4.5489481   4.00556156]]
[[ 2.72677398  4.8618203 ]
 [-4.10933407 -7.50669163]
 [-4.56130824  4.04383432]]

3.2 优化聚类的数量

使用Silhouette距离的均值来衡量聚类结果的好坏，在metrics模块中可以找到silhouette_score方法来计算得分。

通过计算一系列的不同的聚类数量的得分，通常得分最高的位置附近是比较好的聚类数量取值。

from sklearn import metrics
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

blobs, classes = make_blobs(500, centers=3)

scores = []

for k in range(2, 30):
    kmean = KMeans(n_clusters=k)
    kmean.fit(blobs, classes)

    scores.append(metrics.silhouette_score(blobs, kmean.labels_))

print(scores)
plt.plot(range(len(scores)), scores)

[0.7886309932105221, 0.7529271814274128, 0.6146973574146535, 0.4302651403684647, 0.3285318660943046, 0.32560763252948505, 0.33836744321304496, 0.3452056475433186, 0.3466567774597919, 0.344592536284487, 0.34523240219690204, 0.34265770815544033, 0.3403885093830953, 0.3442014721691386, 0.34323575309881027, 0.3418006941853705, 0.35367134774799686, 0.34764025436969426, 0.34505365612553296, 0.3517117523350353, 0.3577407169626914, 0.3458147106597954, 0.34080496590913045, 0.33990321947115615, 0.35098916233043775, 0.344594071344537, 0.34477844590350915, 0.34794461122938564]





[]

3.3 寻找特征空间中最接近的对象

给定一些点，我们希望以一定的方式计算其与其他点的距离，sklearn的pairwise模块提供了这种便利。
pairwise_distances可以以指定的方式计算两两之间的距离，目前支持的距离方式有：hamming, cosine, euclidean, manhattan, l1,l2,ciyblock等。

from sklearn.metrics import pairwise
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

points, labels = make_blobs(n_samples=20, centers=3)
plt.scatter(points[:, 0], points[:, 1], c=labels)

dist = pairwise.pairwise_distances(points, metric='euclidean')

print(dist.shape)
print(dist[0])

(20, 20)
[ 0.         22.89308807  2.56555945 13.24114319 11.02938882 20.82368041
 11.51752616 22.43691943 22.86857453 21.34544019 11.56896063  1.69132639
 20.90143498 12.3970374  11.6082638   0.34855675 10.33852078  1.10400601
  1.53791614  0.53539892]

3.4 使用KMeans进行离群点检测

有时候我们需要假设离群点是异常数据，此时就需要识别出它们并从数据集中移除。检测离群点的操作是查找簇的质心，之后通过点到质心的距离来识别潜在的离群点。

距离的计算可以使用上节的pairwise方式，但KMeans对象的transform方法帮我们实现了这个到中心点距离的计算，省去我们自己去实现。

from sklearn.metrics import pairwise
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

points, labels = make_blobs(n_samples=20, centers=1)
print(points.shape)
plt.scatter(points[:, 0], points[:, 1])

kmean = KMeans(n_clusters=1)
kmean.fit(points)
plt.scatter(kmean.cluster_centers_[:,0], kmean.cluster_centers_[:, 1], color='green')

all = np.vstack([kmean.cluster_centers_, points])
dist1 = pairwise.pairwise_distances(all)

dist2 = kmean.transform(points)

print(np.argsort(dist1[0])[1:] - 1)

sort_idx = np.argsort(dist2.ravel())
print(sort_idx)


out_p = points[sort_idx][-5:]
plt.scatter(out_p[:,0], out_p[:, 1], color='red')

(20, 2)
[ 3 17  1  5 19  2 14  8 11 10  0  4 16 18 12  7 13 15  9  6]
[ 3 17  1  5 19  2 14  8 11 10  0  4 16 18 12  7 13 15  9  6]

3.9 使用KNN进行回归

KNN是属于监督学习的方法，它使用特征空间中邻近的K个点来构建回归，不像常规回归那样使用整个特征空间。

其原理非常简单，找到被测点的最临近的K的点，计算其均值。

from sklearn import datasets
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt


boston = datasets.load_boston()
knn = KNeighborsRegressor(n_neighbors=10)

train_num = 150
knn.fit(boston.data[:train_num], boston.target[:train_num])

pred = knn.predict(boston.data[train_num:])

print(mean_squared_error(boston.target[train_num:], pred))
print(r2_score(boston.target[train_num:], pred))

90.88220533707864
0.14269935079131102

4 分类模型

4.1 使用决策树分类

决策树根据一些条件进行分支条件的判断。sklearn的tree模块提供了决策树的实现。在构建决策树时，可以指定树的分叉标准和树的最大深度。

from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets

iris = datasets.load_iris()
clf = DecisionTreeClassifier(criterion='entropy',max_depth=2)

sel = np.random.choice([True, False], len(iris.data), p=[0.75, 0.25])

clf.fit(iris.data[sel], iris.target[sel])

pred = clf.predict(iris.data[~sel])

print((pred == iris.target[~sel]).mean())

0.972972972972973

4.2 使用随机森林进行分类

随机森林对于过拟合非常健壮，它通过构造大量浅层树，让每查树为分类投票，再选择投票结果。因此随机森林也是一种集成算法。

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets

iris = datasets.load_iris()
clf = RandomForestClassifier(n_estimators=10, max_depth=2)

sel = np.random.choice([True, False], len(iris.data), p=[0.75, 0.25])

clf.fit(iris.data[sel], iris.target[sel])

pred = clf.predict(iris.data[~sel])

print((pred == iris.target[~sel]).mean())

0.975

4.3 使用SVM对数据进行分类

SVM背后的原理是寻找一个平面，将数据分割为组。

from sklearn.svm import SVC, LinearSVC
from sklearn import datasets

iris = datasets.load_iris()
clf = SVC(kernel='rbf', gamma=0.1)

sel = np.random.choice([True, False], len(iris.data), p=[0.75, 0.25])

clf.fit(iris.data[sel], iris.target[sel])

pred = clf.predict(iris.data[~sel])

print((pred == iris.target[~sel]).mean())

clf = LinearSVC()
clf.fit(iris.data[sel], iris.target[sel])

pred = clf.predict(iris.data[~sel])

print((pred == iris.target[~sel]).mean())

1.0
0.9142857142857143

4.4 使用多分类来归纳

在处理线性模型时，需要使用OneVsRestClassifier，这个模式会为每个类创建一个分类器。每个分类器只预测一种类型，在所有分类器的输出中，输出概率最大的分类器对应了相应的类别。

from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier

iris = datasets.load_iris()
clf = OneVsRestClassifier(LogisticRegression(), n_jobs=2)

sel = np.random.choice([True, False], len(iris.data), p=[0.75, 0.25])

clf.fit(iris.data[sel], iris.target[sel])

pred = clf.predict(iris.data[~sel])

print((pred == iris.target[~sel]).mean())

0.975609756097561

4.5 使用LDA进行分类

LDA指的是线性判别分析，当数据每个类的协方差类似是可以使用。

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis,  QuadraticDiscriminantAnalysis
from sklearn.metrics import classification_report
from sklearn import datasets

iris = datasets.load_iris()

lda = LinearDiscriminantAnalysis()
lda.fit(iris.data, iris.target)
pred = lda.predict(iris.data)
print(classification_report(iris.target, pred))

qda =  QuadraticDiscriminantAnalysis()
qda.fit(iris.data, iris.target)
pred = qda.predict(iris.data)
print(classification_report(iris.target, pred))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        50
          1       0.98      0.96      0.97        50
          2       0.96      0.98      0.97        50

avg / total       0.98      0.98      0.98       150

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        50
          1       0.98      0.96      0.97        50
          2       0.96      0.98      0.97        50

avg / total       0.98      0.98      0.98       150

4.6 使用SGD进行分类

from sklearn import linear_model
from sklearn import datasets
from sklearn.metrics import classification_report


iris = datasets.load_iris()

sgd = linear_model.SGDClassifier()
sgd.fit(iris.data, iris.target)

pred = sgd.predict(iris.data)

print(classification_report(iris.target, pred))

             precision    recall  f1-score   support

          0       0.83      1.00      0.91        50
          1       0.71      0.80      0.75        50
          2       1.00      0.68      0.81        50

avg / total       0.85      0.83      0.82       150



/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in  in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  "and default tol will be 1e-3." % type(self), FutureWarning)

4.7 使用朴素Bayes进行文本分类

下面的示例使用Bayes做一个简单的文本分类计算，语料使用自带的newsgroups，从中选择两个类别。

特征提取方面，使用了sklearn的词频计数作为特征，并且对比TF-IDF的特征化对分类算法的影响。

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np


categories = ['rec.autos',  'talk.politics.guns', 'rec.motorcycles',]
newsgroups = fetch_20newsgroups(categories=categories)

# print('\n'.join(newsgroups.data[0:2]))
# print(newsgroups.target[0:2])
print(len(newsgroups.data)) # 1738

def train(vector, target):
    train_num = 1700
    bayes = GaussianNB()

    bayes.fit(vector[:train_num], target[:train_num])
    pred = bayes.predict(vector[train_num:])

    print(classification_report(target[train_num:], pred))
    return bayes

cv = CountVectorizer()
vector = cv.fit_transform(newsgroups.data).todense()
# print(vector[0])
print('train clf with CountVectorizer')
train(vector, newsgroups.target)

cv2 = TfidfVectorizer()
vector2 = cv2.fit_transform(newsgroups.data).todense()
print('train clf with TfidfVectorizer')
bayes = train(vector2, newsgroups.target)

print(newsgroups.data[-1])
print(vector2[-1])

print(bayes.predict(vector2[-1]))     
print(newsgroups.target[-1])

kmean =  KMeans(n_clusters=3)
kmean.fit(vector2)

ratio = np.sum(kmean.labels_==newsgroups.target) / len(newsgroups.target)
print(ratio)

1738
train clf with CountVectorizer
             precision    recall  f1-score   support

          0       0.93      1.00      0.96        13
          1       1.00      0.86      0.92         7
          2       1.00      1.00      1.00        18

avg / total       0.98      0.97      0.97        38

train clf with TfidfVectorizer
             precision    recall  f1-score   support

          0       0.93      1.00      0.96        13
          1       1.00      0.86      0.92         7
          2       1.00      1.00      1.00        18

avg / total       0.98      0.97      0.97        38

Subject: thanks to poster of NY Times article on ATF in Texas
From: [email protected] (John Kim)
Distribution: world
Organization: Harvard University Science Center
Nntp-Posting-Host: scws8.harvard.edu
Lines: 12


good job to whoever posted the article.  I'd
been saving that NYTimes edition for a while, planning to ytpe it
in myself, but now I don't have to.

For all of those people who were worried about whether or not
the media would even question the raid, we owe it to the
NY Times (despite their rabidly anti-gun editorials) for 
being willing to talk to these 4 BATF  agents.

-Case Kim


[[0. 0. 0. ... 0. 0. 0.]]
[2]
2
0.3003452243958573

4.8 标签传递半监督学习

如果我们有一个数据集，其中只标注了一部分，还有一些没有标注，此时可以尝试标签传递/扩展算法来进行训练。

下面的代码在iris数据集上人为的设置一些数据为未标注，并使用标签传递来进行训练，最后看预测结果和原始标注的对比。

from sklearn import datasets
import numpy as np
from sklearn import semi_supervised
from sklearn.metrics import classification_report


iris = datasets.load_iris()

X = iris.data.copy()
y = iris.target.copy()

names = iris.target_names.copy()
names = np.append(names, ['unlabeled'])

unlabel = np.random.choice([True, False], len(y))
y[unlabel] = -1
print(y[:10])

train = unlabel = np.random.choice([True, False], len(y), p=[0.8,0.2])

def run(clf, X, y, train, target):

    clf.fit(X[train], y[train])
    pred = lp.predict(X[~train])
    print(classification_report(pred, target[~train]))

lp = semi_supervised.LabelPropagation() 
ls = semi_supervised.LabelSpreading()

run(lp, X, y, train, iris.target)
run(ls, X, y, train, iris.target)

[-1 -1 -1 -1  0  0 -1  0 -1  0]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        11
          1       0.82      0.90      0.86        10
          2       0.92      0.85      0.88        13

avg / total       0.91      0.91      0.91        34

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        11
          1       0.82      0.90      0.86        10
          2       0.92      0.85      0.88        13

avg / total       0.91      0.91      0.91        34



/usr/local/lib/python3.6/dist-packages/scipy/sparse/csgraph/_laplacian.py:72: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if normed and (np.issubdtype(csgraph.dtype, int)

5 模型后处理

5.1 K-Fold交叉验证

K-Fold可将数据集划分为K份，并依次取其中一份作为测试集，其他数据作为训练集，直到每一份都作为测试集被测试过。
sklearn中的KFold实例返回一个生成器，生成器每次返回一批训练集和测试集的索引。

from sklearn.model_selection import KFold

X = np.arange(12).reshape(6,2)
y = np.array([1,1,1,0,0,0])

kfold = KFold(n_splits=3)
for train_index, test_index in kfold.split(X):
    print(train_index, test_index)
#     X_train, X_test = X[train_index], X[test_index]
#     y_train, y_test = y[train_index], y[test_index]

[2 3 4 5] [0 1]
[0 1 4 5] [2 3]
[0 1 2 3] [4 5]

5.2 自动化交叉验证

上节的过程可以被一个函数进行自动化执行。

from sklearn import ensemble
from sklearn import datasets
from sklearn import model_selection


boston = datasets.load_boston()
clf = ensemble.RandomForestRegressor(max_features='auto')

scores = model_selection.cross_val_score(clf, boston.data, boston.target)
print(scores)

[0.80401839 0.55784815 0.25571493]

5.3 使用ShuffleSplit交叉验证

shufflesplit是另一个交叉验证的技巧。它可以将数据集打散，并随机选择指定的元素作为测试集，其他元素作为训练集。和KFold类似，它返回的生成器会生成多个交叉的集合索引。

from sklearn.model_selection import ShuffleSplit

X = np.arange(12).reshape(6,2)
y = np.array([1,1,1,0,0,0])

sp = ShuffleSplit(n_splits=4, train_size=4, test_size=2)

for train, test in sp.split(X):
    print(train, test)

[4 3 5 0] [2 1]
[3 2 4 0] [5 1]
[1 3 0 4] [2 5]
[3 5 4 1] [2 0]

5.4 分层的KFold

有时，数据集不同类的数据存在一定的比例，为了在kfold时仍维持这个比例，就需要使用StratifiedKFold用于生成同样比例的训练集和测试集。

下面代码中，数据集中正例与反例的比例是2:1，经过分层KFold后的数据集也满足2:1。

from sklearn.model_selection import StratifiedKFold

X = np.arange(12).reshape(6,2)
y = np.array([1,1,1,1,0,0])

kfold = StratifiedKFold(n_splits=2)
for train, test in kfold.split(X, y):
    print(train, test)

[2 3 5] [0 1 4]
[0 1 4] [2 3 5]

5.5 网格搜索

网格搜索主要为了解决参数的自动搜索问题。例如对于决策树，需要优化的参数包括树的分裂方法，最大层数等。我们可以自己编写代码组合不同的参数来训练模型，并从中选择最好的参数和模型，sklearn已经内建网络搜索的方法。

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn import grid_search


boston = datasets.load_iris()
lr = LogisticRegression()
lr.fit(boston.data, boston.target)

search_params = {
    'penalty': ['l1', 'l2'],
    'C': [1, 2, 3, 4]
}

gs = grid_search.GridSearchCV(lr, search_params)
gs.fit(boston.data, boston.target)

print(gs.grid_scores_)
print(max(gs.grid_scores_, key=lambda x: x[1]))

[mean: 0.94000, std: 0.03207, params: {'C': 1, 'penalty': 'l1'}, mean: 0.94667, std: 0.01794, params: {'C': 1, 'penalty': 'l2'}, mean: 0.96000, std: 0.04217, params: {'C': 2, 'penalty': 'l1'}, mean: 0.95333, std: 0.02426, params: {'C': 2, 'penalty': 'l2'}, mean: 0.96000, std: 0.04217, params: {'C': 3, 'penalty': 'l1'}, mean: 0.95333, std: 0.02426, params: {'C': 3, 'penalty': 'l2'}, mean: 0.96000, std: 0.04217, params: {'C': 4, 'penalty': 'l1'}, mean: 0.96000, std: 0.02745, params: {'C': 4, 'penalty': 'l2'}]
mean: 0.96000, std: 0.04217, params: {'C': 2, 'penalty': 'l1'}

5.7 伪造评估器来比较结果

主要使用dummy模块提供的dummyRegressor和DummyClassifier类来返回一些指定策略的回归或分类值。

5.8 回归模型评估

在metrics模块中提供了各种的评估函数，针对分类，回归，聚类都有不同的评估方式。

度量函数	说明
metrics.mean_squared_error	MSE误差
mean_squared_log_error	MSLE误差
r2_score	决策系数回归得分

5.9 特征选取

from sklearn import feature_selection
from sklearn import datasets
import matplotlib.pyplot as plt

boston = datasets.load_boston()

f, p = feature_selection.f_regression(boston.data, boston.target)
print(f)
print(p)

plt.bar(range(len(p)), p)

[ 88.15124178  75.2576423  153.95488314  15.97151242 112.59148028
 471.84673988  83.47745922  33.57957033  85.91427767 141.76135658
 175.10554288  63.05422911 601.61787111]
[2.08355011e-19 5.71358415e-17 4.90025998e-31 7.39062317e-05
 7.06504159e-24 2.48722887e-74 1.56998221e-18 1.20661173e-08
 5.46593257e-19 5.63773363e-29 1.60950948e-34 1.31811273e-14
 5.08110339e-88]

5.10 L1范数上的特征选择

from sklearn import feature_selection
from sklearn import datasets
from sklearn import linear_model
from sklearn import metrics
from sklearn.cross_validation import ShuffleSplit

import numpy as np
import matplotlib.pyplot as plt

boston = datasets.load_boston()

sp = ShuffleSplit(len(boston.data), 4)


train, test = next(iter(sp))
#print(train, test)


lr = linear_model.LinearRegression()
lr.fit(boston.data[train], boston.target[train])
pred = lr.predict(boston.data[test])

mse = metrics.mean_squared_error(boston.target[test], pred)
print(mse)

la = linear_model.LassoCV()
la.fit(boston.data, boston.target)
print(la.coef_)
columns = np.arange(boston.data.shape[1])[la.coef_>0]
print(columns)

X = boston.data[:, columns]
lr.fit(X[train], boston.target[train])
pred = lr.predict(X[test])

mse = metrics.mean_squared_error(boston.target[test], pred)
print(mse)

23.782395846128534
[-0.07391859  0.04944576 -0.          0.         -0.          1.80092396
  0.01135702 -0.81333654  0.27206588 -0.01542027 -0.74314538  0.00898036
 -0.70409988]
[ 1  5  6  8 11]
41.908110910553106

5.11 使用joblib保存模型

from sklearn.externals import joblib

joblib.dump(lr, 'model.clf')

m = joblib.load('model.clf')

你可能感兴趣的:(ML)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
swagger访问路径 igotyback swagger
Swagger2.x版本访问地址：http://{ip}:{port}/{context-path}/swagger-ui.html{ip}是你的服务器IP地址。{port}是你的应用服务端口，通常为8080。{context-path}是你的应用上下文路径，如果应用部署在根路径下，则为空。Swagger3.x版本对于Swagger3.x版本（也称为OpenAPI3）访问地址：http://{ip
html 中如何使用 uniapp 的部分方法某公司摸鱼前端 html uni-app 前端
示例代码：Documentconsole.log(window);效果展示：好了，现在就可以uni.使用相关的方法了
高级编程--XML+socket练习题 masa010 java 开发语言
1.北京华北2114.8万人上海华东2,500万人广州华南1292.68万人成都华西1417万人（1）使用dom4j将信息存入xml中（2）读取信息，并打印控制台（3）添加一个city节点与子节点（4）使用socketTCP协议编写服务端与客户端，客户端输入城市ID，服务器响应相应城市信息（5）使用socketTCP协议编写服务端与客户端，客户端要求用户输入city对象，服务端接收并使用dom4j
Python教程：一文了解使用Python处理XPath 旦莫 Python进阶 python 开发语言
目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath？2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代
四章-32-点要素的聚合彩云飘过
本文基于腾讯课堂老胡的课《跟我学Openlayers--基础实例详解》做的学习笔记，使用的openlayers5.3.xapi。源码见1032.html，对应的官网示例https://openlayers.org/en/latest/examples/cluster.htmlhttps://openlayers.org/en/latest/examples/earthquake-clusters.
DIV+CSS+JavaScript技术制作网页（旅游主题网页设计与制作）云南大理 STU学生网页设计网页设计期末网页作业 html静态网页 html5期末大作业网页设计 web大作业
️精彩专栏推荐作者主页:【进入主页—获取更多源码】web前端期末大作业：【HTML5网页期末作业(1000套)】程序员有趣的告白方式：【HTML七夕情人节表白网页制作(110套)】文章目录二、网站介绍三、网站效果▶️1.视频演示2.图片演示四、网站代码HTML结构代码CSS样式代码五、更多源码二、网站介绍网站布局方面：计划采用目前主流的、能兼容各大主流浏览器、显示效果稳定的浮动网页布局结构。网站程
关于城市旅游的HTML网页设计——(旅游风景云南 5页)HTML+CSS+JavaScript 二挡起步 web前端期末大作业 javascript html css 旅游风景
⛵源码获取文末联系✈Web前端开发技术描述网页设计题材，DIV+CSS布局制作,HTML+CSS网页设计期末课程大作业|游景点介绍|旅游风景区|家乡介绍|等网站的设计与制作|HTML期末大学生网页设计作业，Web大学生网页HTML：结构CSS：样式在操作方面上运用了html5和css3，采用了div+css结构、表单、超链接、浮动、绝对定位、相对定位、字体样式、引用视频等基础知识JavaScrip
HTML网页设计制作大作业（div+css）云南我的家乡旅游景点带文字滚动二挡起步 web前端期末大作业 web设计网页规划与设计 html css javascript dreamweaver 前端
Web前端开发技术描述网页设计题材，DIV+CSS布局制作,HTML+CSS网页设计期末课程大作业游景点介绍|旅游风景区|家乡介绍|等网站的设计与制作HTML期末大学生网页设计作业HTML：结构CSS：样式在操作方面上运用了html5和css3，采用了div+css结构、表单、超链接、浮动、绝对定位、相对定位、字体样式、引用视频等基础知识JavaScript：做与用户的交互行为文章目录前端学习路线
【目标检测数据集】卡车数据集1073张VOC+YOLO格式熬夜写代码的平头哥∰ 目标检测 YOLO 人工智能
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：1073标注数量(xml文件个数)：1073标注数量(txt文件个数)：1073标注类别数：1标注类别名称:["truck"]每个类别标注的框数：truck框数=1120总框数：1120使用标注工具：labelImg标注
钢筋长度超限检测检数据集VOC+YOLO格式215张1类别 futureflsl 数据集 YOLO 深度学习机器学习
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：215标注数量(xml文件个数)：215标注数量(txt文件个数)：215标注类别数：1标注类别名称:["iron"]每个类别标注的框数：iron框数=215总框数：215使用标注工具：labelImg标注规则：对类别进
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
SpringBlade dict-biz/list 接口 SQL 注入漏洞文章永久免费只为良心 oracle 数据库
SpringBladedict-biz/list接口SQL注入漏洞POC:构造请求包查看返回包你的网址/api/blade-system/dict-biz/list?updatexml(1,concat(0x7e,md5(1),0x7e),1)=1漏洞概述在SpringBlade框架中，如果dict-biz/list接口的后台处理逻辑没有正确地对用户输入进行过滤或参数化查询（PreparedSta
BART&BERT Ambition_LAO 深度学习
BART和BERT都是基于Transformer架构的预训练语言模型。模型架构：BERT(BidirectionalEncoderRepresentationsfromTransformers)主要是一个编码器（Encoder）模型，它使用了Transformer的编码器部分来处理输入的文本，并生成文本的表示。BERT特别擅长理解语言的上下文，因为它在预训练阶段使用了掩码语言模型（MLM）任务，即
spring如何整合druid连接池？惜.己 spring spring junit 数据库 java idea 后端 xml
目录spring整合druid连接池1.新建maven项目2.新建mavenModule3.导入相关依赖4.配置log4j2.xml5.配置druid.xml1)xml中如何引入properties2)下面是配置文件6.准备jdbc.propertiesJDBC配置项解释7.配置druid8.测试spring整合druid连接池1.新建maven项目打开IDE（比如IntelliJIDEA,Ecl
matlab mle 优化,MLE+: Matlab Toolbox for Integrated Modeling, Control and Optimization for Buildings... Simon Zhong matlab mle 优化
摘要：FollowingunilateralopticnervesectioninadultPVGhoodedrat,theaxonguidancecueephrin-A2isup-regulatedincaudalbutnotrostralsuperiorcolliculus(SC)andtheEphA5receptorisdown-regulatedinaxotomisedretinalgan
遥感影像的切片处理 sand&wich 计算机视觉 python 图像处理
在遥感影像分析中，经常需要将大尺寸的影像切分成小片段，以便于进行详细的分析和处理。这种方法特别适用于机器学习和图像处理任务，如对象检测、图像分类等。以下是如何使用Python和OpenCV库来实现这一过程，同时确保每个影像片段保留正确的地理信息。准备环境首先，确保安装了必要的Python库，包括numpy、opencv-python和xml.etree.ElementTree。这些库将用于图像处理
入门MySQL——查询语法练习 K_un
前言：前面几篇文章为大家介绍了DML以及DDL语句的使用方法，本篇文章将主要讲述常用的查询语法。其实MySQL官网给出了多个示例数据库供大家实用查询，下面我们以最常用的员工示例数据库为准，详细介绍各自常用的查询语法。1.员工示例数据库导入官方文档员工示例数据库介绍及下载链接：https://dev.mysql.com/doc/employee/en/employees-installation.h
00. 这里整理了最全的爬虫框架（Java + Python）有一只柴犬爬虫系列爬虫 java python
目录1、前言2、什么是网络爬虫3、常见的爬虫框架3.1、java框架3.1.1、WebMagic3.1.2、Jsoup3.1.3、HttpClient3.1.4、Crawler4j3.1.5、HtmlUnit3.1.6、Selenium3.2、Python框架3.2.1、Scrapy3.2.2、BeautifulSoup+Requests3.2.3、Selenium3.2.4、PyQuery3.2
详解：如何设计出健壮的秒杀系统？夜空_2cd3
作者：Yrion博客园：cnblogs.com/wyq178/p/11261711.html前言：秒杀系统相信很多人见过，比如京东或者淘宝的秒杀，小米手机的秒杀。那么秒杀系统的后台是如何实现的呢？我们如何设计一个秒杀系统呢？对于秒杀系统应该考虑哪些问题？如何设计出健壮的秒杀系统？本期我们就来探讨一下这个问题：image目录一：****秒杀系统应该考虑的问题二：****秒杀系统的设计和技术方案三：*
RabbitMQ生产者重复机制与确认机制 java炒饭小能手 java-rabbitmq rabbitmq java
重复机制生产者发送消息时，出现了网络故障，导致与MQ的连接中断。为了解决这个问题，SpringAMQP提供的消息发送时的重试机制。即：当RabbitTemplate与MQ连接超时后，多次重试。需要修该发送端模块的application.yaml文件，添加下面的内容：spring:rabbitmq:connection-timeout:1s#设置MQ的连接超时时间template:retry:ena
yolov5＞onnx＞ncnn＞apk 图像处理大大大大大牛啊 opencv实战代码讲解 yolo onnx ncnn 安卓
一.yolov5pt模型转onnx条件：colabnotebookyolov51.安装环境!pipinstallonnx>=1.7.0#forONNXexport!pipinstallcoremltools==4.0#forCoreMLexport!pipinstallonnx-simplifier2.修改common.py在classFocus下面
使用由 Python 编写的 lxml 实现高性能 XML 解析 hunyxv python 笔记 python xml
转载自：文章lxml简介Python从来不出现XML库短缺的情况。从2.0版本开始，它就附带了xml.dom.minidom和相关的pulldom以及SimpleAPIforXML(SAX)模块。从2.4开始，它附带了流行的ElementTreeAPI。此外，很多第三方库可以提供更高级别的或更具有python风格的接口。尽管任何XML库都足够处理简单的DocumentObjectModel(DOM
设计模式之建造者模式(通俗易懂--代码辅助理解【Java版】） ok!ko 设计模式设计模式建造者模式 java
文章目录设计模式概述1、建造者模式2、建造者模式使用场景3、优点4、缺点5、主要角色6、代码示例：1）实现要求2）UML图3)实现步骤：1）创建一个表示食物条目和食物包装的接口2）创建实现Packing接口的实体类3）创建实现Item接口的抽象类，该类提供了默认的功能4）创建扩展了Burger和ColdDrink的实体类5）创建一个Meal类，带有上面定义的Item对象6）创建一个MealBuil
斟一小组鸡血视频和自己一起成长
http://m.v.qq.com/play/play.html?coverid=&vid=c0518henl2a&ptag=2_6.0.0.14297_copy有一种努力叫做靠自己http://m.v.qq.com/play/play.html?coverid=&vid=i0547o426g4&ptag=2_6.0.0.14297_copy世界最励志短片https://v.qq.com/x/pa
Dockerfile命令详解之 FROM 清风怎不知意容器化 java 前端 javascript
许多同学不知道Dockerfile应该如何写，不清楚Dockerfile中的指令分别有什么意义，能达到什么样的目的，接下来我将在容器化专栏中详细的为大家解释每一个指令的含义以及用法。专栏订阅传送门https://blog.csdn.net/qq_38220908/category_11989778.html指令不区分大小写。但是，按照惯例，它们应该是大写的，以便更容易地将它们与参数区分开来。(引用
《HTML 与 CSS—— 响应式设计》陈在天box html css 前端
一、引言在当今数字化时代，人们使用各种不同的设备访问互联网，包括智能手机、平板电脑、笔记本电脑和台式机等。为了确保网站在不同设备上都能提供良好的用户体验，响应式设计成为了网页开发的关键。HTML和CSS作为网页开发的基础技术，在实现响应式设计方面发挥着重要作用。本文将深入探讨HTML与CSS中的响应式设计原理、方法和最佳实践。二、响应式设计的概念与重要性（一）概念响应式设计是一种网页设计方法，旨在
【C语言】- 自定义类型：结构体、枚举、联合 Cavalier_01 C语言
【C语言】：操作符（https://mp.csdn.net/editor/html/115218055）数据类型（https://mp.csdn.net/editor/html/115219664）自定义类型：结构体、枚举、联合（https://mp.csdn.net/editor/html/115373785）变量、常量（https://mp.csdn.net/editor/html/11523
html+css网页设计旅游网站首页1个页面 html+css+js网页设计 html css 旅游
html+css网页设计旅游网站首页1个页面网页作品代码简单，可使用任意HTML辑软件（如：Dreamweaver、HBuilder、Vscode、Sublime、Webstorm、Text、Notepad++等任意html编辑软件进行运行及修改编辑等操作）。获取源码1，访问该网站https://download.csdn.net/download/qq_42431718/897527112，点击
[数据集][目标检测]汽车头部尾部检测数据集VOC+YOLO格式5319张3类别 FL1623863129 数据集目标检测汽车 YOLO
数据集制作单位：未来自主研究中心(FIRC)版权单位：未来自主研究中心(FIRC)版权声明：数据集仅仅供个人使用，不得在未授权情况下挂淘宝、咸鱼等交易网站公开售卖,由此引发的法律责任需自行承担数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：5319标注数量(xml文件
强大的销售团队背后竟然是大数据分析的身影蓝儿唯美数据分析
Mark Roberge是HubSpot的首席财务官，在招聘销售职位时使用了大量数据分析。但是科技并没有挤走直觉。大家都知道数理学家实际上已经渗透到了各行各业。这些热衷数据的人们通过处理数据理解商业流程的各个方面，以重组弱点，增强优势。 Mark Roberge是美国HubSpot公司的首席财务官，HubSpot公司在构架集客营销现象方面出过一份力——因此他也是一位数理学家。他使用数据分析
Haproxy+Keepalived高可用双机单活 bylijinnan 负载均衡 keepalived haproxy 高可用
我们的应用MyApp不支持集群，但要求双机单活（两台机器：master和slave）： 1.正常情况下，只有master启动MyApp并提供服务 2.当master发生故障时，slave自动启动本机的MyApp，同时虚拟IP漂移至slave，保持对外提供服务的IP和端口不变 F5据说也能满足上面的需求，但F5的通常用法都是双机双活，单活的话还没研究过服务器资源 10.7
eclipse编辑器中文乱码问题解决 0624chenhong eclipse乱码
使用Eclipse编辑文件经常出现中文乱码或者文件中有中文不能保存的问题，Eclipse提供了灵活的设置文件编码格式的选项，我们可以通过设置编码格式解决乱码问题。在Eclipse可以从几个层面设置编码格式：Workspace、Project、Content Type、File 本文以Eclipse 3.3（英文）为例加以说明： 1. 设置Workspace的编码格式： Windows-&g
基础篇--resources资源不懂事的小屁孩 android
最近一直在做java开发，偶尔敲点android代码，突然发现有些基础给忘记了，今天用半天时间温顾一下resources的资源。 String.xml 字符串资源涉及国际化问题 http://www.2cto.com/kf/201302/190394.html string-array
接上篇补上window平台自动上传证书文件的批处理问卷酷的飞上天空 window
@echo off : host=服务器证书域名或ip，需要和部署时服务器的域名或ip一致 ou=公司名称, o=公司名称 set host=localhost set ou=localhost set o=localhost set password=123456 set validity=3650 set salias=s
企业物联网大潮涌动：如何做好准备？蓝儿唯美企业
物联网的可能性也许是无限的。要找出架构师可以做好准备的领域然后利用日益连接的世界。尽管物联网（IoT）还很新，企业架构师现在也应该为一个连接更加紧密的未来做好计划，而不是跟上闸门被打开后的集成挑战。“问题不在于物联网正在进入哪些领域，而是哪些地方物联网没有在企业推进，” Gartner研究总监Mike Walker说。 Gartner预测到2020年物联网设备安装量将达260亿，这些设备在全
spring学习——数据库（mybatis持久化框架配置） a-john mybatis
Spring提供了一组数据访问框架，集成了多种数据访问技术。无论是JDBC，iBATIS(mybatis)还是Hibernate，Spring都能够帮助消除持久化代码中单调枯燥的数据访问逻辑。可以依赖Spring来处理底层的数据访问。 mybatis是一种Spring持久化框架，要使用mybatis，就要做好相应的配置： 1，配置数据源。有很多数据源可以选择，如：DBCP，JDBC，aliba
Java静态代理、动态代理实例 aijuans Java静态代理
采用Java代理模式，代理类通过调用委托类对象的方法，来提供特定的服务。委托类需要实现一个业务接口，代理类返回委托类的实例接口对象。按照代理类的创建时期，可以分为：静态代理和动态代理。所谓静态代理：　指程序员创建好代理类，编译时直接生成代理类的字节码文件。所谓动态代理：　在程序运行时，通过反射机制动态生成代理类。一、静态代理类实例： 1、Serivce.ja
Struts1与Struts2的12点区别 asia007 Struts1与Struts2
1) 在Action实现类方面的对比：Struts 1要求Action类继承一个抽象基类；Struts 1的一个具体问题是使用抽象类编程而不是接口。Struts 2 Action类可以实现一个Action接口，也可以实现其他接口，使可选和定制的服务成为可能。Struts 2提供一个ActionSupport基类去实现常用的接口。即使Action接口不是必须实现的，只有一个包含execute方法的P
初学者要多看看帮助文档不要用js来写Jquery的代码百合不是茶 jquery js
解析json数据的时候需要将解析的数据写到文本框中, 出现了用js来写Jquery代码的问题; 1, JQuery的赋值有问题代码如下: data.username 表示的是: 网易 $("#use
经理怎么和员工搞好关系和信任 bijian1013 团队项目管理管理
产品经理应该有坚实的专业基础，这里的基础包括产品方向和产品策略的把握，包括设计，也包括对技术的理解和见识，对运营和市场的敏感，以及良好的沟通和协作能力。换言之，既然是产品经理，整个产品的方方面面都应该能摸得出门道。这也不懂那也不懂，如何让人信服？如何让自己懂？就是不断学习，不仅仅从书本中，更从平时和各种角色的沟通
如何为rich:tree不同类型节点设置右键菜单 sunjing contextMenu tree Richfaces
组合使用target和targetSelector就可以啦，如下： <rich:tree id="ruleTree" value="#{treeAction.ruleTree}" var="node" nodeType="#{node.type}" selectionChangeListener=&qu
【Redis二】Redis2.8.17搭建主从复制环境 bit1129 redis
开始使用Redis2.8.17 Redis第一篇在Redis2.4.5上搭建主从复制环境，对它的主从复制的工作机制，真正的惊呆了。不知道Redis2.8.17的主从复制机制是怎样的，Redis到了2.4.5这个版本，主从复制还做成那样，Impossible is nothing! 本篇把主从复制环境再搭一遍看看效果，这次在Unbuntu上用官方支持的版本。 Ubuntu上安装Red
JSONObject转换JSON--将Date转换为指定格式白糖_ JSONObject
项目中，经常会用JSONObject插件将JavaBean或List<JavaBean>转换为JSON格式的字符串，而JavaBean的属性有时候会有java.util.Date这个类型的时间对象，这时JSONObject默认会将Date属性转换成这样的格式： {"nanos":0,"time":-27076233600000,
JavaScript语言精粹读书笔记 braveCS JavaScript
【经典用法】： //①定义新方法 Function .prototype.method=function(name, func){ this.prototype[name]=func; return this; } //②给Object增加一个create方法，这个方法创建一个使用原对
编程之美-找符合条件的整数用字符串来表示大整数避免溢出 bylijinnan 编程之美
import java.util.LinkedList; public class FindInteger { /** * 编程之美找符合条件的整数用字符串来表示大整数避免溢出 * 题目：任意给定一个正整数N，求一个最小的正整数M(M>1)，使得N*M的十进制表示形式里只含有1和0 * * 假设当前正在搜索由0，1组成的K位十进制数
读书笔记 chengxuyuancsdn 读书笔记
1、Struts访问资源 2、把静态参数传递给一个动作 3、<result>type属性 4、s:iterator、s:if c:forEach 5、StringBuilder和StringBuffer 6、spring配置拦截器 1、访问资源 (1)通过ServletActionContext对象和实现ServletContextAware,ServletReque
[通讯与电力]光网城市建设的一些问题 comsci 问题
信号防护的问题,前面已经说过了,这里要说光网交换机与市电保障的关系我们过去用的ADSL线路,因为是电话线,在小区和街道电力中断的情况下,只要在家里用笔记本电脑+蓄电池,连接ADSL,同样可以上网........
oracle 空间RESUMABLE daizj oracle 空间不足 RESUMABLE 错误挂起
空间RESUMABLE操作转 Oracle从9i开始引入这个功能，当出现空间不足等相关的错误时，Oracle可以不是马上返回错误信息，并回滚当前的操作，而是将操作挂起，直到挂起时间超过RESUMABLE TIMEOUT，或者空间不足的错误被解决。这一篇简单介绍空间RESUMABLE的例子。第一次碰到这个特性是在一次安装9i数据库的过程中，在利用D
重构第一次写的线程池 dieslrae 线程池 python
最近没有什么学习欲望,修改之前的线程池的计划一直搁置,这几天比较闲,还是做了一次重构,由之前的2个类拆分为现在的4个类. 1、首先是工作线程类:TaskThread,此类为一个工作线程,用于完成一个工作任务,提供等待(wait),继续(proceed),绑定任务(bindTask)等方法 #!/usr/bin/env python # -*- coding:utf8 -*-
C语言学习六指针 dcj3sjt126com c
初识指针，简单示例程序： /* 指针就是地址，地址就是指针地址就是内存单元的编号指针变量是存放地址的变量指针和指针变量是两个不同的概念但是要注意：通常我们叙述时会把指针变量简称为指针，实际它们含义并不一样 */ # include <stdio.h> int main(void) { int * p; // p是变量的名字， int *
yii2 beforeSave afterSave beforeDelete dcj3sjt126com delete
public function afterSave($insert, $changedAttributes) { parent::afterSave($insert, $changedAttributes); if($insert) { //这里是新增数据 } else { //这里是更新数据 } }
timertask shuizhaosi888 timertask
java.util.Timer timer = new java.util.Timer(true); // true 说明这个timer以daemon方式运行（优先级低， // 程序结束timer也自动结束），注意，javax.swing // 包中也有一个Timer类，如果import中用到swing包， // 要注意名字的冲突。 TimerTask task = new
Spring Security（13）——session管理 234390216 session Spring Security 攻击保护超时
session管理目录 1.1 检测session超时 1.2 concurrency-control 1.3 session 固定攻击保护
公司项目NODEJS实践0.3[ mongo / session ...] 逐行分析JS源代码 mongodb session nodejs
http://www.upopen.cn 一、前言书接上回，我们搭建了WEB服务端路由、模板等功能，完成了register 通过ajax与后端的通信，今天主要完成数据与mongodb的存取，实现注册 / 登录 /
pojo.vo.po.domain区别 LiaoJuncai java VO POJO javabean domain
　　POJO = "Plain Old Java Object"，是MartinFowler等发明的一个术语，用来表示普通的Java对象，不是JavaBean, EntityBean 或者 SessionBean。POJO不但当任何特殊的角色，也不实现任何特殊的Java框架的接口如，EJB， JDBC等等。　　　　即POJO是一个简单的普通的Java对象，它包含业务逻辑
Windows Error Code OhMyCC windows
0 操作成功完成. 1 功能错误. 2 系统找不到指定的文件. 3 系统找不到指定的路径. 4 系统无法打开文件. 5 拒绝访问. 6 句柄无效. 7 存储控制块被损坏. 8 存储空间不足, 无法处理此命令. 9 存储控制块地址无效. 10 环境错误. 11 试图加载格式错误的程序. 12 访问码无效. 13 数据无效. 14 存储器不足, 无法完成此操作. 15 系
在storm集群环境下发布Topology roadrunners 集群 storm topology spout bolt
storm的topology设计和开发就略过了。本章主要来说说如何在storm的集群环境中，通过storm的管理命令来发布和管理集群中的topology。 1、打包打包插件是使用maven提供的maven-shade-plugin，详细见maven-shade-plugin。 <plugin> <groupId>org.apache.maven.
为什么不允许代码里出现“魔数” tomcat_oracle java
　　在一个新项目中，我最先做的事情之一，就是建立使用诸如Checkstyle和Findbugs之类工具的准则。目的是制定一些代码规范，以及避免通过静态代码分析就能够检测到的bug。　　迟早会有人给出案例说这样太离谱了。其中的一个案例是Checkstyle的魔数检查。它会对任何没有定义常量就使用的数字字面量给出警告，除了-1、0、1和2。　　很多开发者在这个检查方面都有问题，这可以从结果
zoj 3511 Cake Robbery(线段树) 阿尔萨斯线段树
题目链接：zoj 3511 Cake Robbery 题目大意：就是有一个N边形的蛋糕，切M刀，从中挑选一块边数最多的，保证没有两条边重叠。解题思路：有多少个顶点即为有多少条边，所以直接按照切刀切掉点的个数排序，然后用线段树维护剩下的还有哪些点。 #include <cstdio> #include <cstring> #include <vector&