FavoriteStar

基于sklearn的机器学习实战

本文目录如下：

- LinearRegression
- - 线性回归入门
  - - 数据生成
    - 定义模型
    - 模型测试与比较
  - 多项式回归
  - - 具体实现
- LogisticRegression
- - 算法思想简述
  - 算法实现
- Decision Tree
- MLP
- SVM
- - 线性SVM
  - 多项式核
  - 高斯核
  - 对比不同核在Mnist上的效果
  - - 读取数据
    - 高斯核
    - 多项式核
    - 线性核
- NBayes
- bagging与随机森林
- AdaBoost
- k-means算法
- KNN
- PCA
- HMM
- visualizetion_report
- - 加载数据集
  - - 手写数据集
    - 肿瘤数据集
    - 波士顿房价数据集
  - 性能可视化
  - - 交叉验证绘制
    - 重要性特征绘制
  - 机器学习度量
  - - 混淆矩阵
    - ROC、AUC曲线
    - PR曲线
    - 轮廓分析
    - 可靠性曲线
    - KS检验
    - 累计收益曲线
    - Lift曲线
  - 聚类方法
  - - 手肘法
  - 降维方法
  - - PCA
    - 2-D Projection

LinearRegression

线性回归入门

数据生成

为了直观地看到算法的思路，我们先生成一些二维数据来直观展现

import numpy as np
import matplotlib.pyplot as plt 

def true_fun(X): # 这是我们设定的真实函数，即ground truth的模型
    return 1.5*X + 0.2

np.random.seed(0) # 设置随机种子
n_samples = 30 # 设置采样数据点的个数

'''生成随机数据作为训练集，并且加一些噪声'''
X_train = np.sort(np.random.rand(n_samples)) 
y_train = (true_fun(X_train) + np.random.randn(n_samples) * 0.05).reshape(n_samples,1)
# 训练数据是加上一定的随机噪声的

定义模型

我们可以直接点用sklearn中的LinearRegression即可：

from sklearn.linear_model import LinearRegression
model = LinearRegression()  # 这就是我们的模型
model.fit(X_train[:, np.newaxis], y_train)  # 训练模型
print("输出参数w：",model.coef_)
print("输出参数b：",model.intercept_)

输出参数w： [[1.4474774]]
输出参数b： [0.22557542]

注意上面代码中的np.newaxis，因为X_train是一个一维的向量，那么其作用就是将X_train变成一个N*1的二维矩阵而已。其实写成X_train[:,None]是相同的效果。

至于为什么要这么做，你可以不这么做试一下，会报错为：

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

可以简单理解为这是sklearn的库对训练数据的要求，不能够是一个一维的向量。

模型测试与比较

可以看到我们输出为1.44和0.22，还是很接近真实答案的，那么我们选取一批测试集来看看精度：

X_test = np.linspace(0,1,100)  # 0和1之间，产生100个等间距的
plt.plot(X_test, model.predict(X_test[:, np.newaxis]), label = "Model")  # 将拟合出来的散点画出
plt.plot(X_test, true_fun(X_test), label = "True function")  # 真实结果
plt.scatter(X_train, y_train)  # 画出训练集的点
plt.legend(loc="best")  # 将标签放在最合适的位置
plt.show()

上述情况是最简单的，但当出现更高维度时，我们就需要进行多项式回归才能够满足需求了。

多项式回归

具体实现

对于多项式回归，一般是利用线性回归求解 $y=\sum_{i=1}^m b_i \times x^i$ ，因此算法如下：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures  # 导入能够计算多项式特征的类
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score  # 交叉验证

def true_fun(X):  # 真实函数
    return np.cos(1.5 * np.pi * X)

np.random.seed(0)
n_samples = 30 

X = np.sort(np.random.rand(n_samples))  # 随机采样后排序
y = true_fun(X) + np.random.randn(n_samples) * 0.1

degrees = [1, 4, 15] # 多项式最高次，我们分别用1次，4次和15次的多项式来尝试拟合
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i+1)  # 总共三个图，获取第i+1个图的图像柄
    plt.setp(ax, xticks = (), yticks = ())  # 这是 设置ax图中的属性
    polynomial_features = PolynomialFeatures(degree=degrees[i],include_bias=False)
    # 建立多项式回归的类，第一个参数就是多项式的最高次数，第二个是是否包含偏置
    linear_regression = LinearRegression()  # 线性回归
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)]) # 使用pipline串联模型
    pipeline.fit(X[:, np.newaxis], y) 
    scores = cross_val_score(pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10) 
    # 使用交叉验证，第一个参数为模型，第二个为输入，第三个为标签，第四个为误差计算方式，第五个为多少折
    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")    
    plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(degrees[i], -scores.mean(), scores.std()))
plt.show()

这里解释两个地方：

PolynomialFeatures：这个类实际上是一个构造特征的类，因为我们原始的X是一个一维的向量，它多项式的次数为1，那我们希望构成一个多项式就需要拿X去计算 $X^1,X^2,...,X^m$ （这是一个变量的情况，如果是多个变量就会计算交叉乘），那么这个类就是实现这样的操作，构造成m个特征
pipeline：这是方便我们的管道，它将各种模块加在一起让我们不用一步步去计算每一个模块，这里就是将PolynomialFeatures和线性回归模块加在一起，那我们将X传进去之后，就经过特征构造后就进行线性回归，因此拟合管道即可。

在其中我们还用到了交叉验证的思路，这部分很常见就不多做解释了。

LogisticRegression

算法思想简述

对于逻辑回归大部分是面对二分类问题，给定数据 $X=\{x_1,x_2,...,\}，Y=\{y_1,y_2,...,\}$
考虑二分类任务，那么其假设函数就是：
$h_{\theta}(x) = g(\theta^Tx)=g(w^Tx+b)=\frac{1}{1+e^{w^Tx+b}}$
来表示为类别1或者类别0的概率。

那么其损失函数一般是采用极大似然估计法来定义：
$L(\theta)=\prod_{i=1}p(y_i=1\mid x_i)=h_{\theta}(x_1)(1-h_{\theta}(x))...$
这里假设 $y_1=1,y_2=0$ 。那么就是该函数最大化，化简可得：
$\theta^{*}=arg\min_{\theta}(-L(\theta))=arg\min_{\theta}-\ln(L(\theta))\\ =\sum_{i=1}(-y_i\theta^Tx_i+\ln(1+e^{\theta^Tx_i}))$
再利用梯度下降即可。

算法实现

# 下面为sklearn版本
import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784")  # 数据
X, y = mnist['data'], mnist['target']
X_train = np.array(X[:60000], dtype = float)
y_train = np.array(y[:60000], dtype = float)
X_test = np.array(X[60000:], dtype = float)
y_test = np.array(y[60000:], dtype = float)  # 构造训练集和数据集
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(60000, 784)
(60000,)
(10000, 784)
(10000,)

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l1', solver='saga', tol=0.1)
# 第一个参数为惩罚项选择l1还是l2，tol是停止求解的条件，solver可以认为是求解器
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("Test score with L1 penalty: %.4f" % score)

Test score with L1 penalty: 0.9245

这里我好奇的是逻辑回归面对的是二分类问题，可是这里我们直接给他多分类问题的数据集为何能够直接求解，查了一遍发现是类内部的优化帮你实现了这一过程。

# 以下为pytorch版本
from torch.utils.data import DataLoader
from torchvision import datasets
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import numpy as np
train_dataset = datasets.MNIST(root = p_parent_path+'/datasets/', train = True,transform = transforms.ToTensor(), download = False)
test_dataset = datasets.MNIST(root = p_parent_path+'/datasets/', train = False, transform = transforms.ToTensor(), download = False)
#加载数据集
batch_size = len(train_dataset)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)
# 数据加载器
X_train,y_train = next(iter(train_loader))
X_test,y_test = next(iter(test_loader))
# 打印前100张图片
images, labels= X_train[:100], y_train[:100] 
# 使用images生成宽度为10张图的网格大小
img = torchvision.utils.make_grid(images, nrow=10)
# cv2.imshow()的格式是(size1,size1,channels),而img的格式是(channels,size1,size1),
# 所以需要使用.transpose()转换，将颜色通道数放至第三维
img = img.numpy().transpose(1,2,0)
print(images.shape)
print(labels.reshape(10,10))
print(img.shape)
plt.imshow(img)
plt.show()

torch.Size([100, 1, 28, 28])
tensor([[4, 7, 0, 9, 3, 6, 1, 7, 7, 8],
        [8, 3, 2, 7, 2, 4, 4, 3, 8, 0],
        [5, 6, 4, 9, 0, 6, 1, 2, 3, 3],
        [6, 0, 4, 3, 7, 0, 7, 6, 5, 1],
        [4, 3, 4, 8, 5, 3, 1, 5, 2, 4],
        [5, 4, 8, 5, 5, 1, 1, 6, 0, 4],
        [5, 4, 5, 1, 4, 4, 8, 2, 7, 3],
        [8, 1, 8, 6, 3, 7, 7, 9, 5, 9],
        [8, 4, 7, 0, 3, 6, 6, 2, 5, 3],
        [2, 0, 6, 5, 1, 7, 2, 7, 1, 2]])
(302, 302, 3)

X_train,y_train = X_train.cpu().numpy(),y_train.cpu().numpy() # tensor转为array形式)
X_test,y_test = X_test.cpu().numpy(),y_test.cpu().numpy() # tensor转为array形式)
X_train = X_train.reshape(X_train.shape[0],784)  # 展开成1维度的向量的形式，长度为28*28等于784
X_test = X_test.reshape(X_test.shape[0],784)
model = LogisticRegression(solver='lbfgs', max_iter = 400)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred)) # 打印报告

precision    recall  f1-score   support

           0       0.50      0.75      0.60         4
           1       0.71      1.00      0.83        10
           2       0.79      0.85      0.81        13
           3       0.79      0.69      0.73        16
           4       0.83      0.91      0.87        11
           5       0.60      0.23      0.33        13
           6       1.00      1.00      1.00         5
           7       0.88      1.00      0.93         7
           8       0.67      0.83      0.74        12
           9       0.71      0.56      0.63         9

    accuracy                           0.75       100
   macro avg       0.75      0.78      0.75       100
weighted avg       0.74      0.75      0.73       100

Decision Tree

首先介绍一个数据集，鸢尾花（iris）数据集，数据集内包含 3 类共 150 条记录，每类各 50 个数据，每条记录都有 4 项特征：花萼长度、花萼宽度、花瓣长度、花瓣宽度，可以通过这4个特征预测鸢尾花卉属于（iris-setosa, iris-versicolour, iris-virginica）中的哪一品种。

import seaborn as sns
from pandas import plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree

# 加载数据集
data = load_iris()
# 转换成DataFrame的格式
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Species'] = data.target  # 添加品种列
# 查看数据集信息
print(f"数据集信息：\n{df.info()}")
# 查看前5条数据
print(f"前5条数据：\n{df.head()}")
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   Species            150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB
数据集信息：
None
前5条数据：
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   Species  
0        0  
1        0  
2        0  
3        0  
4        0

上述是对数据的初步观察，下面来看具体的算法实现：

# 用数值代替品类名称
target = np.unique(data.target)  # 去重
print(target)
target_names = np.unique(data.target_names)
print(target_names)
targets = dict(zip(target, target_names))
print(targets)
df['Species'] = df['Species'].replace(targets)

# 提取数据和标签
X =df.drop(columns = 'Species')  # 把标签列丢掉就是特征
y = df['Species']
feature_names = X.columns
labels = y.unique()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)
# 划分训练集与测试集，测试集比例为0.4，随机种子为42
model = DecisionTreeClassifier(max_depth = 3, random_state = 42)  # 决策树的最大深度为3
model.fit(X_train, y_train)
# 以文字的形式输出树
text_representation = tree.export_text(model)
print(text_representation)

# 以图片的形式画出
plt.figure(figsize=(30, 10), facecolor='w')
a = tree.plot_tree(model,
                  feature_names = feature_names,
                  class_names = labels,
                  rounded = True,
                  filled = True,
                  fontsize = 14)
plt.show()

[0 1 2]
['setosa' 'versicolor' 'virginica']
{0: 'setosa', 1: 'versicolor', 2: 'virginica'}
|--- feature_2 <= 2.45
|   |--- class: setosa
|--- feature_2 >  2.45
|   |--- feature_3 <= 1.75
|   |   |--- feature_2 <= 5.35
|   |   |   |--- class: versicolor
|   |   |--- feature_2 >  5.35
|   |   |   |--- class: virginica
|   |--- feature_3 >  1.75
|   |   |--- feature_2 <= 4.85
|   |   |   |--- class: virginica
|   |   |--- feature_2 >  4.85
|   |   |   |--- class: virginica

MLP

多层感知机的介绍可以看我这篇讲神经网络的博客

下面我们关注算法实现

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import fetch_openml
import numpy as np

mnist = fetch_openml("mnist_784")  # 加载数据集
X, y = mnist['data'], mnist['target']
X_train = np.array(X[:60000], dtype=float)
y_train = np.array(y[:60000], dtype=float)
X_test = np.array(X[60000:], dtype=float)
y_test = np.array(y[60000:], dtype=float)

clf = MLPClassifier(alpha = 1e-5, hidden_layer_sizes = (15,15), random_state=1)
# alpha为正则项的惩罚系数，第二个为每一层隐藏节点的个数，这里就是2层，每层15个

clf.fit(X_train, y_train)

score = clf.score(X_test, y_test)
score

0.9124

那么还有一些值得注意的参数为：

activation：选择激活函数，可选有{‘identity’,‘logistic’,‘tanh’,‘relu’}，默认为relu
solver：权重优化器，可选有{‘lbfgs’,‘sgd’,‘adam’}，默认为adam
learning_rate_init：初始学习率，仅在sgd或者adam时使用

SVM

我们还是专注于SVM算法的实现。

选择不同的核主要是在svm.SVC中指定参数kernel

线性SVM

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
data = np.array([
    [0.1, 0.7],
    [0.3, 0.6],
    [0.4, 0.1],
    [0.5, 0.4],
    [0.8, 0.04],
    [0.42, 0.6],
    [0.9, 0.4],
    [0.6, 0.5],
    [0.7, 0.2],
    [0.7, 0.67],
    [0.27, 0.8],
    [0.5, 0.72]
])# 建立数据集
label = [1] * 6 + [0] * 6  # 前6个数据的label为1，后6个为0
x_min, x_max = data[:,0].min() - 0.2, data[:,0].max() + 0.2
y_min, y_max = data[:,1].min() - 0.2, data[:,0].max() + 0.2
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.002),
                    np.arange(y_min, y_max, 0.002))  # 生成网格网络
model_linear = svm.SVC(kernel = 'linear', C = 0.001)  # 线性SVM模型
model_linear.fit(data, label)
Z = model_linear.predict(np.c_[xx.ravel(), yy.ravel()])
# 是先将xx（size*size）和yy(size*size)拉成一维，然后让它们相连，成为一个两列的矩阵，然后作为X进去预测
Z = Z.reshape(xx.shape)
plt.contourf(xx,yy, Z, cmap=plt.cm.ocean, alpha=0.6)
# 可以理解为绘制等高线，xx为横坐标，yy为轴坐标，Z为确切坐标点的取值，cmap为配色方案
plt.scatter(data[:6,0],data[:6,1], marker='o', color='r', s=100, lw=3)
plt.scatter(data[6:,0],data[6:,1], marker='x', color='k', s=100, lw=3)
plt.title("Linear SVM")
plt.show()

多项式核

plt.figure(figsize=(16,15))

# 画出多个多项式等级来对比
for i, degree in enumerate([1,3,5,7,9,12]):
    model_poly = svm.SVC(C=0.001, kernel='poly', degree = degree)  # 多项式核
    model_poly.fit(data, label)
    Z = model_poly.predict(np.c_[xx.ravel(), yy.ravel()])#预测
    Z = Z.reshape(xx.shape)
    
    plt.subplot(3,2, i+1)
    plt.subplots_adjust(wspace=0.2, hspace=0.2)  # 调整子图的间距
    plt.contourf(xx,yy, Z, cmap=plt.cm.ocean, alpha=0.6)
    
    plt.scatter(data[:6, 0], data[:6, 1], marker='o', color='r', s=100, lw=3)
    plt.scatter(data[6:, 0], data[6:, 1], marker='x', color='k', s=100, lw=3)
    plt.title('Poly SVM with $\degree=$' + str(degree))
    
plt.show()

高斯核

plt.figure(figsize=(16,15))

for i, gamma in enumerate([1,5,15,35,45,55]):
    model_rbf = svm.SVC(kernel='rbf', gamma=gamma, C = 0.001)  # 选择高斯核模型
    model_rbf.fit(data, label)
    Z = model_rbf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.subplot(3, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)
    plt.contourf(xx, yy, Z, cmap=plt.cm.ocean, alpha=0.6)
 
    # 画出训练点
    plt.scatter(data[:6, 0], data[:6, 1], marker='o', color='r', s=100, lw=3)
    plt.scatter(data[6:, 0], data[6:, 1], marker='x', color='k', s=100, lw=3)
    plt.title('RBF SVM with $\gamma=$' + str(gamma))
plt.show()

对比不同核在Mnist上的效果

读取数据

from sklearn import svm
import numpy as np
from time import time
from sklearn.metrics import accuracy_score
from struct import unpack
from sklearn.model_selection import GridSearchCV

def readimage(path):
    with open(path, 'rb') as f:
        magic, num, rows, cols = unpack('>4I', f.read(16))
        img = np.fromfile(f, dtype=np.uint8).reshape(num, 784)
    return img

def readlabel(path):
    with open(path, 'rb') as f:
        magic, num = unpack('>2I', f.read(8))
        lab = np.fromfile(f, dtype=np.uint8)
    return lab

train_data  = readimage("../../datasets/MNIST/raw/train-images-idx3-ubyte")#读取数据
train_label = readlabel("../../datasets/MNIST/raw/train-labels-idx1-ubyte")
test_data   = readimage("../../datasets/MNIST/raw/t10k-images-idx3-ubyte")
test_label  = readlabel("../../datasets/MNIST/raw/t10k-labels-idx1-ubyte")
print(train_data.shape)
print(train_label.shape)

(60000, 784)
(60000,)

高斯核

#数据集中数据太多，为了节约时间，我们只使用前4000张进行训练
train_data=train_data[:4000]
train_label=train_label[:4000]
test_data=test_data[:400]
test_label=test_label[:400]

svc=svm.SVC()
parameters = {"kernel":['rbf'], "C":[1]}
print("Train....")
clf = GridSearchCV(svc, parameters, n_jobs=-1)  # 网格搜索来决定参数
start = time()
clf.fit(train_data, train_label)
end = time()
t = end - start
print("训练时间：%dmin%.3fsec" % (t//60, t-60 * (t//60)))
prediction = clf.predict(test_data)
print("accuracy:",accuracy_score(prediction, test_label))
accurate = [0] * 10
sumall = [0] * 10

i = 0
j = 0
while i < len(test_label):
    sumall[test_label[i]] += 1
    if prediction[i] == test_label[i]:
        j += 1
    i += 1
print("测试集准确率：", j/400)

Train....
训练时间：0min7.548sec
accuracy: 0.955
测试集准确率： 0.955

多项式核

parameters = {'kernel':['poly'], 'C':[1]}#使用了多项式核
print("Train...")
clf=GridSearchCV(svc,parameters,n_jobs=-1)
start = time()
clf.fit(train_data, train_label)
end = time()
t = end - start
print('Train：%dmin%.3fsec' % (t//60, t - 60 * (t//60)))
prediction = clf.predict(test_data)
print("accuracy: ", accuracy_score(prediction, test_label))
accurate=[0]*10
sumall=[0]*10
i=0
j=0
while i<len(test_label):#计算测试集的准确率
    sumall[test_label[i]]+=1
    if prediction[i]==test_label[i]:
        j+=1
    i+=1
print("测试集准确率：",j/400)

Train...
Train：0min6.438sec
accuracy:  0.93
测试集准确率： 0.93

线性核

parameters = {'kernel':['linear'], 'C':[1]}#使用了线性核
print("Train...")
clf=GridSearchCV(svc,parameters,n_jobs=-1)
start = time()
clf.fit(train_data, train_label)
end = time()
t = end - start
print('Train：%dmin%.3fsec' % (t//60, t - 60 * (t//60)))
prediction = clf.predict(test_data)
print("accuracy: ", accuracy_score(prediction, test_label))
accurate=[0]*10
sumall=[0]*10
i=0
j=0
while i<len(test_label):#计算测试集的准确率
    sumall[test_label[i]]+=1
    if prediction[i]==test_label[i]:
        j+=1
    i+=1
print("测试集准确率：",j/400)

Train...
Train：0min3.712sec
accuracy:  0.9175
测试集准确率： 0.9175

NBayes

这部分算法的调用相对简单：

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.datasets import make_blobs
# make_blobs：为聚类产生数据集
# n_samples：样本点数，n_features：数据的维度，centers:产生数据的中心点，默认值3
# cluster_std：数据集的标准差，浮点数或者浮点数序列，默认值1.0，random_state：随机种子
X, y = make_blobs(n_samples = 100, n_features = 2, centers = 2, random_state = 2, cluster_std = 1.5)
plt.scatter(X[:,0], X[:,1], c = y, s = 50, cmap = 'RdBu')
plt.show()

先画出训练集的散点图：

接下来我们再构建自己的测试集来看看效果：

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()  # 朴素贝叶斯
model.fit(X, y)
rng = np.random.RandomState(0)
X_test = [-6, -14] + [14,18] * rng.rand(5000,2) # 生成测试集
y_pred = model.predict(X_test)
# 将训练集和测试集的数据用图像表示出来，颜色深直径大的为训练集，颜色浅直径小的为测试集
plt.scatter(X[:,0],X[:,1], c = y, s = 50, cmap = 'RdBu')
lim = plt.axis()  # 获取当前的坐标轴限制参数
plt.scatter(X_test[:,0], X_test[:,1], c = y_pred, s = 20, cmap='RdBu', alpha = 0.1)
plt.axis(lim)
plt.show()

可以看到基本上两个分类的边界还是很明显的。

我们也可以看看预测出来的概率大概是什么样子：

yprob = model.predict_proba(X_test)
yprob[:20].round(2)

Out[25]:

array([[0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.94, 0.06],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [0.01, 0.99],
       [0.  , 1.  ],
       [0.  , 1.  ]])

bagging与随机森林

关于这方面的介绍可以看我这篇博客

下面我们继续关注算法的实现：

首先是数据集的加载：

import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

wine = load_wine()  # 使用葡萄酒数据集
print(f"所有特征：{wine.feature_names}")
X = pd.DataFrame(wine.data, columns = wine.feature_names)
y = pd.Series(wine.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

所有特征：['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

接下来我们简单看看如果用单个的决策树将会有什么样的结果：

# 构造并训练决策树分类器
base_model = DecisionTreeClassifier(max_depth = 1, criterion='gini', random_state = 1)  
# 使用基尼指数作为选择标准
base_model.fit(X_train, y_train)
y_pred = base_model.predict(X_test)
print(f"决策树的准确率为：{accuracy_score(y_test, y_pred):.3f}")

决策树的准确率为：0.694

可以看到，对于简单的决策树来说其精确度并不够。

那么我们尝试一下以该决策树作为基分类器的bagging集成，看看能有多大的提升：

from sklearn.ensemble import BaggingClassifier
# 这里的基分类器选择是上面构建的决策树模型，前面虽然已经fit了一次，但是不影响，应该也是重新fit的
model = BaggingClassifier(base_estimator = base_model,
                         n_estimators = 50,  # 最大的弱学习器的个数为50
                         random_state = 1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)# 预测
print(f"BaggingClassifier的准确率：{accuracy_score(y_test,y_pred):.3f}")

BaggingClassifier的准确率：0.917

可以看到提升还是很明显的！接下来我们关注一下重要参数——基分类器的个数对结果的影响：

# 下面来测试基分类器个数的影响
x = list(range(2,102,2))  # 从2到102之间的偶数
y = []
for i in x:
    model = BaggingClassifier(base_estimator = base_model,
                             n_estimators = i,
                             random_state = 1)
    model.fit(X_train, y_train)
    model_test_sc = accuracy_score(y_test, model.predict(X_test))
    y.append(model_test_sc)  # 将得分进行存储
    
plt.style.use('ggplot')  # 设置绘图背景样式
plt.title("Effect of n_estimators", pad = 20)
plt.xlabel("Number of base estimators")
plt.ylabel("Test accuracy of BaggingClassifier")
plt.plot(x,y)
plt.show()

可以看到基分类器的数量并不是越多越好的！这是因为太多可能会出现冗余，导致分类结果不好。

接下来来观察改进算法——随机森林的实现：

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier( n_estimators = 50,
                              random_state = 1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"RandomForestClassifier的准确率：{accuracy_score(y_test,y_pred):.3f}")

RandomForestClassifier的准确率：0.972

可以看到随机森林因为加入了特征随机性，因此其基分类器的多样性得到提升，进而分类精度也就得到进一步的提升。

我们也来观察基分类器个数对结果的影响：

x = list(range(2, 102, 2))# 估计器个数即n_estimators，在这里我们取[2,102]的偶数
y = []

for i in x:
    model = RandomForestClassifier(
                              n_estimators=i,
                              
                              random_state=1)
  
    model.fit(X_train, y_train)
    model_test_sc = accuracy_score(y_test, model.predict(X_test))
    y.append(model_test_sc)

plt.style.use('ggplot')
plt.title("Effect of n_estimators", pad=20)
plt.xlabel("Number of base estimators")
plt.ylabel("Test accuracy of RandomForestClassifier")
plt.plot(x, y)
plt.show()

可以看对对于随机森林来说，我觉得是因为它加入了特征的随机性，因此对于数量就不那么敏感。

AdaBoost

关于AdaBoost的介绍也可以看我这篇博客

下面我们仍然关注算法的实现：

同样先导入数据，然后看看在单个决策树上的模型好坏：

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()#使用葡萄酒数据集
print(f"所有特征：{wine.feature_names}")
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Series(wine.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

base_model = DecisionTreeClassifier(max_depth = 1, criterion='gini', random_state = 1)
base_model.fit(X_train, y_train)
y_pred = base_model.predict(X_test)
print(f"决策树的准确率：{accuracy_score(y_test,y_pred):.3f}")

决策树的准确率：0.694

跟之前的结果是一样的。

那么接下来我们尝试应用AdaBoost算法来拟合：

from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
model = AdaBoostClassifier(base_estimator=base_model, n_estimators=50, learning_rate = 0.8)
# n_estimators和learning_rate是要调的参数，lr是弱学习器的权重衰减系数
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = metrics.accuracy_score(y_test, y_pred) # 准确率
print(f"准确率：{acc:.2}")

准确率：0.97

可以看到其效果提升很多！但是这个参数是我们随机初始化的，我们尝试用网格搜索来搜索在训练集上表现最佳的参数：

hyperparameter_space = {"n_estimators":list(range(2,102,2)),
                       "learning_rate":[0,1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}
gs = GridSearchCV(AdaBoostClassifier(algorithm='SAMME.R', random_state = 1),
                 param_grid = hyperparameter_space,
                 scoring = 'accuracy', n_jobs = -1, cv = 5)
gs.fit(X_train, y_train)
print("最佳参数为：",gs.best_params_)
print("最佳得分为：",gs.best_score_)

最佳参数为： {'learning_rate': 0.8, 'n_estimators': 42}
最佳得分为：0.9857142857142858

再看看它在测试集上的分数：

model = AdaBoostClassifier(base_estimator=base_model, n_estimators=42, learning_rate = 0.8)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = metrics.accuracy_score(y_test, y_pred) # 准确率
print(f"准确率：{acc:.2}")

准确率：0.94

可以看到居然还不如我们之前的参数。这里要注意在进行网格搜索的时候就进行了K折交叉验证的。我一开始是以为网格搜索是在训练集上寻找拟合效果最好的参数，这点需要注意。

k-means算法

关于聚类算法的详细介绍可以看我这篇博客

下面我们继续关注算法实现。

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
from sklearn import datasets
%matplotlib inline

# 聚类前
X = np.random.rand(1000,2)
plt.scatter(X[:,0], X[:,1], marker='o')

# 初始化质心，从原有数据中挑选k个作为质心
def IiniCentroids(X, k):
    index = np.random.randint(0, len(X)-1, k)
    return X[index]

# 聚类后
kmeans = KMeans(n_clusters = 4)  # 分成2类
kmeans.fit(X)
label_pred = kmeans.labels_
plt.scatter(X[:,0], X[:,1], c= label_pred)
plt.show()

他给的代码比较少，因此我去搜索了该函数的其他参数：

n_clusters：int类型，生成的聚类数，默认为8
max_iter：int类型，执行最大迭代次数，默认为300
n_init：选择多份不同的聚类中心，同样运行结果最终选取一个
init：有三个可选值
- k-means++：默认，用特殊方法选定初始质心并加速收敛
- random：随机从训练数据中选取质心
- 传递一个ndarray：自己指定质心
n_jobs
random_state

它的主要属性有：

cluster_centers_：最后的聚类中心
labels：每个样本对应的簇
inertia_：用来评估簇的个数是否合适，距离越小说明分得越好，用来选择簇最佳个数

print("聚类中心为：",kmeans.cluster_centers_)
print("评估：",kmeans.inertia_)

聚类中心为： [[0.79862048 0.71591318]
 [0.22582347 0.26005466]
 [0.73845863 0.23886344]
 [0.29972473 0.76998545]]
评估： 41.37635968102986

KNN

这个算法的介绍可以看我这篇博客，里面讲解了算法以及详细的python实现过程。

下面我们专注于利用sklearn的实现过程。

sklearn 库的 neighbors 模块实现了KNN 相关算法，其中：

KNeighborsClassifier类用于实现分类问题
KNeighborsRegressor类用于实现回归问题（回归问题简单理解就是附近点在特征上的平均值赋给目标点）

这两个类的构造方法基本一致，这里我们主要介绍 KNeighborsClassifier 类，原型如下：

KNeighborsClassifier(
    n_neighbors=5, 
    weights='uniform', 
    algorithm='auto', 
    leaf_size=30, 
    p=2, 
    metric='minkowski', 
    metric_params=None, 
    n_jobs=None, 
    **kwargs)

主要关注这几个参数：

n_neighbors：即 KNN 中的 K 值，一般使用默认值 5。
weights：用于确定邻居的权重，有三种方式：
- weights=uniform，表示所有邻居的权重相同
- weights=distance，表示权重是距离的倒数，即与距离成反比
- 自定义函数，可以自定义不同距离所对应的权重，一般不需要自己定义函数
algorithm：用于设置计算邻居的算法，它有四种方式：
- algorithm=auto，根据数据的情况自动选择适合的算法
- algorithm=kd_tree，使用 KD 树算法
  - KD 树适用于维度较少的情况，一般维数不超过 20，如果维数大于 20 之后，效率会下降
- algorithm=ball_tree，使用球树算法
  - 球树更适用于维度较大的情况
- algorithm=brute，称为暴力搜索
  - 它和 KD 树相比，采用的是线性扫描，而不是通过构造树结构进行快速检索
  - 缺点是，当训练集较大的时候，效率很低
leaf_size：表示构造 KD 树或球树时的叶子节点数，默认是 30

下面来进入代码实战：

from sklearn.datasets import load_digits
import pandas as pd
digits = load_digits()
data = digits.data     # 特征集
target = digits.target # 目标集
data_pd = pd.DataFrame(data)
data_pd

可以看到是64个维度，相当于是一个64维空间下的一个散点。

from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(
    data, target, test_size=0.25, random_state=33)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(train_x, train_y)
predict_y = knn.predict(test_x)
from sklearn.metrics import accuracy_score
score = accuracy_score(test_y, predict_y)
score

0.9844444444444445

PCA

关于PCA算法的详细解释可以看我这篇博客，讲解了PCA算法以及Numpy实现PCA的过程。

下面我们继续关注算法的实现过程：

#首先我们生成随机数据并可视化
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
from sklearn.datasets import make_blobs
# X为样本特征，Y为样本簇类别， 共1000个样本，每个样本3个特征，共4个簇
X, y = make_blobs(n_samples=10000, n_features=3, centers=[[3,3, 3], [0,0,0], [1,1,1], [2,2,2]], 
                  cluster_std=[0.2, 0.1, 0.2, 0.2], random_state =9)
fig = plt.figure()
ax = Axes3D(fig, rect=[0,0,1,1], elev = 20, azim = 10)  
# rect是左，底部，宽度，高度，用来确定范围，elev是上下观察视角，azim是左右观察视角
plt.scatter(X[:,0],X[:,1], X[:,2], marker='o')

因为PCA降维的时候需要重点关注保留的方差，因此我们先不进行降维，只对数据进行投影，看看投影后的三个维度的方差分布：

from sklearn.decomposition import PCA
pca = PCA(n_components = 3)
pca.fit(X)
print(pca.explained_variance_ratio_)  # 各个特征保留的方差百分比
print(pca.explained_variance_)  # 各个特征的方差原始数值

[0.98318212 0.00850037 0.00831751]
[3.78521638 0.03272613 0.03202212]

可以看到第一个维度的方差占据了98%。

那么接下来尝试降到2维度：

pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)

[0.98318212 0.00850037]
[3.78521638 0.03272613]

可以看到如果保留两个维度的话，它选择了前两个方差占据比较大的特征，舍弃了第三个特征。

我们将降维后的图画出来：

X_new = pca.transform(X)
plt.scatter(X_new[:, 0], X_new[:, 1],marker='o')
plt.show()

我们刚才的降维是指定保留维度数，那我们也可以指定保留的方差比例大小：

pca = PCA(n_components=0.95)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.n_components_)

[0.98318212]
[3.78521638]
1

可以看到因为第一个就占据了98%，所以保留95%就直接保留第一个维度就可以了。

我们还可以让MLE算法自己选择降维的结果：

pca = PCA(n_components='mle')
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
print(pca.n_components_)

[0.98318212]
[3.78521638]
1

可以看到MLE算法就只保留了第一个特征。

我们这里补充一下该类的具体参数：

n_components：就是指定降维后的特殊数目或者指定保留的方差比例，还可以通过设为MLE来自动选择
copy：布尔值，是否需要将原始训练数据复制
whliten：布尔值，是否白化，使得每个特征都具有相同的方差

该类的属性有：

n_components_：返回所保留的特征个数
explained_variance_ratio_：返回所保留的各个特征的方差百分比
explained_variance_：返回所保留的各个特征的方差大小

常用方法为：

fit_transform(X)：训练并返回降维后的数据
inverse_transform(newData) ：将降维得到的数据newData转换回原始数据，可能会有一点不同
transform(X)：将X转换为降维后的数据

尝试一下复原降维的数据：

new_Data = pca.transform(X)
X_regan = pca.inverse_transform(new_Data)
X-X_regan

array([[ 0.14364008, -0.1352249 , -0.00781994],
       [ 0.05135552, -0.01316744, -0.03802959],
       [-0.03610653,  0.07254754, -0.03665018],
       ...,
       [ 0.18537785, -0.0907325 , -0.09400653],
       [-0.2618617 ,  0.20035984,  0.06048799],
       [-0.02015389,  0.12283753, -0.10292754]])

还是有很大差距的。

HMM

对于HMM的原理介绍强烈推荐看这个视频，真的讲得很好！

我们继续关注程序的实现。

hmmlearn实现了三种HMM模型类，按照观测状态是连续状态还是离散状态，可以分为两类。GaussianHMM和GMMHMM是连续观测状态的HMM模型，而MultinomialHMM是离散观测状态的模型。那我们来尝试使用一下：

#pip install hmmlearn

import numpy as np
import matplotlib.pyplot as plt

from hmmlearn import hmm

# Prepare parameters for a 4-components HMM
# Initial population probability
startprob = np.array([0.6, 0.3, 0.1, 0.0])
# The transition matrix, note that there are no transitions possible
# between component 1 and 3
transmat = np.array([[0.7, 0.2, 0.0, 0.1],
                     [0.3, 0.5, 0.2, 0.0],
                     [0.0, 0.3, 0.5, 0.2],
                     [0.2, 0.0, 0.2, 0.6]])
# The means of each component
means = np.array([[0.0, 0.0],
                  [0.0, 11.0],
                  [9.0, 10.0],
                  [11.0, -1.0]])
# The covariance of each component
covars = .5 * np.tile(np.identity(2), (4, 1, 1))

# Build an HMM instance and set parameters
gen_model = hmm.GaussianHMM(n_components=4, covariance_type="full")

# Instead of fitting it from the data, we directly set the estimated
# parameters, the means and covariance of the components
gen_model.startprob_ = startprob
gen_model.transmat_ = transmat
gen_model.means_ = means
gen_model.covars_ = covars

# Generate samples
X, Z = gen_model.sample(500)

# Plot the sampled data
fig, ax = plt.subplots()
ax.plot(X[:, 0], X[:, 1], ".-", label="observations", ms=6,
        mfc="orange", alpha=0.7)

# Indicate the component numbers
for i, m in enumerate(means):
    ax.text(m[0], m[1], 'Component %i' % (i + 1),
            size=17, horizontalalignment='center',
            bbox=dict(alpha=.7, facecolor='w'))
ax.legend(loc='best')
fig.show()

scores = list()
models = list()
for n_components in (3, 4, 5):
    # define our hidden Markov model
    model = hmm.GaussianHMM(n_components=n_components,
                            covariance_type='full', n_iter=10)
    model.fit(X[:X.shape[0] // 2])  # 50/50 train/validate
    models.append(model)
    scores.append(model.score(X[X.shape[0] // 2:]))
    print(f'Converged: {model.monitor_.converged}'
          f'\tScore: {scores[-1]}')

# get the best model
model = models[np.argmax(scores)]
n_states = model.n_components
print(f'The best model had a score of {max(scores)} and {n_states} '
      'states')

# use the Viterbi algorithm to predict the most likely sequence of states
# given the model
states = model.predict(X)

Converged: True	Score: -1065.5259488089373
Converged: True	Score: -904.2908933008515
Converged: True	Score: -905.5449538166446
The best model had a score of -904.2908933008515 and 4 states

#让我们将我们的状态与生成的状态和我们的转换矩阵进行比较，来看我们的模型
# plot model states over time
fig, ax = plt.subplots()
ax.plot(Z, states)
ax.set_title('States compared to generated')
ax.set_xlabel('Generated State')
ax.set_ylabel('Recovered State')
fig.show()

# plot the transition matrix
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 5))
ax1.imshow(gen_model.transmat_, aspect='auto', cmap='spring')
ax1.set_title('Generated Transition Matrix')
ax2.imshow(model.transmat_, aspect='auto', cmap='spring')
ax2.set_title('Recovered Transition Matrix')
for ax in (ax1, ax2):
    ax.set_xlabel('State To')
    ax.set_ylabel('State From')

fig.tight_layout()
fig.show()

visualizetion_report

该章节主要是讲解机器学习的相关可视化部分，使用Scikit-Plot来实现，主要包括以下几个部分：

estimators：用于绘制各种算法
metrics：用于绘制机器学习的onfusion matrix, ROC AUC curves, precision-recall curves等曲线
cluster：主要用于绘制聚类
decomposition：主要用于绘制PCA降维

先加载需要的模块：

# 加载需要用到的模块
import scikitplot as skplt

import sklearn
from sklearn.datasets import load_digits, load_boston, load_breast_cancer
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

import sys

print("Scikit Plot Version : ", skplt.__version__)
print("Scikit Learn Version : ", sklearn.__version__)
print("Python Version : ", sys.version)

如果没有安装skplt库的话可以直接：

pip install scikit-plot

加载数据集

手写数据集

digits = load_digits()
X_digits, Y_digits = digits.data, digits.target

print("Digits Dataset Size : ", X_digits.shape, Y_digits.shape)

X_digits_train, X_digits_test, Y_digits_train, Y_digits_test = train_test_split(X_digits, Y_digits,
                                                                                train_size=0.8,
                                                                                stratify=Y_digits,
                                                                                random_state=1)

print("Digits Train/Test Sizes : ",X_digits_train.shape, X_digits_test.shape, Y_digits_train.shape, Y_digits_test.shape)

Digits Dataset Size :  (1797, 64) (1797,)
Digits Train/Test Sizes :  (1437, 64) (360, 64) (1437,) (360,)

肿瘤数据集

cancer = load_breast_cancer()
X_cancer, Y_cancer = cancer.data, cancer.target

print("Feautre Names : ", cancer.feature_names)
print("Cancer Dataset Size : ", X_cancer.shape, Y_cancer.shape)
X_cancer_train, X_cancer_test, Y_cancer_train, Y_cancer_test = train_test_split(X_cancer, Y_cancer,
                                                                                train_size=0.8,
                                                                                stratify=Y_cancer,
                                                                                random_state=1)

print("Cancer Train/Test Sizes : ",X_cancer_train.shape, X_cancer_test.shape, Y_cancer_train.shape, Y_cancer_test.shape)

Feautre Names :  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Cancer Dataset Size :  (569, 30) (569,)
Cancer Train/Test Sizes :  (455, 30) (114, 30) (455,) (114,)

波士顿房价数据集

boston = load_boston()
X_boston, Y_boston = boston.data, boston.target

print("Boston Dataset Size : ", X_boston.shape, Y_boston.shape)

print("Boston Dataset Features : ", boston.feature_names)
X_boston_train, X_boston_test, Y_boston_train, Y_boston_test = train_test_split(X_boston, Y_boston,
                                                                                train_size=0.8,
                                                                                random_state=1)

print("Boston Train/Test Sizes : ",X_boston_train.shape, X_boston_test.shape, Y_boston_train.shape, Y_boston_test.shape)

Boston Dataset Size :  (506, 13) (506,)
Boston Dataset Features :  ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Boston Train/Test Sizes :  (404, 13) (102, 13) (404,) (102,)

性能可视化

交叉验证绘制

我们绘制出逻辑回归的交叉验证学习曲线：

skplt.estimators.plot_learning_curve(LogisticRegression(), X_digits, Y_digits,
                                     cv=7, shuffle=True, scoring="accuracy",
                                     n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title="Digits Classification Learning Curve")
plt.show()

skplt.estimators.plot_learning_curve(LinearRegression(), X_boston, Y_boston,
                                     cv=7, shuffle=True, scoring="r2", n_jobs=-1,
                                     figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title="Boston Regression Learning Curve ");
plt.show()

需要注意的是因为第二个数据集上的评估指标采用了r2，因此其分数跟第一个有些许不同。

重要性特征绘制

好的特征具有的特点为：

有区分性，不会跟其他特征产生冗余
特征之间相互独立
简单易于理解

因此重要性特征绘制就是让我们能够直观看到哪些特征被函数认为是更加优秀、重要的特征。

rf_reg = RandomForestRegressor()  # 随机森林
rf_reg.fit(X_boston_train, Y_boston_train)
print(rf_reg.score(X_boston_test, Y_boston_test))
gb_classif = GradientBoostingClassifier()  # 梯度提升
gb_classif.fit(X_cancer_train, Y_cancer_train)
print(gb_classif.score(X_cancer_test, Y_cancer_test))
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)  # 两张图，现在ax1是可以在第一张图上画
skplt.estimators.plot_feature_importances(rf_reg, feature_names=boston.feature_names, 
                                         title = "Random Forest Regressor Feature Importance",
                                         x_tick_rotation = 90, order="ascending", ax=ax1)
# x_tick_rotation是将x轴的文字旋转90°
ax2 = fig.add_subplot(122)
skplt.estimators.plot_feature_importances(gb_classif, feature_names=cancer.feature_names,
                                         title="Gradient Boosting Classifier Feature Importance",
                                         x_tick_rotation=90,
                                         ax=ax2);

plt.tight_layout()  # 会自动调整子图参数，使之填充整个图像区域
plt.show()

机器学习度量

混淆矩阵

对于二分类来说，混淆矩阵简单理解就是下图：

那我们就经常用于计算精准率与召回率，同时计算F1分数。而对于多分类来说相当于是将方阵的维度扩大而已。

log_reg = LogisticRegression()
log_reg.fit(X_digits_train, Y_digits_train)
log_reg.score(X_digits_test, Y_digits_test)
Y_test_pred = log_reg.predict(X_digits_test)

fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(1,2,1)
skplt.metrics.plot_confusion_matrix(Y_digits_test, Y_test_pred, title="Confusion Matrix", cmap="Oranges", ax=ax1)
ax2 = fig.add_subplot(1,2,2)
skplt.metrics.plot_confusion_matrix(Y_digits_test, Y_test_pred,
                                    normalize=True,  # 相当于约束到分数
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    ax=ax2);
plt.show()

第二张图加了normalize=True，就相当于压缩到1之间的比例。

ROC、AUC曲线

要了解ROC曲线，我们要从混淆矩阵入手：

其中：

TP：预测为1，真实为1，真阳率
FP：预测为1，真实为0，假阳率
TN：预测为0，真实为0，真阴率
FN：预测为0，真实为1，假阴率

那么样本中真正的正例总数为TP+FN，那么预测正确的正类占所有正类的比例为：
$TPR=\frac{TP}{TP+FN}$
同理，真正的反例为FP+TN，那么预测错误的反例占所有反例的比例为：
$FPR=\frac{FP}{TN+FP}$
另外一个概念是截断点t，其代表着当模型对于样本的预测概率大于t时，那么就归类为正类，否则归类为负类。

那么ROC曲线就是对于数据集，当截断点t取不同的数值时，TPR和FPR的结果画出来的二维曲线。

而AUC曲线就是ROC曲线的面积。

Y_test_probs = log_reg.predict_proba(X_digits_test)
skplt.metrics.plot_roc_curve(Y_digits_test, Y_test_probs, title="Digits ROC Curve", figsize=(12,6))
plt.show()

PR曲线

PR曲线的绘制方法跟ROC曲线同理，其选取的两个指标为精准率和召回率：
$precision=\frac{TP}{TP+FP}\\ recall = \frac{TP}{TP+FN}$
然后同样是选择不同的截断点画出数值。

skplt.metrics.plot_precision_recall_curve(Y_digits_test, Y_test_probs, title="Digits Precision-Recall Curve", figsize=(12,6))
plt.show()

轮廓分析

简单理解轮廓分析就是用来评判聚类效果的好坏。

kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_digits_train, Y_digits_train)
cluster_labels = kmeans.predict(X_digits_test)
skplt.metrics.plot_silhouette(X_digits_test, cluster_labels,figsize=(8,6))
plt.show()

可靠性曲线

检验概率模型的可靠性。

lr_probas = LogisticRegression().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
rf_probas = RandomForestClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
gb_probas = GradientBoostingClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)
et_scores = ExtraTreesClassifier().fit(X_cancer_train, Y_cancer_train).predict_proba(X_cancer_test)

probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = ['Logistic Regression', 'Random Forest', 'Gradient Boosting', 'Extra Trees Classifier']
skplt.metrics.plot_calibration_curve(Y_cancer_test,
                                     probas_list,
                                     clf_names, n_bins=15,
                                     figsize=(12,6)
                                     )
plt.show()

KS检验

KS检验是用来检验两样本是否服从相同分布的。

rf = RandomForestClassifier()
rf.fit(X_cancer_train, Y_cancer_train)
Y_cancer_probas = rf.predict_proba(X_cancer_test)

skplt.metrics.plot_ks_statistic(Y_cancer_test, Y_cancer_probas, figsize=(10,6))
plt.show()

累计收益曲线

skplt.metrics.plot_cumulative_gain(Y_cancer_test, Y_cancer_probas, figsize=(10,6))
plt.show()

Lift曲线

skplt.metrics.plot_lift_curve(Y_cancer_test, Y_cancer_probas, figsize=(10,6))
plt.show()

聚类方法

手肘法

用来选择聚类应该选择的簇数目

skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
                               X_digits,
                               cluster_ranges=range(2, 20),
                               figsize=(8,6))
plt.show()

降维方法

PCA

可以查看PCA前n个主成分所占方差比例：

pca = PCA(random_state=1)
pca.fit(X_digits)

skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6))
plt.show()

2-D Projection

2D投影：

skplt.decomposition.plot_pca_2d_projection(pca, X_digits, Y_digits,
                                           figsize=(10,10),
                                           cmap="tab10")
plt.show()

你可能感兴趣的:(机器学习,sklearn,python,人工智能,数据挖掘)

Python接口测试之接口关键字封装测试老哥 python 软件测试自动化测试职场和发展测试用例接口测试测试工具
点击文末小卡片，免费获取软件测试全套资料，资料在手，涨薪更快我们使用RF做UI自动化测试的时候，使用的是关键字驱动。同样，Python做接口自动化测试的时候，也可以使用关键字驱动。但是这里并不是叫关键字驱动，而是叫数据驱动。而接口测试的关键字是什么呢？我们数据驱动的载体是Excel，那么excel里存放的数据是接口测试用例数据，一个接口数据里有常量和变量。变量就是一些参数对应的值，而常量就是接口的
假如你从现在开始学习软件测试，需要多久才能学会呢？ AIZHINAN 学习
首先，不要去网上找那些零零碎碎的教程，很难学懂！你可以根据这个学习大纲定计划只要3-6个月就可以掌握软件测试，升职涨薪不在话下：1.基础阶段：先搞懂测试理论、用例设计，会用Jira写Bug；2.中级阶段：学SQL查数据、Linux看日志，Postman测接口，再用Selenium玩自动化；3.进阶阶段：搭Pytest框架、用JMeter压测，安全测试搞BurpSuite；4.扩展技能：Python
Python网安-zip文件暴力破解（仅供学习） Whoisshutiao python网安 python 开发语言网络安全
目录源码在这里需要的模块准备一个密码本和需要破解的ZIP文件一行一行地从密码文件中读取每个密码。核心部分注意，需要修改上段代码注释里的这段具有编码问题的代码：源码在这里https://github.com/Wist-fully/Attack/tree/cracker需要的模块fromtqdmimporttqdmimportzipfileimportpyzipper准备一个密码本和需要破解的ZIP文
【力扣hot100】python刷题笔记之哈希 Animato. 哈希算法 leetcode 笔记
1.两数之和（简单）题目描述：给定一个整数数组nums和一个整数目标值target，请你在该数组中找出和为目标值target的那两个整数，并返回它们的数组下标。你可以假设每种输入只会对应一个答案，并且你不能使用两次相同的元素。你可以按任意顺序返回答案。示例：解法一：暴力解法：双层循环（这里就不给代码了）解法二：哈希表（时间复杂度O(n)）算法思路：（1）先创建一个空字典当做哈希表来存储已经遍历过的
生成式AI技术对未来知识生产模式的颠覆性影响：跨学科案例分析德宿人工智能
引言随着人工智能技术的迅猛发展，生成式AI作为一种革命性技术正在深刻地改变人类知识生产和学术研究的范式。生成式AI不仅能够创建原创内容，还能模拟人类思维过程，处理和生成大量数据，从而在各个学科领域展现出广阔的应用前景。本研究报告旨在深入探讨生成式AI技术对未来知识生产模式的颠覆性影响，通过对比传统学术研究与AI辅助研究的范式差异，并选取医学、法学、文学、经济学和艺术学等五个典型领域进行深度案例分析
ChatGPT驱动的跨学科研究灵感挖掘指南学境思源AcademicIdeas 学境思源 AI写作 ChatGPT chatgpt
跨学科研究已成为解决复杂问题的重要手段。学境思源，无论是人工智能与心理学的结合，一键生成论文初稿！还是生态学与经济学的融合，越来越多的研究者正试图打破学科界限，探索全新问题域。但问题是：acaids.com。我们如何高效发现这些跨学科交叉点？使用传统方式，像文献综述、领域专家访谈或大型头脑风暴虽有效，但耗时，且受限于已有认知。今天为大家分享一种高效、智能、可复制的方法——利用ChatGPT进行跨学
大模型本地部署，拥有属于自己的ChatGpt 小妖同学学AI chatgpt
ChatGpt以其强大的信息整合和对话能力惊艳了全球，在自然语言处理上面表现出了惊人的能力。不管用于文案撰写还是程序辅助开发都大大提高了我们的工作效率，但是其使用有一定的门槛，让我们大多数人都望而却步，今天我们利用ollama实现本地大模型的步骤，让我们轻松拥有自己的人工智能。Ollama作为一个轻量级的工具，可以帮助用户在本地运行这些大型语言模型，无需持续依赖云服务，既保护了数据隐私，又能减少网
python 爬虫 selenium作用_详解python爬虫利器Selenium使用方法 weixin_39585974 python 爬虫 selenium作用
简介：用pyhon爬取动态页面时普通的urllib2无法实现，例如下面的京东首页，随着滚动条的下拉会加载新的内容，而urllib2就无法抓取这些内容，此时就需要今天的主角selenium。Selenium是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样。支持的浏览器包括IE、MozillaFirefox、MozillaSuite等。使用它爬取页面
矩阵（二维数组）局部极大/小值-python实现银河系渐入佳境编程指南算法 python 算法矩阵
题目来源：某为面试/算法第四版：Algs4-1.4.19矩阵的局部最小元素参考思路：传送CODE：importnumpyasnp'''deffindMin():arr=np.random.rand(10,10)index_arr=np.zeros((10,10))foriinrange(arr.shape[0]):forjinrange(arr.shape[1]):ifi>0andi0andj
PPT 要你好看（全彩）又是一个装逼的
分享一下我老师大神的人工智能教程！零基础，通俗易懂！http://blog.csdn.net/jiangjunshow也欢迎大家转载本篇文章。分享知识，造福人民，实现我们中华民族伟大复兴！PPT,要你好看（全彩）杨臻编著ISBN978-7-121-14725-82011年11月出版定价：49.90元16开264页宣传语：般若黑洞▪百万点击之升华16位知名PPT高手联袂热议内容简介此刻呈现在你面前的
Python网安-ftp服务暴力破解（仅供学习） Whoisshutiao python 网络安全开发语言
目录源码在这里需要导入的模块连接ftp，并设置密码本和线程核心代码设置线程源码在这里https://github.com/Wist-fully/Attack/tree/cracker需要导入的模块importftplibfromthreadingimportThreadimportqueue连接ftp，并设置密码本和线程host="192.168.6.6"user="student"port=21
Python爬虫网安-request+示例 Whoisshutiao python爬虫网安 python 爬虫开发语言网络安全
目录get&post自定义请求头文件上传添加cookie获取网页使用cookiejarsessionssl证书校验超时身份认证（httpbasicAuth）代理配置get&post#！/usr/bin/envpythonimportrequests#get#r=requests.get('http://httpbin.org/get')#print(r.text)#添加参数的get请求data={
多个 Job 并发运行时共享配置文件导致上下文污染，固化 Jenkins Job 上下文要站在顶端 Jenkins jenkins servlet 运维
基于context.py固化JenkinsJob上下文的完整方案，适用于你当前的工作流（Python+JenkinsPipeline），解决：多个Job并发运行时共享配置文件导致上下文污染；读取环境变量或JSON文件时被其他Job修改的问题；后续阶段（如发送通知）读取错误上下文的问题；✅目标在每个JenkinsJob开始时，将关键变量一次性固化到内存中，并在整个Job生命周期内始终使用这些值。整体
使用 Xinference 命令行工具（xinference launch）部署 Nanonets-OCR-s 没刮胡子 Linux服务器技术人工智能AI 软件开发技术实战专栏 ocr
使用Xinference命令行工具（xinferencelaunch）部署Nanonets-OCR-s一、核心优势与适用场景通过xinferencelaunch命令可直接在命令行完成模型部署，无需编写Python代码，适合快速验证或生产环境批量部署。二、部署步骤：从命令行启动模型1.确认环境与依赖已安装Xinference：pipinstall"xinference[all]"GPU显存≥9GB（
Spring AI 结合 MCP MySQL 实现对话式数据库查询没刮胡子软件开发技术实战专栏人工智能AI Spring 数据库 spring 人工智能 spring-ai mcp-server mysql
在现代应用开发中，将人工智能与数据库查询结合可以创造更自然、更智能的用户交互方式。下面我将详细介绍如何使用SpringAI框架结合MCP（可能指MySQL连接池或相关组件）实现对话中的数据库查询功能。什么是SpringAI和MCPMySQLSpringAI框架概述SpringAI是基于Spring生态的人工智能集成框架，它提供了：与大型语言模型(LLM)的集成能力对话管理和自然语言处理功能业务逻辑
MiniMax - M1：开源大模型的革命性突破
开源大模型MiniMax-M1研究报告一、引言在人工智能技术飞速发展的当下，大模型领域的竞争愈发激烈。开源大模型以其开放性、可定制性和社区协作的优势，逐渐成为推动人工智能技术进步的重要力量。MiniMax-M1作为全球首个开源大规模混合架构的推理模型，一经发布便引起了广泛关注。它在长上下文处理、推理效率和成本控制等方面展现出了卓越的性能，为人工智能的发展带来了新的思路和方向。本文将对MiniMax
Ubuntu基础（上传文件和部署Python） aaiier ubuntu linux 运维
首先打开[email protected]然后写yes，在输入密码然后就是输入ls/查看根目录ls/结果是ubuntu@x0-x-xx-xx:~$ls/binbootdevhomelib.usr-is-mergedlost+foundmntprocrunsbin.usr-is-mergedsrvtmpvarbin.usr-is-mergeddataetclibli
print(str(3+5))的结果是什么？为什么？ Lauren_Lu python
✅语句：print(str(3+5))✅执行顺序与含义：括号优先：先计算3+5+是加法运算符3+5是一个表达式，结果为整数8使用str()函数将结果转换为字符串str(8)返回字符串'8'使用print()打印这个字符串print('8')的输出就是：8✅为什么要运算？因为：Python遇到表达式3+5时，必须先计算出结果；str()需要一个值作为参数，而不是一个没计算的表达式；这是Python表
深度学习使用Pytorch训练模型步骤 vvvdg 深度学习 pytorch 人工智能
训练模型是机器学习和深度学习中的核心过程，旨在通过大量数据学习模型参数，以便模型能够对新的、未见过的数据做出准确的预测。训练模型通常包括以下几个步骤：1.数据准备：收集和处理数据，包括清洗、标准化和归一化。将数据分为训练集、验证集和测试集。2.定义模型：选择模型架构，例如决策树、神经网络等。初始化模型参数（权重和偏置）。3.选择损失函数：根据任务类型（如分类、回归）选择合适的损失函数。4.选择优化
Flutter开发环境配置指南 harmonyos
环境相关问题flutter开发环境配置参考建议使用的开发工具版本flutter3.22.0-ohos版本python3.8-python3.11java17node18ohpm1.6+HamonyOSSDKapi11Xcode14.3断网环境flutterpubget执行失败解决方案：加上--offline参数，完整命令flutterpubget--offline。mac环境release版本的应
常见的强化学习算法分类及其特点 ywfwyht 人工智能算法分类人工智能
强化学习（ReinforcementLearning,RL）是一种机器学习方法，通过智能体（Agent）与环境（Environment）的交互来学习如何采取行动以最大化累积奖励。以下是一些常见的强化学习算法分类及其特点：1.基于值函数的算法这些算法通过估计状态或状态-动作对的价值来指导决策。Q-Learning无模型的离线学习算法。通过更新Q值表来学习最优策略。更新公式：Q(s,a)←Q(s,a)
python编译Edge-tts： Edge tts Player 浩读语音朗读 edge-tts python 自然语言处理 edge 前端
Edge-TTS是Python库，通过微软AzureCognitiveServices转化文本为自然语音，Edge-TTS支持40多种语言和300种声音，提供优质的语音输出，这给学习外语的学生和老师很大的福利。下面，尝试着用python来编写一个简单的TTS转MP3。EdgeTTSfromtkinterimport*fromtkinterimportttkfromtkinter.filedialo
【Python】PyRoboPath：Python机器人路径规划的终极指南宅男很神经 python 开发语言
PyRoboPath：Python机器人路径规划的终极指南第1部分：PyRoboPath与路径规划基础第1章：PyRoboPath概览与核心理念1.1什么是PyRoboPath？PyRoboPath是一个先进的、开源的Python库，致力于为学术研究人员、行业工程师以及机器人爱好者提供一套完整、高效、易用且可扩展的机器人路径规划解决方案。它不仅仅是一个算法的集合，更是一个集成了机器人建模、环境表示
基于人工智能的图表生成器警世龙开发记录人工智能自然语言处理
基于人工智能的图表生成器软件需求分析本项目旨在开发一个基于Web的图表生成工具，利用人工智能技术将自然语言描述转换为专业的流程图、时序图等可视化图表。具体需求如下：支持用户输入自然语言描述来生成图表。提供实时预览功能，让用户能够即时看到生成的图表。允许用户对生成的Mermaid代码进行编辑。支持图表的缩放和平移操作。提供代码保存和图片导出功能。具备快捷键支持，提高用户操作效率。技术选型前端HTML
Edge-TTS的使用
Edge-TTS的使用Edge-TTS是一个的文本转语音（TTS）Python库。它利用了微软AzureCognitiveServices的强大功能，能够将文本信息转换成流畅自然的语音输出。这个库特别适合需要在应用程序中加入语音功能的开发者使用。edge-tts在github上已开源，有3的kstar！替代国内收费的TTS服务完全没问题。它支持40多种语言，300多种声音，效果很不错~github
Scikit-learn：机器学习的「万能工具箱」科技林总 DeepSeek学AI 人工智能
——三行代码构建AI模型的全栈指南**###**一、诞生背景：让机器学习从实验室走向大众****2010年前的AI困境**：-学术界模型难以工程化-算法实现碎片化（MATLAB/C++主导）-企业应用门槛极高>**破局者**：DavidCournapeau发起*Scikit-learn*项目，**统一算法接口**+**Python简易语法**=机器学习民主化革命---###**二、设计哲学：一致性
如何看待机器学习方法在超分子化学领域的日渐流行？ cda2024 机器学习人工智能
大家好，今天咱们来聊聊一个既时髦又接地气的话题：如何看待机器学习方法在超分子化学领域的日渐流行？想象一下，你是一位超分子化学家，正忙于设计一种新型的分子结构，这个结构需要具备特定的功能。传统的方法是通过反复实验和理论计算来优化这个结构，但过程可能非常耗时且复杂。而现在，借助机器学习，你可以更快、更准确地找到最优解。这就是为什么机器学习在超分子化学领域变得越来越受欢迎的原因之一。一、超分子化学是什么
助力您发SCI 机器学习（ML）在材料领域应用专题 YEcenfei 分子动力学催化材料机器学习人工智能 python
第一天机器学习在材料与化学常见的方法理论内容1.机器学习概述2.材料与化学中的常见机器学习方法3.应用前沿实操内容Python基础1.开发环境搭建2.变量和数据类型3.列表4.if语句5.字典6.For和while循环实操内容Python基础（续）1.函数2.类和对象3.模块Python科学数据处理1.NumPy2.Pandas3.Matplotlib第二天机器学习材料与化学应用<
Edge-TTS在广电系统中的语音合成技术的创新应用
Edge-TTS在广电系统中的语音合成技术的创新应用作者：本人是一名县级融媒体中心的工程师，多年来一直坚持学习、提升自己。喜欢Python编程、人工智能、网络安全等多领域的技术。摘要随着人工智能技术的快速发展，文字转语音(Text-to-Speech,TTS)系统已成为多种应用的重要组成部分，尤其在广播电视领域。本文介绍了一种基于Edge-TTS大模型的文字转语音工具，该工具结合了现代文本处理和语
掌握编程：数字时代的必备技能 afsdfewasdf AI编程
编程在现代社会的必要性学习编程在当今数字化时代具有显著优势。随着科技发展，编程技能已成为许多行业的基础需求，从软件开发到数据分析，甚至传统行业也在逐步依赖技术解决方案。掌握编程能力可以提升个人竞争力，开拓职业机会。就业市场需求旺盛技术岗位如软件工程师、数据科学家、人工智能专家等持续增长。非技术岗位如市场营销、金融分析也要求基础编程知识处理自动化任务或数据分析。掌握编程技能能显著提高薪资水平和职业发
PHP，安卓，UI，java，linux视频教程合集 cocos2d-x小菜 java UI linux PHP android
╔-----------------------------------╗┆
zookeeper admin 笔记 braveCS zookeeper
Required Software 1) JDK>=1.6 2)推荐使用ensemble的ZooKeeper(至少3台)，并run on separate machines 3)在Yahoo!，zk配置在特定的RHEL boxes里，2个cpu，2G内存，80G硬盘数据和日志目录 1)数据目录里的文件是zk节点的持久化备份，包括快照和事务日
Spring配置多个连接池 easterfly spring
项目中需要同时连接多个数据库的时候，如何才能在需要用到哪个数据库就连接哪个数据库呢？ Spring中有关于dataSource的配置： <bean id="dataSource" class="com.mchange.v2.c3p0.ComboPooledDataSource" &nb
Mysql 171815164 mysql
例如，你想myuser使用mypassword从任何主机连接到mysql服务器的话。 GRANT ALL PRIVILEGES ON *.* TO 'myuser'@'%'IDENTIFIED BY 'mypassword' WI TH GRANT OPTION; 如果你想允许用户myuser从ip为192.168.1.6的主机连接到mysql服务器，并使用mypassword作
CommonDAO（公共/基础DAO） g21121 DAO
好久没有更新博客了，最近一段时间工作比较忙，所以请见谅，无论你是爱看呢还是爱看呢还是爱看呢，总之或许对你有些帮助。 DAO(Data Access Object)是一个数据访问（顾名思义就是与数据库打交道）接口，DAO一般在业
直言有讳永夜-极光感悟随笔
1.转载地址:http://blog.csdn.net/jasonblog/article/details/10813313 精华: “直言有讳”是阿里巴巴提倡的一种观念，而我在此之前并没有很深刻的认识。为什么呢？就好比是读书时候做阅读理解，我喜欢我自己的解读，并不喜欢老师给的意思。在这里也是。我自己坚持的原则是互相尊重，我觉得阿里巴巴很多价值观其实是基本的做人
安装CentOS 7 和Win 7后，Win7 引导丢失随便小屋 centos
一般安装双系统的顺序是先装Win7，然后在安装CentOS，这样CentOS可以引导WIN 7启动。但安装CentOS7后，却找不到Win7 的引导，稍微修改一点东西即可。一、首先具有root 的权限。即进入Terminal后输入命令su，然后输入密码即可二、利用vim编辑器打开/boot/grub2/grub.cfg文件进行修改 v
Oracle备份与恢复案例 aijuans oracle
Oracle备份与恢复案例一. 理解什么是数据库恢复当我们使用一个数据库时，总希望数据库的内容是可靠的、正确的，但由于计算机系统的故障（硬件故障、软件故障、网络故障、进程故障和系统故障）影响数据库系统的操作，影响数据库中数据的正确性，甚至破坏数据库，使数据库中全部或部分数据丢失。因此当发生上述故障后，希望能重构这个完整的数据库，该处理称为数据库恢复。恢复过程大致可以分为复原(Restore)与
JavaEE开源快速开发平台G4Studio v5.0发布無為子
我非常高兴地宣布,今天我们最新的JavaEE开源快速开发平台G4Studio_V5.0版本已经正式发布。访问G4Studio网站 http://www.g4it.org 2013-04-06 发布G4Studio_V5.0版本功能新增 (1). 新增了调用Oracle存储过程返回游标，并将游标映射为Java List集合对象的标
Oracle显示根据高考分数模拟录取百合不是茶 PL/SQL编程 oracle例子模拟高考录取学习交流
题目要求: 1,创建student表和result表 2,pl/sql对学生的成绩数据进行处理 3,处理的逻辑是根据每门专业课的最低分线和总分的最低分数线自动的将录取和落选 1,创建student表,和result表学生信息表; create table student( student_id number primary key,--学生id
优秀的领导与差劲的领导 bijian1013 领导管理团队
责任优秀的领导：优秀的领导总是对他所负责的项目担负起责任。如果项目不幸失败了，那么他知道该受责备的人是他自己，并且敢于承认错误。差劲的领导：差劲的领导觉得这不是他的问题，因此他会想方设法证明是他的团队不行，或是将责任归咎于团队中他不喜欢的那几个成员身上。努力工作优秀的领导：团队领导应该是团队成员的榜样。至少，他应该与团队中的其他成员一样努力工作。这仅仅因为他
js函数在浏览器下的兼容 Bill_chen jquery 浏览器 IE DWR ext
做前端开发的工程师，少不了要用FF进行测试，纯js函数在不同浏览器下，名称也可能不同。对于IE6和FF，取得下一结点的函数就不尽相同： IE6：node.nextSibling,对于FF是不能识别的； FF：node.nextElementSibling,对于IE是不能识别的；兼容解决方式：var Div = node.nextSibl
【JVM四】老年代垃圾回收：吞吐量垃圾收集器(Throughput GC) bit1129 垃圾回收
吞吐量与用户线程暂停时间衡量垃圾回收算法优劣的指标有两个：吞吐量越高，则算法越好暂停时间越短，则算法越好首先说明吞吐量和暂停时间的含义。垃圾回收时，JVM会启动几个特定的GC线程来完成垃圾回收的任务，这些GC线程与应用的用户线程产生竞争关系，共同竞争处理器资源以及CPU的执行时间。GC线程不会对用户带来的任何价值，因此，好的GC应该占
J2EE监听器和过滤器基础白糖_ J2EE
Servlet程序由Servlet，Filter和Listener组成，其中监听器用来监听Servlet容器上下文。监听器通常分三类：基于Servlet上下文的ServletContex监听，基于会话的HttpSession监听和基于请求的ServletRequest监听。 ServletContex监听器 ServletContex又叫application
博弈AngularJS讲义(16) - 提供者 boyitech js AngularJS api Angular Provider
Angular框架提供了强大的依赖注入机制，这一切都是有注入器(injector)完成. 注入器会自动实例化服务组件和符合Angular API规则的特殊对象，例如控制器，指令，过滤器动画等。那注入器怎么知道如何去创建这些特殊的对象呢？ Angular提供了5种方式让注入器创建对象，其中最基础的方式就是提供者(provider), 其余四种方式(Value, Fac
java-写一函数f(a,b)，它带有两个字符串参数并返回一串字符，该字符串只包含在两个串中都有的并按照在a中的顺序。 bylijinnan java
public class CommonSubSequence { /** * 题目：写一函数f(a,b)，它带有两个字符串参数并返回一串字符，该字符串只包含在两个串中都有的并按照在a中的顺序。 * 写一个版本算法复杂度O(N^2)和一个O(N) 。 * * O(N^2)：对于a中的每个字符，遍历b中的每个字符，如果相同，则拷贝到新字符串中。 * O(
sqlserver 2000 无法验证产品密钥 Chen.H sql windows SQL Server Microsoft
在 Service Pack 4 (SP 4), 是运行 Microsoft Windows Server 2003、 Microsoft Windows Storage Server 2003 或 Microsoft Windows 2000 服务器上您尝试安装 Microsoft SQL Server 2000 通过卷许可协议 (VLA) 媒体。这样做, 收到以下错误信息CD KEY的 SQ
[新概念武器]气象战争 comsci
气象战争的发动者必须是拥有发射深空航天器能力的国家或者组织.... 原因如下: 地球上的气候变化和大气层中的云层涡旋场有密切的关系,而维持一个在大气层某个层次
oracle 中 rollup、cube、grouping 使用详解 daizj oracle grouping rollup cube
oracle 中 rollup、cube、grouping 使用详解 -- 使用oracle 样例表演示转自namesliu -- 使用oracle 的样列库，演示 rollup, cube, grouping 的用法与使用场景 --- ROLLUP ，为了理解分组的成员数量，我增加了分组的计数 COUNT(SAL)
技术资料汇总分享 Dead_knight 技术资料汇总分享
本人汇总的技术资料，分享出来，希望对大家有用。 http://pan.baidu.com/s/1jGr56uE 资料主要包含： Workflow->工作流相关理论、框架(OSWorkflow、JBPM、Activiti、fireflow...) Security->java安全相关资料(SSL、SSO、SpringSecurity、Shiro、JAAS...) Ser
初一下学期难记忆单词背诵第一课 dcj3sjt126com english word
could 能够 minute 分钟 Tuesday 星期二 February 二月 eighteenth 第十八 listen 听 careful 小心的，仔细的 short 短的 heavy 重的 empty 空的 certainly 当然 carry 携带；搬运 tape 磁带 basket 蓝子 bottle 瓶 juice 汁，果汁 head 头；头部
截取视图的图片, 然后分享出去 dcj3sjt126com OS Objective-C
OS 7 has a new method that allows you to draw a view hierarchy into the current graphics context. This can be used to get an UIImage very fast. I implemented a category method on UIView to get the vi
MySql重置密码 fanxiaolong MySql重置密码
方法一: 在my.ini的[mysqld]字段加入： skip-grant-tables 重启mysql服务，这时的mysql不需要密码即可登录数据库然后进入mysql mysql>use mysql; mysql>更新 user set password=password('新密码') WHERE User='root'; mysq
Ehcache（03）——Ehcache中储存缓存的方式 234390216 ehcache MemoryStore DiskStore 存储驱除策略
Ehcache中储存缓存的方式目录 1 堆内存（MemoryStore） 1.1 指定可用内存 1.2 驱除策略 1.3 元素过期 2 &nbs
spring mvc中的@propertysource jackyrong spring mvc
在spring mvc中，在配置文件中的东西，可以在java代码中通过注解进行读取了： @PropertySource 在spring 3.1中开始引入比如有配置文件 config.properties mongodb.url=1.2.3.4 mongodb.db=hello 则代码中 @PropertySource(&
重学单例模式 lanqiu17 单例 Singleton 模式
最近在重新学习设计模式，感觉对模式理解更加深刻。觉得有必要记下来。第一个学的就是单例模式，单例模式估计是最好理解的模式了。它的作用就是防止外部创建实例，保证只有一个实例。单例模式的常用实现方式有两种，就人们熟知的饱汉式与饥汉式，具体就不多说了。这里说下其他的实现方式静态内部类方式: package test.pattern.singleton.statics; publ
.NET开源核心运行时，且行且珍惜 netcome java .net 开源
背景 2014年11月12日，ASP.NET之父、微软云计算与企业级产品工程部执行副总裁Scott Guthrie，在Connect全球开发者在线会议上宣布，微软将开源全部.NET核心运行时，并将.NET 扩展为可在 Linux 和 Mac OS 平台上运行。.NET核心运行时将基于MIT开源许可协议发布，其中将包括执行.NET代码所需的一切项目——CLR、JIT编译器、垃圾收集器（GC）和核心
使用oscahe缓存技术减少与数据库的频繁交互 Everyday都不同 Web 高并发 oscahe缓存
此前一直不知道缓存的具体实现，只知道是把数据存储在内存中，以便下次直接从内存中读取。对于缓存的使用也没有概念，觉得缓存技术是一个比较”神秘陌生“的领域。但最近要用到缓存技术，发现还是很有必要一探究竟的。缓存技术使用背景：一般来说，对于web项目，如果我们要什么数据直接jdbc查库好了，但是在遇到高并发的情形下，不可能每一次都是去查数据库，因为这样在高并发的情形下显得不太合理——
Spring+Mybatis 手动控制事务 toknowme mybatis
@Override public boolean testDelete(String jobCode) throws Exception { boolean flag = false; &nbs
菜鸟级的android程序员面试时候需要掌握的知识点 xp9802 android
熟悉Android开发架构和API调用掌握APP适应不同型号手机屏幕开发技巧熟悉Android下的数据存储熟练Android Debug Bridge Tool 熟练Eclipse/ADT及相关工具熟悉Android框架原理及Activity生命周期熟练进行Android UI布局熟练使用SQLite数据库；熟悉Android下网络通信机制，S