狐狐的鹿鹿

【手把手机器学习入门到放弃】SVM支持向量机

支持向量机

打仗的时候只有站最前面的人在打而已

支持向量机也是完成分类问题的一个工具，不同于逻辑回归，在支持向量机解决的分类问题中，只有最靠近对方阵营的样本对分界线的确定起到作用，而远离分界线的那些样本对分界线的确定没有作用。在这样的机制下，SVM拥有更好的鲁棒性，受离群点的影响几乎可忽略不计。

本次演示使用美国成人收入统计模型
数据说明如下：

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

标签有两种：>50K, <=50K.

import pandas as pd 
import numpy as np
from sklearn import svm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

1. 数据初探

dataframe = pd.read_table('datasets/Adult/adult.data',sep=',',header=None)
dataframe.columns=["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", 
                   "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", 
                   "hours-per-week", "native-country","salary"]
dataframe.head(3)

/Users/yaochenli/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: read_table is deprecated, use read_csv instead.
  """Entry point for launching an IPython kernel.

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K

2. 缺失值处理

dataframe.workclass.unique()

可以看到这个原始数据里面的缺失值是用“？”表示的

dataframe.shape

(32561, 15)

(dataframe==" ?").sum()

/Users/yaochenli/anaconda3/lib/python3.7/site-packages/pandas/core/ops.py:1649: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  result = method(y)





age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
salary               0
dtype: int64

对比样本量，缺失值不算很多，由于缺失的值都不是标量，而是标签变量，所以我们根据分布进行填充

工作类别 workclass

dataframe.workclass.value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

这里private明显占了大多数，我们就把缺失值使用Private填充

dataframe.workclass.replace(" ?", " Private", inplace=True)

dataframe.workclass.value_counts()

 Private             24532
 Self-emp-not-inc     2541
 Local-gov            2093
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

职业 occupation

dataframe.occupation.value_counts()

 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 ?                    1843
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name: occupation, dtype: int64

这里分布比较平均，我们把“？”单独以Others替代

dataframe.occupation.replace(" ?", " Other", inplace=True)

dataframe.occupation.value_counts()

 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 Other                1843
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name: occupation, dtype: int64

原国籍 native-country

dataframe["native-country"].value_counts()

 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 Greece                           29
 France                           29
 Ecuador                          28
 Ireland                          24
 Hong                             20
 Trinadad&Tobago                  19
 Cambodia                         19
 Thailand                         18
 Laos                             18
 Yugoslavia                       16
 Outlying-US(Guam-USVI-etc)       14
 Hungary                          13
 Honduras                         13
 Scotland                         12
 Holand-Netherlands                1
Name: native-country, dtype: int64

这里美国的样本量占了绝大多数，我们先使用美国来填充缺失值

dataframe["native-country"].replace(" ?", " United-States", inplace=True)

dataframe["native-country"].value_counts()

 United-States                 29753
 Mexico                          643
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 France                           29
 Greece                           29
 Ecuador                          28
 Ireland                          24
 Hong                             20
 Cambodia                         19
 Trinadad&Tobago                  19
 Laos                             18
 Thailand                         18
 Yugoslavia                       16
 Outlying-US(Guam-USVI-etc)       14
 Honduras                         13
 Hungary                          13
 Scotland                         12
 Holand-Netherlands                1
Name: native-country, dtype: int64

3. 数据分布初探

探究正负样本的数量关系

plt.figure(figsize=(8,5))
sns.color_palette("Set3")
# sns.set(style="whitegrid")
sns.countplot(dataframe.salary, palette="rocket")
plt.title("distribution of salary")

Text(0.5, 1.0, 'distribution of salary')

可以看到样本标签不是很平衡，但我们用SVM问题不大

探究学历与工资之间的关系

plt.figure(figsize=(20,10))
#sns.set(style="whitegrid")
sns.set()
plt.title("countgrid of education level over salary")
plt.subplot(121)
sns.countplot(x="education", data=dataframe[dataframe["salary"]==" <=50K"], palette="rocket")
plt.xlabel("<=50K")
plt.xticks(rotation=45)
plt.subplot(122)
sns.countplot(x="education", data=dataframe[dataframe["salary"]==" >50K"], palette="rocket")
plt.xlabel(">50K")
plt.xticks(rotation=45)
# plt.legend(loc="upper right")

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 )

从上面的结果可以看到工资<=50K的人群中，HS-grad，和一些社区学校毕业的居多，而在工资>50K的人群中Bachelor最多

探究年龄与工资之间的关系

plt.figure(figsize=(20,15))
sns.kdeplot(dataframe[dataframe["salary"]==" <=50K"].age, shade=True)
sns.kdeplot(dataframe[dataframe["salary"]==" >50K"].age, shade=True)
plt.legend(["<=50K",">50K"])
plt.title("age distribution over salary")

Text(0.5, 1.0, 'age distribution over salary')

我们可以看到工资较高的人群平均年龄也较高，超过40岁，而工资较低的人群年龄平均在20多。这个是符合我们的预期的

探究性别对工资的影响

plt.figure(figsize=(20,10))
#sns.set(style="whitegrid")
sns.set()
plt.title("countgrid of gender over salary")
plt.subplot(121)
sns.countplot(x="sex", data=dataframe[dataframe["salary"]==" <=50K"], palette="rocket")
plt.xlabel("<=50K")
plt.subplot(122)
sns.countplot(x="sex", data=dataframe[dataframe["salary"]==" >50K"], palette="rocket")
plt.xlabel(">50K")

Text(0.5, 0, '>50K')

可以看到在工资较低的人群中，男女比例大概在5:3，而在工资较高的人群中，男女比大约在5.5:1，看来在高端职业上，美国的职场性别歧视也是十分明显。

探究高工资人群和低工资人群的工作时长

plt.figure(figsize=(20,15))
sns.set()
sns.distplot(dataframe[dataframe["salary"]==" <=50K"]["hours-per-week"],vertical=True)
sns.distplot(dataframe[dataframe["salary"]==" >50K"]["hours-per-week"],vertical=True)
plt.legend(["<=50K",">50K"])
plt.title("age distribution over salary")

Text(0.5, 1.0, 'age distribution over salary')

可以看到普遍来说，工资较低人群每周工作时长比较低，而工资较高的人群确实工作时长也比较高，能者多劳，看来在哪都是这样

探究年龄和工作时长之间的关系（这个跟模型没关系，纯属满足本人好奇心）

plt.figure(figsize=(15,8))
cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True)
ax = sns.scatterplot(x="age", y="hours-per-week",
                     hue="salary",
                     data=dataframe)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VRrp3mtE-1572353712274)(output_42_0.png)]

60岁之后工作时长是明显随着年龄减少的，工资较高人群工作时长集中在40-60小时每周，工作年龄在30岁到60岁，而低工资人群工作时长各种都有。可以推理高工资人群大多数都是坐办公室的，而且基本上也都得加班

4. 建立模型输入

# 对标记变量转换成哑变量 使用pd.get_dummies方法，将salary转换成 <=50K：0，>50K:1
X = dataframe.join(pd.get_dummies(dataframe.loc[:,["workclass","education","occupation","marital-status","relationship",
                                           "race","sex","native-country"]]))
X=X.drop(["workclass","education","marital-status","occupation","relationship",
                                           "race","sex","native-country"], axis=1)
X.salary.replace(" <=50K",0,inplace=True)
X.salary.replace(" >50K",1,inplace=True)
X.head(3)

	age	fnlwgt	education-num	capital-gain	hours-per-week	...	native-country_ United-States
0	39	77516	13	2174	40	...	1
1	50	83311	13	0	13	...	1
2	38	215646	9	0	40	...	1

3 rows × 107 columns

y=X.salary
X=X.drop(["salary"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X,y)

X_train.head(3)

	age	fnlwgt	education-num	hours-per-week	workclass_ Private	...	native-country_ United-States
16834	26	131686	13	40	1	...	1
14311	33	136331	9	50	1	...	1
32421	31	298995	9	35	1	...	1

3 rows × 106 columns

5. 进行模型训练

sklearn里面的SVM模块中分类器当中自带4种核函数（什么是核函数麻烦读者自行百度），分别是：

线性核 linear: $(X, X^{'})$
多项式核 polynomial: $(\gamma+r)^d$
高斯核 rbf $exp(-\gamma||X-X'||^2)$
sigmoid核 $tanh(\gamma+r)$

此外还可以使用自定义的核函数，详见官网：https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html#sphx-glr-auto-examples-svm-plot-custom-kernel-py

下面我们尝试用4种核函数分别进行训练

1. 线性核

help(svm.SVC)

Help on class SVC in module sklearn.svm.classes:

class SVC(sklearn.svm.base.BaseSVC)
 |  SVC(C=1.0, kernel='rbf', degree=3, gamma='auto_deprecated', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None)
 |  
 |  C-Support Vector Classification.
 |  
 |  The implementation is based on libsvm. The fit time scales at least
 |  quadratically with the number of samples and may be impractical
 |  beyond tens of thousands of samples. For large datasets
 |  consider using :class:`sklearn.linear_model.LinearSVC` or
 |  :class:`sklearn.linear_model.SGDClassifier` instead, possibly after a
 |  :class:`sklearn.kernel_approximation.Nystroem` transformer.
 |  
 |  The multiclass support is handled according to a one-vs-one scheme.
 |  
 |  For details on the precise mathematical formulation of the provided
 |  kernel functions and how `gamma`, `coef0` and `degree` affect each
 |  other, see the corresponding section in the narrative documentation:
 |  :ref:`svm_kernels`.
 |  
 |  Read more in the :ref:`User Guide `.
 |  
 |  Parameters
 |  ----------
 |  C : float, optional (default=1.0)
 |      Penalty parameter C of the error term.
 |  
 |  kernel : string, optional (default='rbf')
 |      Specifies the kernel type to be used in the algorithm.
 |      It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or
 |      a callable.
 |      If none is given, 'rbf' will be used. If a callable is given it is
 |      used to pre-compute the kernel matrix from data matrices; that matrix
 |      should be an array of shape ``(n_samples, n_samples)``.
 |  
 |  degree : int, optional (default=3)
 |      Degree of the polynomial kernel function ('poly').
 |      Ignored by all other kernels.
 |  
 |  gamma : float, optional (default='auto')
 |      Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
 |  
 |      Current default is 'auto' which uses 1 / n_features,
 |      if ``gamma='scale'`` is passed then it uses 1 / (n_features * X.var())
 |      as value of gamma. The current default of gamma, 'auto', will change
 |      to 'scale' in version 0.22. 'auto_deprecated', a deprecated version of
 |      'auto' is used as a default indicating that no explicit value of gamma
 |      was passed.
 |  
 |  coef0 : float, optional (default=0.0)
 |      Independent term in kernel function.
 |      It is only significant in 'poly' and 'sigmoid'.
 |  
 |  shrinking : boolean, optional (default=True)
 |      Whether to use the shrinking heuristic.
 |  
 |  probability : boolean, optional (default=False)
 |      Whether to enable probability estimates. This must be enabled prior
 |      to calling `fit`, and will slow down that method.
 |  
 |  tol : float, optional (default=1e-3)
 |      Tolerance for stopping criterion.
 |  
 |  cache_size : float, optional
 |      Specify the size of the kernel cache (in MB).
 |  
 |  class_weight : {dict, 'balanced'}, optional
 |      Set the parameter C of class i to class_weight[i]*C for
 |      SVC. If not given, all classes are supposed to have
 |      weight one.
 |      The "balanced" mode uses the values of y to automatically adjust
 |      weights inversely proportional to class frequencies in the input data
 |      as ``n_samples / (n_classes * np.bincount(y))``
 |  
 |  verbose : bool, default: False
 |      Enable verbose output. Note that this setting takes advantage of a
 |      per-process runtime setting in libsvm that, if enabled, may not work
 |      properly in a multithreaded context.
 |  
 |  max_iter : int, optional (default=-1)
 |      Hard limit on iterations within solver, or -1 for no limit.
 |  
 |  decision_function_shape : 'ovo', 'ovr', default='ovr'
 |      Whether to return a one-vs-rest ('ovr') decision function of shape
 |      (n_samples, n_classes) as all other classifiers, or the original
 |      one-vs-one ('ovo') decision function of libsvm which has shape
 |      (n_samples, n_classes * (n_classes - 1) / 2). However, one-vs-one
 |      ('ovo') is always used as multi-class strategy.
 |  
 |      .. versionchanged:: 0.19
 |          decision_function_shape is 'ovr' by default.
 |  
 |      .. versionadded:: 0.17
 |         *decision_function_shape='ovr'* is recommended.
 |  
 |      .. versionchanged:: 0.17
 |         Deprecated *decision_function_shape='ovo' and None*.
 |  
 |  random_state : int, RandomState instance or None, optional (default=None)
 |      The seed of the pseudo random number generator used when shuffling
 |      the data for probability estimates. If int, random_state is the
 |      seed used by the random number generator; If RandomState instance,
 |      random_state is the random number generator; If None, the random
 |      number generator is the RandomState instance used by `np.random`.
 |  
 |  Attributes
 |  ----------
 |  support_ : array-like, shape = [n_SV]
 |      Indices of support vectors.
 |  
 |  support_vectors_ : array-like, shape = [n_SV, n_features]
 |      Support vectors.
 |  
 |  n_support_ : array-like, dtype=int32, shape = [n_class]
 |      Number of support vectors for each class.
 |  
 |  dual_coef_ : array, shape = [n_class-1, n_SV]
 |      Coefficients of the support vector in the decision function.
 |      For multiclass, coefficient for all 1-vs-1 classifiers.
 |      The layout of the coefficients in the multiclass case is somewhat
 |      non-trivial. See the section about multi-class classification in the
 |      SVM section of the User Guide for details.
 |  
 |  coef_ : array, shape = [n_class * (n_class-1) / 2, n_features]
 |      Weights assigned to the features (coefficients in the primal
 |      problem). This is only available in the case of a linear kernel.
 |  
 |      `coef_` is a readonly property derived from `dual_coef_` and
 |      `support_vectors_`.
 |  
 |  intercept_ : array, shape = [n_class * (n_class-1) / 2]
 |      Constants in decision function.
 |  
 |  fit_status_ : int
 |      0 if correctly fitted, 1 otherwise (will raise warning)
 |  
 |  probA_ : array, shape = [n_class * (n_class-1) / 2]
 |  probB_ : array, shape = [n_class * (n_class-1) / 2]
 |      If probability=True, the parameters learned in Platt scaling to
 |      produce probability estimates from decision values. If
 |      probability=False, an empty array. Platt scaling uses the logistic
 |      function
 |      ``1 / (1 + exp(decision_value * probA_ + probB_))``
 |      where ``probA_`` and ``probB_`` are learned from the dataset [2]_. For
 |      more information on the multiclass case and training procedure see
 |      section 8 of [1]_.
 |  
 |  Examples
 |  --------
 |  >>> import numpy as np
 |  >>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
 |  >>> y = np.array([1, 1, 2, 2])
 |  >>> from sklearn.svm import SVC
 |  >>> clf = SVC(gamma='auto')
 |  >>> clf.fit(X, y) #doctest: +NORMALIZE_WHITESPACE
 |  SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
 |      decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
 |      max_iter=-1, probability=False, random_state=None, shrinking=True,
 |      tol=0.001, verbose=False)
 |  >>> print(clf.predict([[-0.8, -1]]))
 |  [1]
 |  
 |  See also
 |  --------
 |  SVR
 |      Support Vector Machine for Regression implemented using libsvm.
 |  
 |  LinearSVC
 |      Scalable Linear Support Vector Machine for classification
 |      implemented using liblinear. Check the See also section of
 |      LinearSVC for more comparison element.
 |  
 |  References
 |  ----------
 |  .. [1] `LIBSVM: A Library for Support Vector Machines
 |      `_
 |  
 |  .. [2] `Platt, John (1999). "Probabilistic outputs for support vector
 |      machines and comparison to regularizedlikelihood methods."
 |      `_
 |  
 |  Method resolution order:
 |      SVC
 |      sklearn.svm.base.BaseSVC
 |      sklearn.svm.base.BaseLibSVM
 |      sklearn.base.BaseEstimator
 |      sklearn.base.ClassifierMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, C=1.0, kernel='rbf', degree=3, gamma='auto_deprecated', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.svm.base.BaseSVC:
 |  
 |  decision_function(self, X)
 |      Evaluates the decision function for the samples in X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_samples, n_features)
 |      
 |      Returns
 |      -------
 |      X : array-like, shape (n_samples, n_classes * (n_classes-1) / 2)
 |          Returns the decision function of the sample for each class
 |          in the model.
 |          If decision_function_shape='ovr', the shape is (n_samples,
 |          n_classes).
 |      
 |      Notes
 |      -----
 |      If decision_function_shape='ovo', the function values are proportional
 |      to the distance of the samples X to the separating hyperplane. If the
 |      exact distances are required, divide the function values by the norm of
 |      the weight vector (``coef_``). See also `this question
 |      `_ for further details.
 |      If decision_function_shape='ovr', the decision function is a monotonic
 |      transformation of ovo decision function.
 |  
 |  predict(self, X)
 |      Perform classification on samples in X.
 |      
 |      For an one-class model, +1 or -1 is returned.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape (n_samples, n_features)
 |          For kernel="precomputed", the expected shape of X is
 |          [n_samples_test, n_samples_train]
 |      
 |      Returns
 |      -------
 |      y_pred : array, shape (n_samples,)
 |          Class labels for samples in X.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.svm.base.BaseSVC:
 |  
 |  predict_log_proba
 |      Compute log probabilities of possible outcomes for samples in X.
 |      
 |      The model need to have probability information computed at training
 |      time: fit with attribute `probability` set to True.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_samples, n_features)
 |          For kernel="precomputed", the expected shape of X is
 |          [n_samples_test, n_samples_train]
 |      
 |      Returns
 |      -------
 |      T : array-like, shape (n_samples, n_classes)
 |          Returns the log-probabilities of the sample for each class in
 |          the model. The columns correspond to the classes in sorted
 |          order, as they appear in the attribute `classes_`.
 |      
 |      Notes
 |      -----
 |      The probability model is created using cross validation, so
 |      the results can be slightly different than those obtained by
 |      predict. Also, it will produce meaningless results on very small
 |      datasets.
 |  
 |  predict_proba
 |      Compute probabilities of possible outcomes for samples in X.
 |      
 |      The model need to have probability information computed at training
 |      time: fit with attribute `probability` set to True.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape (n_samples, n_features)
 |          For kernel="precomputed", the expected shape of X is
 |          [n_samples_test, n_samples_train]
 |      
 |      Returns
 |      -------
 |      T : array-like, shape (n_samples, n_classes)
 |          Returns the probability of the sample for each class in
 |          the model. The columns correspond to the classes in sorted
 |          order, as they appear in the attribute `classes_`.
 |      
 |      Notes
 |      -----
 |      The probability model is created using cross validation, so
 |      the results can be slightly different than those obtained by
 |      predict. Also, it will produce meaningless results on very small
 |      datasets.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.svm.base.BaseLibSVM:
 |  
 |  fit(self, X, y, sample_weight=None)
 |      Fit the SVM model according to the given training data.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape (n_samples, n_features)
 |          Training vectors, where n_samples is the number of samples
 |          and n_features is the number of features.
 |          For kernel="precomputed", the expected shape of X is
 |          (n_samples, n_samples).
 |      
 |      y : array-like, shape (n_samples,)
 |          Target values (class labels in classification, real numbers in
 |          regression)
 |      
 |      sample_weight : array-like, shape (n_samples,)
 |          Per-sample weights. Rescale C per sample. Higher weights
 |          force the classifier to put more emphasis on these points.
 |      
 |      Returns
 |      -------
 |      self : object
 |      
 |      Notes
 |      -----
 |      If X and y are not C-ordered and contiguous arrays of np.float64 and
 |      X is not a scipy.sparse.csr_matrix, X and/or y may be copied.
 |      
 |      If X is a dense array, then the other methods will not support sparse
 |      matrices as input.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.svm.base.BaseLibSVM:
 |  
 |  coef_
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``__`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.

clf = svm.SVC(C=1.0, kernel="linear", verbose=True, max_iter=10000, cache_size=500,class_weight="balanced")
clf.fit(X_train, y_train)

[LibSVM]

SVC(C=1.0, cache_size=500, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=10000, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=True)

# 进行预测
y_pred_linear = clf.predict(X_test)

# 查看得分
clf.score(X_test,y_test)

0.7221471563689964

2. 多项式核

clf = svm.SVC(C=1.0, kernel="poly", degree=2, verbose=True, max_iter=10000, cache_size=500,class_weight="balanced")
clf.fit(X_train, y_train)

[LibSVM]
SVC(C=1.0, cache_size=500, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=2, gamma='auto_deprecated',
    kernel='poly', max_iter=10000, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=True)

y_pred_poly = clf.predict(X_test)

clf.score(X_test,y_test)

0.7419235966097532

得分比线性核高了不少

3. 高斯核

clf = svm.SVC(C=1.0, kernel="rbf", verbose=True, gamma='auto', max_iter=10000, cache_size=1000)
clf.fit(X_train, y_train)

[LibSVM]

SVC(C=1.0, cache_size=1000, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=10000, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=True)

y_pred_rbf = clf.predict(X_test)

clf.score(X_test,y_test)

0.7531015845719199

4. sigmoid核

clf = svm.SVC(C=1.0, kernel="sigmoid", verbose=True, max_iter=10000, cache_size=500)
clf.fit(X_train, y_train)

[LibSVM]
SVC(C=1.0, cache_size=500, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='sigmoid', max_iter=10000, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=True)

y_pred_sigmoid = clf.predict(X_test)

clf.score(X_test,y_test)

0.7592433361994841

在都运行10000此迭代的情况下，除了线性核表现较差之外，其他三个核的表现都差不多

我们下面实验一下四种核的收敛速度

iter_num=[x for x in range(1000,11000,1000)]

linear=[]
poly=[]
rbf=[]
sigmoid=[]
for iters in iter_num:
    clf1 = svm.SVC(C=1.0, kernel="linear", max_iter=iters, cache_size=500)
    clf1.fit(X_train, y_train)
    linear.append(clf1.score(X_test,y_test))
    
    clf2 = svm.SVC(C=1.0, kernel="poly", degree=2, max_iter=iters, cache_size=500)
    clf2.fit(X_train, y_train)
    poly.append(clf2.score(X_test,y_test))
    
    clf3 = svm.SVC(C=0.1, kernel="rbf", gamma='scale', max_iter=iters, cache_size=500)
    clf3.fit(X_train,y_train)
    rbf.append(clf3.score(X_test,y_test))
    
    clf4 = svm.SVC(C=1.0, kernel="sigmoid", max_iter=iters, cache_size=500)
    clf4.fit(X_train,y_train)
    sigmoid.append(clf4.score(X_test,y_test))
    print(iters)

ax=plt.figure(figsize=(15,8))
data = pd.DataFrame({"linear":linear,"poly":poly, "rbf": rbf, "sigmoid": sigmoid},index=iter_num)
sns.lineplot(data=data, palette="tab10", linewidth=2.5)
plt.xlabel("rounds")
plt.ylabel("score")
plt.xticks(iter_num)
plt.title("score variation over rounds of four kinds of kernels")

Text(0.5, 1.0, 'score variation over rounds of four kinds of kernels')

可以看到线性核和多项式核收敛的比较快，在1000轮以前已经收敛,随着轮数增加线性核反而出现了震荡，而高斯核和sigmoid核收敛的较慢，在5000到6000轮之间收敛了。为了不摧残笔者的电脑了就不做更细粒度的分析了。

6. 模型评价

from sklearn import metrics
# 线性核的报告
print(metrics.classification_report(y_test, y_pred_linear))

              precision    recall  f1-score   support

           0       0.76      0.93      0.84      6181
           1       0.21      0.06      0.09      1960

    accuracy                           0.72      8141
   macro avg       0.48      0.49      0.46      8141
weighted avg       0.63      0.72      0.66      8141

# 多项式核的报告
print(metrics.classification_report(y_test, y_pred_poly))

              precision    recall  f1-score   support

           0       0.76      0.97      0.85      6181
           1       0.20      0.02      0.04      1960

    accuracy                           0.74      8141
   macro avg       0.48      0.50      0.45      8141
weighted avg       0.62      0.74      0.66      8141

# 高斯核的报告
print(metrics.classification_report(y_test, y_pred_rbf))

              precision    recall  f1-score   support

           0       0.77      0.97      0.86      6181
           1       0.42      0.07      0.12      1960

    accuracy                           0.75      8141
   macro avg       0.59      0.52      0.49      8141
weighted avg       0.68      0.75      0.68      8141

# sigmoid核的报告
print(metrics.classification_report(y_test, y_pred_sigmoid))

              precision    recall  f1-score   support

           0       0.76      1.00      0.86      6181
           1       0.00      0.00      0.00      1960

    accuracy                           0.76      8141
   macro avg       0.38      0.50      0.43      8141
weighted avg       0.58      0.76      0.66      8141

总结一下，SVM是一种将特征投射到高维平面上进行分割的方法，如果使用rbf核，则可以切割出任意弯曲的边界。在数据分析中是一种常用的分类器。

你可能感兴趣的:(手把手机器学习,实战区)

深入TA-Lib：量化技术指标详解
深入TA-Lib：量化技术指标详解本文系统讲解TA-Lib技术指标分析，涵盖基础、数据处理、趋势与动量指标、均量线、布林线等，并结合Python代码与大数据、机器学习实战案例，助力读者掌握量化交易实战技巧。本文系统梳理了TA-Lib技术指标分析的核心内容，包括TA-Lib基础、数据处理、趋势与动量指标、均量线、布林线等关键技术指标分析方法，并结合Python代码示例与大数据、机器学习的融合实战案例
AI 人工智能与 Copilot 的融合发展策略 AI天才研究院 AI人工智能与大数据人工智能 copilot ai
AI人工智能与Copilot的融合发展策略关键词：人工智能、Copilot、代码生成、人机协作、机器学习、自然语言处理、软件开发摘要：本文探讨了人工智能与Copilot技术的融合发展策略。我们将从技术原理、实现方法、应用场景等多个维度深入分析，提出一套完整的融合框架和发展路径。文章首先介绍背景和核心概念，然后详细讲解关键技术，包括自然语言处理、代码生成算法等，接着通过实际案例展示应用效果，最后讨论
#Datawhale组队学习#7月-强化学习Task1 fzyz123 Datawhale组队学习强化学习人工智能 AI
这里是Datawhale组织的组队学习《强化学习入门202507》，Datawhale是一个开源的社区。第一章绪论1.1为什么要学习强化学习？强化学习（ReinforcementLearning,RL）是机器学习中专注于智能体（Agent）如何通过与环境交互学习最优决策策略的分支。与监督学习依赖静态数据集、无监督学习聚焦数据内在结构不同，强化学习的核心在于序贯决策：智能体通过试错探索环境，根据行动
微算法科技技术突破：用于前馈神经网络的量子算法技术助力神经网络变革 MicroTech2025 量子计算算法神经网络
随着量子计算和机器学习的迅猛发展，企业界正逐步迈向融合这两大领域的新时代。在这一背景下，微算法科技（NASDAQ:MLGO）成功研发出一套用于前馈神经网络的量子算法，突破了传统神经网络在训练和评估中的性能瓶颈。这一创新性的量子算法以经典的前馈和反向传播算法为基础，借助量子计算的强大算力，极大提升了网络训练和评估效率，并带来了对过拟合的天然抗性。前馈神经网络是深度学习的核心架构，广泛应用于图像分类、
图机器学习（13）——图相似性检测
图机器学习（13）——图相似性检测0.前言1.基于图嵌入的方法2.基于图核的方法3.基于GNN的方法4.应用0.前言图机器学习(machinelearning,ML)方法能广泛应用于各类任务，其应用场景涵盖从药物设计到社交网络推荐系统等多个领域。值得注意的是，由于这类方法在设计上具有通用性，同一算法可用于解决不同问题。学习图之间相似性的定量度量是一个关键问题。事实上，这是网络分析的重要步骤，同时也
一天学会超级玛丽小游戏_手把手教学_Java小游戏 62f5ecb72f71
超级玛丽是任天堂制作的一款小游戏,在的童年里一起玩这个游戏,大胡子,背带裤的马里奥,每关以马里奥在走到重点的前提下尽可能地收集金币。他在闯关过程中，会遇到怪物，可以通过踩死或者跳过。也会遇到深坑。给游戏增加了一定的难度。今天带大家用java制作制作这款小游戏,下面是课程介绍.课程介绍：在你的童年记忆里，是否有一个会蹦跳，会吃蘑菇的小人？超级玛丽是一款经典并且流行的小游戏，通过键盘来控制马里奥的移动
3步！用代码生成工具秒建SqlSugar Winform项目？手把手教学，小白也能轻松上手！墨瑾轩数据库学习 oracle 数据库
关注墨瑾轩，带你探索编程的奥秘！超萌技术攻略，轻松晋级编程高手技术宝库已备好，就等你来挖掘订阅墨瑾轩，智趣学习不孤单即刻启航，编程之旅更有趣（对比传统开发效率：人工写代码vs魔法生成器，谁才是真正的“代码魔法师”？）代码生成工具——程序员的“魔法棒”你有没有试过用Excel表格生成代码？或者像搭积木一样拼出一个完整的Winform项目？SqlSugar+代码生成工具（比如Database2Shar
Java 大视界 -- Java 大数据机器学习模型在金融市场情绪指数构建与投资决策支持中的应用（339）青云交大数据新视界 Java 大视界 java 大数据机器学习金融情绪指数投资决策量化策略情绪分析
Java大视界--Java大数据机器学习模型在金融市场情绪指数构建与投资决策支持中的应用（339）引言：正文：一、Java构建的金融市场情绪数据采集与预处理体系1.1多源异构数据接入引擎1.2数据采集延迟测试报告1.3情绪数据预处理管道二、Java驱动的金融市场情绪指数构建模型2.1多维度情绪指数计算框架2.2情绪指数与投资决策的映射模型三、Java在金融投资决策支持中的实战应用3.1量化私募情绪
LangChain中的向量数据库接口－Weaviate 洪城叮当 langchain 数据库经验分享笔记交互人工智能知识图谱
文章目录前言一、原型定义二、代码解析1、add_texts方法1.1、应用样例2、from_texts方法2.1、应用样例3、similarity_search方法3.1、应用样例三、项目应用1、安装依赖2、引入依赖3、创建对象4、添加数据5、查询数据总结前言 Weaviate是一个开源的向量数据库，支持存储来自各类机器学习模型的数据对象和向量嵌入，并能无缝扩展至数十亿数据对象。它提供存储文档嵌
Python的科学计算库NumPy（一） linlin_1998 python numpy 开发语言
NumPy(NumericalPython)是Python中最基础、最重要的科学计算库之一，提供了高性能的多维数组（ndarray）对象和大量数学函数，是许多数据科学、机器学习库（如Pandas、SciPy、TensorFlow等）的基础依赖。1.创建一个numpy里面的一维数组importnumpyasnp###通过array方法创建一个ndarrayarray1=np.array([1,2,3
微算法科技的前沿探索：量子机器学习算法在视觉任务中的革新应用 MicroTech2025 量子计算算法
在信息技术飞速发展的今天，计算机视觉作为人工智能领域的重要分支，正逐步渗透到我们生活的方方面面。从自动驾驶到人脸识别，从医疗影像分析到安防监控，计算机视觉技术展现了巨大的应用潜力。然而，随着视觉任务复杂度的不断提升，传统机器学习算法在处理大规模、高维度数据时遇到了计算瓶颈。在此背景下，量子计算作为一种颠覆性的计算模式，以其独特的并行处理能力和指数级增长的计算空间，为解决这一难题提供了新的思路。微算
在mac m1基于llama.cpp运行deepseek
lama.cpp是一个高效的机器学习推理库，目标是在各种硬件上实现LLM推断，保持最小设置和最先进性能。llama.cpp支持1.5位、2位、3位、4位、5位、6位和8位整数量化，通过ARMNEON、Accelerate和Metal支持Apple芯片，使得在MACM1处理器上运行Deepseek大模型成为可能。1下载llama.cppgitclonehttps://github.com/ggerg
SpringBoot + Logback 实现日志脱敏【手把手案例】甘蓝聊Java 【更新中...】项目中的那些事 spring boot logback Logback日志脱敏 Java日志脱敏
文章目录背景分析现有Logback配置了解PatternLayout中的Converter解决第1步：创建自定义Converter第2步：自定义logback配置文件第3步：调整yaml配置第4步：启动服务并验证参考背景SpringBoot+MyBatis+MySQL+Logback框架，使用MySQL的AES_DECRYPT()和AES_ENCRYPT()函数，由于日志设置为debug级别，导致
【机器学习笔记Ⅰ】9 特征缩放巴伦是只猫机器学习机器学习笔记人工智能
特征缩放（FeatureScaling）详解特征缩放是机器学习数据预处理的关键步骤，旨在将不同特征的数值范围统一到相近的尺度，从而加速模型训练、提升性能并避免某些特征主导模型。1.为什么需要特征缩放？(1)问题背景量纲不一致：例如：特征1：年龄（范围0-100）特征2：收入（范围0-1,000,000）梯度下降的困境：量纲大的特征（如收入）会导致梯度更新方向偏离最优路径，收敛缓慢。量纲小的特征（如
深度学习实战-使用TensorFlow与Keras构建智能模型程序员Gloria Python超入门 TensorFlow python
深度学习实战-使用TensorFlow与Keras构建智能模型深度学习已经成为现代人工智能的重要组成部分，而Python则是实现深度学习的主要编程语言之一。本文将探讨如何使用TensorFlow和Keras构建深度学习模型，包括必要的代码实例和详细的解析。1.深度学习简介深度学习是机器学习的一个分支，使用多层神经网络来学习和表示数据中的复杂模式。其广泛应用于图像识别、自然语言处理、推荐系统等领域。
matlab画信号图方法,献给初学者：手把手教你绘制信号通路图
信号通路是指能将细胞外的分子信号经细胞膜传入细胞内发挥效应的一系列酶促反应通路。细胞信号通路图是科研研究过程中最常见也是最常用到的，如何绘制适合我们自己科研课题的信号通路图呢？可以试试pathwaybuildertool软件。这款软件简单易学，即便是零基础的同学，也可以做出漂亮的信号通路。1.首先，打开PathwayBuilderTool2.0软件，软件自带分子生物学会用到的基本元素，如不同的细胞
【大模型与机器学习解惑】什么是A/B测试，为何进行A/B测试？
以下内容将围绕机器学习中的A/B测试展开，从概念与背景到实施细节、示例代码、优化思路和未来建议，并在最后给出一个整体的“输出目录”供参考。目录什么是机器学习的A/B测试为何要进行A/B测试A/B测试的实施流程示例代码与详细解释优化方向与未来建议结语1.什么是机器学习的A/B测试A/B测试（也常被称作对照试验、SplitTest）最早多用于互联网产品的功能或界面迭代中，指的是将用户或样本随机分为两组
详解LLMOps，将DevOps用于大语言模型开发
大家好，在机器学习领域，随着技术的不断发展，将大型语言模型（LLMs）集成到商业产品中已成为一种趋势，同时也带来了许多挑战。为了有效应对这些挑战，数据科学家们转向了一种新型的DevOps实践LLM-OPS，专为大型语言模型的开发和维护而设计。本文将介绍LLM-OPS的核心思想，并分析这一策略如何帮助数据科学家更高效地运用DevOps的优秀实践，从而在语言模型的开发和部署过程中，提升工作效率和成果的
从零开始：手把手教你用 Uniapp 搭建多平台应用儿歌八万首 uniapp uni-app 移动端跨平台 harmonyOS
这个框架能帮你解决什么问题？想象一下，你需要开发一个移动应用，但是你的老板说：“我们需要同时上线Android、iOS还有华为的鸿蒙系统”。传统的做法是什么？雇三个团队，写三套代码，维护三个项目…想想都头疼对吧？这个框架就是来拯救你的！一套代码，三个平台同时搞定。不用学Java、Swift、ArkTS，只要你会Vue，就能轻松上手。简单来说，这个框架就像是一个"翻译官"，你用熟悉的Vue语法写代码
搜广推校招面经九十一
美团机器学习/数据挖掘算法工程师_二面一、介绍一下ESMM模型，是否有进行过函数推导传统的转化率建模方式：只用发生点击（click=1）的样本来训练CVR模型。CVR定义如下：CVR=P(y=1∣x,z=1)CVR=P(y=1|x,z=1)CVR=P(y=1∣x,z=1)y=1表示用户发生了转化（如购买）z=1表示用户点击了广告这样做的问题：样本选择偏差（SampleSelectionBias,S
手把手教你玩转 Python 虚拟环境：从入门到实战避坑指南佑瞻 python工程化 python 开发语言
在Python开发中，你是否遇到过这样的尴尬场景？当你在一个项目中安装了某个库的特定版本后，另一个项目却因为依赖冲突无法运行；或者不小心修改了系统Python环境，导致整个开发环境崩溃。别担心，这些都是因为没有正确使用虚拟环境惹的祸。今天我们就来系统学习Python虚拟环境的核心知识，让你的项目管理从此井井有条。一、为什么必须使用虚拟环境？——从版本冲突说起想象一下这样的场景：我们正在开发两个Py
Redis OM for Python 实战：用 Flask 构建 Redis 文档型 API
在日常开发中，我们使用Redis时常常会遇到这样的场景：需要存储复杂的结构化数据（比如用户信息、商品详情），还要支持灵活的查询（按年龄筛选、按技能搜索）。直接用Redis的基础命令处理JSON数据不仅繁琐，查询起来更是头疼。而RedisOMforPython的出现，正好解决了这些问题——它让我们能用Python类轻松建模，用简洁的代码实现CRUD和复杂查询。今天我们就结合Flask框架，手把手教你
python 计算生态概览的概述
文章目录前言python计算生态库的介绍1.网络爬虫2.数据分析3.文本处理4.数据可视化5.机器学习6.图形用户界面7.游戏开发8.网络应用开发前言python计算生态概览的解释Python计算生态概览是对Python作为一门强大而广泛使用的编程语言所拥有的庞大软件集合的整体描述和概述。这个生态体系不仅包含了Python的标准库（stdlib），即随Python解释器安装的基本模块，还涵盖了极其
Node.js安装及环境配置完全指南（手把手保姆级教程） Cyb3rMagnet node.js 其他
文章目录一、为什么你的开发环境总出问题？二、安装包去哪下才靠谱？1.Windows用户看这里2.Mac用户专属通道3.Linux用户命令行秘籍三、环境配置防坑指南1.PATH变量自查（重要！）2.Windows环境变量手动配置3.Mac/Linux用户看这里四、npm加速大法1.换国内镜像源（速度提升10倍！）2.安装cnpm（可选）五、版本管理神器nvm1.安装nvm2.常用命令六、常见报错急救
Google机器学习实践指南(模型预测偏差) AI_Auto 人工智能机器学习人工智能
Google机器学习（31）-模型预测偏差预测偏差：模型为何总是"猜不准"的真相揭秘你的模型预测准确率高达95%，却总是与实际情况差那么一点点？这可能是预测偏差在作祟！本文将带你深入探索这个被忽视的模型"隐形杀手"。一、什么是预测偏差？一个生活化案例想象一下，你网购了一个智能体重秤，连续一周称重显示都是60kg。但你去健身房用专业设备测量，实际是62kg。这种系统性的测量偏差，就是预测偏差在现实中
【机器学习|学习笔记】用 Python 结合 graphviz 生成 ID3、C4.5、CART 三种决策树的结构示意图。
【机器学习|学习笔记】用Python结合graphviz生成ID3、C4.5、CART三种决策树的结构示意图【机器学习|学习笔记】用Python结合graphviz生成ID3、C4.5、CART三种决策树的结构示意图文章目录【机器学习|学习笔记】用Python结合graphviz生成ID3、C4.5、CART三种决策树的结构示意图用Python结合graphviz生成ID3、C4.5、CART三种
Spring Boot 与消息队列：使用 RabbitMQ 进行消息的生产与消费！ bug菌¹ 滚雪球学SpringBoot java-rabbitmq spring boot rabbitmq springboot集成消息队列
本文精选收录于《滚雪球学SpringBoot》专栏，专为零基础学习者量身打造。从Spring基础到项目实战，手把手带你掌握核心技术，助力你快速提升，迈向职场巅峰，开启财富自由之路！无论你是刚入门的小白，还是已有基础的开发者，都能在这里找到适合自己的学习路径！关注、收藏、订阅，持续更新中！和我们一起高速成长，突破自我！全文目录：前言目录1.SpringBoot与消息队列概述1.1什么是消息队列？
智能产品经理的核心能力 AI天才研究院 Agentic AI 实战 AI人工智能与大数据 AI大模型企业级应用开发实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
智能产品经理的核心能力1.背景介绍在当今快节奏的数字时代,产品经理扮演着至关重要的角色,他们负责确保产品满足用户需求,实现商业目标,并保持竞争优势。随着人工智能(AI)和机器学习(ML)技术的不断发展,智能产品经理的概念应运而生。智能产品经理需要将传统的产品管理技能与新兴技术相结合,以创建具有创新性和智能化的产品体验。智能产品不仅需要满足功能需求,还需要提供个性化、智能化和无缝的用户体验。这对产品
使用Python进行机器学习入门指南软考和人工智能学堂 Python开发经验 python 机器学习开发语言
使用Python进行机器学习入门指南机器学习（MachineLearning）是人工智能（ArtificialIntelligence,AI）的一个重要分支，旨在通过算法和统计模型，使计算机系统能够自动从数据中学习和改进。Python作为机器学习领域的主流编程语言，提供了丰富的库和工具来实现各种机器学习任务。本文将介绍如何使用Python进行机器学习，包括基本概念、常用库以及一个实战项目示例。目录
【亲测免费】 CatBoost 教程项目使用指南
CatBoost教程项目使用指南tutorials项目地址:https://gitcode.com/gh_mirrors/tutorials1/tutorials1.项目介绍CatBoost是一个高效、灵活且易于使用的梯度提升库，特别适用于处理分类特征。它由Yandex开发，广泛应用于机器学习和数据科学领域。CatBoost提供了丰富的功能，包括自动处理分类特征、支持GPU训练、内置的交叉验证和模
分享100个最新免费的高匿HTTP代理IP mcj8089 代理IP 代理服务器匿名代理免费代理IP 最新代理IP
推荐两个代理IP网站： 1. 全网代理IP：http://proxy.goubanjia.com/ 2. 敲代码免费IP：http://ip.qiaodm.com/ 120.198.243.130:80,中国/广东省 58.251.78.71:8088,中国/广东省 183.207.228.22:83,中国/
mysql高级特性之数据分区 annan211 java 数据结构 mongodb 分区 mysql
mysql高级特性 1 以存储引擎的角度分析，分区表和物理表没有区别。是按照一定的规则将数据分别存储的逻辑设计。器底层是由多个物理字表组成。 2 分区的原理分区表由多个相关的底层表实现，这些底层表也是由句柄对象表示，所以我们可以直接访问各个分区。存储引擎管理分区的各个底层表和管理普通表一样(所有底层表都必须使用相同的存储引擎)，分区表的索引只是
JS采用正则表达式简单获取URL地址栏参数 chiangfai js 地址栏参数获取
GetUrlParam:function GetUrlParam(param){ var reg = new RegExp("(^|&)"+ param +"=([^&]*)(&|$)"); var r = window.location.search.substr(1).match(reg); if(r!=null
怎样将数据表拷贝到powerdesigner (本地数据库表) Array_06 powerDesigner
================================================== 1、打开PowerDesigner12，在菜单中按照如下方式进行操作 file->Reverse Engineer->DataBase 点击后，弹出 New Physical Data Model 的对话框 2、在General选项卡中 Model name:模板名字，自
logbackのhelloworld 飞翔的马甲日志 logback
一、概述 1.日志是啥？当我是个逗比的时候我是这么理解的：log.debug()代替了system.out.print(); 当我项目工作时，以为是一堆得.log文件。这两天项目发布新版本，比较轻松，决定好好地研究下日志以及logback。传送门1：日志的作用与方法： http://www.infoq.com/cn/articles/why-and-how-log 上面的作
新浪微博爬虫模拟登陆随意而生新浪微博
转载自：http://hi.baidu.com/erliang20088/item/251db4b040b8ce58ba0e1235 近来由于毕设需要，重新修改了新浪微博爬虫废了不少劲，希望下边的总结能够帮助后来的同学们。现行版的模拟登陆与以前相比，最大的改动在于cookie获取时候的模拟url的请求
synchronized 香水浓 java thread
Java语言的关键字，可用来给对象和方法或者代码块加锁，当它锁定一个方法或者一个代码块的时候，同一时刻最多只有一个线程执行这段代码。当两个并发线程访问同一个对象object中的这个加锁同步代码块时，一个时间内只能有一个线程得到执行。另一个线程必须等待当前线程执行完这个代码块以后才能执行该代码块。然而，当一个线程访问object的一个加锁代码块时，另一个线程仍然
maven 简单实用教程 AdyZhang maven
1. Maven介绍 1.1. 简介 java编写的用于构建系统的自动化工具。目前版本是2.0.9，注意maven2和maven1有很大区别，阅读第三方文档时需要区分版本。 1.2. Maven资源见官方网站；The 5 minute test，官方简易入门文档；Getting Started Tutorial，官方入门文档；Build Coo
Android 通过 intent传值获得null aijuans android
我在通过intent 获得传递兑现过的时候报错，空指针,我是getMap方法进行传值，代码如下 1 2 3 4 5 6 7 8 9 public void getMap(View view){ Intent i =
apache 做代理报如下错误：The proxy server received an invalid response from an upstream baalwolf response
网站配置是apache＋tomcat,tomcat没有报错，apache报错是： The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /. Reason: Error reading fr
Tomcat6 内存和线程配置 BigBird2012 tomcat6
1、修改启动时内存参数、并指定JVM时区（在windows server 2008 下时间少了8个小时）在Tomcat上运行j2ee项目代码时，经常会出现内存溢出的情况，解决办法是在系统参数中增加系统参数： window下，在catalina.bat最前面 set JAVA_OPTS=-XX:PermSize=64M -XX:MaxPermSize=128m -Xms5
Karam与TDD bijian1013 Karam TDD
一.TDD 测试驱动开发（Test-Driven Development,TDD）是一种敏捷（AGILE）开发方法论，它把开发流程倒转了过来，在进行代码实现之前，首先保证编写测试用例，从而用测试来驱动开发（而不是把测试作为一项验证工具来使用）。 TDD的原则很简单： a.只有当某个
[Zookeeper学习笔记之七]Zookeeper源代码分析之Zookeeper.States bit1129 zookeeper
public enum States { CONNECTING, //Zookeeper服务器不可用，客户端处于尝试链接状态 ASSOCIATING, //？？？ CONNECTED, //链接建立，可以与Zookeeper服务器正常通信 CONNECTEDREADONLY, //处于只读状态的链接状态，只读模式可以在
【Scala十四】Scala核心八：闭包 bit1129 scala
Free variable A free variable of an expression is a variable that’s used inside the expression but not defined inside the expression. For instance, in the function literal expression (x: Int) => (x
android发送json并解析返回json ronin47 android
package com.http.test; import org.apache.http.HttpResponse; import org.apache.http.HttpStatus; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import
一份IT实习生的总结 brotherlamp PHP php资料 php教程 php培训 php视频
今天突然发现在不知不觉中自己已经实习了 3 个月了，现在可能不算是真正意义上的实习吧，因为现在自己才大三，在这边撸代码的同时还要考虑到学校的功课跟期末考试。让我震惊的是，我完全想不到在这 3 个月里我到底学到了什么，这是一件多么悲催的事情啊。同时我对我应该 get 到什么新技能也很迷茫。所以今晚还是总结下把，让自己在接下来的实习生活有更加明确的方向。最后感谢工作室给我们几个人这个机会让我们提前出来
据说是2012年10月人人网校招的一道笔试题-给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。将重物放到天平左侧，问在两边如何添加砝码 bylijinnan java
public class ScalesBalance { /** * 题目： * 给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。（假设N无限大，但一种重量的砝码只有一个） * 将重物放到天平左侧，问在两边如何添加砝码使两边平衡 * * 分析： * 三进制 * 我们约定括号表示里面的数是三进制，例如 47=(1202
dom4j最常用最简单的方法 chiangfai dom4j
要使用dom4j读写XML文档,需要先下载dom4j包,dom4j官方网站在 http://www.dom4j.org/目前最新dom4j包下载地址:http://nchc.dl.sourceforge.net/sourceforge/dom4j/dom4j-1.6.1.zip 解开后有两个包,仅操作XML文档的话把dom4j-1.6.1.jar加入工程就可以了,如果需要使用XPath的话还需要
简单HBase笔记 chenchao051 hbase
一、Client-side write buffer 客户端缓存请求描述：可以缓存客户端的请求，以此来减少RPC的次数，但是缓存只是被存在一个ArrayList中，所以多线程访问时不安全的。可以使用getWriteBuffer()方法来取得客户端缓存中的数据。默认关闭。二、Scan的Caching 描述： next( )方法请求一行就要使用一次RPC,即使
mysqldump导出时出现when doing LOCK TABLES daizj mysql mysqdump 导数据
　　执行　mysqldump -uxxx -pxxx -hxxx -Pxxxx database tablename > tablename.sql　导出表时，会报 mysqldump: Got error: 1044: Access denied for user 'xxx'@'xxx' to database 'xxx' when doing LOCK TABLES 解决
CSS渲染原理 dcj3sjt126com Web
从事Web前端开发的人都与CSS打交道很多，有的人也许不知道css是怎么去工作的，写出来的css浏览器是怎么样去解析的呢？当这个成为我们提高css水平的一个瓶颈时，是否应该多了解一下呢？一、浏览器的发展与CSS
《阿甘正传》台词 dcj3sjt126com
Part Ⅰ: 《阿甘正传》Forrest Gump经典中英文对白 Forrest: Hello! My names Forrest. Forrest Gump. You wanna Chocolate? I could eat about a million and a half othese. My momma always said life was like a box ochocol
Java处理JSON dyy_gusi json
Json在数据传输中很好用，原因是JSON 比 XML 更小、更快，更易解析。在Java程序中，如何使用处理JSON，现在有很多工具可以处理，比较流行常用的是google的gson和alibaba的fastjson，具体使用如下： 1、读取json然后处理 class ReadJSON { public static void main(String[] args)
win7下nginx和php的配置 geeksun nginx
1. 安装包准备 nginx : 从nginx.org下载nginx-1.8.0.zip php：从php.net下载php-5.6.10-Win32-VC11-x64.zip， php是免安装文件。 RunHiddenConsole: 用于隐藏命令行窗口 2. 配置 # java用8080端口做应用服务器，nginx反向代理到这个端口即可 p
基于2.8版本redis配置文件中文解释 hongtoushizi redis
转载自： http://wangwei007.blog.51cto.com/68019/1548167 在Redis中直接启动redis-server服务时, 采用的是默认的配置文件。采用redis-server xxx.conf 这样的方式可以按照指定的配置文件来运行Redis服务。下面是Redis2.8.9的配置文
第五章常用Lua开发库3-模板渲染 jinnianshilongnian nginx lua
动态web网页开发是Web开发中一个常见的场景，比如像京东商品详情页，其页面逻辑是非常复杂的，需要使用模板技术来实现。而Lua中也有许多模板引擎，如目前我在使用的lua-resty-template，可以渲染很复杂的页面，借助LuaJIT其性能也是可以接受的。如果学习过JavaEE中的servlet和JSP的话，应该知道JSP模板最终会被翻译成Servlet来执行；而lua-r
JZSearch大数据搜索引擎颠覆者 JavaScript
系统简介：大数据的特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。最后这一点也是和传统的数据挖掘技术有着本质的不同。业界将其归纳为4个“V”——Volume，Variety，Value，Velocity。大数据搜索引
10招让你成为杰出的Java程序员 pda158 java 编程框架
如果你是一个热衷于技术的 Java 程序员，那么下面的 10 个要点可以让你在众多 Java 开发人员中脱颖而出。　　 1. 拥有扎实的基础和深刻理解 OO 原则　　对于 Java 程序员，深刻理解 Object Oriented Programming（面向对象编程）这一概念是必须的。没有 OOPS 的坚实基础，就领会不了像 Java 这些面向对象编程语言
tomcat之oracle连接池配置小网客 oracle
tomcat版本7.0 配置oracle连接池方式：修改tomcat的server.xml配置文件： <GlobalNamingResources> <Resource name="utermdatasource" auth="Container" type="javax.sql.DataSou
Oracle 分页算法汇总 vipbooks oracle sql 算法 .net
这是我找到的一些关于Oracle分页的算法，大家那里还有没有其他好的算法没？我们大家一起分享一下！ -- Oracle 分页算法一 select * from ( select page.*,rownum rn from (select * from help) page -- 20 = (currentPag