cumi6497

用朴素的英语解释9种关键机器学习算法

Machine learning is changing the world. Google uses machine learning to suggest search results to users. Netflix uses it to recommend movies for you to watch. Facebook uses machine learning to suggest people you may know.

机器学习正在改变世界。 Google使用机器学习向用户建议搜索结果。 Netflix使用它来推荐电影供您观看。 Facebook使用机器学习来建议您可能认识的人。

Machine learning has never been more important. At the same time, understanding machine learning is hard. The field is full of jargon. And the number of different ML algorithms grows each year.

机器学习从未如此重要。同时，很难理解机器学习。这个领域充满了行话。而且，不同的机器学习算法的数量每年都在增长。

This article will introduce you to the fundamental concepts within the field of machine learning. More specifically, we will discuss the basic concepts behind the 9 most important machine learning algorithms today.

本文将向您介绍机器学习领域的基本概念。更具体地说，我们将讨论当今9种最重要的机器学习算法背后的基本概念。

高尔顿是一位研究父母与孩子之间关系的科学家。更具体地说，高尔顿正在研究父亲的身高与儿子的身高之间的关系。

Galton’s first discovery was that sons tended to be roughly as tall as their fathers. This is not surprising.

高尔顿的第一个发现是儿子的年龄往往和父亲一样高。这不足为奇。

Later on, Galton discovered something much more interesting. The son’s height tended to be closer to the overall average height of all people than it was to his own father.

后来，高尔顿发现了一些更有趣的东西。儿子的身高往往比他父亲的身高更接近所有人的总体平均身高 。

Galton gave this phenomenon a name: regression. Specifically, he said “A father’s son’s height tends to regress (or drift towards) the mean (average) height”.

高尔顿给这个现象起了一个名字：回归。具体来说，他说：“父亲儿子的身高倾向于下降(或向平均(平均)身高漂移)”。

This led to an entire field in statistics and machine learning called regression.

这导致了统计学和机器学习的整个领域，称为回归。

线性回归的数学 (The Mathematics of Linear Regression)

When creating a regression model, all that we are trying to do is draw a line that is as close as possible to each point in a data set.

创建回归模型时，我们要做的就是绘制一条尽可能接近数据集中每个点的线。

The typical example of this is the “least squares method” of linear regression, which only calculates the closeness of a line in the up-and-down direction.

典型的例子是线性回归的“最小二乘法”，该方法仅计算上下方向上的一条线的紧密度。

Here is an example to help illustrate this:

这是一个示例来帮助说明这一点：

When you create a regression model, your end product is an equation that you can use to predict the y-value of an x-value, without actually knowing the y-value in advance.

创建回归模型时，最终产品是一个方程，您可以使用它来预测x值的y值，而无需事先实际知道y值。

逻辑回归 (Logistic Regression)

Logistic regression is similar to linear regression except that instead of calculating a numerical y value, it estimates which category a data point belongs to.

Logistic回归与线性回归相似，不同之处在于， Logistic回归估计数据点属于哪个类别，而不是计算数字y值。

什么是逻辑回归？ (What is Logistic Regression?)

Logistic regression is a machine learning model that is used to solve classification problems.

逻辑回归是一种机器学习模型，用于解决分类问题。

Here are a few examples of machine learning classification problems:

以下是一些机器学习分类问题的示例：

Spam emails (spam or not spam?)
垃圾邮件(垃圾邮件还是非垃圾邮件？)
Car insurance claims (write-off or repair?)
汽车保险索赔(注销或维修？)
Disease diagnosis
疾病诊断

Each of the classification problems have exactly two categories, which makes them examples of binary classification problems.

每个分类问题都有正好两个类别，这使它们成为二进制分类问题的示例。

Logistic regression is well-suited for solving binary classification problems – we just assign the different categories a value of 0 and 1 respectively.

Logistic回归非常适合解决二元分类问题–我们仅将不同类别的值分别指定为0和1 。

Why do we need logistic regression? Because you can’t use a linear regression model to make binary classification predictions. It wouldn't lead to a good fit, since you’re trying to fit a straight line through a dataset with only two possible values.

为什么我们需要逻辑回归？因为您不能使用线性回归模型来进行二进制分类预测。因为您试图通过只有两个可能值的数据集拟合一条直线，所以不会很好地拟合。

This image may help you understand why linear regression models are poorly suited for binary classification problems:

此图可能帮助您理解为什么线性回归模型不适用于二进制分类问题：

In this image, the y-axis represents the probability that a tumor is malignant. Conversely, the value 1-y represents the probability that a tumor is not malignant. As you can see, the linear regression model does a poor job of predicting this probability for most of the observations in the data set.

在此图像中， y-axis表示肿瘤是恶性的概率。相反，值1-y表示肿瘤不是恶性的概率。如您所见，对于数据集中的大多数观测值，线性回归模型无法很好地预测这种可能性。

This is why logistic regression models are useful. They have a bend to their line of best fit, which makes them much better-suited for predicting categorical data.

这就是逻辑回归模型有用的原因。他们倾向于最适合的路线，这使他们更适合预测分类数据。

Here is an example that compares a linear regression model to a logistic regression model using the same training data:

这是一个使用相同训练数据将线性回归模型与逻辑回归模型进行比较的示例：

乙状结肠功能 (The Sigmoid Function)

The reason why the logistic regression model has a bend in its curve is because it is not calculated using a linear equation. Instead, logistic regression models are built using the Sigmoid Function (also called the Logistic Function because of its use in logistic regression).

逻辑回归模型的曲线有弯曲的原因是因为它不是使用线性方程式计算的。相反，使用Sigmoid函数(也称为Logistic函数，因为它在logistic回归中使用)构建了logistic回归模型。

You will not have to memorize the Sigmoid Function to be successful in machine learning. With that said, having some understanding of its appearance is useful.

您无需记住Sigmoid函数即可在机器学习中获得成功。话虽如此，对它的外观有所了解很有用。

The equation is shown below:

该公式如下所示：

The main characteristic of the Sigmoid Function worth understanding is this: no matter what value you pass into it, it will always generate an output somewhere between 0 and 1.

值得理解的Sigmoid函数的主要特征是：无论您传递给它什么值，它始终会在0到1之间生成输出。

使用Logistic回归模型进行预测 (Using Logistic Regression Models to Make Predictions)

To use the linear regression model to make predictions, you generally need to specify a cutoff point. This cutoff point is typically 0.5.

要使用线性回归模型进行预测，通常需要指定一个截止点。该截止点通常为0.5 。

Let’s use our cancer diagnosis example from our earlier image to see this principle in practice. If the logistic regression model outputs a value below 0.5, then the data point is categorized as a non-malignant tumor. Similarly, if the Sigmoid Function outputs a value above 0.5, then the tumor would be classified as malignant.

让我们使用先前图像中的癌症诊断示例在实践中了解这一原理。如果逻辑回归模型输出的值小于0.5，则将数据点归类为非恶性肿瘤。同样，如果Sigmoid函数输出的值大于0.5，则肿瘤将被分类为恶性。

使用混淆矩阵来衡量逻辑回归性能 (Using a Confusion Matrix to Measure Logistic Regression Performance)

A confusion matrix can be used as a tool to compare true positives, true negatives, false positives, and false negatives in machine learning.

混淆矩阵可以用作在机器学习中比较真阳性，真阴性，假阳性和假阴性的工具。

Confusion matrices are particularly useful when used to measure the performance of logistic regression models. Here is an example of how we could use a confusion matrix:

当用于测量逻辑回归模型的性能时，混淆矩阵特别有用。这是我们如何使用混淆矩阵的示例：

A confusion matrix is useful for assessing whether your model is particularly weak in a specific quadrant of the confusion matrix. As an example, it might have an abnormally high number of false positives.

混淆矩阵可用于评估模型在混淆矩阵的特定象限中是否特别弱。例如，它可能具有异常高的误报数。

It can also be helpful in certain applications, to make sure that your model performs well in an especially dangerous zone of the confusion matrix.

在某些应用中，确保您的模型在混乱矩阵的特别危险区域中表现良好，也可能会有所帮助。

In this cancer example, for instance, you’d want to be very sure that you model does not have a very high rate of false negatives, as this would indicate that someone has a malignant tumor that you incorrectly classified as non-malignant.

例如，在这个癌症示例中，您要非常确定自己的模型没有很高的假阴性率，因为这将表明某人患有恶性肿瘤，而您将其错误地归类为非恶性。

章节总结 (Section Wrap-up)

In this section, you had your first exposure to logistic regression machine learning models.

在本部分中，您首次接触了逻辑回归机器学习模型。

Here is a brief summary of what you learned about logistic regression:

这是您从逻辑回归中学到的简短摘要：

The types of classification problems that are suitable to be solved using logistic regression models
适合使用逻辑回归模型解决的分类问题的类型
That the logistic function (also called the Sigmoid Function) always outputs a value between 0 and 1
逻辑函数(也称为Sigmoid函数)始终输出0到1之间的值
How to use cutoff points to make predictions using a logistic regression machine learning model
如何使用临界点使用Logistic回归机器学习模型进行预测
Why confusion matrices are useful to measure the performance of logistic regression models
为什么混淆矩阵可用于衡量逻辑回归模型的性能

K最近邻居 (K-Nearest Neighbors)

The K-nearest neighbors algorithm can help you solve classification problems where there are more than two categories.

K近邻算法可以帮助您解决两个以上类别的分类问题。

什么是K最近邻居算法？ (What is the K-Nearest Neighbors Algorithm?)

The K-nearest neighbors algorithm is a classification algorithm that is based on a simple principle. In fact, the principle is so simple that it is best understood through example.

K最近邻居算法是基于简单原理的分类算法。实际上，该原理非常简单，因此可以通过示例对其进行最好的理解。

Imagine that you had data on the height and weight of football players and basketball players. The K-nearest neighbors algorithm can be used to predict whether a new athlete is either a football player or a basketball player.

想象一下，您拥有有关足球运动员和篮球运动员的身高和体重的数据。 K近邻算法可用于预测新运动员是足球运动员还是篮球运动员。

To do this, the K-nearest neighbors algorithm identifies the K data points that are closest to the new observation.

为此，K近邻算法可识别最接近新观测值的K数据点。

The following image visualizes this, with a K value of 3:

下图以K值为3对此进行了可视化显示：

In this image, the football players are labeled as blue data points and the basketball players are labeled as orange dots. The data point that we are attempting to classify is labeled as green.

在此图像中，足球运动员被标记为蓝色数据点，篮球运动员被标记为橙色点。我们尝试分类的数据点被标记为绿色。

Since the majority (2 out of 3) of the closets data points to the new data points are blue football players, then the K-nearest neighbors algorithm will predict that the new data point is also a football player.

由于壁橱数据点中的大多数(3之2)是新数据点，所以都是蓝色足球运动员，因此K近邻算法将预测新数据点也是足球运动员。

建立K最近邻居算法的步骤 (The Steps for Building a K-Nearest Neighbors Algorithm)

The general steps for building a K-nearest neighbors algorithm are:

建立K近邻算法的一般步骤是：

Store all of the data
存储所有数据
Calculate the Euclidean distance from the new data point x to all the other points in the data set
计算从新数据点x到数据集中所有其他点的欧几里得距离
Sort the points in the data set in order of increasing distance from x
按距x距离递增的顺序对数据集中的点进行排序
Predict using the same category as the majority of the K closest data points to x
使用与K最接近x数据点中的大多数相同的类别进行预测

K最近邻算法中K的重要性 (The Importance of K in a K-Nearest Neighbors Algorithm)

Although it might not be obvious from the start, changing the value of K in a K-nearest neighbors algorithm will change which category a new point is assigned to.

虽然这可能不是很明显从一开始，改变的值K在最近邻居法将改变其类别新的点分配给。

More specifically, having a very low K value will cause your model to perfectly predict your training data and poorly predict your test data. Similarly, having too high of a K value will make your model unnecessarily complex.

更具体地说，具有非常低的K值将导致您的模型完美地预测您的训练数据，而较差地预测您的测试数据。同样， K值太高会使模型不必要地复杂。

The following visualization does an excellent job of illustrating this:

以下可视化效果很好地说明了这一点：

K最近邻算法的优缺点 (The Pros and Cons of the K-Nearest Neighbors Algorithm)

To conclude this introduction to the K-nearest neighbors algorithm, I wanted to briefly discuss some pros and cons of using this model.

为了结束对K近邻算法的介绍，我想简要讨论一下使用该模型的利弊。

Here are some main advantages to the K-nearest neighbors algorithm:

这是K近邻算法的一些主要优点：

The algorithm is simple and easy to understand
该算法简单易懂
It is trivial to train the model on new training data
在新的训练数据上训练模型很简单
It works with any number of categories in a classification problem
它适用于分类问题中的任何数量的类别
It is easy to add more data to the data set
向数据集中添加更多数据很容易
The model accepts only two parameters: K and the distance metric you’d like to use (usually Euclidean distance)
该模型仅接受两个参数： K和您要使用的距离度量(通常是欧几里得距离)

Similarly, here are a few of the algorithm’s main disadvantages:

同样，以下是该算法的一些主要缺点：

There is a high computational cost to making predictions, since you need to sort the entire data set
由于需要对整个数据集进行排序，因此进行预测会产生很高的计算成本
It does not work well with categorical features
它不适用于分类功能

章节总结 (Section Wrap-up)

Here is a brief summary of what you just learned about the k-nearest neighbors algorithm:

这是您刚刚了解的k近邻算法的简短摘要：

An example of a classification problem (football players vs. basketball players) that the K-nearest neighbors algorithm could solve
K近邻算法可以解决的分类问题(足球运动员与篮球运动员)的示例
How the K-nearest neighbors uses the Euclidean distance of the neighboring data points to predict which category a new data point belongs to
K近邻如何使用相邻数据点的欧几里得距离来预测新数据点属于哪个类别
Why the value of K matters for making predictions
为什么K的值对进行预测很重要
The pros and cons of using the K-nearest neighbors algorithm
使用K最近邻算法的优缺点

决策树和随机森林 (Decision Trees and Random Forests)

Decision trees and randoms forests are both examples of tree methods.

决策树和随机森林都是树方法的示例。

More specifically, decision trees are machine learning models used to make predictions by cycling through every feature in a data set, one-by-one. Random forests are ensembles of decision trees that used random orders of the features in the data sets.

更具体地说，决策树是一种机器学习模型，用于通过循环遍历数据集中的每个功能进行预测。随机森林是决策树的集合，这些决策树使用了数据集中要素的随机顺序。

什么是树方法？ (What Are Tree Methods?)

Before we dig into the theoretical underpinnings of tree methods in machine learning, it is helpful to start with an example.

在深入研究机器学习中树方法的理论基础之前，从一个示例开始是有帮助的。

Imagine that you play basketball every Monday. Moreover, you always invite the same friend to come play with you.

想象一下，您每个星期一都打篮球。而且，您总是邀请同一个朋友和您一起玩。

Sometimes the friend actually comes. Sometimes they don't.

有时朋友真的来了。有时他们没有。

The decision on whether or not to come depends on numerous factors, like weather, temperature, wind, and fatigue. You start to notice these features and begin tracking them alongside your friend's decision whether to play or not.

是否要来的决定取决于许多因素，例如天气，温度，风和疲劳。您开始注意到这些功能，并开始在您的朋友决定是否玩的同时跟踪它们。

You can use this data to predict whether or not your friend will show up to play basketball. One technique you could use is a decision tree. Here’s what this decision tree would look like:

您可以使用此数据来预测您的朋友是否会出现参加篮球比赛。您可以使用的一种技术是决策树。这是此决策树的外观：

Every decision tree has two types of elements:

每个决策树都有两种类型的元素：

Nodes: locations where the tree splits according to the value of some attribute
Nodes ：树根据某些属性的值拆分的位置
Edges: the outcome of a split to the next node
Edges ：拆分到下一个节点的结果

You can see in the image above that there are nodes for outlook, humidity and windy. There is an edge for each potential value of each of those attributes.

您可以在上面的图像中看到，有一些outlook ， humidity和windy节点。这些属性中每个属性的每个潜在值都有一条边。

Here are two other pieces of decision tree terminology that you should understand before proceeding:

在继续之前，您还应该了解以下两项决策树术语：

Root: the node that performs the first split
Root ：执行第一次拆分的节点
Leaves: terminal nodes that predict the final outcome
Leaves ：预测最终结果的末端节点

You now have a basic understanding of what decision trees are. We will learn about how to build decision trees from scratch in the next section.

现在，您对什么是决策树有了基本的了解。在下一部分中，我们将学习如何从头开始构建决策树。

如何从头开始构建决策树 (How to Build Decision Trees From Scratch)

Building decision trees is harder than you might imagine. This is because deciding which features to split your data on (which is a topic that belongs to the fields of Entropy and Information Gain) is a mathematically complex problem.

建立决策树比您想象的要难。这是因为确定要分割数据的哪些功能(这是一个属于熵和信息增益领域的主题)在数学上很复杂。

To address this, machine learning practitioners typically use many decision trees using a random sample of features chosen as the split.

为了解决这个问题，机器学习从业人员通常使用许多决策树，使用随机选择的特征样本作为拆分。

Said differently, a new random sample of features is chosen for every single tree at every single split. This technique is called random forests.

换句话说，在每个分割处为每个树选择一个新的特征随机样本。这种技术称为随机森林 。

In general, practitioners typically chose the size of the random sample of features (denoted m) to be the square root of the number of total features in the data set (denoted p). To be succinct, m is the square root of p, and then a specific feature is randomly selected from m.

通常，从业人员通常选择特征随机样本的大小(表示为m )作为数据集中所有特征数量(表示为p )的平方根。简而言之， m是p平方根，然后从m随机选择一个特定特征。

If this does not make complete sense right now, do not worry. It will be more clear when you eventually build your first random forest model.

如果目前尚不完全正确，请不要担心。当您最终构建第一个随机森林模型时，将更加清楚。

使用随机森林的好处 (The Benefits of Using Random Forests)

Imagine that you’re working with a data set that has one very strong feature. Said differently, the data set has one feature that is much more predictive of the final outcome than the other features in the data set.

想象一下，您正在使用具有一个非常强大功能的数据集。换句话说，数据集的一个特征比数据集中的其他特征更能预测最终结果。

If you’re building your decision trees manually, then it makes sense to use this feature as the top split of the decision tree. This means that you’ll have multiple trees whose predictions are highly correlated.

如果您是手动构建决策树，则可以将此功能用作决策树的顶部拆分。这意味着您将拥有多棵预测相关性很高的树。

We want to avoid this since taking the average of highly correlated variables does not significantly reduce variance. By randomly selecting features for each tree in a random forest, the trees become decorrelated and the variance of the resulting model is reduced. This decorrelation is the main advantage of using random forests over handmade decision trees

我们要避免这种情况，因为取高度相关变量的平均值不会显着减小方差。通过为随机森林中的每棵树随机选择特征，这些树将变为去相关性，并减少结果模型的方差。这种去相关性是使用随机森林优于手工决策树的主要优势

章节总结 (Section Wrap-up)

Here is a brief summary of what you learned about decision trees and random forests in this article:

这是您从本文中学到的决策树和随机森林的摘要：

An example of a problem that you could predict using decision trees
您可以使用决策树预测的问题示例
The elements of a decision tree: nodes, edges, roots, and leaves
决策树的元素： nodes ， edges ， roots和leaves
How taking random samples of decision tree features allows us to build a random forest
如何抽取决策树特征的随机样本使我们能够建立随机森林
Why using random forests to decorrelate variables can be helpful for reducing the variance of your final model
为什么使用随机森林对变量进行解相关可有助于减少最终模型的方差

支持向量机 (Support Vector Machines)

Support vector machines are classification algorithms (although, technically speaking, they could also be used to solve regression problems) that divide a data set into categories based by slicing through the widest gap between categories. This concept will be made more clear through visualizations in a moment.

支持向量机是分类算法(尽管从技术上讲，它们也可以用于解决回归问题)，它通过切分类别之间的最大差距，将数据集分为多个类别。稍后将通过可视化使此概念更加清晰。

什么是支持向量机？ (What Are Support Vector Machines?)

Support vector machines – or SVMs for short – are supervised machine learning models with associated learning algorithms that analyze data and recognize patterns.

支持向量机 (简称SVM)是受监督的机器学习模型，具有关联的学习算法，可以分析数据并识别模式。

Support vector machines can be used for both classification problems and regression problems. In this article, we will specifically be looking at the use of support vector machines for solving classification problems.

支持向量机可用于分类问题和回归问题。在本文中，我们将专门研究使用支持向量机解决分类问题。

支持向量机如何工作？ (How Do Support Vector Machines Work?)

Let’s dig in to how support vector machines really work.

让我们深入探讨支持向量机的工作原理。

Given a set of training examples – each of which is marked for belonging to one of two categories – a support vector machine training algorithm builds a model. This model assigns new examples into one of the two categories. This makes the support vector machine a non-probabilistic binary linear classifier.

给定一组训练示例(每个标记都被标记为属于两个类别之一)，支持向量机训练算法将建立一个模型。该模型将新示例分配给两个类别之一。这使得支持向量机成为非概率二进制线性分类器。

The SVM uses geometry to make categorical predictions.

SVM使用几何进行分类预测。

More specifically, an SVM model maps the data points as points in space and divides the separate categories so that they are divided by an open gap that is as wide as possible. New data points are predicted to belong to a category based on which side of the gap they belong to.

更具体地说，SVM模型将数据点映射为空间中的点，并划分单独的类别，以便它们被尽可能宽的开放间隙划分。根据新数据点属于间隙的哪一侧，预测它们将属于一个类别。

Here is an example visualization that can help you understand the intuition behind support vector machines:

这是一个可视化示例，可以帮助您理解支持向量机的直观知识：

As you can see, if a new data point falls on the left side of the green line, it will be labeled with the red category. Similarly, if a new data point falls on the right side of the green line, it will get labelled as belonging to the blue category.

如您所见，如果新数据点位于绿线的左侧，则将其标记为红色类别。同样，如果新数据点位于绿线的右侧，则会被标记为属于蓝色类别。

This green line is called a hyperplane, which is an important piece of vocabulary for support vector machine algorithms.

这条绿线称为超平面 ，它是支持向量机算法的重要词汇。

Let’s take a look at a different visual representation of a support vector machine:

让我们看一下支持向量机的另一种视觉表示形式：

In this diagram, the hyperplane is labelled as the optimal hyperplane. Support vector machine theory defines the optimal hyperplane as the one that maximizes the margin between the closest data points from each category.

在此图中，超平面被标记为最佳超平面 。支持向量机理论将最佳超平面定义为最大化每个类别中最接近的数据点之间的余量的最佳超平面 。

As you can see, the margin line actually touches three data points – two from the red category and one from the blue category. These data points which touch the margin lines are called support vectors and are where support vector machines get their name from.

如您所见，边界线实际上触及三个数据点-两个来自红色类别，一个来自蓝色类别。这些与边界线接触的数据点称为支持向量 ，是支持向量机的名称来源。

章节总结 (Section Wrap-up)

Here is a brief summary of what you just learned about support vector machines:

这是您刚刚了解到的关于支持向量机的简短摘要：

That support vector machines are an example of a supervised machine learning algorithm
支持向量机是监督式机器学习算法的一个示例
That support vector machines can be used to solve both classification and regression problems
支持向量机可用于解决分类和回归问题
How support vector machines categorize data points using a hyperplane that maximizes the margin between categories in a data set
支持向量机如何使用超平面对数据点进行分类，该超平面可最大化数据集中类别之间的边距
That the data points that touch margin lines in a support vector machine are called support vectors. These data points are where support vector machines derive their name from.
在支持向量机中接触边界线的数据点称为支持向量 。这些数据点是支持向量机的名称来源。

K均值聚类 (K-Means Clustering)

K-means clustering is a machine learning algorithm that allows you to identify segments of similar data within a data set.

K均值聚类是一种机器学习算法，可让您识别数据集中相似数据的片段。

什么是K均值聚类？ (What is K-Means Clustering?)

K-means clustering is an unsupervised machine learning algorithm.

K-均值聚类是一种无监督的机器学习算法。

This means that it takes in unlabelled data and will attempt to group similar clusters of observations together within your data.

这意味着它将接收未标记的数据，并将尝试在数据中将相似的观察结果群集在一起。

K-means clustering algorithms are highly useful for solving real-world problems. Here are a few use cases for this machine learning model:

K-means聚类算法对于解决实际问题非常有用。以下是此机器学习模型的一些用例：

Customer segmentation for marketing teams
营销团队的客户细分
Document classification
文件分类
Delivery route optimization for companies like Amazon, UPS, or FedEx
为Amazon，UPS或FedEx等公司优化配送路线
Identifying and reacting to crime centers within a city
识别城市中的犯罪中心并对之做出React
Professional sport analytics
专业运动分析
Predicting and preventing cybercrime
预测和预防网络犯罪

The primary goal of a K means clustering algorithm is to divide a data set into distinct groups such that the observations within each group are similar to each other.

K均值聚类算法的主要目标是将数据集划分为不同的组，以使每个组内的观测值彼此相似。

Here is a visual representation of what this looks like in practice:

这是实际情况的直观表示：

We will explore the mathematics behind a K-means clustering in the next section of this tutorial.

我们将在本教程的下一部分中探讨K-均值聚类背后的数学原理。

K-Means聚类算法如何工作？ (How Do K-Means Clustering Algorithms Work?)

The first step in running a K-means clustering algorithm is to select the number of clusters you'd like to divide your data into. This number of clusters is the K value that is referenced in the algorithm’s name.

运行K-means聚类算法的第一步是选择要划分数据的聚类数。该簇数是算法名称中引用的K值。

Choosing the K value within a K-means clustering algorithm is an important choice. We will talk more about how to chose a proper value of K later in this article.

在K均值聚类算法中选择K值是一个重要的选择。我们将在本文后面详细讨论如何选择适当的K值。

Next, you must randomly assign each point in your data set to a random cluster. This gives our initial assignment which you then run the following iteration on until the clusters stop changing:

接下来，您必须将数据集中的每个点随机分配给随机簇。这给出了我们的初始分配，然后您可以对其进行以下迭代，直到集群停止更改为止：

Compute each cluster’s centroid by taking the mean vector of points within that cluster
通过获取该群集内点的平均向量来计算每个群集的质心
Re-assign each data point to the cluster that has the closest centroid
将每个数据点重新分配给具有最接近质心的群集

Here is an animation of how this works in practice for a K-means clustering algorithm with a K value of 3. You can see the centroid of each of cluster represented by a black + character.

这是在K值为3的K均值聚类算法中如何实际工作的动画。您可以看到用黑色+字符表示的每个簇的质心。

As you can see, this iteration continues until the clusters stop changing – meaning data points are no longer being assigned to new clusters.

如您所见，此迭代将继续进行，直到群集停止更改为止-这意味着不再将数据点分配给新的群集。

为K均值聚类算法选择合适的K值 (Choosing A Proper K Value For K Means Clustering Algorithms)

Choosing a proper K value for a K-means clustering algorithm is actually quite difficult. There is no “right” answer for choosing the “best” K value.

为K均值聚类算法选择合适的K值实际上非常困难。选择“最佳” K值没有“正确”的答案。

One method that machine learning practitioners often use is called the elbow method.

机器学习从业人员经常使用的一种方法称为肘方法 。

To use the elbow method, the first thing you need to do is compute the sum of squared errors (SSE) for you K-means clustering algorithm for a group of K values. SSE in a K means clustering algorithm is defined as the sum of the squared distance between each data point in a cluster and that cluster’s centroid.

要使用弯头方法，首先需要做的是为一组K值的K均值聚类算法计算平方误差总和(SSE)。 K均值聚类算法中的SSE定义为聚类中每个数据点与该聚类质心之间的平方距离之和。

As an example of this step, you might compute the SSE for K values of 2, 4, 6, 8, and 10.

作为该步骤的一个例子，则可能计算SSE为K的值2 ， 4 ， 6 ， 8 ，和10 。

Next, you will want to generate a plot of the SSE against these different K values. You will see that the error decreases as the K value increases.

接下来，您将要针对这些不同的K值生成SSE的图。您会看到误差随着K值的增加而减小。

This makes sense – the more categories you create within a data set, the more likely it is that each data point is close to the center of its specific cluster.

这很有道理–您在数据集中创建的类别越多，每个数据点越接近其特定群集的中心的可能性就越大。

With that said, the idea behind the elbow method is to choose a value of K at which the SSE slows its rate of decline abruptly. This abrupt decrease produces an elbow in the graph.

话虽如此，弯头法的思想是选择一个SSE突然减慢其下降速度的K值。这种突然的下降会在图表中产生一个elbow 。

As an example, here is a graph of SSE against K. In this case, the elbow method would suggest using a K value of approximately 6.

例如，这是SSE与K 。在这种情况下，弯头法建议使用大约6的K值。

Importantly, 6 is just an estimate for a good value of K to use. There is never a “best” K value in a K-means clustering algorithm. As with many things in the field of machine learning, this is a highly situation-dependent decision.

重要的是， 6只是要使用的K一个好的估计值。 K-均值聚类算法中永远不会有“最佳” K值。与机器学习领域的许多事情一样，这是高度依赖于情况的决策。

章节总结 (Section Wrap-up)

Here is a brief summary of what you learned in this article:

这是您从本文中学到的简短摘要：

Examples of unsupervised machine learning problems that the K-means clustering algorithm is capable of solving
K均值聚类算法能够解决的无监督机器学习问题的示例
The basic principles of what a K-means clustering algorithm is
什么是K均值聚类算法的基本原理
How the K-means clustering algorithm works
K-means聚类算法如何工作
How to use the elbow method to select an appropriate value of K in a K-means clustering model
如何在K均值聚类模型中使用弯头法选择合适的K值

主成分分析 (Principal Component Analysis)

Principal component analysis is used to transform a many-featured data set into a transformed data set with fewer features where each new feature is a linear combination of the preexisting features. This transformed data set aims to explain most of the variance of the original data set with far more simplicity.

主成分分析用于将功能广泛的数据集转换为功能较少的转换数据集，其中每个新功能都是现有功能的线性组合。此转换后的数据集旨在以更为简单的方式解释原始数据集的大部分差异。

什么是主成分分析？ (What is Principal Component Analysis?)

Principal component analysis is a machine learning technique that is used to examine the interrelations between sets of variables.

主成分分析是一种机器学习技术，用于检查变量集之间的相互关系。

Said differently, principal component analysis studies sets of variables in order to identify the underlying structure of those variables.

换句话说，主成分分析研究变量集，以识别这些变量的底层结构。

Principal component analysis is sometimes called factor analysis.

主成分分析有时也称为因子分析 。

Based on this description, you might be think that principal component analysis is quite similar to linear regression.

基于此描述，您可能会认为主成分分析与线性回归非常相似。

That is not the case. In fact, these two techniques have some important differences.

事实并非如此。实际上，这两种技术有一些重要的区别。

线性回归与主成分分析之间的差异 (The Differences Between Linear Regression and Principal Component Analysis)

Linear regression determines a line of best fit through a data set. Principal component analysis determines several orthogonal lines of best fit for the data set.

线性回归确定通过数据集的最佳拟合线。主成分分析确定最适合数据集的几条正交线。

If you’re unfamiliar with the term orthogonal, it just means that the lines are at right angles (90 degrees) to each other – like North, East, South, and West are on a map.

如果您不熟悉“ 正交 ”一词，则仅表示直线彼此成直角(90度)，例如北，东，南和西在地图上。

Let’s consider an example to help you understand this better.

让我们考虑一个示例，以帮助您更好地理解这一点。

Take a look at the axis labels in this image.

看一下这张图中的轴标签。

In this image, the x-axis principal component examples 73% of the variance in the data set. The y-axis principal component explains about 23% of the variance in the data set.

在此图像中，x轴主成分示例了数据集中73％的方差。 y轴主成分解释了数据集中约23％的方差。

This means that 4% of the variance in the data set remains unexplained. You could reduce this number further by adding more principal components to your analysis.

这意味着数据集中4％的方差仍然无法解释。您可以通过在分析中添加更多主成分来进一步减少此数量。

章节总结 (Section Wrap-up)

Here’s a brief summary of what you learned about principal component analysis in this tutorial:

这是您在本教程中学到的关于主成分分析的简要概述：

That principal component analysis attempts to find orthogonal factors that determine the variability in a data set
该主成分分析试图找到确定数据集中变异性的正交因素
The differences between principal component analysis and linear regression
主成分分析与线性回归之间的差异
What the orthogonal principal components look like when visualized inside of a data set
在数据集中可视化时正交主成分的外观
That adding more principal components can help you to explain more of the variance in a data set
添加更多主要成分可以帮助您解释数据集中更多的方差

翻译自: https://www.freecodecamp.org/news/a-no-code-intro-to-the-9-most-important-machine-learning-algorithms-today/

你可能感兴趣的:(算法,决策树,大数据,编程语言,python)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
理解Gunicorn：Python WSGI服务器的基石范范0825 ipython linux 运维
理解Gunicorn：PythonWSGI服务器的基石介绍Gunicorn，全称GreenUnicorn，是一个为PythonWSGI（WebServerGatewayInterface）应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具，Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置，帮助初学者快速上手。1.什么是Gunico
【一起学Rust | 设计模式】习惯语法——使用借用类型作为参数、格式化拼接字符串、构造函数广龙宇一起学Rust #Rust设计模式 rust 设计模式开发语言
提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录前言一、使用借用类型作为参数二、格式化拼接字符串三、使用构造函数总结前言Rust不是传统的面向对象编程语言，它的所有特性，使其独一无二。因此，学习特定于Rust的设计模式是必要的。本系列文章为作者学习《Rust设计模式》的学习笔记以及自己的见解。因此，本系列文章的结构也与此书的结构相同（后续可能会调成结构），基本上分为三个部分
Python数据分析与可视化实战指南 William数据分析 python python 数据
在数据驱动的时代，Python因其简洁的语法、强大的库生态系统以及活跃的社区，成为了数据分析与可视化的首选语言。本文将通过一个详细的案例，带领大家学习如何使用Python进行数据分析，并通过可视化来直观呈现分析结果。一、环境准备1.1安装必要库在开始数据分析和可视化之前，我们需要安装一些常用的库。主要包括pandas、numpy、matplotlib和seaborn等。这些库分别用于数据处理、数学
python os.environ 江湖偌大 python 深度学习
os.environ['TF_CPP_MIN_LOG_LEVEL']='0'#默认值，输出所有信息os.environ['TF_CPP_MIN_LOG_LEVEL']='1'#屏蔽通知信息（INFO）os.environ['TF_CPP_MIN_LOG_LEVEL']='2'#屏蔽通知信息和警告信息（INFO\WARNING）os.environ['TF_CPP_MIN_LOG_LEVEL']='
Python中os.environ基本介绍及使用方法鹤冲天Pro #Python python 服务器开发语言
文章目录python中os.environos.environ简介os.environ进行环境变量的增删改查python中os.environ的使用详解1.简介2.key字段详解2.1常见key字段3.os.environ.get()用法4.环境变量的增删改查和判断是否存在4.1新增环境变量4.2更新环境变量4.3获取环境变量4.4删除环境变量4.5判断环境变量是否存在python中os.envi
Pyecharts数据可视化大屏：打造沉浸式数据分析体验我的运维人生信息可视化数据分析数据挖掘运维开发技术共享
Pyecharts数据可视化大屏：打造沉浸式数据分析体验在当今这个数据驱动的时代，如何将海量数据以直观、生动的方式展现出来，成为了数据分析师和企业决策者关注的焦点。Pyecharts，作为一款基于Python的开源数据可视化库，凭借其丰富的图表类型、灵活的配置选项以及高度的定制化能力，成为了构建数据可视化大屏的理想选择。本文将深入探讨如何利用Pyecharts打造数据可视化大屏，并通过实际代码案例
Goolge earth studio 进阶4——路径修改与平滑陟彼高冈yu Google earth studio 进阶教程旅游
如果我们希望在大约中途时获得更多的城市鸟瞰视角。可以将相机拖动到这里并创建一个新的关键帧。camera_target_clip_7EarthStudio会自动平滑我们的路径，所以当我们通过这个关键帧时，不是一个生硬的角度，而是一个平滑的曲线。camera_target_clip_8路径上有贝塞尔控制手柄，允许我们调整路径的形状。右键单击，我们可以选择“平滑路径”，这是默认的自动平滑算法，或者我们可
Python教程：一文了解使用Python处理XPath 旦莫 Python进阶 python 开发语言
目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath？2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代
python os.environ_python os.environ 读取和设置环境变量 weixin_39605414 python os.environ
>>>importos>>>os.environ.keys()['LC_NUMERIC','GOPATH','GOROOT','GOBIN','LESSOPEN','SSH_CLIENT','LOGNAME','USER','HOME','LC_PAPER','PATH','DISPLAY','LANG','TERM','SHELL','J2REDIR','LC_MONETARY','QT_QPA
基于社交网络算法优化的二维最大熵图像分割智能算法研学社（Jack旭）智能优化算法应用图像分割算法 php 开发语言
智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码文章目录智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码1.前言2.二维最大熵阈值分割原理3.基于社交网络优化的多阈值分割4.算法结果：5.参考文献：6.Matlab代码摘要：本文介绍基于最大熵的图像分割，并且应用社交网络算法进行阈值寻优。1.前言阅读此文章前，请阅读《图像分割：直方图区域划分及信息统计介绍》htt
使用Faiss进行高效相似度搜索 llzwxh888 faiss python
在现代AI应用中，快速和高效的相似度搜索是至关重要的。Faiss（FacebookAISimilaritySearch）是一个专门用于快速相似度搜索和聚类的库，特别适用于高维向量。本文将介绍如何使用Faiss来进行相似度搜索，并结合Python代码演示其基本用法。什么是Faiss？Faiss是一个由FacebookAIResearch团队开发的开源库，主要用于高维向量的相似性搜索和聚类。Faiss
python是什么意思中文-在python中%是什么意思编程大乐趣
Python中%有两种：1、数值运算：%代表取模，返回除法的余数。如：>>>7%212、%操作符（字符串格式化，stringformatting），说明如下：%[(name)][flags][width].[precision]typecode(name)为命名flags可以有+，-，''或0。+表示右对齐。-表示左对齐。''为一个空格，表示在正数的左侧填充一个空格，从而与负数对齐。0表示使用0填
Day1笔记-Python简介&标识符和关键字&输入输出 ~在杰难逃~ Python python 开发语言大数据数据分析数据挖掘
大家好，从今天开始呢，杰哥开展一个新的专栏，当然，数据分析部分也会不定时更新的，这个新的专栏主要是讲解一些Python的基础语法和知识，帮助0基础的小伙伴入门和学习Python，感兴趣的小伙伴可以开始认真学习啦！一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码，再通过语言处理程序执行向计算机发送指令，让计算机完成对应的工作，编程
python八股文面试题分享及解析(1) Shawn________ python
#1.'''a=1b=2不用中间变量交换a和b'''#1.a=1b=2a,b=b,aprint(a)print(b)结果：21#2.ll=[]foriinrange(3):ll.append({'num':i})print(11)结果:#[{'num':0},{'num':1},{'num':2}]#3.kk=[]a={'num':0}foriinrange(3):#0,12#可变类型，不仅仅改变
121. 买卖股票的最佳时机薄荷糖的味道_fb40
给定一个数组，它的第i个元素是一支给定股票第i天的价格。如果你最多只允许完成一笔交易（即买入和卖出一支股票），设计一个算法来计算你所能获取的最大利润。注意你不能在买入股票前卖出股票。示例1:输入:[7,1,5,3,6,4]输出:5解释:在第2天（股票价格=1）的时候买入，在第5天（股票价格=6）的时候卖出，最大利润=6-1=5。注意利润不能是7-1=6,因为卖出价格需要大于买入价格。示例2:输入:
每日算法&面试题，大厂特训二十八天——第二十天（树）肥学 ⚡算法题⚡面试题每日精进 java 算法数据结构
目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题，最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧！！特别介绍小白练手专栏，适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章
Python快速入门 —— 第三节：类与对象孤华暗香 Python快速入门 python 开发语言
第三节：类与对象目标：了解面向对象编程的基础概念，并学会如何定义类和创建对象。内容：类与对象：定义类：class关键字。类的构造函数：__init__()。类的属性和方法。对象的创建与使用。示例：classStudent:def__init__(self,name,age,major):self.name&#
pyecharts——绘制柱形图折线图 2224070247 信息可视化 python java 数据可视化
一、pyecharts概述自2013年6月百度EFE(ExcellentFrontEnd）数据可视化团队研发的ECharts1.0发布到GitHub网站以来，ECharts一直备受业界权威的关注并获得广泛好评，成为目前成熟且流行的数据可视化图表工具，被应用到诸多数据可视化的开发领域。Python作为数据分析领域最受欢迎的语言，也加入ECharts的使用行列，并研发出方便Python开发者使用的数据
回溯算法-重新安排行程 chirou_ 算法数据结构图论 c++图搜索
leetcode332.重新安排行程这题我还没自己ac过，只能现在凭着刚学完的热乎劲把我对题解的理解记下来。本题我认为对数据结构的考察比较多，用什么数据结构去存数据，去读取数据，都是很重要的。classSolution{private:unordered_map>targets;boolbacktracking(intticketNum,vector&result){//1.确定参数和返回值//2
Python 实现图片裁剪（附代码） | Python工具剑客阿良_ALiang
前言本文提供将图片按照自定义尺寸进行裁剪的工具方法，一如既往的实用主义。环境依赖ffmpeg环境安装，可以参考我的另一篇文章：windowsffmpeg安装部署_阿良的博客-CSDN博客本文主要使用到的不是ffmpeg，而是ffprobe也在上面这篇文章中的zip包中。ffmpy安装：pipinstallffmpy-ihttps://pypi.douban.com/simple代码不废话了，上代码
【华为OD技术面试真题 - 技术面】- python八股文真题题库（4) 算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选**1.Python中的`with`**用途和功能自动资源管理示例：文件操作上下文管理协议示例代码工作流程解析优点2.\_\_new\_\_和**\_\_init\_\_**区别__new____init__区别总结3.**切片（Slicing）操作**基本切片语法
python os 环境变量 CV矿工 python 开发语言 numpy
环境变量：环境变量是程序和操作系统之间的通信方式。有些字符不宜明文写进代码里，比如数据库密码，个人账户密码，如果写进自己本机的环境变量里，程序用的时候通过os.environ.get（）取出来就行了。os.environ是一个环境变量的字典。环境变量的相关操作importos"""设置/修改环境变量：os.environ[‘环境变量名称’]=‘环境变量值’#其中key和value均为string类
Python爬虫解析工具之xpath使用详解 eqa11 python 爬虫开发语言
文章目录Python爬虫解析工具之xpath使用详解一、引言二、环境准备1、插件安装2、依赖库安装三、xpath语法详解1、路径表达式2、通配符3、谓语4、常用函数四、xpath在Python代码中的使用1、文档树的创建2、使用xpath表达式3、获取元素内容和属性五、总结Python爬虫解析工具之xpath使用详解一、引言在Python爬虫开发中，数据提取是一个至关重要的环节。xpath作为一门
Faiss：高效相似性搜索与聚类的利器网络·魚大数据 faiss
Faiss是一个针对大规模向量集合的相似性搜索库，由FacebookAIResearch开发。它提供了一系列高效的算法和数据结构，用于加速向量之间的相似性搜索，特别是在大规模数据集上。本文将介绍Faiss的原理、核心功能以及如何在实际项目中使用它。Faiss原理：近似最近邻搜索：Faiss的核心功能之一是近似最近邻搜索，它能够高效地在大规模数据集中找到与给定查询向量最相似的向量。这种搜索是近似的，
【华为OD技术面试真题 - 技术面】- python八股文真题题库（1）算法大师华为od 面试 python
华为OD面试真题精选专栏：华为OD面试真题精选目录:2024华为OD面试手撕代码真题目录以及八股文真题目录文章目录华为OD面试真题精选1.数据预处理流程数据预处理的主要步骤工具和库2.介绍线性回归、逻辑回归模型线性回归（LinearRegression）模型形式：关键点：逻辑回归（LogisticRegression）模型形式：关键点：参数估计与评估：3.python浅拷贝及深拷贝浅拷贝（Shal
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
《Python数据分析实战终极指南》 xjt921122 python 数据分析开发语言
对于分析师来说，大家在学习Python数据分析的路上，多多少少都遇到过很多大坑**，有关于技能和思维的**：Excel已经没办法处理现有的数据量了，应该学Python吗？找了一大堆Python和Pandas的资料来学习，为什么自己动手就懵了？跟着比赛类公开数据分析案例练了很久，为什么当自己面对数据需求还是只会数据处理而没有分析思路？学了对比、细分、聚类分析，也会用PEST、波特五力这类分析法，为啥
insert into select 主键自增_mybatis拦截器实现主键自动生成 weixin_39521651 insert into select 主键自增 mybatis delete返回值 mybatis insert返回主键 mybatis insert返回对象 mybatis plus insert返回主键 mybatis plus 插入生成id
前言前阵子和朋友聊天，他说他们项目有个需求，要实现主键自动生成，不想每次新增的时候，都手动设置主键。于是我就问他，那你们数据库表设置主键自动递增不就得了。他的回答是他们项目目前的id都是采用雪花算法来生成，因此为了项目稳定性，不会切换id的生成方式。朋友问我有没有什么实现思路，他们公司的orm框架是mybatis，我就建议他说，不然让你老大把mybatis切换成mybatis-plus。mybat
Python中深拷贝与浅拷贝的区别 yuxiaoyu.
转自：http://blog.csdn.net/u014745194/article/details/70271868定义：在Python中对象的赋值其实就是对象的引用。当创建一个对象，把它赋值给另一个变量的时候，python并没有拷贝这个对象，只是拷贝了这个对象的引用而已。浅拷贝：拷贝了最外围的对象本身，内部的元素都只是拷贝了一个引用而已。也就是，把对象复制一遍，但是该对象中引用的其他对象我不复
多线程编程之存钱与取钱周凡杨 java thread 多线程存钱取钱
生活费问题是这样的：学生每月都需要生活费，家长一次预存一段时间的生活费，家长和学生使用统一的一个帐号，在学生每次取帐号中一部分钱，直到帐号中没钱时通知家长存钱，而家长看到帐户还有钱则不存钱，直到帐户没钱时才存钱。问题分析：首先问题中有三个实体，学生、家长、银行账户，所以设计程序时就要设计三个类。其中银行账户只有一个，学生和家长操作的是同一个银行账户，学生的行为是
java中数组与List相互转换的方法征客丶 JavaScript java jsonp
1.List转换成为数组。（这里的List是实体是ArrayList) 　　调用ArrayList的toArray方法。　　toArray 　　public T[] toArray(T[] a)返回一个按照正确的顺序包含此列表中所有元素的数组；返回数组的运行时类型就是指定数组的运行时类型。如果列表能放入指定的数组，则返回放入此列表元素的数组。否则，将根据指定数组的运行时类型和此列表的大小分
Shell 流程控制 daizj 流程控制 if else while case shell
Shell 流程控制和Java、PHP等语言不一样，sh的流程控制不可为空，如(以下为PHP流程控制写法)： <?php if(isset($_GET["q"])){ search(q);}else{// 不做任何事情} 在sh/bash里可不能这么写，如果else分支没有语句执行，就不要写这个else，就像这样 if else if if 语句语
Linux服务器新手操作之二周凡杨 Linux 简单操作
1.利用关键字搜寻Man Pages man -k keyword 其中-k 是选项，keyword是要搜寻的关键字如果现在想使用whoami命令，但是只记住了前3个字符who，就可以使用 man -k who来搜寻关键字who的man命令 [haself@HA5-DZ26 ~]$ man -k
socket聊天室之服务器搭建朱辉辉33 socket
因为我们做的是聊天室，所以会有多个客户端，每个客户端我们用一个线程去实现，通过搭建一个服务器来实现从每个客户端来读取信息和发送信息。我们先写客户端的线程。 public class ChatSocket extends Thread{ Socket socket; public ChatSocket(Socket socket){ this.sock
利用finereport建设保险公司决策分析系统的思路和方法老A不折腾 finereport 金融保险分析系统报表系统项目开发
决策分析系统呈现的是数据页面，也就是俗称的报表，报表与报表间、数据与数据间都按照一定的逻辑设定，是业务人员查看、分析数据的平台，更是辅助领导们运营决策的平台。底层数据决定上层分析，所以建设决策分析系统一般包括数据层处理（数据仓库建设）。项目背景介绍通常，保险公司信息化程度很高，基本上都有业务处理系统（像集团业务处理系统、老业务处理系统、个人代理人系统等）、数据服务系统（通过
始终要页面在ifream的最顶层林鹤霄
index.jsp中有ifream，但是session消失后要让login.jsp始终显示到ifream的最顶层。。。始终没搞定，后来反复琢磨之后，得到了解决办法，在这儿给大家分享下。。 index.jsp--->主要是加了颜色的那一句 <html> <iframe name="top" ></iframe> <ifram
MySQL binlog恢复数据 aigo mysql
1，先确保my.ini已经配置了binlog： # binlog log_bin = D:/mysql-5.6.21-winx64/log/binlog/mysql-bin.log log_bin_index = D:/mysql-5.6.21-winx64/log/binlog/mysql-bin.index log_error = D:/mysql-5.6.21-win
OCX打成CBA包并实现自动安装与自动升级 alxw4616 ocx cab
近来手上有个项目,需要使用ocx控件 (ocx是什么? http://baike.baidu.com/view/393671.htm) 在生产过程中我遇到了如下问题. 1. 如何让 ocx 自动安装? a) 如何签名? b) 如何打包? c) 如何安装到指定目录? 2.
Hashmap队列和PriorityQueue队列的应用百合不是茶 Hashmap队列 PriorityQueue队列
HashMap队列已经是学过了的,但是最近在用的时候不是很熟悉,刚刚重新看以一次, HashMap是K,v键 ,值 put()添加元素 //下面试HashMap去掉重复的 package com.hashMapandPriorityQueue; import java.util.H
JDK1.5 returnvalue实例 bijian1013 java thread java多线程 returnvalue
Callable接口：返回结果并且可能抛出异常的任务。实现者定义了一个不带任何参数的叫做 call 的方法。 Callable 接口类似于 Runnable，两者都是为那些其实例可能被另一个线程执行的类设计的。但是 Runnable 不会返回结果，并且无法抛出经过检查的异常。 ExecutorService接口方
angularjs指令中动态编译的方法(适用于有异步请求的情况) 内嵌指令无效 bijian1013 JavaScript AngularJS
在directive的link中有一个$http请求，当请求完成后根据返回的值动态做element.append('......');这个操作，能显示没问题，可问题是我动态组的HTML里面有ng-click，发现显示出来的内容根本不执行ng-click绑定的方法！
【Java范型二】Java范型详解之extend限定范型参数的类型 bit1129 extend
在第一篇中，定义范型类时，使用如下的方式： public class Generics<M, S, N> { //M,S,N是范型参数 } 这种方式定义的范型类有两个基本的问题： 1. 范型参数定义的实例字段，如private M m = null;由于M的类型在运行时才能确定，那么我们在类的方法中，无法使用m，这跟定义pri
【HBase十三】HBase知识点总结 bit1129 hbase
1. 数据从MemStore flush到磁盘的触发条件有哪些？ a.显式调用flush，比如flush 'mytable' b.MemStore中的数据容量超过flush的指定容量，hbase.hregion.memstore.flush.size,默认值是64M 2. Region的构成是怎么样？ 1个Region由若干个Store组成
服务器被DDOS攻击防御的SHELL脚本 ronin47
mkdir /root/bin vi /root/bin/dropip.sh #!/bin/bash/bin/netstat -na|grep ESTABLISHED|awk ‘{print $5}’|awk -F:‘{print $1}’|sort|uniq -c|sort -rn|head -10|grep -v -E ’192.168|127.0′|awk ‘{if($2!=null&a
java程序员生存手册-craps 游戏-一个简单的游戏 bylijinnan java
import java.util.Random; public class CrapsGame { /** * *一个简单的赌*博游戏，游戏规则如下： *玩家掷两个骰子，点数为1到6，如果第一次点数和为7或11，则玩家胜， *如果点数和为2、3或12，则玩家输， *如果和为其它点数，则记录第一次的点数和，然后继续掷骰，直至点数和等于第一次掷出的点
TOMCAT启动提示NB: JAVA_HOME should point to a JDK not a JRE解决开窍的石头 JAVA_HOME
当tomcat是解压的时候，用eclipse启动正常，点击startup.bat的时候启动报错; 报错如下： The JAVA_HOME environment variable is not defined correctly This environment variable is needed to run this program NB: JAVA_HOME shou
[操作系统内核]操作系统与互联网 comsci 操作系统
我首先申明：我这里所说的问题并不是针对哪个厂商的，仅仅是描述我对操作系统技术的一些看法操作系统是一种与硬件层关系非常密切的系统软件，按理说，这种系统软件应该是由设计CPU和硬件板卡的厂商开发的，和软件公司没有直接的关系，也就是说，操作系统应该由做硬件的厂商来设计和开发
富文本框ckeditor_4.4.7 文本框的简单使用支持IE11 cuityang 富文本框
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>知识库内容编辑</tit
Property null not found darrenzhu datagrid Flex Advanced propery null
When you got error message like "Property null not found ***", try to fix it by the following way: 1)if you are using AdvancedDatagrid, make sure you only update the data in the data prov
MySQl数据库字符串替换函数使用 dcj3sjt126com mysql 函数替换
需求：需要将数据表中一个字段的值里面的所有的 . 替换成 _ 原来的数据是 site.title site.keywords .... 替换后要为 site_title site_keywords 使用的SQL语句如下： updat
mac上终端起动MySQL的方法 dcj3sjt126com mysql mac
首先去官网下载: http://www.mysql.com/downloads/ 我下载了5.6.11的dmg然后安装,安装完成之后..如果要用终端去玩SQL.那么一开始要输入很长的:/usr/local/mysql/bin/mysql 这不方便啊,好想像windows下的cmd里面一样输入mysql -uroot -p1这样...上网查了下..可以实现滴. 打开终端,输入: 1
Gson使用一（Gson） eksliang json gson
转载请出自出处：http://eksliang.iteye.com/blog/2175401 一.概述从结构上看Json，所有的数据（data）最终都可以分解成三种类型：第一种类型是标量（scalar），也就是一个单独的字符串（string）或数字（numbers），比如"ickes"这个字符串。第二种类型是序列（sequence），又叫做数组（array）
android点滴4 gundumw100 android
Android 47个小知识 http://www.open-open.com/lib/view/open1422676091314.html Android实用代码七段（一） http://www.cnblogs.com/over140/archive/2012/09/26/2611999.html http://www.cnblogs.com/over140/arch
JavaWeb之JSP基本语法 ihuning javaweb
目录 JSP模版元素 JSP表达式 JSP脚本片断 EL表达式 JSP注释特殊字符序列的转义处理如何查找JSP页面中的错误 JSP模版元素 JSP页面中的静态HTML内容称之为JSP模版元素，在静态的HTML内容之中可以嵌套JSP
App Extension编程指南（iOS8/OS X v10.10）中文版啸笑天 ext
当iOS 8.0和OS X v10.10发布后，一个全新的概念出现在我们眼前，那就是应用扩展。顾名思义，应用扩展允许开发者扩展应用的自定义功能和内容，能够让用户在使用其他app时使用该项功能。你可以开发一个应用扩展来执行某些特定的任务，用户使用该扩展后就可以在多个上下文环境中执行该任务。比如说，你提供了一个能让用户把内容分
SQLServer实现无限级树结构 macroli oracle sql SQL Server
表结构如下：数据库id path titlesort 排序 1 0 首页 0 2 0,1 新闻 1 3 0,2 JAVA 2 4 0,3 JSP 3 5 0,2,3 业界动态 2 6 0,2,3 国内新闻 1 创建一个存储过程来实现，如果要在页面上使用可以设置一个返回变量将至传过去 create procedure test as begin decla
Css居中div，Css居中img，Css居中文本，Css垂直居中div qiaolevip 众观千象学习永无止境每天进步一点点 css
/**********Css居中Div**********/ div.center { width: 100px; margin: 0 auto; } /**********Css居中img**********/ img.center { display: block; margin-left: auto; margin-right: auto; }
Oracle 常用操作(实用) 吃猫的鱼 oracle
SQL>select text from all_source where owner=user and name=upper('&plsql_name'); SQL>select * from user_ind_columns where index_name=upper('&index_name'); 将表记录恢复到指定时间段以前
iOS中使用RSA对数据进行加密解密 witcheryne ios rsa iPhone objective c
RSA算法是一种非对称加密算法,常被用于加密数据传输.如果配合上数字摘要算法, 也可以用于文件签名. 本文将讨论如何在iOS中使用RSA传输加密数据. 本文环境 mac os openssl-1.0.1j, openssl需要使用1.x版本, 推荐使用[homebrew](http://brew.sh/)安装. Java 8 RSA基本原理 RS

用朴素的英语解释9种关键机器学习算法

推荐系统 (Recommendation Systems)

什么是推荐系统？ (What Are Recommendation Systems?)

推荐系统和线性代数 (Recommendation Systems and Linear Algebra)

推荐系统如何工作？ (How Do Recommendation Systems Work?)

章节总结 (Section Wrap-up)

线性回归 (Linear Regression)

线性回归的历史 (The History of Linear Regression)

线性回归的数学 (The Mathematics of Linear Regression)

逻辑回归 (Logistic Regression)

什么是逻辑回归？ (What is Logistic Regression?)

乙状结肠功能 (The Sigmoid Function)

使用Logistic回归模型进行预测 (Using Logistic Regression Models to Make Predictions)

使用混淆矩阵来衡量逻辑回归性能 (Using a Confusion Matrix to Measure Logistic Regression Performance)

章节总结 (Section Wrap-up)

K最近邻居 (K-Nearest Neighbors)

什么是K最近邻居算法？ (What is the K-Nearest Neighbors Algorithm?)

建立K最近邻居算法的步骤 (The Steps for Building a K-Nearest Neighbors Algorithm)

K最近邻算法中K的重要性 (The Importance of K in a K-Nearest Neighbors Algorithm)

K最近邻算法的优缺点 (The Pros and Cons of the K-Nearest Neighbors Algorithm)

章节总结 (Section Wrap-up)

决策树和随机森林 (Decision Trees and Random Forests)

什么是树方法？ (What Are Tree Methods?)

如何从头开始构建决策树 (How to Build Decision Trees From Scratch)

使用随机森林的好处 (The Benefits of Using Random Forests)

章节总结 (Section Wrap-up)

支持向量机 (Support Vector Machines)

什么是支持向量机？ (What Are Support Vector Machines?)

支持向量机如何工作？ (How Do Support Vector Machines Work?)

章节总结 (Section Wrap-up)

K均值聚类 (K-Means Clustering)

什么是K均值聚类？ (What is K-Means Clustering?)

K-Means聚类算法如何工作？ (How Do K-Means Clustering Algorithms Work?)

为K均值聚类算法选择合适的K值 (Choosing A Proper K Value For K Means Clustering Algorithms)

章节总结 (Section Wrap-up)

主成分分析 (Principal Component Analysis)

什么是主成分分析？ (What is Principal Component Analysis?)

线性回归与主成分分析之间的差异 (The Differences Between Linear Regression and Principal Component Analysis)

章节总结 (Section Wrap-up)

你可能感兴趣的:(算法,决策树,大数据,编程语言,python)