Machine learning is changing the world. Google uses machine learning to suggest search results to users. Netflix uses it to recommend movies for you to watch. Facebook uses machine learning to suggest people you may know.
机器学习正在改变世界。 Google使用机器学习向用户建议搜索结果。 Netflix使用它来推荐电影供您观看。 Facebook使用机器学习来建议您可能认识的人。
Machine learning has never been more important. At the same time, understanding machine learning is hard. The field is full of jargon. And the number of different ML algorithms grows each year.
机器学习从未如此重要。 同时,很难理解机器学习。 这个领域充满了行话。 而且,不同的机器学习算法的数量每年都在增长。
This article will introduce you to the fundamental concepts within the field of machine learning. More specifically, we will discuss the basic concepts behind the 9 most important machine learning algorithms today.
本文将向您介绍机器学习领域的基本概念。 更具体地说,我们将讨论当今9种最重要的机器学习算法背后的基本概念。
Recommendation systems are used to find similar entries in a data set.
推荐系统用于在数据集中查找相似的条目。
Perhaps the most common real-world example of a recommendation exists inside of Netflix. More specifically, its video streaming service will recommend suggested movies and TV shows based on content that you’ve already watched.
推荐中最常见的现实示例是存在于Netflix内部。 更具体地说,其视频流服务将根据您已经看过的内容来推荐电影和电视节目。
Another recommendation system is Facebook’s “People You May Know” feature, which suggests possible friends for you based on your existing friends list.
另一个推荐系统是Facebook的“您可能认识的人”功能,该功能会根据您现有的朋友列表为您推荐可能的朋友。
Fully developed and deployed recommendation systems are extremely sophisticated. They are also very resource-intensive.
完全开发和部署的推荐系统非常复杂。 它们也是非常耗费资源的。
Fully-fledged recommendation systems require a deep background in linear algebra to build from scratch.
完善的推荐系统需要从头开始构建线性代数的深厚背景。
Because of this, there might be concepts in this section that you do not understand if you’ve never studied linear algebra before.
因此,如果您之前从未学习过线性代数,那么本节中可能会有一些您不了解的概念。
Don’t worry, though – the scikit-learn Python library makes it very easy to build recommendation systems. S0 you don't need much of a linear algebra background to build real-world recommendation systems.
不过,请不要担心-scikit-learn Python库使构建推荐系统非常容易。 这样,您就不需要太多线性代数背景来构建实际的推荐系统。
There are two main types of recommendation systems:
推荐系统有两种主要类型:
Content-based recommendation systems give you recommendations based on items’ similarity of items that you’ve already used. They behave exactly how you’d expect a recommendation system to behave.
基于内容的推荐系统会根据您已经使用过的商品的相似性为您提供建议。 它们的行为与建议系统的行为完全一样。
Collaborative filtering recommendation systems produce recommendations based on knowledge of the user’s interactions with items. Said differently, they use the wisdom of the crowds. (Hence the term “collaborative” in its name.)
协作过滤推荐系统基于用户与项目交互的知识来产生推荐。 换句话说,他们利用了人群的智慧。 (因此,名称中包含“协作”一词。)
In the real world, collaborative filtering recommendation systems are much more common than content-based systems. This is primarily because they typically give better results. Some practitioners also find collaborative filtering recommendation systems easier to understand.
在现实世界中,协作过滤推荐系统比基于内容的系统更为普遍。 这主要是因为它们通常会产生更好的结果。 一些从业人员还发现协作过滤推荐系统更容易理解。
Collaborative filtering recommendation systems also have a unique feature that content-based systems are missing. Namely, they have the ability to learn features on their own.
协作过滤推荐系统还具有基于内容的系统缺失的独特功能。 即,他们具有自行学习功能的能力。
This means that they can even start identifying similarities between items based on attributes that you haven’t even told them to consider.
这意味着他们甚至可以根据您甚至没有告诉他们考虑的属性来开始确定项目之间的相似性。
There are two subcategories within collaborative filtering:
协作过滤中有两个子类别:
You don’t need to know the differences between these two types of collaborative filtering recommendation systems to be successful in machine learning. It is enough to recognize that multiple types exist.
您无需了解这两种协作式过滤推荐系统之间的区别即可在机器学习中取得成功。 认识到存在多种类型就足够了。
Here is a brief summary of what we discussed about recommendation systems in this tutorial:
这是我们在本教程中讨论的有关推荐系统的简短摘要:
Linear regression is used to predict some y
values based on the value of another set of x
values.
线性回归用于基于另一组x
值的值来预测y
值。
Linear regression was created in the 1800s by Francis Galton.
线性回归是由弗朗西斯·加尔顿 ( Francis Galton)在1800年代创建的。
Galton was a scientist studying the relationship between parents and children. More specifically, Galton was investigating the relationship between the heights of fathers and the heights of their sons.
高尔顿是一位研究父母与孩子之间关系的科学家。 更具体地说,高尔顿正在研究父亲的身高与儿子的身高之间的关系。
Galton’s first discovery was that sons tended to be roughly as tall as their fathers. This is not surprising.
高尔顿的第一个发现是儿子的年龄往往和父亲一样高。 这不足为奇。
Later on, Galton discovered something much more interesting. The son’s height tended to be closer to the overall average height of all people than it was to his own father.
后来,高尔顿发现了一些更有趣的东西。 儿子的身高往往比他父亲的身高更接近所有人的总体平均身高 。
Galton gave this phenomenon a name: regression. Specifically, he said “A father’s son’s height tends to regress (or drift towards) the mean (average) height”.
高尔顿给这个现象起了一个名字: 回归 。 具体来说,他说:“父亲儿子的身高倾向于下降(或向平均(平均)身高漂移)”。
This led to an entire field in statistics and machine learning called regression.
这导致了统计学和机器学习的整个领域,称为回归。
When creating a regression model, all that we are trying to do is draw a line that is as close as possible to each point in a data set.
创建回归模型时,我们要做的就是绘制一条尽可能接近数据集中每个点的线。
The typical example of this is the “least squares method” of linear regression, which only calculates the closeness of a line in the up-and-down direction.
典型的例子是线性回归的“最小二乘法”,该方法仅计算上下方向上的一条线的紧密度。
Here is an example to help illustrate this:
这是一个示例来帮助说明这一点:
When you create a regression model, your end product is an equation that you can use to predict the y-value of an x-value, without actually knowing the y-value in advance.
创建回归模型时,最终产品是一个方程,您可以使用它来预测x值的y值,而无需事先实际知道y值。
Logistic regression is similar to linear regression except that instead of calculating a numerical y
value, it estimates which category a data point belongs to.
Logistic回归与线性回归相似,不同之处在于, Logistic回归估计数据点属于哪个类别 ,而不是计算数字y
值。
Logistic regression is a machine learning model that is used to solve classification problems.
逻辑回归是一种机器学习模型,用于解决分类问题。
Here are a few examples of machine learning classification problems:
以下是一些机器学习分类问题的示例:
Each of the classification problems have exactly two categories, which makes them examples of binary classification problems.
每个分类问题都有正好两个类别,这使它们成为二进制分类问题的示例。
Logistic regression is well-suited for solving binary classification problems – we just assign the different categories a value of 0
and 1
respectively.
Logistic回归非常适合解决二元分类问题–我们仅将不同类别的值分别指定为0
和1
。
Why do we need logistic regression? Because you can’t use a linear regression model to make binary classification predictions. It wouldn't lead to a good fit, since you’re trying to fit a straight line through a dataset with only two possible values.
为什么我们需要逻辑回归? 因为您不能使用线性回归模型来进行二进制分类预测。 因为您试图通过只有两个可能值的数据集拟合一条直线,所以不会很好地拟合。
This image may help you understand why linear regression models are poorly suited for binary classification problems:
此图可能帮助您理解为什么线性回归模型不适用于二进制分类问题:
In this image, the y-axis
represents the probability that a tumor is malignant. Conversely, the value 1-y
represents the probability that a tumor is not malignant. As you can see, the linear regression model does a poor job of predicting this probability for most of the observations in the data set.
在此图像中, y-axis
表示肿瘤是恶性的概率。 相反,值1-y
表示肿瘤不是恶性的概率。 如您所见,对于数据集中的大多数观测值,线性回归模型无法很好地预测这种可能性。
This is why logistic regression models are useful. They have a bend to their line of best fit, which makes them much better-suited for predicting categorical data.
这就是逻辑回归模型有用的原因。 他们倾向于最适合的路线,这使他们更适合预测分类数据。
Here is an example that compares a linear regression model to a logistic regression model using the same training data:
这是一个使用相同训练数据将线性回归模型与逻辑回归模型进行比较的示例:
The reason why the logistic regression model has a bend in its curve is because it is not calculated using a linear equation. Instead, logistic regression models are built using the Sigmoid Function (also called the Logistic Function because of its use in logistic regression).
逻辑回归模型的曲线有弯曲的原因是因为它不是使用线性方程式计算的。 相反,使用Sigmoid函数(也称为Logistic函数,因为它在logistic回归中使用)构建了logistic回归模型。
You will not have to memorize the Sigmoid Function to be successful in machine learning. With that said, having some understanding of its appearance is useful.
您无需记住Sigmoid函数即可在机器学习中获得成功。 话虽如此,对它的外观有所了解很有用。
The equation is shown below:
该公式如下所示:
The main characteristic of the Sigmoid Function worth understanding is this: no matter what value you pass into it, it will always generate an output somewhere between 0 and 1.
值得理解的Sigmoid函数的主要特征是:无论您传递给它什么值,它始终会在0到1之间生成输出。
To use the linear regression model to make predictions, you generally need to specify a cutoff point. This cutoff point is typically 0.5
.
要使用线性回归模型进行预测,通常需要指定一个截止点。 该截止点通常为0.5
。
Let’s use our cancer diagnosis example from our earlier image to see this principle in practice. If the logistic regression model outputs a value below 0.5, then the data point is categorized as a non-malignant tumor. Similarly, if the Sigmoid Function outputs a value above 0.5, then the tumor would be classified as malignant.
让我们使用先前图像中的癌症诊断示例在实践中了解这一原理。 如果逻辑回归模型输出的值小于0.5,则将数据点归类为非恶性肿瘤。 同样,如果Sigmoid函数输出的值大于0.5,则肿瘤将被分类为恶性。
A confusion matrix can be used as a tool to compare true positives, true negatives, false positives, and false negatives in machine learning.
混淆矩阵可以用作在机器学习中比较真阳性,真阴性,假阳性和假阴性的工具。
Confusion matrices are particularly useful when used to measure the performance of logistic regression models. Here is an example of how we could use a confusion matrix:
当用于测量逻辑回归模型的性能时,混淆矩阵特别有用。 这是我们如何使用混淆矩阵的示例:
A confusion matrix is useful for assessing whether your model is particularly weak in a specific quadrant of the confusion matrix. As an example, it might have an abnormally high number of false positives.
混淆矩阵可用于评估模型在混淆矩阵的特定象限中是否特别弱。 例如,它可能具有异常高的误报数。
It can also be helpful in certain applications, to make sure that your model performs well in an especially dangerous zone of the confusion matrix.
在某些应用中,确保您的模型在混乱矩阵的特别危险区域中表现良好,也可能会有所帮助。
In this cancer example, for instance, you’d want to be very sure that you model does not have a very high rate of false negatives, as this would indicate that someone has a malignant tumor that you incorrectly classified as non-malignant.
例如,在这个癌症示例中,您要非常确定自己的模型没有很高的假阴性率,因为这将表明某人患有恶性肿瘤,而您将其错误地归类为非恶性。
In this section, you had your first exposure to logistic regression machine learning models.
在本部分中,您首次接触了逻辑回归机器学习模型。
Here is a brief summary of what you learned about logistic regression:
这是您从逻辑回归中学到的简短摘要:
The K-nearest neighbors algorithm can help you solve classification problems where there are more than two categories.
K近邻算法可以帮助您解决两个以上类别的分类问题。
The K-nearest neighbors algorithm is a classification algorithm that is based on a simple principle. In fact, the principle is so simple that it is best understood through example.
K最近邻居算法是基于简单原理的分类算法。 实际上,该原理非常简单,因此可以通过示例对其进行最好的理解。
Imagine that you had data on the height and weight of football players and basketball players. The K-nearest neighbors algorithm can be used to predict whether a new athlete is either a football player or a basketball player.
想象一下,您拥有有关足球运动员和篮球运动员的身高和体重的数据。 K近邻算法可用于预测新运动员是足球运动员还是篮球运动员。
To do this, the K-nearest neighbors algorithm identifies the K
data points that are closest to the new observation.
为此,K近邻算法可识别最接近新观测值的K
数据点。
The following image visualizes this, with a K value of 3
:
下图以K值为3
对此进行了可视化显示:
In this image, the football players are labeled as blue data points and the basketball players are labeled as orange dots. The data point that we are attempting to classify is labeled as green.
在此图像中,足球运动员被标记为蓝色数据点,篮球运动员被标记为橙色点。 我们尝试分类的数据点被标记为绿色。
Since the majority (2 out of 3) of the closets data points to the new data points are blue football players, then the K-nearest neighbors algorithm will predict that the new data point is also a football player.
由于壁橱数据点中的大多数(3之2)是新数据点,所以都是蓝色足球运动员,因此K近邻算法将预测新数据点也是足球运动员。
The general steps for building a K-nearest neighbors algorithm are:
建立K近邻算法的一般步骤是:
Calculate the Euclidean distance from the new data point x
to all the other points in the data set
计算从新数据点x
到数据集中所有其他点的欧几里得距离
Sort the points in the data set in order of increasing distance from x
按距x
距离递增的顺序对数据集中的点进行排序
Predict using the same category as the majority of the K
closest data points to x
使用与K
最接近x
数据点中的大多数相同的类别进行预测
Although it might not be obvious from the start, changing the value of K
in a K-nearest neighbors algorithm will change which category a new point is assigned to.
虽然这可能不是很明显从一开始,改变的值K
在最近邻居法将改变其类别新的点分配给。
More specifically, having a very low K
value will cause your model to perfectly predict your training data and poorly predict your test data. Similarly, having too high of a K
value will make your model unnecessarily complex.
更具体地说,具有非常低的K
值将导致您的模型完美地预测您的训练数据,而较差地预测您的测试数据。 同样, K
值太高会使模型不必要地复杂。
The following visualization does an excellent job of illustrating this:
以下可视化效果很好地说明了这一点:
To conclude this introduction to the K-nearest neighbors algorithm, I wanted to briefly discuss some pros and cons of using this model.
为了结束对K近邻算法的介绍,我想简要讨论一下使用该模型的利弊。
Here are some main advantages to the K-nearest neighbors algorithm:
这是K近邻算法的一些主要优点:
The model accepts only two parameters: K
and the distance metric you’d like to use (usually Euclidean distance)
该模型仅接受两个参数: K
和您要使用的距离度量(通常是欧几里得距离)
Similarly, here are a few of the algorithm’s main disadvantages:
同样,以下是该算法的一些主要缺点:
Here is a brief summary of what you just learned about the k-nearest neighbors algorithm:
这是您刚刚了解的k近邻算法的简短摘要:
Why the value of K
matters for making predictions
为什么K
的值对进行预测很重要
Decision trees and randoms forests are both examples of tree methods.
决策树和随机森林都是树方法的示例。
More specifically, decision trees are machine learning models used to make predictions by cycling through every feature in a data set, one-by-one. Random forests are ensembles of decision trees that used random orders of the features in the data sets.
更具体地说,决策树是一种机器学习模型,用于通过循环遍历数据集中的每个功能进行预测。 随机森林是决策树的集合,这些决策树使用了数据集中要素的随机顺序。
Before we dig into the theoretical underpinnings of tree methods in machine learning, it is helpful to start with an example.
在深入研究机器学习中树方法的理论基础之前,从一个示例开始是有帮助的。
Imagine that you play basketball every Monday. Moreover, you always invite the same friend to come play with you.
想象一下,您每个星期一都打篮球。 而且,您总是邀请同一个朋友和您一起玩。
Sometimes the friend actually comes. Sometimes they don't.
有时朋友真的来了。 有时他们没有。
The decision on whether or not to come depends on numerous factors, like weather, temperature, wind, and fatigue. You start to notice these features and begin tracking them alongside your friend's decision whether to play or not.
是否要来的决定取决于许多因素,例如天气,温度,风和疲劳。 您开始注意到这些功能,并开始在您的朋友决定是否玩的同时跟踪它们。
You can use this data to predict whether or not your friend will show up to play basketball. One technique you could use is a decision tree. Here’s what this decision tree would look like:
您可以使用此数据来预测您的朋友是否会出现参加篮球比赛。 您可以使用的一种技术是决策树。 这是此决策树的外观:
Every decision tree has two types of elements:
每个决策树都有两种类型的元素:
Nodes
: locations where the tree splits according to the value of some attribute
Nodes
:树根据某些属性的值拆分的位置
Edges
: the outcome of a split to the next node
Edges
:拆分到下一个节点的结果
You can see in the image above that there are nodes for outlook
, humidity
and windy
. There is an edge for each potential value of each of those attributes.
您可以在上面的图像中看到,有一些outlook
, humidity
和windy
节点。 这些属性中每个属性的每个潜在值都有一条边。
Here are two other pieces of decision tree terminology that you should understand before proceeding:
在继续之前,您还应该了解以下两项决策树术语:
Root
: the node that performs the first split
Root
:执行第一次拆分的节点
Leaves
: terminal nodes that predict the final outcome
Leaves
:预测最终结果的末端节点
You now have a basic understanding of what decision trees are. We will learn about how to build decision trees from scratch in the next section.
现在,您对什么是决策树有了基本的了解。 在下一部分中,我们将学习如何从头开始构建决策树。
Building decision trees is harder than you might imagine. This is because deciding which features to split your data on (which is a topic that belongs to the fields of Entropy and Information Gain) is a mathematically complex problem.
建立决策树比您想象的要难。 这是因为确定要分割数据的哪些功能(这是一个属于熵和信息增益领域的主题)在数学上很复杂。
To address this, machine learning practitioners typically use many decision trees using a random sample of features chosen as the split.
为了解决这个问题,机器学习从业人员通常使用许多决策树,使用随机选择的特征样本作为拆分。
Said differently, a new random sample of features is chosen for every single tree at every single split. This technique is called random forests.
换句话说,在每个分割处为每个树选择一个新的特征随机样本。 这种技术称为随机森林 。
In general, practitioners typically chose the size of the random sample of features (denoted m
) to be the square root of the number of total features in the data set (denoted p
). To be succinct, m
is the square root of p
, and then a specific feature is randomly selected from m
.
通常,从业人员通常选择特征随机样本的大小(表示为m
)作为数据集中所有特征数量(表示为p
)的平方根。 简而言之, m
是p
平方根,然后从m
随机选择一个特定特征。
If this does not make complete sense right now, do not worry. It will be more clear when you eventually build your first random forest model.
如果目前尚不完全正确,请不要担心。 当您最终构建第一个随机森林模型时,将更加清楚。
Imagine that you’re working with a data set that has one very strong feature. Said differently, the data set has one feature that is much more predictive of the final outcome than the other features in the data set.
想象一下,您正在使用具有一个非常强大功能的数据集。 换句话说,数据集的一个特征比数据集中的其他特征更能预测最终结果。
If you’re building your decision trees manually, then it makes sense to use this feature as the top split of the decision tree. This means that you’ll have multiple trees whose predictions are highly correlated.
如果您是手动构建决策树,则可以将此功能用作决策树的顶部拆分。 这意味着您将拥有多棵预测相关性很高的树。
We want to avoid this since taking the average of highly correlated variables does not significantly reduce variance. By randomly selecting features for each tree in a random forest, the trees become decorrelated and the variance of the resulting model is reduced. This decorrelation is the main advantage of using random forests over handmade decision trees
我们要避免这种情况,因为取高度相关变量的平均值不会显着减小方差。 通过为随机森林中的每棵树随机选择特征,这些树将变为去相关性,并减少结果模型的方差。 这种去相关性是使用随机森林优于手工决策树的主要优势
Here is a brief summary of what you learned about decision trees and random forests in this article:
这是您从本文中学到的决策树和随机森林的摘要:
The elements of a decision tree: nodes
, edges
, roots
, and leaves
决策树的元素: nodes
, edges
, roots
和leaves
Support vector machines are classification algorithms (although, technically speaking, they could also be used to solve regression problems) that divide a data set into categories based by slicing through the widest gap between categories. This concept will be made more clear through visualizations in a moment.
支持向量机是分类算法(尽管从技术上讲,它们也可以用于解决回归问题),它通过切分类别之间的最大差距,将数据集分为多个类别。 稍后将通过可视化使此概念更加清晰。
Support vector machines – or SVMs for short – are supervised machine learning models with associated learning algorithms that analyze data and recognize patterns.
支持向量机 (简称SVM)是受监督的机器学习模型,具有关联的学习算法,可以分析数据并识别模式。
Support vector machines can be used for both classification problems and regression problems. In this article, we will specifically be looking at the use of support vector machines for solving classification problems.
支持向量机可用于分类问题和回归问题。 在本文中,我们将专门研究使用支持向量机解决分类问题。
Let’s dig in to how support vector machines really work.
让我们深入探讨支持向量机的工作原理。
Given a set of training examples – each of which is marked for belonging to one of two categories – a support vector machine training algorithm builds a model. This model assigns new examples into one of the two categories. This makes the support vector machine a non-probabilistic binary linear classifier.
给定一组训练示例(每个标记都被标记为属于两个类别之一),支持向量机训练算法将建立一个模型。 该模型将新示例分配给两个类别之一。 这使得支持向量机成为非概率二进制线性分类器。
The SVM uses geometry to make categorical predictions.
SVM使用几何进行分类预测。
More specifically, an SVM model maps the data points as points in space and divides the separate categories so that they are divided by an open gap that is as wide as possible. New data points are predicted to belong to a category based on which side of the gap they belong to.
更具体地说,SVM模型将数据点映射为空间中的点,并划分单独的类别,以便它们被尽可能宽的开放间隙划分。 根据新数据点属于间隙的哪一侧,预测它们将属于一个类别。
Here is an example visualization that can help you understand the intuition behind support vector machines:
这是一个可视化示例,可以帮助您理解支持向量机的直观知识:
As you can see, if a new data point falls on the left side of the green line, it will be labeled with the red category. Similarly, if a new data point falls on the right side of the green line, it will get labelled as belonging to the blue category.
如您所见,如果新数据点位于绿线的左侧,则将其标记为红色类别。 同样,如果新数据点位于绿线的右侧,则会被标记为属于蓝色类别。
This green line is called a hyperplane, which is an important piece of vocabulary for support vector machine algorithms.
这条绿线称为超平面 ,它是支持向量机算法的重要词汇。
Let’s take a look at a different visual representation of a support vector machine:
让我们看一下支持向量机的另一种视觉表示形式:
In this diagram, the hyperplane is labelled as the optimal hyperplane. Support vector machine theory defines the optimal hyperplane as the one that maximizes the margin between the closest data points from each category.
在此图中,超平面被标记为最佳超平面 。 支持向量机理论将最佳超平面定义为最大化每个类别中最接近的数据点之间的余量的最佳超平面 。
As you can see, the margin line actually touches three data points – two from the red category and one from the blue category. These data points which touch the margin lines are called support vectors and are where support vector machines get their name from.
如您所见,边界线实际上触及三个数据点-两个来自红色类别,一个来自蓝色类别。 这些与边界线接触的数据点称为支持向量 ,是支持向量机的名称来源。
Here is a brief summary of what you just learned about support vector machines:
这是您刚刚了解到的关于支持向量机的简短摘要:
How support vector machines categorize data points using a hyperplane that maximizes the margin between categories in a data set
支持向量机如何使用超平面对数据点进行分类,该超平面可最大化数据集中类别之间的边距
That the data points that touch margin lines in a support vector machine are called support vectors. These data points are where support vector machines derive their name from.
在支持向量机中接触边界线的数据点称为支持向量 。 这些数据点是支持向量机的名称来源。
K-means clustering is a machine learning algorithm that allows you to identify segments of similar data within a data set.
K均值聚类是一种机器学习算法,可让您识别数据集中相似数据的片段。
K-means clustering is an unsupervised machine learning algorithm.
K-均值聚类是一种无监督的机器学习算法。
This means that it takes in unlabelled data and will attempt to group similar clusters of observations together within your data.
这意味着它将接收未标记的数据,并将尝试在数据中将相似的观察结果群集在一起。
K-means clustering algorithms are highly useful for solving real-world problems. Here are a few use cases for this machine learning model:
K-means聚类算法对于解决实际问题非常有用。 以下是此机器学习模型的一些用例:
The primary goal of a K means clustering algorithm is to divide a data set into distinct groups such that the observations within each group are similar to each other.
K均值聚类算法的主要目标是将数据集划分为不同的组,以使每个组内的观测值彼此相似。
Here is a visual representation of what this looks like in practice:
这是实际情况的直观表示:
We will explore the mathematics behind a K-means clustering in the next section of this tutorial.
我们将在本教程的下一部分中探讨K-均值聚类背后的数学原理。
The first step in running a K-means clustering algorithm is to select the number of clusters you'd like to divide your data into. This number of clusters is the K
value that is referenced in the algorithm’s name.
运行K-means聚类算法的第一步是选择要划分数据的聚类数。 该簇数是算法名称中引用的K
值。
Choosing the K
value within a K-means clustering algorithm is an important choice. We will talk more about how to chose a proper value of K
later in this article.
在K均值聚类算法中选择K
值是一个重要的选择。 我们将在本文后面详细讨论如何选择适当的K
值。
Next, you must randomly assign each point in your data set to a random cluster. This gives our initial assignment which you then run the following iteration on until the clusters stop changing:
接下来,您必须将数据集中的每个点随机分配给随机簇。 这给出了我们的初始分配,然后您可以对其进行以下迭代,直到集群停止更改为止:
Here is an animation of how this works in practice for a K-means clustering algorithm with a K
value of 3
. You can see the centroid of each of cluster represented by a black +
character.
这是在K
值为3
的K
均值聚类算法中如何实际工作的动画。 您可以看到用黑色+
字符表示的每个簇的质心。
As you can see, this iteration continues until the clusters stop changing – meaning data points are no longer being assigned to new clusters.
如您所见,此迭代将继续进行,直到群集停止更改为止-这意味着不再将数据点分配给新的群集。
Choosing a proper K
value for a K-means clustering algorithm is actually quite difficult. There is no “right” answer for choosing the “best” K
value.
为K均值聚类算法选择合适的K
值实际上非常困难。 选择“最佳” K
值没有“正确”的答案。
One method that machine learning practitioners often use is called the elbow method.
机器学习从业人员经常使用的一种方法称为肘方法 。
To use the elbow method, the first thing you need to do is compute the sum of squared errors (SSE) for you K-means clustering algorithm for a group of K
values. SSE in a K means clustering algorithm is defined as the sum of the squared distance between each data point in a cluster and that cluster’s centroid.
要使用弯头方法,首先需要做的是为一组K
值的K
均值聚类算法计算平方误差总和(SSE)。 K均值聚类算法中的SSE定义为聚类中每个数据点与该聚类质心之间的平方距离之和。
As an example of this step, you might compute the SSE for K
values of 2
, 4
, 6
, 8
, and 10
.
作为该步骤的一个例子,则可能计算SSE为K
的值2
, 4
, 6
, 8
,和10
。
Next, you will want to generate a plot of the SSE against these different K
values. You will see that the error decreases as the K
value increases.
接下来,您将要针对这些不同的K
值生成SSE的图。 您会看到误差随着K
值的增加而减小。
This makes sense – the more categories you create within a data set, the more likely it is that each data point is close to the center of its specific cluster.
这很有道理–您在数据集中创建的类别越多,每个数据点越接近其特定群集的中心的可能性就越大。
With that said, the idea behind the elbow method is to choose a value of K
at which the SSE slows its rate of decline abruptly. This abrupt decrease produces an elbow
in the graph.
话虽如此,弯头法的思想是选择一个SSE突然减慢其下降速度的K
值。 这种突然的下降会在图表中产生一个elbow
。
As an example, here is a graph of SSE against K
. In this case, the elbow method would suggest using a K
value of approximately 6
.
例如,这是SSE与K
。 在这种情况下,弯头法建议使用大约6
的K
值。
Importantly, 6
is just an estimate for a good value of K
to use. There is never a “best” K
value in a K-means clustering algorithm. As with many things in the field of machine learning, this is a highly situation-dependent decision.
重要的是, 6
只是要使用的K
一个好的估计值。 K-均值聚类算法中永远不会有“最佳” K
值。 与机器学习领域的许多事情一样,这是高度依赖于情况的决策。
Here is a brief summary of what you learned in this article:
这是您从本文中学到的简短摘要:
How to use the elbow method to select an appropriate value of K
in a K-means clustering model
如何在K均值聚类模型中使用弯头法选择合适的K
值
Principal component analysis is used to transform a many-featured data set into a transformed data set with fewer features where each new feature is a linear combination of the preexisting features. This transformed data set aims to explain most of the variance of the original data set with far more simplicity.
主成分分析用于将功能广泛的数据集转换为功能较少的转换数据集,其中每个新功能都是现有功能的线性组合。 此转换后的数据集旨在以更为简单的方式解释原始数据集的大部分差异。
Principal component analysis is a machine learning technique that is used to examine the interrelations between sets of variables.
主成分分析是一种机器学习技术,用于检查变量集之间的相互关系。
Said differently, principal component analysis studies sets of variables in order to identify the underlying structure of those variables.
换句话说,主成分分析研究变量集,以识别这些变量的底层结构。
Principal component analysis is sometimes called factor analysis.
主成分分析有时也称为因子分析 。
Based on this description, you might be think that principal component analysis is quite similar to linear regression.
基于此描述,您可能会认为主成分分析与线性回归非常相似。
That is not the case. In fact, these two techniques have some important differences.
事实并非如此。 实际上,这两种技术有一些重要的区别。
Linear regression determines a line of best fit through a data set. Principal component analysis determines several orthogonal lines of best fit for the data set.
线性回归确定通过数据集的最佳拟合线。 主成分分析确定最适合数据集的几条正交线。
If you’re unfamiliar with the term orthogonal, it just means that the lines are at right angles (90 degrees) to each other – like North, East, South, and West are on a map.
如果您不熟悉“ 正交 ”一词,则仅表示直线彼此成直角(90度),例如北,东,南和西在地图上。
Let’s consider an example to help you understand this better.
让我们考虑一个示例,以帮助您更好地理解这一点。
Take a look at the axis labels in this image.
看一下这张图中的轴标签。
In this image, the x-axis principal component examples 73% of the variance in the data set. The y-axis principal component explains about 23% of the variance in the data set.
在此图像中,x轴主成分示例了数据集中73%的方差。 y轴主成分解释了数据集中约23%的方差。
This means that 4% of the variance in the data set remains unexplained. You could reduce this number further by adding more principal components to your analysis.
这意味着数据集中4%的方差仍然无法解释。 您可以通过在分析中添加更多主成分来进一步减少此数量。
Here’s a brief summary of what you learned about principal component analysis in this tutorial:
这是您在本教程中学到的关于主成分分析的简要概述:
翻译自: https://www.freecodecamp.org/news/a-no-code-intro-to-the-9-most-important-machine-learning-algorithms-today/