统计相关系数r与r2的区别_什么是相关系数? 统计解释中的r值

统计相关系数r与r2的区别

Correlations are a great tool for learning about how one thing changes with another. After reading this, you should understand what correlation is, how to think about correlations in your own work, and code up a minimal implementation to calculate correlations.

关联是学习一件事如何变化的好工具。 阅读此内容后,您应该了解什么是相关性,如何在自己的工作中考虑相关性,并编写一个最小的实现来计算相关性。

相关性是关于两件事如何相互变化 (A correlation is about how two things change with each other)

Correlation is an abstract math concept, but you probably already have an idea about what it means. Here are some examples of the three general categories of correlation.

关联是一个抽象的数学概念,但是您可能已经对它的含义有所了解。 以下是相关的三个常规类别的一些示例。

As you eat more food, you will probably end up feeling more full. This is a case of when two things are changing together in the same way. One goes up (eating more food), then the other also goes up (feeling full). This is a positive correlation.

当您吃更多的食物时,您可能最终会感到更饱。 这是两种情况以相同的方式一起改变的情况。 一个上升(吃更多的食物),然后另一个上升(吃饱)。 这是正相关

When you're in a car and it goes faster, you will probably get to your destination faster and your total travel time will be less. This is a case of two things changing in the opposite direction (more speed, but less time). This is a negative correlation.

当您开车时,它行驶得更快,您可能会更快到达目的地,而总的旅行时间会更少。 这是两种情况朝相反方向变化的情况(速度更快,时间却更少)。 这是负相关

There is also a third possible way two things can "change". Or rather, not change. For example, if you were to gain weight and looked at how your test scores changed, there probably won't be any general pattern of change in your test scores. This means there's no correlation.

还有第三种可能的方法,即两件事可以“改变”。 或者说,不变。 例如,如果您要增加体重并查看您的考试成绩如何变化,则考试成绩可能不会有任何一般的变化模式。 这意味着没有关联。

了解两件事如何一起变化是预测的第一步 (Knowing about how two things change together is the first step to prediction)

Being able to describe what is going on in our previous examples is great and all. But what's the point? The reason is to apply this knowledge in a meaningful way to help predict what will happen next.

能够描述我们前面的示例中发生的事情是非常重要的。 但是有什么意义呢? 原因是要以有意义的方式应用这些知识,以帮助预测接下来会发生什么。

In our eating example, we may record how much we eat for a whole week and then make a note of how full we feel afterwards. As we found before, the more we eat, the more full we feel.

在我们的饮食示例中,我们可以记录一整周的饮食量,然后记下之后的饱腹感。 正如我们之前所发现的,我们吃的越多,我们的饱腹感就越大。

After collecting all of this information, we can ask more questions about why this happens to better understand this relationship. Here, we may start to ask what kind of foods make us more full, or whether the time of day affects how full we feel as well.

收集了所有这些信息之后,我们可以提出更多有关为什么会发生这种情况的更多问题,以更好地理解这种关系。 在这里,我们可能会开始问哪种食物会让我们更饱,或者一天中的时间是否也会影响我们的饱感。

Similar thinking can be applied to your job or business as well. If you notice sales or other important metrics are going up or down with other measure of your business (in other words, things are positively correlated or negatively correlated), it may be worth exploring and learning more about that relationship to improve your business.

类似的想法也可以应用于您的工作或业务。 如果您发现销售额或其他重要指标随您的业务其他指标而上升或下降(换句话说,事情是正相关或负相关的),那么可能值得探索和了解更多有关这种关系的知识以改善您的业务。

相关可以具有不同的强度 (Correlations can have different levels of strength)

We've covered some general correlations as either

我们已经涵盖了一些一般的相关性,因为

  • positive,

    正,
  • negative, or

    否定的,或
  • non-existent

    不存在的

Although those descriptions are okay, all positive and negative correlations are not all the same.

尽管这些描述还可以,但是所有正相关和负相关都不尽相同。

These descriptions can also be translated to numbers. A correlation value can take on any decimal value between negative one, \(-1\), and positive one, \(+1\).

这些描述也可以翻译成数字。 相关值可以采用负数\(-1 \)和正数\(+ 1 \)之间的任何十进制值。

Decimal values between \(-1\) and \(0\) are negative correlations, like \(-0.32\).

\(-1 \)和\(0 \)之间的小数值为负相关,例如\(-0.32 \)。

Decimal values between \(0\) and \(+1\) are positive correlations, like \(+0.63\).

\(0 \)和\(+ 1 \)之间的小数值为正相关,例如\(+ 0.63 \)。

A perfect zero correlation means there is no correlation.

完美的零相关性意味着没有相关性。

For each type of correlation, there is a range of strong correlations and weak correlations. Correlation values closer to zero are weaker correlations, while values closer to positive or negative one are stronger correlation.

对于每种类型的相关,都有一系列强相关和弱相关。 接近零的相关值是较弱的相关性 ,而接近正值或负1的值是较强的相关性

Strong correlations show more obvious trends in the data, while weak ones look messier. For example, the stronger high, positive correlation below looks more like a line compared to the weaker and lower, positive correlation.

较强的相关性显示数据中更明显的趋势,而较弱的相关性看起来更混乱。 例如,与较弱和较低的正相关相比,下方的较高的高正相关看起来更像一条线。

Similarly, strongly negative correlations have a more obvious trend than the weaker and lower negative correlation.

同样,强的负相关比弱的和较低的负相关具有更明显的趋势。

r值从何而来? 那可以取什么价值呢? (Where does the r value come from? And what values can it take?)

The "r value" is a common way to indicate a correlation value. More specifically, it refers to the (sample) Pearson correlation, or Pearson's r. The "sample" note is to emphasize that you can only claim the correlation for the data you have, and you must be cautious in making larger claims beyond your data.

r值”是指示相关值的常用方式。 更具体地说,它是指(样本)Pearson相关性或Pearson的r 。 “样本”注释旨在强调您只能声明所拥有数据的相关性,并且在声明超出数据范围的更大声明时必须谨慎。

The table below summarizes what we've covered about correlations so far.

下表总结了到目前为止我们所涉及的相关内容。

Pearson's r value Correlation between two things is... Example
r = -1 Perfectly negative Hour of the day and number of hours left in the day
r < 0 Negative Faster car speeds and lower travel time
r = 0 Independent or uncorrelated Weight gain and test scores
r > 0 Positive More food eaten and feeling more full
r = 1 Perfectly positive Increase in my age and increase in your age
皮尔逊的r值 两件事之间的相关性是...
r = -1 完全负面 一天中的小时和一天中剩余的小时数
r <0 更快的车速和更少的旅行时间
r = 0 独立或不相关 体重增加和考试成绩
r> 0 吃得更多的食物,感觉更饱
r = 1 完全正面 增加我的年龄,增加你的年龄

In the next few sections, we will

在接下来的几节中,我们将

  • Break down the math equation to calculate correlations

    分解数学方程式以计算相关性
  • Use example numbers to use this correlation equation

    使用示例数字来使用此相关方程
  • Code up the math equation in Python and JavaScript

    用Python和JavaScript编写数学方程式

分解数学以计算相关性 (Breaking down the math to calculate correlations)

As a reminder, correlations can only be between \(-1\) and \(1\). Why is that?

提醒一下,相关只能在\(-1 \)和\(1 \)之间。 这是为什么?

The quick answer is that we adjust the amount of change in both variables to a common scale. In more technical terms, we normalize how much the two variables change together by how much each of the two variables change by themselves.

快速的答案是,我们将两个变量的变化量调整到一个共同的尺度。 用更专业的术语来说,我们将两个变量各自的变化量归一化。

From Wikipedia, we can grab the math definition of the Pearson correlation coefficient. It looks very complicated, but let's break it down together.

从Wikipedia ,我们可以获取Pearson相关系数的数学定义。 它看起来很复杂,但让我们一起分解一下。

\[ \textcolor{lime}{r} _{ \textcolor{#4466ff}{x} \textcolor{fuchsia}{y} } = \frac{ \sum_{i=1}^{n} (x_i - \textcolor{green}{\bar{x}})(y_i - \textcolor{olive}{\bar{y}}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \textcolor{green}{\bar{x}})^2 \sum_{i=1}^{n} (y_i - \textcolor{olive}{\bar{y}})^2 } }\]

\ [\ textcolor {lime} {r} _ {\ textcolor {#4466ff} {x} \ textcolor {fuchsia} {y}} = \ frac {\ sum_ {i = 1} ^ {n}(x_i-\ textcolor {green} {\ bar {x}})(y_i-\ textcolor {olive} {\ bar {y}})} {\ sqrt {\ sum_ {i = 1} ^ {n}(x_i-\ textcolor {green } {\ bar {x}})^ 2 \ sum_ {i = 1} ^ {n}(y_i-\ textcolor {olive} {\ bar {y}})^ 2}} \]

From this equation, to find the \(\textcolor{lime}{\text{correlation}}\) between an \( \textcolor{#4466ff}{\text{x variable}} \) and a \( \textcolor{fuchsia}{\text{y variable}} \), we first need to calculate the \( \textcolor{green}{\text{average value for all the } x \text{ values}} \) and the \( \textcolor{olive}{ \text{average value for all the } y \text{ values}} \).

从此等式中,找到\(\ textcolor {#4466ff} {\ text {x变量}} \)和\(\ textcolor {紫红色} {\ text {y变量}} \),我们首先需要计算\(\ textcolor {green} {\ text {所有} x \ text {值}} \ \的平均值和\(\ textcolor {olive} {\ text {所有} y \ text {值}} \的平均值)。

Let's focus on the top of the equation, also known as the numerator. For each of the \( x\) and \(y\) variables, we'll then need to find the distance of the \(x\) values from the average of \(x\), and do the same subtraction with \(y\).

让我们关注方程的顶部,也就是分子。 然后,对于每个\(x \)和\(y \)变量,我们需要找到\(x \)值与\(x \)平均值的距离,并用相同的减法\(y \)。

Intuitively, comparing all these values to the average gives us a target point to see how much change there is in one of the variables.

直观地,将所有这些值与平均值进行比较可以为我们提供一个目标点,以查看其中一个变量有多少变化。

This is seen in the math form, \(\textcolor{#800080}{\sum_{i=1}^{n}}(\textcolor{#000080}{x_i - \overline{x}})\), \(\textcolor{#800080}{\text{adds up all}}\) the \(\textcolor{#000080}{\text{differences between}}\) your values with the average value for your \(x\) variable.

这可以通过数学形式\(\ textcolor {#800080} {\ sum_ {i = 1} ^ {n}}(\ textcolor {#000080} {x_i-\ overline {x}})\),\ (\ textcolor {#800080} {\ text {全部累加}} \\)\(\ textcolor {#000080} {\ text {之间的差异}} \)您的值与\(x \)的平均值变量。

In the bottom of the equation, also known as the denominator, we do a similar calculation. However, before we add up all of the distances from our values and their averages, we will multiple them by themselves (that's what the \((\ldots)^2\) is doing).

在等式(也称为分母)的底部,我们进行了类似的计算。 但是,在我们将所有距离与它们的平均值和平均值相加之前,我们将自己乘以它们(这就是\((\ ldots)^ 2 \)的作用)。

This denominator is what "adjusts" the correlation so that the values are between \(-1\) and \(1\).

该分母是“调节”相关性的值,以便其值在\(-1 \)和\(1 \)之间。

在方程式中使用数字使其真实 (Using numbers in our equation to make it real)

To demonstrate the math, let's find the correlation between the ages of you and your siblings last year \([1, 2, 6]\) and your ages for this year \([2, 3, 7]\). Note that this is a small example. Typically you would want many more than three samples to have more confidence in your correlation being true.

为了演示数学,让我们找到您和您的兄弟姐妹去年\([1、2、6] \)的年龄与您今年\([2、3、7] \)的年龄之间的相关性。 请注意,这是一个小例子。 通常,您希望三个以上的样本对自己的相关性更有信心。

Looking at the numbers, they appear to increase the same. You may also notice they are the same sequence of numbers but the second set of numbers has one added to it. This is as close to a perfect correlation as we'll get. In other words, we should get an \(r = 1\).

从数字上看,它们似乎增加了相同的数量。 您可能还会注意到,它们是相同的数字序列,但第二组数字已添加一个。 正如我们将要获得的那样,这接近完美的相关性。 换句话说,我们应该得到一个\(r = 1 \)。

First we need to calculate the averages of each. The average of \([1, 2, 6]\) is \((1+2+6)/3 = 3\) and the average of \([2, 3, 7]\) is \((2+3+7)/3 = 4\). Filling in our equation, we get

首先,我们需要计算每个的平均值。 \([1、2、6] \)的平均值为\((1 + 2 + 6)/ 3 = 3 \)和\([2、3、7] \)的平均值为\((2 + 3 + 7)/ 3 = 4 \)。 填写方程式,我们得到

\[ r _{ x y } = \frac{ \sum_{i=1}^{n} (x_i - 3)(y_i - 4) }{ \sqrt{ \sum_{i=1}^{n} (x_i - 3)^2 \sum_{i=1}^{n} (y_i - 4)^2 } }\]

\ [r _ {xy} = \ frac {\ sum_ {i = 1} ^ {n}(x_i-3)(y_i-4)} {\ sqrt {\ sum_ {i = 1} ^ {n}(x_i -3)^ 2 \ sum_ {i = 1} ^ {n}(y_i-4)^ 2}} \]

Looking at the top of the equation, we need to find the paired differences of \(x\) and \(y\). Remember, the \(\sum\) is the symbol for adding. The top then just becomes

查看方程的顶部,我们需要找到\(x \)和\(y \)的配对差。 请记住,\(\ sum \)是要添加的符号。 然后顶部变成

\[ (1-3)(2-4) + (2-3)(3-4) + (6-3)(7-4) \]

\ [(1-3)(2-4)+(2-3)(3-4)+(6-3)(7-4)\]

\[= (-2)(-2) + (-1)(-1) + (3)(3) \]

\ [=(-2)(-2)+(-1)(-1)+(3)(3)\]

\[= 4 + 1 + 9 = 14\]

\ [= 4 +1 + 9 = 14 \]

So the top becomes 14.

因此,顶部变为14。

\[ r _{ x y } = \frac{ 14 }{ \sqrt{ \sum_{i=1}^{n} (x_i - 3)^2 \sum_{i=1}^{n} (y_i - 4)^2 } }\]

\ [r _ {xy} = \ frac {14} {\ sqrt {\ sum_ {i = 1} ^ {n}(x_i-3)^ 2 \ sum_ {i = 1} ^ {n}(y_i-4 )^ 2}} \]

In the bottom of the equation, we need to do some very similar calculations, except focusing on just the \(x\) and \(x\) separately before multiplying.

在等式的底部,我们需要做一些非常相似的计算,只是在相乘之前只关注\(x \)和\(x \)。

Let's focus on just \( \sum_{i=1}^n (x_i - 3)^2 \) first. Remember, \(3\) here is the average of all the \(x\) values. This number will change depending on your particular data.

首先,我们仅关注\(\ sum_ {i = 1} ^ n(x_i-3)^ 2 \)。 请记住,这里的\(3 \)是所有\(x \)值的平均值。 该数字将根据您的特定数据而变化。

\[ (1-3)^2 + (2-3)^2 + (6-3)^2 \]

\ [(1-3)^ 2 +(2-3)^ 2 +(6-3)^ 2 \]

\[= (-2)^2 + (-1)^2 + (3)^2 = 4 + 1 + 9 = 14 \]

\ [=(-2)^ 2 +(-1)^ 2 +(3)^ 2 = 4 +1 + 9 = 14 \]

And now for the \(y\) values.

现在为\(y \)值。

\[ (2-4)^2 + (3-4)^2 + (7-4)^2 \]

\ [(2-4)^ 2 +(3-4)^ 2 +(7-4)^ 2 \]

\[ (-2)^2 + (-1)^2 + (3)^2 = 4 + 1 + 9 = 14\]

\ [(-2)^ 2 +(-1)^ 2 +(3)^ 2 = 4 +1 + 9 = 14 \]

We those numbers filled out, we can put them back in our equation and solve for our correlation.

填写完这些数字后,我们可以将它们放回方程式中并求解相关性。

\[ r _{ x y } = \frac{ 14 }{ \sqrt{ 14 \times 14 }} = \frac{14}{\sqrt{ 14^2}} = \frac{14}{14} = 1\]

\ [r _ {xy} = \ frac {14} {\ sqrt {14 \ times 14}} = \ frac {14} {\ sqrt {14 ^ 2}} = \ frac {14} {14} = 1 \ ]

We've successfully confirmed that we get \(r = 1\).

我们已经成功确认我们得到\(r = 1 \)。

Although this was a simple example, it is always best to use simple examples for demonstration purposes. It shows our equation does indeed work, which will be important when coding it up in the next section.

尽管这只是一个简单的示例,但始终最好使用简单的示例进行演示。 它表明我们的公式确实有效,这在下一节进行编码时很重要。

皮尔逊相关系数的Python和JavaScript代码 (Python and JavaScript code for the Pearson correlation coefficient)

Math can sometimes be too abstract, so let's code this up for you to experiment with. As a reminder, here is the equation we are going to code up.

数学有时可能过于抽象,因此让我们编写代码以供您尝试。 提醒一下,这是我们要编写的公式。

\[ r _{ x y } = \frac{ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2 } }\]

\ [r _ {xy} = \ frac {\ sum_ {i = 1} ^ {n}(x_i-\ bar {x})(y_i-\ bar {y})} {\ sqrt {\ sum_ {i = 1} ^ {n}(x_i-\ bar {x})^ 2 \ sum_ {i = 1} ^ {n}(y_i-\ bar {y})^ 2}} \]

After going through the math above and reading the code below, it should be a bit clearer on how everything works together.

在完成了上面的数学运算并阅读了下面的代码之后,应该更加清楚所有事情如何协同工作。

Below is the Python version of the Pearson correlation.

以下是Pearson相关性的Python版本。

Here's an example of our Python code at work, and we can double check our work using a Pearson correlation function from the SciPy package.

这是我们工作中的Python代码的示例,我们可以使用SciPy包中的Pearson相关函数来仔细检查我们的工作。

import numpy as np
import scipy.stats

# Create fake data
x = np.arange(5, 15)  # array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
y = np.array([24, 0, 58, 26, 82, 89, 90, 90, 36, 56])

# Use a package to calculate Pearson's r
# Note: the p variable below is the p-value for the Pearson's r. This tests
#   how far away our correlation is from zero and has a trend.
r, p = scipy.stats.pearsonr(x, y)
r  # 0.506862548805646

# Use our own function
pearson(x, y)  # 0.506862548805646

Below is the JavaScript version of the Pearson correlation.

以下是Pearson相关性JavaScript版本。

Here's an example of our JavaScript code at work to double check our work.

这是我们正在使用JavaScript代码的示例,用于仔细检查我们的工作。

x = Array.from({length: 10}, (x, i) => i + 5)
// Array(10) [ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ]

y = [24, 0, 58, 26, 82, 89, 90, 90, 36, 56]

pearson(x, y)
// 0.506862548805646

Feel free to translate the formula into either Python or JavaScript to better understand how it works.

随时将公式转换为Python或JavaScript,以更好地了解其工作原理。

结论 (In conclusion)

Correlations are a helpful and accessible tool to better understand the relationship between any two numerical measures. It can be thought of as a start for predictive problems or just better understanding your business.

相关性是一个有用且易于访问的工具,可以帮助您更好地了解任何两个数值量度之间的关系。 可以将其视为预测性问题的开始,或者只是更好地了解您的业务。

Correlation values, most commonly used as Pearson's r, range from \(-1\) to \(+1\) and can be categorized into negative correlation (\(-1 \lt r \lt 0\)), positive (\(0 \lt r \lt 1\)), and no correlation (\(r = 0\)).

相关值(最常用作Pearson的r )的范围是\(-1 \)到\(+ 1 \),可以归类为负相关(\(-1 \ lt r \ lt 0 \)),正相关(\ (0 \ lt r \ lt 1 \)),并且没有相关性(\(r = 0 \))。

瞥见更大的相关性世界 (A glimpse into the larger world of correlations)

There is more than one way to calculate a correlation. Here we have touched on the case where both variables change at the same way. There are other cases where one variable may change at a different rate, but still have a clear relationship. This gives rise to what's called, non-linear relationships.

有多种方法可以计算相关性。 在这里,我们谈到了两个变量以相同方式更改的情况。 在其他情况下,一个变量可能会以不同的速率变化,但仍具有明确的关系。 这引起了所谓的非线性关系 。

Note, correlation does not imply causation. If you need quick examples of why, look no further.

注意, 相关性并不意味着因果关系 。 如果您需要简单的原因说明, 请不要再犹豫了 。

Below is a list of other articles I came across that helped me better understand the correlation coefficient.

以下是我遇到的其他文章列表,这些文章帮助我更好地了解了相关系数。

  • If you want to explore a great interactive visualization on correlation, take a look at this simple and fantastic site.

    如果您想探索相关性的出色交互式可视化效果,请查看这个简单而出色的网站。

  • Using Python, there multiple ways to implement a correlation and there are multiple types of correlation. This excellent tutorial shows great examples of Python code to experiment with yourself.

    使用Python,有多种实现关联的方法,并且有多种关联类型。 这个出色的教程显示了Python代码的绝佳示例,供您进行实验。

  • A blog post by Sabatian Sauer goes over correlations using "average deviation rectangles", where each point creates a visual rectangle from each point using the mean, and illustrating it using the R programming language.

    Sabatian Sauer的博客文章使用“平均偏差矩形”遍历了相关性,其中每个点使用均值从每个点创建一个可视矩形,并使用R编程语言对其进行说明。

  • And for the deeply curious people out there, take a look at this paper showing 13 ways to look at the correlation coefficient (PDF).

    对于那些非常好奇的人,请看一下这篇文章,它展示了查看相关系数 (PDF)的13种方法 。

Follow me on Twitter and check out my personal blog where I share some other insights and helpful resources for programming, statistics, and machine learning.

在Twitter上关注我,并查看我的个人博客 ,我在其中分享了一些其他见解以及有关编程,统计和机器学习的有用资源。

Thanks for reading!

谢谢阅读!

翻译自: https://www.freecodecamp.org/news/what-is-a-correlation-coefficient-r-value-in-statistics-explains/

统计相关系数r与r2的区别

你可能感兴趣的:(大数据,编程语言,python,机器学习,人工智能)