生存分析简介:Kaplan-Meier估计器

In my previous article, I described the potential use-cases of survival analysis and introduced all the building blocks required to understand the techniques used for analyzing the time-to-event data.

在我的上一篇文章中 ,我描述了生存分析的潜在用例,并介绍了理解用于分析事件数据的技术所需的所有构造块。

I continue the series by explaining perhaps the simplest, yet very insightful approach to survival analysis — the Kaplan-Meier estimator. After a theoretical introduction, I will show you how to carry out the analysis in Python using the popular lifetimes library.

在继续本系列文章时,我将解释也许是最简单但非常有见地的生存分析方法-Kaplan-Meier估计器。 在进行了理论介绍之后,我将向您展示如何使用流行的lifetimes库在Python中进行分析。

1. Kaplan-Meier估计器 (1. The Kaplan-Meier Estimator)

The Kaplan-Meier estimator (also known as the product-limit estimator, you will see why later on) is a non-parametric technique of estimating and plotting the survival probability as a function of time. It is often the first step in carrying out the survival analysis, as it is the simplest approach and requires the least assumptions. To carry out the analysis using the Kaplan-Meier approach, we assume the following:

Kaplan-Meier估计器 (也称为乘积极限估计器,您将在后面看到原因)是一种非参数技术,用于估计和绘制随时间变化的生存概率。 它通常是进行生存分析的第一步,因为它是最简单的方法,需要的假设最少。 为了使用Kaplan-Meier方法进行分析,我们假设以下内容:

  • The event of interest is unambiguous and happens at a clearly specified time.

    感兴趣的事件是明确的,并且在明确指定的时间发生。
  • The survival probability of all observations is the same, it does not matter exactly when they have entered the study.

    所有观察结果的生存概率是相同的,当它们进入研究时并不重要。
  • Censored observations have the same survival prospects as observations that continue to be followed.

    删失的观察与继续观察的观察具有相同的生存前景。

In real-life cases, we never know the true survival function. That is why with the Kaplan-Meier estimator, we approximate the true survival function from the collected data. The estimator is defined as the fraction of observations who survived for a certain amount of time under the same circumstances and is given by the following formula:

在现实生活中,我们永远不知道真正的生存功能。 这就是为什么使用Kaplan-Meier估计器,我们可以从收集的数据中近似真实的生存函数。 估计量定义为在相同情况下存活一定时间的观测值所占的比例,并由以下公式给出:

where:

哪里:

  • t_i is a time when at least one event happened,

    t_i是至少发生一个事件的时间,
  • d_i is the number of events that happened at time t_i,

    d_i是在时间t_i发生的事件数,
  • n_i represents the number of individuals known to have survived up to time t_i (they have not yet had the death event or have been censored). Or to put it differently, the number of observations at risk at time t_i.

    n_i表示已知直到t_i生存的个体数量(他们尚未发生死亡事件或受到审查)。 或者换句话说,在时间t_i处处于危险之中的观测数量。

From the product symbol in the formula, we can see the connection to the other name of the method, the product-limit estimator. The survival probability at time t is equal to the product of the percentage chance of surviving at time t and each prior time.

从公式中的乘积符号,我们可以看到与方法另一个名称乘积极限估计器的连接。 在时间t的生存概率等于在时间t与每个先前时间生存的机会百分比的乘积。

What we most often associate with this approach to survival analysis and what we generally see in practice are the Kaplan-Meier curves — a plot of the Kaplan-Meier estimator over time. We can use those curves as an exploratory tool — to compare the survival function between cohorts, groups that received some kind of treatment or not, behavioral clusters, etc.

我们最常将这种与生存分析方法相关联的东西,以及我们通常在实践中通常会看到的是Kaplan-Meier曲线 -Kaplan-Meier估计量随时间变化的曲线图。 我们可以使用这些曲线作为探索性工具-比较队列,是否接受某种治疗的组,行为簇等之间的生存功能。

The survival line is actually a series of decreasing horizontal steps, which approach the shape of the population’s true survival function given a large enough sample size. In practice, the plot is often accompanied by confidence intervals, to show how uncertain we are about the point estimates — wide confidence intervals indicate high uncertainty, probably due to the study containing only a few participants — caused by both observations dying and being censored. For more details on the calculation of the confidence intervals using the Greenwood method, please see [2].

生存线实际上是一系列递减的水平步长,在有足够大的样本量的情况下,它们接近总体真实生存函数的形状。 在实践中,该图通常带有置信区间,以显示我们对点估计的不确定性-宽置信区间表明高度不确定性,这可能是由于观察数垂死和受到审查所致。 有关使用Greenwood方法计算置信区间的更多详细信息,请参见[2]。

Image provided by the author 图片由作者提供

The interpretation of the survival curve is quite simple, the y-axis represents the probability that the subject still has not experienced the event of interest after surviving up to time t, represented on the x-axis. Each drop in the survival function (approximated by the Kaplan-Meier estimator) is caused by the event of interest happening for at least one observation.

生存曲线的解释非常简单,y轴表示受试者生存到时间t仍未经历感兴趣事件的概率,用x轴表示。 生存函数的每个下降(由Kaplan-Meier估计器近似)是由至少一个观察值发生的感兴趣事件引起的。

The actual length of the vertical line represents the fraction of observations at risk that experienced the event at time t. This means that a single observation (not actually the same one, but simply singular) experiencing the event at two different times can result in a drop of difference size — depending on the number of observations at risk. This way, the height of the drop can also inform us about the number of observations at risk (even when unreported and/or there are no confidence intervals).

垂直线的实际长度表示在时间t处经历事件的处于危险之中的观察结果的比例。 这意味着在两个不同的时间经历该事件的单个观测值(实际上不是同一观测值,而只是单数形式)会导致差异大小减小-具体取决于处于风险中的观测值的数量。 这样,下降的高度还可以告知我们处于危险中的观察次数(即使在未报告和/或没有置信区间的情况下)。

When no observations experienced the event of interest or some observations were censored, there is no drop in the survival curve.

当没有观察到感兴趣的事件或审查某些观察结果时,生存曲线不会下降。

Free-Photos from Free-Photos在 Pixabay Pixabay上发布

2.对数等级测试 (2. The log-rank test)

We have learned how to use the Kaplan-Meier estimator to approximate the true survival function of a population. And we know we can plot multiple curves to compare their shapes, for example, by the OS the users of our mobile app use. However, we still do not have a tool that will actually allow for comparison. Well, at least a more rigorous one than eyeballing the curves.

我们已经学习了如何使用Kaplan-Meier估计量来近似人口的真实生存功能。 我们知道我们可以绘制多条曲线以比较它们的形状,例如,通过移动应用程序用户使用的操作系统。 但是,我们仍然没有真正可以进行比较的工具。 好吧,至少比盯着曲线更严格。

That is when the log-rank test comes into play. It is a statistical test that compares the survival probabilities between two groups (or more, for that please see the Python implementation). The null hypothesis of the test states that there is no difference between the survival functions of the considered groups.

那就是对数等级测试开始起作用的时候。 这是一种统计测试,用于比较两组之间的生存概率(或更多,请参见Python实现)。 测试的原假设表明,所考虑的群体的生存功能之间没有差异。

The log-rank test uses the same assumptions as of the Kaplan-Meier estimator. Additionally, there is the proportional hazards assumption — the hazard ratio (please see the previous article for a reminder about the hazard rate) should be constant throughout the study period. In practice, this means that the log-rank test might not be an appropriate test if the survival curves cross. However, this is still a topic of active debate, please see [4] and [5].

对数秩检验使用与Kaplan-Meier估计器相同的假设。 此外,还有比例风险假设 -风险比(请参见上一篇文章,以提醒人们有关危险率)在整个研究期间应保持恒定。 实际上,这意味着如果生存曲线交叉,对数秩检验可能不是合适的检验。 但是,这仍然是一个活跃的辩论话题,请参见[4]和[5]。

For brevity, we do not cover the maths behind the test. If you are interested, please see this article or [3].

为简洁起见,我们不介绍测试背后的数学。 如果您有兴趣,请参阅本文或[3]。

3. Kaplan-Meier的常见错误 (3. Common mistakes with Kaplan-Meier)

In this part, I wanted to mention some of the common mistakes that can occur while working with the Kaplan-Meier estimator.

在这一部分中,我想提到在使用Kaplan-Meier估计器时可能发生的一些常见错误。

删除审查的数据 (Removing censored data)

It might be tempting to remove censored data as it can significantly alter the shape of the Kaplan-Meier curve, however, this can lead to severe biases so we should always include it while fitting the model.

删除受检查的数据可能很诱人,因为它会显着改变Kaplan-Meier曲线的形状,但是,这可能会导致严重的偏差,因此在拟合模型时应始终将其包括在内。

解释曲线的端点 (Interpreting the ends of the curves)

Pay special attention when interpreting the end of the survival curves, as any big drops close to the end of the study can be explained by only a few observations reaching this point of time (this should also be indicated by wider confidence intervals)

解释生存曲线的终点时要特别注意,因为接近研究终点的任何大滴滴都只能通过到达该时间点的一些观察结果来解释(这也应通过更宽的置信区间来表示)

将连续变量二等分 (Dichotomizing continuous variables)

By dichotomizing I mean using the median or “optimal” cut-off point to create groups such as “low” and “high” regarding any continuous metric. This approach can create multiple problems:

通过二分法,我的意思是使用中位数或“最佳”截止点来创建有关任何连续指标的组,例如“低”和“高”。 这种方法会产生多个问题:

  • Finding an “optimal“ cut-off point can be very dataset-dependent and impossible to replicate in different studies. Also, by doing multiple comparisons, we risk increasing the chances of false positives (finding a difference in the survival functions, when actually there is none).

    找到“最佳”临界点可能与数据集密切相关,并且不可能在不同研究中重复。 此外,通过进行多次比较,我们冒着增加误报率的危险(在实际上不存在生存功能时,发现生存功能上的差异)。
  • Dichotomizing decreases the power of the statistical test by forcing all measurements to a binary value, which in turn can lead to the need for a much larger sample size required to detect an effect. It is also worth mentioning that with survival analysis, the required sample size refers to the number of observations with the event of interest.

    二分法通过将所有测量值强制为二进制值来降低统计检验的功效,这又可能导致需要更大的样本量来检测效果。 还值得一提的是,在进行生存分析时,所需样本量是指关注事件的观察次数。
  • When dichotomizing, we make poor assumptions about the distribution of risk among observations. Let’s assume we use the age of 50 as the split between young and old patients. If we do so, we assume that an 18-year-old is in the same risk group as a 49-year-old, which is not true in most of the cases.

    二分法时,我们对观察值之间的风险分布做出了错误的假设。 假设我们使用50岁作为年轻患者和老年患者之间的比例 。 如果这样做,我们假设18岁的孩子和49岁的孩子属于同一风险组,这在大多数情况下是不正确的。

仅占一个预测变量 (Accounting for only one predictor)

The Kaplan-Meier estimator is a univariable method, as it approximates the survival function using at most one variable/predictor. As a result, the results can be easily biased — either exaggerating or missing the signal. That is caused by the so-called omitted-variable bias, which causes the analysis to assume that the potential effects of multiple predictors should be attributed only to the single one, which we take into account. Because of that, multivariable methods such as the Cox regression should be used instead.

Kaplan-Meier估计器是一种单变量方法,因为它最多使用一个变量/预测器来近似生存函数。 结果,结果很容易产生偏差-放大或丢失信号。 这是由所谓的遗漏变量偏差引起的,该偏差使分析假设多个预测变量的潜在影响应仅归因于单个变量,我们已将其考虑在内。 因此, 应该使用Cox回归等多变量方法代替。

4. Python示例 (4. Example in Python)

It is time to implement what we have learned in practice. We start by importing all the required libraries.

现在是实施我们在实践中学到的东西的时候了。 我们首先导入所有必需的库。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from lifelines import KaplanMeierFitter 
from lifelines.statistics import (logrank_test, 
                                  pairwise_logrank_test, 
                                  multivariate_logrank_test, 
                                  survival_difference_at_fixed_point_in_time_test)


plt.style.use('seaborn')

Then, we load the dataset and do some small wrangling to make it work nicely with the lifelines library. For the analysis, we use the popular Telco Customer Churn dataset (available here or on my GitHub). The dataset contains client information of a telephone/internet provider, including their tenure, what kind of services they use, some demographical data, and ultimately the flag indicating churn.

然后,我们加载数据集并进行一些小调整,以使其与生命线库很好地配合使用。 为了进行分析,我们使用了流行的Telco客户流失数据集(可在此处或在我的GitHub上找到)。 数据集包含电话/互联网提供商的客户信息,包括他们的任期,他们使用哪种服务,一些人口统计数据以及最终指示用户流失的标志。

df = pd.read_csv('../data/telco_customer_churn.csv')
df['churn'] = [1 if x == 'Yes' else 0 for x in df['Churn']]

For this analysis, we use the following columns:

对于此分析,我们使用以下列:

  • tenure — the number of months the customer has stayed with the company,

    tenure -客户在公司停留的月数,

  • churn — information whether the customer churned (binary encoded: 1 if the event happened, 0 otherwise),

    churn —顾客是否搅拌的信息(二进制编码:如果事件发生,则为1,否则为0),

  • PaymentMethod— what kind of payment method the customers used.

    PaymentMethod客户使用哪种付款方式。

For the most basic scenario, we actually only need the time-to-event and the flag indicating if the event of interest happened.

对于最基本的情况,我们实际上只需要到达事件的时间和标志,以指示感兴趣的事件是否发生。

T = df['tenure']
E = df['churn']


kmf = KaplanMeierFitter()
kmf.fit(T, event_observed=E)


kmf.plot(at_risk_counts=True)
plt.title('Kaplan-Meier Curve');

The KaplanMeierFitter works similarly to the classes known from scikit-learn: we first instantiate the object of the class and then use the fit method to fit the model to our data. While plotting, we specify at_risk_counts=True to additionally display information about the number of observations at risk at certain points of time.

KaplanMeierFitter工作方式类似于scikit-learn已知类:我们首先实例化该类的对象,然后使用fit方法将模型拟合到我们的数据。 绘制时,我们指定at_risk_counts=True来另外显示有关在特定时间点处处于危险之中的观测次数的信息。

Normally, we would be interested in the median survival time, that is, the point in time in which on average 50% of the population has already died, or in this case, churned. We can access it using the following line:

通常,我们会对平均生存时间感兴趣,也就是说,平均50%的人口已经死亡,或者在这种情况下,这个数字会增加。 我们可以使用以下行来访问它:

kmf.median_survival_time_

However, in this case, the command returns inf, as we can see from the survival curve that we actually do not observe that point in our data.

但是,在这种情况下,该命令返回inf ,正如我们从生存曲线中看到的那样,我们实际上并未在数据中观察到该点。

We have seen the basic use-case, now let’s complicate the analysis and plot the survival curves for each variant of the payment method. We can do so by running the following code:

我们已经看到了基本用例,现在让我们复杂化分析并绘制支付方式每个变体的生存曲线。 我们可以通过运行以下代码来做到这一点:

ax = plt.subplot(111)


kmf = KaplanMeierFitter()


for payment_method in df['PaymentMethod'].unique():
    
    flag = df['PaymentMethod'] == payment_method
    
    kmf.fit(T[flag], event_observed=E[flag], label=payment_method)
    kmf.plot(ax=ax)


plt.title("Survival curves by payment methods");

Running the block of code generates the following plot:

运行代码块将生成以下图:

We can see that the probability of survival is definitely the lowest for the electronic check, while the curves for automatic bank transfer/credit card are very similar. This is a perfect time to use the log-rank test to see if they are actually different.

我们可以看到,电子支票的生存率绝对是最低的,而自动银行转账/信用卡的曲线非常相似。 这是使用对数秩检验来查看它们是否确实不同的绝佳时机。

credit_card_flag = df['PaymentMethod'] == 'Credit card (automatic)'
bank_transfer_flag = df['PaymentMethod'] == 'Bank transfer (automatic)'


results = logrank_test(T[credit_card_flag], 
                       T[bank_transfer_flag], 
                       E[credit_card_flag], 
                       E[bank_transfer_flag])
results.print_summary()

The following table presents the results.

下表显示了结果。

By looking at the p-value of 0.35, we can see that there are no reasons to reject the null hypothesis stating that the survival functions are identical. For this example, we only compared two methods of payment. However, there are definitely more combinations we could test. There is a handy function called pairwise_logrank_test, which makes the comparison very easy.

通过查看0.35的p值,我们可以看到没有理由拒绝原假设,即生存函数是相同的。 在此示例中,我们仅比较了两种付款方式。 但是,肯定有更多组合可以测试。 有一个方便使用的函数,称为pairwise_logrank_test ,它使比较非常容易。

results = pairwise_logrank_test(df['tenure'], df['PaymentMethod'], df['churn'])
results.print_summary()

In the table, we see the previous comparison we did, as well as all the other combinations. The bank transfer vs. credit card is the only case in which we should not reject the null hypothesis. Also, we should be cautious about interpreting the results of the log-rank test, as we can see in the plot above that the curves for the bank transfer and credit card payments actually cross, so the assumption of proportional hazards is violated.

在表中,我们可以看到之前所做的比较,以及所有其他组合。 银行转账还是信用卡是唯一我们不应该拒绝原假设的情况。 同样,我们在解释对数秩检验的结果时应谨慎,正如我们在上图中看到的那样,银行转帐和信用卡付款的曲线实际上是交叉的,因此违反了比例风险的假设。

There are two more things we can easily test using the lifelines library. The first one is the multivariate log-rank test, in which the null hypothesis states that all the groups have the same “death” generating process, so their survival curves are identical.

使用生命线库,我们可以轻松地测试另外两件事。 第一个是多元对数秩检验,其中零假设表明所有组具有相同的“死亡”生成过程,因此它们的生存曲线相同。

results = pairwise_logrank_test(df['tenure'], df['PaymentMethod'], df['churn'])
results.print_summary()

The results of the test indicate that we should reject the null hypothesis, so the survival curves are not identical, which we have already seen in the plot.

测试结果表明,我们应该拒绝原假设,因此生存曲线并不相同,这在图中已经看到。

Lastly, we can test the survival difference at a specific point in time. Coming back to the example, in the plot, we can see that the curves are furthest apart around t = 60. Let’s see if that difference is statistically significant.

最后,我们可以测试特定时间点的生存差异。 回到该示例,在图中,我们可以看到曲线在t = 60处相距最远。让我们看看该差异是否在统计上显着。

results = survival_difference_at_fixed_point_in_time_test(60, 
                                                          T[credit_card_flag], 
                                                          T[bank_transfer_flag], 
                                                          E[credit_card_flag], 
                                                          E[bank_transfer_flag])
results.print_summary()

By looking at the test’s p-value, there is no reason to reject the null hypothesis stating that there is no difference between the survival at that point of time.

通过查看测试的p值,没有理由拒绝零假设,该零假设指出在该时间点的生存时间之间没有差异。

5。结论 (5. Conclusions)

In this article, I described a very popular tool for conducting survival analysis — the Kaplan-Meier estimator. We also covered the log-rank test for comparing two/multiple survival functions. The described approach is a very popular one, however, not without flaws. Before concluding, let’s take a look at the pros and cons of the Kaplan-Meier estimator/curves.

在本文中,我描述了一种进行生存分析的非常流行的工具-Kaplan-Meier估计器。 我们还介绍了用于比较两个/多个生存函数的对数秩检验。 所描述的方法是一种非常流行的方法,但是并非没有缺陷。 在结束之前,让我们看一下Kaplan-Meier估计器/曲线的优缺点。

Advantages:

优点:

  • Gives the average view of the population, also per groups.

    给出总体的平均视图,也按组给出。
  • Does not require a lot of features — only the information about the time-to-event and if the event actually occurred. Additionally, we can use any categorical features describing groups.

    不需要很多功能-仅需要有关事件发生时间以及事件是否实际发生的信息。 另外,我们可以使用任何描述组的分类特征。
  • Automatically handles class imbalance, as virtually any proportion of death to censored events is acceptable.

    自动处理阶级失衡,因为几乎可以接受死亡与审查事件的任何比例。
  • As it is a non-parametric method, few assumptions are made about the underlying distribution of the data.

    由于它是一种非参数方法,因此几乎没有对数据的基础分布进行任何假设。

Disadvantages:

缺点:

  • We cannot evaluate the magnitude of the predictor’s impact on survival probability.

    我们无法评估预测变量对生存概率的影响程度。
  • We cannot simultaneously account for multiple factors for observations, for example, the country of origin and the phone’s operating system.

    我们不能同时考虑多种观察因素,例如,原产国和电话的操作系统。
  • The assumption of independence between censoring and survival (at time t, censored observations should have the same prognosis as the ones without censoring) can be inapplicable/unrealistic.

    审查与生存之间具有独立性的假设(在时间t,被审查的观察结果应与未经审查的观察结果具有相同的预后)可能不适用/不切实际。

  • When the underlying data distribution is (to some extent) known, the approach is not as accurate as some competing techniques.

    当底层数据分布(在某种程度上)已知时,该方法不如某些竞争技术准确。

Summing up, even with a few disadvantages the Kaplan-Meier survival curves are a great place to start off while conducting survival analysis. While doing so, we can get valuable insights about the potential predictors of survival and accelerate our progress with some more advanced techniques (which I will describe in future articles).

总结一下,即使有一些缺点,Kaplan-Meier生存曲线还是进行生存分析时一个很好的起点。 在此过程中,我们可以获得有关生存的潜在预测因素的宝贵见解,并通过一些更先进的技术(我们将在以后的文章中进行介绍)来加快我们的进步。

You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.

您可以在我的GitHub上找到用于本文的代码。 一如既往,欢迎任何建设性的反馈。 您可以在Twitter或评论中与我联系。

In case you found this article interesting, you might also like the other ones in the series:

如果您发现本文有趣,您可能还会喜欢本系列中的其他文章:

6.参考 (6. References)

[1] Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American statistical association, 53(282), 457–481. — available here

[1] Kaplan,EL和Meier,P.(1958)。 来自不完整观测值的非参数估计。 美国统计协会杂志53 (282),457–481。 — 在这里可用

[2] S. Sawyer (2003). The Greenwood and Exponential Greenwood Confidence Intervals in Survival Analysis — available here

[2] S. Sawyer(2003)。 生存分析中的Greenwood和指数Greenwood置信区间- 在此处可用

[3] Kaplan-Meier Survival Curves and the Log-Rank Test — available here

[3] Kaplan-Meier生存曲线和对数秩检验- 在此处可用

[4] Non-proportional hazards — so what? — available here

[4]非比例危害-那又如何? — 在这里可用

[5] Bouliotis, G., & Billingham, L. (2011). Crossing survival curves: alternatives to the log-rank test. Trials, 12(S1), A137.

[5] Bouliotis,G.和Billingham,L.(2011)。 交叉生存曲线:对数秩检验的替代方法。 试验12 (S1),A137。

翻译自: https://towardsdatascience.com/introduction-to-survival-analysis-the-kaplan-meier-estimator-94ec5812a97a

你可能感兴趣的:(python,java,linux)