Andrew Ng的Nature Medicine论文翻译

Before start: 这是按导师要求完成的寒假作业,全文个人渣翻,若有意见以及翻译不达意之处烦请随时提出或直接在评论区纠正。

论文地址:https://www.nature.com/articles/s41591-018-0268-3

好的,让我们开始吧:

论文标题: Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network

基于动态心电图的深度神经网络的心血管专家级心律不齐检测及分类

摘要:Computerized electrocardiogram (ECG)interpretation plays a critical role in the clinical ECG workflow. Widelyavailable digital ECG data and the algorithmic paradigm of deep learning presentan opportunity to substantially improve the accuracy and scalability ofautomated ECG analysis. However, a comprehensive evaluation of an end-to-enddeep learning approach for ECG analysis across a wide variety of diagnostic classeshas not been previously reported. Here, we develop a deep neural network (DNN)to classify 12 rhythm classes using 91,232 single-lead ECGs from 53,549patients who used a single-lead ambulatory ECG monitoring device. When validatedagainst an independent test dataset annotated by a consensus committee ofboard-certified practicing cardiologists, the DNN achieved an average areaunder the receiver operating characteristic curve (ROC) of 0.97. The average F1score, which is the harmonic mean of the positive predictive value andsensitivity, for the DNN (0.837) exceeded that of average cardiologists(0.780). With specificity fixed at the average specificity achieved bycardiologists, the sensitivity of the DNN exceeded the average cardiologistsensitivity for all rhythm classes. These findings demonstrate that an end-to-enddeep learning approach can classify a broad range of distinct arrhythmias fromsingle-lead ECGs with high diagnostic performance similar to that ofcardiologists. If confirmed in clinical settings, this approach could reducethe rate of misdiagnosed computerized ECG interpretations and improve the efficiencyof expert human ECG interpretation by accurately triaging or prioritizing themost urgent conditions.

在临床的心电图(ECG)诊断流程中,计算机化ECG起着十分重要的作用。广泛可获取的数字ECG数据以及深度学习的算法范例展现了实质上提升自动ECG分析方法的准确率及可升级性的机会。然而,目前并未有对基于端到端深度学习的多诊断类别ECG分析方法的全面性评估见报。因此,基于单导联动态ECG监测设备采集到的53549位患者的91232例单导联ECG,我们提出了一种用于分辨12种心律的深度神经网络(DNN)模型。在经由经验丰富的心血管专家组成的委员会注释的独立测试集验证后,该DNN的平均ROC面积达到了0.97。将可同时体现正面预测的准确率以及灵敏度的平均F1值作为评估标准时,DNN的表现(0.837)已超过心血管专家的平均表现(0.780)。DNN的在所有心律的特异性的召回率表现也优于心血管专家的平均表现,其中特异性固定为心血管专家得出的平均值。这些发现证明了一个端到端的深度学习算法可利用单导联ECG完成多种不同类型的心律不齐诊断且已接近心血管专家的表现。若该方法得到临床应用的检验,则可减少计算机化ECG的误诊率并提升在紧急状态下的诊断效率。

The electrocardiogram is a fundamental tool in the everyday practice of clinical medicine, with more than 300 million ECGs obtained annually worldwide. The ECG is pivotal for diagnosing a wide spectrum of abnormalities from arrhythmias toacute coronary syndrome. Computer-aided interpretation has become increasingly important in the clinical ECG workflow since its introduction over 50 years ago, serving as a crucial adjunct to physician interpretation in many clinical settings. However, existing commercial ECG interpretation algorithms still show substantial rates of misdiagnosis. The combination of widespread digitization of ECG data and the development of algorithmic paradigms that can benefit from large-scale processing of raw data presents an opportunity to reexamine the standard approach to algorithmic ECG analysis and may provide substantial improvements to automated ECG interpretation.

ECG是临床医学的日常实践中的基础工具,每年在全世界被采集超过3亿张。ECG在将宽泛的心律不齐现象确诊为对应病症的过程中起着重要作用。从50年前被发明起,计算机诊断方法就在临床的ECG诊断流程中变得愈发重要,且已在多种临床环境中起到了关键性的辅助医生诊断的作用。然而,现存的商业ECG诊断算法依旧有着极高的误诊率。大量的数字ECG数据结合从处理大规模的原始ECG数据中收益不断发展的算法模型展示了一个重新检验标准化ECG分析算法的机会,并有可能显著提升基于ECG的自动诊断能力。

Substantialalgorithmic advances in the past five years have been driven largely by aspecific class of models known as deep neural networks. DNNs are computationalmodels consisting of multiple processing layers, with each layer being able tolearn increasingly abstract, higher-level representations of the input datarelevant to perform specific tasks. They have dramatically improved the stateof the art in speech recognition, image recognition, strategy games such as Go,and in medical applications. The ability of DNNs to recognize patterns andlearn useful features from raw input data without requiring extensive datapreprocessing, feature engineering or handcrafted rules makes them particularlywell suited to interpret ECG data. Furthermore, since DNN performance tends toincrease as the amount of training data increases, this approach is wellpositioned to take advantage of the widespread digitization of ECG data.

过去五年里,DNN模型在很大程度上推进了算法的进步。DNN模型是由多个处理层组成的,每个层都具备根据不同类别学习输入数据中提取的高层次高抽象的特征表达的能力。DNN已显著提升了语言识别、图像识别、战略游戏(如Go)以及医疗应用方面的学科发展水平。DNN的识别各种模式以及可从无需进行额外的预处理、特征工程或人为制定规则的原始输入数据中提取并学习有用特征的能力使其十分适合用于ECG数据的诊断任务。除此以外,由于DNN的表现通常会随着训练集数据规模的增加得到优化,使用DNN则可以更好地利用广布的数字化ECG数据带来的优势。

A comprehensive evaluation of whether an end-to-end deep learning approach can be used to analyze raw ECG data to classify a broad range of diagnoses remainslacking. Much of the previous work to employ DNNs toward ECG interpretation has focused on single aspects of the ECG processing pipeline, such as noise reduction or feature extraction or has approached limited diagnostic tasks,detecting only a handful of heartbeat types (normal, ventricular or supraventricular ectopic, fusion, and so on) or rhythm diagnoses (most commonly atrial fibrillation or ventricular tachycardia). Lack of appropriate data has limited many efforts beyond these applications. Most prior efforts used data from the MIT-BIH Arrhythmia database (PhysioNet), which is limited by the small number of patients and rhythm episodes present in the dataset.

当前并未有端到端深度学习是否可以用于对原始ECG数据分析以达到多诊断类别目的的全面性评估见报。在以往,大部分将DNN用于ECG诊断的研究均聚焦于ECG诊断流程中的某一方面,如滤除噪声、特征提取、用于有限的诊断任务如检测少数几种心跳类型(正常、室早、心室融合波等等)以及异常心律诊断(通常是房颤或室性心动过速)。由于缺乏合适的数据,许多尝试跳出上述范围的研究都受到了限制。大部分先前的研究均使用了来自MIT-BIH心律不齐数据库(即PhysioNet)的数据,而该数据库则存在着患者少且异常心律片段少的限制。

In this study, we constructed a large, novel ECG dataset that underwent expert annotation for a broad range of ECG rhythm classes. We developed a DNN to detect 12 rhythm classes from raw single-lead ECG inputs using a training dataset consisting of 91,232 ECG records from 53,549 patients. The DNN was designed to classify 10 arrhythmias as well as sinus rhythm and noise for a total of 12 output rhythm classes (Extended Data Fig. 1). ECG data were recorded by the Zio monitor, which is a Food and Drug Administration(FDA)-cleared, single-lead, patch-based ambulatory ECG monitor that continuously records data from a single vector (modified Lead II) at 200Hz. The mean and median wear time of the Zio monitor in our dataset was 10.6 and13.0 days, respectively. Mean age was 69±16 years and 43% were women. We validated the DNN on a test dataset that consisted of 328 ECG records collected from 328 unique patients, which was annotated by a consensus committee of expert cardiologists (see Methods). Mean age on the test dataset was 70±17 years and 38% were women. The mean inter-annotator agreement on the testdataset was 72.8%.

我们构建了一个经专家标注的涵盖多种心律类型的大型新ECG数据库。我们研发了一个可以从原始单导联ECG中检测12种心律的DNN模型,其中训练集由从53549位患者采集的91232例ECG数据组成。该模型可输出包含窦性心律在内共11种心律类型以及噪声共12类心律标签(如图1所示)。ECG数据由Zio采集器录得,该采集器是经FDA审批通过的、基于块的单导联动态ECG采集器,可从单一导联位置以200Hz频率采集连续信号数据。在我们的数据集中,佩戴该采集器的平均以及中值时间分别为10.6天以及13.0天,平均年龄为69±16岁,43%的受试者为女性。验证用的测试集由328位不同的病人采集的328例ECG数据组成,其中数据的标签由心血管专家组成的委员会标注得出(见方法)。测试集的受试者平均年龄为70±17岁,38%为女性。测试集标签在委员会的一致率为72.8%

Supplementary Table 1 shows the number of unique patients exhibiting each rhythm class.

表1给出了心律标签对应的病人数量。

We first compared the performance of the DNN against the gold standard cardiologist consensus committee diagnoses by calculating the AUC (Table 1a). Since the DNN algorithm was designed to make a rhythm class prediction approximately once per second (see Methods), we report performance both as assessed once every second — which we call “sequence-level” and consists of one rhythm class per interval—and once per record, which we call “set-level” and consists of the group of unique diagnoses present in the record. Sequence-level metrics help capture the duration of an arrhythmia, such as its onset and offset within a record, whereas set-level metrics focus only on the existence of a rhythm class within a record. The DNN achieved an AUC of greater than 0.91 for all rhythm classes; at the sequence-level all but one AUC was above 0.97. The class-weighted average AUC was 0.978 at the sequence-level and 0.977 at the set-level. The model demonstrated high AUCs for arrhythmias of greater clinical significance such as AF, atrio-ventricular block, and ventricular tachycardia. The sequence and set-level results were similar, though sequence-level AUC was higher in the majority of cases. In sensitivity analyses, we calculated multi-class AUC using the method described by Hand and Till and results were materially unchanged. Supplementary Table 2 shows the maximum sensitivity achieved by the DNN with specificity >90%, and vice versa. With one exception, all sensitivity and specificity pairs were >90%.

首先我们将DNN的结果与作为“金标准”的心血管专家委员会的诊断结果通过计算AUC进行了对比(表1a)。由于DNN算法被设计用于每秒进行一次心律标签预测,我们用三种方式进行表现的对比:每秒进行一次预测、每个样本(即一个周期)进行一次预测以及根据样本给对应病人进行预测,其中按样本进行预测与按病人进行预测的区别在于按样本进行预测仅关注该样本内存在的心律标签。结果显示DNN在所有心律标签的AUC都超过了0.91;每秒预测的总AUC超过了0.97。其中类加权平均AUC在按秒预测以及按样本预测的值分别为0.978以及0.977.该模型在预测含更高临床价值的心律不齐种类如房颤、房室传导阻滞以及室性心动过速时有着极高的AUC。虽然按秒预测的AUC在大部分标签下均高过按样本预测的值,二者总的结果差不多一致。在召回率分析中,我们使用了由Hand and Till描述的方法计算多标签AUC,结果并无实质上的变化。如表2所示,最大召回率由DNN取得,对应特异性超过了90%,反之亦然。仅有一个标签的召回率对应的所有特异性均超过了90%。

In addition to a cardiologist consensus committee annotation, each ECG record in the test dataset received annotations from six separate individual cardiologists who were not part of the committee (see Methods). Using the committee labels as the gold standard, we compared the DNN algorithm F1 score to the average individual cardiologist F1 score, which is the harmonic mean of the positive predictive value (PPV; precision) and sensitivity (recall) (Table 1). Cardiologist F1 scores were averaged over six individual cardiologists. The trend of DNN F1 scores tended to follow that of the averaged cardiologist F1 scores: both had lower F1 on similar classes, such as ventricular tachycardia and ectopic atrial rhythm (EAR). The set-level average F1 scores weighted by the frequency of each class for the DNN (0.837) exceeded those for the averaged cardiologist (0.780). We performed multiple sensitivity analyses, all of which were consistent with our main results: both AUC and F1 scores on the 10% development dataset (n=8,761) were materially unchanged from the test dataset results, although they were slightly higher (Supplementary Tables 3 and 4). In addition, we retrained the DNN holding out an additional 10% of the training dataset as a second held-out test dataset (n=8,768); the AUC and F1 scores for all rhythms were materially unchanged (Supplementary Tables 5 and 6). We note that unlike the primary test dataset, which has gold-standard annotations from a committee of cardiologists, both sensitivity analysis datasets are annotated by certified ECG technicians.

测试集样本的标签除经心血管专家委员会标注外,每个ECG样本均被6名并非委员会的心血管专家标注了(见方法)。我们以委员会给出的标签作为“金标准”,对比了算法的F1值以及6名非委员会的心血管专家的平均F1值,其中F1值可看作是精确率以及召回率的调和平均值。算法的F1值呈现出略低于心血管专家的平均F1值的趋势:二者在某些心律标签的F1值均较低,如室性心率过速以及房性异位心律(EAR)。将按样本预测的标签频率加权平均F1值作为衡量标准时,算法的表现(0.837)则超越了心血管专家的表现(0.780)。我们还进行了多重特异性分析,结果与我们得出的主要结果一致:虽然10%的训练集(n=8761)样本的F1和AUC均略高于测试集的结果,然而实质上并没有变化(如表3及表4所示)。除此之外,我们将训练集中的10%取出作为额外的第二测试集(n=8768)并重新训练了算法模型。结果显示所有心律标签的AUC以及F1值并没有实质变化(如表5及表6所示)。需要注意的是,与先前由委员会给出“金标准”标签的测试集不同,用于召回率分析的数据集的标签由受认证的ECG技术人员给出。

We plotted receiver operating characteristic curves (ROCs) and precision-recall curves for the sequence-level analyses of three example classes: atrial fibrillation; trigeminy; and AVB (Fig. 1a, b). Individual cardiologist performance and averaged cardiologist performance are plotted on the same figure. Extended Data Fig. 2 presents ROCs for all classes, showing that the model met or exceeded the averaged cardiologist performance for all rhythm classes. Fixing the specificity at the average specificity level achieved by cardiologists, the sensitivity of the DNN exceeded the average cardiologist sensitivity for all rhythm classes (Table 2). We used confusion matrices to illustrate the discordance between the DNN’s predictions (Fig. 2a) or averaged cardiologist predictions (Fig. 2b) and the committee consensus. The two confusion matrices exhibit a similar pattern, highlighting those rhythm classes that were generally more problematic to classify (that is, supraventricular tachycardia (SVT) versus atrial fibrillation, junctional versus sinus rhythm, and EAR versus sinus rhythm).

我们绘制出了按秒预测的三个心律标签(房颤、三联律以及AVB)的ROC以及PRC作为例子,如图1a, b所示。每位心血管专家以及其平均表现则也被绘制到了同一张图中。图2记录了所有标签的ROC曲线,且说明模型在所有的心律标签的表现均接近甚至超越了心血管专家的平均表现。如表2所示,算法在所有标签下的召回率均超越了心血管专家的平均召回率。我们利用混淆矩阵以说明算法预测结果(图2a)以及心血管专家的平均预测结果(图2b)与委员会的统一结果间的不一致性。这两个矩阵展现了一个相似的模式,即高亮了在分类时普遍存在更多问题的心律标签(如SVT对房颤、交界对窦性心律以及EAR对窦性心律)。

Finally, to demonstrate the generalizability of our DNN architecture to external data, we applied our DNN to the 2017 PhysioNet Challenge data (https://physionet.org/challenge/2017/), which contained four rhythm classes: sinus rhythm; atrial fibrillation; noise; and other. Keeping our DNN architecture fixed and without any other hyper-parameter tuning, we trained our DNN on the publicly available training dataset (n=8,528), holding out a 10% development dataset for early stopping. DNN performance on the hidden test dataset (n=3,658) demonstrated overall F1 scores that were among those of the best performers from the competition (Supplementary Table 7), with a class average F1 of 0.83. This demonstrates the ability of our end-to-end DNN-based approach to generalize to a new set of rhythm labels on a different dataset.

最后,为测试我们的模型应对外部数据的概化能力,我们应用了2017年PhysioNet挑战数据集(https://physionet.org/challenge/2017/),该数据集包含四个心律标签:窦性心律、房颤、噪声以及其他。在保证我们的算法模型完整以及并未进行任何其它预先调参的情况下,我们使用对公众开放的训练集(n=8528)进行训练,并依照早停法除去数据集中的10%。DNN在测试集(n=3658)的总F1值优于参赛者们给出的最好结果(如表7所示),其中一个标签的F1值达到了0.83。上述内容进一步证明了我们的基于DNN算法的端到端方法有能力应用于包含新的心律标签的不同数据集。

Our study is the first comprehensive demonstration of a deep learning approach to perform classification across a broad range of the most common and important ECG rhythm diagnoses. Our DNN had an average class-weighted AUC of 0.97, with higher average F1 scores and sensitivities than cardiologists. These findings demonstrate that an end-to-end DNN approach has the potential to be used to improve the accuracy of algorithmic ECG interpretation. Recent algorithmic and computational advances compel us to revisit the standard approaches to automated ECG interpretation. Furthermore, algorithmic approaches whose performance improves as more data become available, such as deep learning, can leverage the widespread digitization of ECG data and provide clear opportunities to bring us closer to the ideal of a learning health care system. We emphasize our use in this study of a dataset large enough to evaluate an end-to-end deep learning approach to predict multiple diagnostic ECG classes, and our validation against the high standard of a cardiologist consensus committee. (Most cardiologists were subspecialized in rhythm abnormalities.) We believe this is the most clinically relevant gold standard, since cardiologists perform the final ECG diagnosis in nearly all clinical settings.

我们的研究是第一个将深度学习方法应用于最重要且最常见的广范围多ECG心律标签诊断分类并进行了全面验证的研究。我们的DNN模型的标签加权平均AUC为0.97,且平均F1值以及召回率均高于心血管专家的表现。上述内容说明了一个端到端的DNN算法有潜力被应用于提升ECG诊断算法的准确性。近期的算法以及计算机性能的提升使我们不得不重新检视了一遍自动ECG诊断标准算法。除此之外,随着更多的ECG数据的开放,算法(如深度学习)的表现将会得到进一步提升,而算法将进一步促进ECG数据的开放并为我们提供一个更加接近理想的自主学习型医疗护理系统的机会。除此以外,我们想强调的是,我们在该研究中使用的数据集已经大到足够验证一个可用于预测多个ECG诊断标签,以及我们筹建的高标准的心血管专家委员会(其中大部分专家均专攻异常心律领域)。由于心血管专家给出最终的ECG诊断是十分接近所有临床环境的,因此我们相信我们的数据集是最接近临床金标准的数据集。

Our study demonstrates that the paradigm shift represented by end-to-end deep learning may enable a new approach to automated ECG analysis. The standard approach to automated ECG interpretation employs various techniques across a series of steps that include signal preprocessing, feature extraction, feature selection/reduction, and classification. At each step, hand-engineered heuristics and derivations of the raw ECG data are developed with the ultimate aim to improve classification for a given rhythm, such as atrial fibrillation. In contrast, DNNs enable an approach that is fundamentally different since a single algorithm can accomplish all of these steps ‘end-to-end’ without requiring class-specific feature extraction; in other words, the DNN can accept the raw ECG data as input and output diagnostic probabilities. With sufficient training data, using a DNN in this manner has the potential to learn all of the important previously manually derived features, along with as-yet-unrecognized features, in a data-driven way, and may learn shared features useful in predicting multiple classes. These properties of DNNs can serve to improve prediction performance, particularly since there is ample evidence to suggest that the currently recognized, manually derived ECG features represent only a fraction of the informative features for any diagnosis.

我们的研究证明了端到端的深度学习展现的范式转变可能是一种新的自动ECG分析的方法。常规的自动ECG诊断算法需要多个运用不同技术的步骤,包括信号预处理、特征提取、特征选择以及分类。每一步针对原始ECG数据的人工式的启发法以及派生法都是为了提升给定的心律标签(如房颤)的分类性能。与之不同的是,DNN模型仅靠一个算法就可以完成上述的全部步骤且无需针对标签进行特征提取。换句话说,DNN模型的输入即原始的ECG数据,输出的则是各种诊断结果的概率。若存在足够的训练数据,一个DNN模型则完全有潜力去发现所有人们先前已发现的重要的特征,以及利用数据驱动的方式发现人们从未发现的特征,甚至有可能在多标签预测任务中发现这些标签共通的特征。DNN模型的这些特性可以更好地提升预测的准确性,尤其是在有充分证据证明人们当前发现的ECG特征仅仅代表了所有可用于诊断的特征的一小部分的情况下。

While artificial neural networks were first applied toward the interpretation of ECGs as early as two decades ago, until recently they only contained several layers and were constrained by algorithmic and computational limitations. More recent studies have employed deeper networks, although some only use DNNs to perform certain steps in the ECG processing pipeline, such as feature extraction or classification. End-to-end DNN approaches have been used more recently showing good performance for a limited set of ECG rhythms, such as atrial fibrillation, ventricular arrhythmias, or individual heartbeat classes. While these prior efforts demonstrated promising performance for specific rhythms, they do not provide a comprehensive evaluation of whether an end-to-end approach can perform well across a wide range of rhythm classes, in a manner similar to that encountered clinically. Our approach is unique in using a 34-layer network in an end-to-end manner to simultaneously output probabilities for a wide range of distinct rhythm diagnoses, all of which is enabled by our dataset, which is orders of magnitude larger than most other datasets of its kind. Distinct from some other recent DNN approaches, no substantial preprocessing of ECG data, such as Fourier or wavelet transforms, is needed to achieve strong classification performance.

早在20年前,就已经有人工神经网络被应用于ECG的诊断了,然而受当时的算法发展水平以及计算机性能的限制,这些神经网络仅有少量的网络层。最近,更多的研究已经开始使用更“深”的网络,然而其中一部分研究仅将DNN模型用于ECG处理流程中的某些步骤,如特征提取或特征分类。如今,端到端的DNN算法的优异性能更多的则是被在某些特定心律类别上如房颤、室性心律失常以及其它心搏类型上得到体现。虽然上述提到的研究证明了DNN在面对特定心律类型时有着优良表现,但是它们并未有全面评估端到端方法是否可以在接近临床环境下面对广范围心律类型任务并取得良好表现。我们独特的算法是一个34层的端到端网络模型,最终输出的是多种不同的心律诊断结果的概率,这些心律类型均在我们的数据集中有对应样本,且各个心律的样本数量级均大于以往任何一个对应的数据集。与近期某些DNN算法不同的是,我们的模型并不需要实质性的针对ECG数据的预处理算法,如傅里叶变换或小波变换等等,就可以达到良好的分类性能。

Since arrhythmia detection is one of the most problematic tasks for existing ECG algorithms, if validated in clinical settings through clinical trials, our approach has the potential for substantial clinical impact. Paired with properly annotated digital ECG data, our approach has the potential to increase the overall accuracy of preliminary computerized ECG interpretations and can also be used to customize predictions to institution- or population-specific applications by additional training on institution-specific data. While expert provider confirmation will probably be appropriate in many clinical settings, the DNN could expand the capability of an expert over-reader in the clinical workflow, for example, by triaging urgent conditions or those for which the DNN has the least ‘confidence’. Since ECG data collected from different clinical applications range in duration from 10s (standard 12-lead ECGs) to multiple days (single-lead ambulatory ECGs), the application of any algorithm, including ours, must ultimately be tailored to the target clinical application. For example, even at the performance characteristics we report, applying our algorithm sequentially across an ECG record of long duration would result in nontrivial false-positive diagnoses. Faced with a similar problem, cardiologists probably incorporate additional mechanisms to improve their diagnostic performance, such as taking advantage of the increased context or knowledge about arrhythmia epidemiology. Similarly, additional algorithmic steps or post-processing heuristics may be important before clinical application.

心率不齐检测是现存的ECG算法中问题最大的任务,若在临床环境下得到了临床测试的验证,我们的方法则完全有潜力实质性的影响临床医学。若与正确标注的数字ECG数据配合,我们的算法有潜力增加计算机化ECG诊断任务的总体准确率,还可以用于依照机构或特定人群的要求通过额外的数据集训练以完成定制化的预测任务。虽然在许多临床环境中专家可以对预测结果进行验证,DNN算法可以在某些临床条件中起到超越其原有作用的能力,例如当出现紧急情况或某些DNN算法十分有“自信”的案例出现时。临床用途不同,ECG数据采集的时长也不同(从12导联的10s到最长几天的单导联ECG数据),因此在应用包括本算法在内的任何算法时,必须按照目标临床应用方向对算法进行调整。例如即便本算法的性能如此优异,若将本算法用于一条时长较长的样本则同样可触发异常的假阳性诊断。心血管专家在面对类似问题时,可能会选择与其它机制以提升他们的诊断表现,例如额外了解会导致心律不齐的流行病学。与之类似地,在得到临床应用前,额外的算法步骤或作为后处理的启发式算法可能也是十分重要的。

An important finding from our study is that the DNN appears to recapitulate the misclassifications made by individual cardiologists, as demonstrated by the similarity in the confusion matrices for the model and cardiologists. Manual review of the discordances revealed that the DNN misclassifications overall appear very reasonable. In many cases, the lack of context, limited signal duration, or having a single lead limited the conclusions that could reasonably be drawn from the data, making it difficult to definitively ascertain whether the committee and/or the algorithm was correct. Similar factors, as well as human error, may explain the inter-annotator agreement of 72.8%.

值得注意的是,如混淆矩阵所示,本算法出现分类错误的位置与独立的心血管专家们大致相同。在检查DNN算法在混淆矩阵中展示的不一致性后,我们认为这种大致相同的分类错误位置是可解释的。在许多案例中,上下文环境的缺失,信号时长的限制以及单导联采集的方式都会限制DNN算法从数据中得出正确结论,同时也无法确定到底是委员会还是算法的结论是准确的。上述理由可能也可以解释为什么委员会的一致率为72.8%。

Of the rhythm classes we examined, ventricular tachycardia is a clinically important rhythm for which the model had a lower F1 score than cardiologists, but interestingly had higher sensitivity (94.1%) than the averaged cardiologist (78.4%). Manual review of the 16 records misclassified by the DNN as ventricular tachycardia showed that ‘mistakes’ made by the algorithm were very reasonable. For example, ventricular tachycardia and idioventricular rhythm (IVR) differ only in the heart rate being above or below 100 beats per minute (b.p.m.), respectively. In 7 of the committee-labeled IVR cases, the record contained periods of heart rate≥100b.p.m., making ventricular tachycardia a reasonable classification by the DNN; the remaining 3 committee-labeled IVR records had rates close to 100b.p.m.. Of the 5 cases where the committee label was atrial fibrillation (4) or SVT (1), all but one displayed aberrant conduction, resulting in wide QRS complexes (the ECG waveform corresponding to ventricular activation) with a similar appearance to ventricular tachycardia. If we recategorize the 7 IVR records with a rate≥100b.p.m. as ventricular tachycardia, overall DNN performance on ventricular tachycardia exceeds that of cardiologists by F1 score, with a set-level F1 score of 0.82 (versus 0.77).

在我们检验的心律标签中,室性心动过速从临床角度来说是十分重要的,虽然模型的F1得分低于心血管专家们的表现,有趣的是模型的召回率(94.1%)却高于心血管专家的平均水平(78.4%)。在检视了由DNN算法误诊为室性心动过速的16条样本后,我们认为算法犯错是可以理解的。例如,室性心动过速与IVR之间仅能靠心率是否高于100BPM区分。有7个被委员会标注为IVR的样本均有一段区间的心率大于100BPM了,DNN因此认为这些样本为室性心动过速,除此以外,有3个被委员会标注为IVR的样本的心率接近100BPM,有4个被标注为房颤,1个被标注为SVT,最后一个样本为差异性传导,该类心律会导致宽QRS波群的出现,同时信号波形的长相也类似室性心动过速。若我们将大于100BPM的7个IVR样本标注为室性心动过速,那么DNN在按样本预测的室性心动过速标签的F1得分则成功超过了心血管专家的表现(0.82对0.77)。

This study has several important limitations. Our input dataset is limited to single-lead ECG records obtained from an ambulatory monitor, which provides limited signal compared to a standard 12-lead ECG; it remains to be determined if our algorithm performance would be similar in 12-lead ECGs. However, it may be in applications such as this, which have lower signal-to-noise ratio and where the current standard of care leaves more room for improvement, that approaches such as deep learning may provide the greatest impact. As discussed earlier, a limitation facing this, or any algorithm, before clinical application would be tailoring it to the target application, which may require additional training or post-processing steps. Additionally, systematic differences in the way technicians versus cardiologists labeled records in our dataset could have decreased DNN performance, although we took precautions to limit this by establishing standard operating protocols for annotation. In addition, as revealed in our manual review of discordant predictions, in some cases there remains uncertainty in the correct label. Given the resource-intensive nature of cardiologist committee ECG annotation, our test dataset was limited to records from 328 patients; confidence intervals (CIs) with our test dataset size were acceptably narrow, as we report in Table 1, although our ability to perform subgroup analysis (such as by age/sex) is limited. Finally, we also note that to obtain a sufficient quantity of rare rhythms in our training and test datasets, we targeted patients exhibiting these rhythms during data extraction. This implies that prevalence-dependent metrics such as the F1 score would not be expected to generalize to the broader population.

本次研究存在几点严重不足:首先,我们数据库的ECG样本为动态单导联ECG,与标准12导联的ECG信号对比存在信息上的不足,我们并未确定本算法在12导联的ECG信号下的表现是否与单导联的接近。然而,若深度学习被应用于如信噪比较低的环境或提升当前的医护标准时则有可能带来巨大冲击。如前文所述,在进行临床应用前,由于本数据集存在的限制,可能需要对本算法或其他任何算法进行额外的训练或后处理步骤。除此之外,虽然我们通过建立操作标准协议以尽可能限制技术员与心血管专家标注标签时思维上的区别的出现,这种区别依旧有可能会影响DNN的表现。以及,正如我们前文对预测差异性的回顾所述,某些样本的标签正确与否存在争议。为避免委员会在标注测试集的样本时出现资源密集,我们将样本限制为从328位患者获取;如表1所示,虽然我们进行亚组分析的能力(如按照年龄或性别)受到了限制,测试集大小的置信区间的窄度仍在可接受范围内。最后,我们还注意到为给训练集以及测试集的罕见心律标签搜集足够数量的样本,在提取特征时,我们瞄准了出现这些心律样本的病人以采集更多的数据。这会导致DNN算法对这些病人的特征出现依赖,例如,若面对更广泛的人群,本算法F1值可能会下降。

In summary, we demonstrate that an end-to-end deep learning approach can classify a broad range of distinct arrhythmias from single-lead ECGs with high diagnostic performance similar to that of cardiologists. If confirmed in clinical settings, this approach has the potential to improve the accuracy, efficiency, and scalability of ECG interpretation.

综上所述,我们对一个用于诊断单导联ECG信号的广范围的心律不齐现象的端到端的深度学习方法进行了评估,其表现已接近心血管专家们的水平。这证明了本算法有潜力在临床环境中提升ECG诊断方法的准确率,效率以及可升级性。

你可能感兴趣的:(Andrew Ng的Nature Medicine论文翻译)