Before start: 这是按导师要求完成的寒假作业,全文个人渣翻,若有意见以及翻译不达意之处烦请随时提出或直接在评论区纠正。
论文标题: Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network
摘要:Computerized electrocardiogram (ECG)interpretation plays a critical role in the clinical ECG workflow. Widelyavailable digital ECG data and the algorithmic paradigm of deep learning presentan opportunity to substantially improve the accuracy and scalability ofautomated ECG analysis. However, a comprehensive evaluation of an end-to-enddeep learning approach for ECG analysis across a wide variety of diagnostic classeshas not been previously reported. Here, we develop a deep neural network (DNN)to classify 12 rhythm classes using 91,232 single-lead ECGs from 53,549patients who used a single-lead ambulatory ECG monitoring device. When validatedagainst an independent test dataset annotated by a consensus committee ofboard-certified practicing cardiologists, the DNN achieved an average areaunder the receiver operating characteristic curve (ROC) of 0.97. The average F1score, which is the harmonic mean of the positive predictive value andsensitivity, for the DNN (0.837) exceeded that of average cardiologists(0.780). With specificity fixed at the average specificity achieved bycardiologists, the sensitivity of the DNN exceeded the average cardiologistsensitivity for all rhythm classes. These findings demonstrate that an end-to-enddeep learning approach can classify a broad range of distinct arrhythmias fromsingle-lead ECGs with high diagnostic performance similar to that ofcardiologists. If confirmed in clinical settings, this approach could reducethe rate of misdiagnosed computerized ECG interpretations and improve the efficiencyof expert human ECG interpretation by accurately triaging or prioritizing themost urgent conditions.
The electrocardiogram is a fundamental tool in the everyday practice of clinical medicine, with more than 300 million ECGs obtained annually worldwide. The ECG is pivotal for diagnosing a wide spectrum of abnormalities from arrhythmias toacute coronary syndrome. Computer-aided interpretation has become increasingly important in the clinical ECG workflow since its introduction over 50 years ago, serving as a crucial adjunct to physician interpretation in many clinical settings. However, existing commercial ECG interpretation algorithms still show substantial rates of misdiagnosis. The combination of widespread digitization of ECG data and the development of algorithmic paradigms that can benefit from large-scale processing of raw data presents an opportunity to reexamine the standard approach to algorithmic ECG analysis and may provide substantial improvements to automated ECG interpretation.
Substantialalgorithmic advances in the past five years have been driven largely by aspecific class of models known as deep neural networks. DNNs are computationalmodels consisting of multiple processing layers, with each layer being able tolearn increasingly abstract, higher-level representations of the input datarelevant to perform specific tasks. They have dramatically improved the stateof the art in speech recognition, image recognition, strategy games such as Go,and in medical applications. The ability of DNNs to recognize patterns andlearn useful features from raw input data without requiring extensive datapreprocessing, feature engineering or handcrafted rules makes them particularlywell suited to interpret ECG data. Furthermore, since DNN performance tends toincrease as the amount of training data increases, this approach is wellpositioned to take advantage of the widespread digitization of ECG data.
A comprehensive evaluation of whether an end-to-end deep learning approach can be used to analyze raw ECG data to classify a broad range of diagnoses remainslacking. Much of the previous work to employ DNNs toward ECG interpretation has focused on single aspects of the ECG processing pipeline, such as noise reduction or feature extraction or has approached limited diagnostic tasks,detecting only a handful of heartbeat types (normal, ventricular or supraventricular ectopic, fusion, and so on) or rhythm diagnoses (most commonly atrial fibrillation or ventricular tachycardia). Lack of appropriate data has limited many efforts beyond these applications. Most prior efforts used data from the MIT-BIH Arrhythmia database (PhysioNet), which is limited by the small number of patients and rhythm episodes present in the dataset.
In this study, we constructed a large, novel ECG dataset that underwent expert annotation for a broad range of ECG rhythm classes. We developed a DNN to detect 12 rhythm classes from raw single-lead ECG inputs using a training dataset consisting of 91,232 ECG records from 53,549 patients. The DNN was designed to classify 10 arrhythmias as well as sinus rhythm and noise for a total of 12 output rhythm classes (Extended Data Fig. 1). ECG data were recorded by the Zio monitor, which is a Food and Drug Administration(FDA)-cleared, single-lead, patch-based ambulatory ECG monitor that continuously records data from a single vector (modified Lead II) at 200Hz. The mean and median wear time of the Zio monitor in our dataset was 10.6 and13.0 days, respectively. Mean age was 69±16 years and 43% were women. We validated the DNN on a test dataset that consisted of 328 ECG records collected from 328 unique patients, which was annotated by a consensus committee of expert cardiologists (see Methods). Mean age on the test dataset was 70±17 years and 38% were women. The mean inter-annotator agreement on the testdataset was 72.8%.
Supplementary Table 1 shows the number of unique patients exhibiting each rhythm class.
We first compared the performance of the DNN against the gold standard cardiologist consensus committee diagnoses by calculating the AUC (Table 1a). Since the DNN algorithm was designed to make a rhythm class prediction approximately once per second (see Methods), we report performance both as assessed once every second — which we call “sequence-level” and consists of one rhythm class per interval—and once per record, which we call “set-level” and consists of the group of unique diagnoses present in the record. Sequence-level metrics help capture the duration of an arrhythmia, such as its onset and offset within a record, whereas set-level metrics focus only on the existence of a rhythm class within a record. The DNN achieved an AUC of greater than 0.91 for all rhythm classes; at the sequence-level all but one AUC was above 0.97. The class-weighted average AUC was 0.978 at the sequence-level and 0.977 at the set-level. The model demonstrated high AUCs for arrhythmias of greater clinical significance such as AF, atrio-ventricular block, and ventricular tachycardia. The sequence and set-level results were similar, though sequence-level AUC was higher in the majority of cases. In sensitivity analyses, we calculated multi-class AUC using the method described by Hand and Till and results were materially unchanged. Supplementary Table 2 shows the maximum sensitivity achieved by the DNN with specificity >90%, and vice versa. With one exception, all sensitivity and specificity pairs were >90%.
首先我们将DNN的结果与作为“金标准”的心血管专家委员会的诊断结果通过计算AUC进行了对比(表1a)。由于DNN算法被设计用于每秒进行一次心律标签预测,我们用三种方式进行表现的对比:每秒进行一次预测、每个样本(即一个周期)进行一次预测以及根据样本给对应病人进行预测,其中按样本进行预测与按病人进行预测的区别在于按样本进行预测仅关注该样本内存在的心律标签。结果显示DNN在所有心律标签的AUC都超过了0.91;每秒预测的总AUC超过了0.97。其中类加权平均AUC在按秒预测以及按样本预测的值分别为0.978以及0.977.该模型在预测含更高临床价值的心律不齐种类如房颤、房室传导阻滞以及室性心动过速时有着极高的AUC。虽然按秒预测的AUC在大部分标签下均高过按样本预测的值,二者总的结果差不多一致。在召回率分析中,我们使用了由Hand and Till描述的方法计算多标签AUC,结果并无实质上的变化。如表2所示,最大召回率由DNN取得,对应特异性超过了90%,反之亦然。仅有一个标签的召回率对应的所有特异性均超过了90%。
In addition to a cardiologist consensus committee annotation, each ECG record in the test dataset received annotations from six separate individual cardiologists who were not part of the committee (see Methods). Using the committee labels as the gold standard, we compared the DNN algorithm F1 score to the average individual cardiologist F1 score, which is the harmonic mean of the positive predictive value (PPV; precision) and sensitivity (recall) (Table 1). Cardiologist F1 scores were averaged over six individual cardiologists. The trend of DNN F1 scores tended to follow that of the averaged cardiologist F1 scores: both had lower F1 on similar classes, such as ventricular tachycardia and ectopic atrial rhythm (EAR). The set-level average F1 scores weighted by the frequency of each class for the DNN (0.837) exceeded those for the averaged cardiologist (0.780). We performed multiple sensitivity analyses, all of which were consistent with our main results: both AUC and F1 scores on the 10% development dataset (n=8,761) were materially unchanged from the test dataset results, although they were slightly higher (Supplementary Tables 3 and 4). In addition, we retrained the DNN holding out an additional 10% of the training dataset as a second held-out test dataset (n=8,768); the AUC and F1 scores for all rhythms were materially unchanged (Supplementary Tables 5 and 6). We note that unlike the primary test dataset, which has gold-standard annotations from a committee of cardiologists, both sensitivity analysis datasets are annotated by certified ECG technicians.
We plotted receiver operating characteristic curves (ROCs) and precision-recall curves for the sequence-level analyses of three example classes: atrial fibrillation; trigeminy; and AVB (Fig. 1a, b). Individual cardiologist performance and averaged cardiologist performance are plotted on the same figure. Extended Data Fig. 2 presents ROCs for all classes, showing that the model met or exceeded the averaged cardiologist performance for all rhythm classes. Fixing the specificity at the average specificity level achieved by cardiologists, the sensitivity of the DNN exceeded the average cardiologist sensitivity for all rhythm classes (Table 2). We used confusion matrices to illustrate the discordance between the DNN’s predictions (Fig. 2a) or averaged cardiologist predictions (Fig. 2b) and the committee consensus. The two confusion matrices exhibit a similar pattern, highlighting those rhythm classes that were generally more problematic to classify (that is, supraventricular tachycardia (SVT) versus atrial fibrillation, junctional versus sinus rhythm, and EAR versus sinus rhythm).
我们绘制出了按秒预测的三个心律标签(房颤、三联律以及AVB)的ROC以及PRC作为例子,如图1a, b所示。每位心血管专家以及其平均表现则也被绘制到了同一张图中。图2记录了所有标签的ROC曲线,且说明模型在所有的心律标签的表现均接近甚至超越了心血管专家的平均表现。如表2所示,算法在所有标签下的召回率均超越了心血管专家的平均召回率。我们利用混淆矩阵以说明算法预测结果(图2a)以及心血管专家的平均预测结果(图2b)与委员会的统一结果间的不一致性。这两个矩阵展现了一个相似的模式,即高亮了在分类时普遍存在更多问题的心律标签(如SVT对房颤、交界对窦性心律以及EAR对窦性心律)。
Finally, to demonstrate the generalizability of our DNN architecture to external data, we applied our DNN to the 2017 PhysioNet Challenge data (, which contained four rhythm classes: sinus rhythm; atrial fibrillation; noise; and other. Keeping our DNN architecture fixed and without any other hyper-parameter tuning, we trained our DNN on the publicly available training dataset (n=8,528), holding out a 10% development dataset for early stopping. DNN performance on the hidden test dataset (n=3,658) demonstrated overall F1 scores that were among those of the best performers from the competition (Supplementary Table 7), with a class average F1 of 0.83. This demonstrates the ability of our end-to-end DNN-based approach to generalize to a new set of rhythm labels on a different dataset.
Our study is the first comprehensive demonstration of a deep learning approach to perform classification across a broad range of the most common and important ECG rhythm diagnoses. Our DNN had an average class-weighted AUC of 0.97, with higher average F1 scores and sensitivities than cardiologists. These findings demonstrate that an end-to-end DNN approach has the potential to be used to improve the accuracy of algorithmic ECG interpretation. Recent algorithmic and computational advances compel us to revisit the standard approaches to automated ECG interpretation. Furthermore, algorithmic approaches whose performance improves as more data become available, such as deep learning, can leverage the widespread digitization of ECG data and provide clear opportunities to bring us closer to the ideal of a learning health care system. We emphasize our use in this study of a dataset large enough to evaluate an end-to-end deep learning approach to predict multiple diagnostic ECG classes, and our validation against the high standard of a cardiologist consensus committee. (Most cardiologists were subspecialized in rhythm abnormalities.) We believe this is the most clinically relevant gold standard, since cardiologists perform the final ECG diagnosis in nearly all clinical settings.
Our study demonstrates that the paradigm shift represented by end-to-end deep learning may enable a new approach to automated ECG analysis. The standard approach to automated ECG interpretation employs various techniques across a series of steps that include signal preprocessing, feature extraction, feature selection/reduction, and classification. At each step, hand-engineered heuristics and derivations of the raw ECG data are developed with the ultimate aim to improve classification for a given rhythm, such as atrial fibrillation. In contrast, DNNs enable an approach that is fundamentally different since a single algorithm can accomplish all of these steps ‘end-to-end’ without requiring class-specific feature extraction; in other words, the DNN can accept the raw ECG data as input and output diagnostic probabilities. With sufficient training data, using a DNN in this manner has the potential to learn all of the important previously manually derived features, along with as-yet-unrecognized features, in a data-driven way, and may learn shared features useful in predicting multiple classes. These properties of DNNs can serve to improve prediction performance, particularly since there is ample evidence to suggest that the currently recognized, manually derived ECG features represent only a fraction of the informative features for any diagnosis.
While artificial neural networks were first applied toward the interpretation of ECGs as early as two decades ago, until recently they only contained several layers and were constrained by algorithmic and computational limitations. More recent studies have employed deeper networks, although some only use DNNs to perform certain steps in the ECG processing pipeline, such as feature extraction or classification. End-to-end DNN approaches have been used more recently showing good performance for a limited set of ECG rhythms, such as atrial fibrillation, ventricular arrhythmias, or individual heartbeat classes. While these prior efforts demonstrated promising performance for specific rhythms, they do not provide a comprehensive evaluation of whether an end-to-end approach can perform well across a wide range of rhythm classes, in a manner similar to that encountered clinically. Our approach is unique in using a 34-layer network in an end-to-end manner to simultaneously output probabilities for a wide range of distinct rhythm diagnoses, all of which is enabled by our dataset, which is orders of magnitude larger than most other datasets of its kind. Distinct from some other recent DNN approaches, no substantial preprocessing of ECG data, such as Fourier or wavelet transforms, is needed to achieve strong classification performance.
Since arrhythmia detection is one of the most problematic tasks for existing ECG algorithms, if validated in clinical settings through clinical trials, our approach has the potential for substantial clinical impact. Paired with properly annotated digital ECG data, our approach has the potential to increase the overall accuracy of preliminary computerized ECG interpretations and can also be used to customize predictions to institution- or population-specific applications by additional training on institution-specific data. While expert provider confirmation will probably be appropriate in many clinical settings, the DNN could expand the capability of an expert over-reader in the clinical workflow, for example, by triaging urgent conditions or those for which the DNN has the least ‘confidence’. Since ECG data collected from different clinical applications range in duration from 10s (standard 12-lead ECGs) to multiple days (single-lead ambulatory ECGs), the application of any algorithm, including ours, must ultimately be tailored to the target clinical application. For example, even at the performance characteristics we report, applying our algorithm sequentially across an ECG record of long duration would result in nontrivial false-positive diagnoses. Faced with a similar problem, cardiologists probably incorporate additional mechanisms to improve their diagnostic performance, such as taking advantage of the increased context or knowledge about arrhythmia epidemiology. Similarly, additional algorithmic steps or post-processing heuristics may be important before clinical application.
An important finding from our study is that the DNN appears to recapitulate the misclassifications made by individual cardiologists, as demonstrated by the similarity in the confusion matrices for the model and cardiologists. Manual review of the discordances revealed that the DNN misclassifications overall appear very reasonable. In many cases, the lack of context, limited signal duration, or having a single lead limited the conclusions that could reasonably be drawn from the data, making it difficult to definitively ascertain whether the committee and/or the algorithm was correct. Similar factors, as well as human error, may explain the inter-annotator agreement of 72.8%.
Of the rhythm classes we examined, ventricular tachycardia is a clinically important rhythm for which the model had a lower F1 score than cardiologists, but interestingly had higher sensitivity (94.1%) than the averaged cardiologist (78.4%). Manual review of the 16 records misclassified by the DNN as ventricular tachycardia showed that ‘mistakes’ made by the algorithm were very reasonable. For example, ventricular tachycardia and idioventricular rhythm (IVR) differ only in the heart rate being above or below 100 beats per minute (b.p.m.), respectively. In 7 of the committee-labeled IVR cases, the record contained periods of heart rate≥100b.p.m., making ventricular tachycardia a reasonable classification by the DNN; the remaining 3 committee-labeled IVR records had rates close to 100b.p.m.. Of the 5 cases where the committee label was atrial fibrillation (4) or SVT (1), all but one displayed aberrant conduction, resulting in wide QRS complexes (the ECG waveform corresponding to ventricular activation) with a similar appearance to ventricular tachycardia. If we recategorize the 7 IVR records with a rate≥100b.p.m. as ventricular tachycardia, overall DNN performance on ventricular tachycardia exceeds that of cardiologists by F1 score, with a set-level F1 score of 0.82 (versus 0.77).
This study has several important limitations. Our input dataset is limited to single-lead ECG records obtained from an ambulatory monitor, which provides limited signal compared to a standard 12-lead ECG; it remains to be determined if our algorithm performance would be similar in 12-lead ECGs. However, it may be in applications such as this, which have lower signal-to-noise ratio and where the current standard of care leaves more room for improvement, that approaches such as deep learning may provide the greatest impact. As discussed earlier, a limitation facing this, or any algorithm, before clinical application would be tailoring it to the target application, which may require additional training or post-processing steps. Additionally, systematic differences in the way technicians versus cardiologists labeled records in our dataset could have decreased DNN performance, although we took precautions to limit this by establishing standard operating protocols for annotation. In addition, as revealed in our manual review of discordant predictions, in some cases there remains uncertainty in the correct label. Given the resource-intensive nature of cardiologist committee ECG annotation, our test dataset was limited to records from 328 patients; confidence intervals (CIs) with our test dataset size were acceptably narrow, as we report in Table 1, although our ability to perform subgroup analysis (such as by age/sex) is limited. Finally, we also note that to obtain a sufficient quantity of rare rhythms in our training and test datasets, we targeted patients exhibiting these rhythms during data extraction. This implies that prevalence-dependent metrics such as the F1 score would not be expected to generalize to the broader population.
In summary, we demonstrate that an end-to-end deep learning approach can classify a broad range of distinct arrhythmias from single-lead ECGs with high diagnostic performance similar to that of cardiologists. If confirmed in clinical settings, this approach has the potential to improve the accuracy, efficiency, and scalability of ECG interpretation.