我之前在上转载过别人写的TCGA barcode介绍,这两天碰到相关的问题,我看了看,觉得不是很清晰,再说道说道。
接触和分析过TCGA数据的朋友肯定会经常处理TCGA barcode的前15位(有时12位),实际从上图可以看出TCGA的barcode设计总共有28位之多。
每一个短横杠衔接的都是含不同意义的序列,如下图
具体的解释如下表:
Label | Identifier for | Value | Value Description | Possible Values |
---|---|---|---|---|
Analyte | Molecular type of analyte for analysis | D | The analyte is a DNA sample | See Code Tables Report |
Plate | Order of plate in a sequence of 96-well plates | 182 | The 182nd plate | 4-digit alphanumeric value |
Portion | Order of portion in a sequence of 100 - 120 mg sample portions | 1 | The first portion of the sample | 01-99 |
Vial | Order of sample in a sequence of samples | C | The third vial | A to Z |
Project | Project name | TCGA | TCGA project | TCGA |
Sample | Sample type | 1 | A solid tumor | Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29. See Code Tables Report for a complete list of sample codes |
Center | Sequencing or characterization center that will receive the aliquot for analysis | 1 | The Broad Institute GCC | See Code Tables Report |
Participant | Study participant | 1 | The first participant from MD Anderson for GBM study | Any alpha-numeric value |
TSS | Tissue source site | 2 | GBM (brain tumor) sample from MD Anderson | See Code Tables Report |
其中比较重要的,用于区分样本类型的是 sample
另外,将barcode的组成从层次结构(树)来看,是这样的:
参考 https://docs.gdc.cancer.gov/Encyclopedia/pages/images/TCGA-TCGAbarcode-080518-1750-4378.pdf
可以看到同一个样本(一个病人的某一个组织块),在实际的实验处理中是分了很多分析试样的,特别是plate部分。这也就导致在实际的分析中有可能会出现多个barcode对应同一个样本(即前15位是一致的),那么分析的时候用哪个呢?
之前我做TCGA的相关分析一般是用UCSC Xena与Broad研究所的数据(属于level 4了),它们已经对这种问题进行了比较妥善的处理,然而最近处理从GDC下载的数据确实碰到了这样的问题,需要自己解决。
通过谷歌引擎找到Biostars上有人对这个问题加以讨论,我按照着提供的链接找到了Broad研究所进行barcode去重的策略:
主要内容如下:
In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
Analyte Replicate Filter
The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.Sort Replicate Filter
The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
翻译成中文,大致有以下3点:
- 对于RNA分析, Analyte序列 H>R>T
- 对于DNA分析,Analyte序列中D>G,W,X
- 如果经常前面的过滤还重复样本,考虑portion和plate序列,选择更大的
另外,分析不使用福尔马林处理的样本(DNA与RNA分析数据失真,但这一点TCGA已经考虑了)
因此我写了个函数来处理这个问题:
tcgaReplicateFilter = function(tsb, analyte_target=c("DNA","RNA"), decreasing=TRUE, analyte_position=20, plate=c(22,25), portion=c(18,19), filter_FFPE=FALSE, full_barcode=FALSE){
# basically, user provide tsb and analyte_target is fine. If you
# want to filter FFPE samples, please set filter_FFPE and full_barcode
# all to TRUE, and tsb must have nchar of 28
analyte_target = match.arg(analyte_target)
# Strings in R are largely lexicographic
# see ??base::Comparison
# filter FFPE samples
# provide by
if(full_barcode & filter_FFPE){
ffpe = c("TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01",
"TCGA-44-2656-01B-06D-A276-05", "TCGA-44-2656-01B-06D-A27C-26",
"TCGA-44-2656-01B-06R-A277-07", "TCGA-44-2662-01B-02D-A271-08",
"TCGA-44-2662-01B-02D-A273-01", "TCGA-44-2662-01B-02R-A277-07",
"TCGA-44-2665-01B-06D-A271-08", "TCGA-44-2665-01B-06D-A273-01",
"TCGA-44-2665-01B-06D-A276-05", "TCGA-44-2665-01B-06R-A277-07",
"TCGA-44-2666-01B-02D-A271-08", "TCGA-44-2666-01B-02D-A273-01",
"TCGA-44-2666-01B-02D-A276-05", "TCGA-44-2666-01B-02D-A27C-26",
"TCGA-44-2666-01B-02R-A277-07", "TCGA-44-2668-01B-02D-A271-08",
"TCGA-44-2668-01B-02D-A273-01", "TCGA-44-2668-01B-02D-A276-05",
"TCGA-44-2668-01B-02D-A27C-26", "TCGA-44-2668-01B-02R-A277-07",
"TCGA-44-3917-01B-02D-A271-08", "TCGA-44-3917-01B-02D-A273-01",
"TCGA-44-3917-01B-02D-A276-05", "TCGA-44-3917-01B-02D-A27C-26",
"TCGA-44-3917-01B-02R-A277-07", "TCGA-44-3918-01B-02D-A271-08",
"TCGA-44-3918-01B-02D-A273-01", "TCGA-44-3918-01B-02D-A276-05",
"TCGA-44-3918-01B-02D-A27C-26", "TCGA-44-3918-01B-02R-A277-07",
"TCGA-44-4112-01B-06D-A271-08", "TCGA-44-4112-01B-06D-A273-01",
"TCGA-44-4112-01B-06D-A276-05", "TCGA-44-4112-01B-06D-A27C-26",
"TCGA-44-4112-01B-06R-A277-07", "TCGA-44-5645-01B-04D-A271-08",
"TCGA-44-5645-01B-04D-A273-01", "TCGA-44-5645-01B-04D-A276-05",
"TCGA-44-5645-01B-04D-A27C-26", "TCGA-44-5645-01B-04R-A277-07",
"TCGA-44-6146-01B-04D-A271-08", "TCGA-44-6146-01B-04D-A273-01",
"TCGA-44-6146-01B-04D-A276-05", "TCGA-44-6146-01B-04D-A27C-26",
"TCGA-44-6146-01B-04R-A277-07", "TCGA-44-6146-01B-04R-A27D-13",
"TCGA-44-6147-01B-06D-A271-08", "TCGA-44-6147-01B-06D-A273-01",
"TCGA-44-6147-01B-06D-A276-05", "TCGA-44-6147-01B-06D-A27C-26",
"TCGA-44-6147-01B-06R-A277-07", "TCGA-44-6147-01B-06R-A27D-13",
"TCGA-44-6775-01C-02D-A271-08", "TCGA-44-6775-01C-02D-A273-01",
"TCGA-44-6775-01C-02D-A276-05", "TCGA-44-6775-01C-02D-A27C-26",
"TCGA-44-6775-01C-02R-A277-07", "TCGA-44-6775-01C-02R-A27D-13",
"TCGA-A6-2674-01B-04D-A270-10", "TCGA-A6-2674-01B-04R-A277-07",
"TCGA-A6-2677-01B-02D-A270-10", "TCGA-A6-2677-01B-02D-A274-01",
"TCGA-A6-2677-01B-02D-A27A-05", "TCGA-A6-2677-01B-02D-A27E-26",
"TCGA-A6-2677-01B-02R-A277-07", "TCGA-A6-2684-01C-08D-A270-10",
"TCGA-A6-2684-01C-08D-A274-01", "TCGA-A6-2684-01C-08D-A27A-05",
"TCGA-A6-2684-01C-08D-A27E-26", "TCGA-A6-2684-01C-08R-A277-07",
"TCGA-A6-3809-01B-04D-A270-10", "TCGA-A6-3809-01B-04D-A274-01",
"TCGA-A6-3809-01B-04D-A27A-05", "TCGA-A6-3809-01B-04D-A27E-26",
"TCGA-A6-3809-01B-04R-A277-07", "TCGA-A6-3810-01B-04D-A270-10",
"TCGA-A6-3810-01B-04D-A274-01", "TCGA-A6-3810-01B-04D-A27A-05",
"TCGA-A6-3810-01B-04D-A27E-26", "TCGA-A6-3810-01B-04R-A277-07",
"TCGA-A6-5656-01B-02D-A270-10", "TCGA-A6-5656-01B-02D-A274-01",
"TCGA-A6-5656-01B-02D-A27A-05", "TCGA-A6-5656-01B-02D-A27E-26",
"TCGA-A6-5656-01B-02R-A277-07", "TCGA-A6-5656-01B-02R-A27D-13",
"TCGA-A6-5659-01B-04D-A270-10", "TCGA-A6-5659-01B-04D-A274-01",
"TCGA-A6-5659-01B-04D-A27A-05", "TCGA-A6-5659-01B-04D-A27E-26",
"TCGA-A6-5659-01B-04R-A277-07", "TCGA-A6-6650-01B-02D-A270-10",
"TCGA-A6-6650-01B-02D-A274-01", "TCGA-A6-6650-01B-02D-A27A-05",
"TCGA-A6-6650-01B-02D-A27E-26", "TCGA-A6-6650-01B-02R-A277-07",
"TCGA-A6-6650-01B-02R-A27D-13", "TCGA-A6-6780-01B-04D-A270-10",
"TCGA-A6-6780-01B-04D-A274-01", "TCGA-A6-6780-01B-04D-A27A-05",
"TCGA-A6-6780-01B-04D-A27E-26", "TCGA-A6-6780-01B-04R-A277-07",
"TCGA-A6-6780-01B-04R-A27D-13", "TCGA-A6-6781-01B-06D-A270-10",
"TCGA-A6-6781-01B-06D-A274-01", "TCGA-A6-6781-01B-06D-A27A-05",
"TCGA-A6-6781-01B-06R-A277-07", "TCGA-A6-6781-01B-06R-A27D-13",
"TCGA-A7-A0DB-01C-02D-A272-09", "TCGA-A7-A0DB-01C-02R-A277-07",
"TCGA-A7-A0DB-01C-02R-A27D-13", "TCGA-A7-A13D-01B-04D-A272-09",
"TCGA-A7-A13D-01B-04R-A277-07", "TCGA-A7-A13D-01B-04R-A27D-13",
"TCGA-A7-A13E-01B-06D-A272-09", "TCGA-A7-A13E-01B-06R-A277-07",
"TCGA-A7-A13E-01B-06R-A27D-13", "TCGA-A7-A26E-01B-06D-A272-09",
"TCGA-A7-A26E-01B-06D-A275-01", "TCGA-A7-A26E-01B-06D-A27B-05",
"TCGA-A7-A26E-01B-06R-A277-07", "TCGA-A7-A26E-01B-06R-A27D-13",
"TCGA-A7-A26J-01B-02D-A272-09", "TCGA-A7-A26J-01B-02D-A275-01",
"TCGA-A7-A26J-01B-02D-A27B-05", "TCGA-A7-A26J-01B-02D-A27F-26",
"TCGA-A7-A26J-01B-02R-A277-07", "TCGA-A7-A26J-01B-02R-A27D-13",
"TCGA-B2-3923-01B-10D-A270-10", "TCGA-B2-3923-01B-10R-A277-07",
"TCGA-B2-3923-01B-10R-A27D-13", "TCGA-B2-3924-01B-03D-A270-10",
"TCGA-B2-3924-01B-03D-A274-01", "TCGA-B2-3924-01B-03D-A27A-05",
"TCGA-B2-3924-01B-03D-A27E-26", "TCGA-B2-3924-01B-03R-A277-07",
"TCGA-B2-3924-01B-03R-A27D-13", "TCGA-B2-5633-01B-04D-A270-10",
"TCGA-B2-5633-01B-04D-A274-01", "TCGA-B2-5633-01B-04D-A27A-05",
"TCGA-B2-5633-01B-04D-A27E-26", "TCGA-B2-5633-01B-04R-A277-07",
"TCGA-B2-5633-01B-04R-A27D-13", "TCGA-B2-5635-01B-04D-A270-10",
"TCGA-B2-5635-01B-04D-A274-01", "TCGA-B2-5635-01B-04D-A27A-05",
"TCGA-B2-5635-01B-04D-A27E-26", "TCGA-B2-5635-01B-04R-A277-07",
"TCGA-B2-5635-01B-04R-A27D-13", "TCGA-BK-A0CA-01B-02D-A272-09",
"TCGA-BK-A0CA-01B-02D-A275-01", "TCGA-BK-A0CA-01B-02D-A27B-05",
"TCGA-BK-A0CA-01B-02D-A27F-26", "TCGA-BK-A0CA-01B-02R-A277-07",
"TCGA-BK-A0CA-01B-02R-A27D-13", "TCGA-BK-A0CC-01B-04D-A272-09",
"TCGA-BK-A0CC-01B-04D-A275-01", "TCGA-BK-A0CC-01B-04D-A27B-05",
"TCGA-BK-A0CC-01B-04R-A277-07", "TCGA-BK-A0CC-01B-04R-A27D-13",
"TCGA-BK-A139-01C-08D-A272-09", "TCGA-BK-A139-01C-08D-A275-01",
"TCGA-BK-A139-01C-08D-A27B-05", "TCGA-BK-A139-01C-08D-A27F-26",
"TCGA-BK-A139-01C-08R-A277-07", "TCGA-BK-A139-01C-08R-A27D-13",
"TCGA-BK-A26L-01C-04D-A272-09", "TCGA-BK-A26L-01C-04D-A275-01",
"TCGA-BK-A26L-01C-04D-A27B-05", "TCGA-BK-A26L-01C-04D-A27F-26",
"TCGA-BK-A26L-01C-04R-A277-07", "TCGA-BK-A26L-01C-04R-A27D-13",
"TCGA-BL-A0C8-01B-04D-A271-08", "TCGA-BL-A0C8-01B-04D-A273-01",
"TCGA-BL-A0C8-01B-04D-A276-05", "TCGA-BL-A0C8-01B-04D-A27C-26",
"TCGA-BL-A0C8-01B-04R-A277-07", "TCGA-BL-A0C8-01B-04R-A27D-13",
"TCGA-BL-A13I-01B-04D-A271-08", "TCGA-BL-A13I-01B-04D-A276-05",
"TCGA-BL-A13I-01B-04R-A277-07", "TCGA-BL-A13I-01B-04R-A27D-13",
"TCGA-BL-A13J-01B-04D-A271-08", "TCGA-BL-A13J-01B-04D-A273-01",
"TCGA-BL-A13J-01B-04D-A276-05", "TCGA-BL-A13J-01B-04D-A27C-26",
"TCGA-BL-A13J-01B-04R-A277-07", "TCGA-BL-A13J-01B-04R-A27D-13")
tsb = setdiff(tsb, tsb[which(tsb %in% ffpe)])
}
# find repeated samples
sampleID = substr(tsb, start = 1, stop = 15)
dp_samples = unique(sampleID[duplicated(sampleID)])
if(length(dp_samples)==0){
message("ooo Not find any duplicated barcodes, return original input..")
tsb
}else{
uniq_tsb = tsb[! sampleID %in% dp_samples]
dp_tsb = setdiff(tsb, uniq_tsb)
add_tsb = c()
# analyte = substr(dp_tsb, start = analyte_position, stop = analyte_position)
# if analyte_target = "DNA"
# analyte: D > G,W,X
if(analyte_target == "DNA"){
for(x in dp_samples){
mulaliquots = dp_tsb[substr(dp_tsb,1,15) == x]
analytes = substr(mulaliquots,
start = analyte_position,
stop = analyte_position)
if(any(analytes == "D") & !(all(analytes == "D"))){
aliquot = mulaliquots[which(analytes == "D")]
add_tsb = c(add_tsb, aliquot)
dp_tsb = setdiff(dp_tsb, mulaliquots)
}
}
}else{
for(x in dp_samples){
mulaliquots = dp_tsb[substr(dp_tsb,1,15) == x]
analytes = substr(mulaliquots,
start = analyte_position,
stop = analyte_position)
if(any(analytes == "H") & !(all(analytes == "H"))){
aliquot = mulaliquots[which(analytes == "H")]
add_tsb = c(add_tsb, aliquot)
dp_tsb = setdiff(dp_tsb, mulaliquots)
}else if(any(analytes == "R") & !(all(analytes == "R"))){
aliquot = mulaliquots[which(analytes == "R")]
add_tsb = c(add_tsb, aliquot)
dp_tsb = setdiff(dp_tsb, mulaliquots)
}else if(any(analytes == "T") & !(all(analytes == "T"))){
aliquot = mulaliquots[which(analytes == "T")]
add_tsb = c(add_tsb, aliquot)
dp_tsb = setdiff(dp_tsb, mulaliquots)
}
}
}
# if analyte_target = "RNA"
# analyte: H > R > T
# else{
#
# }
if(length(dp_tsb) == 0){
message("ooo Filter barcodes successfully!")
c(uniq_tsb, add_tsb)
}else{
# filter according to portion number
sampleID_res = substr(dp_tsb, start=1, stop=15)
dp_samples_res = unique(sampleID_res[duplicated(sampleID_res)])
for(x in dp_samples_res){
mulaliquots = dp_tsb[substr(dp_tsb,1,15) == x]
portion_codes = substr(mulaliquots,
start = portion[1],
stop = portion[2])
portion_keep = sort(portion_codes, decreasing = decreasing)[1]
if(!all(portion_codes == portion_keep)){
if(length(which(portion_codes == portion_keep)) == 1){
add_tsb = c(add_tsb, mulaliquots[which(portion_codes == portion_keep)])
dp_tsb = setdiff(dp_tsb, mulaliquots)
}else{
dp_tsb = setdiff(dp_tsb, mulaliquots[which(portion_codes != portion_keep)])
}
}
}
if(length(dp_tsb)==0){
message("ooo Filter barcodes successfully!")
c(uniq_tsb, add_tsb)
}else{
# filter according to plate number
sampleID_res = substr(dp_tsb, start=1, stop=15)
dp_samples_res = unique(sampleID_res[duplicated(sampleID_res)])
for(x in dp_samples_res){
mulaliquots = dp_tsb[substr(dp_tsb,1,15) == x]
plate_codes = substr(mulaliquots,
start = plate[1],
stop = plate[2])
plate_keep = sort(plate_codes, decreasing = decreasing)[1]
add_tsb = c(add_tsb, mulaliquots[which(plate_codes == plate_keep)])
dp_tsb = setdiff(dp_tsb, mulaliquots)
}
if(length(dp_tsb)==0){
message("ooo Filter barcodes successfully!")
c(uniq_tsb, add_tsb)
}else{
message("ooo barcodes ", dp_tsb, " failed in filter process, other barcodes will be returned.")
c(uniq_tsb, add_tsb)
}
}
}
}
}
该函数经过了调试和修改,并进行相应的判断测试:
DNA与RNA重复测试:
#> # DNA analyte replicate filter
#> tcgaReplicateFilter(tsb = c("TCGA-55-7913-01B-11D-2237-01", "TCGA-55-7913-01B-11X-2237-01", "TCGA-55-7913-01B-11D-2237-01"))
#ooo Filter barcodes successfully!
#[1] "TCGA-55-7913-01B-11D-2237-01"
#> # RNA analyte replicate filter
#> tcgaReplicateFilter(tsb = c("TCGA-55-7913-01B-11H-2237-01", "TCGA-55-7913-01B-11R-2237-01", "TCGA-55-7913-01B-11T-2237-01"), analyte_target = "RNA")
#ooo Filter barcodes successfully!
#[1] "TCGA-55-7913-01B-11H-2237-01"
#> tcgaReplicateFilter(tsb = c("TCGA-55-7913-01B-11R-2237-01", "TCGA-55-7913-01B-11R-2237-01", "TCGA-55-7913-01B-11T-2237-01"), analyte_target = "RNA")
#ooo Filter barcodes successfully!
#[1] "TCGA-55-7913-01B-11R-2237-01"
#> tcgaReplicateFilter(tsb = c("TCGA-55-7913-01B-11T-2237-01", "TCGA-55-7913-01B-11T-2237-01", "TCGA-55-7913-01B-11D-2237-01"), analyte_target = "RNA")
#ooo Filter barcodes successfully!
#[1] "TCGA-55-7913-01B-11T-2237-01"
样本portion/plate顺序测试:
# plate number
test_data = read_tsv("
TCGA-A6-2684-01A-01R-1410-07
TCGA-A6-2684-01A-01R-A278-07
TCGA-A6-6650-01A-11R-A278-07
TCGA-A6-6650-01A-11R-1774-07
TCGA-A6-2674-01A-02R-A278-07
TCGA-A6-2674-01A-02R-0821-07
TCGA-A6-3809-01A-01R-A278-07
TCGA-A6-3809-01A-01R-1022-07
TCGA-A6-6780-01A-11R-A278-07
TCGA-A6-6780-01A-11R-1839-07
TCGA-A6-3810-01A-01R-1022-07
TCGA-A6-3810-01A-01R-A278-07
TCGA-A6-5659-01A-01R-1653-07
TCGA-A6-5659-01A-01R-A278-07
TCGA-A6-6781-01A-22R-1928-07
TCGA-A6-6781-01A-22R-A278-07
TCGA-A6-5656-01A-21R-A278-07
TCGA-A6-5656-01A-21R-1839-07", col_names=FALSE)
> tcgaReplicateFilter(test_data$X1)
ooo Filter barcodes successfully!
[1] "TCGA-A6-2684-01A-01R-A278-07" "TCGA-A6-6650-01A-11R-A278-07" "TCGA-A6-2674-01A-02R-A278-07"
[4] "TCGA-A6-3809-01A-01R-A278-07" "TCGA-A6-6780-01A-11R-A278-07" "TCGA-A6-3810-01A-01R-A278-07"
[7] "TCGA-A6-5659-01A-01R-A278-07" "TCGA-A6-6781-01A-22R-A278-07" "TCGA-A6-5656-01A-21R-A278-07"
# portion number
>tcgaReplicateFilter(tsb = c("TCGA-55-7913-01B-11T-2237-01", "TCGA-55-7913-01B-09T-2237-01"), analyte_target = "RNA")
ooo Filter barcodes successfully!
[1] "TCGA-55-7913-01B-11T-2237-01"
函数发布在Gist上,并对我自己要处理的样本barcode进行测试。
如果有问题欢迎反馈和修改。