jialonghao

4.微生物组机器学习包SIAMCAT学习

文章目录

- - - 4.1 SIAMCAT包的基本用法
    - - 4.1.1 SIAMCAT basic vignette
      - 4.1.2 SIAMCAT confounder example
      - **About This Vignette**
        
        **Preparations**
        
        SIAMCAT Workflow (without Confounders)
      - 4.1.3 SIAMCAT holdout testing
      - **介绍（ Introduction）**
        
        **导入数据（Load the Data）**
        
        **在法国数据集上构建模型（Model Building on the French Dataset）**
        
        **Application on the Holdout Dataset**
        
        模型评估(Model Evaluation)
      - 4.1.4 SIAMCAT input
      - Introduction
        
        [Loading your data into R](https://bioconductor.org/packages/release/bioc/vignettes/SIAMCAT/inst/doc/SIAMCAT_read-in.html#loading-your-data-into-r)
        
        [Creating a siamcat-class object](https://bioconductor.org/packages/release/bioc/vignettes/SIAMCAT/inst/doc/SIAMCAT_read-in.html#creating-a-siamcat-class-object)
      - 4.1.5 SIAMCAT meta-analysis
      - About This Vignette
        
        Compare Associations
        
        Study as Confounding Factor
        
        ML Meta-analysis
      - 4.1.6 SIAMCAT ML pitfalls
      - About This Vignette
        
        Supervised Feature Selection
        
        Naive Splitting of Dependent Data

论文：Wirbel, J., Zych, K., Essex, M. et al. Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox. Genome Biol 22, 93 (2021). https://doi.org/10.1186/s13059-021-02306-1

Github代码：https://github.com/zellerlab/siamcat_paper

相关数据：https://doi.org/10.5281/zenodo.4454489

论文要点：

**导致过拟合的原因：**supervised feature filtering 和naive splitting of dependent samples。

supervised feature filtering ：特征选择过程中考虑了标签信息。
naive splitting of dependent samples：注意同一个个体不同采样时间点的情况。

如何解决spurious association和reproducibility issues的问题？——control augmention.
模型自省（model introspection）：微生物组变化到底是疾病特异性的，还是具有普遍性的失调？

4.1 SIAMCAT包的基本用法

测试目录：F:\Zhaolab2020\gut-brain-axis\metaAD\2021GB_SIAMCAT\testwork

注意：最好使用R4版本。

4.1.1 SIAMCAT basic vignette

标签设置：

label.crc.zeller <- create.label(meta=meta.crc.zeller,
    label='Group', case='CRC')
   
##函数
create.label(label, case,
    meta=NULL, control=NULL,
    p.lab = NULL, n.lab = NULL,
    remove.meta.column=FALSE,
    verbose=1)
#label是用于创建标签的命名向量和metadata列的名字。
#verbose：整数，用于控制输出。0表示不输出，1表示只输出进程或者成功相关信息，2表示正常水平的信息，3用于debug information. 默认为1。

将数据导入siamcat：

sc.obj <- siamcat(feat=feat.crc.zeller,
    label=label.crc.zeller,
    meta=meta.crc.zeller)
    
##SIAMCAT constructor function
siamcat(..., feat=NULL, label=NULL, meta=NULL,
    phyloseq=NULL, validate=TRUE, verbose=3)

特征过滤

sc.obj <- filter.features(sc.obj,
                          filter.method = 'abundance',
                          cutoff = 0.001)
###Perform unsupervised feature filtering.
filter.features(siamcat, filter.method = "abundance",
    cutoff = 0.001, rm.unmapped = TRUE,
    feature.type='original', verbose = 1)
###filter.method包括 c('abundance', 'cum.abundance', 'prevalence', 'variance')，默认是'abundance'。
###cutoff： float, abundace, prevalence, or variance cutoff, defaults to 0.001
###rm.unmapped: 是否去掉unmapped, 默认为去掉。
###feature.type: "original", "filtered", or "normalized"。

特征选择方法：

‘abundace’ - remove features whose maximum abundance is never above the threshold value in any of the samples. 去掉再任何样本中最大丰度均不超过阈值的特征。
‘cum.abundance’ - remove features with very low abundance in all samples, i.e. those that are never among the most abundant entities that collectively make up (1-cutoff) of the reads in any sample。去掉在所有样本中丰度都很低的特征。
‘prevalence’ - remove features with low prevalence across samples, i.e. those that are undetected (relative abundance of 0) in more than 1 - cutoff percent of samples. 流行度。
‘variance’ - remove features with low variance across samples, i.e. those that have a variance lower than cutoff。去掉方差很小的特征。

关联检测：

sc.obj <- check.associations(
    sc.obj,
    sort.by = 'fc',
    alpha = 0.05,
    mult.corr = "fdr",
    detect.lim = 10 ^-6,
    plot.type = "quantile.box",
    panels = c("fc", "prevalence", "auroc"))
    
    
##Check and visualize associations between features and classes
check.associations(siamcat, fn.plot=NULL, color.scheme = "RdYlBu",
    alpha =0.05, mult.corr = "fdr", sort.by = "fc",
    detect.lim = 1e-06, pr.cutoff = 1e-6, max.show = 50,
    plot.type = "quantile.box",
    panels = c("fc","auroc"), prompt = TRUE,
    feature.type = 'filtered', verbose = 1)
##fn.plot: string, filename for the pdf-plot.
##alpha:float, significance level, defaults to 0.05
##mult.corr: string, multiple hypothesis correction method, see p.adjust, defaults to "fdr" 
##sort.by: string, sort features by p-value ("p.val"), by fold change ("fc") or by prevalence shift ("pr.shift")
##detect.lim: 对数变换前的伪计数。
##pr.cutoff： float, cutoff for the prevalence computation（流行度计算）, defaults to 1e-06
##max.show：相关特征的数目。
##plot.type：制定丰度的绘图方式。c("bean", "box", "quantile.box", "quantile.rect")
##panels：c("fc", "auroc", "prevalence")

关联检测中值得注意的内容：

多重假设检验矫正的原理是什么？Bonferroni(p*(1/n)) 和 FDR(**Benjaminiand Hochberg: **α*k/m)的区别是什么？https://zhuanlan.zhihu.com/p/51546651
c(“bean”, “box”, “quantile.box”, “quantile.rect”)等绘图方式分别是什么含义？参考F:\Zhaolab2020\gut-brain-axis\metaAD\2021GB_SIAMCAT\testwork测试结果。
丰度图旁边的panels名字：c(“fc”, “auroc”, “prevalence”)分别是什么意思？如何绘制的？
- Significance as computed by a Wilcoxon test followed by multiple hypothesis testing correction. 非参数样本检验，基于样本的秩次排列，将两独立样本组的非正态样本值进行比较。【注意：参数检验假定数据服从某分布（一般为正态分布），通过样本参数的估计量（x±s）对总体参数（μ）进行检验，比如t检验、u检验、方差分析。非参数检验不需要假定总体分布形式，直接对数据的分布进行检验，比如，卡方检验和wilcoxon秩和检验。】
- AUROC (Area Under the Receiver Operating Characteristics Curve) as a non-parameteric measure of enrichment (corresponds to the effect size of the Wilcoxon test). 【相关名词： $敏感性sensitivity=\frac{TP}{TP+FN}$ , 特异性 $specificity=\frac{TN}{FP+TN}$ , Wilcoxon-Mann-Whitney test的U statistic 证明有时间再仔细研究。参考：https://zhuanlan.zhihu.com/p/326327644】
- The generalized Fold Change (gFC) is a pseudo fold change which is calculated as geometric mean of the differences between the quantiles for the different classes found in the label. 【充分利用到Fold change的后验分布的方差的信息，参考https://kevinzjy.github.io/2017/05/20/170520-Paper-GFOLD/】
- The prevalence shift between the two different classes found in the label. 【应该就是本来的意思：流行度。在样本中出现的频率。】

Confounder Testing（混杂因素测试）：

sc.obj <- check.confounders(
    sc.obj,
    fn.plot = 'confounder_plots.pdf',
    meta.in = NULL,
    feature.type = 'filtered'
)

##Check for potential confounders in the metadata
check.confounders(siamcat, fn.plot, meta.in = NULL, verbose = 1)
##meta.in: vector, specific metadata variable names to analyze, defaults to NULL (all metadata variables will be analyzed)
##详情：该函数检查分类标签与metadata中潜在的混在因素（例如：age, sex 或者BMI)之间的关联。统计检验包括Fisher's extact text或者Wilcoxon test, 关联使用barplot或者Q-Q plot来可视化， 可视化形式取决于metadata的类型。
##另外，其通过条件熵（conditional entropy and association)来评估metadata variables之间的关联。使用广义线性模型（generalized linear models）来评估标签与metadata variable之间的关联，提供一个关联热图和合适的定量箱线图（boxplots)。

相关问题：

Fisher’s extact test或者Wilcoxon test的区别是什么？【Fisher’s extact test（用于离散型变量）: 用于分析列联表的统计显著性检验方法。Wilcoxon检验（用于连续型变量）：非参数样本检验，基于样本的秩次排序，将两独立样本组的非正态样本值进行比较。】
Q-Q plot是什么形式的？有什么意义？怎么使用ggplot2绘制？【quantile-quantile (QQ) plot: 比较累计分布函数来判断两组数据是否服从同一分布。】
什么是条件熵（conditional entropy）？【**条件熵(Conditional Entropy)：**表示两个随机变量X和Y，在已知Y的情况下对随机变量X的不确定性，称之为条件熵H(X|Y)。】
什么是广义线性模型（generalized linear models）？如何使用R来实现？【**广义线性模型**就是把自变量的线性预测函数当作因变量的估计值。很多模型都是基于广义线性模型的，例如，传统的线性回归模型，最大熵模型，Logistic回归，softmax回归。】
关联热图（heatmaps）和合适的定量箱线图（boxplots)是什么意思？【参考测试目录：F:\Zhaolab2020\gut-brain-axis\metaAD\2021GB_SIAMCAT\testwork】。
方差（variance）解释部分是什么意思？为什么左上角有很多特征的变量可能是标签关联的混杂因素？如何解释呢？

数据归一化（Data Normalization）：

sc.obj <- normalize.features(
    sc.obj,
    norm.method = "log.unit",
    norm.param = list(
        log.n0 = 1e-06,
        n.p = 2,
        norm.margin = 1
    )
)

##Perform feature normalization
normalize.features(siamcat,
    norm.method = c("rank.unit", "rank.std",
        "log.std", "log.unit", "log.clr"),
    norm.param = list(log.n0 = 1e-06, sd.min.q = 0.1,
        n.p = 2, norm.margin = 1),
    feature.type='filtered',
    verbose = 1)
##norm.method: c('rank.unit', 'rank.std', 'log.std', 'log.unit', 'log.clr')
##norm.param: 设置不同归一化方法的参数。
##feature.type:"original", "filtered", or "normalized".

说明：

5种归一化方法：rank.unit(将特征转化为秩，然后归一化每一列)；rank.std(将特征转化为ranks, 然后进行z-score归一化)；log.clr(中心对数比例转化)；log.std(对数转化，然后z-score归一化)；log.unit(对数变化，然后归一化)。
归一化参数： rank.unit不需要任何其他参数；rank.std(sd.min.q加上的最小方差)；log.clr(log.n0是对数变换前的伪计数)； log.std(log.n0和sd.min.q); log.unit(参数包括log.n0, np和norm.margin。n.p指定了使用的向量范数，norm.margin指定标准化的边界，其中1表示对特征标准化，2表示对样本，3表示全局最大的标准化。)

准备交叉验证（Prepare Cross-Validation）：进行两次重复的5-折交叉验证。

sc.obj <-  create.data.split(
    sc.obj,
    num.folds = 5,
    num.resample = 2
)

##Split a dataset into training and a test sets.
create.data.split(siamcat, num.folds = 2, num.resample = 1,
    stratify = TRUE, inseparable = NULL, verbose = 1)
##inseparable: 不可分割的元数据变量中的名字。

训练模型（Model Training）：

sc.obj <- train.model(
    sc.obj,
    method = "lasso"
)

##This function trains the a machine learning model on the training data
train.model(siamcat,
    method = c("lasso", "enet", "ridge", "lasso_ll",
        "ridge_ll", "randomForest"),
    stratify = TRUE, modsel.crit = list("auc"),
    min.nonzero.coeff = 1, param.set = NULL,
    perform.fs = FALSE,
    param.fs = list(thres.fs = 100, method.fs = "AUC", direction='absolute'),
    feature.type='normalized',
    verbose = 1)
##method： 训练模型的方法包括c('lasso', 'enet', 'ridge', 'lasso_ll', 'ridge_ll', 'randomForest')。
##modsel.crit: 模型选择的标准包括c('auc', 'f1', 'acc', 'pr')。
##min.nonzero.coeff：模型(仅对于'lasso', 'ridge', and 'enet')中应该出现的最小非零系数的整数，默认为1。
##param.set:参数设置，可能包括：cost-for lasso_ll and ridge_ll; alpha-for enet; ntree and mtry - for RandomForest)
##perform.fs: 设定是否进行特征选择。
##param.fs: 用于特征选择的参数，必须包括（thres.fs--用于特征选择的阈值；method.fs: 用于特征选择的方法，包括AUC、gFC和Wilcoxon；direction:对于AUC和gFC而言，最关联特征的方向,可能是absolute, positive或者negative。

机器学习模型和参数来自**mlr包。对于需要额外超参数的机器学习方法，最有超参数可以通过mlr**包的tuneParams函数来实现。

‘lasso’, ‘enet’, and ‘ridge’ use the ‘classif.cvglmnet’ Learner,
‘lasso_ll’ and ‘ridge_ll’ use the ‘classif.LiblineaRL1LogReg’ and the 'classif.LiblineaRL2LogReg’ Learners respectively
‘randomForest’ is implemented via the ‘classif.randomForest’ Learner.

用于特征选择的函数：

‘AUC’ - computes the Area Under the Receiver Operating Characteristics Curve for each single feature and selects the top param.fs$thres.fs, e.g. 100 features
‘gFC’ - computes the generalized Fold Change (see check.associations) for each feature and likewise selects the top param.fs$thres.fs, e.g. 100 features
Wilcoxon - computes the p-Value for each single feature with the Wilcoxon test and selects features with a p-value smaller than param.fs$thres.fs

4.1.2 SIAMCAT confounder example

测试数据集来自：Nielsen et al. Nat Biotechnol 2014.

About This Vignette
Preparations

curatedMetagenomicsData

注意检查来自单个受试者的重复样本：

print(length(unique(meta.nielsen.full$subjectID)))
print(nrow(meta.nielsen.full))

选取每个受试者第一次的采样作为代表：

meta.nielsen <- meta.nielsen.full %>%
    select(sampleID, subjectID, study_condition, disease_subtype,
        disease, age, country, number_reads, median_read_length, BMI) %>%
    mutate(visit=str_extract(sampleID, '_[0-9]+$')) %>%
    mutate(visit=str_remove(visit, '_')) %>%
    mutate(visit=as.numeric(visit)) %>%
    mutate(visit=case_when(is.na(visit)~0, TRUE~visit)) %>%
    group_by(subjectID) %>%
    filter(visit==min(visit)) %>%
    ungroup() %>%
    mutate(Sample_ID=sampleID) %>%
    mutate(Group=case_when(disease=='healthy'~'CTR',
                            TRUE~disease_subtype))

只选择患有**Ulcerative colitis（UC, 溃疡性结肠炎）**和正常的样本：

meta.nielsen <- meta.nielsen %>%
    filter(Group %in% c('UC', 'CTR'))

物种丰度谱（Taxonomic Profiles）:

x <- 'NielsenHB_2014.metaphlan_bugs_list.stool'
feat <- curatedMetagenomicData(x=x, dryrun=FALSE)
feat <- feat[[x]]@assayData$exprs

提取species水平的物种丰度，然后将其转化为相对丰度:

feat <- feat[grep(x=rownames(feat), pattern='s__'),]
feat <- feat[grep(x=rownames(feat),pattern='t__', invert = TRUE),]
feat <- t(t(feat)/100)

将完整的lineages转化为较短的species name:

rownames(feat) <- str_extract(rownames(feat), 's__.*$')

mOTUs2 Profiles: metadata和features都可以通过EMBL的集群获取https://www.embl.de/download/zeller/metaHIT/

# base url for data download
data.location <- 'https://www.embl.de/download/zeller/metaHIT/'
## metadata
meta.nielsen <- read_tsv(paste0(data.location, 'meta_Nielsen.tsv'))
## Rows: 396 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): Sample_ID, Individual_ID, Country, Gender, Group
## dbl (4): Sampling_day, Age, BMI, Library_Size
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# also here, we have to remove repeated samplings and CD samples
meta.nielsen <- meta.nielsen %>%
    filter(Group %in% c('CTR', 'UC')) %>%
    group_by(Individual_ID) %>%
    filter(Sampling_day==min(Sampling_day)) %>%
    ungroup() %>%
    as.data.frame()
rownames(meta.nielsen) <- meta.nielsen$Sample_ID

## features
feat <- read.table(paste0(data.location, 'metaHIT_motus.tsv'), 
                    stringsAsFactors = FALSE, sep='\t',
                    check.names = FALSE, quote = '', comment.char = '')
feat <- feat[,colSums(feat) > 0]
feat <- prop.table(as.matrix(feat), 2)

SIAMCAT Workflow (without Confounders)

选择国家为ESP（即西班牙， https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3）的样本，并将其导入siamcat对象：

# remove Danish samples
meta.nielsen.esp <- meta.nielsen[meta.nielsen$Country == 'ESP',]
sc.obj <- siamcat(feat=feat, meta=meta.nielsen.esp, label='Group', case='UC')

基于丰度和流行度过滤特征：

sc.obj <- filter.features(sc.obj, cutoff=1e-04,
                            filter.method = 'abundance')
## Features successfully filtered
sc.obj <- filter.features(sc.obj, cutoff=0.05,
                            filter.method='prevalence',
                            feature.type = 'filtered')
## Features successfully filtered

check.assocation函数计算富集显著性和关联度量（例如generalized fold change和single-feature AUROC):

sc.obj <- check.associations(sc.obj, detect.lim = 1e-06, alpha=0.1, 
                            max.show = 20,plot.type = 'quantile.rect',
                            panels = c('fc'),
                            fn.plot = './association_plot_nielsen.pdf')

检查提供的meta-varaibles, 进行混杂因素分析：

check.confounders(sc.obj, fn.plot = './confounders_nielsen.pdf')

机器学习流程（Machine Learning Workflow），主要包括如下步骤：

Feature normalization
Data splitting for cross-validation
Model training
Making model predictions (on left-out data)
Evaluating model predictions (using AUROC and AUPRC)

sc.obj <- normalize.features(sc.obj, norm.method = 'log.std',
                            norm.param = list(log.n0=1e-06, sd.min.q=0))
## Features normalized successfully.
sc.obj <- create.data.split(sc.obj, num.folds = 5, num.resample = 5)
## Features splitted for cross-validation successfully.
sc.obj <- train.model(sc.obj, method='lasso')
## Trained lasso models successfully.
sc.obj <- make.predictions(sc.obj)
## Made predictions successfully.
sc.obj <- evaluate.predictions(sc.obj)
## Evaluated predictions successfully.

模型评估绘图将提供ROC curve和PR curve图：

model.evaluation.plot(sc.obj, fn.plot = './eval_plot_nielsen.pdf')

模型解释绘图将提供训练机器学习模型的额外信息：

model.interpretation.plot(sc.obj, consens.thres = 0.8,
                            fn.plot = './interpret_nielsen.pdf')
## Successfully plotted model interpretation plot to: ./interpret_nielsen.pdf

混杂因素分析（Confounder Analysis）

如何区分西班牙和丹麦的样本呢？

table(meta.nielsen$Group, meta.nielsen$Country)
##      
##       DNK ESP
##   CTR 177  59
##   UC    0  69

创建包含丹麦健康对照组的SIAMCAT对象：

## 创建SIAMCAT对象
sc.obj.full <- siamcat(feat=feat, meta=meta.nielsen,
                        label='Group', case='UC')
                        
             
sc.obj.full <- filter.features(sc.obj.full, cutoff=1e-04,
                                filter.method = 'abundance')
## Features successfully filtered
sc.obj.full <- filter.features(sc.obj.full, cutoff=0.05,
                                filter.method='prevalence',
                                feature.type = 'filtered')
## Features successfully filtered

混杂因素绘图向我们展示宏变量country可能存在问题：

check.confounders(sc.obj.full, fn.plot = './confounders_dnk.pdf')

包含丹麦的样本时，使用SIAMCAT来检验关联：

sc.obj.full <- check.associations(sc.obj.full, detect.lim = 1e-06, alpha=0.1, 
                                    max.show = 20,
                                    plot.type = 'quantile.rect',
                                    fn.plot = './association_plot_dnk.pdf')

混杂因素可能导致关联检验的偏差。使用SIAMCAT来检验两个数据集(考虑丹麦样本和只考虑西班牙的样本)之间的关联。我们可以从SIAMCAT对象中提取关联指标,并将它们与散点图进行比较。

assoc.sp <- associations(sc.obj)
assoc.sp$species <- rownames(assoc.sp)
assoc.sp_dnk <- associations(sc.obj.full)
assoc.sp_dnk$species <- rownames(assoc.sp_dnk)

df.plot <- full_join(assoc.sp, assoc.sp_dnk, by='species')
df.plot %>%
    mutate(highlight=str_detect(species, 'formicigenerans')) %>%
    ggplot(aes(x=-log10(p.adj.x), y=-log10(p.adj.y), col=highlight)) +
    geom_point(alpha=0.3) +
        xlab('Spanish samples only\n-log10(q)') +
        ylab('Spanish and Danish samples only\n-log10(q)') +
        theme_classic() +
        theme(panel.grid.major = element_line(colour='lightgrey'),
            aspect.ratio = 1.3) +
        scale_colour_manual(values=c('darkgrey', '#D41645'), guide=FALSE) +
        annotate('text', x=0.7, y=8, label='Dorea formicigenerans')

结果发现，有几个物种，在考虑丹麦的健康对照样本时是显著的，但是只考虑西班牙的样本时是不显著的。例如*“Dorea formicigenerans”*。

# extract information out of the siamcat object
feat.dnk <- get.filt_feat.matrix(sc.obj.full)
label.dnk <- label(sc.obj.full)$label
country <- meta(sc.obj.full)$Country
names(country) <- rownames(meta(sc.obj.full))

df.plot <- tibble(dorea=log10(feat.dnk[
    str_detect(rownames(feat.dnk),'formicigenerans'),
    names(label.dnk)] + 1e-05),
    label=label.dnk, country=country) %>%
    mutate(label=case_when(label=='-1'~'CTR', TRUE~"UC")) %>%
    mutate(x_value=paste0(country, '_', label))

df.plot %>%
    ggplot(aes(x=x_value, y=dorea)) +
        geom_boxplot(outlier.shape = NA) +
        geom_jitter(width = 0.08, stroke=0, alpha=0.2) +
        theme_classic() +
        xlab('') +
        ylab("log10(Dorea formicigenerans)") +
        stat_compare_means(comparisons = list(c('DNK_CTR', 'ESP_CTR'),
                                                c('DNK_CTR', 'ESP_UC'),
                                                c('ESP_CTR', 'ESP_UC')))

df.plot是一个291行，4列的DataFrame。

机器学习(Machine Learning):

机器学习工作流的结果也会受到国家之间的差异的影响,导致了夸大的性能估计。

sc.obj.full <- normalize.features(sc.obj.full, norm.method = 'log.std',
                                norm.param = list(log.n0=1e-06, sd.min.q=0))
## Features normalized successfully.
sc.obj.full <- create.data.split(sc.obj.full, num.folds = 5, num.resample = 5)
## Features splitted for cross-validation successfully.
sc.obj.full <- train.model(sc.obj.full, method='lasso')
## Trained lasso models successfully.
sc.obj.full <- make.predictions(sc.obj.full)
## Made predictions successfully.
sc.obj.full <- evaluate.predictions(sc.obj.full)
## Evaluated predictions successfully.

当我们比较两种不同模型的性能时,包含丹麦语和西班牙样本的模型似乎表现得更好(更高的AUROC值)。然而,之前的分析表明,这种性能估计是有偏见和夸大的,因为西班牙样本和丹麦样本之间的差异非常大。

model.evaluation.plot("Spanish samples only"=sc.obj,
                    "Danish and Spanish samples"=sc.obj.full,
                    fn.plot = './eval_plot_dnk.pdf')
## Plotted evaluation of predictions successfully to: ./eval_plot_dnk.pdf

为了演示机器学习模型如何利用这种混杂因素，我们可以训练一个模型来区分西班牙和丹麦的控制样本。正如你所看到的，这个模型可以几乎完全准确地区分这两个国家。

meta.nielsen.country <- meta.nielsen[meta.nielsen$Group=='CTR',]

sc.obj.country <- siamcat(feat=feat, meta=meta.nielsen.country,
                            label='Country', case='ESP')
sc.obj.country <- filter.features(sc.obj.country, cutoff=1e-04,
                            filter.method = 'abundance')
sc.obj.country <- filter.features(sc.obj.country, cutoff=0.05,
                            filter.method='prevalence',
                            feature.type = 'filtered')
sc.obj.country <- normalize.features(sc.obj.country, norm.method = 'log.std',
                                    norm.param = list(log.n0=1e-06,
                                        sd.min.q=0))
sc.obj.country <- create.data.split(sc.obj.country, 
                                    num.folds = 5, num.resample = 5)
sc.obj.country <- train.model(sc.obj.country, method='lasso')
sc.obj.country <- make.predictions(sc.obj.country)
sc.obj.country <- evaluate.predictions(sc.obj.country)

print(eval_data(sc.obj.country)$auroc)
## Area under the curve: 0.9701

值得参考的部分：

怎么对MetaPhlAn2和mOUTs2的输出进行预处理？进而将其导入SIAMCAT对象，进行后续分析？
怎么进行Confounder Analysis？如何学习其图像的相关展示方法？

4.1.3 SIAMCAT holdout testing

参考：https://bioconductor.org/packages/release/bioc/vignettes/SIAMCAT/inst/doc/SIAMCAT_holdout.html

介绍（ Introduction）

SIAMCAT包的功能之一是在宏基因组数据集上训练统计机器学习模型。本节教程，我们将展示一个数据集上训练的模型用于另一个独立的数据集（holdout dataset）。本节教程使用的两个结肠癌研究的数据集，第一个数据集来自法国（ Zeller et al），第二个数据集来自中国（Yu et al），连个数据集均使用mOTUs2预处理。

导入数据（Load the Data）

数据集可以从 Zeller group的公开宏基因组数据集网络资源找到。

library(SIAMCAT)

# this is data from Zeller et al., Mol. Syst. Biol. 2014
fn.feat.fr  <-
    'https://www.embl.de/download/zeller/FR-CRC/FR-CRC-N141_tax-ab-specI.tsv'
fn.meta.fr  <-
    'https://www.embl.de/download/zeller/FR-CRC/FR-CRC-N141_metadata.tsv'

# this is the external dataset from Yu et al., Gut 2017
fn.feat.cn  <-
    'https://www.embl.de/download/zeller/CN-CRC/CN-CRC-N128_tax-ab-specI.tsv'
fn.meta.cn  <-
    'https://www.embl.de/download/zeller/CN-CRC/CN-CRC-N128_metadata.tsv'

首先采用法国的项目，构建SIAMCAT对象。

# features
# be vary of the defaults in R!!!
feat.fr  <- read.table(fn.feat.fr, sep='\t', quote="",
    check.names = FALSE, stringsAsFactors = FALSE)
# the features are counts, but we want to work with relative abundances
feat.fr.rel <- prop.table(as.matrix(feat.fr), 2)

# metadata
meta.fr  <- read.table(fn.meta.fr, sep='\t', quote="",
    check.names=FALSE, stringsAsFactors=FALSE)

# create SIAMCAT object
siamcat.fr <- siamcat(feat=feat.fr.rel, meta=meta.fr,
    label='Group', case='CRC')

然后加载中国研究，创建SIAMCAT对象，作为holdout dataset。

# features
feat.cn  <- read.table(fn.feat.cn, sep='\t', quote="",
    check.names = FALSE)
feat.cn.rel <- prop.table(as.matrix(feat.cn), 2)

# metadata
meta.cn  <- read.table(fn.meta.cn, sep='\t', quote="",
    check.names=FALSE, stringsAsFactors = FALSE)

# SIAMCAT object
siamcat.cn <- siamcat(feat=feat.cn.rel, meta=meta.cn,
        label='Group', case='CRC')

在法国数据集上构建模型（Model Building on the French Dataset）

数据预处理（包括数据验证、过滤和标准化）：

## 特征过滤
siamcat.fr <- filter.features(
    siamcat.fr,
    filter.method = 'abundance',
    cutoff = 0.001,
    rm.unmapped = TRUE,
    verbose=2
)

##特征标准化
siamcat.fr <- normalize.features(
    siamcat.fr,
    norm.method = "log.std",
    norm.param = list(log.n0 = 1e-06, sd.min.q = 0.1),
    verbose = 2
)

模型训练：

##交叉验证数据集划分
siamcat.fr <-  create.data.split(
    siamcat.fr,
    num.folds = 5,
    num.resample = 2
)

##训练模型
siamcat.fr <- train.model(
    siamcat.fr,
    method = "lasso"
)

预测, 在每个交叉验证折上进行预测，评估预测：

##进行预测
siamcat.fr <- make.predictions(siamcat.fr)

##评估预测
siamcat.fr <-  evaluate.predictions(siamcat.fr)

Application on the Holdout Dataset

现在，我们已经成功地为法国数据集构建了模型，我们可以将其应用到中国的holdout数据集。首先，我们将使用与法国数据集相同的参数对中国数据集进行标准化，以使数据具有可比性。对于这一步，我们可以在SIAMCAT的normalize.features函数中使用冻结的标准化功能(frozen normalization functionality )。我们为这个函数提供了保存在siamcat.fr对象中的所有标准化参数，其可以使用norm_params访问器访问这些参数。

冻结标准化（Frozen Normalization）：

##注意，标准化函数还是normalize.features，只是参数norm.param使用了法国数据集上的。
siamcat.cn <- normalize.features(siamcat.cn,
    norm.param=norm_params(siamcat.fr),
    feature.type='original',
    verbose = 2)

将训练的模型用于holdout数据集。

##使用siamcat.holdout参数来引入外部数据集。
siamcat.cn <- make.predictions(
    siamcat = siamcat.fr,
    siamcat.holdout = siamcat.cn,
    normalize.holdout = FALSE)

## Warning in make.predictions(siamcat = siamcat.fr, siamcat.holdout =
## siamcat.cn, : WARNING: holdout set is not being normalized!

注意make.predictions只能用于标准化以后的holdout数据集。

## Alternative Code, not run here
siamcat.cn <- siamcat(feat=feat.cn.rel, meta=meta.cn,
    label='Group', case='CRC')
siamcat.cn <- make.predictions(siamcat = siamcat.fr,
    siamcat.holdout = siamcat.cn,
    normalize.holdout = TRUE)

再次评估预测：

siamcat.cn <- evaluate.predictions(siamcat.cn)

模型评估(Model Evaluation)

现在，我们可以使用model.evaluation.plot比较原始分类器的性能和holdout数据集的性能。在这里，我们可以提供几个SIAMCAT对象，模型评估将被绘制在同一幅图中。注意，我们可以以命名对象的形式提供对象，以便在图例中打印名称。
```
model.evaluation.plot('FR-CRC'=siamcat.fr,
    'CN-CRC'=siamcat.cn,
    colours=c('dimgrey', 'orange')
```

4.1.4 SIAMCAT input

参考： https://bioconductor.org/packages/release/bioc/vignettes/SIAMCAT/inst/doc/SIAMCAT_read-in.html

Introduction

本节教程将展示如何读取和输入你的数据到SIAMCAT包。我们将覆盖从磁盘中读取文本文件、格式化数据并用它们来创建siamcat-class对象。

siamcat-class是这个包的核心。所有输入维护局和结果都存储在里面。该对象的结构在下面的siamcat-class object部分描述。

Loading your data into R

整体而言，SIAMCAT有三种类型的输入：特征（features）, 元数据（Metadata）和标签（Label）。

features: 是 matrix, 或者 data.frame, 或者 otu_table，形式为features (in rows) x samples (in columns)。
metadata: 应该是 matrix, 或者 data.frame，形式为samples (in rows) x metadata (in columns)。
label: Named vector, Metadata column,Label file。

library(SIAMCAT)
##读取特征表
fn.in.feat  <- system.file(
    "extdata",
    "feat_crc_zeller_msb_mocat_specI.tsv",
    package = "SIAMCAT"
)
feat <- read.table(fn.in.feat, sep='\t',
    header=TRUE, quote='',
    stringsAsFactors = FALSE, check.names = FALSE)
# look at some features
feat[110:114, 1:2]

##读取metadata表
fn.in.meta  <- system.file(
    "extdata",
    "num_metadata_crc_zeller_msb_mocat_specI.tsv",
    package = "SIAMCAT"
)
meta <- read.table(fn.in.meta, sep='\t',
    header=TRUE, quote='',
    stringsAsFactors = FALSE, check.names = FALSE)
head(meta)

##Finds the full file names of files in packages etc.
system.file(..., package = "base", lib.loc = NULL,
            mustWork = FALSE)
##...: character vectors, 指定某个包的特定子目录和文件。
##package: 指定单个包的名字。
##mustWork: 逻辑判断。 默认为TRUE,如果没有匹配的文件，返回错误。

指定标签：

label <- create.label(meta=meta, label="diagnosis",
    case = 1, control=0)
    
label <- create.label(meta=meta, label="diagnosis",
    case = 1, control=0,
    p.lab = 'cancer', n.lab = 'healthy')

当输入文件为lefse格式（biom-format）时，输入方法如下：

fn.in.lefse<- system.file(
    "extdata",
    "LEfSe_crc_zeller_msb_mocat_specI.tsv",
    package = "SIAMCAT"
)

##读取lefse文件
meta.and.features <- read.lefse(fn.in.lefse,
    rows.meta = 1:6, row.samples = 7)
meta <- meta.and.features$meta
feat <- meta.and.features$feat

##创建label
label <- create.label(meta=meta, label="label", case = "cancer")

##相关函数解析
##read an input file in a LEfSe input format
read.lefse(filename = "data.txt", rows.meta = 1, row.samples = 2)
##filename: LEfSe输入形式的输入文件名。
##row.meta: 指定哪些行存储的metadata变量。
##row.samples: 指定哪些行存储的是样本名称。

metagenomeSeq format files的输入文件：使用read.table函数或者BIOM形式输入。

BIOM format files：通过phyloseq加入到SIAMCAT。文本文件首先通过phyloseq的import_biom导入。然后phyloseq对象可以被导入为siamcat对象。

Creating a siamcat object of a phyloseq object：直接通过siamcat构造函数，由phyloseq对象创建siamcat对象。

data("GlobalPatterns") ## phyloseq example data
label <- create.label(meta=sample_data(GlobalPatterns),
    label = "SampleType",
    case = c("Freshwater", "Freshwater (creek)", "Ocean"))

# run the constructor function
siamcat <- siamcat(phyloseq=GlobalPatterns, label=label)

Creating a siamcat-class object

siamcat-class是整个包的核心。

在上图中，矩形描述了对象的槽位，存储在槽位中的对象的类在椭圆中给出。有两个必须的槽位–phyloseq（包含作为sample_data的metadata和作为otu_table的原始特征）以及label。这两个槽位用更粗的边框加以标识。

siamcat对象的构建使用了siamcat函数，可以通过特征表Features或者phyloseq来初始化。
```
siamcat <- siamcat(feat=feat, label=label, meta=meta)
siamcat <- siamcat(phyloseq=phyloseq, label=label)
```
1. phyloseq, label and orig_feat slots
```
help('phyloseq-class')
help('otu_table-class')
orig_feat(samcat.obj) ##可以使用orig_feat来获取siamcat对象的原始特征表。
```
2. All the other slots
  
  在运行SIAMCAT的过程中，其他的卡槽也会被填充。

3. Accessing and assigning slots

siamcat中的每一个卡槽，都可以通过如下方式获取：

slot_name(siamcat)

例如，获取eval_data卡槽，你可以输入：

eval_data(siamcat)

##特例
physeq(siamcat)

##赋值
slot_name(siamcat) <- object_to_assign
label(siamcat) <- new_label

Slots inside the slots

有两个插槽里面还有插槽。首先，model_list插槽中有models插槽，其包含了mlr模型的真实列表–能够通过models(siamcat)来获取。并且可以使用model.type()来获取训练模型的方法model_type(siamcat)。
phyloseq插槽有复杂的结构。然而，除非在SIAMCAT对象外面创建phyloseq对象，否则只有两个phyloseq插槽的插槽被占据：otu_table插槽包含着特征表，sam_data插槽包含metadata信息。分别可以通过features(siamcat)或者meta(siamcat)来获取。

phyloseq插槽中的其他插槽没有专门的访问器，但是从siamcat对象导出phyloseq对象后，很容易获取。

phyloseq <- physeq(siamcat)
tax_tab <- tax_table(phyloseq)
head(tax_tab)

如果你想要了解更多关于phyloseq数据结构的问题。可以查看phyloseq的BioConductor页面。

4.1.5 SIAMCAT meta-analysis

About This Vignette

本节教程，我们将展示SIAMCAT如何促进宏基因组meta-analyses, 聚焦于关联测试和ML工作流程。作为案例，我们使用五个不同的**Crohn’s disease (CD)**研究，因为我们拥有来自5个不同数据集的宏基因组数据集。这些数据集是：

metaHIT
Lewis et al. 2015
He et al. 2017
Franzosa et al. 2019
HMP2

开始（Setup）

library("tidyverse")
library("SIAMCAT")

首先，我们从所有研究中导入数据，其可以从EMBL的集群下载。原始数据经过预处理，然后使用 mOTUs2 来进行物种分类，然后聚焦genus水平。

# base url for data download
data.location <- 'https://www.embl.de/download/zeller/'
# datasets
datasets <- c('metaHIT', 'Lewis_2015', 'He_2017', 'Franzosa_2019', 'HMP2')
# metadata
meta.all <- read_tsv(paste0(data.location, 'CD_meta/meta_all.tsv'))
# features
feat <- read.table(paste0(data.location, 'CD_meta/feat_genus.tsv'), 
                check.names = FALSE, stringsAsFactors = FALSE, quote = '', 
                sep='\t')
feat <- as.matrix(feat)
# check that metadata and features agree
stopifnot(all(colnames(feat) == meta.all$Sample_ID))

让我们检查一下各个研究中分组的分布：

table(meta.all$Study, meta.all$Group)
##                
##                  CD CTR
##   Franzosa_2019  88  56
##   HMP2          583 357
##   He_2017        49  53
##   Lewis_2015    294  25
##   metaHIT        21  71

某些研究可能包括来自同一受试者的多个样本。例如，HMP聚焦CD的纵向时间维度。因此，我们在训练和评估机器学习模型（查看Machine learning pitfalls部分的教程）以及进行关联分析时，要考虑这些问题。因此，为每个个体创建包含单个条目的第二个元数据表将很方便。

meta.ind <- meta.all %>% 
    group_by(Individual_ID) %>% 
    filter(Timepoint==min(Timepoint)) %>% 
    ungroup()

##group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data)) 按照一个或多个变量来分组。
##filter(.data, ..., .preserve = FALSE) 使用列名的值来对行取子集。
##group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))  取消分组。

Compare Associations

使用SIAMCAT来计算关联。 为了检验关联，我们将每个数据集包裹为不同的SIAMCAT对象，然后使用check.associations函数：

assoc.list <- list()
for (d in datasets){
    # filter metadata and convert to dataframe
    meta.train <- meta.ind %>% 
        filter(Study==d) %>% 
        as.data.frame()
    rownames(meta.train) <- meta.train$Sample_ID

    # create SIAMCAT object
    sc.obj <- siamcat(feat=feat, meta=meta.train, label='Group', case='CD')
    # test for associations
    sc.obj <- check.associations(sc.obj, detect.lim = 1e-05,
        feature.type = 'original',fn.plot = paste0('./assoc_plot_', d, '.pdf'))
    # extract the associations and save them in the assoc.list
    temp <- associations(sc.obj)
    temp$genus <- rownames(temp)
    assoc.list[[d]] <- temp %>% 
        select(genus, fc, auc, p.adj) %>% 
        mutate(Study=d)
}
# combine all associations
df.assoc <- bind_rows(assoc.list)
df.assoc <- df.assoc %>% filter(genus!='unclassified')
head(df.assoc)
##                                             genus fc auc p.adj   Study
## 159730 Thermovenabulum...1 159730 Thermovenabulum  0 0.5   NaN metaHIT
## 42447 Anaerobranca...2         42447 Anaerobranca  0 0.5   NaN metaHIT
## 1562 Desulfotomaculum...3   1562 Desulfotomaculum  0 0.5   NaN metaHIT
## 60919 Sanguibacter...4         60919 Sanguibacter  0 0.5   NaN metaHIT
## 357 Agrobacterium...5           357 Agrobacterium  0 0.5   NaN metaHIT
## 392332 Geoalkalibacter...6 392332 Geoalkalibacter  0 0.5   NaN metaHIT

Plot Heatmap for Interesting Genera. 比较存储在df.assoc中的关联。例如，我们可以提取至少在一个数据集中与标签强烈相关的特征（single feature AUROC > 0.75 或 < 0.25），然后将the generalized fold change绘制为热图。

genera.of.interest <- df.assoc %>% 
    group_by(genus) %>% 
    summarise(m=mean(auc), n.filt=any(auc < 0.25 | auc > 0.75), 
        .groups='keep') %>% 
    filter(n.filt) %>% 
    arrange(m)

提取了genera之后，我们绘制了它们：

df.assoc %>% 
    # take only genera of interest
    filter(genus %in% genera.of.interest$genus) %>% 
    # convert to factor to enforce an ordering by mean AUC
    mutate(genus=factor(genus, levels = rev(genera.of.interest$genus))) %>% 
    # convert to factor to enforce ordering again
    mutate(Study=factor(Study, levels = datasets)) %>% 
    # annotate the cells in the heatmap with stars
    mutate(l=case_when(p.adj < 0.01~'*', TRUE~'')) %>%  
    ggplot(aes(y=genus, x=Study, fill=fc)) + 
        geom_tile() + 
        scale_fill_gradient2(low = '#3B6FB6', high='#D41645', mid = 'white', 
            limits=c(-2.7, 2.7), name='Generalized\nfold change') + 
        theme_minimal() + 
        geom_text(aes(label=l)) +
        theme(panel.grid = element_blank()) + 
        xlab('') + ylab('') +
        theme(axis.text = element_text(size=6))

反思：这种展示方式，可以用于NC-AD, NC-MCI, NC- SCS, NC-SCD种相关特征的比较。

Study as Confounding Factor

此外,我们还可以检查研究之间的差异是否会影响特定的genera的方差。为此,我们创建了单个SIAMCAT对象,它拥有完整的数据集,然后运行check.confounder函数。

df.meta <- meta.ind %>% 
    as.data.frame()
rownames(df.meta) <- df.meta$Sample_ID
sc.obj <- siamcat(feat=feat, meta=df.meta, label='Group', case='CD')
## + starting create.label
## Label used as case:
##    CD
## Label used as control:
##    CTR
## + finished create.label.from.metadata in 0.001 s
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 504 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.06 s
check.confounders(sc.obj, fn.plot = './confounder_plot_cd_meta.pdf',
                feature.type='original')
## Finished checking metadata for confounders, results plotted to: ./confounder_plot_cd_meta.pdf

结果方差图显示，某些generas受不同研究的影响很大，其他的genera则不是。注意，随着标签信息（CD vs controls）变化很大的genera，在不同的研究之间方差变化不是很大。

ML Meta-analysis

训练LASSO模型。最后，我们可以进行机器学习meta-analysis: 我们首先为每个数据集训练一个模型，然后使用SIAMCAT的holdout测试功能将其应用到其他数据集。对于跨受试者重复样本的数据集，我们阻止了受试者的交叉验证以避免结果偏差（参考Machine learning pitfalls）。

# create tibble to store all the predictions
auroc.all <- tibble(study.train=character(0), 
                    study.test=character(0),
                    AUC=double(0))
# and a list to save the trained SIAMCAT objects
sc.list <- list()
for (i in datasets){
    # restrict to a single study
    meta.train <- meta.all %>% 
        filter(Study==i) %>% 
        as.data.frame()
    rownames(meta.train) <- meta.train$Sample_ID

    ## take into account repeated sampling by including a parameters
    ## in the create.data.split function
    ## For studies with repeated samples, we want to block the
    ## cross validation by the column 'Individual_ID'
    block <- NULL
    if (i %in% c('metaHIT', 'Lewis_2015', 'HMP2')){
        block <- 'Individual_ID'
        if (i == 'HMP2'){ 
            # for the HMP2 dataset, the number of repeated sample per subject 
            # need to be reduced, because some subjects have been sampled 
            # 20 times, other only 5 times
            meta.train <- meta.all %>% 
                filter(Study=='HMP2') %>% 
                group_by(Individual_ID) %>% 
                sample_n(5, replace = TRUE) %>% 
                distinct() %>% 
                as.data.frame()
            rownames(meta.train) <- meta.train$Sample_ID
        }
    }
    # create SIAMCAT object
    sc.obj.train <- siamcat(feat=feat, meta=meta.train, 
                            label='Group', case='CD')
    # normalize features
    sc.obj.train <- normalize.features(sc.obj.train, norm.method = 'log.std',
        norm.param=list(log.n0=1e-05, sd.min.q=0),feature.type = 'original')
    # Create data split
    sc.obj.train <- create.data.split(sc.obj.train,
        num.folds = 10, num.resample = 10, inseparable = block)
    # train LASSO model
    sc.obj.train <- train.model(sc.obj.train, method='lasso')

    ## apply trained models to other datasets

    # loop through datasets again
    for (i2 in datasets){
        if (i == i2){
            # make and evaluate cross-validation predictions (same dataset)
            sc.obj.train <- make.predictions(sc.obj.train)
            sc.obj.train <- evaluate.predictions(sc.obj.train)
            auroc.all <- auroc.all %>% 
                add_row(study.train=i, study.test=i,
                    AUC=eval_data(sc.obj.train)$auroc %>% as.double())
        } else {
            # make and evaluate on the external datasets
            # use meta.ind here, since we want only one sample per subject!
            meta.test <- meta.ind %>% 
                filter(Study==i2) %>%
                as.data.frame()
            rownames(meta.test) <- meta.test$Sample_ID
            sc.obj.test <- siamcat(feat=feat, meta=meta.test,
                                    label='Group', case='CD')
            # make holdout predictions
            sc.obj.test <- make.predictions(sc.obj.train, 
                                            siamcat.holdout = sc.obj.test)
            sc.obj.test <- evaluate.predictions(sc.obj.test)
            auroc.all <- auroc.all %>% 
                add_row(study.train=i, study.test=i2,
                    AUC=eval_data(sc.obj.test)$auroc %>% as.double())
        }
    }
    # save the trained model
    sc.list[[i]] <- sc.obj.train
}

训练了所有模型后，我们计算了每个数据集测试的平均。

test.average <- auroc.all %>% 
    filter(study.train!=study.test) %>% 
    group_by(study.test) %>% 
    summarise(AUC=mean(AUC), .groups='drop') %>% 
    mutate(study.train="Average")

现在，我们有了AUROC值，我们可以将它们绘制为很好的热图：

# combine AUROC values with test average
bind_rows(auroc.all, test.average) %>% 
    # highlight cross validation versus transfer results
    mutate(CV=study.train == study.test) %>%
    # for facetting later
    mutate(split=case_when(study.train=='Average'~'Average', TRUE~'none')) %>% 
    mutate(split=factor(split, levels = c('none', 'Average'))) %>% 
    # convert to factor to enforce ordering
    mutate(study.train=factor(study.train, levels=c(datasets, 'Average'))) %>% 
    mutate(study.test=factor(study.test, levels=c(rev(datasets),'Average'))) %>% 
    ggplot(aes(y=study.test, x=study.train, fill=AUC, size=CV, color=CV)) +
        geom_tile() + theme_minimal() +
        # text in tiles
        geom_text(aes_string(label="format(AUC, digits=2)"), 
            col='white', size=2)+
        # color scheme
        scale_fill_gradientn(colours=rev(c('darkgreen','forestgreen', 
                                        'chartreuse3','lawngreen', 
                                        'yellow')), limits=c(0.5, 1)) +
        # axis position/remove boxes/ticks/facet background/etc.
        scale_x_discrete(position='top') + 
        theme(axis.line=element_blank(), 
                axis.ticks = element_blank(), 
                axis.text.x.top = element_text(angle=45, hjust=.1), 
                panel.grid=element_blank(), 
                panel.border=element_blank(), 
                strip.background = element_blank(), 
                strip.text = element_blank()) + 
        xlab('Training Set') + ylab('Test Set') + 
        scale_color_manual(values=c('#FFFFFF00', 'grey'), guide=FALSE) + 
        scale_size_manual(values=c(0, 1), guide=FALSE) + 
        facet_grid(~split, scales = 'free', space = 'free')
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

结果如下图所示：

Investigate Feature Weights. 现在我们已经训练了模型（将其保存在sc.list对象中），我们可以使用SIAMCAT来提取模型的权重，比较我们上面计算的关联。

weight.list <- list()
for (d in datasets){
    sc.obj.train <- sc.list[[d]]
    # extract the feature weights out of the SIAMCAT object
    temp <- feature_weights(sc.obj.train)
    temp$genus <- rownames(temp)
    # save selected info in the weight.list
    weight.list[[d]] <- temp %>% 
        select(genus, median.rel.weight, mean.rel.weight, percentage) %>% 
        mutate(Study=d) %>% 
        mutate(r.med=rank(-abs(median.rel.weight)), 
            r.mean=rank(-abs(mean.rel.weight)))
}
# combine all feature weights into a single tibble
df.weights <- bind_rows(weight.list)
df.weights <- df.weights %>% filter(genus!='unclassified')

基于这个，我们可以绘制任何的热图，聚焦于我们在上面的关联热图中关注的genera。

# compute absolute feature weights
abs.weights <- df.weights %>% 
    group_by(Study) %>% 
    summarise(sum.median=sum(abs(median.rel.weight)),
                sum.mean=sum(abs(mean.rel.weight)),
                .groups='drop')

df.weights %>% 
    full_join(abs.weights) %>% 
    # normalize by the absolute model size
    mutate(median.rel.weight=median.rel.weight/sum.median) %>% 
    # only include genera of interest
    filter(genus %in% genera.of.interest$genus) %>% 
    # highlight feature rank for the top 20 features
    mutate(r.med=case_when(r.med > 20~NA_real_, TRUE~r.med)) %>%
    # enforce the correct ordering by converting to factors again
    mutate(genus=factor(genus, levels = rev(genera.of.interest$genus))) %>% 
    mutate(Study=factor(Study, levels = datasets)) %>% 
    ggplot(aes(y=genus, x=Study, fill=median.rel.weight)) + 
        geom_tile() + 
        scale_fill_gradientn(colours=rev(
            c('#007A53', '#009F4D', "#6CC24A", 'white',
            "#EFC06E", "#FFA300", '#BE5400')), 
            limits=c(-0.15, 0.15)) +
        theme_minimal() + 
        geom_text(aes(label=r.med), col='black', size= 2) +
        theme(panel.grid = element_blank()) + 
        xlab('') + ylab('') +
        theme(axis.text = element_text(size=6))
## Joining, by = "Study"

4.1.6 SIAMCAT ML pitfalls

参考教程： https://www.bioconductor.org/packages/release/bioc/vignettes/SIAMCAT/inst/doc/SIAMCAT_ml_pitfalls.html

About This Vignette

在这个教程中,我们想要探索机器学习分析的两个陷阱,这可能会导致过于乐观的性能估计（过拟合，overfiting）。

在建立交叉验证工作流程时,通常是为了估计经过训练的模型在外部数据上的表现,这在考虑标记标记发现时是特别重要的。然而,更复杂的工作流涉及特征选择或时间序列数据（time-course data）可能对正确设置具有挑战性。从测试到测试数据的信息泄漏的错误工作流程,可能会导致过拟合，并且在外部数据集上泛化能力很糟糕。

在这里,我们关注的是监督特征选择和对依赖数据的简单划分。

Setup. 首先，我们加载分析所需的包。
```
如您所见，不正确的特征选择过程会导致 AUROC 值膨胀，但对真正外部数据集的泛化能力较低，尤其是在选择的特征很少时。 相反，正确的过程给出了较低的交叉验证结果，但可以更好地估计模型在外部数据上的表现。library("tidyverse")
library("SIAMCAT")
```

Supervised Feature Selection

监督式特征选择意味着在交叉验证划分之前就考虑标签信息。在该流程中，与标签相关的特征选择（例如差异丰度检验以后），使用整个数据集来计算特征关联，没有将数据放在一边来进行无偏的模型估计。

进行特征选择的正确方式是将特征选择步骤嵌入交叉验证步骤。这意味着在每个训练折中，特征关联的计算是独立进行的。

加载数据（Load the Data）。作为案例，我们将使用curatedMetagenomicData包中的两个结肠癌（CRC）数据集。由于模型训练流程耗费很长的时间，这个教程没有在包的构建上进行评估，但是如果您为自己执行代码块,您应该得到类似的结果。

library("curatedMetagenomicData")

首先，我们加载Thomas et al的数据集作为训练集。

x <- 'ThomasAM_2018a.metaphlan_bugs_list.stool'
feat.t <- curatedMetagenomicData(x=x, dryrun=FALSE)
feat.t <- feat.t[[x]]@assayData$exprs
# clean up metaphlan profiles to contain only species-level abundances
feat.t <- feat.t[grep(x=rownames(feat.t), pattern='s__'),]
feat.t <- feat.t[grep(x=rownames(feat.t),pattern='t__', invert = TRUE),]
stopifnot(all(colSums(feat.t) != 0))
feat.t <- t(t(feat.t)/100)

出现报错：报错：Error in UseMethod(“filter_”): no applicable method for ‘filter_’ applied to an object of class “c(‘tbl_SQLiteConnection’, ‘tbl_dbi’, ‘tbl_sql’, ‘tbl_lazy’, ‘tbl’)”

解决办法：R语言版本较低，需要升级为R4版本。https://www.cnblogs.com/chenwenyan/p/15064291.html

使用Zeller et al的数据集作为外部数据集：

x <- 'ZellerG_2014.metaphlan_bugs_list.stool'
feat.z <- curatedMetagenomicData(x=x, dryrun=FALSE)
feat.z <- feat.z[[x]]@assayData$exprs
# clean up metaphlan profiles to contain only species-level abundances
feat.z <- feat.z[grep(x=rownames(feat.z), pattern='s__'),]
feat.z <- feat.z[grep(x=rownames(feat.z),pattern='t__', invert = TRUE),]
stopifnot(all(colSums(feat.z) != 0))
feat.z <- t(t(feat.z)/100)

我们可以从combined_metadata中提取对应的metadata信息，其是curatedMetagenomicData包的一部分。

meta.t <- combined_metadata %>% 
    filter(dataset_name == 'ThomasAM_2018a') %>% 
    filter(study_condition %in% c('control', 'CRC'))
rownames(meta.t) <- meta.t$sampleID
meta.z <- combined_metadata %>% 
    filter(dataset_name == 'ZellerG_2014') %>% 
    filter(study_condition %in% c('control', 'CRC'))
rownames(meta.z) <- meta.z$sampleID

MetaPhlAn2分类器的输出结果只展示数据集中出现的物种。因此，ThomasAM_2018矩阵中包括的某些物种可能不包括在ZellerG_2014的矩阵中，或者相反。为了将它们作为SIAMCAT的训练集和外部测试集，我们首先要确定两个数据集中完全重叠的特征集合（可参看 Holdout Testing with SIAMCAT教程）。

species.union <- union(rownames(feat.t), rownames(feat.z))
# add Zeller_2014-only species to the Thomas_2018 matrix
add.species <- setdiff(species.union, rownames(feat.t))
feat.t <- rbind(feat.t, 
            matrix(0, nrow=length(add.species), ncol=ncol(feat.t),
                dimnames = list(add.species, colnames(feat.t))))

# add Thomas_2018-only species to the Zeller_2014 matrix
add.species <- setdiff(species.union, rownames(feat.z))
feat.z <- rbind(feat.z, 
            matrix(0, nrow=length(add.species), ncol=ncol(feat.z),
                dimnames = list(add.species, colnames(feat.z))))

#注意相关函数

现在，我们开始准备模型训练过程。对此，我们可以选择不同的特征选择阈值，准备一个tibble来保存结果。

fs.cutoff <- c(20, 100, 250)

auroc.all <- tibble(cutoff=character(0), type=character(0), 
                    study.test=character(0), AUC=double(0))

首先，我们训练一个没有进行特征选择，有所有可用特征的模型。我们将correct和incorrect的训练结果加入结果矩阵，用于后续绘图。

sc.obj.t <- siamcat(feat=feat.t, meta=meta.t,
                    label='study_condition', case='CRC')
sc.obj.t <- filter.features(sc.obj.t, filter.method = 'prevalence',
                            cutoff = 0.01)
sc.obj.t <- normalize.features(sc.obj.t, norm.method = 'log.std',
                                norm.param=list(log.n0=1e-05, sd.min.q=0))
sc.obj.t <- create.data.split(sc.obj.t,
                                num.folds = 10, num.resample = 10)
sc.obj.t <- train.model(sc.obj.t, method='lasso')
sc.obj.t <- make.predictions(sc.obj.t)
sc.obj.t <- evaluate.predictions(sc.obj.t)

auroc.all <- auroc.all %>% 
    add_row(cutoff='full', type='correct', 
            study.test='Thomas_2018', 
            AUC=as.numeric(sc.obj.t@eval_data$auroc)) %>% 
    add_row(cutoff='full', type='incorrect', study.test='Thomas_2018', 
            AUC=as.numeric(sc.obj.t@eval_data$auroc))

接着，我们将模型用于外部数据集，记录其在另外一个数据集上的泛化能力。

sc.obj.z <- siamcat(feat=feat.z, meta=meta.z,
                    label='study_condition', case='CRC')
sc.obj.z <- make.predictions(sc.obj.t, sc.obj.z)
sc.obj.z <- evaluate.predictions(sc.obj.z)
auroc.all <- auroc.all %>% 
    add_row(cutoff='full', type='correct', 
            study.test='Zeller_2014', 
            AUC=as.numeric(sc.obj.z@eval_data$auroc)) %>% 
    add_row(cutoff='full', type='incorrect', 
            study.test='Zeller_2014', 
            AUC=as.numeric(sc.obj.z@eval_data$auroc))

**错误的流程：基于监督式特征选择的训练。**在不正确的特征选择流程中，我们使用整个数据集，基于差异风度检测了特征，然后选择高度相关的特征。

sc.obj.t <- check.associations(sc.obj.t, detect.lim = 1e-05,
                                fn.plot = 'assoc_plot.pdf')
mat.assoc <- associations(sc.obj.t)
mat.assoc$species <- rownames(mat.assoc)
# sort by p-value
mat.assoc <- mat.assoc %>% as_tibble() %>% arrange(p.val)

基于check.association函数的P values, 我们选择X个特征用于训练模型。

for (x in fs.cutoff){
    # select x number of features based on p-value ranking
    feat.train.red <- feat.t[mat.assoc %>%
                                slice(seq_len(x)) %>%
                                pull(species),]
    sc.obj.t.fs <- siamcat(feat=feat.train.red, meta=meta.t,
                            label='study_condition', case='CRC')
    # normalize the features without filtering
    sc.obj.t.fs <- normalize.features(sc.obj.t.fs, norm.method = 'log.std',
        norm.param=list(log.n0=1e-05,sd.min.q=0),feature.type = 'original')
    # take the same cross validation split as before
    data_split(sc.obj.t.fs) <- data_split(sc.obj.t)
    # train
    sc.obj.t.fs <- train.model(sc.obj.t.fs, method = 'lasso')
    # make predictions
    sc.obj.t.fs <- make.predictions(sc.obj.t.fs)
    # evaluate predictions and record the result
    sc.obj.t.fs <- evaluate.predictions(sc.obj.t.fs)
    auroc.all <- auroc.all %>% 
        add_row(cutoff=as.character(x), type='incorrect', 
                study.test='Thomas_2018',
                AUC=as.numeric(sc.obj.t.fs@eval_data$auroc))
    # apply to the external dataset and record the result
    sc.obj.z <- siamcat(feat=feat.z, meta=meta.z,
                        label='study_condition', case='CRC')
    sc.obj.z <- make.predictions(sc.obj.t.fs, sc.obj.z)
    sc.obj.z <- evaluate.predictions(sc.obj.z)
    auroc.all <- auroc.all %>% 
        add_row(cutoff=as.character(x), type='incorrect', 
                study.test='Zeller_2014', 
                AUC=as.numeric(sc.obj.z@eval_data$auroc))
}

正确的流程：嵌套特征选择的训练。 将特征选择嵌套到交叉验证流程中，是特征选择的正确方式。我们通过指定SIAMCAT包中的train.model函数的perform.fs参数，可以实现嵌套交叉验证流程。

for (x in fs.cutoff){
    # train using the original SIAMCAT object 
    # with correct version of feature selection
    sc.obj.t.fs <- train.model(sc.obj.t, method = 'lasso', perform.fs = TRUE,
        param.fs = list(thres.fs = x,method.fs = "AUC",direction='absolute'))
    # make predictions
    sc.obj.t.fs <- make.predictions(sc.obj.t.fs)
    # evaluate predictions and record the result
    sc.obj.t.fs <- evaluate.predictions(sc.obj.t.fs)
    auroc.all <- auroc.all %>% 
        add_row(cutoff=as.character(x), type='correct', 
                study.test='Thomas_2018',
                AUC=as.numeric(sc.obj.t.fs@eval_data$auroc))
    # apply to the external dataset and record the result
    sc.obj.z <- siamcat(feat=feat.z, meta=meta.z,
                        label='study_condition', case='CRC')
    sc.obj.z <- make.predictions(sc.obj.t.fs, sc.obj.z)
    sc.obj.z <- evaluate.predictions(sc.obj.z)
    auroc.all <- auroc.all %>% 
        add_row(cutoff=as.character(x), type='correct', 
                study.test='Zeller_2014', 
                AUC=as.numeric(sc.obj.z@eval_data$auroc))
}

**结果绘图。**现在，我们来绘制结果性能图，来评估交叉验证和外部验证的性能。

auroc.all %>%
    # facetting for plotting
    mutate(split=case_when(study.test=="Thomas_2018"~
                            'Cross validation (Thomas 2018)',
                        TRUE~"External validation (Zeller 2014)")) %>%
    # convert to factor to enforce ordering
    mutate(cutoff=factor(cutoff, levels = c(fs.cutoff, 'full'))) %>%
    ggplot(aes(x=cutoff, y=AUC, col=type)) +
        geom_point() + geom_line(aes(group=type)) +
        facet_grid(~split) +
        scale_y_continuous(limits = c(0.5, 1), expand = c(0,0)) +
        xlab('Features selected') +
        ylab('AUROC') +
        theme_bw() + 
        scale_colour_manual(values = c('correct'='blue', 'incorrect'='red'),
            name='Feature selection procedure') + 
        theme(panel.grid.minor = element_blank(), legend.position = 'bottom')

如您所见，不正确的特征选择过程会导致 AUROC 值膨胀，但对真正外部数据集的泛化能力较低，尤其是在选择的特征很少时。相反，正确的过程给出了较低的交叉验证结果，但可以更好地估计模型在外部数据上的表现。

Naive Splitting of Dependent Data

当样本不独立时，机器学习工作流程中可能会出现另一个问题。例如，在不同时间点从同一个体采集的微生物组样本通常比从其他个体采集的样本更相似。如果这些样本在简单的交叉验证过程中被随机拆分，则可能会出现来自同一个人的样本最终会出现在训练和测试折叠(fold)中的情况。在这种情况下，与应该学习区分个体标签的期望模型相比，该模型将学习对同一个体的跨时间点进行泛化。为避免此问题，需要在交叉验证期间阻止相关测量，以确保同一块内的样本将保持在同一折叠中（用于训练和测试）。

**加载数据。**我们使用EMBL集群上的多个Crohn’s disease (CD) 数据集作为案例。该数据集已经进行了过滤和清洗。由于模型训练要花费很长的时间，所以这部分教程没有实际执行，你可以尝试自己执行它。

data.location <- 'https://www.embl.de/download/zeller/'

# metadata
meta.all <- read_tsv(paste0(data.location, 'CD_meta/meta_all.tsv'))
## Rows: 1597 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): Sample_ID, Group, Individual_ID, Study
## dbl (2): Library_Size, Timepoint
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# features
feat.motus <- read.table(paste0(data.location, 'CD_meta/feat_rel_filt.tsv'),
                        sep='\t', stringsAsFactors = FALSE,
                        check.names = FALSE)

当我们检查样本数目和受试者数目时，我们发现HMP研究中存在某个受试者对应多个样本的情况。

x <- meta.all %>% 
    group_by(Study, Group) %>% 
    summarise(n.all=n(), .groups='drop')
y <- meta.all %>% 
    select(Study, Group, Individual_ID) %>% 
    distinct() %>% 
    group_by(Study, Group) %>% 
    summarize(n.indi=n(),  .groups='drop')
full_join(x,y)
## Joining, by = c("Study", "Group")
## # A tibble: 10 × 4
##    Study         Group n.all n.indi
##                
##  1 Franzosa_2019 CD       88     88
##  2 Franzosa_2019 CTR      56     56
##  3 HMP2          CD      583     50
##  4 HMP2          CTR     357     26
##  5 He_2017       CD       49     49
##  6 He_2017       CTR      53     53
##  7 Lewis_2015    CD      294     85
##  8 Lewis_2015    CTR      25     25
##  9 metaHIT       CD       21     13
## 10 metaHIT       CTR      71     59

因此，我们将在HMP2训练模型。但是每个受试者的样本数目变化很大，因此我们希望在每个受试者上随机选择5个样本。【问题：为什么每个受试者可以选5个样本呢？选了5个样本之后，如何正确进行测试和交叉验证？】

meta.all %>% 
    filter(Study=='HMP2') %>% 
    group_by(Individual_ID) %>% 
    summarise(n=n(), .groups='drop') %>% 
    pull(n) %>% hist(20)

# sample 5 samples per individual
meta.train <- meta.all %>% 
    filter(Study=='HMP2') %>% 
    group_by(Individual_ID) %>%
    sample_n(5, replace = TRUE) %>%
    distinct() %>%
    as.data.frame()
rownames(meta.train) <- meta.train$Sample_ID

对于评估，我们只想每个受试者只选一个样本，因此我们创建了新的矩阵，来去掉其他研究中重复的样本。

meta.ind <- meta.all %>% 
    group_by(Individual_ID) %>% 
    filter(Timepoint==min(Timepoint)) %>% 
    ungroup()

最后，我们准备创建一个tibble来存储所有的AUROC值。

auroc.all <- tibble(type=character(0), study.test=character(0), AUC=double(0))

基于朴素交叉验证的训练（Train with Naive Cross-validation）。 朴素交叉验证的样本划分不考虑样本间的依赖性。因此，大致流程如下所示：

sc.obj <- siamcat(feat=feat.motus, meta=meta.train,
                    label='Group', case='CD')
sc.obj <- normalize.features(sc.obj, norm.method = 'log.std',
    norm.param=list(log.n0=1e-05,sd.min.q=1),feature.type = 'original')
sc.obj.naive <- create.data.split(sc.obj, num.folds = 10, num.resample = 10)
sc.obj.naive <- train.model(sc.obj.naive, method='lasso')
sc.obj.naive <- make.predictions(sc.obj.naive)
sc.obj.naive <- evaluate.predictions(sc.obj.naive)
auroc.all <- auroc.all %>% 
    add_row(type='naive', study.test='HMP2', 
        AUC=as.numeric(eval_data(sc.obj.naive)$auroc))

**基于阻塞交叉验证的训练（Train with Blocked Cross-validation）。**正确的方式要考虑来自同一受试者的重复样本，进行阻塞式的交叉验证。这种方式下，来自同一受试者的样本将始终在同一个折中结束。这可以通过指定SIAMCAT包中create.data.split函数的inseparable参数来实现。

sc.obj.block <- create.data.split(sc.obj, num.folds = 10, num.resample = 10,
                                inseparable = 'Individual_ID')
sc.obj.block <- train.model(sc.obj.block, method='lasso')
sc.obj.block <- make.predictions(sc.obj.block)
sc.obj.block <- evaluate.predictions(sc.obj.block)
auroc.all <- auroc.all %>% 
    add_row(type='blocked', study.test='HMP2', 
        AUC=as.numeric(eval_data(sc.obj.block)$auroc))

##Split a dataset into training and a test sets.
##create.data.split(siamcat, num.folds = 2, num.resample = 1, stratify = TRUE, inseparable = NULL, verbose = 1)
##如果提供了inseparable参数，数据拆分将考虑用于数据拆分的metadata avaiable. 例如，数据包括来自同一个受试者的多个样本，在同一折中保留来自同一个人的数据是有意义的。如果给出了inseparable参数，将忽略stratify参数。

**用于外部数据集（Apply to External Datasets）。**现在我们可以将模型用于外部数据集，并记录结果准确率。

Plot the Resultsfor (i in setdiff(unique(meta.all$Study), 'HMP2')){
    meta.test <- meta.ind %>% 
        filter(Study==i) %>% 
        as.data.frame()
    rownames(meta.test) <- meta.test$Sample_ID
    # apply naive model
    sc.obj.test <- siamcat(feat=feat.motus, meta=meta.test, 
                            label='Group', case='CD')
    sc.obj.test <- make.predictions(sc.obj.naive, sc.obj.test)
    sc.obj.test <- evaluate.predictions(sc.obj.test)
    auroc.all <- auroc.all %>% 
    add_row(type='naive', study.test=i,
            AUC=as.numeric(eval_data(sc.obj.test)$auroc))
    # apply blocked model
    sc.obj.test <- siamcat(feat=feat.motus, meta=meta.test, 
                            label='Group', case='CD')
    sc.obj.test <- make.predictions(sc.obj.block, sc.obj.test)
    sc.obj.test <- evaluate.predictions(sc.obj.test)
    auroc.all <- auroc.all %>% 
        add_row(type='blocked', study.test=i,
                AUC=as.numeric(eval_data(sc.obj.test)$auroc))
}

**绘制结果（Plot the Results）。**现在我们比较两种不同的方法得到的AUROC值。

auroc.all %>%
    # convert to factor to enforce ordering
    mutate(type=factor(type, levels = c('naive', 'blocked'))) %>%
    # facetting for plotting
    mutate(CV=case_when(study.test=='HMP2'~'CV', 
                        TRUE~'External validation')) %>%
    ggplot(aes(x=study.test, y=AUC, fill=type)) +
        geom_bar(stat='identity', position = position_dodge(), col='black') +
        theme_bw() +
        coord_cartesian(ylim=c(0.5, 1)) +
        scale_fill_manual(values=c('red', 'blue'), name='') +
        facet_grid(~CV, space = 'free', scales = 'free') +
        xlab('') + ylab('AUROC') +
        theme(legend.position = c(0.8, 0.8))

如你所见，朴素的交叉验证流程相比阻塞交叉验证导致了性能的膨胀。然而，当评估外部数据集的泛化性能时，阻塞交叉验证流程导致了更好的性能。

你可能感兴趣的:(R语言,生物信息学,microbiome,生物信息学,r语言)

R语言绘制自定义形状词云图 dltan 可视化 R语言 r语言开发语言
R语言绘制自定义形状词云图方法程序结果如下：#常规直接使用install.packages("wordcloud2")是无法进行自定义形状的词云图绘制，必须降低包的版本，使用之前的wordcloud2老版本原始包library(wordcloud2)batman=system.file("examples/3.png",package="wordcloud2")###读取形状图片，注意图片默认放在
R语言绘制词云图后端架构小白 r语言开发语言 R语言
R语言绘制词云图词云图是一种常见的数据可视化方式，用于展示文本数据中频繁出现的词语。在R语言中，我们可以使用wordcloud包来创建精美的词云图。本文将向您介绍如何使用R语言绘制词云图，并提供相应的源代码示例。准备工作：在开始之前，您需要确保已经安装了wordcloud包。如果尚未安装，可以通过以下命令进行安装：install.packages("wordcloud")安装完成后，您可以加载该包
解构R语言底层逻辑：用语言学思维进行降维打击南大小程聊科研 r语言
以我多年自学以及辅导身边同学、同事的经验来看，许多人不是学不会R语言，而是刚开始就对“编程”这两个字带有一种潜意识里面的恐惧感，然后想着编程肯定需要数学基础，自己没学过等等负面情绪。实际上，对于R语言来讲，和我们以前学过的英语没有任何区别，用语言学的方法去带入，就可以非常快速的对R语言产生理解。下面，我将利用语言学思维，对R语言的底层逻辑进行降维打击。一、R语言赋值语句就是主系表结构在刚开始学英语
使用Python或R语言重新拟合模型 pk_xz123456 python 算法 python r语言开发语言
以下分别给出使用Python和R语言完成该任务的示例代码，假设我们有一个包含被试编号、实验条件和反应时的数据，并且要拟合一个线性回归模型。Python实现importpandasaspdimportnumpyasnpimportstatsmodels.apiassm#生成示例数据data={'subject':np.repeat(range(1,11),5),'condition':np.tile
R语言装环境Gcc报错以及scater包的安装一穷二白到年薪百万环境配置 conda linux
error:‘timespec_get’hasnotbeendeclaredin‘::’80|using::timespec_get;在conda的虚拟环境中升级gcc的版本condainstall-cconda-forgegcc=11gxx=11终极方法，在R的最新版本和环境下装啥都能成功！！比如beyondcell的方法的BeyondcellScoreNormalization中scater包
LM_Funny-2-01 递推算法：从数学基础到跨学科应用王旭·wangxu_a 算法
目录第一章递推算法的数学本质1.1形式化定义与公理化体系定理1.1(完备性条件)1.2高阶递推的特征分析案例：Gauss同余递推4第二章工程实现优化技术2.1内存压缩的革新方法滚动窗口策略分块存储技术2.2异构计算加速方案GPU并行递推量子计算原型第三章跨学科应用案例3.1密码学中的递推构造混沌流密码系统3.2生物信息学的序列分析DNA甲基化预测第一章递推算法的数学本质1.1形式化定义与公理化体系
R语言：高效数据分析和可视化的利器 TechPr r语言数据分析开发语言 R语言
R语言：高效数据分析和可视化的利器R语言是一种强大而灵活的数据分析和统计建模工具，广泛用于学术界和工业界。它提供了丰富的库和包，使得数据处理、统计分析和可视化变得更加容易和高效。本文将介绍R语言的一些基本概念，并提供相应的源代码示例。变量和数据结构在R语言中，可以使用赋值运算符（<-或=）来创建变量。例如，下面的代码将创建一个名为x的变量，并将其赋值为10：x<-10R语言支持多种数据结构，包括向
R语言：探索数据的利器 ByteWhiz r语言开发语言 R语言
R语言：探索数据的利器R语言是一种强大而灵活的编程语言，尤其在数据科学和统计分析领域中广泛应用。作为一门开源语言，R语言拥有丰富的数据处理和可视化功能，同时支持大规模数据分析和机器学习。本文将介绍R语言的基本特性、常用的数据操作和可视化技巧，并提供相应的源代码示例。一、R语言的基本特性向量化操作：R语言鼓励使用向量化操作，即对整个向量或矩阵执行相同的操作，从而提高运算效率。例如，可以通过一条简单的
【R语言】在Jupyter Notebook中使用conda管理的R语言小丫么小阿豪 R语言
Motivation忽然发现jupyter竟然能拿来写R，帮教授配了个环境结果g了，试了一下感觉像是conda管理包的问题，记录一下正确的步骤。步骤创建一个新conda环境condacreate-nr-kernel激活环境condaactivater-kernel安装JupyternotebookcondainstallJupyter安装Rkernel（好像下面两步可以替换为condainstal
跨平台编程：在Conda中搭建R语言环境的终极指南 2401_85812026 conda r语言开发语言
跨平台编程：在Conda中搭建R语言环境的终极指南在数据科学和统计分析领域，R语言以其强大的数据处理能力和丰富的图形表示功能而广受欢迎。然而，对于习惯了使用Linux操作系统的用户来说，如何方便地在Conda环境中安装和配置R语言环境是一个常见问题。本文将详细指导您如何在Conda中安装R语言环境，确保您能够顺利地进行数据分析和编程工作。️一、Conda与R语言的结合艺术Conda是一个开源的包管
蓝易云 - 安装r语言在linux环境蓝易云 r语言 linux 开发语言 mysql 数据库正则表达式运维
在Linux环境下安装R语言，可以按照以下步骤进行：打开终端。使用包管理工具（如yum、apt等）安装R语言。以下是在不同Linux发行版下的安装命令示例：对于CentOS或RedHat系统，使用以下命令：sudoyuminstallR对于Ubuntu或Debian系统，使用以下命令：sudoapt-getinstallr-base安装完成后，运行R命令以启动R语言交互式环境：R如果需要安装
R语言数据导出和导入 csv tsv xls xlsx 仿生bug r语言
【R语言】Excel导出为Excel的xls、xlsx#【-------导出数据--------】write.table(data2,file="train1.xls",sep="\t",row.names=TRUE,col.names=TRUE,quote=TRUE)write.table(data2,file="train2.xlsx",sep="\t",row.names=TRUE,col.
R语言报错变数的长度不一样，需要改成元素自变量对应的名称仿生bug r语言 big data
在进行回归，决策树等出现报错，观察数据等情况都无发现错误使用本地数据鸢尾花（yuānwěihuā）做示例，说明问题data(iris)train_sub=sample(nrow(iris),7/10*nrow(iris))trainset=iris[train_sub,]testset=iris[-train_sub,]fit1=rpart(iris$Species~.,data=trainset
R语言安装生物信息数据库包 Bio Coder R语言 r语言数据库
R语言安装生物信息数据库包在生物信息学领域，R语言是重要的数据分析工具。今天，我们就来聊聊在R语言环境下，安装生物信息数据库包（org.*.*.db）的步骤。为什么要安装org.*.*.db系列包生物信息学分析中，我们常处理基因相关数据，比如基因功能注释、位置、参与的生物学通路等。org.*.*.db系列包就像基因百科全书，提供不同物种的基因注释信息。比如研究人类基因时，能帮我们快速获取基因别名、
R语言关联TCGA数据库下载的RNA-SEQ数据和临床信息胖胖的胖球生物信息学大数据生物信息学 r语言
刚开始学习TCGA数据处理和分析，记下来方便以后查看setwd("E:/MyData/luadRNA-SEQ-20201028")#把工作目录定位到manifest文件所在的位置manifest="gdc_manifest.2020-10-28.txt"x=read.table(manifest,header=T)#header为TRUE表示读取第一行作为变量名表格已经建好了，可以view(x)，
撰写文献必用的评价指标之DCA决策曲线小辉同志深度学习深度学习论文阅读
系列文章目录第一章撰写文献必用的评价指标之普通表格第二章撰写文献必用的评价指标之DCA决策曲线目录系列文章目录前言一、DCA决策曲线表现形式横轴纵轴曲线曲线解读图例二、单因素多因素分析单因素分析多因素分析三、R语言程序代码代码解释总结前言在智慧医疗中，深度学习模型用于疾病预测等任务，DCA决策曲线能将模型的预测结果与不同阈值下的临床决策相结合，直观展示在不同疾病概率阈值下，采取某种诊断或治疗策略所
10-R数组 qwy715229258163 R语言 r语言 python 算法
R数组数组也是R语言的对象，R语言可以创建一维或多维数组。R语言数组是一个同一类型的集合，前面我们学的矩阵matrix其实就是一个二维数组。向量、矩阵、数组关系可以看下图：R语言数组创建使用array()函数，该函数使用向量作为输入参数，可以使用dim设置数组维度。array()函数语法格式如下：array(data=NA,dim=length(data),dimnames=NULL)参数说明：d
11-R因子 qwy715229258163 R语言 r语言开发语言
R因子因子用于存储不同类别的数据类型，例如人的性别有男和女两个类别，年龄来分可以有未成年人和成年人。R语言创建因子使用factor()函数，向量作为输入参数。factor()函数语法格式：factor(x=character(),levels,labels=levels,exclude=NA,ordered=is.ordered(x),nmax=NA)参数说明：x：向量。levels：指定各水平值
R语言应用实战-基于R语言的判别分析：fisher判别法，距离判别法以及Bayers判别法（附源代码）文宇肃然 R语言实战应用案例精讲 R语言数据分析分类回归深度学习
前言判别分析（DiscriminatAnalysis)是多变量统计分析中用于判别样本所属类型的一种统计分析法。它所要解决的问题是在一些已知研究对象用某种方法已经分成若干类的情况下确定新的样本属于已知类别的哪一类。判别分析在处理问题时，通常要给出一个衡量新样品与各已知类型接近程度的描述统计模型即判别函数，同时也指定一种判别规则，借以判定新的样本归属。以下是我为大家准备的几个精品专栏，喜欢的小伙伴可自
HMSC联合物种分布模型中环境变量、物种属性、系统发育、数据分层设置综合案例 weixin_贾地理遥感生态模型物种分布生物多样性 Hmsc模型物种属性系统发育群落生态贝叶斯统计混合效应
联合物种分布模型（JointSpeciesDistributionModelling，JSDM）在生态学领域，特别是群落生态学中发展最为迅速，它在分析和解读群落生态数据的革命性和独特视角使其受到广大国内外学者的关注。在学界不同研究团队研发出不同的联合物种模型，其中由芬兰的Ovaskainen教授领导的团队研发的R语言程序包Hmsc发展势头最为强劲。Hmsc是物种群落分层模型的缩写(Hierarch
r语言面板数据回归_R语言之回归分析你的麦克疯 r语言面板数据回归
回归分析(regressionanalysis)是确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法。运用十分广泛，下列表格向我们展示了回归的不同类型以及其用途。本章为R语言回归分析之上部分，主要向读者们展示如何运用R语言完成ols(普通最小二乘)回归：简单线性回归、多项式回归、多元线性回归的语言编程示例，以及检验回归分析中统计假设的方法。回归类型用途简单线性用一个量化的解释变量来预测一
【R语言数据分析】基于R语言对中、美两国GDP分析（R语言大作业） m0_73866147 数据分析大数据 r语言
目录一、研究意义二、数据来源三、读取数据读取数据代码运行结果截图四、数据分析绘制箱线图建立箱线图代码运行结果截图五、建立回归模型建立回归模型代码运行结果截图有关于相关系数的计算与检验六、回归分析确定回归方程七、预测中国和美国未来的GDP值、预测中国的GDP赶超美国的时间数据可视化八、总结一、研究意义GDP作为衡量一个国家经济发展的重要指标，被赋予了非常重要的意义，深刻反映着当下经济发展的现状。中美
R语言中的偏最小乘回归（Partial Least Squares Regression, PLSR）和判别分析（Discriminant Analysis, 程序才子 r语言回归开发语言 R语言
R语言中的偏最小乘回归（PartialLeastSquaresRegression,PLSR）和判别分析（DiscriminantAnalysis,DA）偏最小乘回归（PartialLeastSquaresRegression,PLSR）与判别分析（DiscriminantAnalysis,DA）是R语言中常用的数据建模和预测技术。它们可以用于解决回归问题和分类问题。本文将介绍PLSR和DA的基本
判别分析在R语言中的实现 FgVector r语言开发语言
判别分析是一种常用的统计方法，用于将样本数据分配到已知类别中。在R语言中，我们可以使用多个包来实现判别分析，例如MASS、caret和lda等。本文将介绍如何使用R语言实现判别分析，并提供相应的源代码。安装和加载所需的包首先，我们需要安装并加载需要的R包。在R控制台中执行以下命令：install.packages("MASS")#安装MASS包install.packages("caret")#安
基于python使用scanpy分析单细胞转录组数据探序基因单细胞分析 python 开发语言
探序基因肿瘤研究院整理相关后缀的格式介绍：.h5ad：是一种用于存储单细胞数据的文件格式，可以通过anndata库在Python中处理.loom：高效的数据存储格式（.loom文件），使得用户可以轻松地存储、查询和分析大规模的单细胞数据集。Loompy的设计目标是提供一个快速、灵活且易于使用的工具，以支持生物信息学家和研究人员在单细胞水平上进行数据分析。python的单细胞转录组数据结构说明：da
R语言中的函数32：seq_along() zoujiahui_2018 #R语言中的函数 r语言开发语言
介绍seq_along函数在R语言中用于生成一个整数序列，其长度与给定对象的长度相同。这个函数特别有用，当你想要创建一个索引序列来遍历一个向量或列表时。用法seq_along(x)参数x:任何R对象（如向量、列表等）。返回值:返回一个从1到x的长度的整数序列。示例#创建一个向量vec<-c("a","b","c")#使用seq_along生成索引indices<-seq_along(vec)pri
使用R语言绘制山脊图的ggridges包心之飞翼 r语言开发语言 R语言
使用R语言绘制山脊图的ggridges包山脊图（ridgeplot）是一种用于可视化多个分布或变量之间关系的图表类型。在R语言中，可以使用ggridges包来创建漂亮的山脊图。本文将介绍如何使用ggridges包绘制山脊图，并提供相应的源代码供参考。首先，确保已经安装了ggridges包。可以使用以下代码来安装：install.packages("ggridges")安装完毕后，加载ggridge
Anaconda3 介绍和安装 gorgor在码农 #python入门基础 python conda
介绍Anaconda是一个开源的Python和R语言发行版，专注于数据科学、机器学习和科学计算，主要面向数据科学和机器学习领域。它集成了大量常用的科学计算库（如NumPy、Pandas、Matplotlib、Scikit-learn等），并提供了强大的包管理工具Conda和环境管理功能，适合快速部署和管理复杂的开发环境。特点：预装丰富库：包含250+常用的数据科学工具包，无需手动安装。跨平台支持：
$ operator is invalid for atomic vectors什么意思滚菩提哦呢
"$operatorisinvalidforatomicvectors"意思是在对原子向量使用"$"操作符时是无效的。"$"操作符是R语言中用于访问数据框(dataframe)中的列的常用操作符。但是，原子向量(atomicvector)是R中的一种基本数据类型，它是一个长度固定的向量，并且所有元素都是相同的数据类型。因此，在对原子向量使用"$"操作符时是无效的，因为原子向量没有列的概念。例如，下
5-R循环 qwy715229258163 R语言 r语言 python 算法
R循环有的时候，我们可能需要多次执行同一块代码。一般情况下，语句是按顺序执行的：函数中的第一个语句先执行，接着是第二个语句，依此类推。编程语言提供了更为复杂执行路径的多种控制结构。循环语句允许我们多次执行一个语句或语句组，下面是大多数编程语言中循环语句的流程图：R语言提供的循环类型有:repeat循环while循环for循环R语言提供的循环控制语句有：break语句Next语句循环控制语句改变你代
redis学习笔记——不仅仅是存取数据 Everyday都不同 returnSource expire/del incr/lpush 数据库分区 redis
最近项目中用到比较多redis，感觉之前对它一直局限于get/set数据的层面。其实作为一个强大的NoSql数据库产品，如果好好利用它，会带来很多意想不到的效果。（因为我搞java，所以就从jedis的角度来补充一点东西吧。PS：不一定全，只是个人理解，不喜勿喷） 1、关于JedisPool.returnSource(Jedis jeids) 这个方法是从red
SQL性能优化-持续更新中。。。。。。 atongyeye oracle sql
1 通过ROWID访问表--索引你可以采用基于ROWID的访问方式情况,提高访问表的效率, , ROWID包含了表中记录的物理位置信息..ORACLE采用索引(INDEX)实现了数据和存放数据的物理位置(ROWID)之间的联系. 通常索引提供了快速访问ROWID的方法,因此那些基于索引列的查询就可以得到性能上的提高. 2 共享SQL语句--相同的sql放入缓存 3 选择最有效率的表
[JAVA语言]JAVA虚拟机对底层硬件的操控还不完善 comsci JAVA虚拟机
如果我们用汇编语言编写一个直接读写CPU寄存器的代码段，然后利用这个代码段去控制被操作系统屏蔽的硬件资源，这对于JVM虚拟机显然是不合法的，对操作系统来讲，这样也是不合法的，但是如果是一个工程项目的确需要这样做，合同已经签了，我们又不能够这样做，怎么办呢？那么一个精通汇编语言的那种X客，是否在这个时候就会发生某种至关重要的作用呢？ &n
lvs- real 男人50 LVS
#!/bin/bash # # Script to start LVS DR real server. # description: LVS DR real server # #. /etc/rc.d/init.d/functions VIP=10.10.6.252 host='/bin/hostname' case "$1" in sta
生成公钥和私钥 oloz DSA 安全加密
package com.msserver.core.util; import java.security.KeyPair; import java.security.PrivateKey; import java.security.PublicKey; import java.security.SecureRandom; public class SecurityUtil {
UIView 中加入的cocos2d，背景透明 374016526 cocos2d glClearColor
要点是首先pixelFormat:kEAGLColorFormatRGBA8，必须有alpha层才能透明。然后view设置为透明glView.opaque = NO;[director setOpenGLView:glView];[self.viewController.view setBackgroundColor:[UIColor clearColor]];[self.viewControll
mysql常用命令香水浓 mysql
连接数据库 mysql -u troy -ptroy 备份表 mysqldump -u troy -ptroy mm_database mm_user_tbl > user.sql 恢复表（与恢复数据库命令相同） mysql -u troy -ptroy mm_database < user.sql 备份数据库 mysqldump -u troy -ptroy
我的架构经验系列文章 - 后端架构 - 系统层面 agevs JavaScript jquery css html5
系统层面：高可用性所谓高可用性也就是通过避免单独故障加上快速故障转移实现一旦某台物理服务器出现故障能实现故障快速恢复。一般来说，可以采用两种方式，如果可以做业务可以做负载均衡则通过负载均衡实现集群，然后针对每一台服务器进行监控，一旦发生故障则从集群中移除；如果业务只能有单点入口那么可以通过实现Standby机加上虚拟IP机制，实现Active机在出现故障之后虚拟IP转移到Standby的快速
利用ant进行远程tomcat部署 aijuans tomcat
在javaEE项目中，需要将工程部署到远程服务器上，如果部署的频率比较高，手动部署的方式就比较麻烦，可以利用Ant工具实现快捷的部署。这篇博文详细介绍了ant配置的步骤（http://www.cnblogs.com/GloriousOnion/archive/2012/12/18/2822817.html），但是在tomcat7以上不适用，需要修改配置，具体如下： 1.配置tomcat的用户角色
获取复利总收入 baalwolf 获取
public static void main(String args[]){ int money=200; int year=1; double rate=0.1; &
eclipse.ini解释 BigBird2012 eclipse
大多数java开发者使用的都是eclipse，今天感兴趣去eclipse官网搜了一下eclipse.ini的配置，供大家参考，我会把关键的部分给大家用中文解释一下。还是推荐有问题不会直接搜谷歌，看官方文档，这样我们会知道问题的真面目是什么，对问题也有一个全面清晰的认识。 Overview 1、Eclipse.ini的作用 Eclipse startup is controlled by th
AngularJS实现分页功能 bijian1013 JavaScript AngularJS 分页
对于大多数web应用来说显示项目列表是一种很常见的任务。通常情况下，我们的数据会比较多，无法很好地显示在单个页面中。在这种情况下，我们需要把数据以页的方式来展示，同时带有转到上一页和下一页的功能。既然在整个应用中这是一种很常见的需求，那么把这一功能抽象成一个通用的、可复用的分页（Paginator）服务是很有意义的。 &nbs
[Maven学习笔记三]Maven archetype bit1129 ArcheType
archetype的英文意思是原型，Maven archetype表示创建Maven模块的模版，比如创建web项目，创建Spring项目等等. mvn archetype提供了一种命令行交互式创建Maven项目或者模块的方式， mvn archetype 1.在LearnMaven-ch03目录下，执行命令mvn archetype:gener
【Java命令三】jps bit1129 Java命令
jps很简单，用于显示当前运行的Java进程，也可以连接到远程服务器去查看 [hadoop@hadoop bin]$ jps -help usage: jps [-help] jps [-q] [-mlvV] [<hostid>] Definitions: <hostid>: <hostname>[:
ZABBIX2.2 2.4 等各版本之间的兼容性 ronin47
zabbix更新很快，从2009年到现在已经更新多个版本，为了使用更多zabbix的新特性，随之而来的便是升级版本，zabbix版本兼容性是必须优先考虑的一点客户端AGENT兼容 zabbix1.x到zabbix2.x的所有agent都兼容zabbix server2.4：如果你升级zabbix server，客户端是可以不做任何改变，除非你想使用agent的一些新特性。 Zabbix代理（p
unity 3d还是cocos2dx哪个适合游戏？ brotherlamp unity自学 unity教程 unity视频 unity资料 unity
unity 3d还是cocos2dx哪个适合游戏？问：unity 3d还是cocos2dx哪个适合游戏？答：首先目前来看unity视频教程因为是3d引擎，目前对2d支持并不完善，unity 3d 目前做2d普遍两种思路，一种是正交相机，3d画面2d视角，另一种是通过一些插件，动态创建mesh来绘制图形单元目前用的较多的是2d toolkit，ex2d，smooth moves，sm2，
百度笔试题：一个已经排序好的很大的数组，现在给它划分成m段，每段长度不定，段长最长为k，然后段内打乱顺序，请设计一个算法对其进行重新排序 bylijinnan java 算法面试百度招聘
import java.util.Arrays; /** * 最早是在陈利人老师的微博看到这道题： * #面试题#An array with n elements which is K most sorted，就是每个element的初始位置和它最终的排序后的位置的距离不超过常数K * 设计一个排序算法。It should be faster than O(n*lgn)。
获取checkbox复选框的值 chiangfai checkbox
<title>CheckBox</title> <script type = "text/javascript"> doGetVal: function doGetVal() { //var fruitName = document.getElementById("apple").value;//根据
MySQLdb用户指南 chenchao051 mysqldb
原网页被墙，放这里备用。 MySQLdb User's Guide Contents Introduction Installation _mysql MySQL C API translation MySQL C API function mapping Some _mysql examples MySQLdb
HIVE 窗口及分析函数 daizj hive 窗口函数分析函数
窗口函数应用场景：（1）用于分区排序（2）动态Group By （3）Top N （4）累计计算（5）层次查询一、分析函数用于等级、百分点、n分片等。函数说明 RANK() &nbs
PHP ZipArchive 实现压缩解压Zip文件 dcj3sjt126com PHP zip
PHP ZipArchive 是PHP自带的扩展类，可以轻松实现ZIP文件的压缩和解压，使用前首先要确保PHP ZIP 扩展已经开启，具体开启方法就不说了，不同的平台开启PHP扩增的方法网上都有，如有疑问欢迎交流。这里整理一下常用的示例供参考。一、解压缩zip文件 01 02 03 04 05 06 07 08 09 10 11
精彩英语贺词 dcj3sjt126com 英语
I'm always here 我会一直在这里支持你 &nb
基于Java注解的Spring的IoC功能 e200702084 java spring bean IOC Office
java模拟post请求 geeksun java
一般API接收客户端（比如网页、APP或其他应用服务）的请求，但在测试时需要模拟来自外界的请求，经探索，使用HttpComponentshttpClient可模拟Post提交请求。此处用HttpComponents的httpclient来完成使命。 import org.apache.http.HttpEntity ; import org.apache.http.HttpRespon
Swift语法之 ---- ?和!区别 hongtoushizi ?swift !
转载自： http://blog.sina.com.cn/s/blog_71715bf80102ux3v.html Swift语言使用var定义变量，但和别的语言不同，Swift里不会自动给变量赋初始值，也就是说变量不会有默认值，所以要求使用变量之前必须要对其初始化。如果在使用变量之前不进行初始化就会报错： var stringValue : String //
centos7安装jdk1.7 jisonami jdk centos
安装JDK1.7 步骤1、解压tar包在当前目录 [root@localhost usr]#tar -xzvf jdk-7u75-linux-x64.tar.gz 步骤2：配置环境变量在etc/profile文件下添加 export JAVA_HOME=/usr/java/jdk1.7.0_75 export CLASSPATH=/usr/java/jdk1.7.0_75/lib
数据源架构模式之数据映射器 home198979 PHP 架构数据映射器 datamapper
前面分别介绍了数据源架构模式之表数据入口、数据源架构模式之行和数据入口数据源架构模式之活动记录，相较于这三种数据源架构模式，数据映射器显得更加“高大上”。一、概念数据映射器（Data Mapper）：在保持对象和数据库（以及映射器本身）彼此独立的情况下，在二者之间移动数据的一个映射器层。概念永远都是抽象的，简单的说，数据映射器就是一个负责将数据映射到对象的类数据。 &nb
在Python中使用MYSQL pda158 mysql python
缘由　　近期在折腾一个小东西须要抓取网上的页面。然后进行解析。将结果放到数据库中。　　了解到 Python在这方面有优势，便选用之。　　由于我有台 server上面安装有 mysql，自然使用之。在进行数据库的这个操作过程中遇到了不少问题，这里记录一下，大家共勉。　　 python中mysql的调用　　百度之后能够通过MySQLdb进行数据库操作。
单例模式 hxl1988_0311 java 单例设计模式单件
package com.sosop.designpattern.singleton; /* * 单件模式：保证一个类必须只有一个实例，并提供全局的访问点 * * 所以单例模式必须有私有的构造器，没有私有构造器根本不用谈单件 * * 必须考虑到并发情况下创建了多个实例对象 * */ /** * 虽然有锁，但是只在第一次创建对象的时候加锁，并发时不会存在效率
27种迹象显示你应该辞掉程序员的工作 vipshichg 工作
1、你仍然在等待老板在2010年答应的要提拔你的暗示。 2、你的上级近10年没有开发过任何代码。 3、老板假装懂你说的这些技术，但实际上他完全不知道你在说什么。 4、你干完的项目6个月后才部署到现场服务器上。 5、时不时的，老板在检查你刚刚完成的工作时，要求按新想法重新开发。 6、而最终这个软件只有12个用户。 7、时间全浪费在办公室政治中，而不是用在开发好的软件上。 8、部署前5分钟才开始测试。

4.微生物组机器学习包SIAMCAT学习

文章目录

4.1 SIAMCAT包的基本用法

4.1.1 SIAMCAT basic vignette

4.1.2 SIAMCAT confounder example

About This Vignette

Preparations

SIAMCAT Workflow (without Confounders)

4.1.3 SIAMCAT holdout testing

介绍（ Introduction）

导入数据（Load the Data）

在法国数据集上构建模型（Model Building on the French Dataset）

Application on the Holdout Dataset

模型评估(Model Evaluation)

4.1.4 SIAMCAT input

Introduction

Loading your data into R

Creating a siamcat-class object

4.1.5 SIAMCAT meta-analysis

About This Vignette

Compare Associations

Study as Confounding Factor

ML Meta-analysis

4.1.6 SIAMCAT ML pitfalls

About This Vignette

Supervised Feature Selection

Naive Splitting of Dependent Data

你可能感兴趣的:(R语言,生物信息学,microbiome,生物信息学,r语言)