单细胞数据integration结果评估

integration的初衷是想要去除数据的批次效应,但是在去除批次效应的同时有可能去除了数据之间本来就有的生物差异。scib这个方法就是从批次效应去除生物差异保留这两方面来衡量integration的效果。

scib这篇文章同时还比较了现有的主流integration方法,在做了排名的同时还对不同应用场景下应该用哪个方法做了简易。可以说是非常好的一篇文章了。

Benchmarking atlas-level data integration in single-cell genomics

好了,现在贴一下我自己常用的评估函数。

def compute_scib_metrics(adata_post, emb_key, label_key, batch_key, model_name):
    """
    Run neighbors first, this program takes a long time

    :param adata_post:
    :param emb_key:
    :param label_key:
    :param batch_key:
    :param model_name:
    :return:
    """

    print('-' * 10 + 'start to compute scib metrics' + '-' * 10)

    from scib.metrics.lisi import lisi_graph
    from scib.metrics.silhouette import silhouette, silhouette_batch
    from scib.metrics.isolated_labels import isolated_labels
    from scib.metrics.kbet import kBET
    import timeit
    import numpy as np
    import pandas as pd

    start = timeit.default_timer()

    order = ['clisi', 'sil_labels', 'isolated_labels', 'ilisi', 'sil_batch', 'kBET']
    df = pd.DataFrame(index=[model_name], columns=order)
    df["ilisi"], df["clisi"] = lisi_graph(adata_post, batch_key=batch_key, label_key=label_key)
    df["sil_labels"] = silhouette(adata_post, group_key=label_key, embed=emb_key)
    # if "dpt_pseudotime" in adata_pre.obs.columns:
    #     df["trajectory_conservation"] = trajectory_conservation(adata_pre, adata_post, label_key=label_key)
    # else:
    #     df["trajectory_conservation"] = 'None'
    df["isolated_labels"] = isolated_labels(adata_post, label_key=label_key, batch_key=batch_key, embed=emb_key)
    df["sil_batch"] = silhouette_batch(adata_post, batch_key=batch_key, group_key=label_key, embed=emb_key)
    df['kBET'] = kBET(adata_post, batch_key=batch_key, label_key=label_key, embed=emb_key)

    l_bio = df.iloc[0].values[:3]
    l_batch = df.iloc[0].values[3:]
    overall_score = 0.6 * (np.mean(l_bio)) + 0.4 * (np.mean(l_batch))
    df['overall_score'] = overall_score

    end = timeit.default_timer()
    print(str(end - start) + ' sec')

    return df

使用了
‘clisi’, ‘sil_labels’, ‘isolated_labels’, ‘ilisi’, ‘sil_batch’, ‘kBET’
六个指标
前三个是生物保护性指标,后三个是批次效应去除指标
指标已经放缩到0——1,值越大效果越好

需要环境:
Linux or UNIX system

Python >= 3.7

R >= 3.6
需要包:
pip install scib

更详细的使用和说明大家可以看这里
scib

你可能感兴趣的:(Bioinformation,python)