ACM TechNews摘要(3)——细胞基因表达数据的流形学习软件

关键词:高维数据处理,流形学习



ACM TechNews摘要(3)——细胞基因表达数据的流形学习软件_第1张图片
采用流形学习算法(tSNE)得到的细胞分类图

亥姆霍兹慕尼黑中心(德国健康与环境中心)的研究员开发了一款机器学习软件 Scanpy(大概是scan+python), 用于管理超庞大数据库,也是人类细胞云图计划(Human Cell Atlas)的候选分析工具之一。

慕尼黑大学教授Fabian Theis说:“对存在组合数据库的类似项目来说,分析软件具有可升级性(scalable)至关重要”。他认为Scanpy毋庸置疑有助于人类细胞云图的分析工作。而Scanpy的发布,代表着整合众多机器学习和统计方法的针对基因表达大数据库的综合分析软件的首秀。架构方面,传统生物统计学项目使用R语言编写的分析系统,但Scanpy的开发是基于机器学习领域的主导语言—Python,并采用基于图形识别的算法分析成像流式细胞仪数据(imaging flow cytometry data),避免荧光染色造成的数据缺失。与传统分析方法相同,Scanpy的分析使用图形坐标系而非基因表达坐标系,使用最近邻识别来刻画细胞而非直接的基因表达数值。细胞分类算法类似于Facebook所采用的社交群体识别算法。

"For this project, and in a growing number of other projects in which databases are combined, it is important to have scalable software," says University of Munich professor Fabian Theis. He notes it is therefore no surprise that Scanpy is a candidate for helping to analyze the Human Cell Atlas. Theis says the publication of Scanpy represents the first time software has been developed to enable comprehensive analysis of large gene-expression datasets with a broad range of machine learning and statistical methods. Scanpy is based on the Python language, the dominant language in the machine-learning community. In addition, Theis says Scanpy relies on graph-based algorithms, differentiating the system from other biostatistics programs, which are traditionally written in the R programming language. Unlike the usual approach of regarding cells as points in a coordinate system within gene-expression space, the algorithms use a graph-like coordinate system. Instead of characterizing a single cell by the expression value for thousands of genes, the system simply characterizes cells by identifying their closest neighbors - very much like the connections in social networks. In fact, to identify cell types, Scanpy uses the same algorithms as Facebook does for identifying communities.

你可能感兴趣的:(ACM TechNews摘要(3)——细胞基因表达数据的流形学习软件)