submap 官方使用说明
Input data sets should have common identifiers. The intersection of these data sets is automatically extracted.
训练集测试集数据表达矩阵要求:
datasetA :file Input dataset A (gct), should have common gene ID with dataset B Note: Remove spaces from sample names.
datasetB :file Input dataset B (gct), should have common gene ID with dataset A Note: Remove spaces from sample names.
输入两组的亚组分型数据:第三行应该是数字。注意:类别标签是从1开始的连续数字。如果cls文件中的标签从0开始,SubMap模块会自动将1添加到所有标签中。
剩余其他参数可以使用默认值。
1,登录,然后点击选择模块,输入submap
2,按照要求先上传到本地文件,然后在按要求再次上传文档
点击:左上角 job results→选择SubMap_SubMapResult.txt,即可下载结果进行整理。
参考:SubMap (genepattern.org)
补充:在使用GenePattern - SubMap在线分析时,输入的两个数据集表达矩阵应使用logTPM(log转换的TPM)数据。这是因为logTPM能够更好地处理基因表达量的变化范围,使得数据更具可比性和可解释性。 对于输入的表达矩阵,基因数量和种类没有特定的要求。然而,为了获得更准确的结果,建议选择包含相同或相关的基因集的数据集。此外,表达矩阵的行应对应于基因,列应对应于样本。 以下是GenePattern - SubMap在线分析的操作流程: 1. 打开GenePattern网站(https://genepattern.broadinstitute.org/)并登录账户。 2. 在“Modules”选项卡中,选择“SubMap”模块。 3. 在“Input Files”部分,点击“Select File”按钮并选择第一个输入数据集的表达矩阵文件。 4. 在“Input Files”部分,点击“Select File”按钮并选择第二个输入数据集的表达矩阵文件。 5. 在“Parameters”部分,选择适当的参数设置,如聚类方法、距离度量等。 6. 点击页面底部的“Run”按钮来运行分析。 7. 分析完成后,您可以查看结果,如子映射的可视化、相关性矩阵等。 请注意,以上仅为GenePattern - SubMap在线分析的基本流程,实际操作可能会有所不同。建议在进行具体分析之前,阅读GenePattern的相关文档或教程,以获得更详细的指导和了解。
数据预处理文献:Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets - PMC (nih.gov)
We started from data sets that were already normalized for their respective study without any additional normalization procedure to account for different platform derivation. For the signal intensity data generated by one-channel oligonucleotide microarrays, Affymetrix's GeneChip, we applied a lower threshold of 20U and a upper threshold of 16,000U. For the log2 transformed ratio data generated by cDNA microarrays, we first removed genes whose values were missing in more than 5% of the samples, and then imputed the missing values for the rest of the genes using a k-nearest neighbor algorithm(ImputeMissingValues.KNN, in the GenePattern software package, GenePattern).
Before marker gene selection, we used following gene filtering. For the oligonucleotide array data, only genes exhibiting at least 3-fold differential expression and an absolute difference of at least 100 units across the samples in the experiment were included. For the cDNA array data, only genes with an absolute log2 ratio greater than one and whose difference in log2 ratio across all the samples in the data set was greater than one were included.
Before applying the SubMap, each microarray probe ID was converted into its corresponding HUGO gene symbol (http://www.gene.ucl.ac.uk/nomenclature/), and multiple probe data corresponding to a single gene symbol was averaged. The number of genes remaining for our analyses of multiple tissue types, DLBCL, breast cancer, and DLBCL (with survival data) data sets were 5565, 661, 1213, and 3795, respectively.