无重复RNAseq样本Gfold值的计算

使用Gfold进行无重复RNAseq数据基因差异分析

1. 准备资料

输入的资料为read_count，要对两个样本进行差异分析则必需准备两个文件，每个文件格式如下：

文件1:Sample2.read_cnt

hsa-miR-146a-3p 20  48  20  20
novel_mir217    45  20  20  20
hsa-miR-548o-3p 20  40  20  20
hsa-miR-378f    20  17703   20  20
novel_mir857    20  38  20  20
novel_mir1083   20  37  20  20
novel_mir790    20  36  20  20
novel_mir20 20  31  20  20
novel_mir1517   20  25  20  20

文件2:Sample1.read_cnt

hsa-miR-146a-3p 20  0   20  20
novel_mir217    20  0   20  20
hsa-miR-548o-3p 20  0   20  20
hsa-miR-378f    20  16  20  20
novel_mir857    20  0   20  20
novel_mir1083   20  0   20  20
novel_mir790    20  0   20  20
novel_mir20 20  0   20  20
novel_mir1517   20  0   20  20

在不同版本中输入文件的要求不同，一般第一列表头为GeneName，必需要有5列（如果只想计算gfold值，就只需要保证 Read Count所属的那列正确，其他列可以随意编造，如需要计算RPKM值则Gene exon length所属的那一列不能随意编造）程序才会正确运行，并且列分隔符必须为tab键（'\t'）

在2015-05-24更新的V1.1.4版本中，列依次为GeneSymbol、GeneName、Read Count、Gene exon length、RPKM

R语言提取得到文件1和文件2代码：

注意两个文件的第一列必须完全相同

# raw data input
file = "H358_Vector.gene.FPKM.xls"
raw_data_Vec <- read.delim(file, stringsAsFactors=FALSE)
file <- "H358_LSH.gene.FPKM.xls"
raw_data_LSH <- read.delim(file, stringsAsFactors=FALSE)

## merge data into one table
data_merged <- merge(raw_data_Vec[, c(1, 4)], raw_data_LSH[, c(1, 4)], by.x=1, by.y=1, all=TRUE)
data_merged[is.na(data_merged)] <- 0

## prepare the input data for gfold
data1 <- data.frame(data_merged[, c(1, 1, 2, 1, 2)])
data2 <- data.frame(data_merged[, c(1, 1, 3, 1, 3)])

colnames(data1) <- c("GeneSymbol", "GeneName1", "Read Count1", "Gene exon length1", "RPKM1")
colnames(data2) <- c("GeneSymbol", "GeneName2", "Read Count2", "Gene exon length2", "RPKM2")

write.table(data1, file="Sample1Vec.read_cnt", row.names=F, col.names=F, quote=F, sep="\t")

write.table(data2, file="Sample2LSH.read_cnt", row.names=F, col.names=F, quote=F, sep="\t")

## gfold diff -s1 Sample1Vec.read_cnt -s2 Sample2LSH.read_cnt -o Sample1VSSample2.diff

2. 软件安装（只能在linux上使用）

2.1 安装 gsl-1.15

# /mnt/e/win_linux/gsl 替换为自己的安装目录
cd /mnt/e/win_linux/gsl

vi ~/.bashrc
export CXXFLAGS="-g -O3 -I/mnt/e/win_linux/gsl/include -L/mnt/e/win_linux/gsl/lib"
export LD_LIBRARY_PATH="/mnt/e/win_linux/gsl/lib:"$LD_LIBRARY_PATH 

source ~/.bashrc

wget ftp://ftp.gnu.org/gnu/gsl/gsl-2.2.tar.gz   
tar zxf gsl-1.15.tar.gz
cd gsl-1.15
./configure --prefix=/mnt/e/win_linux/gsl
make
make install

2.2 安装gfold

cd ..
## 在这里https://bitbucket.org/feeldead/gfold/downloads/?tab=tags选择一个版本下载，
## 注意不要选择最上面tag为tip的（安装不了）
wget https://bitbucket.org/feeldead/gfold/get/V1.1.4.zip
unzip V1.1.4.zip
cd feeldead-gfold-1921fd6dc668/
make

## 报错
## g++ -O3 -Wall -g main.cc -o gfold -lgsl -lgslcblas 
## In file included from GeneInfo.hpp:29:0,
##                  from main.cc:24:
## Utility.hpp:69:36: fatal error: gsl/gsl_statistics_int.h: No such file or directory
## compilation terminated.
## Makefile:11: recipe for target 'program' failed
## make: *** [program] Error 1

## 运行以下命令
export CXXFLAGS="-g -O3 -I/mnt/e/win_linux/gsl/include -L/mnt/e/win_linux/gsl/lib"  
export LD_LIBRARY_PATH="/mnt/e/win_linux/gsl/lib:"$LD_LIBRARY_PATH
g++ -O3 -Wall -g main.cc -o gfold -lgsl -lgslcblas -I/mnt/e/win_linux/gsl/include -L/mnt/e/win_linux/gsl/lib
## 虽然显示warning 但是已经安装好了，有可执行文件gfold了
./gfold -h

##      =============================================================================== 
##          gfold    :   Generalized fold change for ranking differentially expressed   
##                       genes from RNA-seq data.
##
##          Author   :   Jianxing Feng ([email protected])
##            Date   :   Sun May 24 07:42:36 CST 2015
##          Version  :   V1.1.4
##      =============================================================================== 
##
##      USAGE:   Please go to the source code directory and use command 'man doc/gfold.man' 
##               or open doc/gfold.html to find documentation.
##
##      Quick Examples: 
##               gfold count -ann hg19Ref.gtf -tag sample1.sam -o sample1.read_cnt          
##               gfold count -ann hg19Ref.gtf -tag sample2.sam -o sample2.read_cnt          
##               gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff 

rm -rf *.zip
mv feeldead-gfold-1921fd6dc668/ gfold.V1.1.4
echo "alias gfold='/mnt/e/win_linux/gfold.V1.1.4/gfold'" >> ~/.profile
source ~/.profile
gfold -h

3. 运行gfold筛选差异基因

# s1为样本1，s2为样本2，o为output，suf为后缀名（省略则默认表示为空）
gfold diff -s1 Sample1.read_cnt -s2 Sample2.read_cnt -o Sample1VSSample2.diff
## 与该命令等价的命令：
## gfold diff -s1 Sample1 -s2 Sample2 -suf .read_cnt -o Sample1VSSample2.diff

## -VL1 Loading read count for each gene from file Vector.read_cnt ...
## -VL1 1935 genes have been scanned
## -VL1 Loading read count for each gene from file LSH.read_cnt ...
## -VL1 1935 genes have been scanned
## -VL1 Calculating normalization constant ...
## -VL1 Normalization constant is: 
##      Vector.read_cnt    2.3238e+07    1.04573
##     LSH.read_cnt    2.80085e+07    1
## -VL1 Calculate RPKM ...
## -VL1 Calculate accurate GFOLD value ...
## 等待一段时间
## Job diff is DONE!

4. 结果解释

# This file is generated by gfold V1.1.4 on Fri Jan 19 21:19:05 2018
# Normalization constants :
#    Sample1N.read_cnt  6802895 1.35542
#    Sample2N.read_cnt  12483845    1
# The GFOLD value could be considered as a reliable log2 fold change.
# It is positive/negative if the gene is up/down regulated.
# A gene with zero GFOLD value should never be considered as 
# differentially expressed. For a comprehensive description of 
# GFOLD, please refer to the manual.
#GeneSymbol GeneName    GFOLD(0.01) E-FDR   log2fdc 1stRPKM 2ndRPKM
novel_mir1585   NA  6.45001 1   9.19028 0   3444.45
hsa-miR-4664-5p NA  5.70336 1   8.44996 0   2058.66
novel_mir252    NA  4.13781 1   6.91447 0   704.911
novel_mir1060   NA  3.64411 1   6.43874 0   504.652

4.1 GFOLD

A gene with zero GFOLD value should never be considered as differentially expressed.

The GFOLD value could be considered as a reliable log2 fold change.

It is positive/negative if the gene is up/down regulated.

注：如果GFOLD大于0说明Sample2的表达高于sample1，基因上调（表示为Sample1 VS Sample2）

4.2 E-FDR

Empirical FDR based on replicates. It is always 1 when no replicates are available

因此无重复样本的经验FDR均为1

4.3 log2fdc

log2 fold change. If no replicate is available, and -acc is T, log2 fold change is based on read counts and normalization constants. Otherwise, log2 fold change is based on the sampled expression level from the posterior distribution.

-acc参数在无重复样本时使用，表示计算Gfold的校正方法，如果-acc值为T（默认值为T），表示根据测序深度进行校正（熟读较慢），如果-acc值为F，表示使用马尔科夫蒙特卡洛方法，-acc参数仅在diff功能下指定。

log2fdc表示log2(fold change)，该值默认是经校正的count值改变倍数取log2

4.3 RPKM

由于我输入的Gene exon length是假的，因此最后两列没有意义

参考：
https://kevinzjy.github.io/2017/05/20/Bioinfo-Differential-Expression/
http://www.biotrainee.com/thread-752-1-1.html
http://www.chenlianfu.com/?p=1085
https://bitbucket.org/feeldead/gfold
https://web.archive.org/web/20161125133249/http://compbio.tongji.edu.cn:80/~fengjx/GFOLD/gfold.html