miRNA array笔记1

1. 对gtf文件进行处理：

cat GSE52313_transcripts.gtf | grep "gene_name" |sed 's/$.*$gene_id$.*$; transcript_id$.*$; gene_name$.*$; oId$.*$/\2\4/g' |sed 's/"//g' | awk '{print $1"="$2}' | sort -n | uniq -c > GSE52313_transcripts_sort.gtf

awk -F" *" '{print $3}' GSE52313_transcripts_sort.gtf | awk -F"=" '{print $1,$2}' > GSE52313_transcripts_renew.gtf

2. 提取lncRNA:

awk 'NR==FNR{a[$1]=$2}NR!=FNR{if ($1 in a){print a[$1],$2,$3,$4,$5,$6,$7,$8,$9}}' GSE52313_transcripts_renew.gtf GSE52313_lnRNA

3. 对lncRNA表达量文件进行基因名替换

awk 'NR==FNR{a[$1]=$2}NR!=FNR{if ($1 in a){print a[$1],$2,$3,$4,$5,$6,$7,$8,$9}}' GSE52313_transcripts_renew.gtf GSE52313_lnRNA_gene_counts > GSE52313_lnRNA_gene_counts_renew

4.根据table S1的信息：sharm: 8R 3RL 12RL 7LR

MI:5O 7R 11L 2L

5. 对lncRNA进行差异表达分析：

首先需要读入一个数据框，列代表每个sample，行代表每个gene

database_all <- read.table(file = "GSE52313_lnRNA_gene_counts_renew", sep = "\t", header = T, row.names=1)

这里主要对于两两比较的数据，因此我取了数据的前6列，分别是两组样品，每组3个生物学重复

设定分组信息，也就是样本分组的名称

type <- factor(c(rep("LC_1",4), rep("LC_2",4)))

我这里是样品1是LC_1，样品2是LC_2

由于DESeq包要求接下来的count data必须要整数型，因此我们需要对数据进行取整,然后将数据database和分组信息type读入到cds对象中

database <- round(as.matrix(database_all))

source("https://bioconductor.org/biocLite.R")

biocLite('BiocInstaller')

biocLite("DESeq2")

coldata <- data.frame(row.names = colnames(database), type)

dds <- DESeqDataSetFromMatrix(countData=database, colData=coldata, design=~type)

DESeq2的使用方法：

输入矩阵数据，行名为sample，列名为gene；DESeq2不支持无生物学重复的数据，因此我选择了2个样本，3个生物学重复的数据；并对count data取整（经大神指点，这里需要说明下，我的测试数据readcount是RSEM定量的结果，并不是常见的htseq-count的结果，所以count值会有小数点，而DESeq2包不支持count数有小数点，所以这里需要round取整）。

设置分组信息以及构建dds对象

dds <- DESeqDataSetFromMatrix(countData=database, colData=coldata, design=~condition)

使用DESeq函数进行估计离散度，然后进行标准的差异表达分析，得到res对象结果

前过滤:

keep<-rowSums(counts(dds)) >= 10

dds<-dds[keep,]

注：并不是必须，不影响计算结果，这样做的优点是dds对象占用的内存小点，后续的计算耗时小点。

dds <- DESeq(dds)

res <- results(dds)

最后设定阈值，筛选差异基因，导出数据

table(res$padj <0.05)

res <- res[order(res$padj),]

resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)),by="row.names",sort=FALSE)

write.csv(resdata,file = "LC_1_vs_LC_2.csv")

#从LC_1_vs_LC_2.csv中筛选lncRNA和mRNA，

miRNA array笔记1

你可能感兴趣的:(miRNA array笔记1)