参考文章:
RNA-seq(6): reads计数,合并矩阵并进行注释
进入R学习和相关操作后,各种折腾已经快2周了,看了几个网站的教程,借了几本教科书,发现都是零零碎碎的知识。索性就按照之前的方法,在操作中学习吧,用到哪个再详细补充知识吧。
关于在windows10系统中安装R和RStudio的教程,网上很多也很简单,提醒一句:别像我这个强迫症患者一样追求最新版本吧,有些包没有更新就麻烦了。我自己安装的是R3.6.3版本,相对稳定吧,记录一下。
1. 使用wc命令对结果进行统计:
# 命令用法
Usage: wc [OPTION]... [FILE]...
or: wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. A word is a non-zero-length sequence of
characters delimited by white space.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-L, --max-line-length print the maximum display width
-w, --words print the word counts
--help display this help and exit
--version output version information and exit
# 操作记录
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll *.count
-rw-rw-r-- 1 zexing zexing 237K 6月 5 00:26 m3108.count
-rw-rw-r-- 1 zexing zexing 244K 6月 5 01:29 m3110.count
-rw-rw-r-- 1 zexing zexing 222K 6月 5 01:29 m3111.count
-rw-rw-r-- 1 zexing zexing 244K 6月 5 02:26 m3112.count
-rw-rw-r-- 1 zexing zexing 244K 6月 5 03:24 m3113.count
-rw-rw-r-- 1 zexing zexing 244K 6月 5 04:22 m3114.count
-rw-rw-r-- 1 zexing zexing 245K 6月 4 23:04 msh1.count
-rw-rw-r-- 1 zexing zexing 245K 6月 5 00:12 msh2.count
-rw-rw-r-- 1 zexing zexing 244K 6月 4 21:57 Scr.count
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ wc -l *.count
24426 m3108.count
24426 m3110.count
24426 m3111.count
24426 m3112.count
24426 m3113.count
24426 m3114.count
24426 msh1.count
24426 msh2.count
24426 Scr.count
219834 total
# 结果显示:同一批测序结果中每个文件的行数相同
2.使用head/tail命令查看结果的首尾信息:
# head命令用法
Usage: head [OPTION]... [FILE]...
Print the first 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
Mandatory arguments to long options are mandatory for short options too.
-c, --bytes=[-]NUM print the first NUM bytes of each file;
with the leading '-', print all but the last
NUM bytes of each file
-n, --lines=[-]NUM print the first NUM lines instead of the first 10;
with the leading '-', print all but the last
NUM lines of each file
-q, --quiet, --silent never print headers giving file names
-v, --verbose always print headers giving file names
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
# 操作记录
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ head -n 14 Scr.count
0610005C13Rik 0
0610007P14Rik 485
0610009B22Rik 213
0610009L18Rik 13
0610009O20Rik 510
0610010B08Rik 0
0610010F05Rik 114
0610010K14Rik 452
0610011F06Rik 397
0610012G03Rik 490
0610030E20Rik 172
0610031O16Rik 0
0610037L13Rik 344
0610038B21Rik 30
# tail命令用法
Usage: tail [OPTION]... [FILE]...
Print the last 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
Mandatory arguments to long options are mandatory for short options too.
-c, --bytes=[+]NUM output the last NUM bytes; or use -c +NUM to
output starting with byte NUM of each file
-f, --follow[={name|descriptor}]
output appended data as the file grows;
an absent option argument means 'descriptor'
-F same as --follow=name --retry
-n, --lines=[+]NUM output the last NUM lines, instead of the last 10;
or use -n +NUM to output starting with line NUM
--max-unchanged-stats=N
with --follow=name, reopen a FILE which has not
changed size after N (default 5) iterations
to see if it has been unlinked or renamed
(this is the usual case of rotated log files);
with inotify, this option is rarely useful
--pid=PID with -f, terminate after process ID, PID dies
-q, --quiet, --silent never output headers giving file names
--retry keep trying to open a file if it is inaccessible
-s, --sleep-interval=N with -f, sleep for approximately N seconds
(default 1.0) between iterations;
with inotify and --pid=P, check process P at
least once every N seconds
-v, --verbose always output headers giving file names
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
# 操作记录
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ tail -n 14 Scr.count
Zxdb 51
Zxdc 609
Zyg11a 0
Zyg11b 354
Zyx 1642
Zzef1 326
Zzz3 502
a 0
l7Rn6 328
__no_feature 10717770
__ambiguous 211131
__too_low_aQual 579837
__not_aligned 312551
__alignment_not_unique 1244960
结果显示,文件的最后五行信息不相关,应该予以删除。
删除方法:
#每个文件一共24426行,删除最后5行,即24422-24426行,使用sed -i 命令即可。
#Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned/count$ sed -i '24422,$d' *.count
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned/count$ tail -n 10 *.count
==> m3108.count <==
Zxda 20
Zxdb 41
Zxdc 781
Zyg11a 0
Zyg11b 301
Zyx 2073
Zzef1 395
Zzz3 305
a 0
l7Rn6 389
根据上边显示结果可以看出,在.count文件中,两列分别是基因名和reads数目,但是没有列的名称,为了后期合并,对每一个文件添加列名,并将各组信息定义变量。
关于read.table()函数,参考:read.table函数详解、read.table()读取数据文件、R read.table 读取表格参数详解
# 首先设置字符串选项,具体为啥后期学习
> options(stringsAsFactors = FALSE)
# Set the global option options(stringsAsFactors = FALSE) inside a parent function and restore the
option after the parent function exits
# read.table()函数的用法及说明
read.table(file, header = FALSE, sep = "", quote = ""'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown")
header:逻辑参数。指定是否文件第一行为变量名(列名)。
na.strings:指定缺失文字。
skip:指定读数据跳过的行数。
nrows:指定数据读入最大的行数。
dec:指定小数点记号。
sep:指定数据分割字符。
row.names与col.names:赋予数据行名和列名。"
# 导入数据,设置分组信息,增加列名
> Control_1 <- read.table("D:/zhaoxiujuan/m3108.count", sep = "\t", col.names = c("gene_id","control_1"))
> Control_2 <- read.table("D:/zhaoxiujuan/m3111.count", sep = "\t", col.names = c("gene_id","control_2"))
> sh_1_1 <- read.table("D:/zhaoxiujuan/m3112.count", sep = "\t", col.names = c("gene_id","sh_1_1"))
> sh_1_2 <- read.table("D:/zhaoxiujuan/m3113.count", sep = "\t", col.names = c("gene_id","sh_1_2"))
> sh_2_3 <- read.table("D:/zhaoxiujuan/m3114.count", sep = "\t", col.names = c("gene_id","sh_2_3"))
> sh_2_2 <- read.table("D:/zhaoxiujuan/m3110.count", sep = "\t", col.names = c("gene_id","sh_2_2"))
# 查看编辑后的文件首尾信息
> head(Control_1)
gene_id control_1
1 0610005C13Rik 0
2 0610007P14Rik 230
3 0610009B22Rik 46
4 0610009L18Rik 3
5 0610009O20Rik 157
6 0610010B08Rik 0
> tail(Control_1)
gene_id control_1
24416 Zyg11b 73
24417 Zyx 492
24418 Zzef1 94
24419 Zzz3 65
24420 a 0
24421 l7Rn6 104
参考文章:使用R中merge()函数合并数据
# merge()函数的用法
merge(x, y, by="by, by.x, by.y: 指定两个数据框中匹配列名称; all, all.x, all.y: 指定合并类型的逻辑值。"
# merge()函数能够合并两个不同的数据框中标识共同的列或行。
# 最简单的是两个数据框
> raw_count <- merge(Control_1, Control_2, by="gene_id")
gene_id control_1 control_2
1 0610005C13Rik 0 0
2 0610007P14Rik 230 0
3 0610009B22Rik 46 0
4 0610009L18Rik 3 0
5 0610009O20Rik 157 1
6 0610010B08Rik 0 0
# 也可以同时使用两组merge()函数进行4组数据的合并
> raw_count <- merge(merge(Control_1, Control_2, by="gene_id"), merge(sh_1_1, sh_1_2, by="gene_id"))
> head(raw_count)
gene_id control_1 control_2 sh_1_1 sh_1_2
1 0610005C13Rik 0 0 4 5
2 0610007P14Rik 230 0 1119 1197
3 0610009B22Rik 46 0 225 272
4 0610009L18Rik 3 0 12 12
5 0610009O20Rik 157 1 684 702
6 0610010B08Rik 0 0 0 0
# 只要merge()里的数据是配对的,也可以同时操作更多。
> raw_count <- merge(merge(merge(Control_1, Control_2, by="gene_id"), merge(sh_1_1, sh_1_2, by="gene_id")), merge(sh_2_2, sh_2_3, by="gene_id"))
> head(raw_count)
gene_id control_1 control_2 sh_1_1 sh_1_2 sh_2_2 sh_2_3
1 0610005C13Rik 0 0 4 5 1 0
2 0610007P14Rik 230 0 1119 1197 1868 1439
3 0610009B22Rik 46 0 225 272 285 228
4 0610009L18Rik 3 0 12 12 16 16
5 0610009O20Rik 157 1 684 702 499 636
6 0610010B08Rik 0 0 0 0 0 0
老菜鸟的学习速度太慢,学学怎么保存当前的结果吧。
# wirte.table()函数的用法及说明
write.table (x, file ="", sep ="", row.names =TRUE, col.names =TRUE, quote =TRUE)
x:需要导出的数据
file:导出的文件路径
sep:分隔符,默认为空格(" "),也就是以空格为分割列
row.names:是否导出行序号,默认为TRUE,也就是导出行序号
col.names:是否导出列名,默认为TRUE,也就是导出列名
quote:字符串是否使用引号表示,默认为TRUE,也就是使用引号表示
# 操作记录
> write.table(raw_count, "G:/raw_count_file", row.names = F, col.names = T, sep = "\t", quote = F)
因为之前操作步骤中已经加入了注释信息,具体参看: 。所以可以直接进行后续操作。