RNA-seq流程学习笔记(14)-在windows10平台上利用R包合并表达矩阵、设置实验分组信息、列名及数据的导入导出

参考文章:
RNA-seq(6): reads计数,合并矩阵并进行注释

进入R学习和相关操作后,各种折腾已经快2周了,看了几个网站的教程,借了几本教科书,发现都是零零碎碎的知识。索性就按照之前的方法,在操作中学习吧,用到哪个再详细补充知识吧。
关于在windows10系统中安装R和RStudio的教程,网上很多也很简单,提醒一句:别像我这个强迫症患者一样追求最新版本吧,有些包没有更新就麻烦了。我自己安装的是R3.6.3版本,相对稳定吧,记录一下。

1. 查看得到的reads计数文件(*.count)信息

1. 使用wc命令对结果进行统计:

# 命令用法
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit
# 操作记录
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll *.count
-rw-rw-r-- 1 zexing zexing 237K 6月   5 00:26 m3108.count
-rw-rw-r-- 1 zexing zexing 244K 6月   5 01:29 m3110.count
-rw-rw-r-- 1 zexing zexing 222K 6月   5 01:29 m3111.count
-rw-rw-r-- 1 zexing zexing 244K 6月   5 02:26 m3112.count
-rw-rw-r-- 1 zexing zexing 244K 6月   5 03:24 m3113.count
-rw-rw-r-- 1 zexing zexing 244K 6月   5 04:22 m3114.count
-rw-rw-r-- 1 zexing zexing 245K 6月   4 23:04 msh1.count
-rw-rw-r-- 1 zexing zexing 245K 6月   5 00:12 msh2.count
-rw-rw-r-- 1 zexing zexing 244K 6月   4 21:57 Scr.count
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ wc -l *.count
  24426 m3108.count
  24426 m3110.count
  24426 m3111.count
  24426 m3112.count
  24426 m3113.count
  24426 m3114.count
  24426 msh1.count
  24426 msh2.count
  24426 Scr.count
 219834 total
# 结果显示:同一批测序结果中每个文件的行数相同

2.使用head/tail命令查看结果的首尾信息:

# head命令用法
Usage: head [OPTION]... [FILE]...
Print the first 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
Mandatory arguments to long options are mandatory for short options too.
  -c, --bytes=[-]NUM       print the first NUM bytes of each file;
                             with the leading '-', print all but the last
                             NUM bytes of each file
  -n, --lines=[-]NUM       print the first NUM lines instead of the first 10;
                             with the leading '-', print all but the last
                             NUM lines of each file
  -q, --quiet, --silent    never print headers giving file names
  -v, --verbose            always print headers giving file names
  -z, --zero-terminated    line delimiter is NUL, not newline
      --help     display this help and exit
      --version  output version information and exit
# 操作记录
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ head -n 14 Scr.count
0610005C13Rik   0
0610007P14Rik   485
0610009B22Rik   213
0610009L18Rik   13
0610009O20Rik   510
0610010B08Rik   0
0610010F05Rik   114
0610010K14Rik   452
0610011F06Rik   397
0610012G03Rik   490
0610030E20Rik   172
0610031O16Rik   0
0610037L13Rik   344
0610038B21Rik   30
# tail命令用法
Usage: tail [OPTION]... [FILE]...
Print the last 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
Mandatory arguments to long options are mandatory for short options too.
  -c, --bytes=[+]NUM       output the last NUM bytes; or use -c +NUM to
                             output starting with byte NUM of each file
  -f, --follow[={name|descriptor}]
                           output appended data as the file grows;
                             an absent option argument means 'descriptor'
  -F                       same as --follow=name --retry
  -n, --lines=[+]NUM       output the last NUM lines, instead of the last 10;
                             or use -n +NUM to output starting with line NUM
      --max-unchanged-stats=N
                           with --follow=name, reopen a FILE which has not
                             changed size after N (default 5) iterations
                             to see if it has been unlinked or renamed
                             (this is the usual case of rotated log files);
                             with inotify, this option is rarely useful
      --pid=PID            with -f, terminate after process ID, PID dies
  -q, --quiet, --silent    never output headers giving file names
      --retry              keep trying to open a file if it is inaccessible
  -s, --sleep-interval=N   with -f, sleep for approximately N seconds
                             (default 1.0) between iterations;
                             with inotify and --pid=P, check process P at
                             least once every N seconds
  -v, --verbose            always output headers giving file names
  -z, --zero-terminated    line delimiter is NUL, not newline
      --help     display this help and exit
      --version  output version information and exit
# 操作记录
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ tail -n 14 Scr.count
Zxdb    51
Zxdc    609
Zyg11a  0
Zyg11b  354
Zyx     1642
Zzef1   326
Zzz3    502
a       0
l7Rn6   328
__no_feature    10717770
__ambiguous     211131
__too_low_aQual 579837
__not_aligned   312551
__alignment_not_unique  1244960

结果显示,文件的最后五行信息不相关,应该予以删除。
删除方法:

  • 下载并使用notepad++软件打开对应文件,删除不需要的内容。#本次采用此方法进行操作。
  • 使用Linux中的sed命令:参考鸟哥的Linux私房菜
#每个文件一共24426行,删除最后5行,即24422-24426行,使用sed -i 命令即可。
#Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned/count$ sed -i '24422,$d' *.count
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned/count$ tail -n 10 *.count
==> m3108.count <==
Zxda    20
Zxdb    41
Zxdc    781
Zyg11a  0
Zyg11b  301
Zyx     2073
Zzef1   395
Zzz3    305
a       0
l7Rn6   389
  • R中删除

2.在R中导入数据,设置分组信息并添加列名

根据上边显示结果可以看出,在.count文件中,两列分别是基因名和reads数目,但是没有列的名称,为了后期合并,对每一个文件添加列名,并将各组信息定义变量。
关于read.table()函数,参考:read.table函数详解、read.table()读取数据文件、R read.table 读取表格参数详解

# 首先设置字符串选项,具体为啥后期学习
> options(stringsAsFactors = FALSE)
# Set the global option options(stringsAsFactors = FALSE) inside a parent function and restore the 
option after the parent function exits

# read.table()函数的用法及说明
read.table(file, header = FALSE, sep = "", quote = ""'",
           dec = ".", row.names, col.names,
           as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown")
header:逻辑参数。指定是否文件第一行为变量名(列名)。
na.strings:指定缺失文字。
skip:指定读数据跳过的行数。
nrows:指定数据读入最大的行数。
dec:指定小数点记号。
sep:指定数据分割字符。
row.names与col.names:赋予数据行名和列名。"

# 导入数据,设置分组信息,增加列名
> Control_1 <- read.table("D:/zhaoxiujuan/m3108.count", sep = "\t", col.names = c("gene_id","control_1"))
> Control_2 <- read.table("D:/zhaoxiujuan/m3111.count", sep = "\t", col.names = c("gene_id","control_2"))
> sh_1_1 <- read.table("D:/zhaoxiujuan/m3112.count", sep = "\t", col.names = c("gene_id","sh_1_1"))
> sh_1_2 <- read.table("D:/zhaoxiujuan/m3113.count", sep = "\t", col.names = c("gene_id","sh_1_2"))
> sh_2_3 <- read.table("D:/zhaoxiujuan/m3114.count", sep = "\t", col.names = c("gene_id","sh_2_3"))
> sh_2_2 <- read.table("D:/zhaoxiujuan/m3110.count", sep = "\t", col.names = c("gene_id","sh_2_2"))

# 查看编辑后的文件首尾信息
> head(Control_1)
        gene_id control_1
1 0610005C13Rik         0
2 0610007P14Rik       230
3 0610009B22Rik        46
4 0610009L18Rik         3
5 0610009O20Rik       157
6 0610010B08Rik         0
> tail(Control_1)
      gene_id control_1
24416  Zyg11b        73
24417     Zyx       492
24418   Zzef1        94
24419    Zzz3        65
24420       a         0
24421   l7Rn6       104

3. 使用merge对各实验组进行整合

参考文章:使用R中merge()函数合并数据

# merge()函数的用法
  merge(x, y, by="by, by.x, by.y: 指定两个数据框中匹配列名称; all, all.x, all.y: 指定合并类型的逻辑值。"
# merge()函数能够合并两个不同的数据框中标识共同的列或行。
# 最简单的是两个数据框
> raw_count <- merge(Control_1, Control_2, by="gene_id")
        gene_id control_1 control_2 
1 0610005C13Rik         0         0  
2 0610007P14Rik       230         0  
3 0610009B22Rik        46         0 
4 0610009L18Rik         3         0 
5 0610009O20Rik       157         1 
6 0610010B08Rik         0         0 
# 也可以同时使用两组merge()函数进行4组数据的合并
> raw_count <- merge(merge(Control_1, Control_2, by="gene_id"), merge(sh_1_1, sh_1_2, by="gene_id"))
> head(raw_count)
        gene_id control_1 control_2 sh_1_1 sh_1_2
1 0610005C13Rik         0         0      4      5
2 0610007P14Rik       230         0   1119   1197
3 0610009B22Rik        46         0    225    272
4 0610009L18Rik         3         0     12     12
5 0610009O20Rik       157         1    684    702
6 0610010B08Rik         0         0      0      0

# 只要merge()里的数据是配对的,也可以同时操作更多。
> raw_count <- merge(merge(merge(Control_1, Control_2, by="gene_id"), merge(sh_1_1, sh_1_2, by="gene_id")), merge(sh_2_2, sh_2_3, by="gene_id"))
> head(raw_count)
        gene_id control_1 control_2 sh_1_1 sh_1_2 sh_2_2 sh_2_3
1 0610005C13Rik         0         0      4      5      1      0
2 0610007P14Rik       230         0   1119   1197   1868   1439
3 0610009B22Rik        46         0    225    272    285    228
4 0610009L18Rik         3         0     12     12     16     16
5 0610009O20Rik       157         1    684    702    499    636
6 0610010B08Rik         0         0      0      0      0      0

4.将整合后的文件保存

老菜鸟的学习速度太慢,学学怎么保存当前的结果吧。

# wirte.table()函数的用法及说明
write.table (x,  file ="",  sep ="",  row.names =TRUE,  col.names =TRUE,  quote =TRUE)
x:需要导出的数据
file:导出的文件路径
sep:分隔符,默认为空格(" "),也就是以空格为分割列
row.names:是否导出行序号,默认为TRUE,也就是导出行序号
col.names:是否导出列名,默认为TRUE,也就是导出列名
quote:字符串是否使用引号表示,默认为TRUE,也就是使用引号表示
# 操作记录
> write.table(raw_count, "G:/raw_count_file", row.names = F, col.names = T, sep = "\t", quote = F)

5.关于基因进行注释-获取gene_symbol

因为之前操作步骤中已经加入了注释信息,具体参看: 。所以可以直接进行后续操作。

你可能感兴趣的:(R学习笔记,RNA-seq学习笔记)