Lunix Day 2

作业题

用less命令查看example.gtf或Y染色体gff文件,
探索–S、-N 参数
解法:1,解压,less -NS Homo_sapiens.GRCh38.102.chromosome.Y.gff3
查看
2.zless查看

压缩文件

tar是用于压缩包,一个压缩包里面会有多个文件和文件夹。而gzip一般用于对单个文件进行压缩。
以后看到有 tar.gz 的文件,你就用 tar 来处理,看到只有gz 没有 tar 的文件,就用 gzip 等处理

gunzip解压,原压缩文件不在

$ ls
bashrc_bk  example.fa  example.fq  example.gtf  Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz  readme.txt
Last2 20:47:54 ~/Data
$ gunzip Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz 
Last2 20:50:05 ~/Data
$ ls
bashrc_bk  example.fa  example.fq  example.gtf  Homo_sapiens.GRCh38.102.chromosome.Y.gff3  readme.txt

tar解压,原压缩文件还在,而且不能自动补齐

$ ls
bashrc_bk  example.fa  example.fq  example.gtf  Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz  readme.txt
Last2 20:53:04 ~/Data
$ tar -zxvf Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz #不能自动补齐文件名
Homo_sapiens.GRCh38.102.chromosome.Y.gff3
Last2 20:53:36 ~/Data
$ ls
bashrc_bk   example.fq   Homo_sapiens.GRCh38.102.chromosome.Y.gff3     readme.txt
example.fa  example.gtf  Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz

cat: concatenate 查看文本文件的内容,输出到屏幕(标准输出流)

当心被刷屏!
常见参数:
-A ## 列出所有内容,包括特殊字符,如制表符
-n ## 打印出所有行号,-b 参数仅打印非空白行行号
常见用法:
其他:
zcat:可以查看压缩的文本文件 tac:逆向查看


image.png
$ cat >file #重定向,>后有空无空均可
Welcome to
^C #回车再ctrl+c,否则ctrl行内容消失
Last2 10:07:55 ~

退出ctrl+c,删除ctrl+back

可以使用路径

$ cat Data/example.fa
>gi|556503834|ref|NC_000913.3|:c3317526-3316039 Escherichia coli str. K-12 substr. MG1655, complete genome
ATGAACAAAGAAATTTTGGCTGTAGTTGAAGCCGTATCCAATGAAAAGGCGCTACCTCGCGAGAAGATTT
TCGAAGCATTGGAAAGCGCGCTGGCGACAGCAACAAAGAAAAAATATGAACAAGAGATCGACGTCCGCGT
ACAGATCGATCGCAAAAGCGGTGATTTTGACACTTTCCGTCGCTGGTTAGTTGTTGATGAAGTCACCCAG
CCGACCAAGGAAATCACCCTTGAAGCCGCACGTTATGAAGATGAAAGCCTGAACCTGGGCGATTACGTTG
AAGATCAGATTGAGTCTGTTACCTTTGACCGTATCACTACCCAGACGGCAAAACAGGTTATCGTGCAGAA
AGTGCGTGAAGCCGAACGTGCGATGGTGGTTGATCAGTTCCGTGAACACGAAGGTGAAATCATCACCGGC
GTGGTGAAAAAAGTAAACCGCGACAACATCTCTCTGGATCTGGGCAACAACGCTGAAGCCGTGATCCTGC
GCGAAGATATGCTGCCGCGTGAAAACTTCCGCCCTGGCGACCGCGTTCGTGGCGTGCTCTATTCCGTTCG
CCCGGAAGCGCGTGGCGCGCAACTGTTCGTCACTCGTTCCAAGCCGGAAATGCTGATCGAACTGTTCCGT
ATTGAAGTGCCAGAAATCGGCGAAGAAGTGATTGAAATTAAAGCAGCGGCTCGCGATCCGGGTTCTCGTG
CGAAAATCGCGGTGAAAACCAACGATAAACGTATCGATCCGGTAGGTGCTTGCGTAGGTATGCGTGGCGC
GCGTGTTCAGGCGGTGTCTACTGAACTGGGTGGCGAGCGTATCGATATCGTCCTGTGGGATGATAACCCG
GCGCAGTTCGTGATTAACGCAATGGCACCGGCAGACGTTGCTTCTATCGTGGTGGATGAAGATAAACACA
CCATGGATATCGCCGTTGAAGCCGGTAACCTGGCGCAGGCGATTGGCCGTAACGGTCAGAACGTGCGTCT
GGCTTCGCAGCTGAGCGGTTGGGAACTCAACGTGATGACCGTTGACGACCTGCAGGCTAAGCATCAGGCG
GAAGCGCACGCAGCGATCGACACCTTCACCAAATATCTCGACATCGACGAAGACTTCGCGACTGTTCTGG
TAGAAGAAGGCTTCTCGACGCTGGAAGAATTGGCCTATGTGCCGATGAAAGAGCTGTTGGAAATCGAAGG
CCTTGATGAGCCGACCGTTGAAGCACTGCGCGAGCGTGCTAAAAATGCACTGGCCACCATTGCACAGGCC
CAGGAAGAAAGCCTCGGTGATAACAAACCGGCTGACGATCTGCTGAACCTTGAAGGGGTAGATCGTGATT
TGGCATTCAAACTGGCCGCCCGTGGCGTTTGTACGCTGGAAGATCTCGCCGAACAGGGCATTGATGATCT
GGCTGATATCGAAGGGTTGACCGACGAAAAAGCCGGAGCACTGATTATGGCTGCCCGTAATATTTGCTGG
TTCGGTGACGAAGCGTAA

可以输出多个文件

$ cat file1 file2

echo

加成对引号,当做整体输出。不能打开文件。

$ echo "example.gtf  | cut -f 9"
example.gtf  | cut -f 9
Last2 16:51:53 ~/myDir

wc 统计文本

常见参数: -l 统计行数
-w 统计字符串数
-c 统计字节数

得到-w 单词数的方法,左右都是空的

$ wc readme.txt
  5  29 206 readme.txt
Last2 09:19:39 ~
$ cat readme.txt | tr ' ' '\n'
Welcome
to
Biotrainee()
!
This
is
your
personal
account
in
our
Cloud.
Have
a
fun
with
it.
Please
feel
free
to
contact
with
me(
email
to
[email protected]
)
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 09:21:23 ~
$ cat readme.txt | tr ' ' '\n' | wc -l
29
Last2 09:22:15 ~

sort

sort命令 https://man.linuxde.net/sort
ASCII码值 https://baike.baidu.com/item/ASCII/309296

sort命令文件过滤分割与合并
sort命令是在Linux里非常有用,它将文件进行排序,并将排序结果标准输出。sort命令既可以从特定的文件,也可以从stdin中获取输入。

语法
sort(选项)(参数)
选项
-b:忽略每行前面开始出的空格字符;
-c:检查文件是否已经按照顺序排序;
-d:排序时,处理英文字母、数字及空格字符外,忽略其他的字符;
-f:排序时,将小写字母视为大写字母;
-i:排序时,除了040至176之间的ASCII字符外,忽略其他的字符;
-m:将几个排序号的文件进行合并;
-M:将前面3个字母依照月份的缩写进行排序;
-n:依照数值的大小排序;
-o<输出文件>:将排序后的结果存入制定的文件;
-r:以相反的顺序来排序;
-t<分隔字符>:指定排序时所用的栏位分隔字符;
+<起始栏位>-<结束栏位>:以指定的栏位来排序,范围由起始栏位到结束栏位的前一栏位。
参数
文件:指定待排序的文件列表。

实例
sort将文件/文本的每一行作为一个单位,相互比较,比较原则是从首字符向后,依次按ASCII码值进行比较,最后将他们按升序输出。

[root@mail text]# cat sort.txt
aaa:10:1.1
ccc:30:3.3
ddd:40:4.4
bbb:20:2.2
eee:50:5.5
eee:50:5.5

[root@mail text]# sort sort.txt
aaa:10:1.1
bbb:20:2.2
ccc:30:3.3
ddd:40:4.4
eee:50:5.5
eee:50:5.5

uniq:去除重复行

只去除相邻的重复行
常见参数: -c:统计每个字符串连续出现的行数

先用sort,再用uniq,当做整体来用

$ cd Data/
Last2 22:34:12 ~/Data
$ ls
bashrc_bk  example.fa  example.fq  example.gtf  Homo_sapiens.GRCh38.102.chromosome.Y.gff3  Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz  readme.txt
Last2 22:34:15 ~/Data
$ cat example.gtf | cut -f 3 | head
UTR
exon
transcript
gene
exon
transcript
exon
exon
UTR
exon
Last2 22:34:53 ~/Data
$ cat example.gtf | cut -f 3 | uniq | less -S
Last2 22:38:07 ~/Data
$ cat example.gtf | cut -f 3 | uniq -c | less -S





















































      1 UTR
      1 exon
      1 transcript
      1 gene
      1 exon
      1 transcript
      2 exon
      1 UTR
      3 exon
      1 UTR
      2 exon
      1 start_codon
      1 CDS
      2 UTR
      2 exon
      2 transcript
      1 gene
      1 stop_codon
      1 UTR
      1 exon
      1 transcript
      1 stop_codon
      2 CDS
      1 UTR
      3 exon
      1 CDS
      1 exon
      1 UTR
      2 exon
      1 CDS
      1 exon
      1 CDS
      1 UTR
      3 exon
      1 CDS
      1 exon
      1 UTR
      2 exon
      1 UTR
      1 CDS
      3 exon
      1 stop_codon
      1 CDS
      1 UTR
      1 stop_codon
      1 exon
      1 transcript
      2 CDS
      1 exon
      1 CDS
      2 exon
      3 CDS
      4 exon
      2 CDS
      3 exon
      1 UTR
      1 CDS
      1 exon
      1 start_codon
      1 UTR
      1 CDS
      2 UTR
      4 exon
      1 start_codon
      2 UTR
      1 exon
      2 UTR
      4 exon
      1 transcript
      1 gene
      1 exon
      1 transcript
      1 exon
      1 gene
      1 transcript
      3 exon
      1 UTR
      1 exon
      1 gene
      1 transcript
      1 stop_codon
      1 CDS
      1 exon
      1 transcript
      1 CDS
      1 exon
      1 CDS
      2 exon
      1 start_codon
      1 UTR
      1 CDS
      1 exon
      1 gene
      1 transcript
      1 CDS
      1 exon
      1 UTR
      1 stop_codon
      1 exon
      1 gene
      1 transcript
      1 UTR
      1 exon
      1 gene
      1 transcript
      1 start_codon
      1 CDS
      1 stop_codon
      1 UTR
      1 exon
      1 transcript
      1 gene
      1 exon
      1 gene
      1 transcript
      3 exon
      1 transcript
      1 exon
      1 transcript
      8 exon
      1 transcript
      1 exon
      1 gene
      1 transcript
      2 exon
      1 transcript
      1 gene
      1 UTR
      1 stop_codon
      1 exon
      1 transcript
      1 CDS
      1 exon
      1 CDS
      1 exon
      1 start_codon
      1 exon
      1 gene
      1 transcript
      2 exon
      1 gene
      1 transcript
      1 exon
      1 transcript
      4 exon
      1 gene
      1 transcript
      1 exon
      1 gene
      1 transcript
      1 exon
      1 gene
      1 transcript
      2 exon
      1 gene
      1 transcript
      2 exon
      1 transcript
      6 exon
      1 UTR
      1 exon
      1 transcript
      1 gene
      1 exon
      1 transcript
      1 exon
      1 transcript
      1 stop_codon
      1 CDS
      1 exon
      1 CDS
      1 exon
      1 start_codon
      2 exon
      1 gene
      1 transcript
      1 exon
(END)
$ cat example.gtf | cut -f 3 | sort | uniq -c | less -S  #先用sort,再用uniq,当做整体来用
 29 CDS
    111 exon
     20 gene
      7 start_codon
      9 stop_codon
     34 transcript
     27 UTR

paste:文本合并 默认按列合并

常见参数: -d:指定分隔符
-s:按行合并
常见用法: paste file1 file2 seq 20 | paste - -

$ cat > file1
1
2
3
4
5
^C
Last2 22:49:58 ~/Data
$ cat  >file2
gene    
gene
gene

gene
gene
gene
^C
Last2 22:51:13 ~/Data
$ cat file1
1
2
3
4
5
Last2 22:51:20 ~/Data
$ cat file2
gene    
gene
gene

gene
gene
gene
Last2 22:51:24 ~/Data
$ paste file1 file2 #默认按列合并
1   gene    
2   gene
3   gene
4   
5   gene
    gene
    gene
Last2 22:52:10 ~/Data
$ paste -s file1 file2  #-s:按行合并
1   2   3   4   5
gene        gene    gene        gene    gene    gene
Last2 22:55:17 ~/Data
$ cat file1 file2 #前后输出
1
2
3
4
5
gene    
gene
gene

gene
gene
gene
Last2 22:55:31 ~/Data
$ cat file1 file2 > file3 #file1 file2重定向到file3,两个文件合并为一个文件
Last2 22:59:17 ~/Data
$ cat file3
1
2
3
4
5
gene    
gene
gene

gene
gene
gene
Last2 22:59:24 ~/Data

seq 20 | paste - - ,把基因文件,一些重复的去掉

$ seq 20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Last2 23:01:38 ~/Data
$ seq 20 | paste - - 
1   2
3   4
5   6
7   8
9   10
11  12
13  14
15  16
17  18
19  20
Last2 23:02:38 ~/Data
$ seq 20 | paste - - - -
1   2   3   4
5   6   7   8
9   10  11  12
13  14  15  16
17  18  19  20
Last2 23:02:45 ~/Data

tr:字符替换

常见参数: -d:删除指定字符
-s:缩减连续重复字符

$ cd Data
Last2 09:51:41 ~/Data
$ cat example.gtf | cut -f 1,2 | tr "\t" ":" | head #tab键替换为:
chr1:ENSEMBL
chr1:ENSEMBL
chr1:ENSEMBL
chr1:HAVANA
chr1:HAVANA
chr1:HAVANA
chr1:HAVANA
chr1:HAVANA
chr1:ENSEMBL
chr1:ENSEMBL
Last2 09:52:43 ~/Data
$ cat > test1
1   2  3   4      5
^C                 
$ cat test1
1   2  3   4      5
Last2 09:54:01 ~/Data
$ cat test1 | tr " " ":"
1:::2::3:::4::::::5
Last2 09:55:06 ~/Data
$ cat test1 | tr -s " " ":" #缩减连续重复字符
1:2:3:4:5
Last2 09:55:22 ~/Data

作业:截取 example.gtf 第 9列的内容

预计比较长,所以用less -NS查看

$ less example.gtf  | cut -f 9  | less -NS

3.在第2步的基础上以分号作为分割符,截取第1列
记得加-f,不然不知道输出哪一列

$ less example.gtf  | cut -f 9 | cut -d ";" -f 1
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"

4.在第3步的基础上排序、去重复并统计行数
5.在第4步的基础上,将空格替换成冒号

$ less example.gtf  | cut -f 9 | cut -d ";" -f 1 | sort | uniq -c
      8 gene_id "ENSG00000177693"
     15 gene_id "ENSG00000184731"
      3 gene_id "ENSG00000221311"
      3 gene_id "ENSG00000222623"
     19 gene_id "ENSG00000223972"
      4 gene_id "ENSG00000227061"
     83 gene_id "ENSG00000227232"
      8 gene_id "ENSG00000233004"
      3 gene_id "ENSG00000233750"
     15 gene_id "ENSG00000237613"
     12 gene_id "ENSG00000237683"
     18 gene_id "ENSG00000238009"
     12 gene_id "ENSG00000239368"
      4 gene_id "ENSG00000239906"
      4 gene_id "ENSG00000239945"
      3 gene_id "ENSG00000240361"
      3 gene_id "ENSG00000240786"
      4 gene_id "ENSG00000241599"
      8 gene_id "ENSG00000241860"
      8 gene_id "ENSG00000243485"
Last2 10:29:22 ~/Data
$ less example.gtf  | cut -f 9 | cut -d ";" -f 1 | sort | uniq -c | tr " " ":"
::::::8:gene_id:"ENSG00000177693"
:::::15:gene_id:"ENSG00000184731"
::::::3:gene_id:"ENSG00000221311"
::::::3:gene_id:"ENSG00000222623"
:::::19:gene_id:"ENSG00000223972"
::::::4:gene_id:"ENSG00000227061"
:::::83:gene_id:"ENSG00000227232"
::::::8:gene_id:"ENSG00000233004"
::::::3:gene_id:"ENSG00000233750"
:::::15:gene_id:"ENSG00000237613"
:::::12:gene_id:"ENSG00000237683"
:::::18:gene_id:"ENSG00000238009"
:::::12:gene_id:"ENSG00000239368"
::::::4:gene_id:"ENSG00000239906"
::::::4:gene_id:"ENSG00000239945"
::::::3:gene_id:"ENSG00000240361"
::::::3:gene_id:"ENSG00000240786"
::::::4:gene_id:"ENSG00000241599"
::::::8:gene_id:"ENSG00000241860"
::::::8:gene_id:"ENSG00000243485"
Last2 10:30:07 ~/Data
$ less example.gtf  | cut -f 9 | cut -d ";" -f 1 | sort | uniq -c | tr -s " " ":"
:8:gene_id:"ENSG00000177693"
:15:gene_id:"ENSG00000184731"
:3:gene_id:"ENSG00000221311"
:3:gene_id:"ENSG00000222623"
:19:gene_id:"ENSG00000223972"
:4:gene_id:"ENSG00000227061"
:83:gene_id:"ENSG00000227232"
:8:gene_id:"ENSG00000233004"
:3:gene_id:"ENSG00000233750"
:15:gene_id:"ENSG00000237613"
:12:gene_id:"ENSG00000237683"
:18:gene_id:"ENSG00000238009"
:12:gene_id:"ENSG00000239368"
:4:gene_id:"ENSG00000239906"
:4:gene_id:"ENSG00000239945"
:3:gene_id:"ENSG00000240361"
:3:gene_id:"ENSG00000240786"
:4:gene_id:"ENSG00000241599"
:8:gene_id:"ENSG00000241860"
:8:gene_id:"ENSG00000243485"
Last2 10:30:31 ~/Data
$ less example.gtf  | cut -f 9 | cut -d ";" -f 1 | sort | uniq -c | tr -s " " "\t"
    8   gene_id "ENSG00000177693"
    15  gene_id "ENSG00000184731"
    3   gene_id "ENSG00000221311"
    3   gene_id "ENSG00000222623"
    19  gene_id "ENSG00000223972"
    4   gene_id "ENSG00000227061"
    83  gene_id "ENSG00000227232"
    8   gene_id "ENSG00000233004"
    3   gene_id "ENSG00000233750"
    15  gene_id "ENSG00000237613"
    12  gene_id "ENSG00000237683"
    18  gene_id "ENSG00000238009"
    12  gene_id "ENSG00000239368"
    4   gene_id "ENSG00000239906"
    4   gene_id "ENSG00000239945"
    3   gene_id "ENSG00000240361"
    3   gene_id "ENSG00000240786"
    4   gene_id "ENSG00000241599"
    8   gene_id "ENSG00000241860"
    8   gene_id "ENSG00000243485"
Last2 10:31:06 ~/Data

文本处理三驾马车:grep

用grep,awk时,最外层用‘’,内层需要用双引号时用“”
拓展:linux中单引号与双引号的区别与用法
https://blog.csdn.net/hs6605015/article/details/109111568

正则表达式:是对字符串操作的一种逻辑公式,就是用事先
定义好的一些特定字符、及这些特定字符的组合,组成一个
“规则字符串”,这个“规则字符串”用来表达对字符串的
一种过滤逻辑。


image.png

[] 和| 功能一样,可写多个

$ cat readme.txt  | grep [bBc]
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 17:36:31 ~
$ cat readme.txt  | grep "b|B|c"
Last2 17:37:22 ~
$ cat readme.txt  | grep -E "b|B|c"
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 17:37:38 ~

练习1:

1.人类Y染色体上有多少个基因呢?
加了-w还是会匹配到其他含有gene的行。因为除了空格,前后是符号的也会被认为是单词。这里单词的理解不同,只要前后不是英文。可以指定前后都是空的(自学)。

$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | grep -w "gene" | less -NS #匹配到其他含有gene的行
 1 Y       ensembl ncRNA_gene      2784749 2784853 .       +       .       ID=gene:ENSG00000251841;Name=RNU6-1334P;biotype=snRNA;description=RNA%2C U6 small nuclear 1334%2C pseudogene [Source:HGNC Symbol%3BAcc:HGNC:48297];gene_id=ENSG00
      2 Y       ensembl snRNA   2784749 2784853 .       +       .       ID=transcript:ENST00000516032;Parent=gene:ENSG00000251841;Name=RNU6-1334P-201;biotype=snRNA;tag=basic;transcript_id=ENST00000516032;transcript_support_level=NA;version=1
      3 Y       ensembl_havana  gene    2786855 2787682 .       -       .       ID=gene:ENSG00000184895;Name=SRY;biotype=protein_coding;description=sex determining region Y [Source:HGNC Symbol%3BAcc:HGNC:11311];gene_id=ENSG00000184895;logic_
      4 Y       ensembl_havana  mRNA    2786855 2787682 .       -       .       ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_id=ENST00000383070;transcri
      5 Y       havana  pseudogene      2789827 2790328 .       +       .       ID=gene:ENSG00000237659;Name=RNASEH2CP1;biotype=processed_pseudogene;description=ribonuclease H2 subunit C pseudogene 1 [Source:HGNC Symbol%3BAcc:HGNC:24117];gen
      6 Y       havana  pseudogenic_transcript  2789827 2790328 .       +       .       ID=transcript:ENST00000454281;Parent=gene:ENSG00000237659;Name=RNASEH2CP1-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000454281;transc
      7 Y       havana  pseudogene      2827982 2828218 .       +       .       ID=gene:ENSG00000232195;Name=TOMM22P2;biotype=processed_pseudogene;description=TOMM22 pseudogene 2 [Source:HGNC Symbol%3BAcc:HGNC:38737];gene_id=ENSG00000232195;
      8 Y       havana  pseudogenic_transcript  2827982 2828218 .       +       .       ID=transcript:ENST00000430735;Parent=gene:ENSG00000232195;Name=TOMM22P2-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000430735;transcri
      9 Y       havana  ncRNA_gene      2828192 2840851 .       -       .       ID=gene:ENSG00000286130;Name=AC006040.1;biotype=lncRNA;description=novel transcript;gene_id=ENSG00000286130;logic_name=havana_homo_sapiens;version=1
     10 Y       havana  lnc_RNA 2828192 2840851 .       -       .       ID=transcript:ENST00000651710;Parent=gene:ENSG00000286130;Name=AC006040.1-201;biotype=lncRNA;tag=basic;transcript_id=ENST00000651710;version=1
     11 Y       ensembl_havana  gene    2841602 2932000 .       +       .       ID=gene:ENSG00000129824;Name=RPS4Y1;biotype=protein_coding;description=ribosomal protein S4 Y-linked 1 [Source:HGNC Symbol%3BAcc:HGNC:10425];gene_id=ENSG00000129
     12 Y       ensembl_havana  mRNA    2841602 2867268 .       +       .       ID=transcript:ENST00000250784;Parent=gene:ENSG00000129824;Name=RPS4Y1-201;biotype=protein_coding;ccdsid=CCDS14773.1;tag=basic;transcript_id=ENST00000250784;trans
     13 Y       havana  mRNA    2841920 2866862 .       +       .       ID=transcript:ENST00000430575;Parent=gene:ENSG00000129824;Name=RPS4Y1-202;biotype=protein_coding;transcript_id=ENST00000430575;transcript_support_level=3;version=1
     14 Y       havana  lnc_RNA 2854096 2866956 .       +       .       ID=transcript:ENST00000477725;Parent=gene:ENSG00000129824;Name=RPS4Y1-203;biotype=processed_transcript;transcript_id=ENST00000477725;transcript_support_level=2;version=1
     15 Y       havana  lnc_RNA 2854730 2932000 .       +       .       ID=transcript:ENST00000515575;Parent=gene:ENSG00000129824;Name=RPS4Y1-204;biotype=processed_transcript;transcript_id=ENST00000515575;transcript_support_level=3;version=1
     16 Y       havana  pseudogene      2881683 2890551 .       -       .       ID=gene:ENSG00000227289;Name=HSFY3P;biotype=transcribed_unprocessed_pseudogene;description=heat shock transcription factor Y-linked 3%2C pseudogene [Source:HGNC 
     17 Y       havana  pseudogenic_transcript  2881683 2883652 .       -       .       ID=transcript:ENST00000444242;Parent=gene:ENSG00000227289;Name=HSFY3P-201;biotype=transcribed_unprocessed_pseudogene;tag=basic;transcript_id=ENST00000444
     18 Y       havana  lnc_RNA 2883110 2890551 .       -       .       ID=transcript:ENST00000652562;Parent=gene:ENSG00000227289;Name=HSFY3P-202;biotype=processed_transcript;tag=basic;transcript_id=ENST00000652562;version=1
     19 Y       havana  pseudogene      2929001 2931120 .       -       .       ID=gene:ENSG00000229163;Name=NAP1L1P2;biotype=processed_pseudogene;description=nucleosome assembly protein 1 like 1 pseudogene 2 [Source:HGNC Symbol%3BAcc:HGNC:3
     20 Y       havana  pseudogenic_transcript  2929001 2931120 .       -       .       ID=transcript:ENST00000414182;Parent=gene:ENSG00000229163;Name=NAP1L1P2-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000414182;transcri
     21 Y       havana  ncRNA_gene      2934406 2934771 .       -       .       ID=gene:ENSG00000278847;Name=AC006157.1;biotype=lncRNA;description=novel transcript;gene_id=ENSG00000278847;logic_name=havana_homo_sapiens;version=1
     22 Y       havana  lnc_RNA 2934406 2934771 .       -       .       ID=transcript:ENST00000611750;Parent=gene:ENSG00000278847;Name=AC006157.1-201;biotype=lncRNA;tag=basic;transcript_id=ENST00000611750;transcript_support_level=NA;version=
     23 Y       ensembl_havana  gene    2935281 2982506 .       +       .       ID=gene:ENSG00000067646;Name=ZFY;biotype=protein_coding;description=zinc finger protein Y-linked [Source:HGNC Symbol%3BAcc:HGNC:12870];gene_id=ENSG00000067646;lo
     24 Y       havana  mRNA    2935281 2982506 .       +       .       ID=transcript:ENST00000383052;Parent=gene:ENSG00000067646;Name=ZFY-202;biotype=protein_coding;ccdsid=CCDS14774.1;tag=basic;transcript_id=ENST00000383052;transcript_suppo
     25 Y       ensembl_havana  mRNA    2935381 2982506 .       +       .       ID=transcript:ENST00000155093;Parent=gene:ENSG00000067646;Name=ZFY-201;biotype=protein_coding;ccdsid=CCDS14774.1;tag=basic;transcript_id=ENST00000155093;transcri
     26 Y       ensembl_havana  mRNA    2935389 2980347 .       +       .       ID=transcript:ENST00000625061;Parent=gene:ENSG00000067646;Name=ZFY-207;biotype=protein_coding;ccdsid=CCDS48200.1;tag=basic;transcript_id=ENST00000625061;transcri
     27 Y       havana  lnc_RNA 2935500 2978053 .       +       .       ID=transcript:ENST00000469869;Parent=gene:ENSG00000067646;Name=ZFY-205;biotype=processed_transcript;transcript_id=ENST00000469869;transcript_support_level=3;version=1
     28 Y       havana  mRNA    2935505 2961286 .       +       .       ID=transcript:ENST00000443793;Parent=gene:ENSG00000067646;Name=ZFY-203;biotype=protein_coding;transcript_id=ENST00000443793;transcript_support_level=3;version=1

把第三列单独拿出,排除其他列的干扰,再匹配以gene开头的

$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | cut -f 3 | grep -w "^gene" | less -S
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | cut -f 3 | grep -w -c "^gene"
47
Last2 18:17:58 ~/Data

2.在Y染色体的注释文件中有第三列哪些类型呢?

$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | cut -f 3 | sort | uniq -c
    522 ###
    290 biological_region
   1568 CDS
      1 chromosome
   4285 exon
    222 five_prime_UTR
     47 gene
      1 #!genebuild-last-updated 2020-09
      1 #!genome-build-accession NCBI:GCA_000001405.28
      1 #!genome-build  GRCh38.p13
      1 #!genome-date 2013-12
      1 #!genome-version GRCh38
      1 ##gff-version 3
    258 lnc_RNA
    149 mRNA
      7 ncRNA
     92 ncRNA_gene
    382 pseudogene
    382 pseudogenic_transcript
      1 ##sequence-region   Y 2781480 56887902
      3 snoRNA
     17 snRNA
    196 three_prime_UTR
Last2 18:19:50 ~/Data

grep -v反向过滤掉#注释文件

$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | cut -f 3 | grep -v "#" | sort | uniq -c
    290 biological_region
   1568 CDS
      1 chromosome
   4285 exon
    222 five_prime_UTR
     47 gene
    258 lnc_RNA
    149 mRNA
      7 ncRNA
     92 ncRNA_gene
    382 pseudogene
    382 pseudogenic_transcript
      3 snoRNA
     17 snRNA
    196 three_prime_UTR
Last2 18:22:45 ~/Data

3.匹配 exon 的行,然后反向输出

$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | grep -v 'exon'

4.匹配 CDS 或者 UTR 的行

$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | grep -e "CDS" -e 'UTR' | less -SN
    1 Y       ensembl_havana  mRNA    2786855 2787682 .       -       .       ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_id=ENST00000383070;transcri
      2 Y       ensembl_havana  three_prime_UTR 2786855 2786988 .       -       .       Parent=transcript:ENST00000383070
      3 Y       ensembl_havana  CDS     2786989 2787603 .       -       0       ID=CDS:ENSP00000372547;Parent=transcript:ENST00000383070;protein_id=ENSP00000372547
      4 Y       ensembl_havana  five_prime_UTR  2787604 2787682 .       -       .       Parent=transcript:ENST00000383070
      5 Y       ensembl_havana  mRNA    2841602 2867268 .       +       .       ID=transcript:ENST00000250784;Parent=gene:ENSG00000129824;Name=RPS4Y1-201;biotype=protein_coding;ccdsid=CCDS14773.1;tag=basic;transcript_id=ENST00000250784;trans
      6 Y       ensembl_havana  five_prime_UTR  2841602 2841624 .       +       .       Parent=transcript:ENST00000250784
      7 Y       ensembl_havana  CDS     2841625 2841627 .       +       0       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
      8 Y       ensembl_havana  CDS     2842165 2842242 .       +       0       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
      9 Y       ensembl_havana  CDS     2844077 2844257 .       +       0       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
     10 Y       ensembl_havana  CDS     2845646 2845743 .       +       2       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
     11 Y       ensembl_havana  CDS     2854600 2854771 .       +       0       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
     12 Y       ensembl_havana  CDS     2865088 2865245 .       +       2       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
     13 Y       ensembl_havana  CDS     2866793 2866894 .       +       0       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
     14 Y       ensembl_havana  three_prime_UTR 2866895 2867268 .       +       .       Parent=transcript:ENST00000250784
     15 Y       havana  five_prime_UTR  2841920 2841943 .       +       .       Parent=transcript:ENST00000430575
     16 Y       havana  CDS     2841944 2841973 .       +       0       ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
     17 Y       havana  CDS     2842165 2842242 .       +       0       ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
     18 Y       havana  CDS     2844077 2844257 .       +       0       ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
     19 Y       havana  CDS     2845646 2845743 .       +       2       ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
     20 Y       havana  CDS     2854600 2854771 .       +       0       ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
     21 Y       havana  CDS     2865088 2865245 .       +       2       ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
     22 Y       havana  CDS     2866793 2866862 .       +       0       ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
     23 Y       havana  mRNA    2935281 2982506 .       +       .       ID=transcript:ENST00000383052;Parent=gene:ENSG00000067646;Name=ZFY-202;biotype=protein_coding;ccdsid=CCDS14774.1;tag=basic;transcript_id=ENST00000383052;transcript_suppo
     24 Y       havana  five_prime_UTR  2935281 2935446 .       +       .       Parent=transcript:ENST00000383052
     25 Y       havana  five_prime_UTR  2953909 2953936 .       +       .       Parent=transcript:ENST00000383052
     26 Y       havana  CDS     2953937 2953997 .       +       0       ID=CDS:ENSP00000372525;Parent=transcript:ENST00000383052;protein_id=ENSP00000372525
     27 Y       havana  CDS     2961074 2961646 .       +       2       ID=CDS:ENSP00000372525;Parent=transcript:ENST00000383052;protein_id=ENSP00000372525
     28 Y       havana  CDS     2975095 2975244 .       +       2       ID=CDS:ENSP00000372525;Parent=transcript:ENST00000383052;protein_id=ENSP00000372525
     29 Y       havana  CDS     2975511 2975654 .       +       2       ID=CDS:ENSP00000372525;Parent=transcript:ENST00000383052;protein_id=ENSP00000372525

5.查找example.fq文件包含 @ 的行并统计
6.查找example.fq文件以 @ 开头的行并统计

$ cat example.fq | grep -c "@" 
1502
Last2 18:39:50 ~/Data
$ cat example.fq | grep -c "^@" 
1006
Last2 18:40:21 ~/Data
$ cat example.fq | wc -l
4000
Last2 18:41:48 ~/Data

fastq以4行为一单位,为何是1006多出6个?因为有些@开头的是碱基质量。查看输出结果,发现id行开头都是@ERR,所以去除掉表示碱基质量的行。

$ cat example.fq | grep "^@ERR"
@ERR329499.1 HWUSI-EAS697:8:115:13414:19955#ACAGTG/1
@ERR329499.2 HWUSI-EAS697:8:116:12001:8002#ACAGTG/1
@ERR329499.3 HWUSI-EAS697:8:109:15856:9893#ACAGTG/1
@ERR329499.4 HWUSI-EAS697:8:112:11677:17310#ACAGTG/1
@ERR329499.5 HWUSI-EAS697:8:107:15127:3214#ACAGTG/1
@ERR329499.6 HWUSI-EAS697:8:107:2618:15051#ACAGTG/1
@ERR329499.7 HWUSI-EAS697:8:115:16789:7248#ACAGTG/1
@ERR329499.8 HWUSI-EAS697:8:109:5676:19198#ACAGTG/1
@ERR329499.9 HWUSI-EAS697:8:118:11989:2132#ACAGTG/1
@ERR329499.10 HWUSI-EAS697:8:109:2951:9799#ACAGTG/1
@ERR329499.11 HWUSI-EAS697:8:113:8258:4189#ACAGTG/1
@ERR329499.12 HWUSI-EAS697:8:111:6447:4906#ACAGTG/1
@ERR329499.13 HWUSI-EAS697:8:111:7152:13520#ACAGTG/1
@ERR329499.14 HWUSI-EAS697:8:117:3796:12057#ACAGTG/1
@ERR329499.15 HWUSI-EAS697:8:115:8447:15115#ACAGTG/1
@ERR329499.16 HWUSI-EAS697:8:108:13238:6838#ACAGTG/1
@ERR329499.17 HWUSI-EAS697:8:115:3731:19054#ACAGTG/1
@ERR329499.18 HWUSI-EAS697:8:110:16526:5227#ACAGTG/1
@ERR329499.19 HWUSI-EAS697:8:107:14430:6946#ACAGTG/1
@ERR329499.20 HWUSI-EAS697:8:108:12737:6667#ACAGTG/1
@ERR329499.21 HWUSI-EAS697:8:113:10105:12246#ACAGTG/1
@ERR329499.22 HWUSI-EAS697:8:114:4407:7477#ACAGTG/1
@ERR329499.23 HWUSI-EAS697:8:112:5156:18014#ACAGTG/1
@ERR329499.24 HWUSI-EAS697:8:117:14629:11219#ACAGTG/1
@ERR329499.25 HWUSI-EAS697:8:105:12596:7464#ACAGTG/1

补充:挂载目录,解释一下家目录不在/home下面而是在trainee1下面

空格不要随便加

$ ls last 2 /  #空格隔开,last 2 /被当做三个部分
ls: cannot access 'last': No such file or directory
ls: cannot access '2': No such file or directory
/:
bin  boot  data  dev  etc  home  initrd.img  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  teach  tmp  trainee1  trainee2  usr  var  vmlinuz
Last2 10:23:09 ~
$ sudo rm -rf Last 2 / #非常危险的命令
#sudo管理员权限,会强制递归删除根目录

你可能感兴趣的:(Lunix Day 2)