作业题
用less命令查看example.gtf或Y染色体gff文件,
探索–S、-N 参数
解法:1,解压,less -NS Homo_sapiens.GRCh38.102.chromosome.Y.gff3
查看
2.zless查看
压缩文件
tar是用于压缩包,一个压缩包里面会有多个文件和文件夹。而gzip一般用于对单个文件进行压缩。
以后看到有 tar.gz 的文件,你就用 tar 来处理,看到只有gz 没有 tar 的文件,就用 gzip 等处理
gunzip解压,原压缩文件不在
$ ls
bashrc_bk example.fa example.fq example.gtf Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz readme.txt
Last2 20:47:54 ~/Data
$ gunzip Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz
Last2 20:50:05 ~/Data
$ ls
bashrc_bk example.fa example.fq example.gtf Homo_sapiens.GRCh38.102.chromosome.Y.gff3 readme.txt
tar解压,原压缩文件还在,而且不能自动补齐
$ ls
bashrc_bk example.fa example.fq example.gtf Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz readme.txt
Last2 20:53:04 ~/Data
$ tar -zxvf Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz #不能自动补齐文件名
Homo_sapiens.GRCh38.102.chromosome.Y.gff3
Last2 20:53:36 ~/Data
$ ls
bashrc_bk example.fq Homo_sapiens.GRCh38.102.chromosome.Y.gff3 readme.txt
example.fa example.gtf Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz
cat: concatenate 查看文本文件的内容,输出到屏幕(标准输出流)
当心被刷屏!
常见参数:
-A ## 列出所有内容,包括特殊字符,如制表符
-n ## 打印出所有行号,-b 参数仅打印非空白行行号
常见用法:
其他:
zcat:可以查看压缩的文本文件 tac:逆向查看
$ cat >file #重定向,>后有空无空均可
Welcome to
^C #回车再ctrl+c,否则ctrl行内容消失
Last2 10:07:55 ~
退出ctrl+c,删除ctrl+back
可以使用路径
$ cat Data/example.fa
>gi|556503834|ref|NC_000913.3|:c3317526-3316039 Escherichia coli str. K-12 substr. MG1655, complete genome
ATGAACAAAGAAATTTTGGCTGTAGTTGAAGCCGTATCCAATGAAAAGGCGCTACCTCGCGAGAAGATTT
TCGAAGCATTGGAAAGCGCGCTGGCGACAGCAACAAAGAAAAAATATGAACAAGAGATCGACGTCCGCGT
ACAGATCGATCGCAAAAGCGGTGATTTTGACACTTTCCGTCGCTGGTTAGTTGTTGATGAAGTCACCCAG
CCGACCAAGGAAATCACCCTTGAAGCCGCACGTTATGAAGATGAAAGCCTGAACCTGGGCGATTACGTTG
AAGATCAGATTGAGTCTGTTACCTTTGACCGTATCACTACCCAGACGGCAAAACAGGTTATCGTGCAGAA
AGTGCGTGAAGCCGAACGTGCGATGGTGGTTGATCAGTTCCGTGAACACGAAGGTGAAATCATCACCGGC
GTGGTGAAAAAAGTAAACCGCGACAACATCTCTCTGGATCTGGGCAACAACGCTGAAGCCGTGATCCTGC
GCGAAGATATGCTGCCGCGTGAAAACTTCCGCCCTGGCGACCGCGTTCGTGGCGTGCTCTATTCCGTTCG
CCCGGAAGCGCGTGGCGCGCAACTGTTCGTCACTCGTTCCAAGCCGGAAATGCTGATCGAACTGTTCCGT
ATTGAAGTGCCAGAAATCGGCGAAGAAGTGATTGAAATTAAAGCAGCGGCTCGCGATCCGGGTTCTCGTG
CGAAAATCGCGGTGAAAACCAACGATAAACGTATCGATCCGGTAGGTGCTTGCGTAGGTATGCGTGGCGC
GCGTGTTCAGGCGGTGTCTACTGAACTGGGTGGCGAGCGTATCGATATCGTCCTGTGGGATGATAACCCG
GCGCAGTTCGTGATTAACGCAATGGCACCGGCAGACGTTGCTTCTATCGTGGTGGATGAAGATAAACACA
CCATGGATATCGCCGTTGAAGCCGGTAACCTGGCGCAGGCGATTGGCCGTAACGGTCAGAACGTGCGTCT
GGCTTCGCAGCTGAGCGGTTGGGAACTCAACGTGATGACCGTTGACGACCTGCAGGCTAAGCATCAGGCG
GAAGCGCACGCAGCGATCGACACCTTCACCAAATATCTCGACATCGACGAAGACTTCGCGACTGTTCTGG
TAGAAGAAGGCTTCTCGACGCTGGAAGAATTGGCCTATGTGCCGATGAAAGAGCTGTTGGAAATCGAAGG
CCTTGATGAGCCGACCGTTGAAGCACTGCGCGAGCGTGCTAAAAATGCACTGGCCACCATTGCACAGGCC
CAGGAAGAAAGCCTCGGTGATAACAAACCGGCTGACGATCTGCTGAACCTTGAAGGGGTAGATCGTGATT
TGGCATTCAAACTGGCCGCCCGTGGCGTTTGTACGCTGGAAGATCTCGCCGAACAGGGCATTGATGATCT
GGCTGATATCGAAGGGTTGACCGACGAAAAAGCCGGAGCACTGATTATGGCTGCCCGTAATATTTGCTGG
TTCGGTGACGAAGCGTAA
可以输出多个文件
$ cat file1 file2
echo
加成对引号,当做整体输出。不能打开文件。
$ echo "example.gtf | cut -f 9"
example.gtf | cut -f 9
Last2 16:51:53 ~/myDir
wc 统计文本
常见参数: -l 统计行数
-w 统计字符串数
-c 统计字节数
得到-w 单词数的方法,左右都是空的
$ wc readme.txt
5 29 206 readme.txt
Last2 09:19:39 ~
$ cat readme.txt | tr ' ' '\n'
Welcome
to
Biotrainee()
!
This
is
your
personal
account
in
our
Cloud.
Have
a
fun
with
it.
Please
feel
free
to
contact
with
me(
email
to
[email protected]
)
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 09:21:23 ~
$ cat readme.txt | tr ' ' '\n' | wc -l
29
Last2 09:22:15 ~
sort
sort命令 https://man.linuxde.net/sort
ASCII码值 https://baike.baidu.com/item/ASCII/309296
sort命令文件过滤分割与合并
sort命令是在Linux里非常有用,它将文件进行排序,并将排序结果标准输出。sort命令既可以从特定的文件,也可以从stdin中获取输入。
语法
sort(选项)(参数)
选项
-b:忽略每行前面开始出的空格字符;
-c:检查文件是否已经按照顺序排序;
-d:排序时,处理英文字母、数字及空格字符外,忽略其他的字符;
-f:排序时,将小写字母视为大写字母;
-i:排序时,除了040至176之间的ASCII字符外,忽略其他的字符;
-m:将几个排序号的文件进行合并;
-M:将前面3个字母依照月份的缩写进行排序;
-n:依照数值的大小排序;
-o<输出文件>:将排序后的结果存入制定的文件;
-r:以相反的顺序来排序;
-t<分隔字符>:指定排序时所用的栏位分隔字符;
+<起始栏位>-<结束栏位>:以指定的栏位来排序,范围由起始栏位到结束栏位的前一栏位。
参数
文件:指定待排序的文件列表。
实例
sort将文件/文本的每一行作为一个单位,相互比较,比较原则是从首字符向后,依次按ASCII码值进行比较,最后将他们按升序输出。
[root@mail text]# cat sort.txt
aaa:10:1.1
ccc:30:3.3
ddd:40:4.4
bbb:20:2.2
eee:50:5.5
eee:50:5.5
[root@mail text]# sort sort.txt
aaa:10:1.1
bbb:20:2.2
ccc:30:3.3
ddd:40:4.4
eee:50:5.5
eee:50:5.5
uniq:去除重复行
只去除相邻的重复行
常见参数: -c:统计每个字符串连续出现的行数
先用sort,再用uniq,当做整体来用
$ cd Data/
Last2 22:34:12 ~/Data
$ ls
bashrc_bk example.fa example.fq example.gtf Homo_sapiens.GRCh38.102.chromosome.Y.gff3 Homo_sapiens.GRCh38.102.chromosome.Y.gff3.gz readme.txt
Last2 22:34:15 ~/Data
$ cat example.gtf | cut -f 3 | head
UTR
exon
transcript
gene
exon
transcript
exon
exon
UTR
exon
Last2 22:34:53 ~/Data
$ cat example.gtf | cut -f 3 | uniq | less -S
Last2 22:38:07 ~/Data
$ cat example.gtf | cut -f 3 | uniq -c | less -S
1 UTR
1 exon
1 transcript
1 gene
1 exon
1 transcript
2 exon
1 UTR
3 exon
1 UTR
2 exon
1 start_codon
1 CDS
2 UTR
2 exon
2 transcript
1 gene
1 stop_codon
1 UTR
1 exon
1 transcript
1 stop_codon
2 CDS
1 UTR
3 exon
1 CDS
1 exon
1 UTR
2 exon
1 CDS
1 exon
1 CDS
1 UTR
3 exon
1 CDS
1 exon
1 UTR
2 exon
1 UTR
1 CDS
3 exon
1 stop_codon
1 CDS
1 UTR
1 stop_codon
1 exon
1 transcript
2 CDS
1 exon
1 CDS
2 exon
3 CDS
4 exon
2 CDS
3 exon
1 UTR
1 CDS
1 exon
1 start_codon
1 UTR
1 CDS
2 UTR
4 exon
1 start_codon
2 UTR
1 exon
2 UTR
4 exon
1 transcript
1 gene
1 exon
1 transcript
1 exon
1 gene
1 transcript
3 exon
1 UTR
1 exon
1 gene
1 transcript
1 stop_codon
1 CDS
1 exon
1 transcript
1 CDS
1 exon
1 CDS
2 exon
1 start_codon
1 UTR
1 CDS
1 exon
1 gene
1 transcript
1 CDS
1 exon
1 UTR
1 stop_codon
1 exon
1 gene
1 transcript
1 UTR
1 exon
1 gene
1 transcript
1 start_codon
1 CDS
1 stop_codon
1 UTR
1 exon
1 transcript
1 gene
1 exon
1 gene
1 transcript
3 exon
1 transcript
1 exon
1 transcript
8 exon
1 transcript
1 exon
1 gene
1 transcript
2 exon
1 transcript
1 gene
1 UTR
1 stop_codon
1 exon
1 transcript
1 CDS
1 exon
1 CDS
1 exon
1 start_codon
1 exon
1 gene
1 transcript
2 exon
1 gene
1 transcript
1 exon
1 transcript
4 exon
1 gene
1 transcript
1 exon
1 gene
1 transcript
1 exon
1 gene
1 transcript
2 exon
1 gene
1 transcript
2 exon
1 transcript
6 exon
1 UTR
1 exon
1 transcript
1 gene
1 exon
1 transcript
1 exon
1 transcript
1 stop_codon
1 CDS
1 exon
1 CDS
1 exon
1 start_codon
2 exon
1 gene
1 transcript
1 exon
(END)
$ cat example.gtf | cut -f 3 | sort | uniq -c | less -S #先用sort,再用uniq,当做整体来用
29 CDS
111 exon
20 gene
7 start_codon
9 stop_codon
34 transcript
27 UTR
paste:文本合并 默认按列合并
常见参数: -d:指定分隔符
-s:按行合并
常见用法: paste file1 file2 seq 20 | paste - -
$ cat > file1
1
2
3
4
5
^C
Last2 22:49:58 ~/Data
$ cat >file2
gene
gene
gene
gene
gene
gene
^C
Last2 22:51:13 ~/Data
$ cat file1
1
2
3
4
5
Last2 22:51:20 ~/Data
$ cat file2
gene
gene
gene
gene
gene
gene
Last2 22:51:24 ~/Data
$ paste file1 file2 #默认按列合并
1 gene
2 gene
3 gene
4
5 gene
gene
gene
Last2 22:52:10 ~/Data
$ paste -s file1 file2 #-s:按行合并
1 2 3 4 5
gene gene gene gene gene gene
Last2 22:55:17 ~/Data
$ cat file1 file2 #前后输出
1
2
3
4
5
gene
gene
gene
gene
gene
gene
Last2 22:55:31 ~/Data
$ cat file1 file2 > file3 #file1 file2重定向到file3,两个文件合并为一个文件
Last2 22:59:17 ~/Data
$ cat file3
1
2
3
4
5
gene
gene
gene
gene
gene
gene
Last2 22:59:24 ~/Data
seq 20 | paste - - ,把基因文件,一些重复的去掉
$ seq 20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Last2 23:01:38 ~/Data
$ seq 20 | paste - -
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
17 18
19 20
Last2 23:02:38 ~/Data
$ seq 20 | paste - - - -
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
Last2 23:02:45 ~/Data
tr:字符替换
常见参数: -d:删除指定字符
-s:缩减连续重复字符
$ cd Data
Last2 09:51:41 ~/Data
$ cat example.gtf | cut -f 1,2 | tr "\t" ":" | head #tab键替换为:
chr1:ENSEMBL
chr1:ENSEMBL
chr1:ENSEMBL
chr1:HAVANA
chr1:HAVANA
chr1:HAVANA
chr1:HAVANA
chr1:HAVANA
chr1:ENSEMBL
chr1:ENSEMBL
Last2 09:52:43 ~/Data
$ cat > test1
1 2 3 4 5
^C
$ cat test1
1 2 3 4 5
Last2 09:54:01 ~/Data
$ cat test1 | tr " " ":"
1:::2::3:::4::::::5
Last2 09:55:06 ~/Data
$ cat test1 | tr -s " " ":" #缩减连续重复字符
1:2:3:4:5
Last2 09:55:22 ~/Data
作业:截取 example.gtf 第 9列的内容
预计比较长,所以用less -NS查看
$ less example.gtf | cut -f 9 | less -NS
3.在第2步的基础上以分号作为分割符,截取第1列
记得加-f,不然不知道输出哪一列
$ less example.gtf | cut -f 9 | cut -d ";" -f 1
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000223972"
gene_id "ENSG00000223972"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
gene_id "ENSG00000227232"
4.在第3步的基础上排序、去重复并统计行数
5.在第4步的基础上,将空格替换成冒号
$ less example.gtf | cut -f 9 | cut -d ";" -f 1 | sort | uniq -c
8 gene_id "ENSG00000177693"
15 gene_id "ENSG00000184731"
3 gene_id "ENSG00000221311"
3 gene_id "ENSG00000222623"
19 gene_id "ENSG00000223972"
4 gene_id "ENSG00000227061"
83 gene_id "ENSG00000227232"
8 gene_id "ENSG00000233004"
3 gene_id "ENSG00000233750"
15 gene_id "ENSG00000237613"
12 gene_id "ENSG00000237683"
18 gene_id "ENSG00000238009"
12 gene_id "ENSG00000239368"
4 gene_id "ENSG00000239906"
4 gene_id "ENSG00000239945"
3 gene_id "ENSG00000240361"
3 gene_id "ENSG00000240786"
4 gene_id "ENSG00000241599"
8 gene_id "ENSG00000241860"
8 gene_id "ENSG00000243485"
Last2 10:29:22 ~/Data
$ less example.gtf | cut -f 9 | cut -d ";" -f 1 | sort | uniq -c | tr " " ":"
::::::8:gene_id:"ENSG00000177693"
:::::15:gene_id:"ENSG00000184731"
::::::3:gene_id:"ENSG00000221311"
::::::3:gene_id:"ENSG00000222623"
:::::19:gene_id:"ENSG00000223972"
::::::4:gene_id:"ENSG00000227061"
:::::83:gene_id:"ENSG00000227232"
::::::8:gene_id:"ENSG00000233004"
::::::3:gene_id:"ENSG00000233750"
:::::15:gene_id:"ENSG00000237613"
:::::12:gene_id:"ENSG00000237683"
:::::18:gene_id:"ENSG00000238009"
:::::12:gene_id:"ENSG00000239368"
::::::4:gene_id:"ENSG00000239906"
::::::4:gene_id:"ENSG00000239945"
::::::3:gene_id:"ENSG00000240361"
::::::3:gene_id:"ENSG00000240786"
::::::4:gene_id:"ENSG00000241599"
::::::8:gene_id:"ENSG00000241860"
::::::8:gene_id:"ENSG00000243485"
Last2 10:30:07 ~/Data
$ less example.gtf | cut -f 9 | cut -d ";" -f 1 | sort | uniq -c | tr -s " " ":"
:8:gene_id:"ENSG00000177693"
:15:gene_id:"ENSG00000184731"
:3:gene_id:"ENSG00000221311"
:3:gene_id:"ENSG00000222623"
:19:gene_id:"ENSG00000223972"
:4:gene_id:"ENSG00000227061"
:83:gene_id:"ENSG00000227232"
:8:gene_id:"ENSG00000233004"
:3:gene_id:"ENSG00000233750"
:15:gene_id:"ENSG00000237613"
:12:gene_id:"ENSG00000237683"
:18:gene_id:"ENSG00000238009"
:12:gene_id:"ENSG00000239368"
:4:gene_id:"ENSG00000239906"
:4:gene_id:"ENSG00000239945"
:3:gene_id:"ENSG00000240361"
:3:gene_id:"ENSG00000240786"
:4:gene_id:"ENSG00000241599"
:8:gene_id:"ENSG00000241860"
:8:gene_id:"ENSG00000243485"
Last2 10:30:31 ~/Data
$ less example.gtf | cut -f 9 | cut -d ";" -f 1 | sort | uniq -c | tr -s " " "\t"
8 gene_id "ENSG00000177693"
15 gene_id "ENSG00000184731"
3 gene_id "ENSG00000221311"
3 gene_id "ENSG00000222623"
19 gene_id "ENSG00000223972"
4 gene_id "ENSG00000227061"
83 gene_id "ENSG00000227232"
8 gene_id "ENSG00000233004"
3 gene_id "ENSG00000233750"
15 gene_id "ENSG00000237613"
12 gene_id "ENSG00000237683"
18 gene_id "ENSG00000238009"
12 gene_id "ENSG00000239368"
4 gene_id "ENSG00000239906"
4 gene_id "ENSG00000239945"
3 gene_id "ENSG00000240361"
3 gene_id "ENSG00000240786"
4 gene_id "ENSG00000241599"
8 gene_id "ENSG00000241860"
8 gene_id "ENSG00000243485"
Last2 10:31:06 ~/Data
文本处理三驾马车:grep
用grep,awk时,最外层用‘’,内层需要用双引号时用“”
拓展:linux中单引号与双引号的区别与用法
https://blog.csdn.net/hs6605015/article/details/109111568
正则表达式:是对字符串操作的一种逻辑公式,就是用事先
定义好的一些特定字符、及这些特定字符的组合,组成一个
“规则字符串”,这个“规则字符串”用来表达对字符串的
一种过滤逻辑。
[] 和| 功能一样,可写多个
$ cat readme.txt | grep [bBc]
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 17:36:31 ~
$ cat readme.txt | grep "b|B|c"
Last2 17:37:22 ~
$ cat readme.txt | grep -E "b|B|c"
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 17:37:38 ~
练习1:
1.人类Y染色体上有多少个基因呢?
加了-w还是会匹配到其他含有gene的行。因为除了空格,前后是符号的也会被认为是单词。这里单词的理解不同,只要前后不是英文。可以指定前后都是空的(自学)。
$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | grep -w "gene" | less -NS #匹配到其他含有gene的行
1 Y ensembl ncRNA_gene 2784749 2784853 . + . ID=gene:ENSG00000251841;Name=RNU6-1334P;biotype=snRNA;description=RNA%2C U6 small nuclear 1334%2C pseudogene [Source:HGNC Symbol%3BAcc:HGNC:48297];gene_id=ENSG00
2 Y ensembl snRNA 2784749 2784853 . + . ID=transcript:ENST00000516032;Parent=gene:ENSG00000251841;Name=RNU6-1334P-201;biotype=snRNA;tag=basic;transcript_id=ENST00000516032;transcript_support_level=NA;version=1
3 Y ensembl_havana gene 2786855 2787682 . - . ID=gene:ENSG00000184895;Name=SRY;biotype=protein_coding;description=sex determining region Y [Source:HGNC Symbol%3BAcc:HGNC:11311];gene_id=ENSG00000184895;logic_
4 Y ensembl_havana mRNA 2786855 2787682 . - . ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_id=ENST00000383070;transcri
5 Y havana pseudogene 2789827 2790328 . + . ID=gene:ENSG00000237659;Name=RNASEH2CP1;biotype=processed_pseudogene;description=ribonuclease H2 subunit C pseudogene 1 [Source:HGNC Symbol%3BAcc:HGNC:24117];gen
6 Y havana pseudogenic_transcript 2789827 2790328 . + . ID=transcript:ENST00000454281;Parent=gene:ENSG00000237659;Name=RNASEH2CP1-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000454281;transc
7 Y havana pseudogene 2827982 2828218 . + . ID=gene:ENSG00000232195;Name=TOMM22P2;biotype=processed_pseudogene;description=TOMM22 pseudogene 2 [Source:HGNC Symbol%3BAcc:HGNC:38737];gene_id=ENSG00000232195;
8 Y havana pseudogenic_transcript 2827982 2828218 . + . ID=transcript:ENST00000430735;Parent=gene:ENSG00000232195;Name=TOMM22P2-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000430735;transcri
9 Y havana ncRNA_gene 2828192 2840851 . - . ID=gene:ENSG00000286130;Name=AC006040.1;biotype=lncRNA;description=novel transcript;gene_id=ENSG00000286130;logic_name=havana_homo_sapiens;version=1
10 Y havana lnc_RNA 2828192 2840851 . - . ID=transcript:ENST00000651710;Parent=gene:ENSG00000286130;Name=AC006040.1-201;biotype=lncRNA;tag=basic;transcript_id=ENST00000651710;version=1
11 Y ensembl_havana gene 2841602 2932000 . + . ID=gene:ENSG00000129824;Name=RPS4Y1;biotype=protein_coding;description=ribosomal protein S4 Y-linked 1 [Source:HGNC Symbol%3BAcc:HGNC:10425];gene_id=ENSG00000129
12 Y ensembl_havana mRNA 2841602 2867268 . + . ID=transcript:ENST00000250784;Parent=gene:ENSG00000129824;Name=RPS4Y1-201;biotype=protein_coding;ccdsid=CCDS14773.1;tag=basic;transcript_id=ENST00000250784;trans
13 Y havana mRNA 2841920 2866862 . + . ID=transcript:ENST00000430575;Parent=gene:ENSG00000129824;Name=RPS4Y1-202;biotype=protein_coding;transcript_id=ENST00000430575;transcript_support_level=3;version=1
14 Y havana lnc_RNA 2854096 2866956 . + . ID=transcript:ENST00000477725;Parent=gene:ENSG00000129824;Name=RPS4Y1-203;biotype=processed_transcript;transcript_id=ENST00000477725;transcript_support_level=2;version=1
15 Y havana lnc_RNA 2854730 2932000 . + . ID=transcript:ENST00000515575;Parent=gene:ENSG00000129824;Name=RPS4Y1-204;biotype=processed_transcript;transcript_id=ENST00000515575;transcript_support_level=3;version=1
16 Y havana pseudogene 2881683 2890551 . - . ID=gene:ENSG00000227289;Name=HSFY3P;biotype=transcribed_unprocessed_pseudogene;description=heat shock transcription factor Y-linked 3%2C pseudogene [Source:HGNC
17 Y havana pseudogenic_transcript 2881683 2883652 . - . ID=transcript:ENST00000444242;Parent=gene:ENSG00000227289;Name=HSFY3P-201;biotype=transcribed_unprocessed_pseudogene;tag=basic;transcript_id=ENST00000444
18 Y havana lnc_RNA 2883110 2890551 . - . ID=transcript:ENST00000652562;Parent=gene:ENSG00000227289;Name=HSFY3P-202;biotype=processed_transcript;tag=basic;transcript_id=ENST00000652562;version=1
19 Y havana pseudogene 2929001 2931120 . - . ID=gene:ENSG00000229163;Name=NAP1L1P2;biotype=processed_pseudogene;description=nucleosome assembly protein 1 like 1 pseudogene 2 [Source:HGNC Symbol%3BAcc:HGNC:3
20 Y havana pseudogenic_transcript 2929001 2931120 . - . ID=transcript:ENST00000414182;Parent=gene:ENSG00000229163;Name=NAP1L1P2-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000414182;transcri
21 Y havana ncRNA_gene 2934406 2934771 . - . ID=gene:ENSG00000278847;Name=AC006157.1;biotype=lncRNA;description=novel transcript;gene_id=ENSG00000278847;logic_name=havana_homo_sapiens;version=1
22 Y havana lnc_RNA 2934406 2934771 . - . ID=transcript:ENST00000611750;Parent=gene:ENSG00000278847;Name=AC006157.1-201;biotype=lncRNA;tag=basic;transcript_id=ENST00000611750;transcript_support_level=NA;version=
23 Y ensembl_havana gene 2935281 2982506 . + . ID=gene:ENSG00000067646;Name=ZFY;biotype=protein_coding;description=zinc finger protein Y-linked [Source:HGNC Symbol%3BAcc:HGNC:12870];gene_id=ENSG00000067646;lo
24 Y havana mRNA 2935281 2982506 . + . ID=transcript:ENST00000383052;Parent=gene:ENSG00000067646;Name=ZFY-202;biotype=protein_coding;ccdsid=CCDS14774.1;tag=basic;transcript_id=ENST00000383052;transcript_suppo
25 Y ensembl_havana mRNA 2935381 2982506 . + . ID=transcript:ENST00000155093;Parent=gene:ENSG00000067646;Name=ZFY-201;biotype=protein_coding;ccdsid=CCDS14774.1;tag=basic;transcript_id=ENST00000155093;transcri
26 Y ensembl_havana mRNA 2935389 2980347 . + . ID=transcript:ENST00000625061;Parent=gene:ENSG00000067646;Name=ZFY-207;biotype=protein_coding;ccdsid=CCDS48200.1;tag=basic;transcript_id=ENST00000625061;transcri
27 Y havana lnc_RNA 2935500 2978053 . + . ID=transcript:ENST00000469869;Parent=gene:ENSG00000067646;Name=ZFY-205;biotype=processed_transcript;transcript_id=ENST00000469869;transcript_support_level=3;version=1
28 Y havana mRNA 2935505 2961286 . + . ID=transcript:ENST00000443793;Parent=gene:ENSG00000067646;Name=ZFY-203;biotype=protein_coding;transcript_id=ENST00000443793;transcript_support_level=3;version=1
把第三列单独拿出,排除其他列的干扰,再匹配以gene开头的
$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | cut -f 3 | grep -w "^gene" | less -S
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | cut -f 3 | grep -w -c "^gene"
47
Last2 18:17:58 ~/Data
2.在Y染色体的注释文件中有第三列哪些类型呢?
$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | cut -f 3 | sort | uniq -c
522 ###
290 biological_region
1568 CDS
1 chromosome
4285 exon
222 five_prime_UTR
47 gene
1 #!genebuild-last-updated 2020-09
1 #!genome-build-accession NCBI:GCA_000001405.28
1 #!genome-build GRCh38.p13
1 #!genome-date 2013-12
1 #!genome-version GRCh38
1 ##gff-version 3
258 lnc_RNA
149 mRNA
7 ncRNA
92 ncRNA_gene
382 pseudogene
382 pseudogenic_transcript
1 ##sequence-region Y 2781480 56887902
3 snoRNA
17 snRNA
196 three_prime_UTR
Last2 18:19:50 ~/Data
grep -v反向过滤掉#注释文件
$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | cut -f 3 | grep -v "#" | sort | uniq -c
290 biological_region
1568 CDS
1 chromosome
4285 exon
222 five_prime_UTR
47 gene
258 lnc_RNA
149 mRNA
7 ncRNA
92 ncRNA_gene
382 pseudogene
382 pseudogenic_transcript
3 snoRNA
17 snRNA
196 three_prime_UTR
Last2 18:22:45 ~/Data
3.匹配 exon 的行,然后反向输出
$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | grep -v 'exon'
4.匹配 CDS 或者 UTR 的行
$ cat Homo_sapiens.GRCh38.102.chromosome.Y.gff3 | grep -e "CDS" -e 'UTR' | less -SN
1 Y ensembl_havana mRNA 2786855 2787682 . - . ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_id=ENST00000383070;transcri
2 Y ensembl_havana three_prime_UTR 2786855 2786988 . - . Parent=transcript:ENST00000383070
3 Y ensembl_havana CDS 2786989 2787603 . - 0 ID=CDS:ENSP00000372547;Parent=transcript:ENST00000383070;protein_id=ENSP00000372547
4 Y ensembl_havana five_prime_UTR 2787604 2787682 . - . Parent=transcript:ENST00000383070
5 Y ensembl_havana mRNA 2841602 2867268 . + . ID=transcript:ENST00000250784;Parent=gene:ENSG00000129824;Name=RPS4Y1-201;biotype=protein_coding;ccdsid=CCDS14773.1;tag=basic;transcript_id=ENST00000250784;trans
6 Y ensembl_havana five_prime_UTR 2841602 2841624 . + . Parent=transcript:ENST00000250784
7 Y ensembl_havana CDS 2841625 2841627 . + 0 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
8 Y ensembl_havana CDS 2842165 2842242 . + 0 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
9 Y ensembl_havana CDS 2844077 2844257 . + 0 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
10 Y ensembl_havana CDS 2845646 2845743 . + 2 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
11 Y ensembl_havana CDS 2854600 2854771 . + 0 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
12 Y ensembl_havana CDS 2865088 2865245 . + 2 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
13 Y ensembl_havana CDS 2866793 2866894 . + 0 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
14 Y ensembl_havana three_prime_UTR 2866895 2867268 . + . Parent=transcript:ENST00000250784
15 Y havana five_prime_UTR 2841920 2841943 . + . Parent=transcript:ENST00000430575
16 Y havana CDS 2841944 2841973 . + 0 ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
17 Y havana CDS 2842165 2842242 . + 0 ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
18 Y havana CDS 2844077 2844257 . + 0 ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
19 Y havana CDS 2845646 2845743 . + 2 ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
20 Y havana CDS 2854600 2854771 . + 0 ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
21 Y havana CDS 2865088 2865245 . + 2 ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
22 Y havana CDS 2866793 2866862 . + 0 ID=CDS:ENSP00000415317;Parent=transcript:ENST00000430575;protein_id=ENSP00000415317
23 Y havana mRNA 2935281 2982506 . + . ID=transcript:ENST00000383052;Parent=gene:ENSG00000067646;Name=ZFY-202;biotype=protein_coding;ccdsid=CCDS14774.1;tag=basic;transcript_id=ENST00000383052;transcript_suppo
24 Y havana five_prime_UTR 2935281 2935446 . + . Parent=transcript:ENST00000383052
25 Y havana five_prime_UTR 2953909 2953936 . + . Parent=transcript:ENST00000383052
26 Y havana CDS 2953937 2953997 . + 0 ID=CDS:ENSP00000372525;Parent=transcript:ENST00000383052;protein_id=ENSP00000372525
27 Y havana CDS 2961074 2961646 . + 2 ID=CDS:ENSP00000372525;Parent=transcript:ENST00000383052;protein_id=ENSP00000372525
28 Y havana CDS 2975095 2975244 . + 2 ID=CDS:ENSP00000372525;Parent=transcript:ENST00000383052;protein_id=ENSP00000372525
29 Y havana CDS 2975511 2975654 . + 2 ID=CDS:ENSP00000372525;Parent=transcript:ENST00000383052;protein_id=ENSP00000372525
5.查找example.fq文件包含 @ 的行并统计
6.查找example.fq文件以 @ 开头的行并统计
$ cat example.fq | grep -c "@"
1502
Last2 18:39:50 ~/Data
$ cat example.fq | grep -c "^@"
1006
Last2 18:40:21 ~/Data
$ cat example.fq | wc -l
4000
Last2 18:41:48 ~/Data
fastq以4行为一单位,为何是1006多出6个?因为有些@开头的是碱基质量。查看输出结果,发现id行开头都是@ERR,所以去除掉表示碱基质量的行。
$ cat example.fq | grep "^@ERR"
@ERR329499.1 HWUSI-EAS697:8:115:13414:19955#ACAGTG/1
@ERR329499.2 HWUSI-EAS697:8:116:12001:8002#ACAGTG/1
@ERR329499.3 HWUSI-EAS697:8:109:15856:9893#ACAGTG/1
@ERR329499.4 HWUSI-EAS697:8:112:11677:17310#ACAGTG/1
@ERR329499.5 HWUSI-EAS697:8:107:15127:3214#ACAGTG/1
@ERR329499.6 HWUSI-EAS697:8:107:2618:15051#ACAGTG/1
@ERR329499.7 HWUSI-EAS697:8:115:16789:7248#ACAGTG/1
@ERR329499.8 HWUSI-EAS697:8:109:5676:19198#ACAGTG/1
@ERR329499.9 HWUSI-EAS697:8:118:11989:2132#ACAGTG/1
@ERR329499.10 HWUSI-EAS697:8:109:2951:9799#ACAGTG/1
@ERR329499.11 HWUSI-EAS697:8:113:8258:4189#ACAGTG/1
@ERR329499.12 HWUSI-EAS697:8:111:6447:4906#ACAGTG/1
@ERR329499.13 HWUSI-EAS697:8:111:7152:13520#ACAGTG/1
@ERR329499.14 HWUSI-EAS697:8:117:3796:12057#ACAGTG/1
@ERR329499.15 HWUSI-EAS697:8:115:8447:15115#ACAGTG/1
@ERR329499.16 HWUSI-EAS697:8:108:13238:6838#ACAGTG/1
@ERR329499.17 HWUSI-EAS697:8:115:3731:19054#ACAGTG/1
@ERR329499.18 HWUSI-EAS697:8:110:16526:5227#ACAGTG/1
@ERR329499.19 HWUSI-EAS697:8:107:14430:6946#ACAGTG/1
@ERR329499.20 HWUSI-EAS697:8:108:12737:6667#ACAGTG/1
@ERR329499.21 HWUSI-EAS697:8:113:10105:12246#ACAGTG/1
@ERR329499.22 HWUSI-EAS697:8:114:4407:7477#ACAGTG/1
@ERR329499.23 HWUSI-EAS697:8:112:5156:18014#ACAGTG/1
@ERR329499.24 HWUSI-EAS697:8:117:14629:11219#ACAGTG/1
@ERR329499.25 HWUSI-EAS697:8:105:12596:7464#ACAGTG/1
补充:挂载目录,解释一下家目录不在/home下面而是在trainee1下面
空格不要随便加
$ ls last 2 / #空格隔开,last 2 /被当做三个部分
ls: cannot access 'last': No such file or directory
ls: cannot access '2': No such file or directory
/:
bin boot data dev etc home initrd.img lib lib64 lost+found media mnt opt proc root run sbin srv sys teach tmp trainee1 trainee2 usr var vmlinuz
Last2 10:23:09 ~
$ sudo rm -rf Last 2 / #非常危险的命令
#sudo管理员权限,会强制递归删除根目录