3.Linux上的文本编辑器vim
命令模式:
撤销动作(后悔药):
• u:复原前一个动作(可连续撤销),就是撤销。按由近及远的顺序撤销。
• ctrl+r:重做上一个动作,抵消撤销。撤销的是上个撤销的内容。ctrl+r命令只对u命令起作用。
不太理解ctrl+r,重做(Redo),即撤销的撤销。如果你撤销得太多,你可以输入CTRL-R(redo)回退前一个命令。换句话说,它撤销一个撤销。
末行模式:
意外退出 没保存
再次打开时
$ vim readme.txt
E325: ATTENTION
Found a swap file by the name ".readme.txt.swp"
owned by: Last2 dated: Fri Jan 22 10:56:04 2021
file name: ~Last2/readme.txt
modified: YES
user name: Last2 host name: VM-0-17-ubuntu
process ID: 21878
While opening file "readme.txt"
dated: Sun Jan 17 16:51:09 2021
(1) Another program may be editing the same file. If this is the case,
be careful not to end up with two different instances of the same
file when making changes. Quit, or continue with caution.
(2) An edit session for this file crashed.
If this is the case, use ":recover" or "vim -r readme.txt"
to recover the changes (see ":help recovery").
If you did this already, delete the swap file ".readme.txt.swp"
to avoid this message.
Swap file ".readme.txt.swp" already exists!
[O]pen Read-Only, (E)dit anyway, (R)ecover, (D)elete it, (Q)uit, (A)bort:
vim编辑时,会生成临时文件,保存退出时,会把临时文件更新到原始文件上去,临时文件消失。意外退出时,临时文件被留下。可以D 直接删掉,原始文件正常编辑。
末行模式:
查询 不用输:
• 输入/KEYWORD进行查询向下 ?是向上搜索
• 按n向下查找
• 按N向上查找
输/www,跳转到网址
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html) #光标跳转到网址
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
/www
search hit BOTTOM, continuing at TOP
:?www
search hit TOP, continuing at BOTTOM
/gene 查找gene关键词
$ vim Data/Homo_sapiens.GRCh38.102.chromosome.Y.gff3
1 ##gff-version 3
2 ##sequence-region Y 2781480 56887902
3 #!genome-build GRCh38.p13
4 #!genome-version GRCh38
5 #!genome-date 2013-12
6 #!genome-build-accession NCBI:GCA_000001405.28
7 #!genebuild-last-updated 2020-09
8 Y GRCh38 chromosome 2781480 56887902 . . . ID=chromosome:Y;Alias=CM000686.2,chrY,NC_000024.10
9 ###
10 Y ensembl ncRNA_gene 2784749 2784853 . + . ID=gene:ENSG00000251841;Name=RNU6-1334P;biotype=snRNA;description=RNA%2C U6 small nuclear 1334%2C pseudogene [Source:HGNC Symbol%3BAcc:HGNC:48297];gene_id=ENSG0000 0251841;logic_name=ncrna_homo_sapiens;version=1
11 Y ensembl snRNA 2784749 2784853 . + . ID=transcript:ENST00000516032;Parent=gene:ENSG00000251841;Name=RNU6-1334P-201;biotype=snRNA;tag=basic;transcript_id=ENST00000516032;transcript_support_level=NA;version=1
12 Y ensembl exon 2784749 2784853 . + . Parent=transcript:ENST00000516032;Name=ENSE00002088309;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002088309;rank=1;version=1
13 ###
14 Y ensembl_havana gene 2786855 2787682 . - . ID=gene:ENSG00000184895;Name=SRY;biotype=protein_coding;description=sex determining region Y [Source:HGNC Symbol%3BAcc:HGNC:11311];gene_id=ENSG00000184895;logic_na me=ensembl_havana_gene_homo_sapiens;version=8
15 Y ensembl_havana mRNA 2786855 2787682 . - . ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_id=ENST00000383070;transcript _support_level=NA (assigned to previous version 1);version=2
16 Y ensembl_havana three_prime_UTR 2786855 2786988 . - . Parent=transcript:ENST00000383070
17 Y ensembl_havana exon 2786855 2787682 . - . Parent=transcript:ENST00000383070;Name=ENSE00001494622;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001494622;rank=1;version=2
18 Y ensembl_havana CDS 2786989 2787603 . - 0 ID=CDS:ENSP00000372547;Parent=transcript:ENST00000383070;protein_id=ENSP00000372547
19 Y ensembl_havana five_prime_UTR 2787604 2787682 . - . Parent=transcript:ENST00000383070
20 ###
21 Y . biological_region 2789532 2789711 0.997 - . external_name=rank %3D 1;logic_name=firstef
22 Y havana pseudogene 2789827 2790328 . + . ID=gene:ENSG00000237659;Name=RNASEH2CP1;biotype=processed_pseudogene;description=ribonuclease H2 subunit C pseudogene 1 [Source:HGNC Symbol%3BAcc:HGNC:24117];gene_ id=ENSG00000237659;logic_name=havana_homo_sapiens;version=1
23 Y havana pseudogenic_transcript 2789827 2790328 . + . ID=transcript:ENST00000454281;Parent=gene:ENSG00000237659;Name=RNASEH2CP1-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000454281;transcri pt_support_level=NA;version=1
24 Y havana exon 2789827 2790328 . + . Parent=transcript:ENST00000454281;Name=ENSE00001772499;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001772499;rank=1;version=1
25 ###
26 Y . biological_region 2790002 2790063 0.657 + . external_name=rank %3D 1;logic_name=firstef
27 Y havana pseudogene 2827982 2828218 . + . ID=gene:ENSG00000232195;Name=TOMM22P2;biotype=processed_pseudogene;description=TOMM22 pseudogene 2 [Source:HGNC Symbol%3BAcc:HGNC:38737];gene_id=ENSG00000232195;lo gic_name=havana_homo_sapiens;version=1
28 Y havana pseudogenic_transcript 2827982 2828218 . + . ID=transcript:ENST00000430735;Parent=gene:ENSG00000232195;Name=TOMM22P2-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000430735;transcript _support_level=NA;version=1
29 Y havana exon 2827982 2828218 . + . Parent=transcript:ENST00000430735;Name=ENSE00001614266;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001614266;rank=1;version=1
30 ###
31 Y havana ncRNA_gene 2828192 2840851 . - . ID=gene:ENSG00000286130;Name=AC006040.1;biotype=lncRNA;description=novel transcript;gene_id=ENSG00000286130;logic_name=havana_homo_sapiens;version=1
32 Y havana lnc_RNA 2828192 2840851 . - . ID=transcript:ENST00000651710;Parent=gene:ENSG00000286130;Name=AC006040.1-201;biotype=lncRNA;tag=basic;transcript_id=ENST00000651710;version=1
33 Y havana exon 2828192 2828735 . - . Parent=transcript:ENST00000651710;Name=ENSE00003843322;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003843322;rank=3;version=1
34 Y havana exon 2829526 2829751 . - . Parent=transcript:ENST00000651710;Name=ENSE00003846102;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003846102;rank=2;version=1
35 Y havana exon 2840471 2840851 . - . Parent=transcript:ENST00000651710;Name=ENSE00003844499;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003844499;rank=1;version=1
36 ###
37 Y ensembl_havana gene 2841602 2932000 . + . ID=gene:ENSG00000129824;Name=RPS4Y1;biotype=protein_coding;description=ribosomal protein S4 Y-linked 1 [Source:HGNC Symbol%3BAcc:HGNC:10425];gene_id=ENSG0000012982 4;logic_name=ensembl_havana_gene_homo_sapiens;version=16
38 Y ensembl_havana mRNA 2841602 2867268 . + . ID=transcript:ENST00000250784;Parent=gene:ENSG00000129824;Name=RPS4Y1-201;biotype=protein_coding;ccdsid=CCDS14773.1;tag=basic;transcript_id=ENST00000250784;transcr ipt_support_level=1 (assigned to previous version 12);version=13
39 Y ensembl_havana five_prime_UTR 2841602 2841624 . + . Parent=transcript:ENST00000250784
40 Y ensembl_havana exon 2841602 2841627 . + . Parent=transcript:ENST00000250784;Name=ENSE00002490412;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=ENSE00002490412;rank=1;version=2
41 Y ensembl_havana CDS 2841625 2841627 . + 0 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
42 Y ensembl_havana exon 2842165 2842242 . + . Parent=transcript:ENST00000250784;Name=ENSE00001709586;constitutive=0;ensembl_end_phase=0;ensembl_phase=0;exon_id=ENSE00001709586;rank=2;version=1
43 Y ensembl_havana CDS 2842165 2842242 . + 0 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
44 Y ensembl_havana exon 2844077 2844257 . + . Parent=transcript:ENST00000250784;Name=ENSE00001738202;constitutive=0;ensembl_end_phase=1;ensembl_phase=0;exon_id=ENSE00001738202;rank=3;version=1
45 Y ensembl_havana CDS 2844077 2844257 . + 0 ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
46 Y ensembl_havana exon 2845646 2845743 . + . Parent=transcript:ENST00000250784;Name=ENSE00001602849;constitutive=0;ensembl_end_phase=0;ensembl_phase=1;exon_id=ENSE00001602849;rank=4;version=1
/gene
替换
• :s///g 全局替换,每一行的所有
• :s/// 只替换第一次,一行一行去找,每一行的第一个
s///2,每一行的第2个
语法为 :[addr]s/源字符串/目的字符串/[option]
全局替换命令为::%s/源字符串/目的字符串/g
[addr] 表示检索范围,省略时表示当前行。
如:“1,20” :表示从第1行到20行;
“%” :表示整个文件,同“1,”;“.,” :从当前行到文件尾;
s : 表示替换操作
:10,20s/mRNA/MRNA/g #替换10,20的所有
:%s/mRNA/MRNA/g #全局替换
三驾马车 -- sed
sed:流编辑器,一般用来对文本进行增删改查。在标准输出流中编辑,原文件并没有改变,若想保存得重定向。
用法:sed [-options] 'script' file(s)
常见参数: -n :取消默认输出,只显示经过sed处理或匹配的行(常用)
-e :直接在命令模式上进行 sed 的动作编辑,接要执行的一个或
者多个命令
-i :直接修改读取的文件内容,不输出。
$ cat readme.txt | sed '1 a Hi'
Welcome to Biotrainee() !
Hi
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 14:01:29 ~
$ cat readme.txt #原文件并没有改变
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 14:02:33 ~
$ cat readme.txt | sed '1 a Hi'>tmp #想保存得重定向
Last2 14:03:03 ~
$ cat tmp
Welcome to Biotrainee() !
Hi
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 14:03:07 ~
2到4行换成**********
$ cat readme.txt | sed '2,4c **********'
Welcome to Biotrainee() !
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:40:55 ~
$ cat readme.txt | sed -e '2c **********' -e '3c **********' -e '4c **********'
Welcome to Biotrainee() !
**********
**********
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:41:56 ~
$ cat readme.txt | sed '2,4c **********\n**********\n**********' #加换行符
Welcome to Biotrainee() !
**********
**********
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:46:16 ~
$ cat readme.txt | sed -e '2,4i**********' -e '2,4d'
Welcome to Biotrainee() !
**********
**********
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:50:05 ~
$ cat readme.txt | sed -e '2,4a**********' -e '2,4d'
Welcome to Biotrainee() !
**********
**********
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:50:16 ~
s∶替换,使用格式为 's/pattern/new/[flags]',把pattern替换成new,默认只替换一个,可以指定flags
flag空着默认每一行第一个;可以是2,每一行第二个;可以是g,全局替换
不写address默认所有行,和vim必须要指定行不同
$ cat readme.txt | sed 's/ee/EE/'
Welcome to BiotrainEE() !
This is your personal account in our Cloud.
Have a fun with it.
Please fEEl free to contact with me( email to [email protected] )
(http://www.biotrainEE.com/thread-1376-1-1.html)
Last2 15:57:20 ~
$ cat readme.txt | sed 's/ee/EE/g'
Welcome to BiotrainEE() !
This is your personal account in our Cloud.
Have a fun with it.
Please fEEl frEE to contact with me( email to [email protected] )
(http://www.biotrainEE.com/thread-1376-1-1.html)
Last2 15:57:53 ~
默认输出是全部打印到标准输出流,加上-n就只显示经过sed处理或匹配的行
$ cat readme.txt | sed '/ee/p' #存在重复输出问题
Welcome to Biotrainee() !
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 16:09:56 ~
$ cat readme.txt | sed -n '/ee/p' #只打印匹配到的
Welcome to Biotrainee() !
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 16:10:12 ~
y∶转换,实现字符一对一转换,格式 'y/inchars/outchars/'。inchars与outchars长度相同。常用于碱基互补配对。
转换命令是全局的,也就是说,它会自动替换文本行中找到的指定字符的所有实例.
$ cat readme.txt | sed 'y/abcd/ABCD/'
WelCome to BiotrAinee() !
This is your personAl ACCount in our ClouD.
HAve A fun with it.
PleAse feel free to ContACt with me( emAil to [email protected] )
(http://www.BiotrAinee.Com/threAD-1376-1-1.html)
Last2 17:34:25 ~
$ cat readme.txt | sed 'y/abc/AB/' #长度不同,报错
sed: -e expression #1, char 9: strings for `y' command are different lengths
Last2 17:34:48 ~
常用于碱基互补配对。
$ cat Data/example.fa | head
>gi|556503834|ref|NC_000913.3|:c3317526-3316039 Escherichia coli str. K-12 substr. MG1655, complete genome
ATGAACAAAGAAATTTTGGCTGTAGTTGAAGCCGTATCCAATGAAAAGGCGCTACCTCGCGAGAAGATTT
TCGAAGCATTGGAAAGCGCGCTGGCGACAGCAACAAAGAAAAAATATGAACAAGAGATCGACGTCCGCGT
ACAGATCGATCGCAAAAGCGGTGATTTTGACACTTTCCGTCGCTGGTTAGTTGTTGATGAAGTCACCCAG
CCGACCAAGGAAATCACCCTTGAAGCCGCACGTTATGAAGATGAAAGCCTGAACCTGGGCGATTACGTTG
AAGATCAGATTGAGTCTGTTACCTTTGACCGTATCACTACCCAGACGGCAAAACAGGTTATCGTGCAGAA
AGTGCGTGAAGCCGAACGTGCGATGGTGGTTGATCAGTTCCGTGAACACGAAGGTGAAATCATCACCGGC
GTGGTGAAAAAAGTAAACCGCGACAACATCTCTCTGGATCTGGGCAACAACGCTGAAGCCGTGATCCTGC
GCGAAGATATGCTGCCGCGTGAAAACTTCCGCCCTGGCGACCGCGTTCGTGGCGTGCTCTATTCCGTTCG
CCCGGAAGCGCGTGGCGCGCAACTGTTCGTCACTCGTTCCAAGCCGGAAATGCTGATCGAACTGTTCCGT
Last2 17:41:35 ~
$ cat Data/example.fa | head | sed 'y/ATCG/TAGC/'
>gi|556503834|ref|NG_000913.3|:c3317526-3316039 Escherichia coli str. K-12 substr. MC1655, complete genome
TACTTGTTTCTTTAAAACCGACATCAACTTCGGCATAGGTTACTTTTCCGCGATGGAGCGCTCTTCTAAA
AGCTTCGTAACCTTTCGCGCGACCGCTGTCGTTGTTTCTTTTTTATACTTGTTCTCTAGCTGCAGGCGCA
TGTCTAGCTAGCGTTTTCGCCACTAAAACTGTGAAAGGCAGCGACCAATCAACAACTACTTCAGTGGGTC
GGCTGGTTCCTTTAGTGGGAACTTCGGCGTGCAATACTTCTACTTTCGGACTTGGACCCGCTAATGCAAC
TTCTAGTCTAACTCAGACAATGGAAACTGGCATAGTGATGGGTCTGCCGTTTTGTCCAATAGCACGTCTT
TCACGCACTTCGGCTTGCACGCTACCACCAACTAGTCAAGGCACTTGTGCTTCCACTTTAGTAGTGGCCG
CACCACTTTTTTCATTTGGCGCTGTTGTAGAGAGACCTAGACCCGTTGTTGCGACTTCGGCACTAGGACG
CGCTTCTATACGACGGCGCACTTTTGAAGGCGGGACCGCTGGCGCAAGCACCGCACGAGATAAGGCAAGC
GGGCCTTCGCGCACCGCGCGTTGACAAGCAGTGAGCAAGGTTCGGCCTTTACGACTAGCTTGACAAGGCA
Last2 17:42:32 ~
思考题
(想不出来就搜索一下,课上不讲)
1.如何做大小写转换?
2.如何替换每一行的前4个字符?
3.如何对奇数行进行操作?
把每个单词的第一个小写字母变大写:
$ cat readme.txt | sed 's/\b[a-z]/\u&/g'
Welcome To Biotrainee() !
This Is Your Personal Account In Our Cloud.
Have A Fun With It.
Please Feel Free To Contact With Me( Email To [email protected] )
(Http://Www.Biotrainee.Com/Thread-1376-1-1.Html)
Last2 17:50:00 ~
把所有小写变大写:
$ cat readme.txt | sed 's/[a-z]/\u&/g'
WELCOME TO BIOTRAINEE() !
THIS IS YOUR PERSONAL ACCOUNT IN OUR CLOUD.
HAVE A FUN WITH IT.
PLEASE FEEL FREE TO CONTACT WITH ME( EMAIL TO [email protected] )
(HTTP://WWW.BIOTRAINEE.COM/THREAD-1376-1-1.HTML)
Last2 17:50:53 ~
三驾马车 -- awk
awk的工作原理
awk 'BEGIN{ commands } pattern{ commands } END{ commands }'
第一步:执行BEGIN{ commands }语句块中的语句;
第二步:从文件或标准输入(stdin)读取一行,然后执行pattern{ commands }语句块,它逐行扫描文件,从第一行到最后一行重复这个过程,直到文件全部被读取完毕。
第三步:当读至输入流末尾时,执行END{ commands }语句块。
BEGIN语句块在awk开始从输入流中读取行之前被执行,这是一个可选的语句块,比如变量初始化、打印输出表格的表头等语句通常可以写在BEGIN语句块中。
END语句块在awk从输入流中读取完所有的行之后即被执行,比如打印所有行的分析结果这类信息汇总都是在END语句块中完成,它也是一个可选语句块。
pattern语句块中的通用命令是最重要的部分,它也是可选的。如果没有提供pattern语句块,则默认执行{ print },即打印每一个读取到的行,awk读取的每一行都会执行该语句块。
Linux which命令用于查找文件。
实例
使用指令"which"查看指令"bash"的绝对路径,输入如下命令:
$ which bash
区分是分隔符还是空格
1.cat -A ,^I是分隔符,不显示是空格,$是enter键
$ cat -A Data/example.gtf | head -3
chr1^IENSEMBL^IUTR^I1737^I2090^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
chr1^IENSEMBL^Iexon^I1737^I2090^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
chr1^IENSEMBL^Itranscript^I1737^I4275^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
Last2 19:21:28 ~
2.vim 的:set list
$ vim Data/example.gtf
:set list
chr1^IENSEMBL^IUTR^I1737^I2090^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
chr1^IENSEMBL^Iexon^I1737^I2090^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
chr1^IENSEMBL^Itranscript^I1737^I4275^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
chr1^IHAVANA^Igene^I1737^I4275^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENSG00000223972"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1"; level 2; havana_gene "OTTHUMG00000000961";$
chr1^IHAVANA^Iexon^I1873^I1920^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000450305"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "RP11-34P13-001"; level 2; havana_gene "OTTHUMG00000000961"; havana_transcript "OTTHUMT00000002844"; ont "PGO:0000005";$
chr1^IHAVANA^Itranscript^I1873^I3533^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000450305"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "RP11-34P13-001"; level 2; havana_gene "OTTHUMG00000000961"; havana_transcript "OTTHUMT00000002844"; ont "PGO:0000005";$
awk多块直接写逗号,每一列独立(写什么是什么,不会自动调顺序)
可以设置指定格式
$ cat Data/example.gtf | awk '{print$10,$10,$9}' | head -3
"ENSG00000223972"; "ENSG00000223972"; gene_id
"ENSG00000223972"; "ENSG00000223972"; gene_id
"ENSG00000223972"; "ENSG00000223972"; gene_id
(base) Last2 09:30:54 ~
$ cat Data/example.gtf | awk '{print$10,$3,$4,$5}' | head -3
"ENSG00000223972"; UTR 1737 2090
"ENSG00000223972"; exon 1737 2090
"ENSG00000223972"; transcript 1737 4275
(base) Last2 09:33:52 ~
默认列与列之间分隔是空格,可以指定
"\t"注意用双引号,用单引号可能出问题,成对找
$ cat Data/example.gtf | awk '{print$10"\t"$3,$4,$5}' | head -3
"ENSG00000223972"; UTR 1737 2090
"ENSG00000223972"; exon 1737 2090
"ENSG00000223972"; transcript 1737 4275
(base) Last2 09:42:07 ~
不写就连在一起
$ cat Data/example.gtf | awk '{print$10"\t"$3$4,$5}' | head -3
"ENSG00000223972"; UTR1737 2090
"ENSG00000223972"; exon1737 2090
"ENSG00000223972"; transcript1737 4275
(base) Last2 09:42:33 ~
$ cat Data/example.gtf | awk '{print$10"\t"$3"\t"$4"\t"$5}' | head -3
"ENSG00000223972"; UTR 1737 2090
"ENSG00000223972"; exon 1737 2090
"ENSG00000223972"; transcript 1737 4275
(base) Last2 09:43:13 ~
-F是外部参数,引号外面;内置变量在‘’里面。
RS 默认/n是一行
$ cat readme.txt
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 16:05:30 ~/Data
$ cat readme.txt | awk 'BEGIN{RS="\n"}{print $0}'
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 16:03:21 ~/Data
$ cat readme.txt | awk 'BEGIN{RS=" "}{print $0}'
Welcome
to
Biotrainee()
!
This
is
your
personal
account
in
our
Cloud.
Have
a
fun
with
it.
Please
feel
free
to
contact
with
me(
email
to
[email protected]
)
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 16:04:28 ~/Data
输入是认,输出是在输入基础上自定义
$ cat readme.txt | awk 'BEGIN{RS="\n";ORS="***"}{print $0}'
Welcome to Biotrainee() !***This is your personal account in our Cloud.***Have a fun with it.***Please feel free to contact with me( email to [email protected] )***(http://www.biotrainee.com/thread-1376-1-1.html)***Last2 16:14:38 ~/Data
$ cat readme.txt | awk 'BEGIN{RS="\n";ORS="***"} {print $0}'
Welcome to Biotrainee() !***This is your personal account in our Cloud.***Have a fun with it.***Please feel free to contact with me( email to [email protected] )***(http://www.biotrainee.com/thread-1376-1-1.html)***Last2 16:15:34 ~/Data
awk 条件和循环语句:
if:条件判断
awk ' { if (判断条件) {yes} else {no} } '
for:循环语句
awk ' { for (循环条件) {循环语句} } '
条件在(),动作在{}
体现awk工作原理是逐行扫描文件
$ cat Data/example.gtf | awk '{for(i=1;i<4;i++){print $i}}' | less -S
chr1
ENSEMBL
UTR
chr1
ENSEMBL
exon
chr1
ENSEMBL
transcript
chr1
HAVANA
gene
chr1
HAVANA
exon
chr1
HAVANA
transcript
chr1
HAVANA
exon
chr1
HAVANA
exon
chr1
ENSEMBL
UTR
chr1
ENSEMBL
exon
chr1
HAVANA
exon
chr1
HAVANA
exon
chr1
ENSEMBL
UTR
chr1
ENSEMBL
exon
chr1
HAVANA
exon
chr1
ENSEMBL
start_codon
chr1
ENSEMBL
CDS
chr1
ENSEMBL
UTR
$ cat Data/example.gtf | awk '{for(i=1;i<4;i++){print $i}}' | paste - - - ||less -S
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 ENSEMBL transcript
chr1 HAVANA gene
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL start_codon
chr1 ENSEMBL CDS
chr1 ENSEMBL UTR
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 ENSEMBL transcript
chr1 ENSEMBL transcript
chr1 HAVANA gene
chr1 ENSEMBL stop_codon
chr1 ENSEMBL UTR
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 ENSEMBL stop_codon
chr1 ENSEMBL CDS
chr1 ENSEMBL CDS
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL CDS
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL UTR
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL stop_codon
chr1 ENSEMBL CDS
chr1 ENSEMBL UTR
chr1 ENSEMBL stop_codon
chr1 ENSEMBL exon
chr1 ENSEMBL transcript
chr1 ENSEMBL CDS
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL CDS
chr1 ENSEMBL CDS
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL CDS
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL UTR
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL start_codon
chr1 ENSEMBL UTR
chr1 ENSEMBL CDS
chr1 ENSEMBL UTR
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 ENSEMBL start_codon
chr1 ENSEMBL UTR
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 ENSEMBL UTR
chr1 ENSEMBL UTR
chr1 ENSEMBL exon
chr1 ENSEMBL exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA gene
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 ENSEMBL exon
chr1 ENSEMBL gene
chr1 ENSEMBL transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA UTR
chr1 HAVANA exon
chr1 HAVANA gene
chr1 HAVANA transcript
chr1 HAVANA stop_codon
chr1 HAVANA CDS
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA CDS
chr1 HAVANA exon
chr1 HAVANA CDS
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA start_codon
chr1 HAVANA UTR
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL gene
chr1 ENSEMBL transcript
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL UTR
chr1 ENSEMBL stop_codon
chr1 HAVANA exon
chr1 HAVANA gene
chr1 HAVANA transcript
chr1 HAVANA UTR
chr1 HAVANA exon
chr1 HAVANA gene
chr1 HAVANA transcript
chr1 HAVANA start_codon
chr1 HAVANA CDS
chr1 HAVANA stop_codon
chr1 HAVANA UTR
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA gene
chr1 HAVANA exon
chr1 HAVANA gene
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA gene
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA gene
chr1 ENSEMBL UTR
chr1 ENSEMBL stop_codon
chr1 ENSEMBL exon
chr1 ENSEMBL transcript
chr1 ENSEMBL CDS
chr1 HAVANA exon
chr1 ENSEMBL CDS
chr1 ENSEMBL exon
chr1 ENSEMBL start_codon
chr1 HAVANA exon
chr1 HAVANA gene
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA gene
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 ENSEMBL exon
chr1 ENSEMBL gene
chr1 ENSEMBL transcript
chr1 ENSEMBL exon
chr1 ENSEMBL gene
chr1 ENSEMBL transcript
chr1 HAVANA exon
chr1 HAVANA gene
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA gene
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA transcript
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr1 HAVANA exon
chr2 HAVANA UTR
chr2 HAVANA exon
chr2 HAVANA transcript
chr2 HAVANA gene
chr2 HAVANA exon
chr2 HAVANA transcript
chr2 HAVANA exon
chr2 HAVANA transcript
chr2 HAVANA stop_codon
chr2 HAVANA CDS
chr2 HAVANA exon
chr2 HAVANA CDS
chr2 HAVANA exon
chr2 HAVANA start_codon
chr2 HAVANA exon
chr2 HAVANA exon
chr2 HAVANA gene
chr2 HAVANA transcript
chr2 HAVANA exon
(base) Last2 10:27:35 ~
awk 数学运算:
- (加),- (减), * (乘),^ (幂)
/ (除),** (平方), % (取余)
int(x) x的整数部分,取靠近零一侧的值
log(x) x的自然对数
$ cat Data/example.gtf | awk '/exon/{print $10"\t"$3"\t"$5-$4}' | less -S
"ENSG00000223972"; exon 353
"ENSG00000223972"; exon 47
"ENSG00000223972"; exon 48
"ENSG00000223972"; exon 84
"ENSG00000223972"; exon 108
"ENSG00000223972"; exon 77
"ENSG00000223972"; exon 153
"ENSG00000223972"; exon 1191
"ENSG00000223972"; exon 217
"ENSG00000227232"; exon 466
"ENSG00000227232"; exon 466
"ENSG00000227232"; exon 97
"ENSG00000227232"; exon 68
"ENSG00000227232"; exon 68
"ENSG00000227232"; exon 33
"ENSG00000227232"; exon 105
"ENSG00000227232"; exon 151
"ENSG00000227232"; exon 151
"ENSG00000227232"; exon 43
"ENSG00000227232"; exon 158
"ENSG00000227232"; exon 158
"ENSG00000227232"; exon 158
"ENSG00000227232"; exon 201
"ENSG00000227232"; exon 197
"ENSG00000227232"; exon 197
"ENSG00000227232"; exon 131
"ENSG00000227232"; exon 135
"ENSG00000227232"; exon 135
"ENSG00000227232"; exon 191
"ENSG00000227232"; exon 140
"ENSG00000227232"; exon 136
"ENSG00000227232"; exon 136
"ENSG00000227232"; exon 146
"ENSG00000227232"; exon 146
"ENSG00000227232"; exon 146
"ENSG00000227232"; exon 146
"ENSG00000227232"; exon 98
"ENSG00000227232"; exon 98
"ENSG00000227232"; exon 98
"ENSG00000227232"; exon 162
"ENSG00000227232"; exon 153
"ENSG00000227232"; exon 153
"ENSG00000227232"; exon 153
"ENSG00000227232"; exon 153
"ENSG00000227232"; exon 22
"ENSG00000227232"; exon 49
"ENSG00000227232"; exon 49
"ENSG00000227232"; exon 36
"ENSG00000243485"; exon 485
"ENSG00000243485"; exon 400
"ENSG00000221311"; exon 137
"ENSG00000243485"; exon 103
"ENSG00000243485"; exon 121
"ENSG00000243485"; exon 133
"ENSG00000237613"; exon 620
练习3:
1.任意挑4句前面的命令自己动手敲一遍
2.使用head查看example.gtf文件
3.将1结果传递给awk,输出含有ENSEMBL的行
4.结合所学,输出以下结果
$ head example.gtf | awk '/ENSEMBL/{print $0}'
chr1 ENSEMBL UTR 1737 2090 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL exon 1737 2090 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL transcript 1737 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL UTR 2476 2584 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL exon 2476 2584 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
Last2 14:25:15 ~/Data
$ head example.gtf | awk -F '\t' '/ENSEMBL/{print $9}' #知道在第九列,先缩小范围
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
Last2 14:29:07 ~/Data
$ head example.gtf | awk -F '\t' '/ENSEMBL/{print $9}' | awk -F '"' '{print $2,$4,$6}' #以“为分隔符
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
Last2 14:32:24 ~/Data
也可以用sed进行替换。替换为空格可以理解为变相的删除。
$ head example.gtf | awk -F '\t' '/ENSEMBL/{print $9}' | awk '{print $2,$4,$6}' |sed -e '1,$ s/"//g' -e '1,$ s/;//g'
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
Last2 14:52:50 ~/Data
总结
grep眼高手低,只能文本搜索,不能做动作
sed 流编辑器,增删改查,有address
awk仿佛就是为了print