Lunix Day 3

3.Linux上的文本编辑器vim

命令模式:

撤销动作(后悔药):
• u:复原前一个动作(可连续撤销),就是撤销。按由近及远的顺序撤销。
• ctrl+r:重做上一个动作,抵消撤销。撤销的是上个撤销的内容。ctrl+r命令只对u命令起作用。
不太理解ctrl+r,重做(Redo),即撤销的撤销。如果你撤销得太多,你可以输入CTRL-R(redo)回退前一个命令。换句话说,它撤销一个撤销。

末行模式:

意外退出 没保存
再次打开时

$ vim readme.txt
E325: ATTENTION
Found a swap file by the name ".readme.txt.swp"
          owned by: Last2   dated: Fri Jan 22 10:56:04 2021
         file name: ~Last2/readme.txt
          modified: YES
         user name: Last2   host name: VM-0-17-ubuntu
        process ID: 21878
While opening file "readme.txt"
             dated: Sun Jan 17 16:51:09 2021

(1) Another program may be editing the same file.  If this is the case,
    be careful not to end up with two different instances of the same
    file when making changes.  Quit, or continue with caution.
(2) An edit session for this file crashed.
    If this is the case, use ":recover" or "vim -r readme.txt"
    to recover the changes (see ":help recovery").
    If you did this already, delete the swap file ".readme.txt.swp"
    to avoid this message.

Swap file ".readme.txt.swp" already exists!
[O]pen Read-Only, (E)dit anyway, (R)ecover, (D)elete it, (Q)uit, (A)bort: 

vim编辑时,会生成临时文件,保存退出时,会把临时文件更新到原始文件上去,临时文件消失。意外退出时,临时文件被留下。可以D 直接删掉,原始文件正常编辑。

末行模式:

查询 不用输:
• 输入/KEYWORD进行查询向下 ?是向上搜索
• 按n向下查找
• 按N向上查找

输/www,跳转到网址

Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html) #光标跳转到网址
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
~                                                                                                                                                                                                                                               
/www
search hit BOTTOM, continuing at TOP                                                        
:?www
search hit TOP, continuing at BOTTOM 

/gene 查找gene关键词

$ vim Data/Homo_sapiens.GRCh38.102.chromosome.Y.gff3

   1 ##gff-version 3
   2 ##sequence-region   Y 2781480 56887902
   3 #!genome-build  GRCh38.p13
   4 #!genome-version GRCh38
   5 #!genome-date 2013-12
   6 #!genome-build-accession NCBI:GCA_000001405.28
   7 #!genebuild-last-updated 2020-09
   8 Y       GRCh38  chromosome      2781480 56887902        .       .       .       ID=chromosome:Y;Alias=CM000686.2,chrY,NC_000024.10
   9 ###
  10 Y       ensembl ncRNA_gene      2784749 2784853 .       +       .       ID=gene:ENSG00000251841;Name=RNU6-1334P;biotype=snRNA;description=RNA%2C U6 small nuclear 1334%2C pseudogene [Source:HGNC Symbol%3BAcc:HGNC:48297];gene_id=ENSG0000     0251841;logic_name=ncrna_homo_sapiens;version=1
  11 Y       ensembl snRNA   2784749 2784853 .       +       .       ID=transcript:ENST00000516032;Parent=gene:ENSG00000251841;Name=RNU6-1334P-201;biotype=snRNA;tag=basic;transcript_id=ENST00000516032;transcript_support_level=NA;version=1
  12 Y       ensembl exon    2784749 2784853 .       +       .       Parent=transcript:ENST00000516032;Name=ENSE00002088309;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002088309;rank=1;version=1
  13 ###
  14 Y       ensembl_havana  gene    2786855 2787682 .       -       .       ID=gene:ENSG00000184895;Name=SRY;biotype=protein_coding;description=sex determining region Y [Source:HGNC Symbol%3BAcc:HGNC:11311];gene_id=ENSG00000184895;logic_na     me=ensembl_havana_gene_homo_sapiens;version=8
  15 Y       ensembl_havana  mRNA    2786855 2787682 .       -       .       ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_id=ENST00000383070;transcript     _support_level=NA (assigned to previous version 1);version=2
  16 Y       ensembl_havana  three_prime_UTR 2786855 2786988 .       -       .       Parent=transcript:ENST00000383070
  17 Y       ensembl_havana  exon    2786855 2787682 .       -       .       Parent=transcript:ENST00000383070;Name=ENSE00001494622;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001494622;rank=1;version=2
  18 Y       ensembl_havana  CDS     2786989 2787603 .       -       0       ID=CDS:ENSP00000372547;Parent=transcript:ENST00000383070;protein_id=ENSP00000372547
  19 Y       ensembl_havana  five_prime_UTR  2787604 2787682 .       -       .       Parent=transcript:ENST00000383070
  20 ###
  21 Y       .       biological_region       2789532 2789711 0.997   -       .       external_name=rank %3D 1;logic_name=firstef
  22 Y       havana  pseudogene      2789827 2790328 .       +       .       ID=gene:ENSG00000237659;Name=RNASEH2CP1;biotype=processed_pseudogene;description=ribonuclease H2 subunit C pseudogene 1 [Source:HGNC Symbol%3BAcc:HGNC:24117];gene_     id=ENSG00000237659;logic_name=havana_homo_sapiens;version=1
  23 Y       havana  pseudogenic_transcript  2789827 2790328 .       +       .       ID=transcript:ENST00000454281;Parent=gene:ENSG00000237659;Name=RNASEH2CP1-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000454281;transcri     pt_support_level=NA;version=1
  24 Y       havana  exon    2789827 2790328 .       +       .       Parent=transcript:ENST00000454281;Name=ENSE00001772499;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001772499;rank=1;version=1
  25 ###
  26 Y       .       biological_region       2790002 2790063 0.657   +       .       external_name=rank %3D 1;logic_name=firstef
  27 Y       havana  pseudogene      2827982 2828218 .       +       .       ID=gene:ENSG00000232195;Name=TOMM22P2;biotype=processed_pseudogene;description=TOMM22 pseudogene 2 [Source:HGNC Symbol%3BAcc:HGNC:38737];gene_id=ENSG00000232195;lo     gic_name=havana_homo_sapiens;version=1
  28 Y       havana  pseudogenic_transcript  2827982 2828218 .       +       .       ID=transcript:ENST00000430735;Parent=gene:ENSG00000232195;Name=TOMM22P2-201;biotype=processed_pseudogene;tag=basic;transcript_id=ENST00000430735;transcript     _support_level=NA;version=1
  29 Y       havana  exon    2827982 2828218 .       +       .       Parent=transcript:ENST00000430735;Name=ENSE00001614266;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001614266;rank=1;version=1
  30 ###
  31 Y       havana  ncRNA_gene      2828192 2840851 .       -       .       ID=gene:ENSG00000286130;Name=AC006040.1;biotype=lncRNA;description=novel transcript;gene_id=ENSG00000286130;logic_name=havana_homo_sapiens;version=1
  32 Y       havana  lnc_RNA 2828192 2840851 .       -       .       ID=transcript:ENST00000651710;Parent=gene:ENSG00000286130;Name=AC006040.1-201;biotype=lncRNA;tag=basic;transcript_id=ENST00000651710;version=1
  33 Y       havana  exon    2828192 2828735 .       -       .       Parent=transcript:ENST00000651710;Name=ENSE00003843322;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003843322;rank=3;version=1
  34 Y       havana  exon    2829526 2829751 .       -       .       Parent=transcript:ENST00000651710;Name=ENSE00003846102;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003846102;rank=2;version=1
  35 Y       havana  exon    2840471 2840851 .       -       .       Parent=transcript:ENST00000651710;Name=ENSE00003844499;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003844499;rank=1;version=1
  36 ###
  37 Y       ensembl_havana  gene    2841602 2932000 .       +       .       ID=gene:ENSG00000129824;Name=RPS4Y1;biotype=protein_coding;description=ribosomal protein S4 Y-linked 1 [Source:HGNC Symbol%3BAcc:HGNC:10425];gene_id=ENSG0000012982     4;logic_name=ensembl_havana_gene_homo_sapiens;version=16
  38 Y       ensembl_havana  mRNA    2841602 2867268 .       +       .       ID=transcript:ENST00000250784;Parent=gene:ENSG00000129824;Name=RPS4Y1-201;biotype=protein_coding;ccdsid=CCDS14773.1;tag=basic;transcript_id=ENST00000250784;transcr     ipt_support_level=1 (assigned to previous version 12);version=13
  39 Y       ensembl_havana  five_prime_UTR  2841602 2841624 .       +       .       Parent=transcript:ENST00000250784
  40 Y       ensembl_havana  exon    2841602 2841627 .       +       .       Parent=transcript:ENST00000250784;Name=ENSE00002490412;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=ENSE00002490412;rank=1;version=2
  41 Y       ensembl_havana  CDS     2841625 2841627 .       +       0       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
  42 Y       ensembl_havana  exon    2842165 2842242 .       +       .       Parent=transcript:ENST00000250784;Name=ENSE00001709586;constitutive=0;ensembl_end_phase=0;ensembl_phase=0;exon_id=ENSE00001709586;rank=2;version=1
  43 Y       ensembl_havana  CDS     2842165 2842242 .       +       0       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
  44 Y       ensembl_havana  exon    2844077 2844257 .       +       .       Parent=transcript:ENST00000250784;Name=ENSE00001738202;constitutive=0;ensembl_end_phase=1;ensembl_phase=0;exon_id=ENSE00001738202;rank=3;version=1
  45 Y       ensembl_havana  CDS     2844077 2844257 .       +       0       ID=CDS:ENSP00000250784;Parent=transcript:ENST00000250784;protein_id=ENSP00000250784
  46 Y       ensembl_havana  exon    2845646 2845743 .       +       .       Parent=transcript:ENST00000250784;Name=ENSE00001602849;constitutive=0;ensembl_end_phase=0;ensembl_phase=1;exon_id=ENSE00001602849;rank=4;version=1
/gene                                                        

替换
• :s///g 全局替换,每一行的所有
• :s/// 只替换第一次,一行一行去找,每一行的第一个
s///2,每一行的第2个

语法为 :[addr]s/源字符串/目的字符串/[option]
全局替换命令为::%s/源字符串/目的字符串/g
[addr] 表示检索范围,省略时表示当前行。
如:“1,20” :表示从第1行到20行;
“%” :表示整个文件,同“1,”;“.,” :从当前行到文件尾;
s : 表示替换操作

:10,20s/mRNA/MRNA/g  #替换10,20的所有
:%s/mRNA/MRNA/g #全局替换

三驾马车 -- sed

sed:流编辑器,一般用来对文本进行增删改查。在标准输出流中编辑,原文件并没有改变,若想保存得重定向。
用法:sed [-options] 'script' file(s)
常见参数: -n :取消默认输出,只显示经过sed处理或匹配的行(常用)
-e :直接在命令模式上进行 sed 的动作编辑,接要执行的一个或
者多个命令
-i :直接修改读取的文件内容,不输出。

$ cat readme.txt | sed '1 a Hi'
Welcome to Biotrainee() !
Hi
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 14:01:29 ~
$ cat readme.txt  #原文件并没有改变
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 14:02:33 ~
$ cat readme.txt | sed '1 a Hi'>tmp #想保存得重定向
Last2 14:03:03 ~
$ cat tmp
Welcome to Biotrainee() !
Hi
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 14:03:07 ~

2到4行换成**********

$ cat readme.txt | sed '2,4c **********'
Welcome to Biotrainee() !
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:40:55 ~
$ cat readme.txt | sed -e '2c **********' -e '3c **********' -e '4c **********'
Welcome to Biotrainee() !
**********
**********
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:41:56 ~
$ cat readme.txt | sed '2,4c **********\n**********\n**********' #加换行符
Welcome to Biotrainee() !
**********
**********
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:46:16 ~
$ cat readme.txt | sed -e '2,4i**********' -e '2,4d'
Welcome to Biotrainee() !
**********
**********
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:50:05 ~
$ cat readme.txt | sed -e '2,4a**********' -e '2,4d'
Welcome to Biotrainee() !
**********
**********
**********
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 15:50:16 ~

s∶替换,使用格式为 's/pattern/new/[flags]',把pattern替换成new,默认只替换一个,可以指定flags
flag空着默认每一行第一个;可以是2,每一行第二个;可以是g,全局替换
不写address默认所有行,和vim必须要指定行不同

$ cat readme.txt | sed 's/ee/EE/'
Welcome to BiotrainEE() !
This is your personal account in our Cloud.
Have a fun with it.
Please fEEl free to contact with me( email to [email protected] )
(http://www.biotrainEE.com/thread-1376-1-1.html)
Last2 15:57:20 ~
$ cat readme.txt | sed 's/ee/EE/g'
Welcome to BiotrainEE() !
This is your personal account in our Cloud.
Have a fun with it.
Please fEEl frEE to contact with me( email to [email protected] )
(http://www.biotrainEE.com/thread-1376-1-1.html)
Last2 15:57:53 ~

默认输出是全部打印到标准输出流,加上-n就只显示经过sed处理或匹配的行

$ cat readme.txt | sed '/ee/p' #存在重复输出问题
Welcome to Biotrainee() !
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 16:09:56 ~
$ cat readme.txt | sed -n '/ee/p' #只打印匹配到的
Welcome to Biotrainee() !
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 16:10:12 ~

y∶转换,实现字符一对一转换,格式 'y/inchars/outchars/'。inchars与outchars长度相同。常用于碱基互补配对。
转换命令是全局的,也就是说,它会自动替换文本行中找到的指定字符的所有实例.

$ cat readme.txt | sed 'y/abcd/ABCD/'
WelCome to BiotrAinee() !
This is your personAl ACCount in our ClouD.
HAve A fun with it.
PleAse feel free to ContACt with me( emAil to [email protected] )
(http://www.BiotrAinee.Com/threAD-1376-1-1.html)
Last2 17:34:25 ~
$ cat readme.txt | sed 'y/abc/AB/'  #长度不同,报错
sed: -e expression #1, char 9: strings for `y' command are different lengths
Last2 17:34:48 ~

常用于碱基互补配对。

$ cat Data/example.fa | head
>gi|556503834|ref|NC_000913.3|:c3317526-3316039 Escherichia coli str. K-12 substr. MG1655, complete genome
ATGAACAAAGAAATTTTGGCTGTAGTTGAAGCCGTATCCAATGAAAAGGCGCTACCTCGCGAGAAGATTT
TCGAAGCATTGGAAAGCGCGCTGGCGACAGCAACAAAGAAAAAATATGAACAAGAGATCGACGTCCGCGT
ACAGATCGATCGCAAAAGCGGTGATTTTGACACTTTCCGTCGCTGGTTAGTTGTTGATGAAGTCACCCAG
CCGACCAAGGAAATCACCCTTGAAGCCGCACGTTATGAAGATGAAAGCCTGAACCTGGGCGATTACGTTG
AAGATCAGATTGAGTCTGTTACCTTTGACCGTATCACTACCCAGACGGCAAAACAGGTTATCGTGCAGAA
AGTGCGTGAAGCCGAACGTGCGATGGTGGTTGATCAGTTCCGTGAACACGAAGGTGAAATCATCACCGGC
GTGGTGAAAAAAGTAAACCGCGACAACATCTCTCTGGATCTGGGCAACAACGCTGAAGCCGTGATCCTGC
GCGAAGATATGCTGCCGCGTGAAAACTTCCGCCCTGGCGACCGCGTTCGTGGCGTGCTCTATTCCGTTCG
CCCGGAAGCGCGTGGCGCGCAACTGTTCGTCACTCGTTCCAAGCCGGAAATGCTGATCGAACTGTTCCGT
Last2 17:41:35 ~
$ cat Data/example.fa | head | sed 'y/ATCG/TAGC/'
>gi|556503834|ref|NG_000913.3|:c3317526-3316039 Escherichia coli str. K-12 substr. MC1655, complete genome
TACTTGTTTCTTTAAAACCGACATCAACTTCGGCATAGGTTACTTTTCCGCGATGGAGCGCTCTTCTAAA
AGCTTCGTAACCTTTCGCGCGACCGCTGTCGTTGTTTCTTTTTTATACTTGTTCTCTAGCTGCAGGCGCA
TGTCTAGCTAGCGTTTTCGCCACTAAAACTGTGAAAGGCAGCGACCAATCAACAACTACTTCAGTGGGTC
GGCTGGTTCCTTTAGTGGGAACTTCGGCGTGCAATACTTCTACTTTCGGACTTGGACCCGCTAATGCAAC
TTCTAGTCTAACTCAGACAATGGAAACTGGCATAGTGATGGGTCTGCCGTTTTGTCCAATAGCACGTCTT
TCACGCACTTCGGCTTGCACGCTACCACCAACTAGTCAAGGCACTTGTGCTTCCACTTTAGTAGTGGCCG
CACCACTTTTTTCATTTGGCGCTGTTGTAGAGAGACCTAGACCCGTTGTTGCGACTTCGGCACTAGGACG
CGCTTCTATACGACGGCGCACTTTTGAAGGCGGGACCGCTGGCGCAAGCACCGCACGAGATAAGGCAAGC
GGGCCTTCGCGCACCGCGCGTTGACAAGCAGTGAGCAAGGTTCGGCCTTTACGACTAGCTTGACAAGGCA
Last2 17:42:32 ~

思考题

(想不出来就搜索一下,课上不讲)
1.如何做大小写转换?
2.如何替换每一行的前4个字符?
3.如何对奇数行进行操作?

把每个单词的第一个小写字母变大写:

$ cat readme.txt | sed 's/\b[a-z]/\u&/g'
Welcome To Biotrainee() !
This Is Your Personal Account In Our Cloud.
Have A Fun With It.
Please Feel Free To Contact With Me( Email To [email protected] )
(Http://Www.Biotrainee.Com/Thread-1376-1-1.Html)
Last2 17:50:00 ~

把所有小写变大写:

$ cat readme.txt | sed 's/[a-z]/\u&/g'
WELCOME TO BIOTRAINEE() !
THIS IS YOUR PERSONAL ACCOUNT IN OUR CLOUD.
HAVE A FUN WITH IT.
PLEASE FEEL FREE TO CONTACT WITH ME( EMAIL TO [email protected] )
(HTTP://WWW.BIOTRAINEE.COM/THREAD-1376-1-1.HTML)
Last2 17:50:53 ~

三驾马车 -- awk

image.png

image.png

awk的工作原理
awk 'BEGIN{ commands } pattern{ commands } END{ commands }'
第一步:执行BEGIN{ commands }语句块中的语句;
第二步:从文件或标准输入(stdin)读取一行,然后执行pattern{ commands }语句块,它逐行扫描文件,从第一行到最后一行重复这个过程,直到文件全部被读取完毕。
第三步:当读至输入流末尾时,执行END{ commands }语句块。
BEGIN语句块在awk开始从输入流中读取行之前被执行,这是一个可选的语句块,比如变量初始化、打印输出表格的表头等语句通常可以写在BEGIN语句块中。

END语句块在awk从输入流中读取完所有的行之后即被执行,比如打印所有行的分析结果这类信息汇总都是在END语句块中完成,它也是一个可选语句块。

pattern语句块中的通用命令是最重要的部分,它也是可选的。如果没有提供pattern语句块,则默认执行{ print },即打印每一个读取到的行,awk读取的每一行都会执行该语句块。

Linux which命令用于查找文件。

实例
使用指令"which"查看指令"bash"的绝对路径,输入如下命令:

$ which bash

区分是分隔符还是空格

1.cat -A ,^I是分隔符,不显示是空格,$是enter键

$ cat -A Data/example.gtf | head -3
chr1^IENSEMBL^IUTR^I1737^I2090^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
chr1^IENSEMBL^Iexon^I1737^I2090^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
chr1^IENSEMBL^Itranscript^I1737^I4275^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
Last2 19:21:28 ~

2.vim 的:set list

$ vim Data/example.gtf
:set list 
chr1^IENSEMBL^IUTR^I1737^I2090^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$ 
chr1^IENSEMBL^Iexon^I1737^I2090^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
chr1^IENSEMBL^Itranscript^I1737^I4275^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";$
chr1^IHAVANA^Igene^I1737^I4275^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENSG00000223972"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1"; level 2; havana_gene "OTTHUMG00000000961";$ 
chr1^IHAVANA^Iexon^I1873^I1920^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000450305"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "RP11-34P13-001"; level 2; havana_gene "OTTHUMG00000000961"; havana_transcript "OTTHUMT00000002844"; ont "PGO:0000005";$
chr1^IHAVANA^Itranscript^I1873^I3533^I.^I+^I.^Igene_id "ENSG00000223972"; transcript_id "ENST00000450305"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "RP11-34P13-001"; level 2; havana_gene "OTTHUMG00000000961"; havana_transcript "OTTHUMT00000002844"; ont "PGO:0000005";$

awk多块直接写逗号,每一列独立(写什么是什么,不会自动调顺序)

可以设置指定格式

$ cat Data/example.gtf | awk '{print$10,$10,$9}' | head -3
"ENSG00000223972"; "ENSG00000223972"; gene_id
"ENSG00000223972"; "ENSG00000223972"; gene_id
"ENSG00000223972"; "ENSG00000223972"; gene_id
(base) Last2 09:30:54 ~
$ cat Data/example.gtf | awk '{print$10,$3,$4,$5}' | head -3
"ENSG00000223972"; UTR 1737 2090
"ENSG00000223972"; exon 1737 2090
"ENSG00000223972"; transcript 1737 4275
(base) Last2 09:33:52 ~

默认列与列之间分隔是空格,可以指定
"\t"注意用双引号,用单引号可能出问题,成对找

$ cat Data/example.gtf | awk '{print$10"\t"$3,$4,$5}' | head -3
"ENSG00000223972";  UTR 1737 2090
"ENSG00000223972";  exon 1737 2090
"ENSG00000223972";  transcript 1737 4275
(base) Last2 09:42:07 ~

不写就连在一起

$ cat Data/example.gtf | awk '{print$10"\t"$3$4,$5}' | head -3
"ENSG00000223972";  UTR1737 2090
"ENSG00000223972";  exon1737 2090
"ENSG00000223972";  transcript1737 4275
(base) Last2 09:42:33 ~
$ cat Data/example.gtf | awk '{print$10"\t"$3"\t"$4"\t"$5}' | head -3
"ENSG00000223972";  UTR 1737    2090
"ENSG00000223972";  exon    1737    2090
"ENSG00000223972";  transcript  1737    4275
(base) Last2 09:43:13 ~
image.png

-F是外部参数,引号外面;内置变量在‘’里面。

RS 默认/n是一行

$ cat readme.txt
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 16:05:30 ~/Data
$ cat readme.txt | awk 'BEGIN{RS="\n"}{print $0}'
Welcome to Biotrainee() !
This is your personal account in our Cloud.
Have a fun with it.
Please feel free to contact with me( email to [email protected] )
(http://www.biotrainee.com/thread-1376-1-1.html)
Last2 16:03:21 ~/Data
$ cat readme.txt | awk 'BEGIN{RS=" "}{print $0}'
Welcome
to
Biotrainee()
!
This
is
your
personal
account
in
our
Cloud.
Have
a
fun
with
it.
Please
feel
free
to
contact
with
me(
email
to
[email protected]
)
(http://www.biotrainee.com/thread-1376-1-1.html)

Last2 16:04:28 ~/Data

输入是认,输出是在输入基础上自定义

$ cat readme.txt | awk 'BEGIN{RS="\n";ORS="***"}{print $0}'
Welcome to Biotrainee() !***This is your personal account in our Cloud.***Have a fun with it.***Please feel free to contact with me( email to [email protected] )***(http://www.biotrainee.com/thread-1376-1-1.html)***Last2 16:14:38 ~/Data
$ cat readme.txt | awk 'BEGIN{RS="\n";ORS="***"} {print $0}'
Welcome to Biotrainee() !***This is your personal account in our Cloud.***Have a fun with it.***Please feel free to contact with me( email to [email protected] )***(http://www.biotrainee.com/thread-1376-1-1.html)***Last2 16:15:34 ~/Data

awk 条件和循环语句:

if:条件判断
awk ' { if (判断条件) {yes} else {no} } '
for:循环语句
awk ' { for (循环条件) {循环语句} } '
条件在(),动作在{}

体现awk工作原理是逐行扫描文件

$ cat Data/example.gtf | awk '{for(i=1;i<4;i++){print $i}}' | less -S
chr1
ENSEMBL
UTR
chr1
ENSEMBL
exon
chr1
ENSEMBL
transcript
chr1
HAVANA
gene
chr1
HAVANA
exon
chr1
HAVANA
transcript
chr1
HAVANA
exon
chr1
HAVANA
exon
chr1
ENSEMBL
UTR
chr1
ENSEMBL
exon
chr1
HAVANA
exon
chr1
HAVANA
exon
chr1
ENSEMBL
UTR
chr1
ENSEMBL
exon
chr1
HAVANA
exon
chr1
ENSEMBL
start_codon
chr1
ENSEMBL
CDS
chr1
ENSEMBL
UTR
$ cat Data/example.gtf | awk '{for(i=1;i<4;i++){print $i}}' | paste - - - ||less -S
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    ENSEMBL transcript
chr1    HAVANA  gene
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL start_codon
chr1    ENSEMBL CDS
chr1    ENSEMBL UTR
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    ENSEMBL transcript
chr1    ENSEMBL transcript
chr1    HAVANA  gene
chr1    ENSEMBL stop_codon
chr1    ENSEMBL UTR
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    ENSEMBL stop_codon
chr1    ENSEMBL CDS
chr1    ENSEMBL CDS
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL CDS
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL UTR
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL stop_codon
chr1    ENSEMBL CDS
chr1    ENSEMBL UTR
chr1    ENSEMBL stop_codon
chr1    ENSEMBL exon
chr1    ENSEMBL transcript
chr1    ENSEMBL CDS
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL CDS
chr1    ENSEMBL CDS
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL CDS
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL UTR
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL start_codon
chr1    ENSEMBL UTR
chr1    ENSEMBL CDS
chr1    ENSEMBL UTR
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    ENSEMBL start_codon
chr1    ENSEMBL UTR
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    ENSEMBL UTR
chr1    ENSEMBL UTR
chr1    ENSEMBL exon
chr1    ENSEMBL exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  gene
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    ENSEMBL exon
chr1    ENSEMBL gene
chr1    ENSEMBL transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  UTR
chr1    HAVANA  exon
chr1    HAVANA  gene
chr1    HAVANA  transcript
chr1    HAVANA  stop_codon
chr1    HAVANA  CDS
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  CDS
chr1    HAVANA  exon
chr1    HAVANA  CDS
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  start_codon
chr1    HAVANA  UTR
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL gene
chr1    ENSEMBL transcript
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL UTR
chr1    ENSEMBL stop_codon
chr1    HAVANA  exon
chr1    HAVANA  gene
chr1    HAVANA  transcript
chr1    HAVANA  UTR
chr1    HAVANA  exon
chr1    HAVANA  gene
chr1    HAVANA  transcript
chr1    HAVANA  start_codon
chr1    HAVANA  CDS
chr1    HAVANA  stop_codon
chr1    HAVANA  UTR
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  gene
chr1    HAVANA  exon
chr1    HAVANA  gene
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  gene
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  gene
chr1    ENSEMBL UTR
chr1    ENSEMBL stop_codon
chr1    ENSEMBL exon
chr1    ENSEMBL transcript
chr1    ENSEMBL CDS
chr1    HAVANA  exon
chr1    ENSEMBL CDS
chr1    ENSEMBL exon
chr1    ENSEMBL start_codon
chr1    HAVANA  exon
chr1    HAVANA  gene
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  gene
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    ENSEMBL exon
chr1    ENSEMBL gene
chr1    ENSEMBL transcript
chr1    ENSEMBL exon
chr1    ENSEMBL gene
chr1    ENSEMBL transcript
chr1    HAVANA  exon
chr1    HAVANA  gene
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  gene
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  transcript
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr1    HAVANA  exon
chr2    HAVANA  UTR
chr2    HAVANA  exon
chr2    HAVANA  transcript
chr2    HAVANA  gene
chr2    HAVANA  exon
chr2    HAVANA  transcript
chr2    HAVANA  exon
chr2    HAVANA  transcript
chr2    HAVANA  stop_codon
chr2    HAVANA  CDS
chr2    HAVANA  exon
chr2    HAVANA  CDS
chr2    HAVANA  exon
chr2    HAVANA  start_codon
chr2    HAVANA  exon
chr2    HAVANA  exon
chr2    HAVANA  gene
chr2    HAVANA  transcript
chr2    HAVANA  exon
(base) Last2 10:27:35 ~

awk 数学运算:

  • (加),- (减), * (乘),^ (幂)
    / (除),** (平方), % (取余)
    int(x) x的整数部分,取靠近零一侧的值
    log(x) x的自然对数
$ cat Data/example.gtf | awk '/exon/{print $10"\t"$3"\t"$5-$4}' | less -S
"ENSG00000223972";      exon    353
"ENSG00000223972";      exon    47
"ENSG00000223972";      exon    48
"ENSG00000223972";      exon    84
"ENSG00000223972";      exon    108
"ENSG00000223972";      exon    77
"ENSG00000223972";      exon    153
"ENSG00000223972";      exon    1191
"ENSG00000223972";      exon    217
"ENSG00000227232";      exon    466
"ENSG00000227232";      exon    466
"ENSG00000227232";      exon    97
"ENSG00000227232";      exon    68
"ENSG00000227232";      exon    68
"ENSG00000227232";      exon    33
"ENSG00000227232";      exon    105
"ENSG00000227232";      exon    151
"ENSG00000227232";      exon    151
"ENSG00000227232";      exon    43
"ENSG00000227232";      exon    158
"ENSG00000227232";      exon    158
"ENSG00000227232";      exon    158
"ENSG00000227232";      exon    201
"ENSG00000227232";      exon    197
"ENSG00000227232";      exon    197
"ENSG00000227232";      exon    131
"ENSG00000227232";      exon    135
"ENSG00000227232";      exon    135
"ENSG00000227232";      exon    191
"ENSG00000227232";      exon    140
"ENSG00000227232";      exon    136
"ENSG00000227232";      exon    136
"ENSG00000227232";      exon    146
"ENSG00000227232";      exon    146
"ENSG00000227232";      exon    146
"ENSG00000227232";      exon    146
"ENSG00000227232";      exon    98
"ENSG00000227232";      exon    98
"ENSG00000227232";      exon    98
"ENSG00000227232";      exon    162
"ENSG00000227232";      exon    153
"ENSG00000227232";      exon    153
"ENSG00000227232";      exon    153
"ENSG00000227232";      exon    153
"ENSG00000227232";      exon    22
"ENSG00000227232";      exon    49
"ENSG00000227232";      exon    49
"ENSG00000227232";      exon    36
"ENSG00000243485";      exon    485
"ENSG00000243485";      exon    400
"ENSG00000221311";      exon    137
"ENSG00000243485";      exon    103
"ENSG00000243485";      exon    121
"ENSG00000243485";      exon    133
"ENSG00000237613";      exon    620

练习3:

1.任意挑4句前面的命令自己动手敲一遍
2.使用head查看example.gtf文件
3.将1结果传递给awk,输出含有ENSEMBL的行
4.结合所学,输出以下结果


image.png
$ head example.gtf | awk '/ENSEMBL/{print $0}'
chr1    ENSEMBL UTR 1737    2090    .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL exon    1737    2090    .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL transcript  1737    4275    .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL UTR 2476    2584    .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL exon    2476    2584    .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
Last2 14:25:15 ~/Data
$ head example.gtf | awk -F '\t' '/ENSEMBL/{print $9}' #知道在第九列,先缩小范围
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
Last2 14:29:07 ~/Data
$ head example.gtf | awk -F '\t' '/ENSEMBL/{print $9}' | awk -F '"' '{print $2,$4,$6}'  #以“为分隔符
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
Last2 14:32:24 ~/Data

也可以用sed进行替换。替换为空格可以理解为变相的删除。

$ head example.gtf | awk -F '\t'  '/ENSEMBL/{print $9}' | awk '{print $2,$4,$6}' |sed -e '1,$ s/"//g' -e '1,$ s/;//g'
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
ENSG00000223972 ENST00000456328 protein_coding
Last2 14:52:50 ~/Data

总结

grep眼高手低,只能文本搜索,不能做动作
sed 流编辑器,增删改查,有address
awk仿佛就是为了print

你可能感兴趣的:(Lunix Day 3)