用Perl/R/Python提取正则表达式匹配到的字符串

这个应用场景挺多的

比如我想提取出注释信息中previous_id对应的基因ID,乍一看以为可以用cut命令一点一点切出来,但是又发现previous_id并不都是紧挨着ID列,并且derived_from对应的ID也需要提取出来。这个时候只能用到正则匹配了。

Perl

$ cat test.pl
#! /usr/bin/perl
use warnings;
use strict;

my $file_name = $ARGV[0];
open my $in_fh, "<", "$file_name";
while (<$in_fh>) {
    chomp $_;
    my @one_line = (split("\t", $_));
    my @old_name = ( $one_line[8] =~ /TraesCS[1-7][A-D]01G[0-9]{6}[CHL]{0,2}/g );
    print "@old_name\n";
}
close $in_fh;

$ perl test.pl test.gff3 | head -n 5
TraesCS1A01G000100
TraesCS1A01G000200
TraesCS1A01G000300
TraesCS1A01G000400
TraesCS1A01G000500

需要注意的是这一句:my @old_name = ( $one_line[8] =~ /TraesCS[1-7][A-D]01G[0-9]{6}[CHL]{0,2}/g );我目前的认识是必须用数组来接收,并且加上g,不管是匹配一次还是多次。

R

在R里面操作很简单,用到的是stringr包。

library(tidyverse)
a <- read.table("test.gff3",header = F,sep = "\t")
b <- as.character(a$V9)
#提取出第一次匹配的内容
c <- str_extract(b,"TraesCS[1-7][ABD]01G[0-9]{6}[CHL]{0,2}")
#提取出所有匹配的内容
#以矩阵形式返回所有匹配到的内容,并将每一行元素个数统一,不够的用""空字符串表示
d <- str_extract_all(b,"TraesCS[1-7][ABD]0[12]G[0-9]{6}[CHL]{0,2}",simplify = T) #此处的正则表达式有小改动,以便演示能匹配到多个的情况
> head(c)
[1] "TraesCS1A01G000100" "TraesCS1A01G000200" "TraesCS1A01G000300" "TraesCS1A01G000400" "TraesCS1A01G000500" "TraesCS1A01G000600"
> head(d)
     [,1]                 [,2]                 [,3]                
[1,] "TraesCS1A02G000100" "TraesCS1A01G000100" "TraesCS1A02G000100"
[2,] "TraesCS1A02G000200" "TraesCS1A01G000200" "TraesCS1A02G000200"
[3,] "TraesCS1A02G000300" "TraesCS1A01G000300" "TraesCS1A02G000300"
[4,] "TraesCS1A02G000400" "TraesCS1A01G000400" "TraesCS1A02G000400"
[5,] "TraesCS1A02G000500" "TraesCS1A01G000500" "TraesCS1A02G000500"
[6,] "TraesCS1A02G000600" "TraesCS1A01G000600" "TraesCS1A02G000600"

Python

$ cat test.py
import re

for line in open('./test.gff3'):
    all = re.findall("TraesCS[1-7][A-D]0[12]G[0-9]{6}[CHL]{0,2}", line)
    for i in all:
        print(i,end="\t")
    print()
$ python3 test.py
TraesCS1A02G000100  TraesCS1A01G000100  TraesCS1A01G000200  TraesCS1A02G000100  
TraesCS1A02G000200  TraesCS1A01G000200  TraesCS1A02G000200  
TraesCS1A02G000300  TraesCS1A01G000300  TraesCS1A02G000300  
TraesCS1A02G000400  TraesCS1A01G000400  TraesCS1A02G000400  
TraesCS1A02G000500  TraesCS1A01G000500  TraesCS1A02G000500

你可能感兴趣的:(用Perl/R/Python提取正则表达式匹配到的字符串)