与你分享生信好用的单行命令

刘小泽记录于 2019.5.10
将Turner写的一些好用的单行命令与大家分享，原文还有许多可以去看
https://github.com/stephenturner/oneliners

About Fastq/fasta

fastq sequences length distribution => 得到fq文件中序列长度的分布

$ zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'

reverse complement => 反向互补

$ echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'

fastq2fasta

$ zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[1]\n";' | gzip -c > file.fasta.gz

split a multifasta file into single ones with csplit => fasta按>拆分

# * refers to the number of files 可以选择拆分的文件数量
$ csplit -z -q -n 4 -f test test.fa /\>/ {*}
# OR use awk 一次性全部按>拆分
$ awk '/^>/{s=++d".fa"} {print > s}' multi.fa

single line fasta to multi-line of 50 characters in each line => 单行fa变多行

$ awk -v FS= '/^>/{print;next}{for (i=0;i<=NF/50;i++) {for (j=1;j<=50;j++) printf "%s", $(i*50 +j); print ""}}' file

# fold -w 50 file

multi-line fasta to one-line => 一个多行fa文件变单行

# 方法一：
$ awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa
# 方法二：
$ cat file.fasta | awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}'

Number of reads in a fastq file => 统计fq中序列数（4行一个序列）

$ cat file.fq | echo $((`wc -l`/4))

print length of each entry in a multifasta file => 打印fa中每个序列的长度

$ awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa

subsample fastq => 取fq文件的子集（其中0.01是指取出来百分之1的reads）

$ cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq

About Sam/Bam

bam2bed

samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed

bam2wig

samtools mpileup -BQ0 file.sorted.bam | perl -pe '($c, $start, undef, $depth) = split;if ($c ne $lastC || $start != $lastStart+1) {print "fixedStep chrom=$c start=$start step=1 span=1\n";}$_ = $depth."\n";($lastC, $lastStart) = ($c, $start);' | gzip -c > file.wig.gz

Basis Linux

get all folders' size in the current folder => 当前目录下的所有目录大小

$ du -h --max-depth=1

exit a dead ssh session => 退出卡死的ssh界面

$ ~.

copy large folders fast => 快速拷贝大文件夹

# copy every file in folder 拷贝目录下的所有文件
rsync -av from_dir/ to_dir
# skip transferred files 跳过已拷贝的文件
rsync -avhP from_dir/ to_dir

find bam in the current folder recursively and copy them to a new dir with 5 CPUs => 拷贝大文件（如bam）到其他文件夹，并用5个线程

find . -name "*bam" | xargs -P5 -I{} rsync -av {} dest_dir

group files by extensions => 按照后缀的顺序排序文件

ll -X

loop through all the names => 循环语句

for i in {1..22} X Y 
do
  echo $i
done
# 对于{01..22} 的结果是 01 02 ...

GREP

grep fastq reads containing a pattern but maintain the fastq format => 匹配fq中序列并打印

# 例如要在SP1.fq中找到这段序列的fq格式
# 如果匹配到多个，那么每条序列中间会用--分隔，因此需要用sed去除
$ grep -A 2 -B 1 'TGAGACAACATCT' SP1.fq | sed '/^--$/d' > out.fq

SED

delete with sed => 删除行

# delete blank lines
sed /^$/d
# delete the last line
sed $d

AWK

awk join two files with common columns => awk连接有共同列的文件（类似于R的merge函数）

# http://stackoverflow.com/questions/13258604/join-two-files-using-awk
# file_a.bed： 
chr1    123 aa  b   c   d
chr1    234 a   b   c   d
chr1    345 aa  b   c   d
chr1    456 a   b   c   d
# file_b.bed
xxxx    abcd    chr1    123 aa  c   d   e
yyyy    defg    chr1    345 aa  e   f   g
# 现在想在a的基础上根据a、b共有列来增加b中的新内容
$ awk 'NR==FNR{a[$3,$4,$5]=$1OFS$2;next}{$6=a[$1,$2,$3];print}' OFS='\t' \
file_b.bed file_a.bed

# 结果
chr1    123 aa  b   c   xxxx    abcd
chr1    234 a   b   c   
chr1    345 aa  b   c   yyyy    defg
chr1    456 a   b   c

Explanation:

NR==FNR NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.

OFS awk set the output field seperator; while set the input seperator is -F

next means to proceed for the next line, rather than execute the following { } code block

awk to compare two different files and print if matches=> 比较两个文件的指定列，然后打印比对上的行

# https://unix.stackexchange.com/questions/134829/compare-two-columns-of-different-files-and-print-if-it-matches
# 例如 file1
abc|123|BNY|apple|
cab|234|cyx|orange|
def|kumar|pki|bird|
# file2
abc|123|
kumar|pki|
cab|234
# expected
abc|123|BNY|apple|
cab|234|cyx|orange|

$  awk -F'|' 'NR==FNR{a[$1$2]++;next};a[$1$2] > 0' file2 file1

conditional operator => 条件判断

基本格式：var=condition?condition_if_true:condition_if_false

例如：

# 现在有这个文件 test
a1  ACTGTCTGTCACTGTGTTGTGATGTTG
a2  ACTTTATATAT
a3  ACTTATATATATATA
a4  ACTTATATATATATA
a5  ACTTTATATATT    
# 我想看看每行序列部分是不是大于14个碱基
$ awk '{print (length($2)>14)?$0">14":$0"<=14"}' test

get new line => 在原来的内容基础上增加新内容

$ awk 'BEGIN{while((getline k <"test")>0) print "NEW:"k}{print}' test

merge multi-fasta into one single fasta => 合并多个fasta文件到一个文件中

# give a awk script called linearize.awk
$cat >linearize.awk 
# then copy and paste below
/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}

# run the awk script
$ paste <(awk -f linearize.awk file1.fa ) <(awk -f linearize.awk file2.fa  )| tr "\t" "\n" > multi.fa

根据id输出序列

while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }'  Seq.fasta; done < id.txt > output.fa

需要指定Seq.fasta、id.txt

与你分享生信好用的单行命令

About Fastq/fasta

fastq sequences length distribution => 得到fq文件中序列长度的分布

reverse complement => 反向互补

fastq2fasta

split a multifasta file into single ones with csplit => fasta按>拆分

single line fasta to multi-line of 50 characters in each line => 单行fa变多行

multi-line fasta to one-line => 一个多行fa文件变单行

Number of reads in a fastq file => 统计fq中序列数（4行一个序列）

print length of each entry in a multifasta file => 打印fa中每个序列的长度

subsample fastq => 取fq文件的子集（其中0.01是指取出来百分之1的reads）

About Sam/Bam

bam2bed

bam2wig

Basis Linux

get all folders' size in the current folder => 当前目录下的所有目录大小

exit a dead ssh session => 退出卡死的ssh界面

copy large folders fast => 快速拷贝大文件夹

find bam in the current folder recursively and copy them to a new dir with 5 CPUs => 拷贝大文件（如bam）到其他文件夹，并用5个线程

group files by extensions => 按照后缀的顺序排序文件

loop through all the names => 循环语句

GREP

grep fastq reads containing a pattern but maintain the fastq format => 匹配fq中序列并打印

SED

delete with sed => 删除行

AWK

awk join two files with common columns => awk连接有共同列的文件（类似于R的merge函数）

awk to compare two different files and print if matches=> 比较两个文件的指定列，然后打印比对上的行

conditional operator => 条件判断

get new line => 在原来的内容基础上增加新内容

merge multi-fasta into one single fasta => 合并多个fasta文件到一个文件中

根据id输出序列

你可能感兴趣的:(与你分享生信好用的单行命令)