本文来自:https://bioinf.shenwei.me/seqkit/usage/
seqkit
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Version: 0.9.1
Author: Wei Shen
Documents : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962
Usage:
seqkit [command]
Available Commands:
common find common sequences of multiple files by id/name/sequence
concat concatenate sequences with same ID from multiple files
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina
duplicate duplicate sequences N times
faidx create FASTA index file and extract subsequence
fq2fa convert FASTQ to FASTA
fx2tab convert FASTA/Q to tabular format (with length/GC content/GC skew)
genautocomplete generate shell autocompletion script
grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed
head print first N FASTA/Q records
help Help about any command
locate locate subsequences/motifs, mismatch allowed
range print FASTA/Q records in a range (start:end)
rename rename duplicated IDs
replace replace name/sequence by regular expression
restart reset start position for circular genome
rmdup remove duplicated sequences by id/name/sequence
sample sample sequences by number or proportion
seq transform sequences (revserse, complement, extract ID...)
shuffle shuffle sequences
sliding sliding sequences, circular genome supported
sort sort sequences by id/name/sequence/length
split split sequences into files by id/seq region/size/parts (mainly for FASTA)
split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)
stats simple statistics of FASTA/Q files
subseq get subsequences by region/gtf/bed, including flanking sequences
tab2fx convert tabular format to FASTA/Q format
translate translate DNA/RNA to protein sequence
version print version information and check for update
Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
-h, --help help for seqkit
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
1.序列操作:
seqkit seq reads.fq.gz |less -S ##查看fastq文件
seqkit seq reads.fq.gz -n -i ##取出fastq文件的reads_id
seqkit seq reads.fq.gz -s -w 0 ##取出fastq文件中的序列
2.统计
seqkit stats reads.fq.gz #结果如下
file format type num_seqs sum_len min_len avg_len max_len
reads.fq.gz FASTQ DNA 7,644 63,970,956 252 8,368.8 116,842
seqkit stats reads.fq.gz -a #统计的其他信息
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%)
reads.fq.gz FASTQ DNA 7,644 63,970,956 252 8,368.8 116,842 3,117 5,837 10,725 0 12,804 16.15 0
3.fastq转成fasta
seqkit fq2fa reads_1.fq.gz -o reads1_.fa.gz
4.通过reads_id抓取序列
zcat reads.fq.gz | seqkit grep -f list > new.fq #list为reads_id列表
cat hairpin.fa.gz | seqkit grep -s -i -p aggcg #通过aggcg序列抓取
seqkit grep --threads 4 -f fastq.uniq.list reads.fq.gz > uniq.fastq
5.split分割
seqkit split2 -s 20000 -O pass -f reads.fastq #每20000行分割一次
6.重复和去重复
zcat reads.fq.gz | seqkit head -n 1 | seqkit duplicate -n 2 #重复次数为2次的reads
cat reads.fq.gz | seqkit head -n 1 | seqkit duplicate -n 2 | seqkit rename #修改重复的reads-id
zcat reads_1.fq.gz | seqkit rmdup -s -o clean.fa.gz #去重复
本博主新开公众号, 希望大家能扫码关注一下,十分感谢大家。