seqkit——fastq/fasta快速处理

本文来自:https://bioinf.shenwei.me/seqkit/usage/

seqkit
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 0.9.1

Author: Wei Shen 

Documents  : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962

Usage:
  seqkit [command]

Available Commands:
  common          find common sequences of multiple files by id/name/sequence
  concat          concatenate sequences with same ID from multiple files
  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina
  duplicate       duplicate sequences N times
  faidx           create FASTA index file and extract subsequence
  fq2fa           convert FASTQ to FASTA
  fx2tab          convert FASTA/Q to tabular format (with length/GC content/GC skew)
  genautocomplete generate shell autocompletion script
  grep            search sequences by ID/name/sequence/sequence motifs, mismatch allowed
  head            print first N FASTA/Q records
  help            Help about any command
  locate          locate subsequences/motifs, mismatch allowed
  range           print FASTA/Q records in a range (start:end)
  rename          rename duplicated IDs
  replace         replace name/sequence by regular expression
  restart         reset start position for circular genome
  rmdup           remove duplicated sequences by id/name/sequence
  sample          sample sequences by number or proportion
  seq             transform sequences (revserse, complement, extract ID...)
  shuffle         shuffle sequences
  sliding         sliding sequences, circular genome supported
  sort            sort sequences by id/name/sequence/length
  split           split sequences into files by id/seq region/size/parts (mainly for FASTA)
  split2          split sequences into files by size/parts (FASTA, PE/SE FASTQ)
  stats           simple statistics of FASTA/Q files
  subseq          get subsequences by region/gtf/bed, including flanking sequences
  tab2fx          convert tabular format to FASTA/Q format
  translate       translate DNA/RNA to protein sequence
  version         print version information and check for update

Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
  -h, --help                            help for seqkit
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

1.序列操作:

 

seqkit seq reads.fq.gz |less -S ##查看fastq文件

seqkit seq reads.fq.gz -n -i  ##取出fastq文件的reads_id

seqkit seq reads.fq.gz -s -w 0  ##取出fastq文件中的序列

2.统计

seqkit stats reads.fq.gz #结果如下

file         format  type  num_seqs     sum_len  min_len  avg_len  max_len
reads.fq.gz  FASTQ   DNA      7,644  63,970,956      252  8,368.8  116,842

seqkit stats reads.fq.gz -a  #统计的其他信息

file         format  type  num_seqs     sum_len  min_len  avg_len  max_len     Q1     Q2      Q3  sum_gap     N50  Q20(%)  Q30(%)
reads.fq.gz  FASTQ   DNA      7,644  63,970,956      252  8,368.8  116,842  3,117  5,837  10,725        0  12,804   16.15       0

3.fastq转成fasta

seqkit fq2fa reads_1.fq.gz -o reads1_.fa.gz

4.通过reads_id抓取序列

zcat reads.fq.gz | seqkit grep -f list > new.fq  #list为reads_id列表

cat hairpin.fa.gz | seqkit grep -s -i -p aggcg  #通过aggcg序列抓取

seqkit grep --threads 4 -f fastq.uniq.list reads.fq.gz > uniq.fastq

5.split分割

seqkit split2 -s 20000 -O pass -f reads.fastq  #每20000行分割一次

6.重复和去重复

zcat reads.fq.gz | seqkit head -n 1 | seqkit duplicate -n 2  #重复次数为2次的reads

cat reads.fq.gz | seqkit head -n 1 | seqkit duplicate -n 2 | seqkit rename #修改重复的reads-id

zcat reads_1.fq.gz | seqkit rmdup -s -o clean.fa.gz  #去重复

本博主新开公众号, 希望大家能扫码关注一下,十分感谢大家。

seqkit——fastq/fasta快速处理_第1张图片

你可能感兴趣的:(seqkit——fastq/fasta快速处理)