【不定期更新】Bedtools大全--42个子命令介绍

参考链接:

用法大全:

官方说明:https://bedtools.readthedocs.io/en/latest/content/overview.html
Bedtools使用:http://blog.genesino.com/2018/04/bedtools/
bedtools 用法大全:https://www.jianshu.com/p/6c3b87301491
2019-和刘小泽一起跟着官网学bedtools:https://www.jianshu.com/p/2efcb6f8f55d
bedtools使用教程详解:

安装:
wget https://github.com/arq5x/bedtools2/archive/v2.25.0.tar.gz
tar zxvf v2.25.0 
cd bedtools2-2.25.0/
make
cd bin/
export PATH=$PWD:$PATH
特定功能:

使用bedtools划分各种条件的bin dplyr-pandas-bedtools 分组变量处理
dplyr-pandas-bedtools 分组变量处理





bedtools开发的目的是为了快速,灵活的比较大量的基因组特征(genomic features)。而genomic features通常使用Browser Extensible Data (BED) 或者 General Feature Format (GFF)文件表示,用UCSC Genome Browser进行可视化比较。

例如,bedtools可以进行取intersect(交集), merge(并集), count(计数), complement(补集),以及用来对广泛使用的基因组文件格式,例如BAM, BED, GFF/GTF, VCF等进行基因组区间的转换。单个的工具设计的目的是应对简单的任务,复杂的分析能通过组合多个bedtools工具操作实现。同时,该工具允许控制输出结果的呈现形式。最初的bedtools版本支持单独的6列BED文件。但是,如今增加了对序列比对BAM文件的支持。以及GFF文件的特征,BED文件。以及VCF文件。这些工具是相当快速的,并且即使是大的数据集也可以在数秒内完成任务。

BEDTools主要使用BED格式的前三列,即:

  • chrom: 染色体信息

  • start: genome feature的起始位点,从0开始

  • end: genome feature的终止位点,至少为1

一般常用物种的genome file在BEDTools安装目录的/genome里面

BEDPE格式是其自定义的一种新的格式,为了简洁的描述不连续的genome features,例如结构变异和双端测序比对

注意:

  • start1和start2起始坐标第一个碱基都为0,所以start=9, end=20表示碱基跨度是从第10位到第20位

  • chrom1或者chrom2用.表示unknown;start1,end1,start2,end2用-1表示unknown



bedtools 的39个子命令列表(按照字母排序)

Utility Description
annotate Annotate coverage of features from multiple files.
bamtobed Convert BAM alignments to BED (& other) formats.
bamtofastq Convert BAM records to FASTQ records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
bedpetobam Convert BEDPE intervals to BAM records.
bedtobam Convert intervals to BAM records.
closest Find the closest, potentially non-overlapping interval.
cluster Cluster (but don’t merge) overlapping/nearby intervals.
complement Extract intervals not represented by an interval file.
coverage Compute the coverage over defined intervals.
expand Replicate lines based on lists of values in columns.
fisher Calculate Fisher statistic b/w two feature files.
flank Create new intervals from the flanks of existing intervals.
genomecov Compute the coverage over an entire genome.
getfasta Use intervals to extract sequences from a FASTA file.
groupby Group by common cols. & summarize oth. cols. (~ SQL “groupBy”)
igv Create an IGV snapshot batch script.
intersect Find overlapping intervals in various ways.
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
links Create a HTML page of links to UCSC locations.
makewindows Make interval “windows” across a genome.
map Apply a function to a column for each overlapping interval.
maskfasta Use intervals to mask sequences from a FASTA file.
merge Combine overlapping/nearby intervals into a single interval.
multicov Counts coverage from multiple BAMs at specific intervals.
multiinter Identifies common intervals among multiple interval files.
nuc Profile the nucleotide content of intervals in a FASTA file.
overlap Computes the amount of overlap from two intervals.
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
random Generate random intervals in a genome.
reldist Calculate the distribution of relative distances b/w two files.
tools/sample Sample random records from file using reservoir sampling.
shift Adjust the position of intervals.
shuffle Randomly redistribute intervals in a genome.
slop Adjust the size of intervals.
sort Order the intervals in a file.
tools/spacing Sample random records from file using reservoir sampling.
tools/split Split a file into multiple files with equal records or base pairs.
subtract Remove intervals based on overlaps b/w two files.
tag Tag BAM alignments based on overlaps with interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
window Find overlapping intervals within a window around an interval.




bedtools 的41个子命令列表(按照功能排序)

  1. 区域注释,如peak注释,peak分布分析,peak与调控元件交集等。

  2. 区域合并,如求算多样品peak合集,或合并重叠区域

  3. 区域互补,如得到非基因区

  4. 利用比对结果对测序广度和深度评估

  5. 多样品peak相似性计算,评估ChIP类区域结果的样品相似性。

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
​
usage:    bedtools  [options]
​
The bedtools sub-commands include:
​
​
[ Genome arithmetic ]
​
 intersect     Find overlapping intervals in various ways.
​
 求区域之间的交集,可以用来注释peak,计算reads比对到的基因组区域
 不同样品的peak之间的peak重叠情况。
​
 window        Find overlapping intervals within a window around an interval.
 closest       Find the closest, potentially non-overlapping interval.
​
 寻找最近但可能不重叠的区域
​
 coverage      Compute the coverage over defined intervals.
​
 计算区域覆盖度
​
 map           Apply a function to a column for each overlapping interval.
 genomecov     Compute the coverage over an entire genome.
 merge         Combine overlapping/nearby intervals into a single interval.
​
 合并重叠或相接的区域
​
 cluster       Cluster (but don't merge) overlapping/nearby intervals.
 complement    Extract intervals _not_ represented by an interval file.
​
 获得互补区域
​
 subtract      Remove intervals based on overlaps b/w two files.
​
 计算区域差集
​
 slop          Adjust the size of intervals.
​
 调整区域大小,如获得转录起始位点上下游3 K的区域
​
 flank         Create new intervals from the flanks of existing intervals.
​
 sort          Order the intervals in a file.
​
 排序,部分命令需要排序过的bed文件
​
 random        Generate random intervals in a genome.
​
 获得随机区域,作为背景集
​
 shuffle       Randomly redistrubute intervals in a genome.
​
 根据给定的bed文件获得随机区域,作为背景集
​
 sample        Sample random records from file using reservoir sampling.
 spacing       Report the gap lengths between intervals in a file.
 annotate      Annotate coverage of features from multiple files.
​
[ Multi-way file comparisons ]
​
 multiinter    Identifies common intervals among multiple interval files.
 unionbedg     Combines coverage intervals from multiple BEDGRAPH files.
​
[ Paired-end manipulation ]
​
 pairtobed     Find pairs that overlap intervals in various ways.
 pairtopair    Find pairs that overlap other pairs in various ways.
​
[ Format conversion ]
​
 bamtobed      Convert BAM alignments to BED (& other) formats.
 bedtobam      Convert intervals to BAM records.
 bamtofastq    Convert BAM records to FASTQ records.
 bedpetobam    Convert BEDPE intervals to BAM records.
 bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.
​
[ Fasta manipulation ]
​
 getfasta      Use intervals to extract sequences from a FASTA file.
​
 提取给定位置的FASTA序列
​
 maskfasta     Use intervals to mask sequences from a FASTA file.
 nuc           Profile the nucleotide content of intervals in a FASTA file.
​
[ BAM focused tools ]
​
 multicov      Counts coverage from multiple BAMs at specific intervals.
 tag           Tag BAM alignments based on overlaps with interval files.
​
[ Statistical relationships ]
​
 jaccard       Calculate the Jaccard statistic b/w two sets of intervals.
​
 计算数据集相似性
​
 reldist       Calculate the distribution of relative distances b/w two files.
 fisher        Calculate Fisher statistic b/w two feature files.
​
[ Miscellaneous tools ]
​
 overlap       Computes the amount of overlap from two intervals.
 igv           Create an IGV snapshot batch script.
​
 用于生成一个脚本,批量捕获IGV截图
​
 links         Create a HTML page of links to UCSC locations.
​
 makewindows   Make interval "windows" across a genome.
​
 把给定区域划分成指定大小和间隔的小区间 (bin)
​
 groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
​
 分组结算,不只可以用于bed文件。
​
 expand        Replicate lines based on lists of values in columns.
 split         Split a file into multiple files with equal records or base pairs
  • stdin 和 - 用法和xargs 不一样
    image.png





Part1:Genome arithmetic

1.intersect

可以计算两个或者多个BED/BAM/VCF/GFF文件中基因组坐标位置的交集(overlap),根据参数不同,可以得到不同的结果。

求区域之间的交集,可以用来注释peak,计算reads比对到的基因组区域, 不同样品的peak之间的peak重叠情况

image
  • Usage:
    bedtools intersect [OPTIONS] -a  \
                             -b 
  • 计算两个bed交集区域
$ cat A.bed
chr1  10  20
chr1  30  40

$ cat B.bed
chr1  15   20

$ bedtools intersect -a A.bed -b B.bed
chr1  15   20

2.window

与bedtools intersect相似,窗口搜索A和B中的重叠特性。然而,窗口添加了A中每个特性上游和下游的指定大小(默认为1000)的碱基对。

image
  • Usage:
bedtools window [OPTIONS] [-a|-abam] -b 
  • 相当于将A区域进行扩增一定区间,再和B比较。
    $ cat A.bed
    chr1  100  200
    ​
    $ cat B.bed
    chr1  500  1000
    chr1  1300 2000
    ​
    $ bedtools window -a A.bed -b B.bed
    chr1  100  200  chr1  500  1000

3.closest

类似于相交,最接近的搜索是在A和B中重叠的特征。如果B中没有一个特征与A中的当前特征重叠,则最接近将报告最接近的特征(即距A的起点或终点最小的基因组距离)。 例如,人们可能想找到哪个是与显着GWAS多态性最接近的基因。 请注意,最接近将报告重叠特征为最接近-即,它不限于最接近的非重叠特征。 以下标志性的“备忘单”总结了最接近的工具提供的各种选项提供的功能。

寻找最近但可能不重叠的区域: 简单来说就是从A.bed 中找到B.bed 中最近的区间,比如修饰一个基因,最近的enhancer 位置.

image
  • Usage:
    bedtools closest [OPTIONS] -a  \
     -b 
  • 比如:
   $ cat a.bed
    chr1  10  20  a1  1 -
    ​
    $ cat b.bed
    chr1  7   8   b1  1 -
    chr1  15  25  b2  2 +
    ​
    $ bedtools closest -a a.bed -b b.bed
    chr1  10  20  a1  1 - chr1  15  25  b2  2 +

4.coverage

计算区域覆盖度:计算一个bed下,每一个region 和另一个bed 文件的交集。

Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
​
BED FILE A  ***************     ***************     ******    **************
​
BED File B  ^^^^ ^^^^              ^^             ^^^^^^^^^    ^^^ ^^ ^^^^
 ^^^^^^^^                                      ^^^^^ ^^^^^ ^^
​
Result      [  N=3, 10/15 ]     [  N=1, 2/15 ]     [N=1,6/6]   [N=6, 12/14 ]
  • Usage:
   bedtools coverage [OPTIONS] -a  \
     -b 
  • 例子:统计A区间每一个元素和B文件交集情况
$ cat A.bed
    chr1  0   100
    chr1  100 200
    chr2  0   100
    ​
    $ cat B.bed
    chr1  10  20
    chr1  20  30
    chr1  30  40
    chr1  100 200
    ​
    $ bedtools coverage -a A.bed -b B.bed
    chr1  0   100  3  30  100 0.3000000
    chr1  100 200  1  100 100 1.0000000
    chr2  0   100  0  0   100 0.0000000

5.map

为每个重叠间隔对列应用一个函数。

image
  • Usage:
bedtools map [OPTIONS] -a  -b 
  • 例子:和A 区间有重叠的B中元素作为一个分组,运用函数计算最大值或者平均值等等。
   $ cat a.bed
    chr1        10      20      a1      1       +
    chr1        50      60      a2      2       -
    chr1        80      90      a3      3       -
    ​
    $ cat b.bed
    chr1        12      14      b1      2       +
    chr1        13      15      b2      5       -
    chr1        16      18      b3      5       +
    chr1        82      85      b4      2       -
    chr1        85      87      b5      3       +
    ​
    $ bedtools map -a a.bed -b b.bed
    chr1        10      20      a1      1       +       12
    chr1        50      60      a2      2       -       .
    chr1        80      90      a3      3       -       5

6.genomecov

染色体和全基因组覆盖度计算

image
  • Usage:
    bedtools genomecov [OPTIONS] [-i|-ibam] -g (iff. -i)
  • example:
    $ cat A.bed
    chr1  10  20
    chr1  20  30
    chr2  0   500
    ​
    $ cat my.genome
    chr1  1000
    chr2  500
    ​
    $ bedtools genomecov -i A.bed -g my.genome
    chr1   0  980  1000  0.98
    chr1   1  20   1000  0.02
    chr2   1  500  500   1
    genome 0  980  1500  0.653333
    genome 1  520  1500  0.346667
    # name 覆盖次数 覆盖碱基数 总碱基数 覆盖度
    # 同时计算单染色体和全基因组覆盖度

7.merge

https://www.jianshu.com/p/2efcb6f8f55d

许多数据集的基因组feature坐标经常是连续的(比如ChIPseq的结果),就像下图的蓝色部分

于是可以把这些连续的基因组小区间连接起来,拼成一个连续的大区间

image
  • Usage:
bedtools merge [OPTIONS] -i 
  • 例子:
需要拼接的输入文件(bed/gff/vcf)必须是排序(sort)过的【不sort会报错‍♂️】

先按染色体,再按起始位点,这样保证merge的算法执行起来非常顺畅,而基本不需要消耗内存再次加工

以exon.bed为例,展示merge的作用

注意看:第3行和第4行的区间是有重叠的,因此它们可以进行merge

image.png
bedtools merge -i exons.bed | head -10

可以看到,merge之后原来的第3行(13220 - 14409)和第4行(14361 - 14829)坐标合并成了13220 - 14829

[图片上传失败...(image-de7f16-1586285151489)]

8.cluster

类似merge 功能,合并重叠及其靠近的区间

image
  • Usage:
bedtools cluster [OPTIONS] -i 
  • example: 默认情况下合并有重叠的区域(1bp)
$ cat A.bed
 chr1  100  200
 chr1  180  250
 chr1  250  500
 chr1  501  1000
 ​
 $ bedtools cluster -i A.bed
 chr1  100     200     1
 chr1  180     250     1
 chr1  250     500     1
 chr1  501     1000    2

9.complement

https://www.jianshu.com/p/2efcb6f8f55d

bedtools complement 实现反选.
给定一个feature坐标信息文件,我们如果不关心其中标记的区间,而是想看看有哪些区间不在这个文件中.

例如,有一个ChIP-seq的peaks信息,现在想知道有哪些区域没有被抗体结合,就可以用complement

image
  • Usage:
bedtools complement -i  -g 
  • example:

    现在有了外显子的bed文件,通过反选,我们就能获得内含子或基因间区的坐标。那么既然是反选,除了exon.bed,还要有一个总体的范围(也就是基因组各个染色体的长度信息)【用-g指定】

 # 首先要确定整体的操作范围
  head -10 genome.txt
  ​
  chr1    249250621
  chr10   135534747
  chr11   135006516
  chr11_gl000202_random   40103
  chr12   133851895
  chr13   115169878
  chr14   107349540
  chr15   102531392
  chr16   90354753
  chr17   81195210
  ​
  # 实现反选,选出不属于exon的区域
  bedtools complement -i exons.bed -g genome.txt | head -10
  ​
  chr1    0   11873
  chr1    12227   12612
  chr1    12721   13220
  chr1    14829   14969
  chr1    15038   15795
  chr1    15947   16606
  chr1    16765   16857
  chr1    17055   17232
  chr1    17368   17605
  chr1    17742   17914

10 .subtract

https://www.jianshu.com/p/cb079a393661

从A中去掉B

image
  • Usage:
bedtools subtract [OPTIONS] -a  -b 
  • example:
    $ cat A.bed
    chr1  10   20
    chr1  100  200
    ​
    $ cat B.bed
    chr1  0    30
    chr1  180  300
    ​
    $ bedtools subtract -a A.bed -b B.bed
    chr1  100  180
  • 比如提取TSS ±2kb 区域外,peak 区间.为计算ROC做准备.
    具体代码:https://github.com/Helab-bioinformatics/itChIP/blob/master/07_ROC.R

11.slop

awk '{OFS="\t" print $1,$2-,$3+}'类似.
bedtools slop将限制染色体的大小(即没有不能小于 0和结束不能大于染色体长度)。

image
  • Usage:
bedtools slop [OPTIONS] -i  -g  [-b or (-l and -r)]
  • example :
    $ cat A.bed
    chr1 5 100
    chr1 800 980
    ​
    $ cat my.genome
    chr1 1000
    ​
    $ bedtools slop -i A.bed -g my.genome -b 5
    chr1 0 105
    chr1 795 985
    ​
    $ bedtools slop -i A.bed -g my.genome -l 2 -r 3
    chr1 3 103
    chr1 798 983

12.flank

将区间进行左右两侧扩充.

image
  • Usage:
bedtools flank [OPTIONS] -i  -g  [-b or (-l and -r)]
  • example:
$ cat A.bed
    chr1 100 200
    chr1 500 600
    ​
    $ cat my.genome
    chr1 1000
    ​
    $ bedtools flank -i A.bed -g my.genome -b 5
    chr1  95      100
    chr1  200     205
    chr1  495     500
    chr1  600     605
    ​
    $ bedtools flank -i A.bed -g my.genome -l 2 -r 3
    chr1  98      100
    chr1  200     203
    chr1  498     500
    chr1  600     603

13. sort

排序,部分命令需要排序过的bed文件

  • Usage:
bedtools sort [OPTIONS] -i 
  • example:
cat A.bed
   chr1 800 1000
   chr1 80  180
   chr1 1   10
   chr1 750 10000
   ​
   sortBed -i A.bed
   chr1 1   10
   chr1 80  180
   chr1 750 10000
   chr1 800 1000

14.random

bedtools random将以BED6格式生成一组随机的间隔。可以指定应该生成的间隔的数目(-n)和大小(-l)。

image
  • Usage:
bedtools random [OPTIONS] -g 
  • example:
$ bedtools random -g hg19.genome
 chr2  87536758        87536858        1       100     -
 chrX  46051735        46051835        2       100     +
 chr18 5237041 5237141 3       100     -
 chr12 45809998        45810098        4       100     +
 chrX  42034890        42034990        5       100     -
 chr10 77510935        77511035        6       100     -
 chr3  39844278        39844378        7       100     -
 chr6  101012700       101012800       8       100     +
 chr12 38123482        38123582        9       100     +
 chr7  88508598        88508698        10      100     -
 ​
 $ bedtools random -g hg19.genome
 chr3  141987850       141987950       1       100     +
 chr5  137643331       137643431       2       100     +
 chr2  155523858       155523958       3       100     -
 chr5  147874094       147874194       4       100     +
 chr1  71838335        71838435        5       100     -
 chr8  71154323        71154423        6       100     -
 chr2  133240474       133240574       7       100     +
 chr9  131495427       131495527       8       100     +
 chrX  125952943       125953043       9       100     +
 chr3  59685545        59685645        10      100     +

15.shuffle

bedtools shuffle将在基因组文件中定义的基因组中,随机排列bed文件中区域在基因组位置。

根据给定的bed文件获得随机区域,作为背景集

image
  • Usage:
bedtools shuffle [OPTIONS] -i  -g 
  • example: 默认情况下,bedtools shuffle将在随机染色体上的随机位置上重新定位输入BED文件中的每个特性。每个特征的大小和链被保留。
$ cat A.bed
  chr1  0  100  a1  1  +
  chr1  0  1000 a2  2  -

  $ cat my.genome
  chr1  10000
  chr2  8000
  chr3  5000
  chr4  2000

  $ bedtools shuffle -i A.bed -g my.genome
  chr4  1498  1598  a1  1  +
  chr3  2156  3156  a2  2  -

16.sample

从文件中随机取样记录.

Summary: Take sample of input file(s) using reservoir sampling algorithm.

Usage: bedtools sample [OPTIONS] -i 

WARNING:  The current sample algorithm will hold all requested sample records in memory prior to output.  The user must ensure that there is adequate memory for this.

17.spacing

报告文件中间隔之间的间隔长度

Summary: Report (last col.) the gap lengths between intervals in a file.

Usage: bedtools spacing [OPTIONS] -i 

-bed  If using BAM input, write output as BED.

-header  Print the header from the A file prior to results.

-nobuf  Disable buffered output. Using this option will cause each line  of output to be printed as it is generated, rather than saved  in a buffer. This will make printing large output files noticeably slower, but can be useful in conjunction with  other software tools and scripts that need to process one  line of bedtools output at a time.

-iobuf  Specify amount of memory to use for input buffer.  Takes an integer argument. Optional suffixes K/M/G supported.  Note: currently has no effect with compressed files.

Notes: (1) Input must be sorted by chrom,start (sort -k1,1 -k2,2n for BED).  (2) The 1st element for each chrom will have NULL distance. (".").  (3) Distance for overlapping intervals is -1 and 0 for adjacent intervals.

Example:


$ cat test.bed 
    chr1    0   10 
    chr1    10  20 
    chr1    19  30 
    chr1    35  45 
    chr1    100 200 
​ $ bedtools spacing -i test.bed ​ chr1 0 10 . ​ chr1 10 20 0 ​ chr1 19 30 -1 ​ chr1 35 45 5 ​ chr1 100 200 55

18.annotate

bedtools可以对一个BED / VCF / GFF文件进行注释,并具有从多个其他BED / VCF / GFF文件中观察到的覆盖范围和重叠数。 通过这种方式,它允许人们通过一个命令询问一个feature与其他多个feature类型的重合程度。

  • Usage
bedtools annotate [OPTIONS] -i  -files FILE1 FILE2 FILE3 ... FILEn
  • 计算输入的bed文件和其他多个文件交集个数.
chr1 100  200   nasty 1  -
 chr2 500  1000  ugly  2  +
 chr3 1000 5000  big   3  -

 $ cat genes.bed
 chr1 150  200   geneA 1  +
 chr1 175  250   geneB 2  +
 chr3 0    10000 geneC 3  -

 $ cat conserve.bed
 chr1 0    10000 cons1 1  +
 chr2 700  10000 cons2 2  -
 chr3 4000 10000 cons3 3  +

 $ cat known_var.bed
 chr1 0    120   known1   -
 chr1 150  160   known2   -
 chr2 0    10000 known3   +

 $ bedtools annotate -counts -i variants.bed -files genes.bed conserve.bed known_var.bed
 chr1  100     200     nasty   1       -       2       1       2
 chr2  500     1000    ugly    2       +       0       1       1
 chr3  1000    5000    big     3       -       1       1       0





Part2:Multi-way file comparisons

19. multiinter

标识多个bed文件之间的公共区间。

Summary: Identifies common intervals among multiple  BED/GFF/VCF files.

Usage: bedtools multiinter [OPTIONS] -i FILE1 FILE2 .. FILEn  Requires that each interval file is sorted by chrom/start.

20.unionbedg

unionbedg将多个BEDGRAPH文件组合成单个文件,这样就可以直接比较多个样本的覆盖率(如基因型)

  • Usage:
bedtools unionbedg [OPTIONS] -i FILE1 FILE2 FILE3 ... FILEn
  • example:
cat 1.bg
  chr1 1000 1500 10
  chr1 2000 2100 20

  cat 2.bg
  chr1 900 1600 60
  chr1 1700 2050 50

  cat 3.bg
  chr1 1980 2070 80
  chr1 2090 2100 20

  cat sizes.txt
  chr1 5000

  bedtools unionbedg -i 1.bg 2.bg 3.bg
  chr1 900  1000 0  60 0
  chr1 1000 1500 10 60 0
  chr1 1500 1600 0  60 0
  chr1 1700 1980 0  50 0
  chr1 1980 2000 0  50 80
  chr1 2000 2050 20 50 80
  chr1 2050 2070 20 0  80
  chr1 2070 2090 20 0  0
  chr1 2090 2100 20 0  20




Part3 : Paired-end manipulation

21. pairtobed

找出以各种方式重叠区间的对

Summary: Report overlaps between a BEDPE file and a BED/GFF/VCF file.

Usage: bedtools pairtobed [OPTIONS] -a  -b 

22.pairtopair

找出以各种方式重叠的配对。

pairToPair比较两个BEDPE文件以查找重叠,其中A中BEDPE特征的每个末端与B中特征的末端重叠。例如,使用pairToPair,可以在两个文件中筛选出完全相同的不一致双末端对齐方式。 这可能表明(除其他事项外)不一致的对表明每个文件/样本中的结构都相同。

  • Usage:
pairToPair [OPTIONS] -a  -b 
  • example: 默认情况下,如果两端都与BEDPE B文件中的特征重叠,则将报告A中的BEDPE特征。 如果存在两个BEDPE文件的链信息,则进一步要求两端的重叠都在同一链上。 这样,原本重叠的(就基因组位置而言)F / R比对将不与R / R比对匹配。
  Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    BEDPE A         >>>>>.................................>>>>>

    BEDPE B            <<<<<.............................>>>>>

    Result

    BEDPE A         >>>>>.................................>>>>>

    BEDPE B            >>>>>.............................>>>>>

    Result          >>>>>.................................>>>>>




Part 4 : Format conversion

23.bamtobed

bedtools bamtobed是一种比较实用转换程序,可将BAM格式的序列比对转换为BED,BED12和/或BEDPE记录。保存了位置信息,同时节约了空间.

  • Usage
bedtools bamtobed [OPTIONS] -i 
  • 比如将bam 转换成bed6 格式
    $ bedtools bamtobed -i reads.bam | head -3
    chr7   118970079   118970129   TUPAC_0001:3:1:0:1452#0/1   37   -
    chr7   118965072   118965122   TUPAC_0001:3:1:0:1452#0/2   37   +
    chr11  46769934    46769984    TUPAC_0001:3:1:0:1472#0/1   37   -

24.bedtobam

将bed 转换成bam 文件

  • Usage:
bedToBam [OPTIONS] -i  -g  > 
  • example:
head -5 rmsk.hg18.chr21.bed
chr21 9719768  9721892  ALR/Alpha  1004  +
chr21 9721905  9725582  ALR/Alpha  1010  +
chr21 9725582  9725977  L1PA3 3288 +
chr21 9726021  9729309  ALR/Alpha  1051  +
chr21 9729320  9729809  L1PA3 3897 -

bedToBam -i rmsk.hg18.chr21.bed -g human.hg18.genome > rmsk.hg18.chr21.bam

samtools view rmsk.hg18.chr21.bam | head -5
ALR/Alpha  0   chr21 9719769  255  2124M *  0  0  *  *
ALR/Alpha  0   chr21 9721906  255  3677M *  0  0  *  *
L1PA3      0   chr21 9725583  255  395M  *  0  0  *  *
ALR/Alpha  0   chr21 9726022  255  3288M *  0  0  *  *
L1PA3      16  chr21 9729321  255  489M  *  0  0  *  *

25.bamtofastq

bedtools bamtofastq是一个转换工具,用于从BAM格式的序列比对中提取FASTQ记录。

  • Usage
bedtools bamtofastq [OPTIONS] -i  -fq 
  • 比如下面例子
$ bedtools bamtofastq -i NA18152.bam -fq NA18152.fq

    $ head -8 NA18152.fq
    @NA18152-SRR007381.35051
    GGAGACATATCATATAAGTAATGCTAGGGTGAGTGGTAGGAAGTTTTTTCATAGGAGGTGTATGAGTTGGTCGTAGCGGAATCGGGGGTATGCTGTTCGAATTCATAAGAACAGGGAGGTTAGAAGTAGGGTCTTGGTGACAAAATATGTTGTATAGAGTTCAGGGGAGAGTGCGTCATATGTTGTTCCTAGGAAGATTGTAGTGGTGAGGGTGTTTATTATAATAATGTTTGTGTATTCGGCTATGAAGAATAGGGCGAAGGGGCCTGCGGCGTATTCGATGTTGAAGCCTGAGACTAGTTCGGACTCCCCTTCGGCAAGGTCGAA
    +
    <<<;;<;<;;<;;;;;;;;;;;;<<<:;;;;;;;;;;;;;;;;::::::;;;;<<;;;;;;;;;;;;;;;;;;;;;;;;;;;;<<<<<;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;<<;;;;;:;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;<<<;;;;;;;;;;<<<<<<<<;;;;;;;;;:;;;;;;;;;;;;;;;;;;;:;;;;8;;8888;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;8966689666666299866669:899
    @NA18152-SRR007381.637219
    AATGCTAGGGTGAGTGGTAGGAAGTTTTTTCATAGGAGGTGTATGAGTTGGTCGTAGCGGAATCGGGGGTATGCTGTTCGAATTCATAAGAACAGGGAGGTTAGAAGTAGGGTCTTGGTGACAAAATATGTTGTATAGAGTTCAGGGGAGAGTGCGTCATATGTTGTTCCTAGGAAGATTGTAGTGGTGAGGGTGTTTATTATAATAATGTTTGTGTATTCGGCTATGAAGAATAGGGCGAAGGGGCCTGCGGCGTATTCGATGTTGAAGCCTGAGACTAGTTCGGACTCCCCTTCCGGCAAGGTCGAA
    +
    <<<<<<<<<<;;<;<;;;;<<;<888888899<;;;;;;<;;;;;;;;;;;;;;;;;;;;;;;;<<<<<;;;;;;;;;<;<<<<<;;;;;;;;;;;;;<<<<;;;;;;;:::;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;<<<<;;;;;;;;;;;;;;;;;;;;;;;<;;;;;;;;;;;;;;;;;;;;;;<888<;<<;;;;<<<<<<;;;;;<<<<<<<<;;;;;;;;;:;;;;888888899:::;;8;;;;;;;;;;;;;;;;;;;99;;99666896666966666600;96666669966

26.bedpetobam

Summary: Converts feature records to BAM format.

Usage: bedpetobam [OPTIONS] -i  -g 

Options: -mapq  Set the mappinq quality for the BAM records.  (INT) Default: 255

-ubam  Write uncompressed BAM output. Default writes compressed BAM.

Notes: (1) BED files must be at least BED4 to create BAM (needs name field).

27.bed12tobed6

bed12ToBed6是一种方便的工具,它可以将BED12中的bed特征(即堆叠型的bed特征,如基因)转换为离散的bed特征。例如,对于一个有六个外显子的基因,bed12ToBed6会产生六个单独的BED6特征(即,每个外显子对应一个)。

  • Usage:
   bed12ToBed6 [OPTIONS] -i 
  • 基因多个外显子转换成每一个外显子单独一行
  head data/knownGene.hg18.chr21.bed | tail -n 3
    chr21 10079666  10120808   uc002yiv.1  0  -  10081686  1 0 1 2 0 6 0 8  0     4   528,91,101,215, 0,1930,39750,40927,
    chr21 10080031  10081687   uc002yiw.1  0  -  10080031  1 0 0 8 0 0 3 1  0     2   200,91,    0,1565,
    chr21 10081660  10120796   uc002yix.2  0  -  10081660  1 0 0 8 1 6 6 0  0     3   27,101,223,0,37756,38913,

    head data/knownGene.hg18.chr21.bed | tail -n 3 | bed12ToBed6 -i stdin
    chr21 10079666  10080194  uc002yiv.1 0  -
    chr21 10081596  10081687  uc002yiv.1 0  -
    chr21 10119416  10119517  uc002yiv.1 0  -
    chr21 10120593  10120808  uc002yiv.1 0  -
    chr21 10080031  10080231  uc002yiw.1 0  -
    chr21 10081596  10081687  uc002yiw.1 0  -
    chr21 10081660  10081687  uc002yix.2 0  -
    chr21 10119416  10119517  uc002yix.2 0  -
    chr21 10120573  10120796  uc002yix.2 0  -




Part5 : Fasta manipulation

28.getfasta

https://www.jianshu.com/p/6c3b87301491

根据坐标区域来从基因组里面提取fasta序列

image
  • Usage
$ bedtools getfasta [OPTIONS] -fi  -bed 
  • example:

参考:# BED/GFF/VCF +reference --> fasta

bedtools getfasta -fi ~/biosoft/bowtie/hg19_index/hg19.fa  -bed ../macs14_results/highQuality_summits.bed  -fo highQuality.fa
bedtools getfasta -fi ~/biosoft/bowtie/hg19_index/hg19.fa  -bed ../macs14_results/highQuality_peaks.bed  -fo highQuality.fa

脚本里面用的是bed格式来记录坐标区域,参考基因组用-fi参数指定具体位置,输出的fasta序列文件用-fo参数指定

tips: 有用的三个参数
image.png
-s  Force strandedness. If the feature occupies the antisense,
        strand, the sequence will be reverse complemented.
        - By default, strand information is ignored.
-name   Use the name field for the FASTA header
-tab    Write output in TAB delimited format.
        - Default is FASTA format.

-s : 当提取数据,区分正负链时候,需要添加-s 参数,自动提取此区间负链反向
的序列(方向互补)
-name : 提取bed文件第四列区间name,为fasta 文件输出注释行.
-tab : fasta文件名称和序列以tab 分割。

29.maskfasta

和getfasta 相反,屏蔽区间.

image
  • Usage
$ bedtools maskfasta [OPTIONS] -fi  -bed  -fo 
  • example:
$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10

$ bedtools maskfasta -fi test.fa -bed test.bed -fo test.fa.out

$ cat test.fa.out
>chr1
AAAAANNNNNCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

30.nuc

分析FASTA文件中,bed 区间对应的核苷酸含量

Summary: Profiles the nucleotide content of intervals in a fasta file.

Usage: bedtools nuc [OPTIONS] -fi  -bed 





Part6 : BAM focused tools

31.multicov

https://www.jianshu.com/p/6c3b87301491

提供的每个bed间隔,它报告来自每个BAM文件的重叠对齐的单独计数。类似功能 featurecount/deeptools multiBamSummary

  • Usage:
bedtools multicov [OPTIONS] -bams BAM1 BAM2 BAM3 ... BAMn -bed  
  • example:

对RNA-seq的比对文件中的比对到各个基因的reads进行计数。**

# 例子:
bedtools multicov -bams aln1.bam aln2.bam aln3.bam -bed ivls-of-interest.bed
# ivls-of-interest.bed这个文件是必须的,可能需要自己制作,其实用gtf文件也可以的,如下:
chr1 0   10000   ivl1
chr1 10000   20000   ivl2
chr1 20000   30000   ivl3
chr1 30000   40000   ivl4

输出结果前三列是坐标,第四列是基因名,跟我们的bed文件一样,只是最后三列是三个样本的计数,是添加上来的!

chr1 0       10000   ivl1    100 2234    0
chr1 10000   20000   ivl2    123 3245    1000
chr1 20000   30000   ivl3    213 2332    2034
chr1 30000   40000   ivl4    335 7654    0

32. tag

Tag BAM alignments based on overlaps with interval files.

Summary: Annotates a BAM file based on overlaps with multiple BED/GFF/VCF files  on the intervals in -i.

Usage: bedtools tag [OPTIONS] -i  -files FILE1 .. FILEn -labels LAB1 .. LABn




Part 7 :Statistical relationships

33.jaccard

检测两个数据之间的相关性

引入一个新的bedtools工具jaccard,它会计算一个杰卡德相似性系数

结果是0.0 to 1. 0的值,数越小相关性越小

# 检测同一个样本的不同数据【系数是0.50637】
bedtools jaccard \
    -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
    -b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed

intersection    union-intersection  jaccard n_intersections
81269248    160493950   0.50637 130852

# 再看不同的样本的不同数据【系数是0.170995】
bedtools jaccard \
    -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
    -b fSkin_fibro_bicep_R-DS19745.hg19.hotspot.twopass.fdr0.05.merge.bed

intersection    union-intersection  jaccard n_intersections
28076951    164197278   0.170995    73261
##### 另外,还能分析更多的样本之间相关性 这个就看(官网[http://quinlanlab.org/tutorials/bedtools/bedtools.html](https://links.jianshu.com/go?to=http%3A%2F%2Fquinlanlab.org%2Ftutorials%2Fbedtools%2Fbedtools.html))翻到最底部 ![img](https://upload-images.jianshu.io/upload_images/9376801-806a9711daa5ebfe.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/878/format/webp)

34.reldist

计算两个文件的相对距离

image

总结两组基因组区间之间相似性的传统方法是基于相交区间的数量或比例。 但是,这种测量很大程度上看不到两组之间的空间相关性,尽管间距或邻近度一致,但相交很少见(例如,增强子和转录起始位点很少重叠,但与两组随机数相比,它们彼此之间的距离更近 间隔)。 Favorov等人[1]提出了一种相对距离度量标准,该度量描述了一组中每个间隔与另一组中两个最近间隔之间的相对距离分布(请参见上图)。 如果两组之间没有空间相关性,则可以期望相对距离在0到0.5的相对距离之间均匀分布。 但是,如果间隔趋于比偶然预期的要近得多,则观察到的相对距离的分布将朝较低的相对距离值(例如,下图)移动。

image
[1] Exploring Massive, Genome Scale Datasets with the GenometriCorr Package.
Favorov A, Mularoni L, Cope LM, Medvedeva Y, Mironov AA, et al. (2012)
PLoS Comput Biol 8(5): e1002529\. doi:10.1371/journal.pcbi.1002529

Usage:

bedtools reldist [OPTIONS] -a  -b 

example:

$ bedtools reldist \
    -a data/refseq.chr1.exons.bed.gz \
    -b data/
    aluY.chr1.bed.gz
0.00  164 43408 0.004
0.01  551 43408 0.013
0.02  598 43408 0.014
0.03  637 43408 0.015
0.04  793 43408 0.018
0.05  688 43408 0.016
0.06  874 43408 0.020
0.07  765 43408 0.018
0.08  685 43408 0.016
0.09  929 43408 0.021
0.10  876 43408 0.020
0.11  959 43408 0.022
0.12  860 43408 0.020
0.13  851 43408 0.020
0.14  903 43408 0.021
0.15  893 43408 0.021
0.16  883 43408 0.020
0.17  828 43408 0.019
0.18  917 43408 0.021
0.19  875 43408 0.020
0.20  897 43408 0.021
0.21  986 43408 0.023
0.22  903 43408 0.021
0.23  944 43408 0.022
0.24  904 43408 0.021
0.25  867 43408 0.020
0.26  943 43408 0.022
0.27  933 43408 0.021
0.28  1132  43408 0.026
0.29  881 43408 0.020
0.30  851 43408 0.020
0.31  963 43408 0.022
0.32  950 43408 0.022
0.33  965 43408 0.022
0.34  907 43408 0.021
0.35  884 43408 0.020
0.36  965 43408 0.022
0.37  944 43408 0.022
0.38  911 43408 0.021
0.39  939 43408 0.022
0.40  921 43408 0.021
0.41  950 43408 0.022
0.42  935 43408 0.022
0.43  919 43408 0.021
0.44  915 43408 0.021
0.45  934 43408 0.022
0.46  843 43408 0.019
0.47  850 43408 0.020
0.48  1006  43408 0.023
0.49  937 43408 0.022

35.fisher

对2个文件之间的重叠/唯一区间进行费舍尔的精确测试。

Given a pair of input files -a and -b in the usual BedTools parlance:

$ cat a.bed
chr1  10  20
chr1  30  40
chr1  51      52

$ cat b.bed
chr1  15   25
chr1  51      52
And a genome of 500 bases: $ echo -e "chr1\t500" > t.genome We may wish to know **if the amount of overlap between the 2 sets of intervals is more than we would expect given their coverage and the size of the genome**. We can do this with `fisher` as: $ bedtools fisher -a a.bed -b b.bed -g t.genome # Number of query intervals: 3 # Number of db intervals: 2 # Number of overlaps: 2 # Number of possible intervals (estimated): 37 # phyper(2 - 1, 3, 37 - 3, 2, lower.tail=F) # Contingency Table Of Counts #_________________________________________ # | in -b | not in -b | # in -a | 2 | 1 | # not in -a | 0 | 34 | #_________________________________________ # p-values for fisher's exact test left right two-tail ratio 1 0.0045045 0.0045045 inf

Part8 : Miscellaneous tools

36.overlap

和intersect 功能类似,将两个输入文件合并了,但是通过参数来指定那几列进行比较

Usage:

overlap [OPTIONS] -i  -cols s1,e1,s2,e2
| Option | Description | | --- | --- | | **-i** | Input file. Use “stdin” for pipes. | | **-cols** | Specify the columns (1-based) for the starts and ends of the features for which you’d like to compute the overlap/distance. The columns must be listed in the following order: *start1,end1,start2,end2* . | example: windowBed -a A.bed -b B.bed -w 10 chr1 10 20 A chr1 15 25 B chr1 10 20 C chr1 25 35 D ## 指定2,3 列与6,7 列比较 windowBed -a A.bed -b B.bed -w 10 | overlap -i stdin -cols 2,3,6,7 chr1 10 20 A chr1 15 25 B 5 chr1 10 20 C chr1 25 35 D -5

37.igv

用于生成一个脚本,批量捕获IGV截图

Summary: Creates a batch script to create IGV images at each interval defined in a BED/GFF/VCF file.

Usage: bedtools igv [OPTIONS] -i 

38. links

创建一个链接到UCSC的HTML页面

Usage:

linksBed [OPTIONS] -i  > 
| Option | Description | | --- | --- | | **-base** | The “basename” for the UCSC browser. *Default: [http://genome.ucsc.edu](http://genome.ucsc.edu)* | | **-org** | The organism (e.g. mouse, human). *Default: human* | | **-db** | The genome build. *Default: hg18* | example: **linksBed** creates links to the public UCSC Genome Browser. head -3 genes.bed chr21 9928613 10012791 uc002yip.1 0 - chr21 9928613 10012791 uc002yiq.1 0 - linksBed -i genes.bed -base http://mirror.uni.edu -org mouse -db mm9 > genes.html

39.makewindows

https://www.jianshu.com/p/7d47d8074bba

把给定区域划分成指定大小和间隔的小区间 (bin)

  • 参考染色体大小文件chrom.size
chr1    100
chr2    150
  • 将染色体划分为20bp为一个bin的区间
$ bedtools makewindows -g  chrom.size -w 20
chr1    0   20
chr1    20  40
chr1    40  60
chr1    60  80
chr1    80  100
chr2    0   20
chr2    20  40
chr2    40  60
chr2    60  80
chr2    80  100
chr2    100 120
chr2    120 140
chr2    140 150

40.groupby

https://www.jianshu.com/p/548d370b75a4

分组结算,不只可以用于bed文件。

以某一列进行分组,运用不同的函数

Usage

bedtools groupby [OPTIONS] -i  -g  -c  -o 
img

41. expand

将某一列是逗号分隔的,差分成多行显示.

Summary: Replicate lines in a file based on columns of comma-separated values.

Usage:  bedtools expand -c [COLS] Options: -i  Input file. Assumes "stdin" if omitted.

-c Specify the column (1-based) that should be summarized.

*   Examples:

   $ cat test.txt
      chr1  10  20  1,2,3   10,20,30
      chr1  40  50  4,5,6   40,50,60

    $ bedtools expand test.txt -c 5
    chr1    10  20  1,2,3   10
    chr1    10  20  1,2,3   20
    chr1    10  20  1,2,3   30
    chr1    40  50  4,5,6   40
    chr1    40  50  4,5,6   50
    chr1    40  50  4,5,6   60

    $ bedtools expand test.txt -c 4,5
    chr1    10  20  1   10
    chr1    10  20  2   20
    chr1    10  20  3   30
    chr1    40  50  4   40
    chr1    40  50  5   50
    chr1    40  50  6   60

42. split

将一个文件分割成多个具有相同记录或基对的文件

Summary: Split a Bed file.

Usage: bedtools split [OPTIONS] -i  -n number-of-files

Options: 
-i|--input (file)  BED input file (req'd). 
-n|--number (int)  Number of files to create (req'd).  
-p|--prefix (string)  Output BED file prefix.  
-a|--algorithm (string) Algorithm used to split data.

      so all files contain the ~ same number of bases
    * simple : route records such that each split file has
      approximately equal records (like Unix split).
-h|--help Print help (this screen). -v|--version Print version. Note: This programs stores the input BED records in memory.

小结:

bedtools 很强大,后面用到了再对细节进行补充

有很多不足地方,希望大家留言指正~~~

你可能感兴趣的:(【不定期更新】Bedtools大全--42个子命令介绍)