bam deal

pibase tools for validational and comparative analysis of BAM filespibase is an open-source package of linux command line tools for validating next-generation sequencing loci (SNPs and loci of interest where no SNPs are known) and for comparative analyses using Fisher's exact test.Acknowledgement: The development of pibase was partly funded by: The German Ministry of Education and Research (BMBF); the National Genome Research Network (NGFN); the Deutsche Forschungsgemeinschaft (DFG) Cluster of Excellence 'Inflammation at Interfaces'; the EU Seventh Framework Programme [FP7/2007-2013, grant numbers 201418, READNA and 262055, ESGI].Disclaimer: pibase is provided free of charge for non-commercial use but you are required to read our disclaimer and to cite us when publishing results.Download: pibase 1.4.7 example data (12GB) example output only (130kb)Overviewpibase Acronym for: get Position Information at BASE position of interest.Interoperability Input and output file types.Work flows Preparing BAM-files and using the pibase tools.QuickStart Tutorial Prerequisites, installation, and pibase examples using BAM-files from the 1000 Genomes project.Essentials:pibase_bamref Extract information from a BAM-file and a reference sequence file and table this information into a tab-separated text file.pibase_consensus Infer 'best' genotypes and their 'quality' classification, and optionally merge multiple pibase_bamref files (e.g. a control panel or several runs of the same patient) into a single file.pibase_fisherdiff Compare two pibase_consensus files using Fisher's exact test on original data (aligned reads) rather than comparing processed data (SNP-calls or genotypes).Annotate:pibase_tosnpacts Works with our unpublished annotation pipeline. Contact us if you wish us to annotate your pibase files with our annotations.pibase_annot Works with our unpublished annotation pipeline. Contact us if you wish us to annotate your pibase files with our annotations.pibase_tag Add primer region tags to a pibase_bamref, pibase_consensus, pibase_fisherdiff, or pibase_annot file.Phylogenetics:pibase_to_rdf Generate rdf file (for phylogenetic network analysis) from set of pibase_fisherdiff files.pibase_rdf_ref Generate a reference sample file from a pibase_consensus file, prior to generating the set of pibase_fisherdiff files required for pibase_to_rdf.pibase_chrm_to_crs Extract Cambridge Reference Sequence (Anderson 1981) variants from reads mapped to chrM (hg18, hg19, or NCBI36).Utilities:pibase_to_vcf Convert a single-sample pibase file into VCFv4.1 format.pibase_c_to_contig Convert pibase-contig-numbers into contig names (e.g. 25 -> chrM).pibase_flagsnp Flag non-reference genotypes in a pibase_consensus file as potential mismatches ("SNPs" in NGS-parlance).pibase_diff Compare two pibase_consensus files using (BestGen) genotypes and a (BestQual) quality threshold.pibase_gen_from_snpacts Works with our unpublished annotation pipeline. Contact us if you wish us to merge SNPs from SNP-callers, SNP-chips, or Sanger sequences into your pibase files for a genotype-comparison.pibase_ref Gets 500 flanking nucleotides of reference sequence around each coordinate (of the input list) and outputs a pseudo-FASTA file. For example for manual Sanger-sequencing primer design.Interoperabilitypibase reads genomic coordinates of interest from a VCF*, samtools pileup, SOLiD Bioscope gff3, or a tab-separated file.pibase extracts data at the coordinates of interest from an indexed FASTA reference and from a BAM-file** generated by BFAST, BWA, SSAHA2, samtools, SOAP (after conversion using soap2sam.pl), and SOLiD Bioscope. To extract the most complete information (including homologous region information and low-coverage genotypes), please use the raw unfiltered BAM-file (which includes non-uniquely mapped reads and duplicate reads).pibase outputs tab-separated text files which can then be used in popular spreadsheet software, or filtered from the linux command line using grep, awk, and cut. pibase can also output variants into VCF, rdf, and snpActs formats.*VCF is the variant list format accepted by European Nucleotide Archive. (VCF files are e.g. generated by GATK and samtools.)**BAM is the sequence format preferred by the European Nucleotide Archive.Work flows"Essentials" work flowPrepare an indexed sorted BAM-file with MD tags using samtools: Create BAM file with MD tag. If you have several BAM-Files for the same sample, and some BAM-files were generated from the same library: merge all BAM-files from the same library into one BAM-file (because pibase_consensus counts the number of unique start points independently per BAM-file).[samtools sort -o unsorted.bam > bamfile.bam]samtools calmd -b bamfile.bam referencefile.fasta > bamfile.md.bam[samtools merge outfile.bam infile1.bam infile2.bam [...]]samtools index bamfile.md.bampibase_bamref : extract position info from BAM file and reference sequence file.pibase_consensus over single run: infer multi-filter-level genotypes from a single pibase_bamref-file and classify the genotypes into stable or dubious genotypes (BestQual flag).pibase_consensus over multiple runs: infer multi-filter-level ''consensus'' genotypes from pibase_bamref-files from multiple runs and classify the genotypes into stable or dubious genotypes (BestQual flag).[Optional: pibase_fisherdiff : compare two samples by unique start point counts (Fisher's exact test 2x4), using the pibase_consensus-files]"Annotate" work flowFirst, carry out the "Essentials" workflow, in order to get a pibase_consensus file or a pibase_fisherdiff file.[Optional: pibase_tosnpacts : export genotypes from pibase_consensus or pibase_fisherdiff or pibase_diff to snpact-format (snpact -ft 'own' -sp 1).[Optional: pibase_annot : annotate pibase_diff files with rsID, gene name, and other information][Optional: pibase_tag : for PCR enriched samples, tag snps which are in PCR primer regions +- 1 base, using pibase_consensus-files or pibase_diff-files or pibase_annot files]"Phylogenetics" work flowFirst, carry out the "Essentials" workflow, in order to get a set of pibase_consensus files.[Optional: pibase_rdf_ref : create a reference sample from one of the pibase_consensus files.]Create a tab-separated text file detailing the sample files in the group, and the sample names which should be displayed in the phylogenetic network (see pibase_to_rdf). The first sample in this text file can either be one of the group of samples, or the reference sample created by pibase_rdf_ref.pibase_to_rdf : create an rdf file from a set of pibase_fisherdiff files for subsequent phylogenetic network analysis, e.g. evolutionary analysis, or sample mix-up (or confusion) analysis.[Optional: pibase_chrm_to_crs : extract Cambridge Reference Sequence (Anderson et al., 1981) variants from reads mapped to chrM (hg18, hg19, or NCBI36)]"Utilities" (Tools for special interests)First, carry out the "Essentials" workflow, in order to get a pibase_consensus file or optionally a pibase_fisherdiff file.[Optional: pibase_to_vcf : From a pibase_consensus file, create a VCF file that can e.g. be submitted to the European Nucleotide Archive or processed with VCFtools.][Optional: pibase_c_to_contig : Convert pibase-contig-numbering (e.g. in a pibase_consensus file or a pibase_fisherdiff file) to contig names, e.g. 25 to chrM.[Optional: pibase_flagsnp : Flag non-reference genotypes in a pibase_consensus file as potential mismatches ("SNPs" in NGS-parlance), using the pibase_consensus-files, pibase_fisherdiff-files or annotated versions of these files.][Optional: pibase_diff : compare two samples (conventionally) by genotypes, using the pibase_consensus-files. This is much less accurate than pibase_fisherdiff and only included for those users interested in a conventional comparison!][Optional: pibase_gen_from_snpacts : Merge SNPs from SNP-callers, SNP-chips, or Sanger sequences into pibase_consensus, pibase_fisherdiff, or annotated versions of these files. The idea is to generate a side-by-side genotype and nucleotide signals comparison table.]QuickStart TutorialPrerequisitesLinux operating system (we use CentOS 5.5 / linux 2.6.18-194.32.1.el5 on our linux cluster and Ubuntu 8, 9, or 10 on our PCs.)Python v2.4.3 or v2.6.5 or v2.7.2 (recommended) (usually already installed on linux clusters or linux PCs)pysam v0.6 (if using other versions, test for pysam bugs using large data sets)GNU Fortran (usually already installed on linux clusters, or installable using the Synaptics package manager under Ubuntu PCs)1GB of RAM (2GB for pibase_fisherdiff)Bash command line, or a linux cluster job scheduler such as PBS.InstallationDownload pibase.v1.4.7.tar.gz and unzip.From the linux command line, navigate into the pibase directory and enter ./f.sh to compile the pibase_643 subprogram.Finally, include the pibase directory in your linux PATH variable.Test runTest whether the installations were successful using our example data (12GB size!): Download the 12GB tar.gz file and unzip. From the linux command line, navigate into the example data folder and then run the example shell script by entering./pibase_test.sh(If problems occur, check the pre-requisites).When pibase_test.sh runs successfully, it creates a directory called output, into which the results files are written.Next, run the phylogenetic example shell script by entering./pibase_to_rdf.shTo validate the test runs, compare directories output and output_validated from the linux command line:diff output output_validatedWindows usersWindows users (and impatient Linux users) can download just the small zipped output_validated folder (130kb) for a quick impression of the output files. Import these tab-separated text files into Excel for viewing, filtering, or sorting. Note that there are floating point and floating comma versions of the files, the latter of which are generated using simple linux commands in the example shell script pibase_test.sh.Detailed examplesPlease look into the shell script pibase_test.sh and pibase_to_rdf.sh web pages which give several examples how to start the pibase tools.Essentials | pibase_bamrefpibase_bamref extracts information at genomic coordinates of interest from a BAM file and a reference sequence file and tables this information into a tab-spearated text file. Optionally, information can be displayed on the command line. A single genomic coordinate can be specified on the command line, or a list of coordinates can be specified in a file. Note that BAM-files must include MD-tags.Usage:pibase_bamref [-[v][d][p chr]] {table or poi} ref bam out LR QVmin [[[MLmin] [[MMmax] [RQVmin] [maxr]]]] [] optional -vdp [v]erbose output into terminal, [d]etailed list of reads, at single [p]osition: chr poi chr: chromosome name (or contig name) poi: position of interest (chromosomal coordinate, 1-based) table: file of poi's: vcf, gff3, pileup, or plain tab-file: chr TAB poi LINEBREAK The format is detected by pibase_bamref. If the plain tab-file is used, the lines must be sorted by chr-names, and the chr-names must be ordered in the sequence of the reference sequence file or BAM-file-header. ref: reference sequence file bam: sorted bam file, bam index file must also exist out: name for (tab-separated) output file LR: max length of reads (e.g. 50) QVmin: optional min base quality (default: 20) MLmin: optional min mapped read length (default: 49) MMmax: optional max number of base space mismatches in read (default: 1) RQVmin: optional min read mapping quality (default: 20) maxr: optional max number of reads considered at a poi (default: -1) (default: unlimited. For ultra-high coverage files: use an integer value of e.g. 1000 as the limit).Examples:pibase_bamref -vp chrM 16189 hg20.fasta na12752.bam na12752_chrM_16189.txtpibase_bamref list.txt hg18.fasta na12752.bam na12752_list.txtpibase_bamref pileupsnps.txt hg18.fasta na12752.bam na12752_pileupsnps.txtpibase_bamref gatksnps.vcf hg18.fasta na12752.bam na12752_gatksnps.txtNotes:If you are generating your own input lists with positions of interest: The genomic coordinates must be 1-based (the UCSC browser, VCF SNPlists, samtools pileup SNPlists, and Bioscope SNPlists use 1-based coordinates). SNPlists from Bioscope 1.2.1 or samtools pileup can be used without checking - these are ok.pibase_bamref speed depends on depth of coverage and read length, and on python/pysam version. For high coverage BAM-files (50x-100x) and 50bp read lengths, pibase_bamref v1.4.5 processes about 50,000-100,000 genotypes per hour on a single core using 2GB RAM . At least python v2.7 and at least pysam v0.5 are required for this speed, otherwise pibase_bamref is up to 100x slower. As there are critical but silent bugs in other pysam versions, please do not use other pysam versions unless you test that version thoroughly (e.g. by running our chr22 and exome example data sets and performing a diff)!For the most complete information: give pibase the original BAM file (which should include the ambiguously mapped reads, and which can also include "duplicate" reads)Essentials | pibase_consensuspibase_consensus adds information to a pibase_bamref file, inferring 'best' genotypes and 'best qualities', and optionally merging read counts from multiple pibase_bamref files (e.g. a control panel or several runs of the same patient) into a single pibase_consensus file. A BestQual question mark ("?") indicates that the BestGen is instable with respect to filtering, and therefore should not be used without further validations.Usage:pibase_consensus [-t[h][v [ver]] t1 t2 t3 t4 t5 {h1 h2}] in1 [in2 ... inN] out []=optional {}=only for "-th" option [-t[h][v] (optional) minimal thresholds for allele calling (A, C, G, or T), (optionally) followed by "h" to override h1 and h2 defaults, (optionally) followed by "v" to revert to an older version [ver] (optional) older pibase_consensus version to revert to. To list all available older versions, specify "-tv" without ver (older version 1.4.3: arguments h1 and h2 not available) t1 min fraction of reads indicating allele (default: 2.2%, i.e. 0.022) t2 min number of reads indicating allele (default: 8) t3 min fraction of unique start points indicating allele (default: 0.04) t4 min number of unique start points indicating allele (default: 4) t5 both-stranded confirmation only if, for at least one filter level, min-strand-counts >= t5 * max-strand-counts (default t5: 0.2) {h1 min fraction of unique start points that need to be different between filter F2-F3 resp. F3-F4 in order to flag a hypervariable resp. homologous locus (default: 2.2%, i.e. 0.022) h2} ] min number of unique start points difference between F2-F3 resp. F3-F4 in order to flag a hypervariable resp. homologous locus (default: 2) inX input file[s] = pibase_bamref output file[s] must have same reference sequence (not checked!) and same positions of interest out name for (tab-separated) output fileExamples:# Single filepibase_consensus na12752_pileupsnps.txt genotypes_na12752_pileupsnps.txt# Two filespibase_consensus na12752_1.txt na12752_2.txt gen_na12752_1to2.txt# Decreased sensitivity to sequencing or mapping errors or contaminationpibase_consensus -t 0.1 8 0.15 8 0.01 na12752_pileupsnps.txt genotypes_na12752_pileupsnps.txtNotes:What pibase_consensus does:Reads the pibase_bamref file and adds further columns of information: For each genomic coordinate of interest: 10 rule-based genotype decisions and one "BestGen" (best genotype) from these 10 genotypes, strand support for the A and B alleles of the "BestGen", and summary allele counts for each of the 10 genotypes.If multiple pibase_bamref files are specified, consensus genotypes are computed for pooled allele counts over all files (e.g. for a library which was sequenced in multiple runs or multiple lanes). The output file format is identical to the single-file output except that all specified input files are listed the file header.Low quality genotypes are annotated with BestQual tags ?1 to ?8; where ?1 denotes not too bad quality and genotypes with higher values should be rejected or inspected further. If there is no tag, the genotype is assumed to be validated by all 10 filter methods at a minimal coverage of t2 (default: 8) per allele of which there are at least t4 (default: 4) unique starting points per allele. The quality grading aims to help you understand the underlying mechanisms of each problem.Both-strandedness confirmation must be applied using the additional tags A+- and B+-. (Because some enrichment methods are strand-biased and occasional failure of both-stranded support is common in next-generation sequencing, we did not bundle the strandedness criterion into the BestQual tag).For slightly higher specificity, the SNVs near indels can be filtered using the additional tag Ign.Regions of simple repeats can be filtered using the Class tag (>1), regions of segmental duplications can be filtered using the Homologous tag ("H"), and hypervariable regions can be filtered using the Hypervariable tag ("V").Best Quality tag:?1 : Slightly poor genotype quality. Mapping stringency versus reference sequence context class looks ok. Not all 10 genotyping filter stages lead to the same genotype. However, for the high mapping stringency filter stages, at least t4 (default: 4) unique start points and at least t2 (default: 8) reads support this genotype.?2 : Somewhat poor genotype quality. Mapping stringency versus reference sequence context class looks ok. This genotype is supported by less than 5 filter stages, but by at least 2 filter stages, of which one stage is in the unique start points category, and the other stage is in the coverage category.?3: Really poor quality. Reference sequence context looks difficult (homopolymeric run>4, or STRs) and mapping stringency was low. But at least one stringent filter supports this genotype.?4: Very poor quality. Reference sequence context looks difficult (homopolymeric run>4, or STRs) and mapping stringency was low. But at least one of the unique-start-point filters supports this genotype.?5: Highly problematic quality. The best unique-start-point derived genotype is in conflict to the best coverage-derived genotype.?6: Highly problematic quality. The best unique-start-point derived genotype is in conflict to the best coverage-derived genotype, and the best coverage-derived genotype is "senior" to the best usp-derived genotype.?7: Low-coverage guess. The coverage is below t2 (default: 8) (can be as low as 1).?8: Low-coverage guess. The coverage is below t2 (default: 8) (can be as low as 1). Reference sequence context looks difficult (homopolymeric run>4, or STRs), and there are no stringently mappable reads.Poor quality genotypes and single-strand genotypes can be filtered using linux commands as follows:# copy header lines into new file:grep '^#' pibase_consensus_file > validated_pibase_consensus_file# append filtered table lines (indel-reads < 3, tags A+- and B+- and BestQual) to the new file:awk 'BEGIN {FS="\t"};($7<3 10="='+-')&&($11=='+-')'" pibase_consensus_file="" grep="" -v="" validated_pibase_consensus_file="" potentially="" problematic="" regions="" can="" be="" filtered="" or="" counted="" using="" linux="" commands="" as="" follows:="" count="" number="" of="" genotypes="" in="" homologous="" loci:="" grep="" -v="" pibase_consensus_file="" grep="" h="" wc="" -l="" count="" number="" of="" genotypes="" in="" hypervariable="" loci:="" grep="" -v="" pibase_consensus_file="" grep="" v="" wc="" -l="" count="" number="" of="" genotypes="" near="" indels:="" awk="" begin="" fs="\t" 7="">0)' pibase_consensus_file | wc -lEssentials | pibase_fisherdiffpibase_fisherdiff merges two pibase_consensus files and tags differences between these two files (e.g. control sample and case sample) based on Fisher's exact test for five different filter levels. (See example.)Usage:pibase_fisherdiff control case out [[cov] [[p] [[fac]]] control: pibase_consensus file of control (healthy) sample case: pibase_consensus file of case (diseased) sample out: output file [] optional [cov]: min coverage threshold (optional. Default: 100) [p]: max median p-value threshold (optional. Default: 0.01) [fac]: max factor (optional. Default: 3.5).Example:# min coverage 50, p <= 0.01, fac <= 10pibase_fisherdiff normal.txt tumor.txt diff_out.txt 50 0.01 10Notes:Eight leading columns are added to the pibase_consensus output, and the two samples are merged into one file which can then be filtered using grep or awk or python/perl/java/etc scripts, or imported into Excel by Windows users and viewed/filtered further in Excel:In the leading column, significant similarity at a genomic coordinate is denoted by "=-" (control sample) or "=+" (case sample).Significant difference at a genomic coordinate is denoted by "-" (control) and "+" (case).Low coverage in one or both samples is denoted by "?-" and "?+".For those who want to understand the underlying mechanism: The control and case samples are computed to be significantly different at a genomic coordinate if the median p-value <= p and the coverage in each sample >= cov and the factor <= fac. The factor is defined as follows: If the major alleles are identical in case and control, major allele counts / minor allele counts must be <= fac; if the major alleles are different, the factor criterion is not used. The coverage and factor criteria must be applied because the p-value changes quite a lot at coverages of about 50 or lower, when just 1 read is shifted from the major allele to the minor allele (i.e. when a normal sequencing error occurs). In other words, for special cases the p-value alone is not a sufficiently accurate indicator of similarity or difference.In the second column, the tag D_error indicates whether the (default or user-specified) thresholds for the applicability of Fisher's exact test have been violated.In detail, if the coverage falls below the threshold, D_error at the genomic coordinates displays an '!' and bit 0 (control) or 1 (case) is set. If the major/minor allele counts threshold is exceeded, D_error is output with an '!' and bit 2 (control) or 3 (case) is set. For example '!1' indicates low coverage in the control sample, and '!12' indicates the major/minor allele threshold violation in both samples.Fisher's exact test p-values (two-tailed) are given in column 3 (median p-value) and columns 5-8 (for each of the 5 filter levels, computed for ACGT counts of unique start points).Both-strand-support testing is not included in the first 3 columns, so explicit filtering using the A+- and B+- tags should be performed where considered, see pibase_consensus filtering.Mathematical detail: The p-values are computed with ACM algorithm 643 (http://portal.acm.org/citation.cfm?id=214326) for a 2x4 matrix using a hybrid approximation. This network method overcomes numeric overflow problems associated with online web calculators.Compute speed: about 20,000-50,000 genomic coordinates per hour on a single core using 2GB RAM.For a posteriori more stringent filtering of differences than in the original pibase_fisherdiff run, it may be of interest to use the linux awk command. Single-strand genotypes and more stringent p-values can be filtered using linux commands as follows:# copy header lines (i.e. starting with '#') into new file:grep '^#' pibase_fisherdiff_file > filtered_file# append filtered table lines to the new file (p-value <= 0.005, and tags A+- and B-):awk 'BEGIN {FS="\t"};($3<=0.005)&&($18=="+-")&&($19=="+-")' pibase_fisherdiff_file >> filtered_file# count differences:grep '^+' filtered_file | wc -lAnnotate | pibase_tosnpactspibase_tosnpacts reads the "BestGen" genotypes from a pibase_consensus file or a pibase_fisherdiff file and creates a file (or file pair, if pibase_fisherdiff) for import into snpacts. After annotation within snpacts, the snpacts-annotations are exported and added to the pibase_consensus or pibase_fisherdiff file using pibase_annot.Usage:pibase_tospnacts in out1 [out2] in: file from pibase_consensus or pibase_fisherdiff out1: output file (for input into snpacts, in 'own' format) out2: 2nd output file (case sample) if pibase_fisherdiffExamples:# pibase_consensus file:pibase_tosnpacts geno_na12752.txt for_snpacts_na12752.txt# pibase_fisherdiff file:pibase_tosnpacts diff.txt for_snpacts_control.txt for_snpsacts_case.txtNotes:How to import the pibase_tosnpacts file into snpacts:snpact -ft own -sp 1 -pchr 0 -psnp 1 -prb 2 -pbb 3 -psb 4 -pmt 5 -all -hg hg18 -t tablename filenameRemove problematic EOLs from BLOBs in annotated table:tableupdate -eol tablenameExport the annotated, de-BLOBBed table to csv:outputact -t tablename -based 1 -csv -worst -hg hg18 inputfilename outputfilenamewhere inputfilename is the file from pibase_tosnpacts, and outputfilename is the file for pibase_annotAnnotate | pibase_annotpibase_annot annotates a pibase_consensus or pibase_fisherdiff file with snp information from an exported snpacts table (see above section on pibase_tosnpacts).Usage:pibase_annot type an1 [an2] in out type: 1=pibase_consensus; 2=pibase_fisherdiff an1: snpacts-annotations file 1 (control sample, created by outputact) an2: snpacts-annotations file 2 (case sample, created by outputact) annot2 is not specified for pibase_consensus (type=1) in: tab-separated file from pibase_consensus or pibase_fisherdiff out: results file with merging of annotations and infileExamples:# pibase_consensus file:pibase_annot 1 annot.txt geno_na12752.txt anno_geno_na12752.txt# pibase_fisherdiff file:pibase_annot 2 anno_control.txt anno_tumor.txt diff.txt anno_diff.txtAnnotate | pibase_tagpibase_tag adds primer region tags to a pibase_consensus, pibase_fisherdiff or pibase_annot file. The pibase file is annotated using information from a primer region file.Usage:pibase_tag typ pr in chr pos out [tag] typ: type of input file (always enter 0) pr: primer region file, tab-separated: chr pos start end strand (int, 1-based) in: tab-separated file, e.g. from pibase_diff or pibase_annot chr: column number (1-based) in in-file with chromosome number pos: column number (1-based) in in-file with position out: output file ( = tagged input file) tag: (optional, default=column 2) column number, at which the 4 primer tag columns are inserted.Example:# pibase_annot file:pibase_tag 0 primer_regions.txt pibase_annot.txt 4 5 pibase_tag_annot.txt 2Notes:The primer region file must have a special table header (see ## in example below) and is tab-separated:chromosome, start, end, strand (integer, 1-based).#dummy test file for pibase_tag ##chr start end strand 1 152644520 152644530 1 1 152644740 152644750 -1Phylogenetics | pibase_to_rdfpibase_to_rdf generates an rdf file (for phylogenetic network analysis) from a set of pibase_fisherdiff files.Usage:pibase_to_rdf list rdf [[pmax] [both] [elim]] list: text file which lists the set of pibase_fisherdiff files: file_name TAB sample_name EOL rdf: name of output file (for analysis using Network) []: optional arguments: [pmax]: max p-value to consider (default=0.01) [both]: bothstranded confirmation (y/n, default=y) [elim]: eliminate "N"-columns and invariable columns (y/n, default=n)Example: See pibase_to_rdf.shNote:The example above refers to our pibase paper. Please consult this paper if you are unfamiliar with phylogenetic network analyses.Phylogenetics | pibase_rdf_refpibase_rdf_ref generates a reference-base pibase_consensus file which may be used as the reference sample for pibase_fisherdiff comparisons in a group of samples, e.g. for a pibase_to_rdf run and subsequent phylogenetic network analysis.Usage:pibase_rdf_ref in out [cov] in: pibase_consensus file (any sample from the group of samples) out: pibase_consensus file with genotypes taken from Ref column of in-file cov: coverage (optional. default=50. Divisible by two.)Example: See pibase_to_rdf.shNote:The example above refers to our pibase paper. Please consult this paper if you are unfamiliar with phylogenetic network analyses.Phylogenetics | pibase_chrm_to_crspibase_chrm_to_crs extracts Cambridge Reference Sequence (CRS)* variants from chrM** alignments. The CRS numbering is identical to the numbering of the revised CRS (rCRS)***. The reference sequence bases within the mtDNA control region are identical for the CRS and the rCRS.(In hg18, hg19, and NCBI36, chrM is used as the mitochondrial reference genome; in NCBI build 37 rCRS is used and in hg20, rCRS will be used; most mtDNA databases are based on the CRS/rCRS genomic coordinates.)Usage:pibase_chrm_to_crs in out own pro sam [[t] [min]] in: pibase_consensus genotype file (MUST be chrM ONLY!) out: base name of output files. The following files are output: out_f4maj.txt: major variants for highest-stringency filter (f4) out_f4min.txt: minor variants for highest-stringency filter (f4) out_f0maj.txt: major variants for lowest-stringency filter (f0) out_f0maj.txt: minor variants for lowest-stringency filter (f0) out_all.txt: complete pibase_consensus file in CRS notation out_sum.txt: summary file in CRS notation own: abbreviation of owner/PrincipalInvestigator of the project pro: for summary: project abbreviation sam: for summary: sample abbreviation []: optional arguments: [t]: threshold for minor allele cut-off (default: 0.015, i.e. 1.5% of total read counts at a given coordinate) [min]: min number of read counts required for an allele-call (default: 2)Example:pibase_chrm_to_crs pat123_chrM.txt pat123_chrs.txt FRA fro Pat123Note:The pibase_consensus file must contain only chrM lines. If there are other lines in the file, the header lines and the chrM lines must be extracted before using pibase_chrm_to_crs:# copy header lines into new file:grep '^#' pibase_consensus_file > chrM_pibase_consensus_file# append lines with "chromosome 25" (i.e. chrM) to the new file:grep '^25' pibase_consensus_file >> chrM_pibase_consensus_fileReferences:* Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H., Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe, B.A., Sanger, F., Schreier, P.H., Smith, A.J., Staden, R., Young, I.G. (1981) Sequence and organization of the human mitochondrial genome. Nature, 290, 457-465.** Ingman, M., Kaessmann, H., Pääbo, S., Gyllensten, U. (2000) Mitochondrial genome variation and the origin of modern humans. Nature, 7, 708-713.*** Andrews, R.M., Kubacka, I., Chinnery, P.F., Lightowlers, R.N., Turnbull, D.M., Howell, N. (1999) Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet., 23, 147.Utilities | pibase_to_vcfpibase_to_vcf converts a single-sample pibase file into VCFv4.1 format.Usage:pibase_to_vcf [-h] [-v] [-d] [-u s s s s s s s s s s] in out nam bam in: tab-separated file (pibase_consensus) out: VCF file nam: name of sample in VCF file (column header) bam: name of bam file from which in file was generatedExample:pibase_c_to_contig in.txt out.txt mysample mysample.bamUtilities | pibase_c_to_contigpibase_c_to_contig converts pibase-contig-numbers into contig names (e.g. 25 -> chrM).Usage:pibase_c_to_contig in [in2] out in: file to be converted (from pibase_consensus, pibase_fisherdiff, ...) Note: pibase_fisherdiff has a shortened file head, so a second input file with the conversion table must be given. in2: file with contig numbering/name table in header (required, if the first input file is a pibase_fisherdiff file) out: output fileExamples:pibase_c_to_contig gen.txt gen_contigs.txtpibase_c_to_contig diff.txt table.txt diff_contigs.txtUtilities | pibase_flagsnppibase_flagsnp flags a BestGen genotype as being an "NGS SNP" (i.e. a mismatch) if it is different to the reference base. Three columns are inserted before the "BestGen" column: "IsSnp" (1/0 flag), "SnpReads" (non-reference-allele coverage at filter level 0), and "CovAll" (coverage at filter level 0, consisting of all reads indicating A, C, G, or T but not N)To sort the pibase_consensus "BestGen" genotypes into Non-Reference (i.e. SNP) and Reference.Usage:pibase_flagsnp in out in: tab-separated file (pibase_consensus, pibase_fisherdiff, or pibase_diff) out: results file with merging of annotations and infileExample:pibase_flagsnp gen_id100.txt snp_gen_id100.txtNote: Outside the NGS world, "SNPs" include reference genotypes which are polymorphisms occurring in 1% or more of a population - this is often a matter of confusion between NGS and non-NGS specialists.Utilities | pibase_diffpibase_diff compares two pibase_consensus files using (BestGen) genotypes and a (BestQual) quality threshold. (Accuracy can be much lower than pibase_fisherdiff, i.e. up 10x or more false positives and false negatives, but compute time is fast.)Usage:pibase_diff control case out worst control: pibase_consensus file of control (healthy) sample case: pibase_consensus file of case (diseased) sample out: output file worst: worst BestQual level to consider (0-8)Examples:# Consider BestGens up to a worst BestQual of ?1 (i.e. ignore ?2 to ?8):pibase_diff normal.txt tumor.txt diff_out.txt 1Notes:One leading column is added to the pibase_consensus output, and the two samples are merged into one file which can then be filtered using grep or python/perl/java/etc scripts, or imported into Excel by Windows users and viewed/filtered further in Excel:In the leading column, identical BestGen genotypes are denoted by "=-" (control sample) or "=+" (case sample).Different BestGen genotypes at a genomic coordinate are denoted by "-" (control) and "+" (case).Differences and single-strand genotypes can be filtered using linux commands as follows:# copy header lines into new file:grep '^#' pibase_diff_file > filtered_diff_file# append filtered lines to the new file (filtering by first column and tags A+- and B+-):awk 'BEGIN {FS="\t"};($11=="+-")&&($12=="+-")' pibase_diff_file | grep -v '^=' >> filtered_diff_file# count differences:grep '^+' filtered_diff_file | wc -lUtilities | pibase_gen_from_snpactspibase_gen_from_snpacts inserts snpacts genotypes into a pibase file before the "BestGen" column. Multiple runs of this script can be performed to insert genotypes from several different sources (e.g. samtools, Bioscope, GATK). Only chromosomes chr1-chr22, chrX, chrY are recognised in snpacts files.To compare genotypes from different sources with signals and genotypes from pibase_consensus.Usage:pibase_gen_from_snpacts typ s1 [s2] in out hdr typ: type of input file (1=pibase_consensus or pibase_gen_from_snpacts) s1: snpacts file1, created by outputact s2: snpacts file2, (not specified for typ=1) in: tab-separated file, from pibase_consensus or pibase_gen_from_snpacts out: results file with merging of annotations and infile hdr: header of inserted snpacts-column (e.g. samtools, Bioscope, GATK)Example:# add samtools SNPs into pibase_consensus file:pibase_gen_from_snpacts 1 id100_samtools.txt gen_id100.txt sam_gen_id100.txt samtools# add Sanger SNPs into samtools-pibase_consensus file:pibase_gen_from_snpacts 1 id100_sanger.txt sam_gen_id100.txt san_sam_gen_id100.txt sangerUtilities | pibase_refpibase_ref gets 500 flanking nucleotides of reference sequence around each coordinate (of the input list) and outputs a pseudo-FASTA file.Primary use: manual Sanger-sequencing primer design.Usage:pibase_ref in ref out in: file with list of genomic coordinates (e.g. a VCF file): chromosome_name TAB 1_based_coordinate_in_chromosome ref: indexed reference sequence file (fai must exist) out: FASTA-formatted output fileExample:pibase_ref snplist.vcf hg19.fa flanked_snps.fasta原文地址:http://www.ikmb.uni-kiel.de/pibase/

你可能感兴趣的:(bioinformatics)