iteye_12837

BLAST Command Line Applications User Manual

http://www.ncbi.nlm.nih.gov/books/NBK1763/

BLAST Command Line Applications User Manual

Christiam Camacho, Thomas Madden, George Coulouris, Ning Ma, Tao Tao, and Richa Agarwala.

Author Information

Created: June 23, 2008; Last Update: January 30, 2012.

1. Introduction

Go to:

This manual documents theBLAST(Basic Local Alignment Search Tool) command line applications developed at the National Center for Biotechnology Information (NCBI). These applications have been revamped to provide an improved user interface, new features, and performance improvements compared to its counterparts in the NCBI C Toolkit. Hereafter we shall distinguish the C Toolkit BLAST command line applications from these command line applications by referring to the latter as the BLAST+ applications, which have been developed using the NCBI C++ Toolkit (http://www.ncbi.nlm.nih.gov/books/NBK7160/).

Please feel free to contact us with any questions, feedback, or bug reports at[email protected].

2. Installation

Go to:

TheBLAST+ applications are distributed in executable and source code format. For the executable formats we provide installers as well as tarballs; the source code is only provided as a tarball. These are freely available atftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/. Please be sure to use the most recent available version; this will be indicated in the file name (for instance, in the sections below, version 2.2.18 is listed, but this should be replaced accordingly).

2.1 Windows

Download the executable installer ncbi-blast-2.2.18+.exe and double click on it. After accepting the license agreement, select the install location and click “Install” and then “Close”

2.2 MacOSX

For users without administrator privileges: Download the ncbi-blast-2.2.18+-universal-macosx.tar.gz tarball and follow the procedure described inOther Unix platforms.

For users with administrator privileges and machines MacOSX version 10.5 or higher: Download the ncbi-blast-2.2.18+.dmg installer and double click on it. Double click the newly mounted ncbi-blast-2.2.18+ volume, double click on ncbi-blast-2.2.18+.pkg and follow the instructions in the installer. By default theBLAST+ applications are installed in /usr/local/ncbi/blast, overwriting its previous contents (an uninstaller is provided and it is recommended when upgrading a BLAST+ installation).

2.3 RedHat Linux

Download the appropriate *.rpm file for your platform and either install or upgrade the ncbi-blast+ package as appropriate using the commands:

Install:
rpm -ivh ncbi-blast-2.2.18-1.x86_64.rpm
Upgrade:
rpm -Uvh ncbi-blast-2.2.18-1.x86_64.rpm

Note: one must have root privileges to run these commands. If you do not have root privileges, please use the procedure described inOther Unix platforms.

2.4 Other Unix platforms

Download the tarball and expand it in the location of your choice.

2.5 Source tarball

Use this approach if you would like to build theBLAST+ applications yourself. Download the tarball, expand it and in the expanded directory type the following commands:

cd c++
./configure --without-debug --with-strip --with-mt --with-build-root=ReleaseMT
cd ReleaseMT/build
make all_r

The compiled executables will be found in c++/ReleaseMT/bin.

In Windows, extract the tarball and open the appropriate MSVC solution or project file (e.g.: c++\compilers\msvc800_prj\static\build), build the -CONFIGURE- project, click on “Reload” when prompted by the development environment, and then build the -BUILD-ALL- project. The compiled executables will be found in the directory corresponding to the build configuration selected (e.g.: c++\compilers\msvc800_prj\static\bin\debugdll).

3. Quick start

Go to:

3.1 For users of NCBI C Toolkit BLAST

The easiest way to get started using these command line applications is by means of the legacy_blast.pl PERL script which is bundled along with theBLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications. For example, instead of using

    blastall -i query -d nr -o blast.out

use

    legacy_blast.pl blastall -i query -d nr -o blast.out 
--path /opt/blast/bin

For more details, refer to the section titledBackwards compatibility script.

3.2 For users of Web BLAST (`http://blast.ncbi.nlm.nih.gov`)

Users of WebBLASTcan take advantage of the search strategies to quickly get started using the BLAST+ applications, as these intend to allow seamless integration between the Web and command line BLAST tools. For more details, refer to the section onBLAST search strategies.

3.3 For new users of BLAST

An introduction toBLASTis outside the scope of this manual, more information on this subject can be found onhttp://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs. Nonetheless, new users will benefit from the examples in thecookbookas well as reading theuser manual.

3.4 Downloading BLAST databases

TheBLASTdatabases are required to run BLAST locally and to supportautomatic resolution of sequence identifiers. Documentation about these can be foundftp://ftp.ncbi.nlm.nih.gov/blast/db/README. These databases may be retrieved automatically with the update_blastdb.pl perl script, which is included as part of this distribution.

This script will download multiple tar files for eachBLASTdatabase volume if necessary, without having to designate each volume. For example:

./update_blastdb.pl htgs

will download all the relevant HTGs tar files (htgs.00.tar.gz, …, htgs.N.tar.gz)

The script can also compare your local copy of the database tar file(s) and only download tar files if the date stamp has changed reflecting a newer version of the database. This will allow the script run on a schedule and only download tar files when needed. Documentation for the update_blastdb.pl script can be obtained by running the script without any arguments (perl is required).

4. User manual

Go to:

4.1 Functionality offered by BLAST+ applications

The functionality offered by theBLAST+ applications has been organized by program type, as to more closely resemble Web BLAST. The following graph depicts a correspondence between the NCBI C Toolkit BLAST command line applications and the BLAST+ applications:

As an example, to run a search of a nucleotidequery(translated “on the fly” byBLAST) against a protein database one would use the blastx application instead of blastall. The blastx application will also work in “Blast2Sequences” mode (i.e.: acceptFASTAsequences instead of a BLAST database as targets) and can also send BLAST searches over the network to the public NCBI server if desired.

The blastn, blastp, blastx, tblastx, tblastn, psiblast, rpsblast, and rpstblastn are considered search applications, as they execute aBLASTsearch, whereas makeblastdb, blastdb_aliastool, and blastdbcmd are considered BLAST database applications, as they either create or examine BLAST databases.

There is also a new set of sequencefilteringapplications described in the sectionSequence filtering applicationsand an application to build database indices that greatly speed up megablast in some cases (see section titledMegablast indexed searches).

Please note that the NCBI C Toolkit applications seedtop and blastclust are not available in this release.

4.2 Common options

The following is a listing of options that are common to the majority ofBLAST+ applications followed by a brief description of what they do:

4.2.1 best_hit_overhang: Overhang value for Best-Hitalgorithm. For more details, see the sectionBest-Hits filtering algorithm.

4.2.2 best_hit_score_edge: Score edge value for Best-Hitalgorithm. For more details, see the sectionBest-Hits filtering algorithm.

4.2.3 db: File name ofBLASTdatabase to search thequeryagainst. Unless an absolute path is used, the database will be searched relative to the current working directory first, then relative to the value specified by the BLASTDB environment variable, then relative to the BLASTDB configuration value specified in theconfiguration file. Multiple databases may be provided as an argument, and they must be separated by a space. Many operating systems now allow spaces in file names and paths, so it is necessary to use quotes. See section 5.15 for details.

4.2.4 dbsize: Effective length of the database.

4.2.5 dbtype: Molecule type stored or to store in aBLASTdatabase.

4.2.6 db_soft_mask: FilteringalgorithmID to apply to the database as softmaskingfor subject sequences. The algorithm IDs for a givenBLASTdatabase can be obtained by invoking blastdbcmd with its -info flag (only shown if suchfilteringin the BLAST database is available). For more details see the sectionMasking in BLAST databases.

4.2.7 culling_limit: Ensures that more than the specified number of HSPs are not aligned to the same part of thequery. This option was designed for searches with a lot of repetitive matches, but if possible it is probably more efficient to mask the query to remove the repetitive sequences.

4.2.8 entrez_query: Restrict the search of theBLASTdatabase to the results of the Entrezqueryprovided.

4.2.9 evalue: Expectation value threshold for saving hits.

4.2.10 export_search_strategy: Name of the file where to save the search strategy (see section titledBLAST search strategies).

4.2.11 gapextend: Cost to extend agap.

4.2.12 gapopen: Cost to open agap.

4.2.13 gilist: File containing a list of GIs to restrict theBLASTdatabase to search. The expect values in the BLAST results are based upon the sequences actually searched and not on the underlying database.

4.2.14 h: Displays the application’s brief documentation.

4.2.15 help: Displays the application’s detailed documentation.

4.2.16 html: Enables the generation of HTML output suitable for viewing in a web browser.

4.2.17 import_search_strategy: Name of the file where to read the search strategy to execute (see section titledBLAST search strategies).

4.2.18 lcase_masking: Interpret lowercase letters inquerysequence(s) as masked.

4.2.19 matrix: Name of the scoring matrix to use.

4.2.720 max_target_seqs: Maximum number of aligned sequences to keep from theBLASTdatabase. This option should only be used with formats that do not have a separate descriptions and alignments section, such as XML, tabular, ASN.1 or BLAST archive.

4.2.21 negative_gilist: File containing a list of GIs to exclude from theBLASTdatabase.

4.2.22 num_alignments: Number of alignments to show in theBLASToutput. This option should only be used with formats that have a separate alignments section, such as the standard BLAST report, including pairwise and anyquery-anchored flavor. This option may not work as expected with formats such as XML, tabular, etc. that do not have a separatealignmentsection. The max_target_seqs option should be used in that case

4.2.23 num_descriptions: Number of one-line descriptions to show in theBLASToutput. This option should be used with output formats that have a separate descriptions section, such as the standard BLAST report, including pairwise and anyquery-anchored flavor. This option may not work as expected with formats such as XML, tabular, etc. that do not have a separate descriptions section. The max_target_seqs option should be used in that case.

4.2.24 num_threads: Number of threads to use during the search.

4.2.25 out: Name of the file to write the application’s output. Defaults to stdout.

4.2.26 outfmt: Allows for the specification of the search application’s output format. A listing of the possible format types is available via the search application’s -help option. If a custom output format is desired, this can be specified by providing a quoted string composed of the desired output format (tabular, tabular with comments, or comma-separated value), a space, and a space delimited list of output specifiers. The list of supported output specifiers is available via the -help command line option. Unsupported output specifiers will be ignored. This should be specified using double quotes if there are spaces in the output format specification (e.g.: -outfmt "7 sseqid ssac qstart qend sstart send qseq evalue bitscore").

4.2.27 parse_deflines: Parse thequeryand subject deflines.

4.2.28query: Name of the file containing the query sequence(s), or ‘-‘ if these are provided on standard input.

4.2.29 query_loc: Location of the firstquerysequence to search in 1-based offsets (Format: start-stop).

4.2.30 remote: Instructs the application to submit the search to NCBI for remote execution.

4.2.31 searchsp: Effective length of the search space.

4.2.32 seg: Arguments toSEGfilteringalgorithm(use ‘no’ to disable).

4.2.33 show_gis: Show NCBI GIs in deflines in theBLASToutput.

4.2.34 soft_masking: Applyfilteringlocations as soft masks (i.e.: only when findingalignmentseeds).

4.2.35 subject: Name of the file containing the subject sequence(s) to search.

4.2.36 subject_loc: Location of the first subject sequence to search in 1-based offsets (Format: start-stop).

4.2.37 strand: Strand(s) of thequerysequence to search.

4.2.38 threshold: Minimum word score such that the word is added to theBLASTlookup table.

4.2.39 ungapped: Perform ungapped alignments only.

4.2.40 version: Displays the application’s version.

4.2.41 window_size: Size of the window for multiple hitsalgorithm, use 0 to specify 1-hit algorithm.

4.2.42 word_size: Word size for word finderalgorithm.

4.2.43 xdrop_gap: X-dropoff value (in bits) for preliminary gapped extensions.

4.2.44 xdrop_gap_final: X-dropoff value (in bits) for final gappedalignment.

4.2.45 xdrop_ungap: X-dropoff value (in bits) for ungapped extensions.

4.3 Backwards compatibility script

The purpose of the legacy_blast.pl Perl script is to help users make the transition from the C ToolkitBLASTcommand line applications to the BLAST+ applications. This script produces its own documentation by invoking it without any arguments.

The legacy_blast.pl script supports two modes of operation, one in which the C ToolkitBLASTcommand line invocation is converted and executed on behalf of the user and another which solely displays the BLAST+ application equivalent to what was provided, without executing the command.

The first mode of operation is achieved by specifying the C ToolkitBLASTcommand line application invocation and optionally providing the --path argument after the command line to convert if the installation path for the BLAST+ applications differs from the default (available by invoking the script without arguments). See example in the first section of theQuick start.

The second mode of operation is achieved by specifying the C ToolkitBLASTcommand line application invocation and appending the --print_only command line option as follows:

$ ./legacy_blast.pl megablast -i query.fsa -d nt -o mb.out --print_only
/opt/ncbi/blast/bin/blastn -query query.fsa -db "nt" -out mb.out 
$

4.4 Exit codes

AllBLAST+ applications have consistent exit codes to signify the exit status of the application. The possible exit codes along with their meaning are detailed in the table below:

Exit Code	Meaning
0	Success
1	Error inquerysequence(s) orBLASToptions
2	Error inBLASTdatabase
3	Error inBLASTengine
4	Out of memory
255	Unknown error

In the case ofBLAST+ database applications, the possible exit codes are 0 (indicating success) and 1 (indicating failure).

4.5 Improvements over C Toolkit BLAST command line applications

4.5.1 Query splitting

This new feature in theBLAST+ applications provides substantial performance improvements, particularly for blastx searches and it is automatically enabled by the software when deemed appropriate. Below is a graph comparing the runtime of blastall and blastx when searching different size excerpts of NC_007113 (varying from 10 kbases to about 10 Mbases) against the human genome database (experiments performed in July 2008):

4.5.2 Tasks

The concept of tasks has been added to support the notion of commonly performed tasks via the -task command line option in blastn and blastp. These tasks resemble the “Program Selection” section of theBLASTweb pages and do not preclude the user from setting other options to override those specified by the task. The following tasks are currently available:

Program	Task Name	Description
blastp	blastp	Traditional BLASTP to compare a proteinqueryto a protein database
blastp	blastp-short	BLASTP optimized for queries shorter than 30 residues
blastn	blastn	Traditional BLASTN requiring an exact match of 11
	blastn-short	BLASTN program optimized for sequences shorter than 50 bases
	megablast	Traditional megablast used to find very similar (e.g., intraspecies or closely related species) sequences
	dc-megablast	Discontiguous megablast used to find more distant (e.g., interspecies) sequences

4.5.3 Megablast indexed searches

Indexed searches for megablast are available and are faster than regular megablast. The application to generate the database indices is called makembindex, which is included in this distribution. More information about it can be found atftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast/README.usage.

4.5.4 Partial sequence retrieval from BLAST databases

Improvements to theBLASTdatabase reading module allow it to fetch only the relevant portions of the subject sequence that are needed in the gappedalignmentstage, providing a substantial improvement in runtime. The following example compares 103 mouse EST sequences against the human genome shows (example run in July 2008 after the database had already been loaded into memory):

$ time megablast -d 9606_genomic -i est.500.in -r 2 -q -3 
-F mL -m 9 -o old.out -W 11 -t 18 -G 5 -E 2
341.455u 65.242s 6:47.99 99.6% 0+0k 0+0io 0pf+0w
$ time blastn -task dc-megablast -db 9606_genomic -query 
est.500.in -outfmt 7 -out new.out
218.540u 11.632s 3:50.53 99.8% 0+0k 0+0io 0pf+0w

Similar gains in performance should be expected inBLASTdatabases which contain very large sequences and many very short queries.

4.5.5 BLAST search strategies

BLASTsearch strategies are files which encode the inputs necessary to perform a BLAST search. The purpose of these files is to be able to seamlessly reproduce a BLAST search in various environments (Web BLAST, command line applications, etc).

4.5.5.1 Exporting search strategies on the WebBLAST

Click on "download" next to the RID/saved strategy in the "Recent Results" or "Saved Strategies" tabs.

4.5.5.2 Exporting search strategies withBLAST+ applications

Add the -export_search_strategy along with a file name to the command line options.

4.5.5.3 Importing search strategies on WebBLAST

Go to the "Saved Strategies" tab, click on "Browse" to select your search strategy file, then click on "View" to load it into the submission page.

4.5.5.4 Importing search strategies withBLAST+ applications

Add the -import_search_strategy along with a file name containing the search strategy file. Note that if provided, the –query, -db, -use_index, and –index_name command line options will override the specifications of the search strategy file provided (no other command line options will override the contents of the search strategy file).

4.5.6 Negative GI lists

Negative GI lists are available on search applications and they provide a means to exclude GIs from aBLASTdatabase search. The expect values in the BLAST results are based upon the sequences actually searched and not on the underlying database. For an example, see thecookbook.

4.5.7 Masking in BLAST databases

It is now possible to createBLASTdatabases which contain filtered sequences (also known asmaskinginformation or masks). Thisfilteringinformation can be used as soft masking for the subject sequences. For instructions on creating masked BLAST databases, please see thecookbook.

4.5.8 Custom output formats for BLAST searches

TheBLAST+ search command line applications support custom output formats for the tabular and comma-separated value output formats. For more details see thecommon optionsas well as thecookbook.

4.5.9 Custom output formats to extract BLAST database data

blastdbcmd supports custom output formats to extract data fromBLASTdatabases via the -outfmt command line option. For more details see theblastdbcmd optionsas well as thecookbook.

4.5.10 Improved software installation packages

TheBLAST+ applications are available via Windows and MacOSX installers as well as RPMs (source and binary) and unix tarballs. For more details about these, refer to theinstallationsection.

4.5.11 Sequence filtering applications

TheBLAST+ applications include a new set of sequencefilteringapplications, namely segmasker, dustmasker, and windowmasker. segmasker is an application that identifies and masks low complexity regions of protein sequences. The dustmasker and windowmasker applications provide similar functionality for nucleotide sequences (seeftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/dustmasker/README.dustmaskerandftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/README.windowmaskerfor more information).

4.5.12 Best-Hits filtering algorithm

The Best-Hitfilteringalgorithmis designed for use in applications that are searching for only the best matches for eachqueryregion reporting matches. Its -best_hit_overhang parameter,H, controls when anHSPis considered short enough to be filtered due to presence of another HSP. For each HSP A that is filtered, there exists another HSP B such that the query region of HSP A extends each end of the query region of HSP B by at most H times the length of the query region for B.

Additional requirements that must also be met in order to filter A on account of B are:

i.: evalue(A) >= evalue(B)
ii.: score(A)/length(A) < (1.0 – score_edge) * score(B)/length(B)

We consider 0.1 to 0.25 to be an acceptable range for the -best_hit_overhang parameter and 0.05 to 0.25 to be an acceptable range for the -best_hit_score_edge parameter. Increasing the value of the overhang parameter eliminates a higher number of matches, but increases the running time; increasing the score_edge parameter removes smaller number of hits.

4.5.13 Automatic resolution of sequence identifiers

TheBLAST+ search applications support automatic resolution ofqueryand subject sequence identifiers specified as GIs or accessions (seethe cookbook sectionfor an example). This feature enables the user to specify one or more sequence identifiers (GIs and/or accessions, one per line) in a file as the input to the -query and -subject command line options.

Upon encountering this type of input, by default the BLAST+ search applications will try to resolve these sequence identifiers in locally available BLAST databases first, then in the BLAST databases at NCBI, and finally in Genbank (the latter two data sources require a properly configured internet connection). These data sources can be configured via the DATA_LOADERS configuration option and the BLAST databases to search can be configured via the BLASTDB_PROT_DATA_LOADER and BLASTDB_NUCL_DATA_LOADER configuration options (see the section onConfiguring BLAST).

4.5.14 BLAST-WindowMasker integration in BLAST+ search applications

TheBLAST+ search applications support integration with the windowmasker files via the -window_masker_taxid and the WINDOW_MASKER_PATH configuration parameter (seeConfiguring BLAST) or via the -window_masker_db command line option.

In the first case, the WINDOW_MASKER_PATH configuration parameter should refer to a directory which contains subdirectories named after NCBI taxonomy IDs (e.g.: 9606 for human, 10090 for mouse), where the windowmasker unit counts data files should be placed with the following naming convention: wmasker.obinary (for files generated with the obinary format) and/or wmasker.oascii (for files generated with the oascii format). For an example on how to create these files, please see theCookbook. Once these windowmasker files and the configuration file are in place, this feature can be invoked by providing the taxonomy ID to the -window_masker_taxid command line option.

Alternatively, this feature can also be invoked by providing the path to the windowmasker unit counts data file via the -window_masker_db.

Please see theCookbookfor a usage example of this feature.

4.5.15 DELTA-BLAST: A tool for sensitive protein sequence search

DELTA-BLASTuses RPS-BLAST to search for conserved domains matching to aquery, constructs aPSSMfrom the matching domains, and searches a sequence database. Its sensitivity is comparable toPSI-BLASTand does not require several iterations of searches against a large sequence database.

4.6 Options by program type

4.6.1 blastp

4.6.1.1 task: Specify the task to execute. For more details, refer to the section ontasks.

4.6.1.2 comp_based_stats: Select the appropriate composition based statistics mode (applicable only to blastp and tblastn). Available choices and references are available by invoking the application with -help option.

4.6.1.3 use_sw_tback: Instead of using the X-dropoff gappedalignmentalgorithm, use Smith-Waterman to compute locally optimal alignments

4.6.2 blastn

4.6.2.1 task: Specify the task to execute. For more details, refer to the section ontasks.

4.6.2.2 penalty: Penalty for a nucleotide mismatch.

4.6.2.3 reward: Reward for a nucleotide match.

4.6.2.4 use_index: Use a megablast database index.

4.6.2.5 index_name: Name of the megablast database index.

4.6.2.6 perc_identity: Minimum percentidentityof matches to report

4.6.2.7 dust: Arguments toDUSTfilteringalgorithm(use ‘no’ to disable).

4.6.2.8 filtering_db: Name ofBLASTdatabase containingfilteringelements (i.e.: repeats)

4.6.2.9 window_masker_taxid: Enable windowmaskerfilteringusing a NCBI taxonomy ID. Windowmaskermaskingfiles are required; seeSearch applications’ integration with windowmasker files.

4.6.2.10 window_masker_db: Enable windowmaskerfilteringusing this windowmaskermaskingfile.

4.6.2.11 no_greedy: Use non-greedy dynamic programming extension.

4.6.2.12 min_raw_gapped_score: Minimum raw gapped score to keep analignmentin the preliminary gapped and traceback stages.

4.6.2.13 template_type: Discontiguous megablast template type.

4.6.2.14 template_length: Discontiguous megablast template length.

4.6.3 blastx

4.6.3.1 query_gencode: Genetic code to use to translate thequerysequence(s).

4.6.3.2 frame_shift_penalty: Frame shift penalty for use with out-of-frame gapped alignments

4.6.3.3 max_intron_length:Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).

4.6.4 tblastx

4.6.4.1 db_gencode: Genetic code to use to translate database/subjects.

4.6.4.2 max_intron_length: Identical toblastx.

4.6.5 tblastn

4.6.5.1 db_gencode: Identical totblastx.

4.6.5.2 frame_shift_penalty: Identical toblastx.

4.6.5.3 max_intron_length: Identical toblastx.

4.6.5.4 comp_based_stats: Identical toblastp.

4.6.5.5 use_sw_tback: Identical toblastp.

4.6.5.6 in_pssm: Checkpoint file to initiate PSI-TBLASTN (currently unimplemented).

4.6.6 psiblast

4.6.6.1 comp_based_stats: Identical toblastpwith the exception that only composition based statistics mode 1 is valid when aPSSMis the input (either when restarting from a checkpoint file or when performing multiplePSI-BLASTiterations).

4.6.6.2 gap_trigger: Number of bits to trigger gapping.

4.6.6.3 use_sw_tback: Identical toblastp.

4.6.6.4 num_iterations: Number of iterations to perform.

4.6.6.5 out_pssm: Name of the file to store checkpoint file containing aPSSM.

4.6.6.6 out_ascii_pssm: Name of the file to stire ASCII version ofPSSM.

4.6.6.7 in_msa: Name of the file containingmultiple sequence alignmentto restartPSI-BLAST. For input format, please see theInput formats to BLASTsection.

4.6.6.8 msa_master_idx: Ordinal number (i.e.: 1-based index) of the sequence in themultiple sequence alignmentfile to use as master sequence. For the rationale and sample usage of this option, please see itsCookbookentry.

4.6.6.9 ignore_msa_master: Ignore the first sequence (usually the master sequence) in themultiple sequence alignmentfile when creating thePSSM. For the rationale and sample usage of this option, please see itsCookbookentry.

4.6.6.10 in_pssm: Checkpoint file to re-startPSI-BLAST.

4.6.6.11 pseudocount: Pseudo-count value used when constructing thePSSM.

4.6.6.12 inclusion_ethresh: E-value inclusion threshold for pairwise alignments to be considered to build thePSSM.

4.6.6.13 phi_pattern: Name of the file containing aPHI-BLASTpattern to search.

4.6.7 rpstblastn

4.6.7.1 query_gencode: Identical toblastx.

4.6.8 makeblastdb

This application serves as a replacement for formatdb.

4.6.8.1 in: Input file orBLASTdatabase name to use as source; the data type is automatically detected. Multiple input files/BLAST databases can be provided, each separated by a space in a string with quotation marks. See section 5.15 for specifics that vary between Windows and UNIX/LINUX/Mac OS X.

4.6.8.2 title: Title for theBLASTdatabase to create

4.6.8.3 parse_seqids: Parse the Seq-id(s) in theFASTAinput provided. Please note that this option should be provided consistently among the various applications involved in creatingBLASTdatabases. For instance, thefiltering applicationsas well asconvert2blastmaskshould use this option if makeblastdb uses it also. Please see section 5.13 for details about the format of the Seq-ids.

4.6.8.4 hash_index: Enables the creation of sequence hash values. These hash values can then be used to quickly determine if a given sequence data exists in thisBLASTdatabase.

4.6.8.5 mask_data: Comma-separated list of input files containingmaskingdata to apply to the sequences being added to theBLASTdatabase being created. For more information, seeMasking in BLAST databasesand theexamples.

4.6.8.6 out: Name of theBLASTdatabase to create.

4.6.8.7 max_file_sz: Maximum file size for any of theBLASTdatabase files created.

4.6.8.8 logfile: Name of the file to which the program log should be redirected (stdout by default).

4.6.8.9 taxid: Taxonomy ID to assign to all sequences.

4.6.8.10 taxid_map: Name of file which provides a mapping of sequence IDs to NCBI taxonomy IDs. This file should contain one line per sequence ID, and each line should contain a sequence ID followed by a taxonomy ID (an integer) separated by a single space.

4.6.9 blastdb_aliastool

This application replaces part of the functionality offered by formatdb. When formatting a large inputFASTAsequence file into aBLASTdatabase, makeblastdb breaks up the resulting database into optimal sized volumes and links the volumes into a large virtual database through an automatically created BLAST database alias file.

We can use BLASTdatabase alias files under different scenarios to manage the collection ofBLASTdatabases and facilitate BLAST searches. For example, we can create an alias file to combine an existing BLAST database with newly generated ones while leaving the original one undisturbed. Also, for an existing BLAST database, we can create a BLAST database alias file based on a GI list so we can search a subset of it, eliminating the need of creating a new database. For examples of how to use this application, please see thecookbooksection.

This application supports three modes of operation:

1) Gi file conversion:

Converts a text file containing GIs (one per line) to a more efficient

binary format. This can be provided as an argument to the -gilist option

of theBLASTsearch command line binaries or to the -gilist option of

this program to create an alias file for aBLASTdatabase (see below).

2) Alias file creation (restricting with GI List):

Creates an alias for aBLASTdatabase and a GI list which restricts this

database. This is useful if one often searches a subset of a database

(e.g., based on organism or a curated list). The alias file makes the

search appear as if one were searching a regularBLASTdatabase rather

than the subset of one.

3) Alias file creation (aggregatingBLASTdatabases):

Creates an alias for multipleBLASTdatabases. All databases must be of

the same molecule type (no validation is done). The relevant options are

-dblist and -num_volumes.

4.6.9.1 gi_file_in: Text file to convert, should contain one GI per line.

4.6.9.2 gi_file_out: File name of converted GI file

4.6.9.3 title: Title forBLASTdatabase.

4.6.9.4 gilist: Name of the file containing the GIs to restrict the database provided in -db.

4.6.9.5 out: Identical tomakeblastdb.

4.6.9.6 logfile: Identical tomakeblastdb.

Please note that when using GI lists, the expect values in theBLASTresults are based upon the sequences actually searched and not on the underlying database.

4.6.10 blastdbcmd

This application is the successor to fastacmd. The following are its supported options:

4.6.10.1 entry: A comma-delimited search string of sequence identifiers, or the keyword ‘all’ to select all sequences in the database.

4.6.10.2 entry_batch: Input file for batch processing, entries must be provided one per line. If input is provided on standard input, a ‘-‘ should be used to indicate this.

4.6.10.3 pig: PIG (Protein Identity Group) to retrieve.

4.6.10.4 info: PrintBLASTdatabase information (overrides all other options).

4.6.10.5 range: Selects the range of a sequence to extract in 1-based offsets (Format: start-stop).

4.6.10.6 strand: Strand of nucleotide sequence to extract.

4.6.10.7 outfmt: Output format string. For a list of available format specifiers, invoke the application with its -help option. Note that for all format specifiers except %f, each line of output will correspond to a single sequence. This should be specified using double quotes if there are spaces in the output format specification (e.g.: -outfmt "%g %t").

4.6.10.8 target_only: The definition line of the sequence should contain target GI only.

4.6.10.9 get_dups: Retrieve duplicate accessions

4.6.10.10 line_length: Line length for output (applicable only withFASTAoutput format).

4.6.10.11 ctrl_a: Use Ctrl-A as the non-redundant defline separator (applicable only withFASTAoutput format).

4.6.10.12 mask_sequence_with: Allows the specification of afilteringalgorithmID from theBLASTdatabase to apply to the sequence data. Applicable only withFASTAand sequence data output formats (%f and %s respectively).

4.6.10.13 list: Display theBLASTdatabases available in the directory provided as an argument to this option.

4.6.10.14 list_outfmt: Allows for the specification of the output format for the -list option; a listing of the possible format types is available via the application’s -help option. Unsupported output specifiers will be ignored. This option’s argument should be specified using double quotes if there are spaces in the output format specification.

4.6.10.15 recursive: Recursively traverse the directory provided to the –list option to find and display availableBLASTdatabases.

4.6.10.16 show_blastdb_search_path: Displays the defaultBLASTdatabase search paths (separated by colons).

4.6.11 convert2blastmask

This application extracts the lower-case masks from itsFASTAinput and converts them to a file format suitable for specifyingmaskinginformation tomakeblastdb. The following are its supported options:

4.6.11.1 masking_algorithm: The name of themaskingalgorithmused to create the masks (e.g.: dust, seg, windowmasker, repeat).

4.6.11.2 masking_options: The options used to configure themaskingalgorithm.

4.6.11.3 in: Name of the input file, by default is standard input.

4.6.11.4 output: Name of the output file, by default is standard output.

4.6.11.5 outfmt: Output file format.

4.6.11.6 parse_seqids: Identical tomakeblastdb.

4.6.12 blastdbcheck

This application performs tests onBLASTdatabases to check their integrity. The following are its supported options:

4.6.12.1 dir: Name of the directory where to look forBLASTdatabases.

4.6.12.2 recursive: Flag to specify whether to recursively search forBLASTdatabases in the directory specified above.

4.6.12.3 full: Check every database sequence.

4.6.12.4 stride: Check every Nth database sequence.

4.6.12.5 samples: Check a randomly selected set of N sequences.

4.6.12.6 ends: Check the beginning and ending N sequences in the database.

4.6.12.7 isam: Set to true to perform ISAM file checking on each of the selected sequences.

4.6.13 blast_formatter

This application formats both local and remoteBLASTresults. An RID is required to format remote BLAST results. The RID may be obtained either from a search submitted to the NCBI BLAST web page or by using the –remote switch with one of the applications mentioned above. The blast_formatter accepts the BLAST archive format for stand-alone formatting. The BLAST archive format can be produced by using “-outfmt 11” argument with the stand-alone applications. For an example of how to use this application, please see itscookbook entry.

4.6.13.1 rid:BLASTRID of the report to be formatted.

4.6.13.2 archive: File produced byBLASTapplication using –outfmt 11

4.6.14 deltablast

4.6.14.1 domain_inclusion_ethresh: E-value inclusion threshold for alignments betweenquerysequence and a conserveddomainto be included inPSSMconstruction.

4.6.14.2 show_domain_hits: Include in the output matching domains that were used to construct aPSSM.

4.6.14.3 rpsdb: Use other than the standard conserveddomaindatabase (cdd_delta) forPSSMconstruction.

4.7 Configuring BLAST

TheBLAST+ search applications can be configured by means of a configuration file named .ncbirc (on Unix-like platforms) or ncbi.ini (on Windows). This is a plain text file which contains sections and key-value pairs to specify configuration parameters. Lines starting with a semi-colon are considered comments. This file will be searched in the following order and locations:

1.: Current working directory
2.: User's HOME directory
3.: Directory specified by the NCBI environment variable

The search for this file will stop at the first location where it is found and the configurations settings from that file will be applied. If the configuration file is not found, default values will apply. The following are the possible configuration parameters that impact theBLAST+ applications:

Configuration Parameter	Specifies	Default value
BLASTDB	Path toBLASTdatabases.	Current working directory
DATA_LOADERS	Data loaders to use for automatic sequence identifier resolution. This is a comma separated list of the following keywords: blastdb, genbank, and none. The none keyword disables this feature and takes precedence over any other keywords specified.	blastdb,genbank
BLASTDB_PROT_DATA_LOADER	Locally availableBLASTdatabase name to search when resolving protein sequences usingBLASTdatabases. Ignored if DATA_LOADERS does not include the blastdb keyword.	nr
BLASTDB_NUCL_DATA_LOADER	Locally availableBLASTdatabase name to search when resolving nucleotide sequences usingBLASTdatabases. Ignored if DATA_LOADERS does not include the blastdb keyword.	nt
GENE_INFO_PATH	Path to gene information files (NCBI only).	Current working directory
WINDOW_MASKER_PATH	Path to windowmasker directory hierarchy.	Current working directory

The following is an example with comments describing the available parameters for configuration:

; Start the section for BLAST configuration
[BLAST]
; Specifies the path where BLAST databases are installed
BLASTDB=/home/guest/blast/db
; Specifies the data sources to use for automatic resolution 
; for sequence identifiers 
DATA_LOADERS=blastdb 
; Specifies the BLAST database to use resolve protein sequences 
BLASTDB_PROT_DATA_LOADER=custom_protein_database 
; Specifies the BLAST database to use resolve protein sequences 
BLASTDB_NUCL_DATA_LOADER=/home/some_user/my_nucleotide_db 


; Windowmasker settings
[WINDOW_MASKER]
WINDOW_MASKER_PATH=/home/guest/blast/db/windowmasker
; end of file

4.7.1 Memory usage

TheBLASTsearch programs can exhaust all memory on a machine if the input is too large or if there are too many hits to the BLAST database. If this is the case, please see your operating system documentation to limit the memory used by a program (e.g.: ulimit on Unix-like platforms).

4.8 Input formats to BLAST

4.8.1 Multiple sequence alignment

The -in_msa psiblast option provides a way to jump start psiblast from a master-slavemultiple sequence alignmentcomputed outside psiblast. The multiple sequence alignment must contain thequerysequence as one of its sequences, but it need not be the first sequence. The multiple sequence alignment must be specified in a format that is derived from Clustal, but without some headers and trailers (see example below).

The rules are also described by the following words. Suppose themultiple sequence alignmenthas N sequences. It may be presented in one or more blocks, where each block presents a range of columns from the multiple sequence alignment. E.g., the first block might have columns 1-60, the second block might have columns 61-95, the third block might have columns 96-128. Each block should have N rows, one row per sequence. The sequences should be in the same order in every block. Blocks are separated by one or more black lines. Within a block there are no blank lines, and each line consists of one sequence identifier followed by some whitespace followed by characters (and gaps) for that sequence in the multiple sequence alignment. In each column, all letters must be in upper case, or all letters must be in lower case.

# Example multiple sequence alignment file
 align1
------
26SPS9_Hs     IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgkla
F57B9_Ce      LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsakl
YDL097c_Sc    ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvkniln
YMJ5_Ce       LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaake
FUS6_ARATH    KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrc
COS41.8_Ci    SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrr
644879        KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqt
YPR108w_Sc    IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspe
eif-3p110_Hs  SKAMKMGDWKTCHSFIINEKMNGkvw----------------------------------
T23D8.4_Ce    SKAMLNGDWKKCQDYIVNDKMNQkvw----------------------------------
YD95_Sp       IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspe
KIAA0107_Hs   LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgae
F49C12.8_Hs   LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircne
Int-6_Mm      KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlket


26SPS9_Hs     lryagrqtealkcvaqasknrsladfekaltdy---------------------------
F57B9_Ce      alkyngsdldamkaiaaaaqkrslkdfqvafgsf--------------------------
YDL097c_Sc    akytketyqsrgidamkavaeaynnrslldfntalkqy----------------------
YMJ5_Ce       ivayqkspriiairsmadafrkrslkdfvkalaeh-------------------------
FUS6_ARATH    asglahlelkkyklaarkfldvnpelgnsyneviapqdiatygglcalasfdrselkqkv
COS41.8_Ci    kfleaaqrynelsyksaiheteqtkalekalncailapagqqrsrmlatlfkdercqllp
644879        qailtklkcaaglaelaarkykqaakclllasfdhcdfpellspsnvaiygglcalatfd
YPR108w_Sc    llslisttaalqsissltislyasdyasyfpyllety-----------------------
eif-3p110_Hs  ------------------------------------------------------------
T23D8.4_Ce    ------------------------------------------------------------
YD95_Sp       vlavlpqnesmssleacinslylcdysgffrtladve-----------------------
KIAA0107_Hs   ilevlhslpavrqylfslyecrysvffqslavv---------------------------
F49C12.8_Hs   vqeqltggglngtlipvreylesyydchydrffiqlaale--------------------
Int-6_Mm      idnnsvssplqslqqrtwlihwslfvffnhpkgrdniidlflyqpqylnaiqtmcphilr


26SPS9_Hs     ------------------------------------------------------------
F57B9_Ce      ------------------------------------------------------------
YDL097c_Sc    ------------------------------------------------------------
YMJ5_Ce       ------------------------------------------------------------
FUS6_ARATH    idninfrnflelvpdvrelindfyssryascleylasl----------------------
COS41.8_Ci    sfgilekmfldriiksdemeefar------------------------------------
644879        rqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem--------------
YPR108w_Sc    ------------------------------------------------------------
eif-3p110_Hs  ------------------------------------------------------------
T23D8.4_Ce    ------------------------------------------------------------
YD95_Sp       ------------------------------------------------------------
KIAA0107_Hs   ------------------------------------------------------------
F49C12.8_Hs   ------------------------------------------------------------
Int-6_Mm      ylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkklrec


26SPS9_Hs     RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMI
F57B9_Ce      PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMI
YDL097c_Sc    EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMI
YMJ5_Ce       KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMI
FUS6_ARATH    KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALI
COS41.8_Ci    QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMI
644879        KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLI
YPR108w_Sc    ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFI
eif-3p110_Hs  DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMI
T23D8.4_Ce    NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMI
YD95_Sp       VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFI
KIAA0107_Hs   EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFI
F49C12.8_Hs   SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLI
Int-6_Mm      ESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLI


26SPS9_Hs     LDKKFHGILDQGEGVLIIFDEPP
F57B9_Ce      LDQKLSGSLDQGEGMLIVFEIAV
YDL097c_Sc    LDKIFYGVLDQGNGWLYVYETPN
YMJ5_Ce       LDKKLMGSIDQHGDTVVVYPKAD
FUS6_ARATH    TDNQIQARIDSHNKILYARHADQ
COS41.8_Ci    CESRMKGHIDQIDGIVFFERRET
644879        LEGLISARVDSHSKILYARDVDQ
YPR108w_Sc    PNKQLNCVIDRVNGIVETNRPDN
eif-3p110_Hs  INEELMASLDQPTQTVVMHRTEP
T23D8.4_Ce    IQEELSATLDEPTDCLIMHRVEP
YD95_Sp       PDNKLNCVIDRVNGVVFTNRPDE
KIAA0107_Hs   AAGRLHCKIDKVNEIVETNRPDS
F49C12.8_Hs   ATGQLQCRIDAVNGVIEVNHRDS
Int-6_Mm      RNARLDAKIDSKLGHVVMGNNAV

5. Cookbook

Go to:

5.1 Query a BLAST database with a GI, but exclude that GI from the results

Extract a GI from the ecoli database:
$ blastdbcmd -entry all -db ecoli -dbtype nucl -outfmt %g | head -1 | \
  tee exclude_me 
1786181
Run the restricted database search, which shows there are no self-hits:
$ blastn -db ecoli -negative_gilist exclude_me -show_gis -num_alignments 0 \
  -query exclude_me | grep `cat exclude_me`
Query= gi|1786181|gb|AE000111.1|AE000111 
$

5.2 Create a masked BLAST database

Creating a maskedBLASTdatabase is a two step process:

a.: Generate themaskingdata using a sequencefilteringutility like windowmasker or dustmasker
b.: Generate the actualBLASTdatabase using makeblastdb

For both steps, the input file can be a text file containing sequences inFASTAformat, or an existingBLASTdatabase created using makeblastdb. We will provide examples for both scenarios.

5.2.1 Collect mask information files

For nucleotide sequence data inFASTAfiles orBLASTdatabase format, we can generate the mask information files using windowmasker or dustmasker. Windowmasker masks the over-represented sequence data and it can also mask the low complexity sequence data using the built-in dustalgorithm(through the -dust option). To mask low-complexity sequences only, we will need to use dustmasker.

For protein sequence data inFASTAfiles orBLASTdatabase format, we need to use segmasker to generate the mask information file.

The following examples assume thatBLASTdatabases, listed in 5.2.3, are available in the current working directory. Note that you should use the sequence id parsing consistently. In all our examples, we enable this function by including the “-parse_seqids” in the command line arguments.

5.2.1.1 Create masking information using dustmasker

We can generate themaskinginformation with dustmasker using a single command line:

$ dustmasker -in hs_chr -infmt blastdb -parse_seqids \
  -outfmt maskinfo_asn1_bin -out hs_chr_dust.asnb

Here we specify the input is aBLASTdatabase named hs_chr (-in hs_chr -infmt blastdb), enable the sequence id parsing (-parse_seqids), request the mask data in binary asn.1 format (-outfmt maskinfo_asn1_bin), and name the output file as hs_chr_dust.asnb (-out hs_chr_dust.asnb).

If the input format is the originalFASTAfile, hs_chr.fa, we need to change input to -in and -infmt options as follows:

$ dustmasker -in hs_chr.fa -infmt fasta -parse_seqids \
  -outfmt maskinfo_asn1_bin -out hs_chr_dust.asnb

5.2.1.2 Create masking information using windowmasker

To generate themaskinginformation using windowmasker from theBLASTdatabase hs_chr, we first need to generate a counts file:

$ windowmasker -in hs_chr -infmt blastdb -mk_counts \
  -parse_seqids -out hs_chr_mask.counts

Here we specify the inputBLASTdatabase (-in hs_chr -infmt blastdb), request it to generate the counts (-mk_counts) with sequence id parsing (-parse_seqids), and save the output to a file named hs_chr_mask.counts (-out hs_chr_mask.counts).

To use theFASTAfile hs_chr.fa to generate the counts, we need to change the input file name and format:

$ windowmasker -in hs_chr.fa -infmt fasta -mk_counts \
  -parse_seqids -out hs_chr_mask.counts

With the counts file we can then proceed to create the file containing themaskinginformation as follows:

$ windowmasker -in hs_chr -infmt blastdb -ustat hs_chr_mask.count \
  -outfmt maskinfo_asn1_bin -parse_seqids -out hs_chr_mask.asnb

Here we need to use the same input (-in hs_chr -infmt blastdb) and the output of step 1 (-ustat hs_chr_mask.counts). We set the mask file format to binary asn.1 (-outfmt maskinfo_asn1_bin), enable the sequence ids parsing (-parse_seqids), and save themaskingdata to hs_chr_mask.asnb (-out hs_chr_mask.asnb).

To use theFASTAfile hs_chr.fa, we change the input file name and file type:

$ windowmasker -in hs_chr.fa -infmt fasta -ustat hs_chr.counts \
  -outfmt maskinfo_asn1_bin -parse_seqids -out hs_chr_mask.asnb

5.2.1.3 Create masking information using segmasker

We can generate themaskinginformation with segmasker using a single command line:

$ segmasker -in refseq_protein -infmt blastdb -parse_seqids \
  -outfmt maskinfo_asn1_bin -out refseq_seg.asnb

Here we specify the refseq_proteinBLASTdatabase (-in refseq_protein -infmt blastdb), enable sequence ids parsing (-parse_seqids), request the mask data in binary asn.1 format (-outfmt maskinfo_asn1_bin), and name the out file as refseq_seg.asnb (-out refseq_seg.asnb).

If the input format is theFASTAfile, we need to change the command line to specify the input format:

$ segmasker -in refseq_protein.fa -infmt fasta -parse_seqids \
  -outfmt maskinfo_asn1_bin -out refseq_seg.asnb

5.2.1.4 Extract masking information from FASTA sequences with lowercase masking

We can also extract themaskinginformation from aFASTAsequence file with lowercase masking (generated by various means) using convert2blastmask utility. An example command line follows:

$ convert2blastmask -in hs_chr.mfa -parse_seqids -masking_algorithm repeat \
 -masking_options "repeatmasker, default" -outfmt maskinfo_asn1_bin \
 -out hs_chr_mfa.asnb

Here the input is hs_chr.mfa (-in hs_chr.mfa), enable parsing of sequence ids, specify themaskingalgorithmname (-masking_algorithm repeat) and its parameter (-masking_options “repeatmasker, default”), and ask for asn.1 output (-outfmt maskinfo_asn1_bin) to be saved in specified file (-out hs_chr_mfa.asnb).

5.2.2 Create BLAST database with the masking information

Using themaskinginformation data files generated in steps 5.2.1.1, 5.2.1.2, 5.2.1.3, and 5.2.1.4, we can createBLASTdatabase with masking information incorporated.

Note: we should use “-parse_seqids” in a consistent manner – either use it in both steps or not use it at all.

5.2.2.1 Create BLAST database with masking information using an existing BLAST database or FASTA sequence file as input

For example, we can use the following command line to apply themaskinginformation, created in step 5.2.1.2, to the existingBLASTdatabase generated in 5.2.3:

$ makeblastdb -in hs_chr -dbtype nucl -parse_seqids \
 -mask_data hs_chr_mask.asnb -out hs_chr -title \
 "Human Chromosome, Ref B37.1"

Here, we use the existingBLASTdatabase as input file (-in hs_chr), specify its type (-dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide themaskingdata from step 5.2.1.2 (-mask_data hs_chr_mask.asnb), and name the output database with the same base name (-out hs_chr) overwriting the existing one.

To use the originalFASTAsequence file (hs_chr.fa) as the input, we need to use “-in hs_chr.fa” to instruct makeblastdb to use that FASTA file instead.

We can check the “re-created” database to find out if themaskinginformation was added properly, using blastdbcmd with the following command line:

$ blastdbcmd -db hs_chr -info

This command prints out a summary of the target database:

Database: human chromosomes, Ref B37.1
        24 sequences; 3,095,677,412 total bases


Date: Aug 13, 2009  3:02 PM     Longest sequence: 249,250,621 bases


Available filtering algorithms applied to database sequences:


Algorithm ID  Algorithm name      Algorithm options                       
    30        windowmasker                                                


Volumes:
        /export/home/tao/blast_test/hs_chr

Extra lines under the “Availablefilteringalgorithms …” describe themaskingalgorithms available. The “Algorithm ID” field, 30 in our case, is what we need to use if we want to invoke database soft masking during an actual search through the “-db_soft_mask” parameter.

We can apply additionalmaskingdata to an existingBLASTdatabase with one type of masking information already added. For example, we can apply the dust masking, generated in step 5.2.1.1, to the database generated in step 5.2.2.1, we can use this command line:

$ makeblastdb -in hs_chr -dbtype nucl -parse_seqids -mask_data \
  hs_chr_dust.asnb -out hs_chr -title "Human Chromosome, Ref B37.1"

Here, we use the existing database as input file (-in hs_chr), specify its type (-dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide themaskingdata from step 5.2.1.1 (-mask_data hs_chr_dust.asnb), naming the database with the same based name (-out hs_chr) overwriting the existing one.

Checking the “re-generated” database with blastdbcmd:

$ blastdbcmd -db hs_chr -info

we can see that both sets ofmaskinginformation are available:

Database: Human Chromosome, Ref B37.1
        24 sequences; 3,095,677,412 total bases


Date: Aug 25, 2009  4:43 PM     Longest sequence: 249,250,621 bases


Available filtering algorithms applied to database sequences:


Algorithm ID  Algorithm name      Algorithm options                       
    11        dust                window=64; level=20; linker=1           
    30        windowmasker                                                


Volumes:
        /net/gizmo4/export/home/tao/blast_test/hs_chr

A more straightforward approach to apply multiple sets ofmaskinginformation in a single makeblastdb run by providing multiple set of masking data files in a comma delimited list:

$ makeblastdb -in hs_chr -dbtype nucl -parse_seqids \
  -mask_data hs_chr_dust.asnb, hs_chr_mask.asnb -out hs_chr

5.2.2.2 Create a protein BLAST database with masking information

We can use themaskingdata file generated in step 5.2.1.3 to create a proteinBLASTdatabase:

$ makeblastdb -in refseq_protein -dbtype prot -parse_seqids \
 -mask_data refseq_seg.asnb -out refseq_protein -title \
 "RefSeq Protein Database"

Using blastdbcmd, we can check the database thus generated:

$ blastdbcmd -db refseq_protein -info

This produces the following summary, which includes themaskinginformation:

Database: RefSeq Protein Database
        7,044,477 sequences; 2,469,203,411 total residues


Date: Sep 1, 2009  10:50 AM     Longest sequence: 36,805 residues


Available filtering algorithms applied to database sequences:


Algorithm ID  Algorithm name      Algorithm options                       
    21        seg                 window=12; locut=2.2; hicut=2.5         


Volumes:
        /export/home/tao/blast_test/refseq_protein2.00
        /export/home/tao/blast_test/refseq_protein2.01
        /export/home/tao/blast_test/refseq_protein2.02

5.2.2.3 Create a nucleotide BLAST database using the masking information extracted from lower case masked FASTA file

We use the following command line, which is very similar to that given in 5.2.2.1.

$ makeblastdb -in hs_chr.mfa -dbtype nucl -parse_seqids -mask_data \
  hs_chr_mfa.asnb -out hs_chr_mfa -title "Human chromosomes (mfa)"

Here we use the lowercase maskedFASTAsequence file as input (-in hs_chr.mfa), specify the database as nucleotide (-dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide themaskingdata (-mask_data hs_chr_mfa.asnb), and name the resulting database as hs_chr_mfa (-out hs_chr_mfa).

Checking the database thus generated using blastdbcmd, we have:

Database: Human chromosomes (mfa)
        24 sequences; 3,095,677,412 total bases


Date: Aug 26, 2009  11:41 AM    Longest sequence: 249,250,621 bases


Available filtering algorithms applied to database sequences:


Algorithm ID  Algorithm name      Algorithm options                       
    40        repeat              repeatmasker lowercase                  


Volumes:
        /export/home/tao/hs_chr_mfa

Thealgorithmname and algorithm options are the values we provided in step 5.2.1.4.

5.2.3 Obtaining Sample data for this cookbook entry

For input nucleotide sequences, we use theBLASTdatabase generated from aFASTAinput file hs_chr.fa, containing complete human chromosomes from BUILD37.1, generated by inflating and combining the hs_ref_*.fa.gz files located at:

ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/

We use this command line to create theBLASTdatabase from the input nucleotide sequences:

$ makeblastdb -in hs_chr.fa -dbtype nucl -parse_seqids -out hs_chr \
  -title "Human chromosomes, Ref B37.1"

For input nucleotide sequences with lowercasemasking, we use theFASTAfile hs_chr.mfa, containing the complete human chromosomes from BUILD37.1, generated by inflating and combining the hs_ref_*.mfa.gz files located in the same ftp directory.

For input protein sequences, we use the preformatted refseq_protein database from the NCBI blast/db/ ftp directory:

ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.00.tar.gz

ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.01.tar.gz

ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.02.tar.gz

5.3 Search the database with database soft masking information

To enable the databasemaskingduring aBLASTsearch, we need to get the Algorithm ID using the -info parameter of blastdbcmd. For the database generated in step 5.2.2.2, we can use the following command line to activate one of the database soft masking created by windowmasker:

$ blastn -query HTT_gene -task megablast -db hs_chr -db_soft_mask 30 \
  -outfmt 7 -out HTT_megablast_mask.out -num_threads 4

Here, we use the blastn program to search a nucleotidequeryHTT_gene* (-query HTT_gene) with megablastalgorithm(-task megablast) against the database created in step 5.2.2.1 (-db hs_chr). We invoke the soft databasemasking(-db_soft_mask 30), set the result format to tabular output (-outfmt 7), and save the result to a file named HTT_megablast_mask.tab (-out HTT_megablast_mask.tab). We also activated the multi-thread feature of blastn to speed up the search by using 4 CPUs^$(-num_threads 4).

*This is a genomic fragment containing the HTT gene from human, including 5 kb up- and down-stream of the transcribed region. It is represented by NG_009378.

^$The number to use under in your run will depend on the number of CPUs your system has.

In a test run under a 64-bits Linux machine, the above search takes 9.828 seconds real time, while the same run without database softmaskinginvoked takes 31 minutes 44.651 seconds.

5.4 Extract all human sequences from the nr database

Although one cannot select GIs by taxonomy from a database, a combination of unix command line tools will accomplish this:

$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \
   awk ' { if ($2 == 9606) { print $1 } } ' | \
   blastdbcmd -db nr -entry_batch - -out human_sequences.txt

The first blastdbcmd invocation produces 2 entries per sequence (GI and taxonomy ID), the awk command selects from the output of that command those sequences which have a taxonomy ID of 9606 (human) and prints its GIs, and finally the second blastdbcmd invocation uses those GIs to print the sequence data for the human sequences in the nr database.

5.5 Custom data extraction and formatting from a BLAST database

The following examples show how to extract selected information from aBLASTdatabase and how to format it:

Extract the accession, sequence length, 
and masked locations for GI 71022837:
$ blastdbcmd -entry 71022837 -db Test/mask-data-db  -outfmt "%a %l %m"
XP_761648.1 1292 119-139;140-144;147-152;154-160;161-216;


Extract the masked FASTA for GI 71022837:
$ blastdbcmd -entry 71022837 -db Test/mask-data-db \
-mask_sequence_with 20 -target_only
>gi|71022837|ref|XP_761648.1| hypothetical protein UM05501.1
MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGPP
PSTHGGPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavqklslaNEaaw
vsiGsaaetmedydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdtWGSMGHCY
LMMDDLQRAYTAYQQALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYKQ
QNKFPASLECFRYILDNPPRPLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNAG
FNNQERAIQFLTKSLESDPNDAQSWYLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRDA
LDAYSRAIRLNPYISEVWFDLGSLYEACNNQISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPVP
QDVHPTAYANNNGMAPGPPTQIGGGPGPSYPPPLVGPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPGT
DDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGPSPGPPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAPP
HVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLAAAGPGGPPPPLDHYGRPMGGPMSEREREMEWER
EREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRPPVYGRDEGRDYYNNSHPGSGPGGPRGGY
ERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDGPFRGPPPGLGRPTPDWERTRAGE
YGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRAGSPWSAKGGVAAPGKNGK
ASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNEPLAARAPSSRMVD
EDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDEPEDSYKRA
KSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPMDVD
GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIH
EDPIDAPAKRADEDGAK

5.6 Display BLAST search results with custom output format

The –outfmt option permits formatting arbitrary fields from theBLASTtabular format. Use the –help option on the command-line application (e.g., blastn) to see the supported fields. The max_target_seqs option should be used with any tabular output to control the number of matches reported.

5.6.1 Example of custom output format

The following example shows how to display the results of aBLASTsearch using a custom output format. The tabular output format with comments is used, but only thequeryaccession, subject accession, evalue, query start, query stop, subject start, and subject stop are requested. For brevity, only the first 10 lines of output are shown:

$ echo 1786181 | ./blastn -db ecoli -outfmt "7 qacc sacc evalue 
qstart qend sstart send" 
# BLASTN 2.2.18+
# Query: gi|1786181|gb|AE000111.1|AE000111 
# Database: ecoli
# Fields: query acc., subject acc., evalue, q. start, q. end, s.
 start, s. end
# 85 hits found
AE000111        AE000111        0.0     1       10596   1       10596
AE000111        AE000174        8e-30   5565    5671    6928    6821
AE000111        AE000394        1e-27   5587    5671    135     219
AE000111        AE000425        6e-26   5587    5671    8552    8468
AE000111        AE000171        3e-24   5587    5671    2214    2130
$

5.6.2 Trace-back operations (BTOP)

The “Blast trace-back operations” (BTOP) string describes thealignmentproduced byBLAST. This string is similar to the CIGAR string produced in SAM format, but there are important differences. BTOP is a more flexible format that lists not only the aligned region but also matches and mismatches. BTOP operations consist of 1.) a number with a count of matching letters, 2.) two letters showing a mismatch (e.g., “AG” means A was replaced by G), or 3.) a dash (“-“) and a letter showing agap. The box below shows a blastn run first with BTOP output and then the same run with the BLAST report showing the alignments.

$ blastn -query test_q.fa -subject test_s.fa -dust no -outfmt "6 
qseqid sseqid btop" -parse_deflines
query1  q_multi 7AG39
query1  q_multi 7A-39
query1  q_multi 6-G-A41
$ blastn -query test_q.fa -subject test_s.fa -dust no -parse_deflines
BLASTN 2.2.24+


Query= query1 
Length=47


Subject=  
Length=142


 Score = 82.4 bits (44),  Expect = 9e-22
 Identities = 46/47 (97%), Gaps = 0/47 (0%)
 Strand=Plus/Plus


Query  1   ACGTCCGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA  47
           ||||||| |||||||||||||||||||||||||||||||||||||||
Sbjct  47  ACGTCCGGGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA  93




 Score = 80.5 bits (43),  Expect = 3e-21
 Identities = 46/47 (97%), Gaps = 1/47 (2%)
 Strand=Plus/Plus


Query  1   ACGTCCGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA  47
           ||||||| |||||||||||||||||||||||||||||||||||||||
Sbjct  1   ACGTCCG-GACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA  46




 Score = 78.7 bits (42),  Expect = 1e-20
 Identities = 47/49 (95%), Gaps = 2/49 (4%)
 Strand=Plus/Plus


Query  1    ACGTCC--GAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA  47
            ||||||  |||||||||||||||||||||||||||||||||||||||||
Sbjct  94   ACGTCCGAGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA  142

5.7 Use blastdb_aliastool to manage the BLAST databases

Often we need to search multiple databases together or wish to search a specific subset of sequences within an existing database. At theBLASTsearch level, we can provide multiple database names to the “-db” parameter, or to provide a GI file specifying the desired subset to the “-gilist” parameter. However for these types of searches, a more convenient way to conduct them is by creating virtual BLAST databases for these. Note: When combining BLAST databases, all the databases must be of the same molecule type. The following examples assume that the two databases as well as the GI file are in the current working directory.

5.7.1 Aggregate existing BLAST databases

To combine the two nematode nucleotide databases, named “nematode_mrna” and “nematode_genomic", we use the following command line:

$ blastdb_aliastool -dblist "nematode_mrna nematode_genomic" -dbtype nucl \
  -out nematode_all -title "Nematode RefSeq mRNA + Genomic"

5.7.2 Create a subset of a BLAST database

The nematode_mrna database contains RefSeq mRNAs for several species of round worms. The best subset is from C. elegance. In most cases, we want to search this subset instead of the complete collection. Since the database entries are from NCBI nucleotide databases and the database is formatted with ”-parse_seqids”, we can use the “-gilist c_elegance_mrna.gi” parameter/value pair to limit the search to the subset of interest, alternatively, we can create a subset of the nematode_mrna database as follows:

$ blastdb_aliastool -db nematode_mrna -gilist c_elegance_mrna.gi -dbtype \
  nucl -out c_elegance_mrna -title "C. elegans refseq mRNA entries"

Note: one can also specify multiple databases using the -db parameter of blastdb_aliastool.

5.8 Reformat BLAST reports with blast_formatter

It may be helpful to view the sameBLASTresults in different formats. A user may first parse the tabular format looking for matches meeting a certain criteria, then go back and examine the relevant alignments in the full BLAST report. He may also first look at pair-wise alignments, then decide to use aquery-anchored view. Viewing a BLAST report in different formats has been possible on the NCBI BLAST web site since 2000, but has not been possible with stand-alone BLAST runs. The blast_formatter allows this, if the original search produced blast archive format using the –outfmt 11 switch. The query sequence, the BLAST options, themaskinginformation, the name of the database, and thealignmentare written out as ASN.1 (a structured format similar to XML). The –max_target_seqs option should be used to control the number of matches recorded in the alignment. The blast_formatter reads this information and formats a report. The BLAST database used for the original search must be available, or the sequences need to be fetched from the NCBI, assuming the database contains sequences in the public dataset. The box below illustrates the procedure. A blastn run first produces the BLAST archive format, and the blast_fomatter then reads the file and produces tabular output.

Blast_formatter will format stand-alone searches performed with an earlier version of a database if both the search and formatting databases are prepared so that fetching by sequence ID is possible. To enable fetching by sequence ID use the –parse_seqids flag when running makeblastdb, or (if available) download preformattedBLASTdatabases fromftp://ftp.ncbi.nlm.nih.gov/blast/db/using update_blastdb.pl (provided as part of the BLAST+ package). Currently the blast archive format and blast_formatter do not work with database free searches (i.e., -subject rather than –db was used for the original search).

$ echo 1786181 | blastn -db ecoli -outfmt 11 -out out.1786181.asn
$ blast_formatter -archive out.1786181.asn -outfmt "7 qacc sacc evalue
qstart qend sstart send"
# BLASTN 2.2.24+
# Query: gi|1786181|gb|AE000111.1|AE000111 Escherichia coli K-12 MG1655
section 1 of 400 
# Database: ecoli
# Fields: query acc., subject acc., evalue, q. start, q. end, 
s. start, s. end
# 85 hits found
AE000111        AE000111        0.0     1       10596   1       10596
AE000111        AE000174        8e-30   5565    5671    6928    6821
AE000111        AE000394        1e-27   5587    5671    135     219
AE000111        AE000425        6e-26   5587    5671    8552    8468
AE000111        AE000171        3e-24   5587    5671    2214    2130
AE000111        AE000171        1e-23   5587    5670    10559   10642
AE000111        AE000376        1e-22   5587    5675    129     42
AE000111        AE000268        1e-22   5587    5671    6174    6090
AE000111        AE000112        1e-22   10539   10596   1       58
AE000111        AE000447        5e-22   5587    5670    681     598
AE000111        AE000344        6e-21   5587    5671    4112    4196
AE000111        AE000490        2e-20   5584    5671    4921    4835
AE000111        AE000280        2e-20   5587    5670    12930   12847

5.9 Extract lowercase masked FASTA from a BLAST database with masking information

If aBLASTdatabase containsmaskinginformation, this can be extracted using the blastdbcmd options –db_mask and –mask_sequence as follows:

$ blastdbcmd -info -db mask-data-db
Database: Mask data test
        10 sequences; 12,609 total residues


Date: Feb 17, 2009  5:10 PM     Longest sequence: 1,694 residues


Available filtering algorithms applied to database sequences:


Algorithm ID  Algorithm name      Algorithm options                       
    20        seg                 default options used                    
    40        repeat              -species Desmodus_rotundus              


Volumes:
        mask-data-db
$ blastdbcmd -db mask-data-db -mask_sequence_with 20 -entry 71022837
>gi|71022837|ref|XP_761648.1| hypothetical protein UM05501.1 [Ustilago maydis 521]
MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGPPPSTHG
GPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavQklslANEaawvsIGsaaetm
EdydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdTWGSMGHCYLMMDDLQRAYTAYQQ
ALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYKQQNKFPASLECFRYILDNPPR
PLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNAGFNNQERAIQFLTKSLESDPNDAQSW
YLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRDALDAYSRAIRLNPYISEVWFDLGSLYEACNN
QISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPVPQDVHPTAYANNNGMAPGPPTQIGGGPGPSYPPPLV
GPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPGTDDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGPSPG
PPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAPPHVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLAAAG
PGGPPPPLDHYGRPMGGPMSEREREMEWEREREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRPPVY
GRDEGRDYYNNSHPGSGPGGPRGGYERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDGPFR
GPPPGLGRPTPDWERTRAGEYGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRAGSP
WSAKGGVAAPGKNGKASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNEPLA
ARAPSSRMVDEDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDEPED
SYKRAKSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPMDVD
GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIHEDPID
APAKRADEDGAK
$

5.10 Display the locations where BLAST will search for BLAST databases

This is accomplished by using the -show_blastdb_search_path option in blastdbcmd:

$ blastdbcmd -show_blastdb_search_path
:/net/nabl000/vol/blast/db/blast1:/net/nabl000/vol/blast/db/blast2:
$

5.11 Display the available BLAST databases at a given directory:

This is accomplished by using the -list option in blastdbcmd:

$ blastdbcmd -list repeat  -recursive
repeat/repeat_3055 Nucleotide
repeat/repeat_31032 Nucleotide
repeat/repeat_35128 Nucleotide
repeat/repeat_3702 Nucleotide
repeat/repeat_40674 Nucleotide
repeat/repeat_4530 Nucleotide
repeat/repeat_4751 Nucleotide
repeat/repeat_6238 Nucleotide
repeat/repeat_6239 Nucleotide
repeat/repeat_7165 Nucleotide
repeat/repeat_7227 Nucleotide
repeat/repeat_7719 Nucleotide
repeat/repeat_7955 Nucleotide
repeat/repeat_9606 Nucleotide
repeat/repeat_9989 Nucleotide
$

The first column of the default output is the file name of theBLASTdatabase (usually provided as the –db argument to other BLAST+ applications), the second column represents the molecule type of the BLAST database. This output is configurable via the list_outfmt command line option.

5.12 Use Windowmasker to filter your BLAST search

The following example shows the steps required to useBLASTand Windowmasker. In the following example, we are starting with aFASTAfile containing the sequences from the human build 36.3 (hs.36.3.fsa). The output is removed for brevity. Please note that steps 1-4 need to be performed only once to generate the windowmasker unit counts files and that similarly, step 5 needs to be done only once to configure BLAST+.

1. Create the unit counts data (WinMask Stage 1)
$ windowmasker -in hs.36.3.fsa -infmt fasta -mk_counts true \
-sformat obinary -out wmasker.obinary


2. Create a masked version of the original FASTA file using dust
and the previously created unit counts data file (WinMask Stage 2)
$ windowmasker -in hs.36.3.fsa -out hs.36.3.masked.fsa -dust true \
-ustat wmasker.obinary -outfmt fasta


3. Create a masked BLAST database
$ makeblastdb -out human.36.3.masked -in hs.36.3.masked.fsa \
-dbtype nucl


4. Make masked index
$ makembindex -input hs.36.3.masked.fsa -iformat fasta -volsize 1024 \
-output human.36.3.masked


5. Install and configure BLAST databases
$ mkdir -p /usr/local/ncbi/blast/db /usr/local/ncbi/blast/windowmasker/9606
$ cp hs.36.3.masked.* /usr/local/ncbi/blast/db
$ cp wmasker.obinary /usr/local/ncbi/blast/windowmasker/9606
$ echo [BLAST] > .ncbirc
$ echo BLASTDB=/usr/local/ncbi/blast/db >> .ncbirc
$ echo [WINDOW_MASKER] >> .ncbirc
$ echo WINDOW_MASKER_PATH=/usr/local/ncbi/blast/windowmasker >> .ncbirc


6. Run BLAST search using Windowmasker for sequence filtering
$ blastn -query input -db database -window_masker_taxid 9606 -out results.txt

5.13 Building a BLAST database with local sequences

The makeblastdb application producesBLASTdatabases fromFASTAfiles. In the simplest case the FASTA definition lines are not parsed by makeblastdb and may be completely unstructured. The text in the definition line will be stored in the BLAST database and displayed in the BLAST report, but it will not be possible to fetch individual sequences using blastdbcmd or to limit the search with the –seqidlist option. Use the –parse_seqids flag when invoking makeblastdb to enable retrieval of sequences based upon sequence identifiers. In this case, each sequence must have a unique identifier, and that identifier must have a specific format. The identifier should begin right after the “>” sign on the definition line, contain no spaces, and follow the formats described inhttp://www.ncbi.nlm.nih.gov/books/NBK7183/?rendertype=table&id=ch_demo.T5User supplied sequences should make use of the local or general identifiers described in the above table. A FASTA file with general IDs would look like:

$ cat mydb.fsa
>gnl|MYDB|1 this is sequence 1
GAATTCCCGCTACAGGGGGGGCCTGAGGCACTGCAGAAAGTGGGCCTGAGCCTCGAGGATGACGGTGCTGCAGGAACCCG
TCCAGGCTGCTATATGGCAAGCACTAAACCACTATGCTTACCGAGATGCGGTTTTCCTCGCAGAACGCCTTTATGCAGAA
GTACACTCAGAAGAAGCCTTGTTTTTACTGGCAACCTGTTATTACCGCTCAGGAAAGGCATATAAAGCATATAGACTCTT
GAAAGGACACAGTTGTACTACACCGCAATGCAAATACCTGCTTGCAAAATGTTGTGTTGATCTCAGCAAGCTTGCAGAAG
GGGAACAAATCTTATCTGGTGGAGTGTTTAATAAGCAGAAAAGCCATGATGATATTGTTACTGAGTTTGGTGATTCAGCT
TGCTTTACTCTTTCATTGTTGGGACATGTATATTGCAAGACAGATCGGCTTGCCAAAGGATCAGAATGTTACCAAAAGAG
CCTTAGTTTAAATCCTTTCCTCTGGTCTCCCTTTGAATCATTATGTGAAATAGGTGAAAAGCCAGATCCTGACCAAACAT
TTAAATTCACATCTTTACAGAACTTTAGCAACTGTCTGCCCAACTCTTGCACAACACAAGTACCTAATCATAGTTTATCT
CACAGACAGCCTGAGACAGTTCTTACGGAAACACCCCAGGACACAATTGAATTAAACAGATTGAATTTAGAATCTTCCAA
>gnl|MYDB|2 this is sequence 2
GAATTCCCGCTACAGGGGGGGCCTGAGGCACTGCAGAAAGTGGGCCTGAGCCTCGAGGATGACGGTGCTGCAGGAACCCG
TCCAGGCTGCTATATGGCAAGCACTAAACCACTATGCTTACCGAGATGCGGTTTTCCTCGCAGAACGCCTTTATGCAGAA
GTACACTCAGAAGAAGCCTTGTTTTTACTGGCAACCTGTTATTACCGCTCAGGAAAGGCATATAAAGCATATAGACTCTT
GAAAGGACACAGTTGTACTACACCGCAATGCAAATACCTGCTTGCAAAATGTTGTGTTGATCTCAGCAAGCTTGCAGAAG
GGGAACAAATCTTATCTGGTGGAGTGTTTAATAAGCAGAAAAGCCATGATGATATTGTTACTGAGTTTGGTGATTCAGCT
TGCTTTACTCTTTCATTGTTGGGACATGTATATTGCAAGACAGATCGGCTTGCCAAAGGATCAGAATGTTACCAAAAGAG
CCTTAGTTTAAATCCTTTCCTCTGGTCTCCCTTTGAATCATTATGTGAAATAGGTGAAAAGCCAGATCCTGACCAAACAT
TTAAATTCACATCTTTACAGAACTTTAGCAACTGTCTGCCCAACTCTTGCACAACACAAGTACCTAATCATAGTTTATCT
CACAGACAGCCTGAGACAGTTCTTACGGAAACACCCCAGGACACAATTGAATTAAACAGATTGAATTTAGAATCTTCCAA
>gnl|MYDB|3 this is sequence 3
GAATTCCCGCTACAGGGGGGGCCTGAGGCACTGCAGAAAGTGGGCCTGAGCCTCGAGGATGACGGTGCTGCAGGAACCCG
TCCAGGCTGCTATATGGCAAGCACTAAACCACTATGCTTACCGAGATGCGGTTTTCCTCGCAGAACGCCTTTATGCAGAA
GTACACTCAGAAGAAGCCTTGTTTTTACTGGCAACCTGTTATTACCGCTCAGGAAAGGCATATAAAGCATATAGACTCTT
GAAAGGACACAGTTGTACTACACCGCAATGCAAATACCTGCTTGCAAAATGTTGTGTTGATCTCAGCAAGCTTGCAGAAG
GGGAACAAATCTTATCTGGTGGAGTGTTTAATAAGCAGAAAAGCCATGATGATATTGTTACTGAGTTTGGTGATTCAGCT
TGCTTTACTCTTTCATTGTTGGGACATGTATATTGCAAGACAGATCGGCTTGCCAAAGGATCAGAATGTTACCAAAAGAG
CCTTAGTTTAAATCCTTTCCTCTGGTCTCCCTTTGAATCATTATGTGAAATAGGTGAAAAGCCAGATCCTGACCAAACAT
TTAAATTCACATCTTTACAGAACTTTAGCAACTGTCTGCCCAACTCTTGCACAACACAAGTACCTAATCATAGTTTATCT$

Makeblastdb can be invoked for this file as below.

$ makeblastdb -in mydb.fsa -parse_seqids -dbtype nucl


Building a new DB, current time: 01/28/2011 13:39:37
New DB name:   mydb.fsa
New DB title:  mydb.fsa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 3 sequences in 0.00206995 seconds.
$

TheFASTAfile has three entries. All entries are part of the “MYDB” database, with the entries numbers 1, 2, and 3. Makeblastdb will store this information properly and produce an index, so that the sequences can be retrieved by these identifiers. Makeblastdb stores the title portion of the definition line (e.g., “this is sequence 1”), but will not parse it. If the first token after the “>” does not contain a bar (“|”) it will be parsed as a local ID. Use the full identifier string (e.g., “gnl|MYDB|2”) to retrieve sequences with a general ID

The NCBI makes databases that are searchable on the NCBI web site (such as nr, refseq_rna, and swissprot) available on its FTP site. It is better to download the preformatted databases rather than starting withFASTA. The databases on the FTP site contain taxonomic information for each sequence, include the identifier indices for lookups, and can be up to four times smaller than the FASTA. The original FASTA can be generated from theBLASTdatabase using blastdbcmd.

5.14 Limiting a Search with a List of Identifiers

BLASTcan now limit a database search by a list of text identifiers, which are specified as whitespace-separated strings in a text formatted file. These identifiers, referencing the sequences to include in BLAST search, should not contain any whitespace and must be resolvable through the BLAST database ID lookup. In some cases this means that the entire bar-delimited format (specified inhttp://www.ncbi.nlm.nih.gov/books/NBK7183/?rendertype=table&id=ch_demo.T5) must be used. In other cases it is enough to simply specify an accession. For the “general” example from section 5.13 a valid ID would be “gnl|MYDB|2”. On the other hand, if the identifier is “gi|15674171|ref|NP_268346.1”, one of the following string is sufficient:

“gi|15674171|ref|NP_268346.1”, “15674171”, “ref|NP_268346”, “NP_268346”, “NP_268346.1”, etc.

When the search is limited by a list of IDs the statistics of theBLASTdatabase are re-calculated to reflect the actual number of sequences and residuals/base included in search.

BLASThas been able to limit a search by a list of GI’s for a number of years. It is important to note that the performance of a binary list of GI’s will always be superior to a list of text IDs. The binary list of GI’s can be formatted to require minimal conversion at run time. If all the sequences in the database have been assigned a GI, a binary list of GI’s should be used rather than a list of accessions.

5.15 Multiple databases vs. spaces in filenames and paths

BLASThas been able to search multiple databases since 1997. The databases can be listed after the “-db” argument or in an alias file (see section on blastdb_aliastool), separated by spaces. Many operating systems now allow spaces in filenames and directory paths, so some care is required. Basically, one should always have two sets of quotes for any path containing a space. Blastdbcmd is used as an example below, but the same rules apply to makeblastdb as well as the search programs like blastn or blastp.

To access aBLASTdatabase containing spaces under Microsoft Windows it is necessary to use two sets of double-quotes, escaping the innermost quotes with a backslash. For example, Users\joeuser\My Documents\Downloads would be accessed by:

blastdbcmd -db "\"Users\joeuser\My Documents\Downloads\mydb\"" -info

The first backslash escapes the beginning inner quote, and the backslash following “mydb” escapes the ending inner quote.

A second database can be added to this command by including it within the outer pair of quotes:

blastdbcmd -db "\"Users\joeuser\My Documents\Downloads\mydb\" myotherdb" -info

If the second database had contained a space, it would have been necessary to surround it by quotes escaped by a backslash.

Under UNIX systems (including LINUX and Mac OS X) it is preferable to use a single quote (‘) in place of the escaped double quote:

blastdbcmd -db ‘ "path with spaces/mydb" ’ -info

Multiple databases can also be listed within the single quotes, similar to the procedure described for Microsoft Windows.

5.16 Specifying a sequence as the multiple sequence alignment master in psiblast

The -in_msa psiblast option, unlike blastpgp, does not support the specification of a master sequence via the -queryoption, so if one wants to specify a sequence (other than the first one) in themultiple sequence alignmentfile to be the master sequence, this has to be specified via the -msa_master_idx option. For instance, in the example below, the third sequence in the multiple sequence alignment would be used as the master sequence:

psiblast -in_msa align1 -db pataa -msa_master_idx 3

5.17 Ignoring the consensus sequence in the multiple sequence alignment in psiblast

Often a consensus sequence is added to amultiple sequence alignmentto be used as the master sequence in aPSI-BLASTsearch. The consensus sequence provides a good option to display thequery-subject alignment in the output and to define which MSA columns are to be converted toPSSM. At the same time adding the consensus sequence changes the statistical properties of the original alignment. To avoid this, the -ignore_msa_master option can be used:

psiblast -in_msa align1 -db pataa -ignore_msa_master

In this case the master sequence is displayed in the output but ignored when thePSSMscores are calculated.

5.18 Performing a DELTA-BLAST search

DELTA-BLASTsearches a protein sequence database using aPSSMconstructed from conserved domains matching aquery. It first searches the NCBI CDD database to construct the PSSM.

5.18.1 Download the cdd_delta database

Obtain this database fromftp://ftp.ncbi.nlm.nih.gov/blast/dbusing theupdate_blastdb.pltool (provided as part of theBLAST+ package). Note that the cdd_delta database must be downloaded and installed to the standard BLAST database directory (seeConfiguring BLAST) or in the current working directory.

5.18.2 Execute the deltablast search

$ deltablast –query query.fsa –db pataa

CopyrightNotice. BLAST is a registered Trademark of the Nationl Library of Medicine.

你可能感兴趣的:(BLAST Command Line Applications User Manual)

next.js刷新页面时二级菜单展开状态判断啃火龙果的兔子开发DEMO javascript 前端 react.js
在Next.js中保持二级菜单刷新后展开状态的解决方案在Next.js应用中，当页面刷新时保持二级菜单的展开状态，可以通过以下几种方法实现：方法1：使用URL参数保存状态（推荐）import{useRouter}from'next/router';import{useEffect,useState}from'react';constMenuComponent=()=>{constrouter=us
CSS 样式设计：背景、字体与边框渐变详解前端呆猿 css 前端
一、CSS背景渐变CSS背景渐变是现代网页设计中常用的技术，可以创建平滑的颜色过渡效果，替代传统的静态背景图像。1.线性渐变(LinearGradient).element{background:linear-gradient(toright,#ff7e5f,#feb47b);}方向参数：toright、toleft、tobottom、totop，或角度如45deg可以添加多个颜色节点：linea
2018-09-08 感冒加鼻窦炎了，吃了点药五大RobertWu伍洋
阿奇霉素分散片怎么吃http://ypk.39.net/882513/manual【药品名称】通用名称：阿奇霉素分散片英文名称：AzithromycinDispersibleTablets【用法用量】以阿奇霉素分散片治疗感染疾病，服用前用水分散后口服直接吞服。其疗程及使用方法如下：成人：1.沙眼衣原体或敏感淋病奈瑟菌所致性传播疾病，仅需单次口服本品1g。2.治疗小儿咽炎、扁桃体炎，一日按体重12m
怎么调用接口发验证码和通知短信？互亿无线_苍穹
PHP对接验证码短信接口DEMO示例本文为您提供了PHP语言版本的验证码短信接口对接DEMO示例*接口类型：触发短信接口，支持发送验证码短信、订单通知短信等。*账户注册：请通过该地址开通账户http://user.ihuyi.com/?exClaO*注意事项：*（1）调试期间，请使用用系统默认的短信内容：您的验证码是：【变量】。请不要把验证码泄露给其他人。*（2）请使用用户名及APIkey来调用接
记录centos6挂载Samba服务失败问题熬夜波比a linux linux 运维服务器
1.环境前提在centos6挂载truenas，centos7可以正常挂载，在centos6系统上报错如下2.网上查证centos6不支持Samba2.3协议在truenas页面打开Samba1协议3.继续挂载报错如下4.网上查证需要加个参数sec=ntlmssp5.可以正常挂载6.完整命令mount-v-tcifs//192.168.10.10/nfs/挂载路径-ousername=nfs,se
ros2 server 可以设置命令同时获取位置
一个自定义服务SetCommandGetPose.srv：请求字段float32command响应字段geometry_msgs/Posepose服务端收到请求后，把command缓存下来，再把当前位姿填进响应返回。为了便于演示，位置用一个简单计数器模拟；你可以把它替换成TF、里程计或SLAM输出。一、创建功能包bash复制ros2pkgcreate--build-typeament_cmakep
CentOS 服务器docker pull 拉取失败
可以通过以下步骤将Windows上下载的Docker镜像导出，然后传输到CentOS服务器并导入使用：步骤1：在Windows上导出镜像#1.拉取镜像（如果你还没有拉取）dockerpullelectronuserland/builder:wine#2.导出镜像为tar文件（注意路径使用双引号）dockersave-o"C:\path\to\electron-builder-wine.tar"el
NetBackup7.6客户端安装及配置 Sp0n
1.从官网选择对应版本进行下载2.解压并进入目录执行安装命令./install3.正确填写服务端名称、客户端名称可以看到NBU客户端已经安装好并在后台运行了4.可以在/user/openv/netbackup/bp.conf文件中修改服务端名和客户端名5.打开防火墙1556端口（让服务端和客户端可以正常互连）
001双双-文案课第七次作业双双执行力财富流教练
作业要求：竞品分析做一个手机的竞品分析至于选择哪两款产品出于什么目的进行分析，需要按照韩老白老师今天讲的四个步骤来对比机型：iPhoneXvs坚果R1iPhoneXvs坚果R1参考资料：iPhoneX参数：http://product.pconline.com.cn/mobile/apple/1048848_detail.html坚果R1参数：http://product.pconline.com
zabbix自动发现告警配置 yeahzxw 监控#zabbix 服务器 linux 运维
自动发现告警配置一、目录文件数详细配置1、编写shell自动发现脚本cd/home/yeahzxw/script/discoverdir.sh#!/bin/bashconf=/home/yeahzxw/script/conf/key_dir.cfgINDEX=0echo'{'echo'"data"':[COUNT=`cat$conf|wc-l`cat$conf|whilereadLINEDIRCO
Android Fragment 嵌套使用 Lrxc
1setUserVisibleHint只有fragment与viewpager配合使用，才会调用3onHiddenChanged的回调时机当使用add()+show()，hide()跳转新的Fragment时，旧的Fragment回调onHiddenChanged()，不会回调onStop()等生命周期方法，而新的Fragment在创建时是不会回调onHiddenChanged()，这点要切记。
OpenSearch SQL 查询完整指南
OpenSearchSQL查询完整指南目录基础查询字符串查询数值查询日期时间查询数组和嵌套查询聚合查询地理空间查询全文搜索复杂查询性能优化基础查询基本SELECT--查询所有字段SELECT*FROMindex_name;--查询特定字段SELECTname,age,emailFROMusers;--使用别名SELECTnameASuser_name,ageASuser_ageFROMusers;
day 10用户管理黄能能
1.为用户添加密码【root才能执行】1.为新用户添加密码【只能是root】{密码尽可能的复杂}【0-9】【a-Z】【！@#￥%&*】交互式设定密码非交互式设定密码批量创建用户，并设定固定密码2.为用户变更密码（1）为自己修改密码（ok）直接使用passwd注意密码需要复杂一点，并达到8位（2）为别人修改密码（root）passwdusername3.密码怎么才算复杂生成随机数字mkpasswd生
二、ubuntu+django+nginx+uwsgi+vue:部署django+vue前后端分离项目
一、创建用户和文件夹#创建www文件夹，所有网站项目都放到这里$sudomkdir/www#创建用户组sudogroupaddwww-g666#创建用户$sudouseraddwww-u666-g666-M-s/sbin/nologin#查看$idwwwid#设置www文件夹的所属组和所属用户$sudochown-Rwww.www/www/#$sudochmod-R666某一目录,所有用户对一个目
Jenkins credentials 增加了github credential 但是在Git SCM 凭证中不显示
不能直接选择secrettext类型，选择usernamewithpassword类型username填github用户名password填在GitHubdevelopersetting中生成的accesstoken
git 使用笔记鸟它鸟
git配置命令配置描述用户gitconfig--globaluser.name"liangjiapengjetson"别名配置gitconfig--globalalias.cicommit配置commit的别名为ci也可以直接再~/.gitconfig下进行配置,在[alias]标识下编写即可,例如ci=commitgit操作命令克隆github仓库到本地[email protected]
python 连接数据库小鱼拉灯 mysql 数据库 python
一.连接MYSQL1.下载PyMySql模块2.在MYSQL中创建数据库并连接importpymysqlconn=pymysql.connect(host='localhost',user='root',password='123456',database='ikun',charset='utf8',port=3306)3.创建表importpymysqlconn=pymysql.connect(
pip路径设置
更改pip默认下载路径Windows系统：直接在user目录中创建一个pip目录，如：C:\Users\xx\pip，并新建文件pip.ini文件，pip文件内容如下：[global]index-url=https://pypi.tuna.tsinghua.edu.cn/simple[install]trusted-host=mirrors.aliyun.com引用:https://www.cnb
Spring04：Spring MVC dfraetaem Spring spring mvc java 后端
一、SpringMVC核心解析SpringMVC是基于Java实现MVC模型的轻量级Web框架，其核心优势在于简化Web开发、灵活性强和与Spring生态无缝集成。通过分层设计，它将应用分为：Controller层：处理请求和响应Service层：业务逻辑处理Dao层：数据持久化操作分层架构示例（SpringBoot+MyBatis）1.Dao层（数据访问层）//UserDao.java（接口）@
C++基础问题
C++基础问题掌握形参默认带缺省值的函数函数调用时#includeintsum(inta,intb=20){returna+b;}intmain(){inta=10,b=20;intret=sum(a,b);coutusingnamespacestd;#defineIS_INLINE1#ifIS_INLINEinline#endifintsum(inta,intb=20){returna+b;}i
2020-02-10 92637e1c8b2f
disciplinemattersmosttowhatwillbeachievedinwardly，changingtheoutwardreality.
UDP协议介绍不想写bug呀 javaEE udp 网络协议网络
目录一、UDP基本概念1、定义：2、特点：（1）无连接：（2）不可靠传输：（3）面向数据报：（4）全双工：二、UDP协议格式1、UDP报文结构2、各部分详解：（1）源端口号：（2）目的端口号：（3）UDP长度：（4）校检和：三、UDP使用注意事项四、基于UDP的应用层协议五、总结一、UDP基本概念1、定义：UDP（UserDatagramProtocol，用户数据报协议）是TCP/IP协议簇中位于
十种常用数据分析模型耐思nice～数据分析数据分析人工智能机器学习数学建模
1-线性回归（LinearRegression）场景：预测商品销售额优点：简单易用，结果易于解释缺点：假设线性关系，容易受到异常值影响概念：建立自变量和因变量之间线性关系的模型。公式：[y=b_0+b_1x_1+b_2x_2+...+b_nx_n]代码示例：importpandasaspdfromsklearn.linear_modelimportLinearRegressionfromsklea
Python中if-else判断语句、while循环语句以及for循环语句的使用总结 bentou_
1.if-esle流程判断语句我们来直接看一个例子，如下，判断我们定义的用户名和用户输入的用户名是否一致。代码当中有几个注意点：判断的时候用双等号表示判断是否一致（三个等号表示赋值）你有没有注意到这里不是用的大括号而是用的冒号！python3对父级和子级的写法是极为严格的，就像这里的if跟else，都是父级，需要顶格写；下面的两个子级（print那里）就需要缩进一个tab。_username="b
Android源码导入Android Studio CYRUS STUDIO android android studio ide
版权归作者所有，如有转发，请注明文章出处：https://cyrus-studio.github.io/blog/前言需要先把Android源码编译一遍然后执行下面指令就可以导入android源码了关于Android源码编译可以参考这篇文章【LineageOS源码下载和编译（XiaomiMi6X，wayne）】。生成android.ipr文件1.进入到下面的目录cd./development/to
Python从入门到荒废-配置国内下载源 zrhsmile Python python
为提升Python包安装速度，配置国内下载源是常见需求。以下是主流方法汇总，结合稳定性和易用性推荐：一、pip永久配置国内源（推荐）通过修改配置文件实现“一次配置，长期生效”：创建/修改配置文件Windows：路径：%APPDATA%\pip\pip.ini（如C:\Users\用户名\AppData\Roaming\pip\pip.ini）内容：[global]index-url=https:/
numpy教程 Jeffrey_Pacino 编程学习 numpy 数据分析
使用jupyternotebook分析数据之前导入的包importnumpyasnp#linearalgebraimportpandasaspd#dataprocessing,CSVfileI/O(e.g.pd.read_csv)%matplotlibinlineimportmatplotlib.pyplotasplt#Matlab-styleplottingimportseabornassns
oracle存储过程日志打印,如何在oracle存储过程中逐行打印昂图 oracle存储过程日志打印
我正在执行一个存储过程，但它在某个时候失败了，当前错误代码不帮我找到错误的位置和确切位置我想知道它正在失败，所以想要在执行时逐行输出。例如：如何在oracle存储过程中逐行打印createorreplace--decaringrequiredvariablePROCEDURE"PROC_DATA_TABLE_DETAILS"ISFORTABLEDETAILSIN(SELECT*FROMuser_t
5个坑？1个法则！数据库索引的最左前缀魔法揭秘：从10秒到0.1秒的逆袭！墨瑾轩数据库学习数据库 oracle sql
关注墨瑾轩，带你探索编程的奥秘！超萌技术攻略，轻松晋级编程高手技术宝库已备好，就等你来挖掘订阅墨瑾轩，智趣学习不孤单即刻启航，编程之旅更有趣**最左前缀法则——数据库的“最左”情结**问题1：索引明明存在，为什么查询还是慢到怀疑人生？案例：--创建用户表CREATETABLEusers(idINTPRIMARYKEY,nameVARCHAR(50),ageINT,emailVARCHAR(100)
python学生成绩管理系统【完整版】，Python开发基础面试题
name=self.username.get()password=self.password.get()ifname==‘hacker707’andpassword==‘admin’:self.page.destroy()MenuPage(self.root)else:showinfo(title=‘错误’,message=‘账号或密码错误！’)db.pyimportjsonclassStuden
LeetCode[Math] - #66 Plus One Cwind java LeetCode 题解 Algorithm Math
原题链接：#66 Plus One 要求：给定一个用数字数组表示的非负整数，如num1 = {1, 2, 3, 9}, num2 = {9, 9}等，给这个数加上1。注意： 1. 数字的较高位存在数组的头上，即num1表示数字1239 2. 每一位（数组中的每个元素）的取值范围为0~9 难度：简单分析：题目比较简单，只须从数组
JQuery中$.ajax()方法参数详解 AILIKES JavaScript jsonp jquery Ajax json
url: 要求为String类型的参数，（默认为当前页地址）发送请求的地址。 type: 要求为String类型的参数，请求方式（post或get）默认为get。注意其他http请求方法，例如put和 delete也可以使用，但仅部分浏览器支持。 timeout: 要求为Number类型的参数，设置请求超时时间（毫秒）。此设置将覆盖$.ajaxSetup()方法的全局
JConsole & JVisualVM远程监视Webphere服务器JVM Kai_Ge JVisualVM JConsole Webphere
JConsole是JDK里自带的一个工具，可以监测Java程序运行时所有对象的申请、释放等动作，将内存管理的所有信息进行统计、分析、可视化。我们可以根据这些信息判断程序是否有内存泄漏问题。　　使用JConsole工具来分析WAS的JVM问题，需要进行相关的配置。　　首先我们看WAS服务器端的配置. 　　1、登录was控制台https://10.4.119.18
自定义annotation 120153216 annotation
Java annotation 自定义注释@interface的用法一、什么是注释说起注释，得先提一提什么是元数据(metadata)。所谓元数据就是数据的数据。也就是说，元数据是描述数据的。就象数据表中的字段一样，每个字段描述了这个字段下的数据的含义。而J2SE5.0中提供的注释就是java源代码的元数据，也就是说注释是描述java源
CentOS 5/6.X 使用 EPEL YUM源 2002wmj centos
CentOS 6.X 安装使用EPEL YUM源1. 查看操作系统版本[root@node1 ~]# uname -a Linux node1.test.com 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux [root@node1 ~]#
在SQLSERVER中查找缺失和无用的索引SQL 357029540 SQL Server
--缺失的索引 SELECT avg_total_user_cost * avg_user_impact * ( user_scans + user_seeks ) AS PossibleImprovement , last_user_seek ,
Spring3 MVC 笔记（二） —json+rest优化 7454103 Spring3 MVC
接上次的 spring mvc 注解的一些详细信息！其实也是一些个人的学习笔记呵呵！
替换“\”的时候报错Unexpected internal error near index 1 \ ^ adminjun java “\替换”
发现还是有些东西没有刻子脑子里,,过段时间就没什么概念了,所以贴出来...以免再忘... 在拆分字符串时遇到通过 \ 来拆分，可是用所以想通过转义 \\ 来拆分的时候会报异常 public class Main { /*
POJ 1035 Spell checker(哈希表) aijuans 暴力求解--哈希表
/* 题意：输入字典，然后输入单词，判断字典中是否出现过该单词，或者是否进行删除、添加、替换操作，如果是，则输出对应的字典中的单词要求按照输入时候的排名输出题解：建立两个哈希表。一个存储字典和输入字典中单词的排名，一个进行最后输出的判重 */ #include <iostream> //#define using namespace std; const int HASH =
通过原型实现javascript Array的去重、最大值和最小值 ayaoxinchao JavaScript array prototype
用原型函数（prototype）可以定义一些很方便的自定义函数，实现各种自定义功能。本次主要是实现了Array的去重、获取最大值和最小值。实现代码如下： <script type="text/javascript"> Array.prototype.unique = function() { var a = {}; var le
UIWebView实现https双向认证请求 bewithme UIWebView https Objective-C
什么是HTTPS双向认证我已在先前的博文 ASIHTTPRequest实现https双向认证请求中有讲述，不理解的读者可以先复习一下。本文是用UIWebView来实现对需要客户端证书验证的服务请求，网上有些文章中有涉及到此内容，但都只言片语，没有讲完全，更没有完整的代码，让人困扰不已。但是此知
NoSQL数据库之Redis数据库管理(Redis高级应用之事务处理、持久化操作、pub_sub、虚拟内存) bijian1013 redis 数据库 NoSQL
3.事务处理 Redis对事务的支持目前不比较简单。Redis只能保证一个client发起的事务中的命令可以连续的执行，而中间不会插入其他client的命令。当一个client在一个连接中发出multi命令时，这个连接会进入一个事务上下文，该连接后续的命令不会立即执行，而是先放到一个队列中，当执行exec命令时，redis会顺序的执行队列中
各数据库分页sql备忘 bingyingao oracle sql 分页
ORACLE 下面这个效率很低 SELECT * FROM ( SELECT A.*, ROWNUM RN FROM (SELECT * FROM IPAY_RCD_FS_RETURN order by id desc) A ) WHERE RN <20; 下面这个效率很高 SELECT A.*, ROWNUM RN FROM (SELECT * FROM IPAY_RCD_
【Scala七】Scala核心一：函数 bit1129 scala
1. 如果函数体只有一行代码，则可以不用写{},比如 def print(x: Int) = println(x) 一行上的多条语句用分号隔开，则只有第一句属于方法体，例如 def printWithValue(x: Int) : String= println(x); "ABC" 上面的代码报错，因为，printWithValue的方法
了解GHC的factorial编译过程 bookjovi haskell
GHC相对其他主流语言的编译器或解释器还是比较复杂的，一部分原因是haskell本身的设计就不易于实现compiler，如lazy特性，static typed，类型推导等。关于GHC的内部实现有篇文章说的挺好，这里，文中在RTS一节中详细说了haskell的concurrent实现，里面提到了green thread，如果熟悉Go语言的话就会发现，ghc的concurrent实现和Go有点类
Java-Collections Framework学习与总结-LinkedHashMap BrokenDreams LinkedHashMap
前面总结了java.util.HashMap，了解了其内部由散列表实现，每个桶内是一个单向链表。那有没有双向链表的实现呢？双向链表的实现会具备什么特性呢？来看一下HashMap的一个子类——java.util.LinkedHashMap。
读《研磨设计模式》-代码笔记-抽象工厂模式-Abstract Factory bylijinnan abstract
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ package design.pattern; /* * Abstract Factory Pattern * 抽象工厂模式的目的是： * 通过在抽象工厂里面定义一组产品接口，方便地切换“产品簇” * 这些接口是相关或者相依赖的
压暗面部高光 cherishLC PS
方法一、压暗高光&重新着色当皮肤很油又使用闪光灯时，很容易在面部形成高光区域。下面讲一下我今天处理高光区域的心得：皮肤可以分为纹理和色彩两个属性。其中纹理主要由亮度通道（Lab模式的L通道）决定，色彩则由a、b通道确定。处理思路为在保持高光区域纹理的情况下，对高光区域着色。具体步骤为：降低高光区域的整体的亮度，再进行着色。如果想简化步骤，可以只进行着色（参看下面的步骤1
Java VisualVM监控远程JVM crabdave visualvm
Java VisualVM监控远程JVM JDK1.6开始自带的VisualVM就是不错的监控工具. 这个工具就在JAVA_HOME\bin\目录下的jvisualvm.exe, 双击这个文件就能看到界面通过JMX连接远程机器, 需要经过下面的配置: 1. 修改远程机器JDK配置文件 (我这里远程机器是linux).
Saiku去掉登录模块 daizj saiku 登录 olap BI
1、修改applicationContext-saiku-webapp.xml <security:intercept-url pattern="/rest/**" access="IS_AUTHENTICATED_ANONYMOUSLY" /> <security:intercept-url pattern=&qu
浅析 Flex中的Focus dsjt html Flex Flash
关键字：focus、 setFocus、 IFocusManager、KeyboardEvent 焦点、设置焦点、获得焦点、键盘事件一、无焦点的困扰——组件监听不到键盘事件原因：只有获得焦点的组件（确切说是InteractiveObject）才能监听到键盘事件的目标阶段；键盘事件（flash.events.KeyboardEvent）参与冒泡阶段，所以焦点组件的父项（以及它爸
Yii全局函数使用 dcj3sjt126com yii
由于YII致力于完美的整合第三方库，它并没有定义任何全局函数。yii中的每一个应用都需要全类别和对象范围。例如，Yii::app()->user;Yii::app()->params['name'];等等。我们可以自行设定全局函数，使得代码看起来更加简洁易用。(原文地址) 我们可以保存在globals.php在protected目录下。然后，在入口脚本index.php的，我们包括在
设计模式之单例模式二（解决无序写入的问题） come_for_dream 单例模式 volatile 乱序执行双重检验锁
在上篇文章中我们使用了双重检验锁的方式避免懒汉式单例模式下由于多线程造成的实例被多次创建的问题，但是因为由于JVM为了使得处理器内部的运算单元能充分利用，处理器可能会对输入代码进行乱序执行（Out Of Order Execute）优化，处理器会在计算之后将乱序执行的结果进行重组，保证该
程序员从初级到高级的蜕变 gcq511120594 框架工作 PHP android html5
软件开发是一个奇怪的行业，市场远远供不应求。这是一个已经存在多年的问题，而且随着时间的流逝，愈演愈烈。我们严重缺乏能够满足需求的人才。这个行业相当年轻。大多数软件项目是失败的。几乎所有的项目都会超出预算。我们解决问题的最佳指导方针可以归结为——“用一些通用方法去解决问题，当然这些方法常常不管用，于是，唯一能做的就是不断地尝试，逐个看看是否奏效”。现在我们把淫浸代码时间超过3年的开发人员称为
Reverse Linked List hcx2013 list
Reverse a singly linked list. /** * Definition for singly-linked list. * public class ListNode { * int val; * ListNode next; * ListNode(int x) { val = x; } * } */ p
Spring4.1新特性——数据库集成测试 jinnianshilongnian spring 4.1
目录 Spring4.1新特性——综述 Spring4.1新特性——Spring核心部分及其他 Spring4.1新特性——Spring缓存框架增强 Spring4.1新特性——异步调用和事件机制的异常处理 Spring4.1新特性——数据库集成测试脚本初始化 Spring4.1新特性——Spring MVC增强 Spring4.1新特性——页面自动化测试框架Spring MVC T
C# Ajax上传图片同时生成微缩图(附Demo) liyonghui160com
1.Ajax无刷新上传图片,详情请阅我的这篇文章。（jquery + c# ashx） 2.C#位图处理 System.Drawing。 3.最新demo支持IE7,IE8,Fir
Java list三种遍历方法性能比较 pda158 java
从c/c++语言转向java开发，学习java语言list遍历的三种方法，顺便测试各种遍历方法的性能，测试方法为在ArrayList中插入1千万条记录，然后遍历ArrayList，发现了一个奇怪的现象，测试代码例如以下： package com.hisense.tiger.list; import java.util.ArrayList; import java.util.Iterator;
300个涵盖IT各方面的免费资源（上）——商业与市场篇 shoothao seo 商业与市场 IT资源免费资源
A.网站模板+logo+服务器主机+发票生成 HTML5 UP:响应式的HTML5和CSS3网站模板。 Bootswatch:免费的Bootstrap主题。 Templated:收集了845个免费的CSS和HTML5网站模板。 Wordpress.org|Wordpress.com:可免费创建你的新网站。 Strikingly:关注领域中免费无限的移动优
localStorage、sessionStorage uule localStorage
W3School 例子 HTML5 提供了两种在客户端存储数据的新方法： localStorage - 没有时间限制的数据存储 sessionStorage - 针对一个 session 的数据存储之前，这些都是由 cookie 完成的。但是 cookie 不适合大量数据的存储，因为它们由每个对服务器的请求来传递，这使得 cookie 速度很慢而且效率也不

按字母分类： A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 其他

BLAST Command Line Applications User Manual

http://www.ncbi.nlm.nih.gov/books/NBK1763/

BLAST Command Line Applications User Manual

1. Introduction

2. Installation

2.1 Windows

2.2 MacOSX

2.3 RedHat Linux

2.4 Other Unix platforms

2.5 Source tarball

3. Quick start

3.1 For users of NCBI C Toolkit BLAST

3.2 For users of Web BLAST (http://blast.ncbi.nlm.nih.gov)

3.3 For new users of BLAST

3.4 Downloading BLAST databases

4. User manual

4.1 Functionality offered by BLAST+ applications

4.2 Common options

4.3 Backwards compatibility script

4.4 Exit codes

4.5 Improvements over C Toolkit BLAST command line applications

4.5.1 Query splitting

4.5.2 Tasks

4.5.3 Megablast indexed searches

4.5.4 Partial sequence retrieval from BLAST databases

4.5.5 BLAST search strategies

4.5.6 Negative GI lists

4.5.7 Masking in BLAST databases

4.5.8 Custom output formats for BLAST searches

4.5.9 Custom output formats to extract BLAST database data

4.5.10 Improved software installation packages

4.5.11 Sequence filtering applications

4.5.12 Best-Hits filtering algorithm

4.5.13 Automatic resolution of sequence identifiers

4.5.14 BLAST-WindowMasker integration in BLAST+ search applications

4.5.15 DELTA-BLAST: A tool for sensitive protein sequence search

4.6 Options by program type

4.6.1 blastp

4.6.2 blastn

4.6.3 blastx

4.6.4 tblastx

4.6.5 tblastn

4.6.6 psiblast

4.6.7 rpstblastn

4.6.8 makeblastdb

4.6.9 blastdb_aliastool

4.6.10 blastdbcmd

4.6.11 convert2blastmask

4.6.12 blastdbcheck

4.6.13 blast_formatter

4.6.14 deltablast

4.7 Configuring BLAST

4.7.1 Memory usage

4.8 Input formats to BLAST

4.8.1 Multiple sequence alignment

5. Cookbook

5.1 Query a BLAST database with a GI, but exclude that GI from the results

5.2 Create a masked BLAST database

5.2.1 Collect mask information files

5.2.1.1 Create masking information using dustmasker

5.2.1.2 Create masking information using windowmasker

5.2.1.3 Create masking information using segmasker

5.2.1.4 Extract masking information from FASTA sequences with lowercase masking

5.2.2 Create BLAST database with the masking information

5.2.2.1 Create BLAST database with masking information using an existing BLAST database or FASTA sequence file as input

5.2.2.2 Create a protein BLAST database with masking information

5.2.2.3 Create a nucleotide BLAST database using the masking information extracted from lower case masked FASTA file

5.2.3 Obtaining Sample data for this cookbook entry

5.3 Search the database with database soft masking information

5.4 Extract all human sequences from the nr database

5.5 Custom data extraction and formatting from a BLAST database

5.6 Display BLAST search results with custom output format

5.6.1 Example of custom output format

5.6.2 Trace-back operations (BTOP)

5.7 Use blastdb_aliastool to manage the BLAST databases

5.7.1 Aggregate existing BLAST databases

5.7.2 Create a subset of a BLAST database

5.8 Reformat BLAST reports with blast_formatter

5.9 Extract lowercase masked FASTA from a BLAST database with masking information

5.10 Display the locations where BLAST will search for BLAST databases

3.2 For users of Web BLAST (`http://blast.ncbi.nlm.nih.gov`)