Biostrings包测试3_Pairwise Sequence Alignments_20200131F
1.设置当前工作目录
getwd()
2.导入R包
library(Biostrings)
3.测试:Pairwise Sequence Alignments
3.1 Introduction
In this document we illustrate how to perform pairwise sequence alignments using the Biostrings package
through the use of the pairwiseAlignment function. This function aligns a set of pattern strings to a subject
string in a global, local, or overlap (ends-free) fashion with or without affine gaps using either a fixed or
quality-based substitution scoring scheme. This function’s computation time is proportional to the product
of the two string lengths being aligned.
3.2 Pairwise Sequence Alignment Problems
The (Needleman-Wunsch) global, the (Smith-Waterman) local, and (ends-free) overlap pairwise sequence
alignment problems are described as follows. Let string Si have ni characters c(i;j) with j 2 f1; : : : ; nig. A
pairwise sequence alignment is a mapping of strings S1 and S2 to gapped substrings S01 and S02 that are
defined by
S01 = g(1;a1)c(1;a1) · · · g(1;b1)c(1;b1)g(1;b1+1)
S02 = g(2;a2)c(2;a2) · · · g(2;b2)c(2;b2)g(2;b2+1)
where
ai; bi 2 f1; : : : ; nig with ai ≤ bi
g(i;j) = 0 or more gaps at the specified position j for aligned string i
length(S01) = length(S02)
Each of these pairwise sequence alignment problems is solved by maximizing the alignment score. An
alignment score is determined by the type of pairwise sequence alignment (global, local, overlap), which sets
the [ai; bi] ranges for the substrings; the substitution scoring scheme, which sets the distance between aligned
characters; and the gap penalties, which is divided into opening and extension components. The optimal
pairwise sequence alignment is the pairwise sequence alignment with the largest score for the specified
alignment type, substitution scoring scheme, and gap penalties. The pairwise sequence alignment types,
substitution scoring schemes, and gap penalties influence alignment scores in the following manner:
#@ Pairwise Sequence Alignment Types:
The type of pairwise sequence alignment determines the substring ranges to apply the substitution scoring and gap penalty schemes. For the three primary (global, local, overlap) and two derivative (subject overlap, pattern overlap) pairwise sequence alignment types, the resulting substring ranges are as follows:
Global - [a1; b1] = [1; n1] and [a2; b2] = [1; n2]
Local - [a1; b1] and [a2; b2]
Overlap - f[a1; b1] = [a1; n1]; [a2; b2] = [1; b2]g or f[a1; b1] = [1; b1]; [a2; b2] = [a2; n2]g
Subject Overlap - [a1; b1] = [1; n1] and [a2; b2]
Pattern Overlap - [a1; b1] and [a2; b2] = [1; n2]
#@ Substitution Scoring Schemes:
The substitution scoring scheme sets the values for the aligned character
pairings within the substring ranges determined by the type of pairwise sequence alignment. This scoring scheme can be fixed for character pairings or quality-dependent for character pairings. (Characters that align with a gap are penalized according to the \Gap Penalty" framework.)
Fixed substitution scoring - Fixed substitution scoring schemes associate each aligned character
pairing with a value. These schemes are very common and include awarding one value for a match
and another for a mismatch, Point Accepted Mutation (PAM) matrices, and Block Substitution
Matrix (BLOSUM) matrices.
Quality-based substitution scoring - Quality-based substitution scoring schemes derive the value for
the aligned character pairing based on the probabilities of character recording errors [3]. Let
i
be the probability of a character recording error. Assuming independence within and between
recordings and a uniform background frequency of the different characters, the combined error
probability of a mismatch when the underlying characters do match is
c =
1+
2−(n=(n−1))∗
1∗
2, where n is the number of characters in the underlying alphabet (e.g. in DNA and RNA, n = 4).
Using
c, the substitution score is given by b∗log2(γ(x;y)∗(1−
c)∗n+(1−γ(x;y))∗
c∗(n=(n−1))),
where b is the bit-scaling for the scoring and γ(x;y) is the probability that characters x and y
represents the same underlying letters (e.g. using IUPAC, γ(A;A) = 1 and γ(A;N) = 1=4).
#@ Gap Penalties:
Gap penalties are the values associated with the gaps within the substring ranges determined by the type of pairwise sequence alignment. These penalties are divided into gap opening and gap extension components, where the gap opening penalty is the cost for adding a new gap and the gap extension penalty is the incremental cost incurred along the length of the gap. A constant gap
penalty occurs when there is a cost associated with opening a gap, but no cost for the length of a gap
(i.e. gap extension is zero). A linear gap penalty occurs when there is no cost associated for opening
a gap (i.e. gap opening is zero), but there is a cost for the length of the gap. An affine gap penalty
occurs when both the gap opening and gap extension have a non-zero associated cost.
3.3 Main Pairwise Sequence Alignment Function
The pairwiseAlignment function solves the pairwise sequence alignment problems mentioned above. It
aligns one or more strings specified in the pattern argument with a single string specified in the subject
argument.
pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”)
Global PairwiseAlignmentsSingleSubject (1 of 2)
pattern: succ–eed
subject: supersede
score: -33.99738
The type of pairwise sequence alignment is set by specifying the type argument to be one of “global”, “local”, “overlap”, “global-local”, and “local-global”.
pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”, type = “local”)
Local PairwiseAlignmentsSingleSubject (1 of 2)
pattern: [1] su
subject: [1] su
score: 5.578203
The gap penalties are regulated by the gapOpening and gapExtension arguments.
pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”, gapOpening = 0, gapExtension = 1)
Global PairwiseAlignmentsSingleSubject (1 of 2)
pattern: su-cce–ed-
subject: sup–ersede
score: 7.945507
The substitution scoring scheme is set using three arguments, two of which are quality-based related
(patternQuality, subjectQuality) and one is fixed substitution related (substitutionMatrix). When the substitution scores are fixed by character pairing, the substituionMatrix argument takes a matrix with the
appropriate alphabets as dimension names. The nucleotideSubstitutionMatrix function tranlates simple
match and mismatch scores to the full spectrum of IUPAC nucleotide codes.
submat <- matrix(-1, nrow = 26, ncol = 26, dimnames = list(letters, letters))
diag(submat) <- 0
pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”, substitutionMatrix = submat, gapOpening = 0, gapExtension = 1)
Global PairwiseAlignmentsSingleSubject (1 of 2)
pattern: succe-ed-
subject: supersede
score: -5
When the substitution scores are quality-based, the patternQuality and subjectQuality arguments represent the equivalent of [x − 99] numeric quality values for the respective strings, and the optional fuzzyMatrix
argument represents how the closely two characters match on a [0; 1] scale. The patternQuality and subjectQuality arguments accept quality measures in either a PhredQuality, SolexaQuality, or IlluminaQuality
scaling. For PhredQuality and IlluminaQuality measures Q 2 [0; 99], the probability of an error in the base
read is given by 10−Q=10 and for SolexaQuality measures Q 2 [−5; 99], they are given by 1−1=(1+ 10−Q=10).
The qualitySubstitutionMatrices function maps the patternQuality and subjectQuality scores to match
and mismatch penalties. These three arguments will be demonstrated in later sections.
The final argument, scoreOnly, to the pairwiseAlignment function accepts a logical value to specify
whether or not to return just the pairwise sequence alignment score. If scoreOnly is FALSE, the pairwise
alignment with the maximum alignment score is returned. If more than one pairwise alignment has the
maximum alignment score exists, the first alignment along the subject is returned. If there are multiple
pairwise alignments with the maximum alignment score at the chosen subject location, then at each location
along the alignment mismatches are given preference to insertions/deletions. For example, pattern: [1]
ATTA; subject: [1] AT-A is chosen above pattern: [1] ATTA; subject: [1] A-TA if they both have
the maximum alignment score.
submat <- matrix(-1, nrow = 26, ncol = 26, dimnames = list(letters, letters))
diag(submat) <- 0
pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”, substitutionMatrix = submat, gapOpening = 0, gapExtension = 1, scoreOnly = TRUE)
[1] -5 -5
4.结束
sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936 LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.54.0 XVector_0.26.0 IRanges_2.20.2 S4Vectors_0.24.2 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] Seurat_3.1.2 TH.data_1.0-10 Rtsne_0.15 colorspace_1.4-1 seqinr_3.6-1 pryr_0.1.4
[7] ggridges_0.5.2 rstudioapi_0.10 leiden_0.3.2 listenv_0.8.0 npsurv_0.4-0 ggrepel_0.8.1
[13] alakazam_0.3.0 mvtnorm_1.0-12 codetools_0.2-16 splines_3.6.2 R.methodsS3_1.7.1 mnormt_1.5-5
[19] lsei_1.2-0 TFisher_0.2.0 zeallot_0.1.0 ade4_1.7-13 jsonlite_1.6 packrat_0.5.0
[25] ica_1.0-2 cluster_2.1.0 png_0.1-7 R.oo_1.23.0 uwot_0.1.5 sctransform_0.2.1
[31] readr_1.3.1 compiler_3.6.2 httr_1.4.1 backports_1.1.5 assertthat_0.2.1 Matrix_1.2-18
[37] lazyeval_0.2.2 htmltools_0.4.0 prettyunits_1.1.0 tools_3.6.2 rsvd_1.0.2 igraph_1.2.4.2
[43] gtable_0.3.0 glue_1.3.1 RANN_2.6.1 reshape2_1.4.3 dplyr_0.8.3 Rcpp_1.0.3
[49] Biobase_2.46.0 vctrs_0.2.1 multtest_2.42.0 gdata_2.18.0 ape_5.3 nlme_3.1-142
[55] gbRd_0.4-11 lmtest_0.9-37 stringr_1.4.0 globals_0.12.5 lifecycle_0.1.0 irlba_2.3.3
[61] gtools_3.8.1 future_1.16.0 zlibbioc_1.32.0 MASS_7.3-51.4 zoo_1.8-7 scales_1.1.0
[67] hms_0.5.3 sandwich_2.5-1 RColorBrewer_1.1-2 reticulate_1.14 pbapply_1.4-2 gridExtra_2.3
[73] ggplot2_3.2.1 stringi_1.4.3 mutoss_0.1-12 plotrix_3.7-7 caTools_1.17.1.4 bibtex_0.4.2.2
[79] Rdpack_0.11-1 SDMTools_1.1-221.2 rlang_0.4.2 pkgconfig_2.0.3 bitops_1.0-6 lattice_0.20-38
[85] ROCR_1.0-7 purrr_0.3.3 htmlwidgets_1.5.1 cowplot_1.0.0 tidyselect_0.2.5 RcppAnnoy_0.0.14
[91] plyr_1.8.5 magrittr_1.5 R6_2.4.1 gplots_3.0.1.2 multcomp_1.4-12 pillar_1.4.3
[97] sn_1.5-4 fitdistrplus_1.0-14 survival_3.1-8 tsne_0.1-3 tibble_2.1.3 future.apply_1.4.0
[103] crayon_1.3.4 KernSmooth_2.23-16 plotly_4.9.1 progress_1.2.2 grid_3.6.2 data.table_1.12.8
[109] metap_1.2 digest_0.6.23 tidyr_1.0.0 numDeriv_2016.8-1.1 R.utils_2.9.2 RcppParallel_4.4.4
[115] munsell_0.5.0 viridisLite_0.3.0