Biostrings包测试3_Pairwise Sequence Alignments_2020-01-31

Biostrings包测试3_Pairwise Sequence Alignments_20200131F

1.设置当前工作目录

getwd()

2.导入R包

library(Biostrings)

3.测试:Pairwise Sequence Alignments

3.1 Introduction

In this document we illustrate how to perform pairwise sequence alignments using the Biostrings package

through the use of the pairwiseAlignment function. This function aligns a set of pattern strings to a subject

string in a global, local, or overlap (ends-free) fashion with or without affine gaps using either a fixed or

quality-based substitution scoring scheme. This function’s computation time is proportional to the product

of the two string lengths being aligned.

3.2 Pairwise Sequence Alignment Problems

The (Needleman-Wunsch) global, the (Smith-Waterman) local, and (ends-free) overlap pairwise sequence

alignment problems are described as follows. Let string Si have ni characters c(i;j) with j 2 f1; : : : ; nig. A

pairwise sequence alignment is a mapping of strings S1 and S2 to gapped substrings S01 and S02 that are

defined by

S01 = g(1;a1)c(1;a1) · · · g(1;b1)c(1;b1)g(1;b1+1)

S02 = g(2;a2)c(2;a2) · · · g(2;b2)c(2;b2)g(2;b2+1)

where

ai; bi 2 f1; : : : ; nig with ai ≤ bi

g(i;j) = 0 or more gaps at the specified position j for aligned string i

length(S01) = length(S02)

Each of these pairwise sequence alignment problems is solved by maximizing the alignment score. An

alignment score is determined by the type of pairwise sequence alignment (global, local, overlap), which sets

the [ai; bi] ranges for the substrings; the substitution scoring scheme, which sets the distance between aligned

characters; and the gap penalties, which is divided into opening and extension components. The optimal

pairwise sequence alignment is the pairwise sequence alignment with the largest score for the specified

alignment type, substitution scoring scheme, and gap penalties. The pairwise sequence alignment types,

substitution scoring schemes, and gap penalties influence alignment scores in the following manner:

#@ Pairwise Sequence Alignment Types:

The type of pairwise sequence alignment determines the substring ranges to apply the substitution scoring and gap penalty schemes. For the three primary (global, local, overlap) and two derivative (subject overlap, pattern overlap) pairwise sequence alignment types, the resulting substring ranges are as follows:

Global - [a1; b1] = [1; n1] and [a2; b2] = [1; n2]

Local - [a1; b1] and [a2; b2]

Overlap - f[a1; b1] = [a1; n1]; [a2; b2] = [1; b2]g or f[a1; b1] = [1; b1]; [a2; b2] = [a2; n2]g

Subject Overlap - [a1; b1] = [1; n1] and [a2; b2]

Pattern Overlap - [a1; b1] and [a2; b2] = [1; n2]

#@ Substitution Scoring Schemes:

The substitution scoring scheme sets the values for the aligned character

pairings within the substring ranges determined by the type of pairwise sequence alignment. This scoring scheme can be fixed for character pairings or quality-dependent for character pairings. (Characters that align with a gap are penalized according to the \Gap Penalty" framework.)

Fixed substitution scoring - Fixed substitution scoring schemes associate each aligned character

pairing with a value. These schemes are very common and include awarding one value for a match

and another for a mismatch, Point Accepted Mutation (PAM) matrices, and Block Substitution

Matrix (BLOSUM) matrices.

Quality-based substitution scoring - Quality-based substitution scoring schemes derive the value for

the aligned character pairing based on the probabilities of character recording errors [3]. Let

i

be the probability of a character recording error. Assuming independence within and between

recordings and a uniform background frequency of the different characters, the combined error

probability of a mismatch when the underlying characters do match is

c =
1+
2−(n=(n−1))∗
1∗

2, where n is the number of characters in the underlying alphabet (e.g. in DNA and RNA, n = 4).

Using

c, the substitution score is given by b∗log2(γ(x;y)∗(1−
c)∗n+(1−γ(x;y))∗
c∗(n=(n−1))),

where b is the bit-scaling for the scoring and γ(x;y) is the probability that characters x and y

represents the same underlying letters (e.g. using IUPAC, γ(A;A) = 1 and γ(A;N) = 1=4).

#@ Gap Penalties:

Gap penalties are the values associated with the gaps within the substring ranges determined by the type of pairwise sequence alignment. These penalties are divided into gap opening and gap extension components, where the gap opening penalty is the cost for adding a new gap and the gap extension penalty is the incremental cost incurred along the length of the gap. A constant gap

penalty occurs when there is a cost associated with opening a gap, but no cost for the length of a gap

(i.e. gap extension is zero). A linear gap penalty occurs when there is no cost associated for opening

a gap (i.e. gap opening is zero), but there is a cost for the length of the gap. An affine gap penalty

occurs when both the gap opening and gap extension have a non-zero associated cost.

3.3 Main Pairwise Sequence Alignment Function

The pairwiseAlignment function solves the pairwise sequence alignment problems mentioned above. It

aligns one or more strings specified in the pattern argument with a single string specified in the subject

argument.

pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”)

Global PairwiseAlignmentsSingleSubject (1 of 2)

pattern: succ–eed

subject: supersede

score: -33.99738

The type of pairwise sequence alignment is set by specifying the type argument to be one of “global”, “local”, “overlap”, “global-local”, and “local-global”.

pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”, type = “local”)

Local PairwiseAlignmentsSingleSubject (1 of 2)

pattern: [1] su

subject: [1] su

score: 5.578203

The gap penalties are regulated by the gapOpening and gapExtension arguments.

pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”, gapOpening = 0, gapExtension = 1)

Global PairwiseAlignmentsSingleSubject (1 of 2)

pattern: su-cce–ed-

subject: sup–ersede

score: 7.945507

The substitution scoring scheme is set using three arguments, two of which are quality-based related

(patternQuality, subjectQuality) and one is fixed substitution related (substitutionMatrix). When the substitution scores are fixed by character pairing, the substituionMatrix argument takes a matrix with the

appropriate alphabets as dimension names. The nucleotideSubstitutionMatrix function tranlates simple

match and mismatch scores to the full spectrum of IUPAC nucleotide codes.

submat <- matrix(-1, nrow = 26, ncol = 26, dimnames = list(letters, letters))
diag(submat) <- 0

pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”, substitutionMatrix = submat, gapOpening = 0, gapExtension = 1)

Global PairwiseAlignmentsSingleSubject (1 of 2)

pattern: succe-ed-

subject: supersede

score: -5

When the substitution scores are quality-based, the patternQuality and subjectQuality arguments represent the equivalent of [x − 99] numeric quality values for the respective strings, and the optional fuzzyMatrix

argument represents how the closely two characters match on a [0; 1] scale. The patternQuality and subjectQuality arguments accept quality measures in either a PhredQuality, SolexaQuality, or IlluminaQuality

scaling. For PhredQuality and IlluminaQuality measures Q 2 [0; 99], the probability of an error in the base

read is given by 10−Q=10 and for SolexaQuality measures Q 2 [−5; 99], they are given by 1−1=(1+ 10−Q=10).

The qualitySubstitutionMatrices function maps the patternQuality and subjectQuality scores to match

and mismatch penalties. These three arguments will be demonstrated in later sections.

The final argument, scoreOnly, to the pairwiseAlignment function accepts a logical value to specify

whether or not to return just the pairwise sequence alignment score. If scoreOnly is FALSE, the pairwise

alignment with the maximum alignment score is returned. If more than one pairwise alignment has the

maximum alignment score exists, the first alignment along the subject is returned. If there are multiple

pairwise alignments with the maximum alignment score at the chosen subject location, then at each location

along the alignment mismatches are given preference to insertions/deletions. For example, pattern: [1]

ATTA; subject: [1] AT-A is chosen above pattern: [1] ATTA; subject: [1] A-TA if they both have

the maximum alignment score.

submat <- matrix(-1, nrow = 26, ncol = 26, dimnames = list(letters, letters))
diag(submat) <- 0
pairwiseAlignment(pattern = c(“succeed”, “precede”), subject = “supersede”, substitutionMatrix = submat, gapOpening = 0, gapExtension = 1, scoreOnly = TRUE)

[1] -5 -5

4.结束

sessionInfo()

R version 3.6.2 (2019-12-12)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:

[1] LC_COLLATE=Chinese (Simplified)_China.936 LC_CTYPE=Chinese (Simplified)_China.936

[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C

[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages:

[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:

[1] Biostrings_2.54.0 XVector_0.26.0 IRanges_2.20.2 S4Vectors_0.24.2 BiocGenerics_0.32.0

loaded via a namespace (and not attached):

[1] Seurat_3.1.2 TH.data_1.0-10 Rtsne_0.15 colorspace_1.4-1 seqinr_3.6-1 pryr_0.1.4

[7] ggridges_0.5.2 rstudioapi_0.10 leiden_0.3.2 listenv_0.8.0 npsurv_0.4-0 ggrepel_0.8.1

[13] alakazam_0.3.0 mvtnorm_1.0-12 codetools_0.2-16 splines_3.6.2 R.methodsS3_1.7.1 mnormt_1.5-5

[19] lsei_1.2-0 TFisher_0.2.0 zeallot_0.1.0 ade4_1.7-13 jsonlite_1.6 packrat_0.5.0

[25] ica_1.0-2 cluster_2.1.0 png_0.1-7 R.oo_1.23.0 uwot_0.1.5 sctransform_0.2.1

[31] readr_1.3.1 compiler_3.6.2 httr_1.4.1 backports_1.1.5 assertthat_0.2.1 Matrix_1.2-18

[37] lazyeval_0.2.2 htmltools_0.4.0 prettyunits_1.1.0 tools_3.6.2 rsvd_1.0.2 igraph_1.2.4.2

[43] gtable_0.3.0 glue_1.3.1 RANN_2.6.1 reshape2_1.4.3 dplyr_0.8.3 Rcpp_1.0.3

[49] Biobase_2.46.0 vctrs_0.2.1 multtest_2.42.0 gdata_2.18.0 ape_5.3 nlme_3.1-142

[55] gbRd_0.4-11 lmtest_0.9-37 stringr_1.4.0 globals_0.12.5 lifecycle_0.1.0 irlba_2.3.3

[61] gtools_3.8.1 future_1.16.0 zlibbioc_1.32.0 MASS_7.3-51.4 zoo_1.8-7 scales_1.1.0

[67] hms_0.5.3 sandwich_2.5-1 RColorBrewer_1.1-2 reticulate_1.14 pbapply_1.4-2 gridExtra_2.3

[73] ggplot2_3.2.1 stringi_1.4.3 mutoss_0.1-12 plotrix_3.7-7 caTools_1.17.1.4 bibtex_0.4.2.2

[79] Rdpack_0.11-1 SDMTools_1.1-221.2 rlang_0.4.2 pkgconfig_2.0.3 bitops_1.0-6 lattice_0.20-38

[85] ROCR_1.0-7 purrr_0.3.3 htmlwidgets_1.5.1 cowplot_1.0.0 tidyselect_0.2.5 RcppAnnoy_0.0.14

[91] plyr_1.8.5 magrittr_1.5 R6_2.4.1 gplots_3.0.1.2 multcomp_1.4-12 pillar_1.4.3

[97] sn_1.5-4 fitdistrplus_1.0-14 survival_3.1-8 tsne_0.1-3 tibble_2.1.3 future.apply_1.4.0

[103] crayon_1.3.4 KernSmooth_2.23-16 plotly_4.9.1 progress_1.2.2 grid_3.6.2 data.table_1.12.8

[109] metap_1.2 digest_0.6.23 tidyr_1.0.0 numDeriv_2016.8-1.1 R.utils_2.9.2 RcppParallel_4.4.4

[115] munsell_0.5.0 viridisLite_0.3.0

你可能感兴趣的:(笔记)