Biostrings包测试1_20200129Wednesday
1.设置当前工作目录
setwd(“Biostrings/”)
2.导入R包
library(Biostrings)
3.R包简要信息
3.1 Description
Package: Biostrings
Title: Efficient manipulation of biological strings
Description: Memory efficient string containers, string matching
algorithms, and other utilities, for fast manipulation of large
biological sequences or sets of sequences.
Version: 2.54.0
Encoding: UTF-8
Author: H. Pagès, P. Aboyoun, R. Gentleman, and S. DebRoy
biocViews: SequenceMatching, Alignment, Sequencing, Genetics,
DataImport, DataRepresentation, Infrastructure
Depends: R (>= 3.5.0), methods, BiocGenerics (>= 0.31.5), S4Vectors (>=
0.21.13), IRanges, XVector (>= 0.23.2)
Imports: graphics, methods, stats, utils
LinkingTo: S4Vectors, IRanges, XVector
Enhances: Rmpi
Suggests: BSgenome (>= 1.13.14), BSgenome.Celegans.UCSC.ce2 (>=
1.3.11), BSgenome.Dmelanogaster.UCSC.dm3 (>= 1.3.11),
BSgenome.Hsapiens.UCSC.hg18, drosophila2probe, hgu95av2probe,
hgu133aprobe, GenomicFeatures (>= 1.3.14), hgu95av2cdf, affy
(>= 1.41.3), affydata (>= 1.11.5), RUnit
License: Artistic-2.0
LazyLoad: yes
Collate: 00datacache.R utils.R IUPAC_CODE_MAP.R AMINO_ACID_CODE.R
GENETIC_CODE.R XStringCodec-class.R seqtype.R XString-class.R
XStringSet-class.R XStringSet-comparison.R XStringViews-class.R
MaskedXString-class.R XStringSetList-class.R xscat.R
XStringSet-io.R letter.R getSeq.R letterFrequency.R
dinucleotideFrequencyTest.R chartr.R reverseComplement.R
translate.R toComplex.R replaceAt.R replaceLetterAt.R
injectHardMask.R padAndClip.R strsplit-methods.R misc.R
SparseList-class.R MIndex-class.R lowlevel-matching.R
match-utils.R matchPattern.R maskMotif.R matchLRPatterns.R
trimLRPatterns.R matchProbePair.R matchPWM.R findPalindromes.R
PDict-class.R matchPDict.R XStringPartialMatches-class.R
XStringQuality-class.R QualityScaledXStringSet.R InDel-class.R
AlignedXStringSet-class.R PairwiseAlignments-class.R
PairwiseAlignmentsSingleSubject-class.R PairwiseAlignments-io.R
align-utils.R pmatchPattern.R pairwiseAlignment.R stringDist.R
needwunsQS.R MultipleAlignment.R matchprobes.R zzz.R
git_url: https://git.bioconductor.org/packages/Biostrings
git_branch: RELEASE_3_10
git_last_commit: b8982e7
git_last_commit_date: 2019-10-29
Date/Publication: 2019-10-29
NeedsCompilation: yes
Packaged: 2019-10-30 01:22:38 UTC; biocbuild
Built: R 3.6.1; i386-w64-mingw32; 2019-10-30 12:46:43 UTC; windows
Archs: i386, x64
3.2 Main function
ls(package:Biostrings)
[1] “%in%”
[2] “AA_ALPHABET”
[3] “AA_PROTEINOGENIC”
[4] “AA_STANDARD”
[5] “AAMultipleAlignment”
[6] “AAString”
[7] “AAStringSet”
[8] “AAStringSetList”
[9] “aligned”
[10] “alignedPattern”
[11] “alignedSubject”
[12] “alphabet”
[13] “alphabetFrequency”
[14] “AMINO_ACID_CODE”
[15] “as.data.frame”
[16] “as.list”
[17] “as.matrix”
[18] “BString”
[19] “BStringSet”
[20] “BStringSetList”
[21] “chartr”
[22] “codons”
[23] “coerce”
[24] “collapse”
[25] “colmask”
[26] “colmask<-”
[27] “compareStrings”
[28] “complement”
[29] “computeAllFlinks”
[30] “consensusMatrix”
[31] “consensusString”
[32] “consensusViews”
[33] “countPattern”
[34] “countPDict”
[35] “countPWM”
[36] “coverage”
[37] “deletion”
[38] “detail”
[39] “dinucleotideFrequency”
[40] “dinucleotideFrequencyTest”
[41] “DNA_ALPHABET”
[42] “DNA_BASES”
[43] “DNAMultipleAlignment”
[44] “DNAString”
[45] “DNAStringSet”
[46] “DNAStringSetList”
[47] “duplicated”
[48] “encoding”
[49] “end”
[50] “endIndex”
[51] “errorSubstitutionMatrices”
[52] “extract_character_from_XString_by_positions”
[53] “extract_character_from_XString_by_ranges”
[54] “extractAllMatches”
[55] “extractAt”
[56] “fasta.index”
[57] “fasta.seqlengths”
[58] “fastq.geometry”
[59] “fastq.seqlengths”
[60] “findPalindromes”
[61] “gaps”
[62] “GENETIC_CODE”
[63] “GENETIC_CODE_TABLE”
[64] “get_seqtype_conversion_lookup”
[65] “getGeneticCode”
[66] “getSeq”
[67] “gregexpr2”
[68] “hasAllFlinks”
[69] “hasLetterAt”
[70] “hasOnlyBaseLetters”
[71] “head”
[72] “IlluminaQuality”
[73] “indel”
[74] “initialize”
[75] “injectHardMask”
[76] “insertion”
[77] “intersect”
[78] “is.unsorted”
[79] “isMatchingAt”
[80] “isMatchingEndingAt”
[81] “isMatchingStartingAt”
[82] “IUPAC_CODE_MAP”
[83] “lcprefix”
[84] “lcsubstr”
[85] “lcsuffix”
[86] “letter”
[87] “letterFrequency”
[88] “letterFrequencyInSlidingView”
[89] “longestConsecutive”
[90] “make_XString_from_string”
[91] “make_XStringSet_from_strings”
[92] “mask”
[93] “maskeddim”
[94] “maskedncol”
[95] “maskednrow”
[96] “maskedratio”
[97] “maskedwidth”
[98] “maskGaps”
[99] “maskMotif”
[100] “masks”
[101] “masks<-”
[102] “match”
[103] “matchLRPatterns”
[104] “matchPattern”
[105] “matchPDict”
[106] “matchProbePair”
[107] “matchprobes”
[108] “matchPWM”
[109] “maxScore”
[110] “maxWeights”
[111] “mergeIUPACLetters”
[112] “minScore”
[113] “minWeights”
[114] “mismatch”
[115] “mismatchSummary”
[116] “mismatchTable”
[117] “mkAllStrings”
[118] “N50”
[119] “nchar”
[120] “nedit”
[121] “neditAt”
[122] “neditEndingAt”
[123] “neditStartingAt”
[124] “needwunsQS”
[125] “nindel”
[126] “nmatch”
[127] “nmismatch”
[128] “nnodes”
[129] “nucleotideFrequencyAt”
[130] “nucleotideSubstitutionMatrix”
[131] “oligonucleotideFrequency”
[132] “oligonucleotideTransitions”
[133] “order”
[134] “padAndClip”
[135] “pairwiseAlignment”
[136] “PairwiseAlignments”
[137] “PairwiseAlignmentsSingleSubject”
[138] “palindromeArmLength”
[139] “palindromeLeftArm”
[140] “palindromeRightArm”
[141] “parallelSlotNames”
[142] “parallelVectorNames”
[143] “pattern”
[144] “patternFrequency”
[145] “pcompare”
[146] “PDict”
[147] “PhredQuality”
[148] “pid”
[149] “pmatchPattern”
[150] “PWM”
[151] “PWMscoreStartingAt”
[152] “quality”
[153] “QualityScaledAAStringSet”
[154] “QualityScaledBStringSet”
[155] “QualityScaledDNAStringSet”
[156] “QualityScaledRNAStringSet”
[157] “qualitySubstitutionMatrices”
[158] “rank”
[159] “readAAMultipleAlignment”
[160] “readAAStringSet”
[161] “readBStringSet”
[162] “readDNAMultipleAlignment”
[163] “readDNAStringSet”
[164] “readQualityScaledDNAStringSet”
[165] “readRNAMultipleAlignment”
[166] “readRNAStringSet”
[167] “relistToClass”
[168] “replaceAmbiguities”
[169] “replaceAt”
[170] “replaceLetterAt”
[171] “reverse”
[172] “reverseComplement”
[173] “RNA_ALPHABET”
[174] “RNA_BASES”
[175] “RNA_GENETIC_CODE”
[176] “RNAMultipleAlignment”
[177] “RNAString”
[178] “RNAStringSet”
[179] “RNAStringSetList”
[180] “rowmask”
[181] “rowmask<-”
[182] “saveXStringSet”
[183] “score”
[184] “seqtype”
[185] “seqtype<-”
[186] “setdiff”
[187] “setequal”
[188] “show”
[189] “showAsCell”
[190] “SolexaQuality”
[191] “sort”
[192] “stackStrings”
[193] “start”
[194] “startIndex”
[195] “stringDist”
[196] “strsplit”
[197] “subject”
[198] “subpatterns”
[199] “subseq”
[200] “subseq<-”
[201] “substr”
[202] “substring”
[203] “summary”
[204] “tail”
[205] “tb”
[206] “tb.width”
[207] “threebands”
[208] “toComplex”
[209] “toString”
[210] “translate”
[211] “trimLRPatterns”
[212] “trinucleotideFrequency”
[213] “twoWayAlphabetFrequency”
[214] “type”
[215] “unaligned”
[216] “union”
[217] “uniqueLetters”
[218] “unitScale”
[219] “unmasked”
[220] “unstrsplit”
[221] “updateObject”
[222] “vcountPattern”
[223] “vcountPDict”
[224] “Views”
[225] “vmatchPattern”
[226] “vmatchPDict”
[227] “vwhichPDict”
[228] “which.isMatchingAt”
[229] “which.isMatchingEndingAt”
[230] “which.isMatchingStartingAt”
[231] “whichPDict”
[232] “width”
[233] “width0”
[234] “windows”
[235] “write.phylip”
[236] “writePairwiseAlignments”
[237] “writeQualityScaledXStringSet”
[238] “writeXStringSet”
[239] “xscat”
[240] “xscodes”
3.3 Introduction
(1) use R external pointers to
store the string data,
(2) use bit patterns to encode the string data,
(3) provide the user with a convenient class of objects where each instance can store a set of views on the same big string
(these views being typically the matches returned by a search algorithm)
4.测试
4.1 The XString class and its subsetting operator [
b <- BString(“I am a BString object”)
#@ b的内容
b
21-letter “BString” instance
seq: I am a BString object
#@ b的长度
length(b)
[1] 21
#@ A DNAString object:
d <- DNAString(“TTGAAAA-CTC-N”)
d
13-letter “DNAString” instance
seq: TTGAAAA-CTC-N
#@ d的长度
length(d)
[1] 13
#@ The differences with a BString object are: (1) only letters from the IUPAC extended genetic alphabet + the gap letter (-) are allowed and (2) each letter in the argument passed to the DNAString function is encoded in a special way before it’s stored in the DNAString object
Access to the individual letters:
#@ 查看d的第三个元素
d[3]
1-letter “DNAString” instance
seq: G
#@ 查看d的第7个到第12个元素
d[7:12]
6-letter “DNAString” instance
seq: A-CTC-
#@ 查看d的第1个到第3个元素
d[1:3]
3-letter “DNAString” instance
seq: TTG
#@ 查看d的所有元素
d[]
13-letter “DNAString” instance
seq: TTGAAAA-CTC-N
#@ 对比b的正向和反向排序内部元素
b[length(b):1]
21-letter “BString” instance
seq: tcejbo gnirtSB a ma I
b
21-letter “BString” instance
seq: I am a BString object
#@ Only in bounds positive numeric subscripts are supported. In fact the subsetting operator for XString objects is not efficient and one should always use the subseq method to extract a substring from a big string:
bb <- subseq(b, 3, 6)
4-letter “BString” instance
seq: am a
dd1 <- subseq(d, end=7)
dd1
7-letter “DNAString” instance
seq: TTGAAAA
dd2 <- subseq(d, start=8)
6-letter “DNAString” instance
seq: -CTC-N
#@ To dump an XString object as a character vector (of length 1), use the toString method:
toString(dd2)
[1] “-CTC-N”
Note that length(dd2) is equivalent to nchar(toString(dd2)) but the latter would be very inefficient on a big DNAString object.
[TODO: Make a generic of the substr() function to work with XString objects. It will be essentially doing toString(subseq()).]
4.2 The == binary operator for XString objects
#@ The 2 following comparisons are TRUE:
bb == “am a”
[1] TRUE
bb
4-letter “BString” instance
seq: am a
dd2 != DNAString(“TG”)
[1] TRUE
6-letter “DNAString” instance
seq: -CTC-N
#@ When the 2 sides of == don’t belong to the same class then the side belonging to the\lowest" class is first converted to an object belonging to the class of the other side (the \highest" class).
#@ The class (pseudo-)order is character < BString < DNAString. When both sides are XString objects of the same subtype (e.g. both are DNAString objects) then the comparison is very fast because it only has to call the C standard function memcmp() and no memory allocation or string encoding/decoding is required.
#@ The 2 following expressions provoke an error because the right member can’t be \upgraded" (converted) to an object of the same class than the left member:
bb == “”
Error in bb == “” :
comparison between a “BString” object and a character vector of length != 1 or an empty string or an NA is not supported
d == bb
Error in d == bb :
comparison between a “DNAString” instance and a “BString” instance is not supported
#@ When comparing an RNAString object with a DNAString object, U and T are considered equals:
r <- RNAString(d)
r
13-letter “RNAString” instance
seq: UUGAAAA-CUC-N
r == d
[1] TRUE
4.3 The XStringViews class and its subsetting operators [ and [[
#@ An XStringViews object contains a set of views on the same XString object called the subject string. Here is an XStringViews object with 4 views:
v4 <- Views(dd2, start=3:0, end=5:8)
class(v4)
[1] “XStringViews”
attr(,“package”)
[1] “Biostrings”
v4
Views on a 6-letter DNAString subject
subject: -CTC-N
views:
start end width
[1] 3 5 3 [TC-]
[2] 2 6 5 [CTC-N]
[3] 1 7 7 [-CTC-N ]
[4] 0 8 9 [ -CTC-N ]
length(v4)
[1] 4
test_v <- Views(dd2, start = 4:1, end = 5:8)
class(test_v)
[1] “XStringViews”
attr(,“package”)
[1] “Biostrings”
test_v
Views on a 6-letter DNAString subject
subject: -CTC-N
views:
start end width
[1] 4 5 2 [C-]
[2] 3 6 4 [TC-N]
[3] 2 7 6 [CTC-N ]
[4] 1 8 8 [-CTC-N ]
#@ Note that the 2 last views are out of limits.
#@ You can select a subset of views from an XStringViews object:
v4[4:2]
Views on a 6-letter DNAString subject
subject: -CTC-N
views:
start end width
[1] 0 8 9 [ -CTC-N ]
[2] 1 7 7 [-CTC-N ]
[3] 2 6 5 [CTC-N]
#@ The returned object is still an XStringViews object, even if we select only one element.
#@ You need to use double-brackets to extract a given view as an XString object:
v4[[2]]
5-letter “DNAString” instance
seq: CTC-N
#@ You can’t extract a view that is out of limits:
v4[[3]]
Error in getListElement(x, i, …) : view is out of limits
#@ Note that, when start and end are numeric vectors and i is a single integer, Views(b, start, end)[[i]] is equivalent to subseq(b, start[i], end[i]).
#@ Subsetting also works with negative or logical values with the expected semantic (the same as for R built-in vectors):
v4[-3]
Views on a 6-letter DNAString subject
subject: -CTC-N
views:
start end width
[1] 3 5 3 [TC-]
[2] 2 6 5 [CTC-N]
[3] 0 8 9 [ -CTC-N ]
v4[c(TRUE, FALSE)]
Views on a 6-letter DNAString subject
subject: -CTC-N
views:
start end width
[1] 3 5 3 [TC-]
[2] 1 7 7 [-CTC-N ]
#@ Note that the logical vector is recycled to the length of v4
4.4 A few more XStringViews objects
12 views (all of the same width):
v12 <- Views(DNAString(“TAATAATG”), start=-2:9, end=0:11)
v12
Views on a 8-letter DNAString subject
subject: TAATAATG
views:
start end width
[1] -2 0 3 [ ]
[2] -1 1 3 [ T]
[3] 0 2 3 [ TA]
[4] 1 3 3 [TAA]
[5] 2 4 3 [AAT]
… … … … …
[8] 5 7 3 [AAT]
[9] 6 8 3 [ATG]
[10] 7 9 3 [TG ]
[11] 8 10 3 [G ]
[12] 9 11 3 [ ]
This is the same as doing Views(d, start=1, end=length(d)):
as(d, “Views”)
Views on a 13-letter DNAString subject
subject: TTGAAAA-CTC-N
views:
start end width
[1] 1 13 13 [TTGAAAA-CTC-N]
#@ Hence the following will always return the d object itself:
as(d, “Views”)[[1]]
13-letter “DNAString” instance
seq: TTGAAAA-CTC-N
#@ 3 XStringViews objects with no view:
v12[0]
Views on a 8-letter DNAString subject
subject: TAATAATG
views: NONE
v12[FALSE]
Views on a 8-letter DNAString subject
subject: TAATAATG
views: NONE
Views(d)
Views on a 13-letter DNAString subject
subject: TTGAAAA-CTC-N
views: NONE
4.5 The == binary operator for XStringViews objects
#@ This operator is the vectorized version of the == operator defined previously for XString objects:
v12 == DNAString(“TAA”)
[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
[11] FALSE FALSE
v12
Views on a 8-letter DNAString subject
subject: TAATAATG
views:
start end width
[1] -2 0 3 [ ]
[2] -1 1 3 [ T]
[3] 0 2 3 [ TA]
[4] 1 3 3 [TAA]
[5] 2 4 3 [AAT]
… … … … …
[8] 5 7 3 [AAT]
[9] 6 8 3 [ATG]
[10] 7 9 3 [TG ]
[11] 8 10 3 [G ]
[12] 9 11 3 [ ]
v12 == DNAString(“ATG”)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[11] FALSE FALSE
v12 == DNAString(“ATGA”)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11] FALSE FALSE
#@ To display all the views in v12 that are equals to a given view, you can type R cuties like:
v12[v12 == v12[4]]
Views on a 8-letter DNAString subject
subject: TAATAATG
views:
start end width
[1] 1 3 3 [TAA]
[2] 4 6 3 [TAA]
v12[v12 == v12[1]]
Views on a 8-letter DNAString subject
subject: TAATAATG
views:
start end width
[1] -2 0 3 [ ]
[2] 9 11 3 [ ]
#@ This is TRUE:
v12[3] == Views(RNAString(“AU”), start=0, end=2)
[1] FALSE
4.6 The start, end and width methods
start(v4)
[1] 3 2 1 0
end(v4)
[1] 5 6 7 8
width(v4)
[1] 3 5 7 9
#@ Note that start(v4)[i] is equivalent to start(v4[i]), except that the former will not issue an error if i is out of bounds (same for end and width methods).
#@ Also, when i is a single integer, width(v4)[i] is equivalent to length(v4[[i]]) except that the former will not issue an error if i is out of bounds or if view v4[i] is out of limits.
5.结束
sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936
[2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936
attached base packages:
[1] stats4 parallel stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] Biostrings_2.54.0 XVector_0.26.0 IRanges_2.20.2
[4] S4Vectors_0.24.2 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] Seurat_3.1.2 TH.data_1.0-10
[3] Rtsne_0.15 colorspace_1.4-1
[5] seqinr_3.6-1 pryr_0.1.4
[7] ggridges_0.5.2 rstudioapi_0.10
[9] leiden_0.3.2 listenv_0.8.0
[11] npsurv_0.4-0 ggrepel_0.8.1
[13] alakazam_0.3.0 mvtnorm_1.0-12
[15] codetools_0.2-16 splines_3.6.2
[17] R.methodsS3_1.7.1 mnormt_1.5-5
[19] lsei_1.2-0 TFisher_0.2.0
[21] zeallot_0.1.0 ade4_1.7-13
[23] jsonlite_1.6 packrat_0.5.0
[25] ica_1.0-2 cluster_2.1.0
[27] png_0.1-7 R.oo_1.23.0
[29] uwot_0.1.5 sctransform_0.2.1
[31] readr_1.3.1 compiler_3.6.2
[33] httr_1.4.1 backports_1.1.5
[35] assertthat_0.2.1 Matrix_1.2-18
[37] lazyeval_0.2.2 htmltools_0.4.0
[39] prettyunits_1.1.0 tools_3.6.2
[41] rsvd_1.0.2 igraph_1.2.4.2
[43] gtable_0.3.0 glue_1.3.1
[45] RANN_2.6.1 reshape2_1.4.3
[47] dplyr_0.8.3 Rcpp_1.0.3
[49] Biobase_2.46.0 vctrs_0.2.1
[51] multtest_2.42.0 gdata_2.18.0
[53] ape_5.3 nlme_3.1-142
[55] gbRd_0.4-11 lmtest_0.9-37
[57] stringr_1.4.0 globals_0.12.5
[59] lifecycle_0.1.0 irlba_2.3.3
[61] gtools_3.8.1 future_1.16.0
[63] zlibbioc_1.32.0 MASS_7.3-51.4
[65] zoo_1.8-7 scales_1.1.0
[67] hms_0.5.3 sandwich_2.5-1
[69] RColorBrewer_1.1-2 reticulate_1.14
[71] pbapply_1.4-2 gridExtra_2.3
[73] ggplot2_3.2.1 stringi_1.4.3
[75] mutoss_0.1-12 plotrix_3.7-7
[77] caTools_1.17.1.4 bibtex_0.4.2.2
[79] Rdpack_0.11-1 SDMTools_1.1-221.2
[81] rlang_0.4.2 pkgconfig_2.0.3
[83] bitops_1.0-6 lattice_0.20-38
[85] ROCR_1.0-7 purrr_0.3.3
[87] htmlwidgets_1.5.1 cowplot_1.0.0
[89] tidyselect_0.2.5 RcppAnnoy_0.0.14
[91] plyr_1.8.5 magrittr_1.5
[93] R6_2.4.1 gplots_3.0.1.2
[95] multcomp_1.4-12 pillar_1.4.3
[97] sn_1.5-4 fitdistrplus_1.0-14
[99] survival_3.1-8 tibble_2.1.3
[101] future.apply_1.4.0 tsne_0.1-3
[103] crayon_1.3.4 KernSmooth_2.23-16
[105] plotly_4.9.1 progress_1.2.2
[107] grid_3.6.2 data.table_1.12.8
[109] metap_1.2 digest_0.6.23
[111] tidyr_1.0.0 numDeriv_2016.8-1.1
[113] R.utils_2.9.2 RcppParallel_4.4.4
[115] munsell_0.5.0 viridisLite_0.3.0