pythonic生物人

NGS数据格式梳理02-SAM/BAM格式最详细解读

本篇是自己学习SAM和SAMtag的资料心得，为防自己理解有误，误导他人，参考资料（文中“[ ]”中的数字对应文末的参考文献）都有加上，不足之处欢迎指正。

SAM/BAM格式简介[1]

术语与概念理解[3][4]

标头部分（header section）

比对信息部分（alignment section）

比对部分概述

比对部分详述[4]

第一列、QNAME

第二列、FLAG

第三列、RNAME

第四列、POS

第五列、MAPQ

第六列、CIGAR

第七列、RNEXT

第八列、PNEXT

第九列、TLEN

第十列、SEQ

第十一列、QUAL

第十二列之后，Optional fields

1.1 Additional Template and Mapping data（一些比对信息）

1.2 Metadata（这部分内容和 SAM中header section部分相关，描述read测序相关信息）

1.3 Barcodes(UMI/单细胞测序cell barcode)

1.4 Original data

1.5 Annotation and Padding

1.6 Technology-specifific data

2 Locally-defifined tags

参考资料

我的公众号

SAM/BAM格式简介[1]

SAM存储格式发明的目的：使得不同平台下机数据，经过不同比对软件后有一个统一的存储格式。
SAM(Sequence Alignment/Map format简写）格式文件，存储测序数据和参考基因组比对结果的文件，每行以table键分割，包含标头部分（header section）和比对部分（alignment section）见下图。
BAM（Binary Alignment/Map format简写）格式文件，SAM的二进制格式文件，通过BGZF library参考库压缩而成。

图一，SAM/BAM格式简介[2]

术语与概念理解[3][4]

该部分有助于后文SAM格式理解，后文反复出现如下概念。

模板（Template）：一段DNA/RNA序列，它的一部分在测序仪上被测序，或被从原始序列中组装。（意思就是：我们通过测序仪测序的那段序列，或者通过组装原始序列得到的更长的序列，就是模板的一部分）。（从后文来看，对于Illumina双端测序来说，template指的就是插入片段）
片段（Segment）：一段连续的序列或子序列（subsequence）（从上下文来看，segment既可以指一条完整的read，也可以指read的一部分）；
读段（Read）：一段来自测序仪的原始序列。read可以包含多个片段（一条read在比对过程中可能会被拆分成几段，对应到参考序列不同的位置上。read被拆分后形成的片段即为segment）。对于测序数据，reads根据测序顺序进行编号；
线性比对（Linear alignment）：一条比对到参考序列上的read可能会有插入、缺失、skips和切除（clipping），但只要没有方向的改变（例如，read的一部分比对到了正义链上，另一部分比对到了反义链上），就是Linear alignment。一个线性比对结果可以代表一个SAM记录；（意思似乎是：一条SAM记录能且只能保存一个线性比对结果）
嵌合比对（Chimeric alignment）：不是线性比对的比对。嵌合比对中包含了一套没有大范围重叠的线性比对（嵌合比对中的每一个片段都是线性比对。关于大范围重叠的说法是为了和多重比对区分）。一般地，嵌合比对中的一个线性比对被认为是“有代表性的比对”（representative alignment），而其他的线性比对被称为补充的（supplementary），用补充比对标志（supplementary alignment flag）加以区别（representative和supplementary成一对，对应嵌合比对）。嵌合比对的所有SAM记录有相同的QNAME，其flag值的0x40和0x80位都相同（见1.4节）（0x40位和0x80位分别表示模板中的第一个片段和最后一个片段，为什么会都相同呢？总要有一个是第一个片段，总要有一个是最后一个片段吧，它俩的0x40位和0x80位不应该相同啊？）。哪个线性比对被视为有代表性是任意选择的。（可见嵌合比对中，各个segments的独立性更强：都不在双链的同一条链了。另外，如果一条read的不同部分比对到了不同的染色体上，那肯定也是嵌合比对了，因为不同染色体之间讨论方向相同是没有意义的，肯定不可能是线性比对了。）
read比对（read alignment）：能代表一条read的比对结果的线性比对或嵌合比对；
多重比对（Multiple mapping）：由于重复序列等情况的存在，一条read在参考基因组上的正确位置可能无法确定。在这种情况下，一条read可能会有多种比对结果，其中一种被视为主要的（primary），所有其他的比对结果的SAM记录的flag标志中都会有一个“次要（secondary）比对结果”的标志。所有这些SAM记录拥有相同的QNAME，flag标志的0x40位和0x80位有相同的值。一般被指定为“主要”的比对结果是最佳比对，如果都是最佳比对，则任意指定一条（primary和secondary成一对，对应多重比对）。（原文注释：嵌合比对主要由结构变异、基因融合、组装错误、RNA测序或实验过程中的一些原因造成，更经常出现在长reads中（长read有利于检测嵌合比对。这就是为什么三代测序是检测染色体结构变异的更有力工具）。嵌合比对中的线性比对之间没有大片段的重叠，每个线性比对有较高的mapping质量值，可以用于SNP/INDEL的检测；而多重比对主要是序列重复造成的，不经常出现在长reads中。如果一条read有多重比对的情况，所有的比对互相之间几乎完全完全重叠。除了一个最佳比对外，所有其他比对的质量值都<3，且会被大多数SNP/INDEL检测软件忽略）。
以1为起始的坐标系（1-based coordinate system）：序列的第一位是1的坐标系。在这种坐标系中，一个区域用闭区间表示。例如，第三位和第七位碱基之间的区域表示为[3,7]。SAM, VCF, GFF和Wiggle格式使用以1为起始的坐标系；
以0为起始的坐标系（0-based coordinate system）：序列的第一位是0的坐标系。在这种坐标系中，一个区域用左闭右开区间表示。例如，第三位和第七位碱基之间的区域表示为[2,7)（原文如此。难道不应该是[3,8)么？不应该。以0为起始，第三位对应的索引号是2，第七位对应的索引号是6，所以索引号[2,7)对应了第三位-第七位碱基。当时脑子糊涂了，没搞清文中说的意思）。BAM, BCFv2, BED和PSL格式使用以0为起始的坐标系；
Phred scale：如果一个概率值0

标头部分（header section）

该部分为SAM/BAM的注释部分，该部分并非必须，可以省略。每一行都以@符开头，后面跟着两个大写字母，每个字段之间以\t分割，每个字段遵循（TAG:Value）的格式（@CO开头的行除外）。每行可以使用以下正则表达式表示：/^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/，@后紧跟的两个大写字母主要有HD，SQ，RG，PG和CO五类，前四类常用如下表，其中加了*号的表示该标签必须存在，例如@HD这个标签存在时，VN必须同时存在，详细介绍如下[3][4]。

@HD	该SAM文件的版本，SAM的sort方式
	VN*	Version简写，SAM文件的版本。
	SO	Sorting order of alignments 简写，比对时的sort顺序，包含如下几类：unknown(默认的), unsorted, queryname和coordinate。coordinate类先以SAM中第三列（RNAME）排序，即@SQ中的SN顺序；如果RNAME相同，则再以SAM中第四列（POS字段）排序。RNAME为*时，则该read放在其它结果之后，顺序任意。
	GO	Grouping of alignments 简写，比对结果的分组信息，相似的比对结果聚集到一起，有以下几类分组方法：none(默认的)，query（根据QNAME分组），reference（根据RNAME/POS分组）。
	SS	Sub-sorting order of alignments 简写，比对时的sub-sort排序，即当排序方式不在上面SO中的四类时，以SS方式排序，例如 @HD SO:unsorted SS:unsorted:MI:coordinate。
@SQ	比对使用参考基因组信息。@SQ的顺序即为参考基因组中染色体的sort顺序。例如，@SQ SN:chr22 LN:51304566
	SN*	Reference sequence name 简写，参考序列名称。每行SN标签独一无二，RNAME和RNEXT会使用到。
	LN*	Reference sequence length 简写，参考序列长度，范围[1,2^31-1]。
	AH	Alternate locus 简写，表明参考基因组版本是基于alt_scaffold时，还有一种参考基因组是基于primary_assembly的。
	AN	Alternative reference sequence names 简写。
	AS	Genome assembly identififier，参考基因组组装ID。
	DS	Description. UTF-8 encoding may be used.
	M5	参考基因组MD5校验，allows reference sequences to be uniquely identifified through the MD5 digest of the sequence itself.
	SP	物种
	MT	分子拓扑信息（molecule topology）,默认是线性，还有环状。
	UR	URI of the sequence，参考基因组的路径，‘http:’ or ‘ftp:’开头。
@RG	Read group 简写，该标签重点关注sample sam 中reads测序信息。
	ID*	每个sample SAM对应一个独一无二的ID。
	CN	产生sample read的测序中心的名字。
	DS	描述（Description）
	DT	sample read测序日期
	FO	流程顺序 flow order。
	KS	The array of nucleotide bases that correspond to the key sequence of each read.
	LB	文库名称
	PG	Programs used for processing the read group.
	PI	Predicted median insert size.
	PL	测序平台，例如ILLUMINA
	PM	平台模式，提供平台/技术进一步细节的自由格式的文本。
	PU	测序平台详细信息（例如flowcell号，lane号（对Illumina）。
	SM	样本名称。如果混样的话，就用pool名称。
@PG	该标签关注使用何种比对软件及生成当前sam文件的命令行。
	ID*	比对软件的唯一ID。
	PN	软件的名称，例如bwa。
	CL	比对命令行，例如bwa samse hg19.fasta sample.fastq -f sample.sam
	PP	前一个程序的标识。
	DS	描述
	VN	比对软件版本号。
@CO	其它注释信息。

比对信息部分（alignment section）

比对部分概述

该部分是SAM文件的核心部分，每一行代表一个序列的线性比对（linear alignment of a segment），每行包含前11个必需字段，和第12个字段后多个可选字段，使用TAB-separated分割，当某个字段信息缺省时，如果字段是字符串型以*替代，如果字段是整型以‘0’来替代，下表为11个必需字段含义的概述[4]：

字段所在列号	区段	类型	正则表达式	简要概述
1	QNAME	字符串	[!-?A-~]{1,254}	需要比对的序列名称（FASTQ中第一行）
2	FLAG	整型	[0,2^16-1]	FLAG的位操作符
3	RNAME	字符串	\\|[!-()+-<>-~][!-~]	参考基序列的名称
4	POS	整型	[0,2^31-1]	序列比对到参考序列中的起始位置坐标（以1为起始）
5	MAPQ	整型	[0,2^8-1]	比对质量值
6	CIGAR	字符串	\*\|([0-9]+[MIDNSHPX=])+	CIGAR字符串
7	RNEXT	字符串	\\|=\|[!-()+-<>-~][!-~]	双端测序中另外一个read比对的参考序列名称
8	PNEXT	整型	[0,2^31-1]	双端测序中另外一个read比对到参考序列中的起始位置坐标
9	TLEN	整型	[-2^31+1,2^31-1]	建库时打断的长度
10	SEQ	字符串	\*\|[A-Za-z=.]+	序列碱基信息（FASTQ中第三行）
11	QUAL	字符串	[!-~]+	SEQ字段对应的ASCII码质量字符（FASTQ中第四行）

比对部分详述[4]
第一列、QNAME

被比对序列的名称（query template name），如果QNAME唯一，则序列被认为来源于同一模板；‘*’表示该字段缺省；一般情况下，该字段为FASTQ文件的第一行信息；嵌合（Chimeric alignment）比对或者多次比对（Multiple mapping）的序列会导致一个QNAME在SAM中多次出现。

第二列、FLAG

SAM中显示的是下图中第一列值或者第一列中的数值和，当显示的是下表中第一列数值时，意义为Description所列出，如果是多个数值和，意义为Description多行意义汇总，常用的意义见下表：

Description意义

1 ：该read使用双端测序，单端测序为0；

2：该read和完全比对到参考序列；

4：该read没有比对到参考序列；

8：双端序列的另外一条序列没有比对上参考序列（read1或者read2）；

16：该read比对到参考序列的负链上（该read反向互补比对到参考序列）；

32 ：该read的另一条read比对到参考序列的负链上；

64 ：双端测序 read1;

128 : 双端测序read2；

256：该read不是最佳的比对序列，一条read能比对到参考序列的多个位置，只有一个是最佳的比对位置，其他都是次要的；

512：该read在过滤（碱基质量，测序平台等指标）时没通过；

1024: PCR（文库构建时）或者仪器（测序时）导致的重复序列；

2048: 该read可能存在嵌合（发生在PCR过程中），当前比对部分只是read的一部分；

如果FLAG不在上表第一列，可以使用如下两个网站查询：

网站1：http://https://broadinstitute.github.io/picard/explain-flags.html

例如，FLAG 88=8(0x8对应值)+16(0x10对应值)+64(0x40对应值)，该FLAG值意义为三个意义的汇总。

网站2：https://www.samformat.info/sam-format-flag

另外一些常用FLAG

One of the reads is unmapped（双端reads只有一条reads比对上）:

73, 133, 89, 121, 165, 181, 101, 117, 153, 185, 69, 137

Both reads are unmapped（双端reads都没比对上）:

77, 141

Mapped within the insert size and in correct orientation（reads比对上了，大小方向均对）:

99, 147, 83, 163

Mapped within the insert size but in wrong orientation（比对上了，但是方向不对）:

67, 131, 115, 179

Mapped uniquely, but with wrong insert size（唯一比对，但是大小不对）:

81, 161, 97, 145, 65, 129, 113, 177

第三列、RNAME

Reference sequence NAME of the alignment，比对时参考序列的名称，一般是染色体号（如果物种为人，则为chr1~chr22，chrX，chrY，chrM）。RNAME（如果不是‘*’）必须在header section部分@SQ中SN标签后出现。如果没有比对上参考基因组，用‘*’来表示。如果RNAME值是‘*’，则后面POS和CIGAR也将没有值。

第四列、POS

该read比对到参考基因组的位置坐标，最小为1（1-based leftmost）。该read如果没有比对上参考序列，则RNAME和CIGAR也无值。

第五列、MAPQ

对应参考序列的质量（MAPing Quality），比对的质量分数，越高说明该read比对到参考基因组上的位置越准确。其值等于-10 lg Probility （错配概率），得出值后四舍五入的整数就是MAPQ值。如果该值是255，则说明对应质量无效。例如，MAPQ为20，即Q20，错误率为0.01，20 = -10log10(0.01) = -10*(-2)。

第六列、CIGAR

Compact Idiosyncratic Gapped Alignment Representation的简写，描述read与参考序列的比对具体情况信息。CIGAR中的数字代表碱基的个数，字符的含义见下表：

字符	BAM	描述	序列是否出现在read中	序列是否出现在参考序列中
M	0	比对匹配（match）	yes	yes
I	1	插入（insertion）	yes	no
D	2	缺失（deletion）	no	yes
N	3	跳过参考序列，转录组中表示遇到内含子	no	yes
S	4	软跳过（soft clipping），比对时跳过read中部分序列，不改变read长度	yes	no
H	5	硬跳过（hard clipping），比对时直接剪掉部分序列，read长度被改变，发生在比对开始或者结束时	no	no
P	6	padding，read比对时中间跳过参考序列部分区域	no	no
=	7	该read完全匹配	yes	yes
X	8	该read不匹配	yes	yes

举例：3M1D2M1I1M：3个碱基匹配（M)（3M）、接下来1个碱基缺失（D）、接下来2个匹配（2M）、接下来1个碱基插入（1I）、接下来1个碱基匹配（1M），如下图：

第七列、RNEXT

双端测序中另外一条read比对的参考序列的名称，单端测序此处为0，RNEXT（如果不是*或者=，*是完全没有比对上，=是完全比对）必须在header section部分@SQ中SN标签后出现。第3和第7列，可以用来判断某条read是否比对成功到了参考序列上，read1和read2是否比对到同一条参考染色体上。

第八列、PNEXT

双端测序中，是指另外一条read比对到参考基因组的位置坐标，最小为1（1-based leftmost）。

第九列、TLEN

文库长度，insert DNA size。

第十列、SEQ

read 碱基序列，FASTQ的第二行。

第十一列、QUAL

FASTQ的第四行。

第十二列之后，Optional fields

可选的自定义区域（Optional fields），可能有多列，多列间使用\t隔开，并不是每行都存在这些列。

XT:A:R NM:i:0 X0:i:4 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 XA:Z:chr1,+102573964,50M,0

XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

#该行该列没有内容

XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

每列格式为TAG:TYPE:VALUE，其中

TAG为两个大写字母；
TYPE可以由如下格式A (character), B (general array), f (real number), H (hexadecimal array), i (integer), or Z (string)；
VALUE ，内容与TYPE相关，TYPE为i时VALUE为整数，以此类推；

TAG详细介绍

可分为6类，详细介绍如下[6]：

1.1 Additional Template and Mapping data（一些比对信息）

AM:i:score The smallest template-independent mapping quality of any segment in the same template as

this read. (See also SM.)

AS:i:score Alignment score generated by aligner.

BQ:Z:qualities Offffset to base alignment quality (BAQ), of the same length as the read sequence. At the

i-th read base, BAQi = Qi

(BQi

64) where Qi is the i-th base quality.

CC:Z:rname Reference name of the next hit; ‘=’ for the same chromosome.

CG:B:I,encodedCigar Real CIGAR in its binary form if (and only if) it contains >65535 operations. This

is a BAM fifile only tag as a workaround of BAM’s incapability to store long CIGARs in the standard

way. SAM and CRAM fifiles created with updated tools aware of the workaround are not expected to

contain this tag. See also the footnote in Section 4.2 of the SAM spec for details.

2CP:i:pos Leftmost coordinate of the next hit.

E2:Z:bases The 2nd most likely base calls. Same encoding and same length as SEQ. See also U2 for

associated quality values.

FI:i:int The index of segment in the template.

FS:Z:str Segment suffiffiffix.

H0:i:count Number of perfect hits.

H1:i:count Number of 1-difffference hits (see also NM).

H2:i:count Number of 2-difffference hits.

HI:i:i Query hit index, indicating the alignment record is the i-th one stored in SAM.

IH:i:count Number of alignments stored in the fifile that contain the query in the current record.

MC:Z:cigar CIGAR string for mate/next segment.

MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* String for mismatching positions.

The MD fifield aims to achieve SNP/indel calling without looking at the reference. For example, a string

‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed

by an A on the reference which is difffferent from the aligned read base; the next 5 reference bases are

matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are

matches. The MD fifield ought to match the CIGAR string.

MQ:i:score Mapping quality of the mate/next segment.

NH:i:count Number of reported alignments that contain the query in the current record.

NM:i:count Number of difffferences (mismatches plus inserted and deleted bases) between the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential matches, with everything else being a mismatch（可以结合CIGAR字段计算错配碱基个数）. Note this means that ambiguity codes in both

sequence and reference that match each other, such as ‘N’ in both, or compatible codes such as ‘A’ and

‘R’, are still counted as mismatches. The special sequence base ‘=’ will always be considered to be a

match, even if the reference is ambiguous at that point. Alignment reference skips, padding, soft and

hard clipping (‘N’, ‘P’, ‘S’ and ‘H’ CIGAR operations) do not count as mismatches, but insertions and

deletions count as one mismatch per base.Note that historically this has been ill-defifined and both data and tools exist that disagree with this defifinition.

PQ:i:score Phred likelihood of the template, conditional on the mapping locations of both/all segments

being correct.

Q2:Z:qualities Phred quality of the mate/next segment sequence in the R2 tag. Same encoding as QUAL.

R2:Z:bases Sequence of the mate/next segment in the template. See also Q2 for any associated quality

values.

SA:Z:(rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+ Other canonical alignments in a chimeric alignment, for

matted as a semicolon-delimited list. Each element in the list represents a part of the chimeric align

ment. Conventionally, at a supplementary line, the fifirst element points to the primary line. Strand is

either ‘+’ or ‘-’, indicating forward/reverse strand, corresponding to FLAG bit 0x10. Pos is a 1-based

coordinate.

SM:i:score Template-independent mapping quality, i.e., the mapping quality if the read were mapped as

a single read rather than as part of a read pair or template.

3TC:i: The number of segments in the template.

TS:A:strand Strand (‘+’ or ‘-’) of the transcript to which the read has been mapped.

U2:Z: Phred probability of the 2nd call being wrong conditional on the best being wrong. The same

encoding and length as QUAL. See also E2 for associated base calls.

UQ:i: Phred likelihood of the segment, conditional on the mapping being correct.

1.2 Metadata（这部分内容和 SAM中header section部分相关，描述read测序相关信息）

RG:Z:readgroup The read group to which the read belongs. If @RG headers are present, then readgroup

must match the RG-ID fifield of one of the headers.

LB:Z:library The library from which the read has been sequenced. If @RG headers are present, then library

must match the RG-LB fifield of one of the headers.

PG:Z:program id Program. Value matches the header PG-ID tag if @PG is present.

PU:Z:platformunit The platform unit in which the read was sequenced. If @RG headers are present, then

platformunit must match the RG-PU fifield of one of the headers.

CO:Z:text Free-text comments.

1.3 Barcodes(UMI/单细胞测序cell barcode)

DNA barcodes can be used to identify the provenance of the underlying reads. There are currently three

varieties of barcodes that may co-exist: Sample Barcode, Cell Barcode, and Unique Molecular Identififier

(UMI).

• Despite its name, the Sample Barcode identififies the Library and allows multiple libraries to be combined

and sequenced together. After sequencing, the reads can be separated according to this barcode and

placed in difffferent “read groups” each corresponding to a library. Since the library was generated from

a sample, knowing the library should inform of the sample. The barcode itself can be included in the

PU fifield in the RG header line. Since the PU fifield should be globally unique, it is advisable to include

specifific information such as flflowcell barcode and lane. It is not recommended to use the barcode as

the ID fifield of the RG header line, as some tools modify this fifield (e.g., when merging fifiles).

• The Cell Barcode is similar to the sample barcode but there is (normally) no control over the assignment

of cells to barcodes (whose sequence could be random or predetermined). The Cell Barcode can help

identify when reads come from difffferent cells in a “single-cell” sequencing experiment.（在单细胞测序中，追溯read来源的标签）

• The UMI is intended to identify the (single- or double-stranded) molecule at the time that the barcode

was introduced. This can be used to inform duplicate marking and make consensus calling in ultra

deep sequencing. Additionally, the UMI can be used to (informatically) link reads that were generated

from the same long molecule, enabling long-range phasing and better informed mapping. In some

experimental setups opposite strands of the same double-stranded DNA molecule get related barcodes.

These templates can also be considered duplicates even though technically they may have difffferent

UMIs. Multiple UMIs can be added by a protocol, possibly at difffferent time-points, which means that

specifific knowledge of the protocol may be needed in order to analyze the resulting data correctly.（UMI信标签，RNA-seq中UMI可以对原始的 RNA 分子进行“绝对定量”）

BC:Z:sequence Barcode sequence (Identifying the sample/library), with any quality scores (optionally)

stored in the QT tag. The BC tag should match the QT tag in length. In the case of multiple unique

molecular identififiers (e.g., one on each end of the template) the recommended implementation con

catenates all the barcodes and places a hyphen (‘-’) between the barcodes from the same template.

QT:Z:qualities Phred quality of the sample barcode sequence in the BC tag. Same encoding as QUAL,

i.e., Phred score + 33. In the case of multiple unique molecular identififiers (e.g., one on each end of

the template) the recommended implementation concatenates all the quality strings with spaces (‘ ’)

between the difffferent strings from the same template.

4CB:Z:str Cell identififier, consisting of the optionally-corrected cellular barcode sequence and an optional

suffiffiffix. The sequence part is similar to the CR tag, but may have had sequencing errors etc corrected.

This may be followed by a suffiffiffix consisting of a hyphen (‘-’) and one or more alphanumeric characters to form an identififier. In the case of the cellular barcode (CR) being based on multiple barcode sequences

the recommended implementation concatenates all the (corrected or uncorrected) barcodes with a

hyphen (‘-’) between the difffferent barcodes. Sequencing errors etc aside, all reads from a single cell

are expected to have the same CB tag.

CR:Z:sequence+ Cellular barcode. The uncorrected sequence bases of the cellular barcode as reported

by the sequencing machine, with the corresponding base quality scores (optionally) stored in CY. Se

quencing errors etc aside, all reads with the same CR tag likely derive from the same cell. In the case

of the cellular barcode being based on multiple barcode sequences the recommended implementation

concatenates all the barcodes with a hyphen (‘-’) between the difffferent barcodes.

CY:Z:qualities+ Phred quality of the cellular barcode sequence in the CR tag. Same encoding as QUAL,

i.e., Phred score + 33. The lengths of the CY and CR tags must match. In the case of the cellular

barcode being based on multiple barcode sequences the recommended implementation concatenates all

the quality strings with with spaces (‘ ’) between the difffferent strings.

MI:Z:str Molecular Identififier. A unique ID within the SAM fifile for the source molecule from which this

read is derived. All reads with the same MI tag represent the group of reads derived from the same

source molecule.

OX:Z:sequence+ Raw (uncorrected) unique molecular identififier bases, with any quality scores (optionally)

stored in the BZ tag. In the case of multiple unique molecular identififiers (e.g., one on each end of the

template) the recommended implementation concatenates all the barcodes with a hyphen (‘-’) between

the difffferent barcodes.

BZ:Z:qualities+ Phred quality of the (uncorrected) unique molecular identififier sequence in the OX tag.

Same encoding as QUAL, i.e., Phred score + 33. The OX tags should match the BZ tag in length. In the

case of multiple unique molecular identififiers (e.g., one on each end of the template) the recommended

implementation concatenates all the quality strings with a space (‘ ’) between the difffferent strings.

RX:Z:sequence+ Sequence bases from the unique molecular identififier. These could be either corrected or

uncorrected. Unlike MI, the value may be non-unique in the fifile. Should be comprised of a sequence of

bases. In the case of multiple unique molecular identififiers (e.g., one on each end of the template) the

recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the difffferent

barcodes.If the bases represent corrected bases, the original sequence can be stored in OX (similar to OQ storing the original qualities of bases.)

QX:Z:qualities+ Phred quality of the unique molecular identififier sequence in the RX tag. Same encoding

as QUAL, i.e., Phred score + 33. The qualities here may have been corrected (Raw bases and qualities

can be stored in OX and BZ respectively.) The lengths of the QX and the RX tags must match. In the

case of multiple unique molecular identififiers (e.g., one on each end of the template) the recommended

implementation concatenates all the quality strings with a space (‘ ’) between the difffferent strings.

1.4 Original data

OA:Z:(RNAME,POS,strand,CIGAR,MAPQ,NM ;)+ The original alignment information of the record

prior to realignment or unalignment by a subsequent tool. Each original alignment entry contains

the following six fifield values from the original record, generally in their textual SAM representations,

separated by commas (‘,’) and terminated by a semicolon (‘;’): RNAME, which must be explicit

(unlike RNEXT, ‘=’ may not be used here); 1-based POS; ‘+’ or ‘-’, indicating forward/reverse strand

respectively (as per bit 0x10 of FLAG); CIGAR; MAPQ; NM tag value, which may be omitted (though

the preceding comma must be retained).

5In the presence of an existing OA tag, a subsequent tool may append another original alignment entry

after the semicolon, adding to—rather than replacing—the existing OA information.

The OA fifield is designed to provide record-level information that can be useful for understanding the

provenance of the information in a record. It is not designed to provide a complete history of the

template alignment information. In particular, realignments resulting in the the removal of Secondary

or Supplementary records will cause the loss of all tags associated with those records, and may also

leave the SA tag in an invalid state.

OC:Z:cigar Original CIGAR, usually before realignment. Deprecated in favour of the more general OA.

OP:i:pos Original 1-based POS, usually before realignment. Deprecated in favour of the more general OA.

OQ:Z:qualities Original base quality, usually before recalibration. Same encoding as QUAL.

1.5 Annotation and Padding

The SAM format can be used to represent de novo assemblies , generally by using padded reference sequences and the annotation tags described here. See the Guide for Describing Assembly Sequences in the SAM Format Specifification for full details of this representation.

CT:Z:strand;type(;key(=value)?)*

Complete read annotation tag, used for consensus annotation dummy features.

The CT tag is intended primarily for annotation dummy reads, and consists of a strand, type and zero or

more key=value pairs, each separated with semicolons. The strand fifield has four values as in GFF3,2

and supplements FLAG bit 0x10 to allow unstranded (‘.’), and stranded but unknown strand (‘?’)

annotation. For these and annotation on the forward strand (strand set to ‘+’), do not set FLAG bit

0x10. For annotation on the reverse strand, set the strand to ‘-’ and set FLAG bit 0x10.

The type and any keys and their optional values are all percent encoded according to RFC3986 to

escape meta-characters ‘=’, ‘%’, ‘;’, ‘|’ or non-printable characters not matched by the isprint() macro

(with the C locale). For example a percent sign becomes ‘%25’.

PT:Z:annotag(\|annotag)*

where each annotag matches start;end;strand;type(;key(=value)?)* Read annotations for parts of the padded read sequence.The PT tag value has the format of a series of annotation tags separated by ‘|’, each annotating a sub-region of the read. Each tag consists of start, end, strand, type and zero or more key=value pairs,each separated with semicolons. Start and end are 1-based positions between one and the sum of the M/I/D/P/S/=/X CIGAR operators, i.e., SEQ length plus any pads. Note any editing of the CIGAR

string may require updating the PT tag coordinates, or even invalidate them. As in GFF3, strand is

one of ‘+’ for forward strand tags, ‘-’ for reverse strand, ‘.’ for unstranded or ‘?’ for stranded but unknown strand. The type and any keys and their optional values are all percent encoded as in the CT tag.

1.6 Technology-specifific data

FZ:B:S,intensities Flow signal intensities（测序拍照的光强度数据） on the original strand of the read, stored as (uint16 t)

round(value * 100.0).

1.6.1 Color space

CM:i:distance Edit distance between the color sequence and the color reference (see also NM).

CS:Z:sequence Color read sequence on the original strand of the read. The primer base must be included.

CQ:Z:qualities Color read quality on the original strand of the read. Same encoding as QUAL; same

length as CS.

2 Locally-defifined tags

You can freely add new tags. Note that tags starting with ‘X’, ‘Y’, or ‘Z’ and tags containing lowercase letters in either position are reserved for local use and will not be formally defifined in any future version of this specifification. If a new tag may be of general interest, it may be useful to have it added to this specifification. Additions can be proposed by opening a new issue at https://github.com/samtools/hts-specs/issues and/or by sending email to [email protected].

参考资料

[1] Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools[J]. Bioinformatics, 2009, 25(16): 2078-2079.

[2] https://www.samformat.info/sam-format-flag

[3] http://note.youdao.com/share/?id=312fa04209cb87f7674de9a9544f329a&type=note#/

[4] https://samtools.github.io/hts-specs/SAMv1.pdf

[5] https://yulijia.net/slides/bioinfomatcis_for_medical_students/2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II.html#1

[6] http://samtools.github.io/hts-specs/SAMtags.pdf

我的公众号

力求详细专业介绍生信相关技术，欢迎一起学习！

你可能感兴趣的:(生物信息)

在生信分析中，处理vcf 比较好用的python包推荐
在生物信息学分析中，处理VCF（VariantCallFormat）文件的Python包有很多，以下是一些常用且好用的Python包，适合不同的分析需求：PyVCF（推荐）简介：PyVCF是一个专门为解析和操作VCF文件设计的Python库，支持读取、过滤和修改VCF文件。优点：简单易用，API直观。支持VCF4.0及以上版本。可以轻松访问变体的信息（如染色体、位置、参考碱基、变异碱基等）。安装：
Conda安装与使用
目录一、软件安装及conda管理1.conda下载2.miniconda安装二、环境配置1.配置镜像：2.创建环境、移除环境：3.查看小环境4.进入、退出小环境5.查找并安装软件三、一步到位其他：参考资料：一、软件安装及conda管理conda可以来管理大量的生物信息学软件，或者想要复现一些文章中的实验结果需要不同环境的切换。1.conda下载（1）anacondaanaconda|镜像站使用帮助
富集分析——GO、KEGG ersanshi055 生信小菜鸟富集分析 GO kegg
一、富集分析的基础认知在生物信息学研究领域，基因功能解析及通路阐释是众多分析流程中的关键环节，富集分析（EnrichmentAnalysis）是将基因或蛋白列表按照功能进行分类的统计方法，目的是找出在特定基因集中显著富集的功能类别或通路。通过这种方法，研究人员可以理解一组基因（如差异表达基因）在哪些生物学过程、分子功能或通路中代表。1.富集分析分类基因本体论富集分析（GeneOntologyEnr
Rstudio：强大的R语言集成开发环境（IDE）简说基因-专业生信合作伙伴 r语言开发语言
Rstudio应该是R语言使用的标配，尽管Rstudio的母公司Posit推出了新一代的集成开发环境Positron，但其还处于开发阶段。作为用户不妨让其成熟后再使用，现阶段还是Rstudio更稳定。如果你在生物信息学或统计学领域工作，R语言几乎是必备的工具之一。而RStudio，作为R语言最流行的集成开发环境（IDE），为数据分析、可视化和编程提供了非常友好的平台。今天我们来介绍一下RStudi
python做生物信息学分析_Python从零开始第五章生物信息学①提取差异基因吴敬欣 python做生物信息学分析
目前来说，做生物信息学的人越来越多，但是我觉得目前而言做生信的主要有三类人：老本行是做实验的，做生信可能是为了辅助研究或者是为了发paper(有非常多的临床生选择趟生信这波水)主要是做生信的，主要涵盖高通量测序数据分析，组学数据分析等等，专门从事生物学数据分析的这群人，其大部分也是本科生物狗作为强大的生力军，以调包写R，python为主。那么这群人就要熟悉看各种包的tutorial以及如何进行常规
用Python实现生信分析——功能预测详解写代码的M教授生信分析 python 开发语言
功能预测是生物信息学中的一项重要任务，通过分析基因或蛋白质序列的特征，推测它们的生物学功能。功能预测通常涉及多种方法，包括序列比对、基序识别、机器学习模型等。这些方法可以帮助科学家推断未知基因的功能，从而加速生物学研究的进展。1.功能预测的主要方法（1）同源性比对：通过将未知基因或蛋白质序列与数据库中的已知序列进行比对，识别出同源序列，并推测它们的功能。常用工具包括BLAST、HMMER等。（2）
用Python实现生信分析——序列搜索和比对工具详解写代码的M教授生信分析 python
1.什么是序列搜索和比对工具？序列搜索和比对工具在生物信息学中用于在大型序列数据库中搜索与查询序列相似的序列，并进行比对分析。这些工具可以帮助研究人员识别与目标序列相关的已知序列，从而推测其功能、结构和进化关系。常见的序列搜索和比对工具包括：BLAST（BasicLocalAlignmentSearchTool）：最常用的序列搜索工具，能够快速找到与查询序列相似的序列。FASTA：另一个常用的序列
大模型在生物信息学中的应用前景 AI天才研究院 AI人工智能与大数据 ChatGPT java python javascript kotlin golang 架构人工智能大厂程序员硅基计算碳基计算认知计算生物计算深度学习神经网络大数据 AIGC AGI LLM 系统架构设计软件哲学 Agent 程序员实现财富自由
大模型在生物信息学中的应用前景关键词：大模型、生物信息学、基因组学、蛋白质组学、应用前景摘要：本文将深入探讨大模型在生物信息学中的应用前景。首先，我们将介绍大模型的基础知识，包括其定义、特点和优势。接着，我们将分析大模型在生物信息学中的问题背景和具体应用场景。然后，我们将详细讲解大模型在生物信息学中的数据处理与分析方法，以及其在基因组学和蛋白质组学中的应用案例。最后，我们将讨论大模型在生物信息学中
【深度学习】条件随机场（CRF）深度解析：原理、应用与前沿白熊188 深度学习深度学习人工智能
条件随机场（CRF）深度解析：原理、应用与前沿一、算法背景知识1.1序列标注的挑战1.2概率图模型演进二、算法理论与结构2.1基本定义2.2特征函数设计状态特征（节点特征）转移特征（边特征）2.3线性链CRF结构2.4训练与解码2.5前向-后向算法三、模型评估3.1评估指标3.2评估方法对比3.3性能基准（CoNLL-2003NER）四、应用案例4.1自然语言处理4.2生物信息学4.3计算机视觉五
最新期刊影响因子，基本包含全部期刊 Bioinfo科研生信筆記影响因子 2024年期刊影响因子期刊因子因子 IF
原文链接：2024年期刊最新影响因子（IF）2024年期刊最新影响因子（IF）BioinfoR生信筆記，注于分享生物信息学相关知识和R语言绘图教程。
向量检索中的 ANN（Approximate Nearest Neighbor）技术 XiaoQiong.Zhang AI 人工智能
向量检索中的ANN（ApproximateNearestNeighbor）技术是一种在高维空间中高效查找与查询向量q最相似的Top-K个向量的方法，其核心在于牺牲一定的精度（召回率）以换取比精确最近邻搜索（ExactNN）高数个数量级的查询速度。它广泛应用于图像/视频检索、自然语言处理（如语义搜索、问答）、推荐系统、生物信息学等场景。⸻一、基本问题定义目标：给定一个查询向量q，在一个庞大的向量集合
cd-hit安装与使用-cd-hit v4.8.1（bioinfomatics tools-005）让学习成为一种生活方式基因组多组学序列比对 github linux 论文阅读数据挖掘
01背景介绍CD-HIT(ClusterDatabaseatHighIdentitywithTolerance)是一种广泛使用的生物信息学工具，主要用于快速聚类生物序列数据，如蛋白质或核酸序列，以减少数据冗余和简化数据分析。其基本原理涉及比较序列之间的相似性，将高度相似的序列分组到同一个聚类中，从而减少数据集的复杂性。1.1算法原理CD-HIT的算法原理主要包括以下几个方面：序列比较和相似性评分：
基于 Java 的大数据分布式计算在基因编辑数据分析与精准医疗中的应用进展知识产权13937636601 计算机 java 分布式计算基因编辑
随着基因测序成本断崖式下降（单人类全基因组低于100）和CRISPR基因编辑技术成熟，全球日均产生超20PB基因数据。传统单机生物信息学工具难以应对海量多组学数据的整合、分析与临床转化。本文将系统阐述**Java技术栈如何构建新一代基因大数据计算中枢**：基于Hadoop+Spark的分布式架构实现千倍加速的基因组比对；通过Flink流式计算引擎支撑CRISPR脱靶效应实时预测；利用ApacheA
PostgreSQL 在生物信息学中的应用 belldeep PostgreSQL 生物信息学 postgresql 数据库生物信息学
PostgreSQL（简称PG）是一种强大的开源关系型数据库管理系统，因其高可靠性、扩展性和支持复杂查询的特性，在生物信息学领域得到广泛应用。以下是其核心应用场景及优势分析：一、生物数据存储与管理生物信息学涉及海量异构数据，PG的结构化存储能力和可扩展性使其成为理想选择。1.多类型数据存储基因组数据：存储DNA/RNA序列、基因注释（如GTF/GFF文件）、变异数据（VCF格式）等。例：将基因组序
一款适合程序员的流程图/思维导图利器 qq_21478261 #Python可视化 python 运维思维导图图论机器学习
首发地址：程序员必备流程图/思维导图利器本文介绍graphviz在Python中的接口。graphviz是在复杂网络、生物信息学、软件工程、数据库和网页设计、机器学习等领域使用广泛的图（Graph）可视化利器。graphviz支持Linux、Windows、Mac、Solaris等多个系统，拥有多种编程语言的API(perl、python、ruby、C#等)。graphviz功能先看看graphv
支持向量机SVM：从数学原理到实际应用代码很孬写支持向量机算法机器学习语言模型自然语言处理 ai 人工智能
前言本篇文章全面深入地探讨了支持向量机（SVM）的各个方面，从基本概念、数学背景到Python和PyTorch的代码实现。文章还涵盖了SVM在文本分类、图像识别、生物信息学、金融预测等多个实际应用场景中的用法。一、引言背景支持向量机（SVM,SupportVectorMachines）是一种广泛应用于分类、回归、甚至是异常检测的监督学习算法。自从Vapnik和Chervonenkis在1995年首
7天掌握！MySQL vs 图数据库：混合架构下的复杂关系分析全揭秘墨瑾轩数据库学习数据库 mysql 架构
关注墨瑾轩，带你探索编程的奥秘！超萌技术攻略，轻松晋级编程高手技术宝库已备好，就等你来挖掘订阅墨瑾轩，智趣学习不孤单即刻启航，编程之旅更有趣在当今的数据密集型世界中，处理和理解复杂的关系网络变得越来越重要。从社交网络到推荐系统，从生物信息学到金融风险评估，这些领域都需要一种能够高效处理高度互联数据的技术。传统的关系型数据库如MySQL，在处理这类问题时遇到了瓶颈。而图数据库则以其独特的结构优势脱颖
《机器学习导论（原书第3版）》下载 jiswordsman 机器学习机器学习人工智能
内容简介机器学习的目标是对计算机编程，以便使用样本数据或以往的经验来解决给定的问题。已经有许多机器学习的成功应用，包括分析以往销售数据来预测客户行为，优化机器人的行为以便使用较少的资源来完成任务，以及从生物信息数据中提取知识的各种系统。本书是关于机器学习的内容全面的教科书，其中有些内容在一般的在机器学习导论书中很少介绍。主要内容包括监督学习，贝叶斯决策理论，参数、半参数和非参数方法，多元分析，隐马
汉明距离（Hamming Distance）追逐此刻算法方法 python 算法开发语言
1.定义汉明距离是指两个等长字符串在相同位置上不同字符的个数。它常用于衡量两个字符串的相似度，广泛应用于编码理论、信息论、密码学、生物信息学等领域。2.数学表达给定两个等长的字符串x和y，汉明距离d(x,y)定义为：其中：n是字符串的长度，xi和yi分别是x和y的第i个字符，Ⅱ(⋅)是指示函数（当条件成立时返回1，否则返回0）。3.示例二进制字符串：x="10110",y="11110"比较每一位
时空图像算法：本文从时间序列光谱分析（TAS）的基础知识出发，详细阐述STIPS中TAS算法的原理和具体操作方法 AI天才研究院深度学习实战自然语言处理人工智能语言模型编程实践开发语言架构设计
作者：禅与计算机程序设计艺术1.简介时空图像（ST-images）是指对空间中的多维图像进行时间编码处理后得到的时间序列图像，它在人类活动、环境变化等场景下具有广泛的应用价值。随着人们对空间的认识的提升和对地球表面所含的微生物信息的获取能力的不断增强，传感器技术的发展给人类的生活带来了前所未有的便利。在这些条件下，利用地球表面的数据、各种传感器设备及相关软件，可以实现从微观到宏观层面的全方位、高速
Newcpgreport：CpG岛甲基化差异分析简说基因-专业生信合作伙伴
在人类基因组中，约60%的基因启动子区域都蕴藏着特殊的DNA序列——CpG岛。CpG岛（富含CpG二核苷酸的区域）被称为基因调控的“开关”，它们常位于基因启动子区域，与DNA甲基化、基因沉默等表观遗传现象密切相关。要精准定位这些区域，生物信息学家们开发了多种工具，其中newcpgreport凭借其独特的算法设计和可靠的检测性能，成为该领域的明星工具。功能特点核心功能与原理1.滑动窗口检测法newc
生物医学工程导论：学习笔记（四） Zodornus 生物医学工程学习笔记
生物信息学(Bioinformatics)狭义概念：应用信息科学的理论、方法和技术，来管理、分析和利用生物分子数据。广义概念：应用信息科学的方法和技术，研究生物体系和生物过程中信息的存储、信息的内涵和信息的传递，研究和分析生物体细胞、组织、器官的生理、病理、药理过程中的各种生物信息。（生命科学中的信息科学）目的：处理、归纳、总结海量的生物实验数据，并找到其中的规律。成果：基因测序等。研究内容基因组
探索生物信息学的未来：Rust-Bio 库富嫱蔷
探索生物信息学的未来：Rust-Bio库rust-bioThislibraryprovidesimplementationsofmanyalgorithmsanddatastructuresthatareusefulforbioinformatics.Allprovidedimplementationsarerigorouslytestedviacontinuousintegration.项目地址
2025.04.18【数据修复】DCA：高效缺失值插补工具解析穆易青单细胞信息可视化
文章目录1.DCA工具简介2.DCA的安装方法3.DCA常用命令1.DCA工具简介在生物信息学领域，数据分析是一个复杂且耗时的过程。DCA（DifferentialCorrelationAnalysis）工具是一个专门设计来识别和分析差异相关性的统计工具。它能够帮助研究者从大量的生物医学数据中，发现变量间的相关性变化，这对于理解复杂疾病的分子机制至关重要。DCA工具通过计算和比较不同样本或条件下变
2025.04.17【Stacked area】| 生信数据可视化：堆叠区域图深度解析穆易青信息可视化
文章目录生信数据可视化：堆叠区域图深度解析堆叠面积图简介为什么使用堆叠面积图如何使用R语言创建堆叠面积图安装和加载ggplot2包创建堆叠面积图的基本步骤示例代码解读堆叠面积图堆叠面积图的局限性实际应用案例示例：基因表达量随时间变化结论生信数据可视化：堆叠区域图深度解析在生物信息学领域，数据可视化是理解复杂数据集的关键。其中，堆叠面积图（StackedAreaChart）是一种展示多个群体随时间变
DNA、蛋白质、生物语义语言模型的介绍 bug开发工程师. 语言模型人工智能自然语言处理
主要模型概述ProtBERT：专注于蛋白质序列嵌入，支持多种下游任务如序列分类和功能预测。ProtGPT2：利用生成式模型生成高质量的蛋白质序列，适用于新蛋白质设计。AlphaFold：革命性地预测蛋白质三维结构，推动了结构生物学的发展。TAPE：提供统一的框架进行蛋白质序列表示学习，支持多种生物信息学任务。BioBERT：针对生物医学文本挖掘设计的模型，提升了生物信息处理能力。DNA-BERT：
matlab在生物学中的应用,MATLAB在生物信息学分析中的应用.doc weixin_39599097 matlab在生物学中的应用
MATLAB在生物信息学分析中的应用MATLAB在生物序列信息分析中的应用生物技术(生物制药方向09)杨清松0909501162摘要：MATLAB生物信息工具箱为广大用户提供了一个用于基因组和蛋白质组分析的综合环境，它利用数据库资源，使科学研究事半功倍，在工具箱提供的开放环境里，用户甚至可以按照自己的目的来设计和利用分析工具。本文主要介绍MATLAB生物信息工具箱在基因序列分析中的应用，包括确定核
用Python实现生信分析——隐马尔可夫模型（HMM）在生物信息学中的应用详解写代码的M教授生信分析人工智能 python
在生物信息学中，隐马尔可夫模型（HMM）被广泛应用于基因组注释、蛋白质结构预测、基因预测等领域。以下是针对生物信息学应用的详细讲解，包括案例、Python实现、运行结果和分析。1.HMM在生物信息学中的应用场景HMM在生物信息学中的应用非常广泛，以下是一些典型场景：（1）基因预测：HMM可以用来预测DNA序列中的基因。通过建模不同区域（如外显子、内含子、启动子等）的特征，HMM可以识别出可能的基因
生物信息学数据库分类划过手的泪滴t 生物信息学数据库
生物信息学数据库（一）文献数据库1、PubMed：拥有超过两百六十万生物医学文献的数据库，这些文献来源于MEDLINE，也就是生物医学文献数据库、生命科学领域学术杂志、以及在线的专业书籍。链接：PubMed(nih.gov)PubMed存在的问题（1）搜索1995年前文献中排名是为以后的作者（2）搜索1976年以前的文献是没有摘要的（3）1965年前的文献较难搜索（二）一级核酸数据库1、※GenB
生物信息学技能树（Bioinformatics）与学习路径 lisw05 生物信息学生物信息学
李升伟整理生物信息学是一门跨学科领域，涉及生物学、计算机科学以及统计学等多个方面。以下是关于生物信息学的学习路径及相关技能的详细介绍。一、基础理论知识1.生物学基础知识需要掌握分子生物学、遗传学、细胞生物学等相关概念。对基因组结构、蛋白质功能及其相互作用有基本理解。2.编程能力掌握至少一种脚本语言（如Python或Perl），用于数据处理和自动化任务3。学习R语言进行数据分析和可视化。3.统计学与
HttpClient 4.3与4.3版本以下版本比较 spjich java httpclient
网上利用java发送http请求的代码很多，一搜一大把，有的利用的是java.net.*下的HttpURLConnection，有的用httpclient，而且发送的代码也分门别类。今天我们主要来说的是利用httpclient发送请求。 httpclient又可分为 httpclient3.x httpclient4.x到httpclient4.3以下 httpclient4.3
Essential Studio Enterprise Edition 2015 v1新功能体验 Axiba .net
概述：Essential Studio已全线升级至2015 v1版本了！新版本为JavaScript和ASP.NET MVC添加了新的文件资源管理器控件，还有其他一些控件功能升级，精彩不容错过，让我们一起来看看吧！ syncfusion公司是世界领先的Windows开发组件提供商，该公司正式对外发布Essential Studio Enterprise Edition 2015 v1版本。新版本
[宇宙与天文]微波背景辐射值与地球温度 comsci 背景
宇宙这个庞大,无边无际的空间是否存在某种确定的,变化的温度呢? 如果宇宙微波背景辐射值是表示宇宙空间温度的参数之一,那么测量这些数值,并观测周围的恒星能量输出值,我们是否获得地球的长期气候变化的情况呢? &nbs
lvs-server 男人50 server
#!/bin/bash # # LVS script for VS/DR # #./etc/rc.d/init.d/functions # VIP=10.10.6.252 RIP1=10.10.6.101 RIP2=10.10.6.13 PORT=80 case $1 in start) /sbin/ifconfig eth2:0 $VIP broadca
java的WebCollector爬虫框架 oloz 爬虫
WebCollector主页： https://github.com/CrawlScript/WebCollector 下载：webcollector-版本号-bin.zip将解压后文件夹中的所有jar包添加到工程既可。接下来看demo package org.spider.myspider; import cn.edu.hfut.dmic.webcollector.cra
jQuery append 与 after 的区别小猪猪08
1、after函数定义和用法： after() 方法在被选元素后插入指定的内容。语法： $(selector).after(content) 实例： <html> <head> <script type="text/javascript" src="/jquery/jquery.js"></scr
mysql知识充电香水浓 mysql
索引索引是在存储引擎中实现的，因此每种存储引擎的索引都不一定完全相同，并且每种存储引擎也不一定支持所有索引类型。根据存储引擎定义每个表的最大索引数和最大索引长度。所有存储引擎支持每个表至少16个索引，总索引长度至少为256字节。大多数存储引擎有更高的限制。MYSQL中索引的存储类型有两种：BTREE和HASH，具体和表的存储引擎相关； MYISAM和InnoDB存储引擎
我的架构经验系列文章索引 agevs 架构
下面是一些个人架构上的总结，本来想只在公司内部进行共享的，因此内容写的口语化一点，也没什么图示，所有内容没有查任何资料是脑子里面的东西吐出来的因此可能会不准确不全，希望抛砖引玉，大家互相讨论。要注意，我这些文章是一个总体的架构经验不针对具体的语言和平台，因此也不一定是适用所有的语言和平台的。（内容是前几天写的，现附上索引）前端架构 http://www.
Android so lib库远程http下载和动态注册 aijuans andorid
一、背景在开发Android应用程序的实现，有时候需要引入第三方so lib库，但第三方so库比较大，例如开源第三方播放组件ffmpeg库, 如果直接打包的apk包里面, 整个应用程序会大很多.经过查阅资料和实验，发现通过远程下载so文件，然后再动态注册so文件时可行的。主要需要解决下载so文件存放位置以及文件读写权限问题。二、主要
linux中svn配置出错 conf/svnserve.conf:12: Option expected 解决方法 baalwolf option
在客户端访问subversion版本库时出现这个错误： svnserve.conf:12: Option expected 为什么会出现这个错误呢，就是因为subversion读取配置文件svnserve.conf时，无法识别有前置空格的配置文件，如### This file controls the configuration of the svnserve daemon, if you##
MongoDB的连接池和连接管理 BigCat2013 mongodb
在关系型数据库中，我们总是需要关闭使用的数据库连接，不然大量的创建连接会导致资源的浪费甚至于数据库宕机。这篇文章主要想解释一下mongoDB的连接池以及连接管理机制，如果正对此有疑惑的朋友可以看一下。通常我们习惯于new 一个connection并且通常在finally语句中调用connection的close()方法将其关闭。正巧，mongoDB中当我们new一个Mongo的时候，会发现它也
AngularJS使用Socket.IO bijian1013 JavaScript AngularJS Socket.IO
目前，web应用普遍被要求是实时web应用，即服务端的数据更新之后，应用能立即更新。以前使用的技术（例如polling）存在一些局限性，而且有时我们需要在客户端打开一个socket，然后进行通信。 Socket.IO(http://socket.io/)是一个非常优秀的库，它可以帮你实
[Maven学习笔记四]Maven依赖特性 bit1129 maven
三个模块为了说明问题，以用户登陆小web应用为例。通常一个web应用分为三个模块，模型和数据持久化层user-core, 业务逻辑层user-service以及web展现层user-web， user-service依赖于user-core user-web依赖于user-core和user-service 依赖作用范围 Maven的dependency定义
【Akka一】Akka入门 bit1129 akka
什么是Akka Message-Driven Runtime is the Foundation to Reactive Applications In Akka, your business logic is driven through message-based communication patterns that are independent of physical locatio
zabbix_api之perl语言写法 ronin47 zabbix_api之perl
zabbix_api网上比较多的写法是python或curl。上次我用java－－http://bossr.iteye.com/blog/2195679，这次用perl。for example: #!/usr/bin/perl use 5.010 ; use strict ; use warnings ; use JSON :: RPC :: Client ; use
比优衣库跟牛掰的视频流出了，兄弟连Linux运维工程师课堂实录，更加刺激，更加实在！ brotherlamp linux运维工程师 linux运维工程师教程 linux运维工程师视频 linux运维工程师资料 linux运维工程师自学
比优衣库跟牛掰的视频流出了，兄弟连Linux运维工程师课堂实录，更加刺激，更加实在！ ----------------------------------------------------- 兄弟连Linux运维工程师课堂实录-计算机基础-1-课程体系介绍1 链接：http://pan.baidu.com/s/1i3GQtGL 密码：bl65 兄弟连Lin
bitmap求哈密顿距离-给定N（1<=N<=100000）个五维的点A(x1,x2,x3,x4,x5)，求两个点X(x1,x2,x3,x4,x5)和Y( bylijinnan java
import java.util.Random; /** * 题目： * 给定N（1<=N<=100000）个五维的点A(x1,x2,x3,x4,x5)，求两个点X(x1,x2,x3,x4,x5)和Y(y1,y2,y3,y4,y5)， * 使得他们的哈密顿距离（d=|x1-y1| + |x2-y2| + |x3-y3| + |x4-y4| + |x5-y5|）最大
map的三种遍历方法 chicony map
package com.test; import java.util.Collection; import java.util.HashMap; import java.util.Iterator; import java.util.Map; import java.util.Set; public class TestMap { public static v
Linux安装mysql的一些坑 chenchao051 linux
1、mysql不建议在root用户下运行 2、出现服务启动不了，111错误，注意要用chown来赋予权限，我在root用户下装的mysql，我就把usr/share/mysql/mysql.server复制到/etc/init.d/mysqld, (同时把my-huge.cnf复制/etc/my.cnf) chown -R cc /etc/init.d/mysql
Sublime Text 3 配置 daizj 配置 Sublime Text
Sublime Text 3 配置解释(默认){// 设置主题文件“color_scheme”: “Packages/Color Scheme – Default/Monokai.tmTheme”,// 设置字体和大小“font_face”: “Consolas”,“font_size”: 12,// 字体选项：no_bold不显示粗体字，no_italic不显示斜体字，no_antialias和
MySQL server has gone away 问题的解决方法 dcj3sjt126com SQL Server
MySQL server has gone away 问题解决方法，需要的朋友可以参考下。应用程序（比如PHP）长时间的执行批量的MYSQL语句。执行一个SQL，但SQL语句过大或者语句中含有BLOB或者longblob字段。比如，图片数据的处理。都容易引起MySQL server has gone away。今天遇到类似的情景，MySQL只是冷冷的说：MySQL server h
javascript/dom:固定居中效果 dcj3sjt126com JavaScript
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&
使用 Spring 2.5 注释驱动的 IoC 功能 e200702084 spring bean 配置管理 IOC Office
使用 Spring 2.5 注释驱动的 IoC 功能 developerWorks 文档选项将打印机的版面设置成横向打印模式打印本页将此页作为电子邮件发送将此页作为电子邮件发送级别：初级陈雄华 ([email protected]), 技术总监, 宝宝淘网络科技有限公司 2008 年 2 月 28 日 &nb
MongoDB常用操作命令 geeksun mongodb
1. 基本操作 db.AddUser(username,password) 添加用户 db.auth(usrename,password) 设置数据库连接验证 db.cloneDataBase(fromhost)
php写守护进程（Daemon） hongtoushizi PHP
转载自： http://blog.csdn.net/tengzhaorong/article/details/9764655 守护进程（Daemon）是运行在后台的一种特殊进程。它独立于控制终端并且周期性地执行某种任务或等待处理某些发生的事件。守护进程是一种很有用的进程。php也可以实现守护进程的功能。 1、基本概念 &nbs
spring整合mybatis,关于注入Dao对象出错问题 jonsvien DAO spring bean mybatis prototype
今天在公司测试功能时发现一问题：先进行代码说明： 1，controller配置了Scope="prototype"（表明每一次请求都是原子型） @resource/@autowired service对象都可以（两种注解都可以）。 2，service 配置了Scope="prototype"（表明每一次请求都是原子型）
对象关系行为模式之标识映射 home198979 PHP 架构企业应用对象关系标识映射
HELLO!架构一、概念 identity Map:通过在映射中保存每个已经加载的对象，确保每个对象只加载一次，当要访问对象的时候，通过映射来查找它们。其实在数据源架构模式之数据映射器代码中有提及到标识映射，Mapper类的getFromMap方法就是实现标识映射的实现。二、为什么要使用标识映射？在数据源架构模式之数据映射器中 //c
Linux下hosts文件详解 pda158 linux
　1、主机名：　　无论在局域网还是INTERNET上，每台主机都有一个IP地址，是为了区分此台主机和彼台主机，也就是说IP地址就是主机的门牌号。　　公网：IP地址不方便记忆，所以又有了域名。域名只是在公网（INtERNET)中存在，每个域名都对应一个IP地址，但一个IP地址可有对应多个域名。　　局域网：每台机器都有一个主机名，用于主机与主机之间的便于区分，就可以为每台机器设置主机
nginx配置文件粗解 spjich java nginx
#运行用户#user nobody;#启动进程,通常设置成和cpu的数量相等worker_processes 2;#全局错误日志及PID文件#error_log logs/error.log;#error_log logs/error.log notice;#error_log logs/error.log inf
数学函数 w54653520 java
public class S { // 传入两个整数，进行比较，返回两个数中的最大值的方法。 public int get( int num1, int nu