zz 塞外野猪的博客 Affymetrix 处理中的一些注意事项

今天和一个做Affymetrix 的 expert 交流了一下,解决了几个我一直不太明白的问题。

在Affymetrix Expression Array中 有个 参数,absent和present。对于一个probe,如果其信号值很低,Affymetrix认为这个probe对应的transcript没有表 达,因此信号可能来自噪音。absent 表示信号值不存在,present表示信号值存在。

因此,在处理芯片的过程中,如果没有做replicate,对于absent的probe应该去掉,不使用。如果有replicate,可以用一定的统计模型来衡量这个probe的值是否可靠。

造成一个probe的信号值为absent的可能性有两种:
1. sample中该probe对应的gene本身就不表达,或者低表达,这样自然其信号值就非常低,很本底噪音分不清除了。
2. array本身质量不好。

一般来说,probe对应的信号值要>100才能使用,这当然是针对Affymetrix Expression Array而言的。另外,house-keep gene 对应的probe signal 应该 >15000,才能说明芯片质量合格。

在网上查了些相关内容:

Filtering of low intensity data
1.Why? Since subsequent analysis will examine ratios of expression levels between two tissues, we don't want to spend our energy looking at expression intensities of genes that might change from close to zero to a larger intensity still close to zero (or vice versa).
2.You have two choices for filtering low intensity data, so do either A or B:
◦A. filtering by Affymetrix Absent/Present calls, or
◦B. filtering by dropping values similar to background
3.A. Using Affymetrix Absent/Present calls
◦Affymetrix performs their Absent/Present calls from p-values associated with detection calls using all probe data for a probeset. See the Statistical Algorithms Description Document for the details.
◦Look at the sheet called "ap", showing Absent/Present calls as determined by Affymetrix.
◦On a new sheet ("norm_filt") of the same file, set the level of each "A" probeset to 1. We won't use 0 because this could cause subsequent problems with division and logarithms.
■Use the Excel "IF" function to show the normalized value if the gene is "Present" or use 1 otherwise. Ex: =IF(ap!B2="P",norm!B2,1)
■The "IF" function takes three arguments: the statement to test, what to do if it's true, what to do if it's false.
4.B. Dropping values similar to background [an alternative to method (A)]
◦One probably wants to discount any expression values which are similar to the background intensity of the chip.
◦"Similar" can be defined as less that two times the standard deviation.
◦Standard deviation of the background is not available for these chips, but the standard deviation of the negative control probesets has been calculated to be 20 (for the data with the median adjusted to 100).
◦On a new sheet ("norm_filt") of the same file, set to 1 the level of each probeset with an expression value < 40. We won't use 0 because this could cause subsequent problems with division and logarithms.
■Use the Excel "IF" function to show the normalized value if the level is greater than 40 or use 1 otherwise. Ex: =IF(norm!B2>40,norm!B2,1)
5.Using your chosen method, what fraction of genes on each chip are present?
◦Copy one column of data and "Paste Special > Values" into another file. "Data > Sort" and count the rows that are greater than 1 divided by the total rows.

 

另外,我一直不明白的一个问题就是,为什么经常会遇到一个probe对应多个gene? 原来,Affymetrix的表达谱芯片,probe设计都是针对该基因的 3‘ 端。 因此,如果考虑 alternative specing, 一个基因就有可能有多个transcript,而这些transcript可能都含有该3’端的exon,因而产生了一个probe对应多个 transcript的情况。

Why are there "_a" probe sets on the HG-U133 2.0 Arrays but not on the previous generation HG-U133 Arrays?

At the time of the HG-U133 Set design the "_a" probe set was not defined. The non-unique probe set type, "_a", was introduced with the Mouse Expression Set 430 to indicate probe sets that recognize multiple alternative transcripts from the same gene. Probe sets with common probes among multiple transcripts from separate genes are annotated with a "_s" suffix. Another way to describe this is that "_a" probe sets have a gene cluster count = 1 and transcript count >1, and "_s" probe sets have a gene cluster count > 1 and transcript count >1. The gene cluster count and transcript counts are available through the NetAffx™ Analysis Center. For consistency, the names of existing probe sets with the "_s" suffices were not changed between the HG-U133 Set and the HG-U133 Plus 2.0 and HG-U133A 2.0 Arrays. The new content on the HG-U133 Plus 2.0 Array incorporates both the "_a" and "_s" probe sets.

 

什么意思呢?

就是有时候探针可以和一个基因的多个转录本结合,这时候这样的探针用 _a 进行表示,以区别best matcht protbe

同样也存在探针可以和另外一些基因结合,这时候这样的探针用 _s 进行表示,以区别best matcht protbe and _a标示的探针

你可能感兴趣的:(博客)