半酶切和大误差搜库对基于target-decoy的过滤方法有利

Experts agree: use semi-enzymatic search for best sensitivity and specificity

Three of the world’s leading experts on MS-MS protein identification came together recently at Sage-N Research’s annual user group meeting, and presented methods and results for the techniques and tools with which they are associated:

  • Jimmy Eng, co-inventor of Sequest and developer of many proteomics tools, presented tips for Sequest analysis
  • Josh Elias, who pioneered the systematic use of decoy databases for FDR estimation, gave a talk on how to use that technique to address Peptide ID signal-to-noise.
  • Alexey Nesvizhskii spoke about the tools he co-authored, in “Peptide identification and protein inference using PeptideProphet and ProteinProphet”

Their talks were very wide-ranging and full of practical insights for the proteomics user community, and they explored the different research interests, data sets, analysis methods and workflows in the individual labs.  However, they all had this in common: they had kept a careful eye on their search settings, monitored sensitivity and error rates, and come to a common, if perhaps not entirely intuitive, conclusion: the most sensitive search and the lowest error rates for shotgun proteomics are achieved when using semi-enzymatic searches — that is, when one end, but not both, of the peptide is allowed to diverge from the expected cleavage site.

Thus, in the case of the commonly used tryptin enzyme, which is expected to cleave only at K or R residues, a match is allowed even if one end or the other of the peptide varies from this. This works better than either forcing both ends to the K/R constraint (”fully enzymatic”) or imposing no constraints on either end (”no enzyme”). As we will see later, the reason better performance has to do with better statistics rather than whether the trypsin actually cleaved in the proper place.

Jimmy Eng used a complex, 6-file yeast proteomics shotgun dataset run on an Orbitrap and a Sequest/PeptideProphet pipeline to analyse the effect of several search settings — including mass tolerance, monoisotopic vs. average masses, and the effect of using various scores — on the number of hits found for a given false discovery rate (FDR).

This slide from his talk compares the number of hits found for a given FDR when using a semi-tryptic, no-enzyme and tryptic search respectively.

It is quite apparent that there is a significant advantage — up to 15% or so –  to using semitryptic conditions rather than fully tryptic ones at all target FDRs. And semitryptic behaves better than no-enzyme, ranging from a modest improvement at very low FDRs to considerable differences in the mid- to high-range. This data validates Jimmy Eng’s standard practice in his environment, to use semi-tryptic searches as the method of choice.

Josh Elias from Stanford agrees. He says, “Partial tryptic search, or semi-tryptic searching, is ideal. And the reason for this is that it helps to distract away all the incorrect identifications”. 

He showed this slide, which shows numbers of hits for a proteomics experiment against a target database and against a decoy database of assumed noise, using the target-decoy strategy presented in his talk and published in Elias & Gygi, “Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry”, Nature Methods - 4, 207 - 214 (2007).

What is noticeable is that number of partial tryptic hits is pretty much the same for hits against the decoy database as in the target database. This suggests that they are all due to chance, and that in fact few represent true hits. In contrast, there are many more target hits that are fully tryptic than there are decoy hits, so most of those are likely to be real.

So if all the real hits are in fact tryptic, why is it better to do a semitryptic search? Why not just look for tryptic hits in the first place? The reason has to do with the statistics of the noise distribution, or the ability, as Dr. Elias puts it, to “distract away the incorrect identifications”. If the noise distribution can be more clearly identified, and teased apart from the true hit distribution, it increases the sensitivity of the analysis at a given FDR. Once the search is complete, semitryptic hits can be downgraded — as they are in the PeptideProphet analysis — or filtered out entirely, as they are in the Elias-Gygi method.

This also suggests why no-enzyme searching is not an improvement. It overdoes the noise, and specificity is impaired. Furthermore, it is computationally much more complex and may require a large and counterproductive increase in search time.

A slide from Alexey Nesvizhskii’s talk bears this out. He compared — using yet another dataset, this time from an LTQ-FT and analyzed with Sequest and TPP — the effects of applying tryptic, semitryptic or no enzyme constraints, and of using a narrow (NW) or large (LW) mass tolerance window.

The answer at the back of the book is highlighted in yellow! The data points show that, for a given mass tolerance (”NW” or “LW”), one gets considerably more hits when using semitryptic conditions rather than  no-enzyme (”unconst”) or, particularly, fully tryptic. Dr. Nevsizhskii would also recommend using a larger mass tolerance — even when the mass accuracy of peptide measurements is expected to be high, as for this LTQ-FT data — for similar reasons.

To sum up, the insights from these three expert talks offer some very practical conclusions for the analysis of shotgun proteomic data on high resolution MS-MS such as the Orbitrap:

  1. Use semitryptic settings for database searches as the default to get the best performance
  2. Consider using a somewhat larger mass tolerance than you know your experiments will yield
  3. Postprocess your peptide IDs with proper statistics and filtering tools, such as PeptideProphet, DTASelect or target-decoy analysis
  4. Keep an eye on the FDR and enjoy the extra good IDs you get!

Tags: Decoy, Elias, Eng, Nevsizhskii, PeptideProphet, semienzyme, semitryptic,sensitivity, SEQUEST, specificity, TPP

你可能感兴趣的:(半酶切和大误差搜库对基于target-decoy的过滤方法有利)