Paper reading (二十六):Supervised machine learning for population genetics: A new Paradigm

论文题目:Supervised Machine Learning for Population Genetics: A New Paradigm

scholar 引用:65

页数:12

发表时间:2017.12

发表刊物:Trends in Genetics

作者:Daniel R. Schrider and Andrew D. Kern

Highlights:

  • ML methods are powerful approaches that have revolutionized many fields, but their use in population genetics inference is only beginning.
  • These methods are able to take advantage of high dimensional input - an important asset for population genetics inference - and are often more robust than other statistical approaches.
  • The early applications of ML to population genetics demonstrate that they outperform traditional approaches.
  • In this review we introduce ML to a biology audience, discuss examples of their application to evolutionary and population genetics, and lay out future directions that we view as promising.

摘要:

As population genemic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational poplulation genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.

结论:

  •  Deep learning should be an important area of future research
  • a general challenge:  making more structured population genetics inferences beyond simple parameter estimation or classification.
  • it is not clear to what extent the supervised ML techniques discussed above could be used to infer genealogies or other tree-like structures.

正文组织架构:

1. Machine Learning for pupulation genetics

2. An introduction to machine learning

3. Why use machine learning?

4. Supervised ML in population genetics by training on real data: finding purifying selection

5. Finding selective sweeps in the genome

6. Inferring demography(人口统计学) and recombination

7. Coestimation of selection and demography

8. Concluding remarks and future direction

9. Outstanding questions

正文部分内容摘录:

1. Machine Learning for pupulation genetics

  • classical statistical estimation: a convenient probabilistic model; an approximation to that model.
  • ML methods can teach us something about nature
  • An equally important advantage of the ML paradigm is that it enables the efficient use of high-dimensional inputs which act as dependent variables, without specific knowledge of the joint probability distribution of these variables.

2. An introduction to machine learning

  • 机器学习一般分为有监督学习和无监督学习。
  • Figure II. An Example Application of Supervised ML to Demographic Model Selection. 提供了github代码: https://github.com/kern-lab/popGenMachineLearningExamples.

3. Why use machine learning?

  • computational efficiency

4. Supervised ML in population genetics by training on real data: finding purifying selection

  • When empircally derived training data are available, supervised ML can be used to make accurate predictions in datasets that cannot be adequately modeled with a reasonable number of parameters.

5. Finding selective sweeps in the genome

  • One population genetic question that has received recent attention using ML approaches is that of detecting selective sweeps: the signature left by an adaptive mutation that rapidly increases in allele frequency until reaching fixation.
  • The methods listed above have two communalities: they use ML to perform classification on multidimensional input, and they handily outperform more traditional univariate methods.
  • approximate Bayesian computaion (ABC)
  • ABC has some important drawbacks that ML overcomes: ABC is susceptible to the curse of dimensionality; ABC is its computational burden.
  • A third difference between ML and ABC is that of interpretability. In the realm of ABC it is not clear which summaries are responsible for a signal. By contrast, many ML methods allow direct measurement of the contribution of each feature.

6. Inferring demography(人口统计学) and recombination

  • Another emerging use of supervised ML in population genetics has been for inference of demographic history and recombination rates.
  • demographic model selection: random forests outperform ABC
  • Supervised ML has also been applied to characterize the rates and patterns of recombination in the genome.

7. Coestimation of selection and demography

  • It is well known that demographic events can mimic the effects of selection, and conversely that selection can confound demographic estimation.

9. Outstanding questions

  • In what scenarios would either ML or ABC be preferable?
  • To what extent can ML methods be made more robust to these assumptions
  • How feasible will parameter estimation be in more complex evolutionary models using ML tools such as deep neural networks?
  • can we do better than standard population genetic statistics?
  • how best can we encode population genetic data
  • can we use ML to infer structured output in population genetics such as genealogies or ancestral recombination graphs?
  • Can such methods be used as substitute for population genetic simulation, perhaps to generate very large samples and chromosomes that are computationally costly to simulate?

你可能感兴趣的:(Paper,Reading,Population,Genetics,machine,learning)