Bayes Nets for Genetic Inheritance
基因遗传的贝叶斯网络
1.构建基因遗传的贝叶斯网络
本章要求构建如下图所示的贝叶斯网络:
图中,变量1、2、3分别表示父母及子女的基因型(Genotype),变量4、5、6分别表示父母及子女基因型所对应的性状(Phenotype)。同时,基因型本身由等位基因(Allele)决定。图中的三个虚线框标记了组成整个贝叶斯网络的三个基本模板因子(Template),也就是本节编程作业的2-4题。最后我们可以通过组合这些模板因子构建出完整的贝叶斯网络。
phenotypeGivenGenotypeMendelianFactor.m 基因型决定性状的孟德尔因子
输入:
isDominant = 1;
genotypeVar = 1;
phenotypeVar = 3;
期望输出:
phenotypeFactor = struct('var', [3,1], 'card', [2,3], 'val', [1,0,1,0,0,1]);
因子phenotypeFactor的结构及意义如下:
var:表示变量(variables)的名称及它们之间的关系,这里('var', [3,1])表示因子描述的是phenotypeVar = 3的性状对genotypeVar = 1的基因型的条件概率分布。
card:是基数(cardinalities)的缩写,描述了各变量的状态数量,这里表示性状可能的状态数为2(有或没有此性状),基因型可能的状态数为3(AA,Aa,aa三种)。
val:表示因子中各变量可能组合对应的概率值(values)。由于这个问题是基于孟德尔遗传理论,同时isDominant = 1,表示该等位基因为显性基因,因此当基因型为AA、Aa的时候,表现出对应性状,而基因型为aa时不表现出对应性状。
参考代码如下:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %INSERT YOUR CODE HERE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% phenotypeFactor.var = [phenotypeVar, genotypeVar]; % phenotypeFactor.card(1) : number of possible phenotypes % phenotypeFactor.card(2) : number of genotypes phenotypeFactor.card = [2, 3]; assignments = IndexToAssignment(1 : prod(phenotypeFactor.card), phenotypeFactor.card); % trait = 1, no trait = 2 | AA = 1, Aa = 2, aa = 3 Index = [find(assignments(:, 1) == 1 & assignments(:, 2) == 1); find(assignments(:, 1) == 1 & assignments(:, 2) == 2); find(assignments(:, 1) == 2 & assignments(:, 2) == 3)]'; phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card)); phenotypeFactor.val(Index) = 1; if isDominant == 0 phenotypeFactor.val = ~phenotypeFactor.val; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
phenotypeGivenGenotypeFactor.m 基因型决定性状的因子
很多基因型与性状之间的关系并不严格遵守孟德尔遗传定律,而是仅仅影响体现某性状的可能性。这一节就是在孟德尔遗传定律的基础上,对这种更为合理的情况进行建模。
输入:
alphaList = [0.8; 0.6; 0.1];
genotypeVar = 1;
phenotypeVar = 3;
期望输出:
phenotypeFactorAlpha = struct('var', [3,1], 'card', [2,3], 'val', [0.8,0.2,0.6,0.4,0.1,0.9]);
参考代码如下:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %INSERT YOUR CODE HERE % The number of genotypes is the length of alphaList. % The number of phenotypes is 2. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% phenotypeFactor.var = [phenotypeVar, genotypeVar]; % phenotypeFactor.card(1) : number of possible phenotypes % phenotypeFactor.card(2) : number of genotypes phenotypeFactor.card = [2, length(alphaList)]; assignments = IndexToAssignment(1 : prod(phenotypeFactor.card), phenotypeFactor.card); phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card)); for ii = 1 : length(alphaList) phenotypeFactor.val(find(assignments(:, 1) == 1 & assignments(:, 2) == ii)) = alphaList(ii); phenotypeFactor.val(find(assignments(:, 1) == 2 & assignments(:, 2) == ii)) = 1 - alphaList(ii); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
genotypeGivenAlleleFreqsFactor.m 等位基因决定基因型的因子
本节要求对给定等位基因概率的情况下,计算基因型的概率分布。注意下基因型Aa与aA相同就好。
输入:
alleleFreqs = [0.1; 0.9];
genotypeVar = 1;
期望输出:
genotypeFactor = struct('var', [1], 'card', [3], 'val', [0.01,0.18,0.81]);
参考代码如下:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %INSERT YOUR CODE HERE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% genotypeFactor.var = genotypeVar; genotypeFactor.card = size(genotypesToAlleles, 1); genotypeFactor.val = zeros(1, prod(genotypeFactor.card)); for ii = 1 : genotypeFactor.card if genotypesToAlleles(ii, 1) == genotypesToAlleles(ii, 2) genotypeFactor.val(ii) = alleleFreqs(genotypesToAlleles(ii, 1)) ^ 2; else genotypeFactor.val(ii) = 2 * alleleFreqs(genotypesToAlleles(ii, 1)) * ... alleleFreqs(genotypesToAlleles(ii, 2)); end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
genotypeGivenParentsGenotypesFactor.m 父母基因型决定子女基因型的因子
这里var中存在3个参数,表示[子女的基因型编号,父母1的基因型编号,父母2的基因型编号]。
输入:
numAlleles = 2;
genotypeVarChild = 3;
genotypeVarParentOne = 1;
genotypeVarParentTwo = 2;
期望输出:
genotypeFactorPar = struct('var', [3,1,2], 'card', [3,3,3], 'val', [1,0,0,0.5,0.5,0,0,1,0,0.5,0.5,0,0.25,0.5,0.25,0,0.5,0.5,0,1,0,0,0.5,0.5,0,0,1]);
当然,numAlleles也可能等于3、4、5之类的,合理利用题目中提供的generateAlleleGenotypeMappers函数即可。
参考代码如下:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %INSERT YOUR CODE HERE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% genotypeFactor.var = genotypeVar; genotypeFactor.card = size(genotypesToAlleles, 1); genotypeFactor.val = zeros(1, prod(genotypeFactor.card)); for ii = 1 : genotypeFactor.card if genotypesToAlleles(ii, 1) == genotypesToAlleles(ii, 2) genotypeFactor.val(ii) = alleleFreqs(genotypesToAlleles(ii, 1)) ^ 2; else genotypeFactor.val(ii) = 2 * alleleFreqs(genotypesToAlleles(ii, 1)) * ... alleleFreqs(genotypesToAlleles(ii, 2)); end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
constructGeneticNetwork.m 构建基因贝叶斯网络
对比之前的贝叶斯网络结构图,我们可以看到构成网络的三个模板因子已经完成了,接下来的工作就是把它们拼起来……
输入:
pedigree = struct('parents', [0,0;1,3;0,0]);
pedigree.names = {'Ira','James','Robin'};
alleleFreqs = [0.1; 0.9];
alphaList = [0.8; 0.6; 0.1];
期望输出:
sampleFactorList = load('sampleFactorList.mat'); % 这个输出太多了,跑下程序就好,我们主要看输入结构哈~
这里,pedigree描述了参数直接的结构关系,就是谁是谁的爸爸谁是谁的儿子,他们这几个人叫什么……其它的参考上述练习即可。
参考代码如下:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %INSERT YOUR CODE HERE % Variable numbers: % 1 - numPeople: genotype variables % numPeople+1 - 2*numPeople: phenotype variables %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for ii = 1 : numPeople if sum(pedigree.parents(ii, :)) == 0 factorList(ii) = genotypeGivenAlleleFreqsFactor(alleleFreqs, ii); else factorList(ii) = genotypeGivenParentsGenotypesFactor(numAlleles, ii, pedigree.parents(ii, 1), pedigree.parents(ii, 2)); end end for ii = numPeople + 1 : 2 * numPeople factorList(ii) = phenotypeGivenGenotypeFactor(alphaList, ii - numPeople, ii); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2.构建解耦贝叶斯网络
前面我们构建的贝叶斯网络,并没有考虑到染色体对的影响。这里我们在考虑基因成对出现的情况下,构建新的贝叶斯网络如下图所示:
可以看到,在加入基因对的概念后,整个网络各变量之间存在比较复杂的相互影响关系,也就是耦合关系。这大概也就是题目称之为"解耦网络"的原因——我们要通过新的模板因子将网络进行简化,以实现对这种复杂关系的建模。
phenotypeGivenCopiesFactor.m 成对基因决定性状的因子
输入:
alphaListThree = [0.8; 0.6; 0.1; 0.5; 0.05; 0.01];
numAllelesThree = 3;
genotypeVarMotherCopy = 1;
genotypeVarFatherCopy = 2;
phenotypeVar = 3;
期望输出:
phenotypeFactorPar = struct('var', [3,1,2], 'card', [2,3,3], 'val', [0.8,0.2,0.6,0.4,0.1,0.9,0.6,0.4,0.5,0.5,0.05,0.95,0.1,0.9,0.05,0.95,0.01,0.99]);
注意alphaList的长度等于numAlleles的组合数,也就是genotypesToAlleles的列数。
参考代码如下:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %INSERT YOUR CODE HERE % The number of phenotypes is 2 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% phenotypeFactor.var = [phenotypeVar, geneCopyVarOne, geneCopyVarTwo]; phenotypeFactor.card = [2, numAlleles, numAlleles]; assignments = IndexToAssignment(1 : prod(phenotypeFactor.card), phenotypeFactor.card); phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card)); allelesCombination = sort(assignments(:, [2, 3]), 2); for ii = 1 : size(genotypesToAlleles, 1) phenotypeFactor.val(find(sum(allelesCombination == genotypesToAlleles(ii, :), 2) == 2 & assignments(:, 1) == 1)) = alphaList(ii); phenotypeFactor.val(find(sum(allelesCombination == genotypesToAlleles(ii, :), 2) == 2 & assignments(:, 1) == 2)) = 1 - alphaList(ii); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
constructDecoupledGeneticNetwork.m 构建解耦基因网络
之后,我们只要用成对基因决定性状的因子替代基因型决定性状的因子,之后调整下之前网络的输入输出结构就好~
输入:
pedigree = struct('parents', [0,0;1,3;0,0]);
pedigree.names = {'Ira','James','Robin'};
alleleFreqsThree = [0.1; 0.7; 0.2];
alleleListThree = {'F', 'f', 'n'};
alphaListThree = [0.8; 0.6; 0.1; 0.5; 0.05; 0.01];
期望输出:
sampleFactorListDecoupled = load('sampleFactorListDecoupled.mat'); % 这个输出也太多了……
参考代码如下:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % INSERT YOUR CODE HERE % Variable numbers: % 1 - numPeople: first parent copy of gene variables % numPeople+1 - 2*numPeople: second parent copy of gene variables % 2*numPeople+1 - 3*numPeople: phenotype variables %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for ii = 1 : numPeople if sum(pedigree.parents(ii, :)) == 0 factorList(ii) = childCopyGivenFreqsFactor(alleleFreqs, ii); else factorList(ii) = childCopyGivenParentalsFactor(numAlleles, ii, pedigree.parents(ii, 1), pedigree.parents(ii, 1) + numPeople); end end for ii = numPeople + 1 : 2 * numPeople if sum(pedigree.parents(ii - numPeople, :)) == 0 factorList(ii) = childCopyGivenFreqsFactor(alleleFreqs, ii); else factorList(ii) = childCopyGivenParentalsFactor(numAlleles, ii, pedigree.parents(ii - numPeople, 2), pedigree.parents(ii - numPeople, 2) + numPeople); end end for ii = 2 * numPeople + 1 : 3 * numPeople factorList(ii) = phenotypeGivenCopiesFactor(alphaList, numAlleles, ii - 2 * numPeople, ii - numPeople, ii); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
可喜可贺,到这里我们就只剩下一个sigmoid贝叶斯网络啦~
3.构建性状由多基因决定的贝叶斯网络
说到sigmoid,我想听了机器学习相关课程的同学应该都很熟,这里就不累述了,总之原理和逻辑回归(或者是单sigmoid核的感知机)一模一样,理解起来也没啥难度的说。
上边第二个图出自林轩田老师的机器学习基石课程ppt,模型是一模一样的。
constructSigmoidPhenotypeFactor.m 构建sigmoid性状因子
输入:
alleleWeights = {[3, -3], [0.9, -0.8]};
geneCopyVarParentOneList = [1; 2];
geneCopyVarParentTwoList = [4; 5];
phenotypeVar = 3;
期望输出:
phenotypeFactorSigmoid = struct('var', [3,1,2,4,5], 'card', [2,2,2,2,2], 'val', [0.999590432835014,0.000409567164986080,0.858148935099512,0.141851064900488,0.997762151478724,0.00223784852127629,0.524979187478940,0.475020812521060,0.858148935099512,0.141851064900488,0.0147740316932731,0.985225968306727,0.524979187478940,0.475020812521060,0.00273196076301106,0.997268039236989,0.997762151478724,0.00223784852127629,0.524979187478940,0.475020812521060,0.987871565015726,0.0121284349842742,0.167981614866076,0.832018385133925,0.524979187478940,0.475020812521060,0.00273196076301106,0.997268039236989,0.167981614866076,0.832018385133925,0.000500201107079564,0.999499798892920]);
嘛嘛,注意下基因排列的顺序就好啦~(我才不会说我看错顺序了……)
参考代码如下:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % INSERT YOUR CODE HERE % Note that computeSigmoid.m will be useful for this function. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% phenotypeFactor.var = [phenotypeVar, geneCopyVarOneList', geneCopyVarTwoList']; phenotypeFactor.card = [2, repmat(length(alleleWeights{1}), [1, length(geneCopyVarOneList)]), ... repmat(length(alleleWeights{2}), [1, length(geneCopyVarTwoList)])]; phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card)); assignments = IndexToAssignment(1 : prod(phenotypeFactor.card), phenotypeFactor.card); for ii = 1 : prod(phenotypeFactor.card) sumWeights = 0; for k1 = 1 : length(geneCopyVarOneList') sumWeights = sumWeights + alleleWeights{k1}(assignments(ii, k1 + 1)); end for k2 = 1 : length(geneCopyVarTwoList') sumWeights = sumWeights + alleleWeights{k2}(assignments(ii, k2 + length(geneCopyVarOneList') + 1)); end if assignments(ii, 1) == 1 phenotypeFactor.val(ii) = computeSigmoid(sumWeights); else phenotypeFactor.val(ii) = 1 - computeSigmoid(sumWeights); end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
最后附上成绩截图~哦哈哈哈哈哈……好像也没啥了不起(0.0'),继续加油~