讲解:Stat 27850、Java,c/c++、PythonPython|Matlab

Stat 27850/30850: Problem set 11. Consider a multiple testing scenario where we are testing the global null. Suppose that you have n p-valuesP1, . . . , Pn. A proportion π of these p-values are true signals, distributed asP1, . . . , Pπn ∼ Unif[0, τ ].The rest of them are nulls,Pπ·n+1, . . . , Pn ∼ Unif[0, 1].And all the p-values are independent. For simplicity assume τ ≥ α/n.(a) Write an exact expression for the probability of rejecting the global null, by using the Bonferroni correctionat level α.(b) Calculate the probability of rejecting the global null using Fisher’s test, using the normal approximationto the χ2 distribution (note that χ2m has expected value m and variance 2m, and the normal approximationis very accurate for large m due to the central limit theorem). You should show that this probability is(approximately) equal to Φ(Φ−1(α) −√n · π log(τ )), where Φ is the CDF of a standard normal.(c) Now let π =1n1/3 and τ =12. Show that the probability of rejecting the global null tends to 1 for Fisher’stest, but tends to α for Bonferroni’s (i.e. Bonferroni’s test is no better than random guessing. It’s trivial tosee that the probability of rejection is at least α, i.e., at least as good as random guessing, so you only needto show an upper bound, i.e., the probability of rejection is upper bounded by a quantity that is ≈ α; thiswill be easiest to prove using a union bound).(d) Finally let π =1n2/3 and τ =1n. Show that the probability of rejecting the global null tends to 1 forBonferroni’s test, but tends to α for Fisher’s (i.e. Fisher’s test is no better than random guessing).2. Next let’s examine the difference between Bonferroni’s and Fisher’s methods numerically. We’ll choose α =0.05, n = 230, π = 2−i, and τ = 2−j, for i, j ∈ {1, . . . , 30}. We are not running simulations, just doingcalculations. For each value of i & j, based on your work in the previous problem, setBonferroni[i,j]to be the exact probability that Bonferroni’s method rejects the global null (with this choice of n, π, τ ), and setFisher[i,j]as the same calculation for Fisher (using the normal approximation as before). Now run this code to visualizeyour results:par(mfrow=c(1,2))image(1:30,1:30,Bonferroni,xlab=-log_2(pi),ylab=-log_2(tau),main=Bonferroni,col=gray((0:10)/10))image(1:30,1:30,Fisher,xlab=-log_2(pi),ylab=-log_2(tau),main=Fisher,col=gray((0:10)/10))These types of figures are known as “phase transition diagrams” where we see a sharp transition from a highchance of success to a high chance of failure. The grayscale corresponds to the chance of rejection: white= 100% chance of rejecting the global null, and black = 0% chance. You should see that higher values of π(i.e. lower i), and lower values of τ (i.e. higher j), improve the chances for both methods. However, the regionsof π, τ values where the methods are successful, are different for the two methods. Describe and explain whatyou see.3. Gene expression / COPD & statins. In class we saw a gene expression data set where for each patient, weobserve• A label: whether the patient takes statin代写Stat 27850作业、代做Java,c/c++程序语言作业、代写Python课程设计作业 代写Python编程|s, or not• Covariates: disease/healthy, age, sex• Gene expression levels (log transformed) for each of n = 12381 genes (X[k,i] is the log-transformedgene expression level for person k and gene i)1Copy and paste the following code to download and organize the data:download.file(ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE71nnn/GSE71220/matrix/GSE71220_series_matrix.txt.gz,GSE71220_series_matrix.txt.gz)gene_expr = t(read.table(GSE71220_series_matrix.txt.gz,skip=66,nrows=12381)[,-1])nsample = dim(gene_expr)[1]; n = dim(gene_expr)[2]statin = read.table(GSE71220_series_matrix.txt.gz,skip=26,nrows=1)[-1]statin = (strsplit(toString(unlist(statin)),_)[[1]][2+3*(0:616)] == statin)# which patients take statinsdisease = read.table(GSE71220_series_matrix.txt.gz,skip=37,nrows=1)[-1]disease = (unlist(strsplit(strsplit(toString(unlist(disease)),: )[[1]],,))[2*(1:nsample)] == COPD)# patients with COPD disease or healthy patientsage = read.table(GSE71220_series_matrix.txt.gz,skip=38,nrows=1)[-1]age = as.numeric(unlist(strsplit(strsplit(toString(unlist(age)),: )[[1]],,))[2*(1:nsample)])# age of patientsex = read.table(GSE71220_series_matrix.txt.gz,skip=39,nrows=1)[-1]sex = (unlist(strsplit(strsplit(toString(unlist(sex)),: )[[1]],,))[2*(1:nsample)])# sex of patient (M or F)Let’s look only at patients with COPD disease, and only the first 200 genes to save computation time. We willignore the age & sex covariates for now. We want to see if there is a true association between statin use and geneexpression levels (overall, across the n genes—not necessarily making a claim for any specific gene). Let’s testthis with the statisticwhere Xiis the vector of gene expression levels for gene #i, and Y is the binary vector indicating if the patienttakes statins.Next, we will try the following two variants of a permutation test, to compute a p-value based on T.(a) For each i = 1, . . . , n — permute the vector Xi at random and computerperm.(b) Permute the vector Y at random. Then for each i = 1, . . . , n, For each scheme, run it with 500 permutations and compute a p-value for the observed value of T.Which scheme is a more valid way of testing the global null hypothesis (i.e., testing if there is any associationbetween statin use and gene expression levels)? To explore this more, let’s run the following simulation. First,sample the vector Y (i.e., statins or no statins) randomly. Then using this simulated Y vector, run procedure (a)to produce a p-value via a permutation test, and do the same for procedure (b) (for each procedure, you can doa smaller number of permutations, like 50 or 100, to save time).Now repeat this experiment a large number of times (at least 300 to see a clear trend — be aware this may takesome time to run depending on how you implement it). Create a histogram of the p-values from scheme (a), andanother for scheme (b). If the permutation test is valid, you should see (approximately) a uniform distribution,since we are simulating the procedure with a random vector Y .Explain what you see, and which scheme is valid or invalid and why.2转自:http://www.daixie0.com/contents/9/4673.html

你可能感兴趣的:(讲解:Stat 27850、Java,c/c++、PythonPython|Matlab)