GWAS 操作流程1

参考github GWAS教学:https://github.com/MareesAT/GWA_tutorial/

总流程分为4部分

1. QC

2. Population Stratification,

3. Association Analyses

4. Polygenic Risk Score (PRS) analyses.

首先是下载数据集以及相关文件


mkdir GWAS_test

cd GWAS_test

git clone https://github.com/MareesAT/GWA_tutorial.git

unzip 1_QC_GWAS.zip

cd 1_QC_GWAS

1. QC

qc的内容可以参考文献

Marees AT, de Kluiver H, Stringer S, et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res. 2018;27(2):e1608. doi:10.1002/mpr.1608

QC一般分为7步

(1) snp 缺失

(2) 性别筛选

(3) MAF 频率筛选

(4) 哈代温伯格平衡测试

(5) 杂合率

(6) 相关性

(7) 人口分层

1.1 查看是否有SNP,individual缺失。

plink --bfile HapMap_3_r3_1 --missing 

Rscript --no-save hist_miss.R

截屏2020-05-13 下午12.34.04.png
截屏2020-05-13 下午12.34.17.png

根据SNPs,individuals缺失值设置阈值进行筛选。


# Delete SNPs with missingness >0.2.

plink --bfile HapMap_3_r3_1 --geno 0.2 --make-bed --out HapMap_3_r3_2

# Delete individuals with missingness >0.2.

plink --bfile HapMap_3_r3_2 --mind 0.2 --make-bed --out HapMap_3_r3_3

# Delete SNPs with missingness >0.02.

plink --bfile HapMap_3_r3_3 --geno 0.02 --make-bed --out HapMap_3_r3_4

# Delete individuals with missingness >0.02.

plink --bfile HapMap_3_r3_4 --mind 0.02 --make-bed --out HapMap_3_r3_5

1.2 性别差异检验

根据男性和女性在X染色体上SNPs分布频率进行性别信息检验,女性F值<0.2;男性F值>0.8将不符合的样本标记为“PROBLEM”.


plink --bfile HapMap_3_r3_5 --check-sex

Rscript --no-save gender_check.R

截屏2020-05-13 下午12.45.26.png
截屏2020-05-13 下午12.45.33.png

从结果看,女性样本中有一个F值>0.2.

处理性别异常个体有两个方法:删除或根据F值校正。这里选择删除的方式。

删除


grep "PROBLEM" plink.sexcheck| awk '{print$1,$2}'> sex_discrepancy.txt

# This command generates a list of individuals with the status ìPROBLEMî.

plink --bfile HapMap_3_r3_5 --remove sex_discrepancy.txt --make-bed --out HapMap_3_r3_6

# This command removes the list of individuals with the status ìPROBLEMî.

校正


plink --bfile HapMap_3_r3_5 --impute-sex --make-bed --out HapMap_3_r3_6

# This imputes the sex based on the genotype information into your data set.

1.3 生成仅包含常染色体且去除low MAF的bfile


# Select autosomal SNPs only (i.e., from chromosomes 1 to 22).

awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt

plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7

# Generate a plot of the MAF distribution.

plink --bfile HapMap_3_r3_7 --freq --out MAF_check

Rscript --no-save MAF_check.R

截屏2020-05-13 下午12.58.13.png

# Remove SNPs with a low MAF frequency.

plink --bfile HapMap_3_r3_7 --maf 0.05 --make-bed --out HapMap_3_r3_8

# 1073226 SNPs are left

# A conventional MAF threshold for a regular GWAS is between 0.01 or 0.05, depending on sample size.

1.4 去除不符合哈代温伯格平衡(HWE)的SNPs


# Check the distribution of HWE p-values of all SNPs.

plink --bfile HapMap_3_r3_8 --hardy

# Selecting SNPs with HWE p-value below 0.00001, required for one of the two plot generated by the next Rscript, allows to zoom in on strongly deviating SNPs.

awk '{ if ($9 <0.00001) print $0 }' plink.hwe>plinkzoomhwe.hwe

Rscript --no-save hwe.R

截屏2020-05-13 下午1.06.09.png
截屏2020-05-13 下午1.06.17.png

# By default the --hwe option in plink only filters for controls.

# Therefore, we use two steps, first we use a stringent HWE threshold for controls, followed by a less stringent threshold for the case data.

plink --bfile HapMap_3_r3_8 --hwe 1e-6 --make-bed --out HapMap_hwe_filter_step1

# The HWE threshold for the cases filters out only SNPs which deviate extremely from HWE.

# This second HWE step only focusses on cases because in the controls all SNPs with a HWE p-value < hwe 1e-6 were already removed

plink --bfile HapMap_hwe_filter_step1 --hwe 1e-10 --hwe-all --make-bed --out HapMap_3_r3_9

# Theoretical background for this step is given in our accompanying article: https://www.ncbi.nlm.nih.gov/pubmed/29484742 .

1.5 计算杂合率

进行杂合率检验,删除不在mean±3SD范围内的样本


# Checks for heterozygosity are performed on a set of SNPs which are not highly correlated.

# Therefore, to generate a list of non-(highly)correlated SNPs, we exclude high inversion regions (inversion.txt [High LD regions]) and prune the SNPs using the command --indep-pairwiseí.

# The parameters ë50 5 0.2í stand respectively for: the window size, the number of SNPs to shift the window at each step, and the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously.

plink --bfile HapMap_3_r3_9 --exclude inversion.txt --range --indep-pairwise 50 5 0.2 --out indepSNP

# Note, don't delete the file indepSNP.prune.in, we will use this file in later steps of the tutorial.

plink --bfile HapMap_3_r3_9 --extract indepSNP.prune.in --het --out R_check

# This file contains your pruned data set.

# Plot of the heterozygosity rate distribution

Rscript --no-save check_heterozygosity_rate.R

截屏2020-05-13 下午1.17.28.png

# The following code generates a list of individuals who deviate more than 3 standard deviations from the heterozygosity rate mean.

# For data manipulation we recommend using UNIX. However, when performing statistical calculations R might be more convenient, hence the use of the Rscript for this step:

Rscript --no-save heterozygosity_outliers_list.R

# Output of the command above: fail-het-qc.txt .

# When using our example data/the HapMap data this list contains 2 individuals (i.e., two individuals have a heterozygosity rate deviating more than 3 SD's from the mean).

# Adapt this file to make it compatible for PLINK, by removing all quotation marks from the file and selecting only the first two columns.

sed 's/"// g' fail-het-qc.txt | awk '{print$1, $2}'> het_fail_ind.txt

# Remove heterozygosity rate outliers.

plink --bfile HapMap_3_r3_9 --remove het_fail_ind.txt --make-bed --out HapMap_3_r3_10

1.6相关性检验

Assuming a random population sample we are going to exclude all individuals above the pihat threshold of 0.2 in this tutorial.


# Check for relationships between individuals with a pihat > 0.2.

plink --bfile HapMap_3_r3_10 --extract indepSNP.prune.in --genome --min 0.2 --out pihat_min0.2

# The HapMap dataset is known to contain parent-offspring relations.

# The following commands will visualize specifically these parent-offspring relations, using the z values.

awk '{ if ($8 >0.9) print $0 }' pihat_min0.2.genome>zoom_pihat.genome

# Generate a plot to assess the type of relationship.

Rscript --no-save Relatedness.R

截屏2020-05-13 下午1.30.11.png

# The generated plots show a considerable amount of related individuals (explentation plot; PO = parent-offspring, UN = unrelated individuals) in the Hapmap data, this is expected since the dataset was constructed as such.

# Normally, family based data should be analyzed using specific family based methods. In this tutorial, for demonstrative purposes, we treat the relatedness as cryptic relatedness in a random population sample.

# In this tutorial, we aim to remove all 'relatedness' from our dataset.

# To demonstrate that the majority of the relatedness was due to parent-offspring we only include founders (individuals without parents in the dataset).

plink --bfile HapMap_3_r3_10 --filter-founders --make-bed --out HapMap_3_r3_11

# Now we will look again for individuals with a pihat >0.2.

plink --bfile HapMap_3_r3_11 --extract indepSNP.prune.in --genome --min 0.2 --out pihat_min0.2_in_founders

# The file 'pihat_min0.2_in_founders.genome' shows that, after exclusion of all non-founders, only 1 individual pair with a pihat greater than 0.2 remains in the HapMap data.

# This is likely to be a full sib or DZ twin pair based on the Z values. Noteworthy, they were not given the same family identity (FID) in the HapMap data.

# For each pair of 'related' individuals with a pihat > 0.2, we recommend to remove the individual with the lowest call rate.

plink --bfile HapMap_3_r3_11 --missing

# Use an UNIX text editor (e.g., vi(m) ) to check which individual has the highest call rate in the 'related pair'.

# Generate a list of FID and IID of the individual(s) with a Pihat above 0.2, to check who had the lower call rate of the pair.

# In our dataset the individual 13291  NA07045 had the lower call rate.

vi 0.2_low_call_rate_pihat.txt

I

13291  NA07045

# Press esc on keyboard!

:x

# Press enter on keyboard

# In case of multiple 'related' pairs, the list generated above can be extended using the same method as for our lone 'related' pair.

# Delete the individuals with the lowest call rate in 'related' pairs with a pihat > 0.2

plink --bfile HapMap_3_r3_11 --remove 0.2_low_call_rate_pihat.txt --make-bed --out HapMap_3_r3_12

################################################################################################################################

# CONGRATULATIONS!! You've just succesfully completed the first tutorial! You are now able to conduct a proper genetic QC.

# For the next tutorial, using the script: 2_Main_script_MDS.txt, you need the following files:

# - The bfile HapMap_3_r3_12 (i.e., HapMap_3_r3_12.fam,HapMap_3_r3_12.bed, and HapMap_3_r3_12.bim

# - indepSNP.prune.in

你可能感兴趣的:(GWAS 操作流程1)