GWAS 操作流程1

参考github GWAS教学：https://github.com/MareesAT/GWA_tutorial/

总流程分为4部分

1. QC

2. Population Stratification,

3. Association Analyses

4. Polygenic Risk Score (PRS) analyses.

首先是下载数据集以及相关文件


mkdir GWAS_test

cd GWAS_test

git clone https://github.com/MareesAT/GWA_tutorial.git

unzip 1_QC_GWAS.zip

cd 1_QC_GWAS

1. QC

qc的内容可以参考文献

Marees AT, de Kluiver H, Stringer S, et al. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int J Methods Psychiatr Res. 2018;27(2):e1608. doi:10.1002/mpr.1608

QC一般分为7步

(1) snp 缺失

(2) 性别筛选

(3) MAF 频率筛选

(4) 哈代温伯格平衡测试

(5) 杂合率

(6) 相关性

(7) 人口分层

1.1 查看是否有SNP，individual缺失。


plink --bfile HapMap_3_r3_1 --missing 

Rscript --no-save hist_miss.R

截屏2020-05-13 下午12.34.04.png

截屏2020-05-13 下午12.34.17.png

根据SNPs，individuals缺失值设置阈值进行筛选。


# Delete SNPs with missingness >0.2.

plink --bfile HapMap_3_r3_1 --geno 0.2 --make-bed --out HapMap_3_r3_2

# Delete individuals with missingness >0.2.

plink --bfile HapMap_3_r3_2 --mind 0.2 --make-bed --out HapMap_3_r3_3

# Delete SNPs with missingness >0.02.

plink --bfile HapMap_3_r3_3 --geno 0.02 --make-bed --out HapMap_3_r3_4

# Delete individuals with missingness >0.02.

plink --bfile HapMap_3_r3_4 --mind 0.02 --make-bed --out HapMap_3_r3_5

1.2 性别差异检验

根据男性和女性在X染色体上SNPs分布频率进行性别信息检验，女性F值<0.2；男性F值>0.8将不符合的样本标记为“PROBLEM”.


plink --bfile HapMap_3_r3_5 --check-sex

Rscript --no-save gender_check.R

截屏2020-05-13 下午12.45.26.png

截屏2020-05-13 下午12.45.33.png

从结果看，女性样本中有一个F值>0.2.

处理性别异常个体有两个方法：删除或根据F值校正。这里选择删除的方式。

删除


grep "PROBLEM" plink.sexcheck| awk '{print$1,$2}'> sex_discrepancy.txt

# This command generates a list of individuals with the status ìPROBLEMî.

plink --bfile HapMap_3_r3_5 --remove sex_discrepancy.txt --make-bed --out HapMap_3_r3_6

# This command removes the list of individuals with the status ìPROBLEMî.

校正


plink --bfile HapMap_3_r3_5 --impute-sex --make-bed --out HapMap_3_r3_6

# This imputes the sex based on the genotype information into your data set.

1.3 生成仅包含常染色体且去除low MAF的bfile


# Select autosomal SNPs only (i.e., from chromosomes 1 to 22).

awk '{ if ($1 >= 1 && $1 <= 22) print $2 }' HapMap_3_r3_6.bim > snp_1_22.txt

plink --bfile HapMap_3_r3_6 --extract snp_1_22.txt --make-bed --out HapMap_3_r3_7

# Generate a plot of the MAF distribution.

plink --bfile HapMap_3_r3_7 --freq --out MAF_check

Rscript --no-save MAF_check.R

截屏2020-05-13 下午12.58.13.png


# Remove SNPs with a low MAF frequency.

plink --bfile HapMap_3_r3_7 --maf 0.05 --make-bed --out HapMap_3_r3_8

# 1073226 SNPs are left

# A conventional MAF threshold for a regular GWAS is between 0.01 or 0.05, depending on sample size.

1.4 去除不符合哈代温伯格平衡（HWE）的SNPs


# Check the distribution of HWE p-values of all SNPs.

plink --bfile HapMap_3_r3_8 --hardy

# Selecting SNPs with HWE p-value below 0.00001, required for one of the two plot generated by the next Rscript, allows to zoom in on strongly deviating SNPs.

awk '{ if ($9 <0.00001) print $0 }' plink.hwe>plinkzoomhwe.hwe

Rscript --no-save hwe.R

截屏2020-05-13 下午1.06.09.png

截屏2020-05-13 下午1.06.17.png


# By default the --hwe option in plink only filters for controls.

# Therefore, we use two steps, first we use a stringent HWE threshold for controls, followed by a less stringent threshold for the case data.

plink --bfile HapMap_3_r3_8 --hwe 1e-6 --make-bed --out HapMap_hwe_filter_step1

# The HWE threshold for the cases filters out only SNPs which deviate extremely from HWE.

# This second HWE step only focusses on cases because in the controls all SNPs with a HWE p-value < hwe 1e-6 were already removed

plink --bfile HapMap_hwe_filter_step1 --hwe 1e-10 --hwe-all --make-bed --out HapMap_3_r3_9

# Theoretical background for this step is given in our accompanying article: https://www.ncbi.nlm.nih.gov/pubmed/29484742 .

1.5 计算杂合率

进行杂合率检验，删除不在mean±3SD范围内的样本


# Checks for heterozygosity are performed on a set of SNPs which are not highly correlated.

# Therefore, to generate a list of non-(highly)correlated SNPs, we exclude high inversion regions (inversion.txt [High LD regions]) and prune the SNPs using the command --indep-pairwiseí.

# The parameters ë50 5 0.2í stand respectively for: the window size, the number of SNPs to shift the window at each step, and the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously.

plink --bfile HapMap_3_r3_9 --exclude inversion.txt --range --indep-pairwise 50 5 0.2 --out indepSNP

# Note, don't delete the file indepSNP.prune.in, we will use this file in later steps of the tutorial.

plink --bfile HapMap_3_r3_9 --extract indepSNP.prune.in --het --out R_check

# This file contains your pruned data set.

# Plot of the heterozygosity rate distribution

Rscript --no-save check_heterozygosity_rate.R

截屏2020-05-13 下午1.17.28.png


# The following code generates a list of individuals who deviate more than 3 standard deviations from the heterozygosity rate mean.

# For data manipulation we recommend using UNIX. However, when performing statistical calculations R might be more convenient, hence the use of the Rscript for this step:

Rscript --no-save heterozygosity_outliers_list.R

# Output of the command above: fail-het-qc.txt .

# When using our example data/the HapMap data this list contains 2 individuals (i.e., two individuals have a heterozygosity rate deviating more than 3 SD's from the mean).

# Adapt this file to make it compatible for PLINK, by removing all quotation marks from the file and selecting only the first two columns.

sed 's/"// g' fail-het-qc.txt | awk '{print$1, $2}'> het_fail_ind.txt

# Remove heterozygosity rate outliers.

plink --bfile HapMap_3_r3_9 --remove het_fail_ind.txt --make-bed --out HapMap_3_r3_10

1.6相关性检验

Assuming a random population sample we are going to exclude all individuals above the pihat threshold of 0.2 in this tutorial.


# Check for relationships between individuals with a pihat > 0.2.

plink --bfile HapMap_3_r3_10 --extract indepSNP.prune.in --genome --min 0.2 --out pihat_min0.2

# The HapMap dataset is known to contain parent-offspring relations.

# The following commands will visualize specifically these parent-offspring relations, using the z values.

awk '{ if ($8 >0.9) print $0 }' pihat_min0.2.genome>zoom_pihat.genome

# Generate a plot to assess the type of relationship.

Rscript --no-save Relatedness.R

截屏2020-05-13 下午1.30.11.png


# The generated plots show a considerable amount of related individuals (explentation plot; PO = parent-offspring, UN = unrelated individuals) in the Hapmap data, this is expected since the dataset was constructed as such.

# Normally, family based data should be analyzed using specific family based methods. In this tutorial, for demonstrative purposes, we treat the relatedness as cryptic relatedness in a random population sample.

# In this tutorial, we aim to remove all 'relatedness' from our dataset.

# To demonstrate that the majority of the relatedness was due to parent-offspring we only include founders (individuals without parents in the dataset).

plink --bfile HapMap_3_r3_10 --filter-founders --make-bed --out HapMap_3_r3_11

# Now we will look again for individuals with a pihat >0.2.

plink --bfile HapMap_3_r3_11 --extract indepSNP.prune.in --genome --min 0.2 --out pihat_min0.2_in_founders

# The file 'pihat_min0.2_in_founders.genome' shows that, after exclusion of all non-founders, only 1 individual pair with a pihat greater than 0.2 remains in the HapMap data.

# This is likely to be a full sib or DZ twin pair based on the Z values. Noteworthy, they were not given the same family identity (FID) in the HapMap data.

# For each pair of 'related' individuals with a pihat > 0.2, we recommend to remove the individual with the lowest call rate.

plink --bfile HapMap_3_r3_11 --missing

# Use an UNIX text editor (e.g., vi(m) ) to check which individual has the highest call rate in the 'related pair'.

# Generate a list of FID and IID of the individual(s) with a Pihat above 0.2, to check who had the lower call rate of the pair.

# In our dataset the individual 13291  NA07045 had the lower call rate.

vi 0.2_low_call_rate_pihat.txt

I

13291  NA07045

# Press esc on keyboard!

:x

# Press enter on keyboard

# In case of multiple 'related' pairs, the list generated above can be extended using the same method as for our lone 'related' pair.

# Delete the individuals with the lowest call rate in 'related' pairs with a pihat > 0.2

plink --bfile HapMap_3_r3_11 --remove 0.2_low_call_rate_pihat.txt --make-bed --out HapMap_3_r3_12

################################################################################################################################

# CONGRATULATIONS!! You've just succesfully completed the first tutorial! You are now able to conduct a proper genetic QC.

# For the next tutorial, using the script: 2_Main_script_MDS.txt, you need the following files:

# - The bfile HapMap_3_r3_12 (i.e., HapMap_3_r3_12.fam,HapMap_3_r3_12.bed, and HapMap_3_r3_12.bim

# - indepSNP.prune.in

GWAS 操作流程1

总流程分为4部分

1. QC

2. Population Stratification,

3. Association Analyses

4. Polygenic Risk Score (PRS) analyses.

1. QC

1.1 查看是否有SNP，individual缺失。

1.2 性别差异检验

1.3 生成仅包含常染色体且去除low MAF的bfile

1.4 去除不符合哈代温伯格平衡（HWE）的SNPs

1.5 计算杂合率

1.6相关性检验

你可能感兴趣的:(GWAS 操作流程1)