群体分层 population stratification

1.4 群体分层 population stratification

PCA分析 用plink软件,评估前10个主成分,--pca 10

$ plink --bfile 1kg_hm3_prunedf --pca 10 --out 1kg_pca
# 主要产生两个文件,1kg_pca.eigenval以及1kg_pcas.eigenvec,
# .eigenvec文件是主成分列表,可被用于进一步分析。
 $ head -4 1kg_pca.eigenval
54.1464
40.0338
6.96377
3.375
$ head -4 1kg_pca.eigenvec
0 HG00096 0.0149253 -0.0329941 0.0157409 0.00171199 0.00178966 -0.00704659 -0.00461685 -0.00735375 -0.00169564 0.0100253
0 HG00097 0.0146554 -0.0330726 0.0168457 -0.00070785 -0.000456348 -0.00860046 -0.00610165 -0.00293391 0.00189605 0.00350145
0 HG00099 0.0147324 -0.0333974 0.0160621 0.00243107 0.000503637 -0.00195932 -0.00130626 -0.00384657 0.00205159 -0.000813858
0 HG00100 0.0146498 -0.0329754 0.0158382 -0.00275797 0.00202298 0.00228241 -0.000977904 -0.00151248 0.00244192 0.00711757
# Panel A
library(ggplot2)
library(patchwork)
columns = c("fid", "Sample.name", "pca1","pca2", "pca3",
            "pca4", "pca5", "pca6", "pca7", "pca8", "pca9", "pca10")
pca <- read.table(file = "1kg_pca.eigenvec", sep = "",
                  header = F, col.names = columns)[,c(2:12)]
ggplot(pca, aes(x=pca1, y=pca2))+ geom_point()+
  theme_bw()+ xlab("PC1")+ ylab("PC2") -> a

# Panel B
geo <- read.table(file = "1kg_samples.txt",
                  sep = "\t",header = T)[,c(1,4,5,6,7)]
data <- merge(geo, pca, by = "Sample.name")
ggplot(data, aes(x=pca1, y=pca2, col=Superpopulation.name))+
  geom_point()+ theme_bw()+
  xlab("PC1")+ ylab("PC2")+
  labs(col = "") -> b
a/b

群体分层 population stratification_第1张图片
参考:
An Introduction to Statistical Genetic Data Analysis.

你可能感兴趣的:(生物信息学)