1.vcftools或plink进行文件格式转换
vcftools --vcf myfile.vcf --plink --out myfile
plink --vcf myfile.vcf --recode --out myfile
生成.ped和.map文件。
2.使用EIGENSOFT中内置的convertf 将文件转化为smartpca的输入文件:
convertf -p transfer.conf
需要一个transfer.conf文件
##config file
genotypename: myfile.ped
snpname: myfile.map
indivname: myfile.ped
outputformat: EIGENSTRAT
genotypeoutname: myfile.geno
snpoutname: myfile.snp
indivoutname: myfile.ind
familynames: NO
这一步本应该生成sylvaticum.eigenstratgeno, sylvaticum.snp, sylvaticum.ind 三个文件,但是却报错。
报错信息如下:
parameter file: transfer.conf
genotypename: myfile.ped
snpname: myfile.map
indivname: myfile.ped
outputformat: EIGENSTRAT
genotypeoutname: myfile.eigenstratgeno
snpoutname: myfile.snp
indivoutname: myfile.ind
familynames: NO
warning (mapfile): bad chrom: 0 Chr01:1074 0 1074
warning (mapfile): bad chrom: 0 Chr01:1194 0 1194
warning (mapfile): bad chrom: 0 Chr01:1644 0 1644
warning (mapfile): bad chrom: 0 Chr01:1645 0 1645
warning (mapfile): bad chrom: 0 Chr01:1825 0 1825
warning (mapfile): bad chrom: 0 Chr01:2917 0 2917
warning (mapfile): bad chrom: 0 Chr01:3190 0 3190
warning (mapfile): bad chrom: 0 Chr01:3193 0 3193
warning (mapfile): bad chrom: 0 Chr01:3219 0 3219
warning (mapfile): bad chrom: 0 Chr01:3248 0 3248
genetic distance set from physical distance
PLINK input. No check on SNP order
snp order check fail; snp list not ordered: myfile.map (processing continues) 99 0 12
zzz 0 392203
snp order check fail; snp list not ordered: myfile.map (processing continues) 99 0 16
zzz 1 392204
snp order check fail; snp list not ordered: myfile.map (processing continues) 99 0 30
zzz 2 392205
snp order check fail; snp list not ordered: myfile.map (processing continues) 99 0 36
zzz 3 391687
snp order check fail; snp list not ordered: myfile.map (processing continues) 99 0 38
zzz 4 393019
snp order check fail; snp list not ordered: myfile.map (processing continues) 99 0 40
zzz 5 392591
snp order check fail; snp list not ordered: myfile.map (processing continues) 99 0 44
zzz 6 392592
snp order check fail; snp list not ordered: myfile.map (processing continues) 99 1 52
zzz 7 392206
snp order check fail; snp list not ordered: myfile.map (processing continues) 99 1 57
zzz 8 392684
fatalx:
no valid snps
Aborted (core dumped)
查看我的map文件,发现我用vcftools生成的map文件中第一列为0。说明染色体编号信息缺失。
less myfile.map
0 Chr01:1074 0 1074
0 Chr01:1194 0 1194
0 Chr01:1645 0 1645
0 Chr01:1825 0 1825
第一列染色体,未知则是0
第二列是SNP的名字
第三列是摩尔根距离,未知则是0
第四列是在染色体上的坐标位置
需要将map文件第一列的修改为只有数字且大于0的编号,这样才能成功。或者生成map和ped文件时尽量用高版本的plink。
3.用smartpca.perl进行pca分析
smartpca.perl -i myfile.ped -a myfile.map -b myfile.ind -o myfile.PCA -p myfile.plot -e myfile.evel -l myfile.log
4.修改.evec文件格式
将生成的.evec文件中,第一行加上 ID, GROUP, pc1, pc2, pc3等列名信息。再根据自己的文件群体情况加上一列species,格式如下:
ID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 species
1 -0.1483 0.1830 0.0523 -0.1494 0.0892 0.1653 0.0039 0.0159 0.0156 0.0674 A
2 -0.1479 0.1879 0.0535 -0.1410 0.1137 0.1446 -0.0395 0.0298 -0.0603 0.0878 A
3 -0.1544 -0.4085 0.1840 -0.0174 0.0287 0.0416 -0.0450 -0.0013 -0.0594 -0.0693 A
4 -0.1494 0.1231 0.0810 0.0086 -0.1877 -0.3575 0.0804 -0.0122 0.1235 0.1087 A
5 -0.1469 0.1346 0.0778 0.0291 -0.1284 -0.4283 -0.0247 -0.0017 -0.1375 -0.0921 AS
6 -0.1452 -0.0403 -0.3729 0.2533 0.0519 0.0024 -0.0944 -0.0284 -0.2040 -0.0471 AS
7 -0.1449 -0.0407 -0.3708 0.2483 0.0457 -0.0060 -0.0820 -0.0092 -0.1719 -0.1247 AS
8 -0.1461 0.1868 0.0546 -0.1281 0.1268 0.1065 -0.0901 0.0044 -0.1849 -0.2082 AS
9 -0.1469 0.1270 0.0788 0.0298 -0.1532 -0.3933 0.0091 -0.0117 -0.0387 -0.1510 AS
5.用R画图
library("ggplot2")
a=read.table("PCA.evec",header=T)
ggplot(a,aes(PC1,PC2,color=species,pch = species))+geom_point(alpha=0.8,size=4)
ggplot(a,aes(PC1,PC3,color=species,pch = species))+geom_point(alpha=0.8,size=4)
ggplot(a,aes(PC2,PC3,color=species,pch = species))+geom_point(alpha=0.8,size=4)