2019-07-26用EIGENSOFT的smartpca进行主成分分析

1.vcftools或plink进行文件格式转换

vcftools --vcf  myfile.vcf --plink --out myfile

plink --vcf  myfile.vcf --recode --out myfile

生成.ped和.map文件。

2.使用EIGENSOFT中内置的convertf 将文件转化为smartpca的输入文件:

convertf -p transfer.conf

需要一个transfer.conf文件

##config file

genotypename:    myfile.ped

snpname:        myfile.map

indivname:      myfile.ped

outputformat:    EIGENSTRAT

genotypeoutname: myfile.geno

snpoutname:      myfile.snp

indivoutname:    myfile.ind

familynames: NO

这一步本应该生成sylvaticum.eigenstratgeno, sylvaticum.snp, sylvaticum.ind 三个文件,但是却报错。

报错信息如下:

parameter file: transfer.conf

genotypename: myfile.ped

snpname: myfile.map

indivname: myfile.ped

outputformat: EIGENSTRAT

genotypeoutname: myfile.eigenstratgeno

snpoutname: myfile.snp

indivoutname: myfile.ind

familynames: NO

warning (mapfile): bad chrom: 0 Chr01:1074      0      1074

warning (mapfile): bad chrom: 0 Chr01:1194      0      1194

warning (mapfile): bad chrom: 0 Chr01:1644      0      1644

warning (mapfile): bad chrom: 0 Chr01:1645      0      1645

warning (mapfile): bad chrom: 0 Chr01:1825      0      1825

warning (mapfile): bad chrom: 0 Chr01:2917      0      2917

warning (mapfile): bad chrom: 0 Chr01:3190      0      3190

warning (mapfile): bad chrom: 0 Chr01:3193      0      3193

warning (mapfile): bad chrom: 0 Chr01:3219      0      3219

warning (mapfile): bad chrom: 0 Chr01:3248      0      3248

genetic distance set from physical distance

PLINK input. No check on SNP order

snp order check fail; snp list not ordered: myfile.map (processing continues)  99    0    12

zzz 0 392203

snp order check fail; snp list not ordered: myfile.map (processing continues)  99    0    16

zzz 1 392204

snp order check fail; snp list not ordered: myfile.map (processing continues)  99    0    30

zzz 2 392205

snp order check fail; snp list not ordered: myfile.map (processing continues)  99    0    36

zzz 3 391687

snp order check fail; snp list not ordered: myfile.map (processing continues)  99    0    38

zzz 4 393019

snp order check fail; snp list not ordered: myfile.map (processing continues)  99    0    40

zzz 5 392591

snp order check fail; snp list not ordered: myfile.map (processing continues)  99    0    44

zzz 6 392592

snp order check fail; snp list not ordered: myfile.map (processing continues)  99    1    52

zzz 7 392206

snp order check fail; snp list not ordered: myfile.map (processing continues)  99    1    57

zzz 8 392684

fatalx:

no valid snps

Aborted (core dumped)

查看我的map文件,发现我用vcftools生成的map文件中第一列为0。说明染色体编号信息缺失。

less myfile.map

0      Chr01:1074      0      1074

0      Chr01:1194      0      1194

0      Chr01:1645      0      1645

0      Chr01:1825      0      1825

第一列染色体,未知则是0

第二列是SNP的名字

第三列是摩尔根距离,未知则是0

第四列是在染色体上的坐标位置

需要将map文件第一列的修改为只有数字且大于0的编号,这样才能成功。或者生成map和ped文件时尽量用高版本的plink。

3.用smartpca.perl进行pca分析

smartpca.perl -i myfile.ped -a myfile.map -b myfile.ind -o myfile.PCA -p myfile.plot -e myfile.evel -l myfile.log 

4.修改.evec文件格式

将生成的.evec文件中,第一行加上 ID, GROUP, pc1, pc2, pc3等列名信息。再根据自己的文件群体情况加上一列species,格式如下:

ID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 species

1 -0.1483 0.1830 0.0523 -0.1494 0.0892 0.1653 0.0039 0.0159 0.0156 0.0674 A

2 -0.1479 0.1879 0.0535 -0.1410 0.1137 0.1446 -0.0395 0.0298 -0.0603 0.0878 A

3 -0.1544 -0.4085 0.1840 -0.0174 0.0287 0.0416 -0.0450 -0.0013 -0.0594 -0.0693 A

4 -0.1494 0.1231 0.0810 0.0086 -0.1877 -0.3575 0.0804 -0.0122 0.1235 0.1087 A

5 -0.1469 0.1346 0.0778 0.0291 -0.1284 -0.4283 -0.0247 -0.0017 -0.1375 -0.0921 AS

6 -0.1452 -0.0403 -0.3729 0.2533 0.0519 0.0024 -0.0944 -0.0284 -0.2040 -0.0471 AS

7 -0.1449 -0.0407 -0.3708 0.2483 0.0457 -0.0060 -0.0820 -0.0092 -0.1719 -0.1247 AS

8 -0.1461 0.1868 0.0546 -0.1281 0.1268 0.1065 -0.0901 0.0044 -0.1849 -0.2082 AS

9 -0.1469 0.1270 0.0788 0.0298 -0.1532 -0.3933 0.0091 -0.0117 -0.0387 -0.1510 AS

5.用R画图

library("ggplot2")

a=read.table("PCA.evec",header=T)

ggplot(a,aes(PC1,PC2,color=species,pch = species))+geom_point(alpha=0.8,size=4)

ggplot(a,aes(PC1,PC3,color=species,pch = species))+geom_point(alpha=0.8,size=4)

ggplot(a,aes(PC2,PC3,color=species,pch = species))+geom_point(alpha=0.8,size=4)

你可能感兴趣的:(2019-07-26用EIGENSOFT的smartpca进行主成分分析)