PCA原理解读和绘制方法

⚠️该教程的PCA绘图基于ggplot2，可以根据ggplot2语法对图片进行额外的修改和保存。

1. 基础

PCA：全称Principal Component Methods，也就是主成分分析。

主成分分析是一种通过协方差分析来对数据进行降维处理的统计方法。首先利用线性变换，将数据变换到一个新的坐标系统中；然后再利用降维的思想，使得任何数据投影的第一大方差在第一个坐标(称为第一主成分)上，第二大方差在第二个坐标(第二主成分)上。这种降维的思想首先减少数据集的维数，同时还保持数据集的对方差贡献最大的特征，最终使数据直观呈现在二维坐标系。主要适用于以下情况：已获得一定数目的变量的观测值，并希望能够构造出少数几个综合变量，在最大程度上反映原始数据的消息。

主要目的：
发现数据的内在模式
对高维数据进行降维，去除数据的噪音和冗余
识别相关变量

2. 计算

2.1 R包

很多R包中的functions可以用来计算PCA：
· prcomp() 和 princomp() [built-in R stats package],
· PCA() [FactoMineR package],
· dudi.pca() [ade4 package],
· epPCA() [ExPosition package]

在这个教程里主要是用两个R包：FactoMineR (for the analysis) 和factoextra (for ggplot2-based visualization)。

安装和加载

install.packages(c("FactoMineR", "factoextra"))
library("FactoMineR")
library("factoextra")

2.2 数据格式

载入演示数据集

data(decathlon2)
View(decathlon2)

在这个表中，1:23行是Active individuals，也就是主成分分析需要是用的Individuals；24:27行是Supplementary individuals；1:10列是Active variables，也就是主成分分析需要是用的variables；11:12行是Supplementary variables。

把主成分分析所需要用的Active individuals和Active variables提取出来

decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6], 4)

2.3 数据标准化

在主成分分析中，变量通常需要scaled(比如standardized)，尤其当变量是使用不同的尺度的单位来衡量的时候。
标准化的目标是使变量之间具有可比性。通常scale后的数据要求i)标准差为1 ii)均值为0。
在基因表达数据的分析中，数据标准化是在PCA和聚类前的一个常用方法。当变量的均值和/或标准差差异非常大的时候，我们通常需要去做scale。
当对变量进行scale的时候，数据可以做这样的变换：

R的基础功能scale()可以被用于对数据进行标准化。这个函数输入的是一个numeric matrix，并且会对列进行标准化。

注意：在使用函数PCA() [in FactoMineR]的时候，数据会自动进行标准化，无需额外标准化处理。

2.4 R code

函数PCA()[in FactoMineR]的用法如下：

PCA(X, scale.unit = TRUE, ncp = 5, ind.sup = NULL, 
    quanti.sup = NULL, quali.sup = NULL,  graph = TRUE)

X: a data frame. Rows are individuals and columns are numeric variables

scale.unit: a logical value. If TRUE, the data are scaled to unit variance before the analysis. This standardization to the same scale avoids some variables to become dominant just because of their large measurement units. It makes variable comparable.

ncp: number of dimensions kept in the final results.

graph: a logical value. If TRUE a graph is displayed.

ind.sup : 指定Supplementary individuals所在的列

quanti.sup, quali.sup : 指定定性和定量观测值所在的列

实操：

library("FactoMineR")
res.pca <- PCA(decathlon2.active, graph = FALSE)

PCA()返回的是一个list，包含"$eig"， "$var"，"$ind"和"$call"四类数据。四类数据的取法将在下一个部分介绍。

print(res.pca)
# **Results for the Principal Component Analysis (PCA)**
# The analysis was performed on 23 individuals, described by 10 variables
# *The results are available in the following objects:

#    name               description                          
# 1  "$eig"             "eigenvalues"                        
# 2  "$var"             "results for the variables"          
# 3  "$var$coord"       "coord. for the variables"           
# 4  "$var$cor"         "correlations variables - dimensions"
# 5  "$var$cos2"        "cos2 for the variables"             
# 6  "$var$contrib"     "contributions of the variables"     
# 7  "$ind"             "results for the individuals"        
# 8  "$ind$coord"       "coord. for the individuals"         
# 9  "$ind$cos2"        "cos2 for the individuals"           
# 10 "$ind$contrib"     "contributions of the individuals"   
# 11 "$call"            "summary statistics"                 
# 12 "$call$centre"     "mean of the variables"              
# 13 "$call$ecart.type" "standard error of the variables"    
# 14 "$call$row.w"      "weights for the individuals"        
# 15 "$call$col.w"      "weights for the variables"

3. Visualization and Interpretation

3.1 Eigenvalues(特征值) / Variances ("`$eig`")

定义：参考主成分分析PCA以及特征值和特征向量的意义

PC1这个轴是原点到每个变量在轴上投影的距离平方和最大时的轴，每个变量在轴上的投影点到原点距离平方和就是这个PC的特征值。Eigenvalue for PC1开方=Singular value for PC1（图片出自StatQuest: Principal Component Analysis (PCA), Step-by-Step）

每个PC的Variation是特征值/n-1。该PC的Variation/所有PC的Variation的和=variance.percent

The eigenvalues measure the amount of variation retained by each principal component. 靠前的PCs的Eigenvalues比较大，后面的PCs的Eigenvalues比较小。That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

#调取Eigenvalues信息
library("factoextra")
eig.val <- get_eigenvalue(res.pca)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1       4.124            41.24                        41.2
## Dim.2       1.839            18.39                        59.6
## Dim.3       1.239            12.39                        72.0
## Dim.4       0.819             8.19                        80.2
## Dim.5       0.702             7.02                        87.2
## Dim.6       0.423             4.23                        91.5
## Dim.7       0.303             3.03                        94.5
## Dim.8       0.274             2.74                        97.2
## Dim.9       0.155             1.55                        98.8
## Dim.10      0.122             1.22                       100.0

Eigenvalues的总和是10(total variance)。The proportion of variation explained by each eigenvalue is given in the second column.

Eigenvalues可用于决定我们后续分析是用多少个PC数（Kaiser 1961）。

但不幸的是，没有公认的客观的方法去决定多少PC是足够的。不同研究领域和不同数据集的情况是不同的。 In practice, we tend to look at the first few principal components in order to find interesting patterns in the data.

在上面的数据中，前三个PC可以解释72%的variation。这个数值是可以接受的。
另一个替代的方法来决定要使用多少PC是查看碎石图Scree Plot（fviz_eig() or fviz_screeplot() [factoextra package]）。

fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))

87%的信息(variances)被前五个PC保留

3.2 Graph of variables ("`$var`")

3.2.1 结果查看
我们可以使用get_pca_var() [factoextra package]取出variables的结果。

var <- get_pca_var(res.pca)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

get_pca_var()的结果包含如下变量（可以被用于 plot）
· var$coord: 绘制scatter plot的变量坐标
· var$cos2: 是var$coord值的平方。represents the quality of representation for variables on the factor map.
· var$contrib: 是cos2/所有cos2的和再乘100（因为单位是%）。contains the contributions (in percentage) of the variables to the principal components.

# Coordinates
head(var$coord)
#                   Dim.1       Dim.2      Dim.3       Dim.4      Dim.5
# X100m        -0.8506257 -0.17939806  0.3015564  0.03357320 -0.1944440
# Long.jump     0.7941806  0.28085695 -0.1905465 -0.11538956  0.2331567
# Shot.put      0.7339127  0.08540412  0.5175978  0.12846837 -0.2488129
# High.jump     0.6100840 -0.46521415  0.3300852  0.14455012  0.4027002
# X400m        -0.7016034  0.29017826  0.2835329  0.43082552  0.1039085
# X110m.hurdle -0.7641252 -0.02474081  0.4488873 -0.01689589  0.2242200

# Cos2: quality on the factore map
head(var$cos2)
#                  Dim.1        Dim.2      Dim.3        Dim.4      Dim.5
# X100m        0.7235641 0.0321836641 0.09093628 0.0011271597 0.03780845
# Long.jump    0.6307229 0.0788806285 0.03630798 0.0133147506 0.05436203
# Shot.put     0.5386279 0.0072938636 0.26790749 0.0165041211 0.06190783
# High.jump    0.3722025 0.2164242070 0.10895622 0.0208947375 0.16216747
# X400m        0.4922473 0.0842034209 0.08039091 0.1856106269 0.01079698
# X110m.hurdle 0.5838873 0.0006121077 0.20149984 0.0002854712 0.05027463

# Contributions to the principal components
head(var$contrib)
#                  Dim.1      Dim.2     Dim.3       Dim.4     Dim.5
# X100m        17.544293  1.7505098  7.338659  0.13755240  5.389252
# Long.jump    15.293168  4.2904162  2.930094  1.62485936  7.748815
# Shot.put     13.060137  0.3967224 21.620432  2.01407269  8.824401
# High.jump     9.024811 11.7715838  8.792888  2.54987951 23.115504
# X400m        11.935544  4.5799296  6.487636 22.65090599  1.539012
# X110m.hurdle 14.157544  0.0332933 16.261261  0.03483735  7.166193

3.2.2 Correlation circle
变量和主成分(PC)之间的correlation被用于每个PC上变量的坐标。
The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations (Abdi and Williams 2010).

# Coordinates of variables
head(var$coord, 4)
##            Dim.1   Dim.2  Dim.3   Dim.4  Dim.5
## X100m     -0.851 -0.1794  0.302  0.0336 -0.194
## Long.jump  0.794  0.2809 -0.191 -0.1154  0.233
## Shot.put   0.734  0.0854  0.518  0.1285 -0.249
## High.jump  0.610 -0.4652  0.330  0.1446  0.403

fviz_pca_var(res.pca, col.var = "black")

这个图就是variable correlation plots，它展示出所有变量间的关系。它的解读如下：
· 正相关的变量会聚在一起
· 负相关的变量会处于相反的方向，位于相反的象限。
· 变量和原点之间的距离衡量factor map上变量的质量。离原点越远的变量在factor map上被展示的更好。

3.2.3 Quality of representation

head(var$cos2, 4)
##           Dim.1   Dim.2  Dim.3   Dim.4  Dim.5
## X100m     0.724 0.03218 0.0909 0.00113 0.0378
## Long.jump 0.631 0.07888 0.0363 0.01331 0.0544
## Shot.put  0.539 0.00729 0.2679 0.01650 0.0619
## High.jump 0.372 0.21642 0.1090 0.02089 0.1622

可以使用corrplot包可视化

library("corrplot")
corrplot(var$cos2, is.corr=FALSE)

也可以使用fviz_cos2()[in factoextra]来画条图

# Total cos2 of variables on Dim.1 and Dim.2
fviz_cos2(res.pca, choice = "var", axes = 1:2)

对于给定的变量，所有PCs的cos2之和=1。如果一个变量可以完美的被两个主成分所代表(如Dim.1 & Dim.2)，这两个PCs的cos2之和=1。这样的话这个变量就会被画在variable factor map的圈上。对于一些变量来说，想要比较好的代表数据，需要超过两个PCs，这样的话变量就会被画在圈内。

# Color by cos2 values: quality on the factor map
fviz_pca_var(res.pca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE # Avoid text overlapping
             )

3.2.4 Contributions of variables to PCs

library("corrplot")
corrplot(var$contrib, is.corr=FALSE)

# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

红线是expected average contribution。如果变量的贡献是一致的，那么贡献值应该是1/length(variables) = 1/10 = 10%。对于给定的PC，贡献度超过红线的variable 被认为对这个PC贡献较大。

fviz_contrib(res.pca, choice = "var", axes = 1:2, top = 10)

变量对某个PC的贡献度可以通过颜色深浅展示出来

fviz_pca_var(res.pca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
             )

根据贡献度改变透明度

fviz_pca_var(res.pca, alpha.var = "contrib")

自定义颜色

# Create a random continuous variable of length 10
set.seed(123)
my.cont.var <- rnorm(10)
# Color variables by the continuous variable
fviz_pca_var(res.pca, col.var = my.cont.var,
             gradient.cols = c("blue", "yellow", "red"),
             legend.title = "Cont.Var")

根据分组绘制颜色

# Create a grouping variable using kmeans
# Create 3 groups of variables (centers = 3)
set.seed(123)
res.km <- kmeans(var$coord, centers = 3, nstart = 25)
grp <- as.factor(res.km$cluster)
# Color variables by groups
fviz_pca_var(res.pca, col.var = grp, 
             palette = c("#0073C2FF", "#EFC000FF", "#868686FF"),
             legend.title = "Cluster")

3.3 Dimension description

res.desc <- dimdesc(res.pca, axes = c(1,2), proba = 0.05)
# Description of dimension 1
res.desc$Dim.1
## $quanti
##              correlation  p.value
## Long.jump          0.794 6.06e-06
## Discus             0.743 4.84e-05
## Shot.put           0.734 6.72e-05
## High.jump          0.610 1.99e-03
## Javeline           0.428 4.15e-02
## X400m             -0.702 1.91e-04
## X110m.hurdle      -0.764 2.20e-05
## X100m             -0.851 2.73e-07

# Description of dimension 2
res.desc$Dim.2
## $quanti
##            correlation  p.value
## Pole.vault       0.807 3.21e-06
## X1500m           0.784 9.38e-06
## High.jump       -0.465 2.53e-02

## $quanti means results for quantitative variables. Note that, variables are sorted by the p-value of the correlation.

3.4 Graph of individuals ("`$ind`")

3.4.1 Results

ind <- get_pca_ind(res.pca)
ind
## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"

To get access to the different components, use this:

# Coordinates of individuals
head(ind$coord)
# Quality of individuals
head(ind$cos2)
# Contributions of individuals
head(ind$contrib)

3.4.2 Plots: quality and contribution

# fviz_pca_ind(res.pca)
fviz_pca_ind(res.pca, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow if many points)
             )

相似的individuals会被聚在一起

改变点的大小

fviz_pca_ind(res.pca, pointsize = "cos2", 
             pointshape = 21, fill = "#E7B800",
             repel = TRUE # Avoid text overlapping (slow if many points)
             )

#同时改变点的大小和颜色by cos2
fviz_pca_ind(res.pca, col.ind = "cos2", pointsize = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow if many points)
             )

画条图

fviz_cos2(res.pca, choice = "ind")
# Total contribution on PC1 and PC2
fviz_contrib(res.pca, choice = "ind", axes = 1:2)

3.4.3 Color by a custom continuous variable

# Create a random continuous variable of length 23,
# Same length as the number of active individuals in the PCA
set.seed(123)
my.cont.var <- rnorm(23)
# Color individuals by the continuous variable
fviz_pca_ind(res.pca, col.ind = my.cont.var,
             gradient.cols = c("blue", "yellow", "red"),
             legend.title = "Cont.Var")

3.4.4 Color by groups

head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

# The variable Species (index = 5) is removed
# before PCA analysis
iris.pca <- PCA(iris[,-5], graph = FALSE)

fviz_pca_ind(iris.pca,
             geom.ind = "point", # show points only (nbut not "text")
             col.ind = iris$Species, # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             legend.title = "Groups"
             )

# Add confidence ellipses
fviz_pca_ind(iris.pca, geom.ind = "point", col.ind = iris$Species, 
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, ellipse.type = "confidence",
             legend.title = "Groups"
             )

3.5 Graph customization

更多图形个性化绘制

3.5.1 Dimensions
默认情况下， variables/individuals代表的是PC1和PC2，如果想看其他PC比如PC2和PC3也是可以的。

# Variables on dimensions 2 and 3
fviz_pca_var(res.pca, axes = c(2, 3))
# Individuals on dimensions 2 and 3
fviz_pca_ind(res.pca, axes = c(2, 3))

3.5.2 Plot elements: point, text, arrow

geom.var参数：
· geom.var = "point", to show only points;
· geom.var = "text" to show only text labels;
· geom.var = c("point", "text") to show both points and text labels
· geom.var = c("arrow", "text") to show arrows and labels (default).

geom.ind参数：
· geom.ind = "point", to show only points;
· geom.ind = "text" to show only text labels;
· geom.ind = c("point", "text") to show both point and text labels (default)

# Show variable points and text labels
fviz_pca_var(res.pca, geom.var = c("point", "text"))
# Show individuals text labels only
fviz_pca_ind(res.pca, geom.ind =  "text")

3.5.3 Size and shape of plot elements
· labelsize: font size for the text labels, e.g.: labelsize = 4.
· pointsize: the size of points, e.g.: pointsize = 1.5.
· arrowsize: the size of arrows. Controls the thickness of arrows, e.g.: arrowsize = 0.5.
· pointshape: the shape of points, pointshape = 21. Type ggpubr::show_point_shapes() to see available point shapes.

# Change the size of arrows an labels
fviz_pca_var(res.pca, arrowsize = 1, labelsize = 5, 
             repel = TRUE)
# Change points size, shape and fill color
# Change labelsize
fviz_pca_ind(res.pca, 
             pointsize = 3, pointshape = 21, fill = "lightblue",
             labelsize = 5, repel = TRUE)

3.5.4 Ellipses
添加Ellipses：addEllipses = TRUE
ellipse.type可以设置为"convex"，"confidence"，"t"，"norm"，"euclid"
ellipse.level：设置椭圆的集中度。如设置ellipse.level = 0.95 or ellipse.level = 0.66。
3.5.5 Group mean points

fviz_pca_ind(iris.pca,
             geom.ind = "point", # show points only (but not "text")
             group.ind = iris$Species, # color by groups
             legend.title = "Groups",
             mean.point = FALSE)

3.5.6 Axis lines

fviz_pca_var(res.pca, axes.linetype = "blank")

3.5.7 Graphical parameters
使用ggpar()[ggpubr package]函数进行更精细的调整

ind.p <- fviz_pca_ind(iris.pca, geom = "point", col.ind = iris$Species)
ggpubr::ggpar(ind.p,
              title = "Principal Component Analysis",
              subtitle = "Iris data set",
              caption = "Source: factoextra",
              xlab = "PC1", ylab = "PC2",
              legend.title = "Species", legend.position = "top",
              ggtheme = theme_gray(), palette = "jco"
              )

3.6 Biplot双标图

绘制一张简单的双标图

fviz_pca_biplot(res.pca, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                )

fviz_pca_biplot(iris.pca, 
                col.ind = iris$Species, palette = "jco", 
                addEllipses = TRUE, label = "var",
                col.var = "black", repel = TRUE,
                legend.title = "Species")

fviz_pca_biplot(iris.pca, 
                # Fill individuals by groups
                geom.ind = "point",
                pointshape = 21,
                pointsize = 2.5,
                fill.ind = iris$Species,
                col.ind = "black",
                # Color variable by groups
                col.var = factor(c("sepal", "sepal", "petal", "petal")),
                
                legend.title = list(fill = "Species", color = "Clusters"),
                repel = TRUE        # Avoid label overplotting
             )+
  ggpubr::fill_palette("jco")+      # Indiviual fill color
  ggpubr::color_palette("npg")      # Variable colors

fviz_pca_biplot(iris.pca, 
                # Individuals
                geom.ind = "point",
                fill.ind = iris$Species, col.ind = "black",
                pointshape = 21, pointsize = 2,
                palette = "jco",
                addEllipses = TRUE,
                # Variables
                alpha.var ="contrib", col.var = "contrib",
                gradient.cols = "RdYlBu",
                
                legend.title = list(fill = "Species", color = "Contrib",
                                    alpha = "Contrib")
                )

4. Filtering results

# Visualize variable with cos2 >= 0.6
fviz_pca_var(res.pca, select.var = list(cos2 = 0.6))
# Top 5 active variables with the highest cos2
fviz_pca_var(res.pca, select.var= list(cos2 = 5))
# Select by names
name <- list(name = c("Long.jump", "High.jump", "X100m"))
fviz_pca_var(res.pca, select.var = name)
# top 5 contributing individuals and variable
fviz_pca_biplot(res.pca, select.ind = list(contrib = 5), 
               select.var = list(contrib = 5),
               ggtheme = theme_minimal())

5. Exporting results

5.1 Export plots to PDF/PNG files

# Print the plot to a pdf file
pdf("myplot.pdf")
print(myplot)
dev.off()

# Scree plot
scree.plot <- fviz_eig(res.pca)
# Plot of individuals
ind.plot <- fviz_pca_ind(res.pca)
# Plot of variables
var.plot <- fviz_pca_var(res.pca)
pdf("PCA.pdf") # Create a new pdf device
print(scree.plot)
print(ind.plot)
print(var.plot)
dev.off() # Close the pdf device

# Print scree plot to a png file
png("pca-scree-plot.png")
print(scree.plot)
dev.off()
# Print individuals plot to a png file
png("pca-variables.png")
print(var.plot)
dev.off()
# Print variables plot to a png file
png("pca-individuals.png")
print(ind.plot)
dev.off()

使用ggexport()

library(ggpubr)
ggexport(plotlist = list(scree.plot, ind.plot, var.plot), 
         filename = "PCA.pdf")

ggexport(plotlist = list(scree.plot, ind.plot, var.plot), 
         nrow = 2, ncol = 2,
         filename = "PCA.pdf")

ggexport(plotlist = list(scree.plot, ind.plot, var.plot),
         filename = "PCA.png")

5.2 Export results to txt/csv files

write.infile(res.pca, "pca.txt", sep = "\t")
# Export into a CSV file
write.infile(res.pca, "pca.csv", sep = ";")

6. 总结

进行PCA的方法
Using prcomp() [stats]

res.pca <- prcomp(iris[, -5], scale. = TRUE)

Read more: http://www.sthda.com/english/wiki/pca-using-prcomp-and-princomp

Using princomp() [stats]

res.pca <- princomp(iris[, -5], cor = TRUE)