We will not be able to uniquely determine 3.mil parameters with only 100k observations. In this case, we call the model unidentifiable.
To handle this in practice, we can:
Interaction Terms会导致特征维度的急剧上升,因此我们需要一些手段来解决维度过大的问题
the number of parameters exceeds or is close to the number of observations.
create a new, smaller set of predictors by taking linear combinations of the original predictors.
We choose Z 1 Z_1 Z1 , Z 2 Z_2 Z2 ,…, Z m Z_m Zm , where and where each Z i Z_i Zi is a linear combination of the original p predictor
Z i = ∑ j = 1 p ϕ i j X j Z_i=\sum_{j=1}^{p}\phi_{ij}X_j Zi=j=1∑pϕijXj
也就是说选取m个新的predictors(即 Z i Z_i Zi )且m < p, 每个predictor都是原本的所有predictor( X j X_j Xj )的线性组合, 以达到降维的效果
Principal Components Analysis (PCA) is a method to identify a new set of predictors, as linear combinations of the original ones, that captures the “maximum amount” of variance in the observed data
Principal ComponentsAnalysis (PCA) produces a list of p principal components Z 1 Z_1 Z1 ,…, Z p Z_p Zp such that
That is, the observed data shows more variance in the direction of Z 1 Z_1 Z1 than in the direction of Z 2 Z_2 Z2 .
Toperform dimensionality reduction we select the top m principle components of PCA as our new predictors and express our observed data in terms of these predictors
Transforming our observed data means projecting our dataset
onto the space defined by the top m PCA components, these
components are our new predictors.
比如说我们有一个二维的(x, y)特征(如图所示),想降到一维
还是以二维举例,为了方便计算,我们把坐标原点定为x和y所有变量的平均值,即( x ‾ \overline{x} x , y ‾ \overline{y} y ). 然后我们旋转过该原点的直线。
因此在高维度的降维时,我们需要使用 协方差矩阵。
(Math部分跳过,毕竟调用库的函数就几行… 感兴趣的自己网上搜吧)
If we use all p of the new Z j Z_j Zj, then we have not improved the dimensionality. Instead, we select the first M PCA variables, Z 1 Z_1 Z1,…, Z M Z_M ZM, to use as predictors in a regression model.
Cross Validation —— the best way to check for a specified problem
PCA is an unsupervised algorithm. It is done independent of the outcome variable.
PCA is not so good because:
PCA is great for:
we want our imputations to take into account:
This is the idea behind the iterative PCA algorithm for imputation.
PCA主成分分析学习总结 - 鱼遇雨欲语与余的文章 - 知乎
如何通俗易懂地讲解什么是 PCA(主成分分析)? - 马同学的回答 - 知乎
主成分分析(PCA)的详细解释 - 知足常乐的文章 - 知乎