pca 主成分分析_通过主成分分析(PCA)了解您的数据并发现潜在模式

pca 主成分分析

Save time, resources and stay healthy with data exploration that goes beyond means, distributions and correlations: Leverage PCA to see through the surface of variables. It saves time and resources, because it uncovers data issues before an hour-long model training and is good for a programmer’s health, since she trades off data worries with something more enjoyable. For example, a well-proven machine learning model might fail, because of one-dimensional data with insufficient variance or other related issues. PCA offers valuable insights that make you confident about data properties and its hidden dimensions.

超越均值,分布和相关性的数据探索可节省时间,资源并保持健康:利用PCA透视变量的表面。 它节省了时间和资源,因为它一小时的模型训练之前就发现了数据问题并且对程序员的健康非常有益,因为她可以用更有趣的东西来权衡数据的烦恼。 例如,由于一维数据的方差不足或其他相关问题,一个经过充分验证的机器学习模型可能会失败。 PCA提供了宝贵的见解,使您对数据属性及其隐藏维度充满信心。

This article shows how to leverage PCA to understand key properties of a dataset, saving time and resources down the road which ultimately leads to a happier, more fulfilled coding life. I hope this post helps to apply PCA in a consistent way and understand its results.

本文展示了如何利用PCA来理解数据集的关键属性,从而节省时间和资源,最终使编码寿命更长寿,更令人满意。 我希望这篇文章有助于以一致的方式应用PCA并了解其结果。

TL; DR (TL;DR)

PCA provides valuable insights that reach beyond descriptive statistics and help to discover underlying patterns. Two PCA metrics indicate 1. how many components capture the largest share of variance (explained variance), and 2., which features correlate with the most important components (factor loading). These metrics crosscheck previous steps in the project work flow, such as data collection which then can be adjusted. As a shortcut and ready-to-use tool, I provide the function do_pca() which conducts a PCA for a prepared dataset to inspect its results within seconds in this notebook or this script.

PCA提供了有价值的见解,这些见解超出了描述性统计数据的范围,并有助于发现潜在的模式。 两个PCA指标指示1.捕获最大方差份额的成分( 解释了方差 ),以及2.与最重要的成分相关的特征( 要素负载 )。 这些度量标准可以交叉检查项目工作流程中的先前步骤 ,例如可以进行数据收集的调整 作为一种快捷且易于使用的工具,我提供了do_pca()函数,该函数为准备好的数据集执行PCA,以在此笔记本或此脚本中在几秒钟内检查其结果。

数据探索作为安全网 (Data exploration as a safety net)

When a project structure resembles the one below, the prepared dataset is under scrutiny in the 4. step by looking at descriptive statistics. Among the most common ones are means, distributions and correlations taken across all observations or subgroups.

当项目结构类似于以下结构时,通过查看描述性统计数据,将在第4步中仔细检查准备的数据集。 最常见的是在所有观察值或子组中采用的均值,分布和相关性。

Common project structure

共同的项目结构

  1. Collection: gather, retrieve or load data

    收集:收集,检索或加载数据
  2. Processing: Format raw data, handle missing entries

    处理:格式化原始数据,处理缺失的条目
  3. Engineering: Construct and select features

    工程:构造和选择特征
  4. Exploration: Inspect descriptives, properties

    探索:检查描述,属性

  5. Modelling: Train, validate and test models

    建模:训练,验证和测试模型
  6. Evaluation: Inspect results, compare models

    评估:检查结果,比较模型

When the moment arrives of having a clean dataset after hours of work, makes many glances already towards the exciting step of applying models to the data. At this stage, around 80–90% of the project’s workload is done, if the data did not fell out of the sky, cleaned and processed. Of course, the urge is strong for modeling, but here are two reasons why a thorough data exploration saves time down the road:

在经过数小时的工作后,有了一个干净的数据集的时刻到来时,已经将许多目光投向了将模型应用于数据的令人兴奋的步骤。 在这个阶段,如果数据没有从天而降,清理和处理,则大约完成了项目工作量的80–90%。 当然,建模的冲动很强烈,但是这里有 彻底的数据探索可以节省时间的两个原因:

  1. catch coding errors → revise feature engineering (step 3)

    捕获编码错误 →修改特征工程(步骤3)

  2. identify underlying properties → rethink data collection (step 1), preprocessing (step 2) or feature engineering (step 3)

    识别基础属性 →重新考虑数据收集(步骤1),预处理(步骤2)或特征工程(步骤3)

Wondering about underperforming models due to underlying data issues after a few hours into training, validating and testing is like a photographer on the set, not knowing how their models might look like. Therefore, the key message is to see data exploration as an opportunity to get to know your data, understanding its strength and weaknesses.

经过数小时的培训,验证和测试后,由于底层数据问题而导致模型表现不佳的问题,就像布景中的摄影师一样,不知道其模型会是什么样子。 因此,关键信息是将数据探索视为了解您的数据 ,了解其优势和劣势的机会。

Descriptive statistics often reveal coding errors. However, detecting underlying issues likely requires more than that. Decomposition methods such as PCA help to identify these and enable to revise previous steps. This ensures a smooth transition to model building.

描述性统计通常会揭示编码错误。 但是,要发现潜在的问题可能还需要更多。 分解方法(例如PCA)有助于识别这些方法,并可以修改以前的步骤。 这样可以确保顺利过渡到模型构建。

pca 主成分分析_通过主成分分析(PCA)了解您的数据并发现潜在模式_第1张图片
Photo by Harrison Haines from Pexels Pexels的 Harrison Haines 摄

用PCA看表面之下 (Look beneath the surface with PCA)

Large datasets often require PCA to reduce dimensionality anyway. The method as such captures the maximum possible variance across features and projects observations onto mutually uncorrelated vectors, called components. Still, PCA serves other purposes than dimensionality reduction. It also helps to discover underlying patterns across features.

无论如何,大型数据集通常需要PCA来减少维数。 这样的方法捕获了整个特征的最大可能方差,并将观测值投影到互不相关的向量(称为分量)上。 尽管如此,PCA还可以实现降维以外的其他目的。 它还有助于发现跨功能的基础模式。

To focus on the implementation in Python instead of methodology, I will skip describing PCA in its workings. There exist many great resources about it that I refer to those instead:

为了专注于Python的实现而不是方法论,我将不介绍PCA的工作原理。 我有很多关于它的大量资源可供参考:

  • Animations showing PCA in ac

你可能感兴趣的:(java,大数据,python,mysql,数据分析)