PCA:主成分分析
class pyspark.ml.feature.PCA(k=None, inputCol=None, outputCol=None)
主成分分析是设法将原来众多具有一定相关性(比如P个指标),重新组合成一组新的互相无关的综合指标来代替原来的指标。
PCA 训练模型以将向量投影到前 k 个主成分的低维空间
model.explainedVariance:返回由每个主成分解释的方差比例向量
01.创建数据
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.4")\
.config("spark.ui.showConsoleProgress","false")\
.appName("PCA").master("local[*]").getOrCreate()
#%%
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
df.show()
输出结果:
+--------------------+
| features|
+--------------------+
| (5,[1,3],[1.0,7.0])|
|[2.0,0.0,3.0,4.0,...|
|[4.0,0.0,0.0,6.0,...|
+--------------------
02.详细查看
[Row(features=SparseVector(5, {1: 1.0, 3: 7.0})),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]))]
03.查看结构,这里稀疏向量和密集向共存
df.printSchema()
输出结果:
root
|-- features: vector (nullable = true)
04.使用PCA主成分分析
from pyspark.ml.feature import PCA
pca = PCA(k=2,inputCol="features",outputCol="res")
model = pca.fit(df)
model.transform(df).show()
输出结果:
+--------------------+--------------------+
| features| res|
+--------------------+--------------------+
| (5,[1,3],[1.0,7.0])|[1.64857282308838...|
|[2.0,0.0,3.0,4.0,...|[-4.6451043317815...|
|[4.0,0.0,0.0,6.0,...|[-6.4288805356764...|
+--------------------+--------------------+
05.详细查看
model.transform(df).head(3)
输出结果:
[Row(features=SparseVector(5, {1: 1.0, 3: 7.0}), res=DenseVector([1.6486, -4.0133])),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), res=DenseVector([-4.6451, -1.1168])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), res=DenseVector([-6.4289, -5.338]))]
06.解释向量
model.explainedVariance
输出结果:
DenseVector([0.7944, 0.2056])