Pyspark特征工程--PCA

PCA:主成分分析

class pyspark.ml.feature.PCA(k=None, inputCol=None, outputCol=None)

主成分分析是设法将原来众多具有一定相关性(比如P个指标),重新组合成一组新的互相无关的综合指标来代替原来的指标。

PCA 训练模型以将向量投影到前 k 个主成分的低维空间

model.explainedVariance:返回由每个主成分解释的方差比例向量

01.创建数据

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.4")\
    .config("spark.ui.showConsoleProgress","false")\
    .appName("PCA").master("local[*]").getOrCreate()
#%%
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
df.show()

​ 输出结果:

+--------------------+
|            features|
+--------------------+
| (5,[1,3],[1.0,7.0])|
|[2.0,0.0,3.0,4.0,...|
|[4.0,0.0,0.0,6.0,...|
+--------------------

02.详细查看

[Row(features=SparseVector(5, {1: 1.0, 3: 7.0})),
 Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0])),
 Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]))]

03.查看结构,这里稀疏向量和密集向共存

df.printSchema()

​ 输出结果:

root
 |-- features: vector (nullable = true)

04.使用PCA主成分分析

from pyspark.ml.feature import PCA
pca = PCA(k=2,inputCol="features",outputCol="res")
model = pca.fit(df)
model.transform(df).show()

​ 输出结果:

+--------------------+--------------------+
|            features|                 res|
+--------------------+--------------------+
| (5,[1,3],[1.0,7.0])|[1.64857282308838...|
|[2.0,0.0,3.0,4.0,...|[-4.6451043317815...|
|[4.0,0.0,0.0,6.0,...|[-6.4288805356764...|
+--------------------+--------------------+

05.详细查看

model.transform(df).head(3)

​ 输出结果:

[Row(features=SparseVector(5, {1: 1.0, 3: 7.0}), res=DenseVector([1.6486, -4.0133])),
 Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), res=DenseVector([-4.6451, -1.1168])),
 Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), res=DenseVector([-6.4289, -5.338]))]

06.解释向量

model.explainedVariance

​ 输出结果:

DenseVector([0.7944, 0.2056])

你可能感兴趣的:(ML基础,spark,机器学习,大数据)