用pyspark学习《应用预测建模》(一)环境准备

《应用预测建模》是Max Kuhn的一本关于机器学习的非常好的书,里面的代码是用R语言实现的。最近没怎么写文章,想来想去,决定用spark和python来做一做这本书的例子。

由于数据集使用《应用预测建模》的数据集,所以就不搞复杂的Hadoop环境了,简单设置一下环境。

先下载adoptopenjdk11并安装。这里选择Java11主要是Java8太老了,以后说不定用到其他Java库,比如optaplanner。然后到官网下载spark-3.3.0-bin-hadoop3,放到C盘。把spark-3.3.0-bin-hadoop3\python里面的pyspark文件夹复制到Python\Python38\Lib\site-packages下面。

如果想使用sql,还要下载hadoop3和winutils,把winutils的bin里的文件复制到hadoop的bin里。

需要设置的环境变量主要有:SPARK_HOME,是你spark的位置;PYSPARK_PYTHON,值就是python;HADOOP_HOME,是你hadoop的位置。

然后我用jupyter-notebook来完成里面的例子。有些模型spark不支持,我会找些别的python库来代替。

先来实现一下第三章数据预处理的部分内容。

首先使用R包AppliedPredictiveModeling的数据集segmentationOriginal。有2019个样本,119列,其中116个是特征。

书中过滤掉类别特征,保留数值特征,即把包含Status的列都过滤掉:

import pandas as pd
segmentationOriginal = pd.read_csv('Documents/segmentationOriginal.csv')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(segmentationOriginal).filter("Case='Train'")
new_col = [c for c in df.columns if c.find('Status')==-1][3:]
df = df.select(new_col)
df.show()

BoxCox变换,针对那些分布显著有偏的数值特征。spark没有提供直接变换的函数。如果数据量不大,可以使用scipy:

py_segmentationOriginal = segmentationOriginal[segmentationOriginal['Case']=='Train']

from scipy import stats
stats.boxcox(py_segmentationOriginal['AreaCh1'])

特征AreaCh1的λ值是-0.856

算一下变换前后的偏度值

df.createOrReplaceTempView('seg_table')
spark.sql("""
select skewness(AreaCh1),skewness(log(AreaCh1)) log_skew, max(AreaCh1), min(AreaCh1),
  skewness((pow(AreaCh1,-0.856)-1)/-0.856) boxcox_skew
from seg_table 
""").show()
+------------------+-----------------+------------+------------+------------------+
| skewness(AreaCh1)|         log_skew|max(AreaCh1)|min(AreaCh1)|       boxcox_skew|
+------------------+-----------------+------------+------------+------------------+
|3.5303544460710077|1.006965880202868|        2186|         150|0.1299175715463478|
+------------------+-----------------+------------+------------+------------------+

原本偏度是3.53,取对数后偏度是1,boxcox变换之后偏度0.13,效果明显。

中心化、标准化和PCA

from pyspark.ml.feature import VectorAssembler,StandardScaler,PCA
from pyspark.ml import Pipeline

vecAssembler = VectorAssembler(inputCols=new_col,outputCol="features")
scaler = StandardScaler(inputCol=vecAssembler.getOutputCol(), outputCol="scaledFeatures",
                        withStd=True, withMean=True)
pca = PCA(k=3, inputCol=scaler.getOutputCol(), outputCol="pcaFeatures")
pipeline = Pipeline(stages=[vecAssembler, scaler, pca])

model = pipeline.fit(seg_data)
model.transform(seg_data).select("pcaFeatures").show(20,False)
+-------------------------------------------------------------+
|pcaFeatures                                                  |
+-------------------------------------------------------------+
|[-5.09857487957507,-4.551380424818033,-0.033451546824984385] |
|[0.25462611056364975,-1.1980325587763472,-1.0205956889549515]|
|[-1.2928941300643262,1.8639347739164245,-1.2511046080904566] |
|[1.46466126382203,1.565832714432854,0.4696208782319596]      |
|[0.8762771486842043,1.2790054972780156,-1.337942609697476]   |
|[0.86154158288239,0.3286841715474364,-0.15546722869534157]   |
|[0.6861966008243924,2.0315180301256714,-2.1213172305186343]  |
|[2.6685158438981094,-5.455060970009097,-1.613799687152413]   |
|[3.2087047621995595,-2.1620798853429206,1.2675300698692626]  |
|[4.482997251320509,-3.7361518598726473,0.3919473957090797]   |
|[-6.952385041796879,-3.498570397548539,-0.33866193744210416] |
|[2.1058023937734838,-6.975440635580999,0.5537636192889748]   |
|[3.1924732973062926,3.4003777021252253,6.461132307309103]    |
|[0.8066060414203505,2.9004554138205094,-1.5943263249501203]  |
|[-3.9557076056689247,-0.8682775317372208,-2.8108580775542396]|
|[4.056033087233065,-2.0842603380353477,-0.9923054535937571]  |
|[-1.2730831064417794,0.9364917475502689,-2.414565477896586]  |
|[-8.550771612733588,-0.27922734365141305,-2.0796917771587964]|
|[-1.2190655829135115,1.1630355784452027,0.7602668076619262]  |
|[-0.07846479937885478,1.2954184356389247,-2.403163315445853] |
+-------------------------------------------------------------+
only showing top 20 rows

先写到这里。

你可能感兴趣的:(python,数据分析,r语言)