《应用预测建模》是Max Kuhn的一本关于机器学习的非常好的书,里面的代码是用R语言实现的。最近没怎么写文章,想来想去,决定用spark和python来做一做这本书的例子。
由于数据集使用《应用预测建模》的数据集,所以就不搞复杂的Hadoop环境了,简单设置一下环境。
先下载adoptopenjdk11并安装。这里选择Java11主要是Java8太老了,以后说不定用到其他Java库,比如optaplanner。然后到官网下载spark-3.3.0-bin-hadoop3,放到C盘。把spark-3.3.0-bin-hadoop3\python里面的pyspark文件夹复制到Python\Python38\Lib\site-packages下面。
如果想使用sql,还要下载hadoop3和winutils,把winutils的bin里的文件复制到hadoop的bin里。
需要设置的环境变量主要有:SPARK_HOME,是你spark的位置;PYSPARK_PYTHON,值就是python;HADOOP_HOME,是你hadoop的位置。
然后我用jupyter-notebook来完成里面的例子。有些模型spark不支持,我会找些别的python库来代替。
先来实现一下第三章数据预处理的部分内容。
首先使用R包AppliedPredictiveModeling的数据集segmentationOriginal。有2019个样本,119列,其中116个是特征。
书中过滤掉类别特征,保留数值特征,即把包含Status的列都过滤掉:
import pandas as pd
segmentationOriginal = pd.read_csv('Documents/segmentationOriginal.csv')
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(segmentationOriginal).filter("Case='Train'")
new_col = [c for c in df.columns if c.find('Status')==-1][3:]
df = df.select(new_col)
df.show()
BoxCox变换,针对那些分布显著有偏的数值特征。spark没有提供直接变换的函数。如果数据量不大,可以使用scipy:
py_segmentationOriginal = segmentationOriginal[segmentationOriginal['Case']=='Train']
from scipy import stats
stats.boxcox(py_segmentationOriginal['AreaCh1'])
特征AreaCh1的λ值是-0.856
算一下变换前后的偏度值
df.createOrReplaceTempView('seg_table')
spark.sql("""
select skewness(AreaCh1),skewness(log(AreaCh1)) log_skew, max(AreaCh1), min(AreaCh1),
skewness((pow(AreaCh1,-0.856)-1)/-0.856) boxcox_skew
from seg_table
""").show()
+------------------+-----------------+------------+------------+------------------+ | skewness(AreaCh1)| log_skew|max(AreaCh1)|min(AreaCh1)| boxcox_skew| +------------------+-----------------+------------+------------+------------------+ |3.5303544460710077|1.006965880202868| 2186| 150|0.1299175715463478| +------------------+-----------------+------------+------------+------------------+
原本偏度是3.53,取对数后偏度是1,boxcox变换之后偏度0.13,效果明显。
中心化、标准化和PCA
from pyspark.ml.feature import VectorAssembler,StandardScaler,PCA
from pyspark.ml import Pipeline
vecAssembler = VectorAssembler(inputCols=new_col,outputCol="features")
scaler = StandardScaler(inputCol=vecAssembler.getOutputCol(), outputCol="scaledFeatures",
withStd=True, withMean=True)
pca = PCA(k=3, inputCol=scaler.getOutputCol(), outputCol="pcaFeatures")
pipeline = Pipeline(stages=[vecAssembler, scaler, pca])
model = pipeline.fit(seg_data)
model.transform(seg_data).select("pcaFeatures").show(20,False)
+-------------------------------------------------------------+ |pcaFeatures | +-------------------------------------------------------------+ |[-5.09857487957507,-4.551380424818033,-0.033451546824984385] | |[0.25462611056364975,-1.1980325587763472,-1.0205956889549515]| |[-1.2928941300643262,1.8639347739164245,-1.2511046080904566] | |[1.46466126382203,1.565832714432854,0.4696208782319596] | |[0.8762771486842043,1.2790054972780156,-1.337942609697476] | |[0.86154158288239,0.3286841715474364,-0.15546722869534157] | |[0.6861966008243924,2.0315180301256714,-2.1213172305186343] | |[2.6685158438981094,-5.455060970009097,-1.613799687152413] | |[3.2087047621995595,-2.1620798853429206,1.2675300698692626] | |[4.482997251320509,-3.7361518598726473,0.3919473957090797] | |[-6.952385041796879,-3.498570397548539,-0.33866193744210416] | |[2.1058023937734838,-6.975440635580999,0.5537636192889748] | |[3.1924732973062926,3.4003777021252253,6.461132307309103] | |[0.8066060414203505,2.9004554138205094,-1.5943263249501203] | |[-3.9557076056689247,-0.8682775317372208,-2.8108580775542396]| |[4.056033087233065,-2.0842603380353477,-0.9923054535937571] | |[-1.2730831064417794,0.9364917475502689,-2.414565477896586] | |[-8.550771612733588,-0.27922734365141305,-2.0796917771587964]| |[-1.2190655829135115,1.1630355784452027,0.7602668076619262] | |[-0.07846479937885478,1.2954184356389247,-2.403163315445853] | +-------------------------------------------------------------+ only showing top 20 rows
先写到这里。