本篇博客使用Kaggle上的AdultBase数据集:Machine-Learning-Databases
此数据集虽然历史比较悠久,但是数据格式比较容易处理,而且信息比较全面,适合数据处理入门。
本篇博客使用了Spark SQL的相关语句,实现了以下功能:
本篇文章主要介绍StringIndexer和IndexToString的搭配使用方法,以及通过Pipeline对其进行组装,也介绍了通过SparkSesssion的read和write操作来格式化存取数据集。
github地址:Truedick23 - AdultBase
首先查阅Spark官方的文档对于StringIndexer的介绍:
Spark 2.3.2 ScalaDoc - StringIndexer
可见其有四个参数:
参数名称 | 作用 |
---|---|
handleInvalid | 对空缺数值的处理规定,有如下参数:“skip” - 过滤掉此条数据;“error” - 抛出错误;“keep” - 对其设置一个新的索引值 |
inputCol | 设置要进行索引的列名 |
outputCol | 设置索引保存的列名 |
stringOrderType | 设置索引编号的方式,包含如下取值:“frequencyDesc” - 频率倒序编号,即出现次数多的编号大;“frequencyAsc” - 频率升序编号;“alplabetDesc” - 字母表降序编号;“alphabetAsc” - 字母表升序编号 |
代码示例:
import org.apache.spark.ml.feature.StringIndexer
def getWorkclassIndexer: StringIndexer = new StringIndexer()
.setInputCol("workclass").setOutputCol("workclassIndex").setHandleInvalid("keep")
.setStringOrderType("frequencyAsc")
可知此段代码通过函数间接设置了workclassIndexer,其inputCol和outputCol分别是“workclass”和“workclassIndexer”,对缺省值的处理方式是用新的编号值来对其编号
结果如下:
+----------------+--------------+
| workclass|workclassIndex|
+----------------+--------------+
| Never-worked| 0.0|
| Without-pay| 1.0|
| Federal-gov| 2.0|
| Self-emp-inc| 3.0|
| State-gov| 4.0|
| ?| 5.0|
| Local-gov| 6.0|
|Self-emp-not-inc| 7.0|
| Private| 8.0|
+----------------+--------------+
很重要的一个小技巧是,我们可以通过对其fit一个数据集后通过labels来获得编号对应的字符串,之后我们会用到这一技巧。
首先查阅Spark官方的文档对于IndexToString的介绍:
Spark 2.3.2 ScalaDoc - IndexToString
可见其有三个参数:
参数名称 | 作用 |
---|---|
inputCol | 设置要进行反索引的列名 |
outputCol | 设置字符串保存的列名 |
labels | StringArrayParam类的参数,用于规定得到的字符串序列 |
示例代码:
import org.apache.spark.ml.feature.IndexToString
new IndexToString()
.setInputCol("workclassIndex").setOutputCol("workclass")
.setLabels(getWorkclassIndexer.fit(getTrainingData).labels)
可见此段代码中定义了一个IndexToString类,其inputCol为“workclassIndex”,outputCol为“workclass”,其字符串序列定义为workclassIndexer的未索引前的列
Pipline是Spark SQL中很好用的一个类,可以组合几个不同的模型,可以有效减少代码量,首先看一下它的文档:
Spark 2.3.2 ScalaDoc - IndexToString
可见其只有一个参数值,即stages,其值为一个Array类,其中包含了不同的操作模型,在示例代码中我们用一个Pipeline组合了一系列IndexToString,然后定义了一个函数用于得到此Pipeline:
private val converter_pipeline = new Pipeline().setStages(Array(
new IndexToString()
.setInputCol("workclassIndex").setOutputCol("workclass")
.setLabels(getWorkclassIndexer.fit(getTrainingData).labels),
new IndexToString()
.setInputCol("educationIndex").setOutputCol("education")
.setLabels(getEduIndexer.fit(getTrainingData).labels),
new IndexToString()
.setInputCol("maritial_statusIndex").setOutputCol("maritial_status")
.setLabels(getMaritalIndexer.fit(getTrainingData).labels),
new IndexToString()
.setInputCol("occupationIndex").setOutputCol("occupation")
.setLabels(getOccupationIndexer.fit(getTrainingData).labels),
new IndexToString()
.setInputCol("relationshipIndex").setOutputCol("relationship")
.setLabels(getRelationshipIndexer.fit(getTrainingData).labels),
new IndexToString()
.setInputCol("raceIndex").setOutputCol("race")
.setLabels(getRaceIndexer.fit(getTrainingData).labels),
new IndexToString()
.setInputCol("sexIndex").setOutputCol("sex")
.setLabels(getSexIndexer.fit(getTrainingData).labels)
))
def getConverterPipline: Pipeline = converter_pipeline
Pipeline可以通过fit和transform对数据集进行操作,示例如下:
val cluster_info_split_table = getConverterPipline
.fit(split_table).transform(split_table)
结果效果如下:
+----+--------------+--------------+--------------------+---------------+-----------------+---------+--------+-------------------+----------------+---------+---------------+-------------+-------------+-----+----+
| age|workclassIndex|educationIndex|maritial_statusIndex|occupationIndex|relationshipIndex|raceIndex|sexIndex|native_countryIndex| workclass|education|maritial_status| occupation| relationship| race| sex|
+----+--------------+--------------+--------------------+---------------+-----------------+---------+--------+-------------------+----------------+---------+---------------+-------------+-------------+-----+----+
|37.0| 7.0| 13.0| 5.0| 10.0| 4.0| 4.0| 1.0| 41.0|Self-emp-not-inc|Bachelors| Never-married| Sales|Not-in-family|White|Male|
|57.0| 7.0| 12.0| 5.0| 10.0| 4.0| 4.0| 1.0| 41.0|Self-emp-not-inc| Masters| Never-married| Sales|Not-in-family|White|Male|
|46.0| 7.0| 13.0| 5.0| 10.0| 4.0| 4.0| 1.0| 41.0|Self-emp-not-inc|Bachelors| Never-married| Sales|Not-in-family|White|Male|
|21.0| 7.0| 13.0| 5.0| 9.0| 3.0| 4.0| 1.0| 41.0|Self-emp-not-inc|Bachelors| Never-married|Other-service| Own-child|White|Male|
|49.0| 7.0| 11.0| 5.0| 10.0| 4.0| 3.0| 1.0| 20.0|Self-emp-not-inc|Assoc-voc| Never-married| Sales|Not-in-family|Black|Male|
|70.0| 6.0| 11.0| 5.0| 9.0| 4.0| 4.0| 1.0| 40.0| Local-gov|Assoc-voc| Never-married|Other-service|Not-in-family|White|Male|
|29.0| 7.0| 13.0| 5.0| 10.0| 4.0| 4.0| 1.0| 41.0|Self-emp-not-inc|Bachelors| Never-married| Sales|Not-in-family|White|Male|
|29.0| 7.0| 12.0| 5.0| 10.0| 3.0| 3.0| 1.0| 19.0|Self-emp-not-inc| Masters| Never-married| Sales| Own-child|Black|Male|
+----+--------------+--------------+--------------------+---------------+-----------------+---------+--------+-------------------+----------------+---------+---------------+-------------+-------------+-----+----+
StructField定义了一个类字段的具体信息,其文档:
Spark 2.3.2 ScalaDoc - StructField
可以看出其具有四个属性值:
属性名 | 作用 |
---|---|
name | 名称 |
dataType | 数据类型 |
nullable | 是否可以为空 |
metadata | 元数据类型 |
而StructType可以理解为多个StructField的集合,文档中给出示例代码如下:
val struct =
StructType(
StructField("a", IntegerType, true) ::
StructField("b", LongType, false) ::
StructField("c", BooleanType, false) :: Nil)
Encoder用于创建Encoder类,而后者用于实现Java虚拟机原生数据类型向Spark SQL类型转换
我们定义一个Adult类,然后定义一个getAdultSchema产生一个StructType
case class Adult (age: Int, workclass: String, fnlwgt: Int,
education: String, education_num: Int,
maritial_status: String, occupation: String,
relationship: String, race: String,
sex: String, capital_gain: Int, capital_loss: Int,
hours_per_week: Int, native_country: String)
def getAdultSchema: StructType = Encoders.product[Adult].schema
之后在读取的时候,我们就可以通过调用这个函数来格式化DataFrame:
val df_training = spark.read.format("csv")
.option("sep", ",")
.option("header", "true")
.schema(getAdultSchema)
.load("./data/adult_training")
.persist()
可以得到如下输出:
+---+----------------+------+------------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
|age| workclass|fnlwgt| education|education_num| maritial_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|
+---+----------------+------+------------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
| 50| Private|283676|Some-college| 10|Married-civ-spouse| Craft-repair| Husband|White| Male| 0| 0| 40| United-States|
| 34| Local-gov|105540| Bachelors| 13|Married-civ-spouse| Prof-specialty| Husband|White| Male| 0| 2051| 40| United-States|
| 44| Private|408717| HS-grad| 9| Divorced| Sales|Not-in-family|White| Male| 3674| 0| 50| United-States|
| 21| Private| 57916| HS-grad| 9| Never-married| Protective-serv| Own-child|White| Male| 0| 0| 40| United-States|
| 37| Private|177974|Some-college| 10|Married-civ-spouse|Machine-op-inspct| Husband|White| Male| 0| 0| 70| United-States|
| 34| ?|177304| 10th| 6| Divorced| ?|Not-in-family|White| Male| 0| 0| 40| Columbia|
| 18| Private|115839| 12th| 8| Never-married| Adm-clerical|Not-in-family|White|Female| 0| 0| 30| United-States|
| 34| ?|205256| HS-grad| 9|Married-civ-spouse| ?| Husband|White| Male| 2885| 0| 80| United-States|
| 38| Private|117802| HS-grad| 9|Married-civ-spouse|Machine-op-inspct| Husband|White| Male| 0| 0| 65| United-States|
| 19| Private|211355| HS-grad| 9| Never-married| Adm-clerical| Own-child|White| Male| 0| 0| 12| United-States|
| 46| Private|173243| HS-grad| 9|Married-civ-spouse| Craft-repair| Husband|White| Male| 0| 0| 40| United-States|
| 19| Private|343200|Some-college| 10| Never-married| Sales| Own-child|White|Female| 0| 0| 25| United-States|
| 22| Private|401690|Some-college| 10| Never-married| Sales| Own-child|White|Female| 0| 0| 20| Mexico|
| 38| Private|196123| HS-grad| 9|Married-civ-spouse| Craft-repair| Husband|White| Male| 0| 0| 50| United-States|
| 33| Private|168981| Masters| 14| Divorced| Exec-managerial| Own-child|White|Female| 14084| 0| 50| United-States|
| 83|Self-emp-not-inc|213866| HS-grad| 9| Widowed| Exec-managerial|Not-in-family|White| Male| 0| 0| 8| United-States|
| 34| Private| 55176|Some-college| 10|Married-civ-spouse| Tech-support| Husband|White| Male| 0| 0| 40| United-States|
| 38| Private|153976| HS-grad| 9|Married-civ-spouse| Craft-repair| Husband|White| Male| 0| 0| 40| United-States|
| 33| Private|119176|Some-college| 10| Widowed| Adm-clerical| Unmarried|White|Female| 0| 0| 40| United-States|
| 27| Private|169117| HS-grad| 9|Married-civ-spouse| Adm-clerical| Wife|Black|Female| 0| 1887| 40| United-States|
+---+----------------+------+------------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+