Spark SQL中使用StringIndexer和IndexToString来对字符串信息进行索引和反索引

简介

本篇博客使用Kaggle上的AdultBase数据集:Machine-Learning-Databases
此数据集虽然历史比较悠久,但是数据格式比较容易处理,而且信息比较全面,适合数据处理入门。
本篇博客使用了Spark SQL的相关语句,实现了以下功能:

  1. 使用StringIndexer来对文本信息进行索引
  2. 使用IndexToString和StringIndexer的labels值来实现反索引
  3. 如何使用StructType和scheme来实现数据格式化存取

本篇文章主要介绍StringIndexer和IndexToString的搭配使用方法,以及通过Pipeline对其进行组装,也介绍了通过SparkSesssion的read和write操作来格式化存取数据集。

github地址:Truedick23 - AdultBase

配置

  • 语言:Scala 2.11
  • Spark版本:Spark 2.3.1

类型介绍

StringIndexer

首先查阅Spark官方的文档对于StringIndexer的介绍:
Spark 2.3.2 ScalaDoc - StringIndexer
可见其有四个参数:

参数名称 作用
handleInvalid 对空缺数值的处理规定,有如下参数:“skip” - 过滤掉此条数据;“error” - 抛出错误;“keep” - 对其设置一个新的索引值
inputCol 设置要进行索引的列名
outputCol 设置索引保存的列名
stringOrderType 设置索引编号的方式,包含如下取值:“frequencyDesc” - 频率倒序编号,即出现次数多的编号大;“frequencyAsc” - 频率升序编号;“alplabetDesc” - 字母表降序编号;“alphabetAsc” - 字母表升序编号

代码示例:

  import org.apache.spark.ml.feature.StringIndexer
  def getWorkclassIndexer: StringIndexer = new StringIndexer()
    .setInputCol("workclass").setOutputCol("workclassIndex").setHandleInvalid("keep")
    .setStringOrderType("frequencyAsc")

可知此段代码通过函数间接设置了workclassIndexer,其inputCol和outputCol分别是“workclass”和“workclassIndexer”,对缺省值的处理方式是用新的编号值来对其编号
结果如下:

+----------------+--------------+
|       workclass|workclassIndex|
+----------------+--------------+
|    Never-worked|           0.0|
|     Without-pay|           1.0|
|     Federal-gov|           2.0|
|    Self-emp-inc|           3.0|
|       State-gov|           4.0|
|               ?|           5.0|
|       Local-gov|           6.0|
|Self-emp-not-inc|           7.0|
|         Private|           8.0|
+----------------+--------------+

很重要的一个小技巧是,我们可以通过对其fit一个数据集后通过labels来获得编号对应的字符串,之后我们会用到这一技巧。

IndexToString

首先查阅Spark官方的文档对于IndexToString的介绍:
Spark 2.3.2 ScalaDoc - IndexToString
可见其有三个参数:

参数名称 作用
inputCol 设置要进行反索引的列名
outputCol 设置字符串保存的列名
labels StringArrayParam类的参数,用于规定得到的字符串序列

示例代码:

    import org.apache.spark.ml.feature.IndexToString
    new IndexToString()
      .setInputCol("workclassIndex").setOutputCol("workclass")
      .setLabels(getWorkclassIndexer.fit(getTrainingData).labels)

可见此段代码中定义了一个IndexToString类,其inputCol为“workclassIndex”,outputCol为“workclass”,其字符串序列定义为workclassIndexer的未索引前的列

Pipeline

Pipline是Spark SQL中很好用的一个类,可以组合几个不同的模型,可以有效减少代码量,首先看一下它的文档:
Spark 2.3.2 ScalaDoc - IndexToString
可见其只有一个参数值,即stages,其值为一个Array类,其中包含了不同的操作模型,在示例代码中我们用一个Pipeline组合了一系列IndexToString,然后定义了一个函数用于得到此Pipeline:

  private val converter_pipeline = new Pipeline().setStages(Array(
    new IndexToString()
      .setInputCol("workclassIndex").setOutputCol("workclass")
      .setLabels(getWorkclassIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("educationIndex").setOutputCol("education")
      .setLabels(getEduIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("maritial_statusIndex").setOutputCol("maritial_status")
      .setLabels(getMaritalIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("occupationIndex").setOutputCol("occupation")
      .setLabels(getOccupationIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("relationshipIndex").setOutputCol("relationship")
      .setLabels(getRelationshipIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("raceIndex").setOutputCol("race")
      .setLabels(getRaceIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("sexIndex").setOutputCol("sex")
      .setLabels(getSexIndexer.fit(getTrainingData).labels)
  ))
  
    def getConverterPipline: Pipeline = converter_pipeline

Pipeline可以通过fit和transform对数据集进行操作,示例如下:

      val cluster_info_split_table = getConverterPipline
        .fit(split_table).transform(split_table)

结果效果如下:

+----+--------------+--------------+--------------------+---------------+-----------------+---------+--------+-------------------+----------------+---------+---------------+-------------+-------------+-----+----+
| age|workclassIndex|educationIndex|maritial_statusIndex|occupationIndex|relationshipIndex|raceIndex|sexIndex|native_countryIndex|       workclass|education|maritial_status|   occupation| relationship| race| sex|
+----+--------------+--------------+--------------------+---------------+-----------------+---------+--------+-------------------+----------------+---------+---------------+-------------+-------------+-----+----+
|37.0|           7.0|          13.0|                 5.0|           10.0|              4.0|      4.0|     1.0|               41.0|Self-emp-not-inc|Bachelors|  Never-married|        Sales|Not-in-family|White|Male|
|57.0|           7.0|          12.0|                 5.0|           10.0|              4.0|      4.0|     1.0|               41.0|Self-emp-not-inc|  Masters|  Never-married|        Sales|Not-in-family|White|Male|
|46.0|           7.0|          13.0|                 5.0|           10.0|              4.0|      4.0|     1.0|               41.0|Self-emp-not-inc|Bachelors|  Never-married|        Sales|Not-in-family|White|Male|
|21.0|           7.0|          13.0|                 5.0|            9.0|              3.0|      4.0|     1.0|               41.0|Self-emp-not-inc|Bachelors|  Never-married|Other-service|    Own-child|White|Male|
|49.0|           7.0|          11.0|                 5.0|           10.0|              4.0|      3.0|     1.0|               20.0|Self-emp-not-inc|Assoc-voc|  Never-married|        Sales|Not-in-family|Black|Male|
|70.0|           6.0|          11.0|                 5.0|            9.0|              4.0|      4.0|     1.0|               40.0|       Local-gov|Assoc-voc|  Never-married|Other-service|Not-in-family|White|Male|
|29.0|           7.0|          13.0|                 5.0|           10.0|              4.0|      4.0|     1.0|               41.0|Self-emp-not-inc|Bachelors|  Never-married|        Sales|Not-in-family|White|Male|
|29.0|           7.0|          12.0|                 5.0|           10.0|              3.0|      3.0|     1.0|               19.0|Self-emp-not-inc|  Masters|  Never-married|        Sales|    Own-child|Black|Male|
+----+--------------+--------------+--------------------+---------------+-----------------+---------+--------+-------------------+----------------+---------+---------------+-------------+-------------+-----+----+

StructField和StructType

StructField定义了一个类字段的具体信息,其文档:
Spark 2.3.2 ScalaDoc - StructField
可以看出其具有四个属性值:

属性名 作用
name 名称
dataType 数据类型
nullable 是否可以为空
metadata 元数据类型

而StructType可以理解为多个StructField的集合,文档中给出示例代码如下:

val struct =
  StructType(
    StructField("a", IntegerType, true) ::
    StructField("b", LongType, false) ::
    StructField("c", BooleanType, false) :: Nil)

Encoders

Encoder用于创建Encoder类,而后者用于实现Java虚拟机原生数据类型向Spark SQL类型转换

我们定义一个Adult类,然后定义一个getAdultSchema产生一个StructType

  case class Adult (age: Int, workclass: String, fnlwgt: Int,
                     education: String, education_num: Int,
                     maritial_status: String, occupation: String,
                     relationship: String, race: String,
                     sex: String, capital_gain: Int, capital_loss: Int,
                     hours_per_week: Int, native_country: String)
                     
  def getAdultSchema: StructType = Encoders.product[Adult].schema

之后在读取的时候,我们就可以通过调用这个函数来格式化DataFrame:

    val df_training = spark.read.format("csv")
      .option("sep", ",")
      .option("header", "true")
      .schema(getAdultSchema)
      .load("./data/adult_training")
      .persist()

可以得到如下输出:

+---+----------------+------+------------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
|age|       workclass|fnlwgt|   education|education_num|   maritial_status|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|
+---+----------------+------+------------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+
| 50|         Private|283676|Some-college|           10|Married-civ-spouse|     Craft-repair|      Husband|White|  Male|           0|           0|            40| United-States|
| 34|       Local-gov|105540|   Bachelors|           13|Married-civ-spouse|   Prof-specialty|      Husband|White|  Male|           0|        2051|            40| United-States|
| 44|         Private|408717|     HS-grad|            9|          Divorced|            Sales|Not-in-family|White|  Male|        3674|           0|            50| United-States|
| 21|         Private| 57916|     HS-grad|            9|     Never-married|  Protective-serv|    Own-child|White|  Male|           0|           0|            40| United-States|
| 37|         Private|177974|Some-college|           10|Married-civ-spouse|Machine-op-inspct|      Husband|White|  Male|           0|           0|            70| United-States|
| 34|               ?|177304|        10th|            6|          Divorced|                ?|Not-in-family|White|  Male|           0|           0|            40|      Columbia|
| 18|         Private|115839|        12th|            8|     Never-married|     Adm-clerical|Not-in-family|White|Female|           0|           0|            30| United-States|
| 34|               ?|205256|     HS-grad|            9|Married-civ-spouse|                ?|      Husband|White|  Male|        2885|           0|            80| United-States|
| 38|         Private|117802|     HS-grad|            9|Married-civ-spouse|Machine-op-inspct|      Husband|White|  Male|           0|           0|            65| United-States|
| 19|         Private|211355|     HS-grad|            9|     Never-married|     Adm-clerical|    Own-child|White|  Male|           0|           0|            12| United-States|
| 46|         Private|173243|     HS-grad|            9|Married-civ-spouse|     Craft-repair|      Husband|White|  Male|           0|           0|            40| United-States|
| 19|         Private|343200|Some-college|           10|     Never-married|            Sales|    Own-child|White|Female|           0|           0|            25| United-States|
| 22|         Private|401690|Some-college|           10|     Never-married|            Sales|    Own-child|White|Female|           0|           0|            20|        Mexico|
| 38|         Private|196123|     HS-grad|            9|Married-civ-spouse|     Craft-repair|      Husband|White|  Male|           0|           0|            50| United-States|
| 33|         Private|168981|     Masters|           14|          Divorced|  Exec-managerial|    Own-child|White|Female|       14084|           0|            50| United-States|
| 83|Self-emp-not-inc|213866|     HS-grad|            9|           Widowed|  Exec-managerial|Not-in-family|White|  Male|           0|           0|             8| United-States|
| 34|         Private| 55176|Some-college|           10|Married-civ-spouse|     Tech-support|      Husband|White|  Male|           0|           0|            40| United-States|
| 38|         Private|153976|     HS-grad|            9|Married-civ-spouse|     Craft-repair|      Husband|White|  Male|           0|           0|            40| United-States|
| 33|         Private|119176|Some-college|           10|           Widowed|     Adm-clerical|    Unmarried|White|Female|           0|           0|            40| United-States|
| 27|         Private|169117|     HS-grad|            9|Married-civ-spouse|     Adm-clerical|         Wife|Black|Female|           0|        1887|            40| United-States|
+---+----------------+------+------------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+

参考资料

  • Spark 2.3.2 ScalaDoc - StringIndexer
  • Spark 2.3.2 ScalaDoc - IndexToString
  • Spark 2.3.2 ScalaDoc - Pipeline
  • Spark 2.3.2 ScalaDoc - StructField

你可能感兴趣的:(Scala,Spark,大数据)