当我们的输入数据为文本(句子)的时候,我们会想把他们切分为单词再进行数据处理,这时候就要用到Tokenizer类了。 Tokenization是一个将文本(如一个句子)转换为个体单元(如词)的处理过程。 一个简单的Tokenizer
类就提供了这个功能。下面的例子展示了如何将句子转换为此序列。
RegexTokenizer
基于正则表达式匹配提供了更高级的断词(tokenization
)。默认情况下,参数pattern
(默认是\s+
)作为分隔符, 用来切分输入文本。用户可以设置gaps
参数为false
用来表明正则参数pattern
表示tokens
而不是splitting gaps
,这个类可以找到所有匹配的事件并作为结果返回。下面是调用的例子。
import org.apache.spark.SparkConf import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ // $example off$ object TokenizerExample { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf(); sparkConf.setMaster("local[*]").setAppName(this.getClass.getSimpleName) val spark = SparkSession .builder .config(sparkConf) .appName("TokenizerExample") .getOrCreate() // $example on$ val sentenceDataFrame = spark.createDataFrame(Seq( (0, "Hi I heard about Spark"), (1, "I wish Java could use case classes"), (2, "Logistic,regression,models,are,neat") )).toDF("id", "sentence") val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words") val regexTokenizer = new RegexTokenizer() .setInputCol("sentence") .setOutputCol("words") .setPattern("\\W") // alternatively .setPattern("\\w+").setGaps(false) val countTokens = udf { (words: Seq[String]) => words.length } val tokenized = tokenizer.transform(sentenceDataFrame) tokenized.select("sentence", "words") .withColumn("tokens", countTokens(col("words"))).show(false) val regexTokenized = regexTokenizer.transform(sentenceDataFrame) regexTokenized.select("sentence", "words") .withColumn("tokens", countTokens(col("words"))).show(false) // $example off$ spark.stop() } }
输出结果:
+———————————–+——————————————+——+
|sentence |words |tokens|
+———————————–+——————————————+——+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
|Logistic,regression,models,are,neat|[logistic,regression,models,are,neat] |1 |
+———————————–+——————————————+——+
+———————————–+——————————————+——+
|sentence |words |tokens|
+———————————–+——————————————+——+
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
|Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5 |
+———————————–+——————————————+——+