spark中monotonically_increasing_id的坑

日常工作中因为获取到的交互矩阵中user是string的,所以需要转换成long或int的unique id。本来以为发现了一个非常好用的函数monotonically_increasing_id,再join回来就行了,直接可以实现为:

import org.apache.spark.sql.functions.monotonically_increasing_id 

userdf = df.select("user").dropDuplicates().withColumn("userid", monotonically_increasing_id())
newdf = df.join(userdf, "user")

然而在送到spark自带的als中进行训练的时候,因为需要将userid转换成int格式,这里就埋下了一个巨坑。

val ratingRdd = newdf.rdd.map(r =>
            Rating(r.getAs[Long]("user").toInt, 
            r.getAs[String]("productid").toInt, 
            r.getAs[Long]("rating").toDouble)
            ).cache()

因为monotonically_increasing_id是long格式的,而且是对每个partition中是从不同的long数值开始无重复递增。看一下原始文档:

  • A column expression that generates monotonically increasing 64-bit integers.
  • The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
  • The current implementation puts the partition ID in the upper 31 bits, and the record number
  • within each partition in the lower 33 bits. The assumption is that the data frame has
  • less than 1 billion partitions, and each partition has less than 8 billion records.
  • As an example, consider a DataFrame with two partitions, each with 3 records.
  • This expression would return the following IDs:
  • {{{ * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. * }}}

然而当数据量一大,从Long转成int的时候,就会被截断,所以我ratingRdd中的user量会比原始的df中的user量少很多。

解决方法:
1、不用monotonically_increasing_id了,老老实实的用zipWithIndex之后再toDF
2、或者是先把distinct之后的数据repartition到1个partition中。但这种方法在用户量非常大的情况下也是失效的,因为无法避免Long的位数比Int能记录的位数多的情况。

你可能感兴趣的:(日常记录)