Spark 读取CSV 解析单元格多行数值问题

CSV 样例数据

[hadoop@ip-10-0-52-52 ~]$ cat test.csv 
id,name,address
1,zhang san,china shanghai
2,li si,"china beijing"
3,tom,china shanghai

Spark 2.2 以下版本读取 CSV

会存在读取异常问题

scala> val df1 = spark.read.option("header", true).csv("file:///home/hadoop/test.csv")
df1: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]

scala> df1.count
res4: Long = 4

scala> df1.show
+--------+---------+--------------+
|      id|     name|       address|
+--------+---------+--------------+
|       1|zhang san|china shanghai|
|       2|    li si|         china|
|beijing"| null| null| | 3| tom|china shanghai| +--------+---------+--------------+ 

遇到该问题也是可以通过读取二进制文件来解决的, 但这不是好的方案,例如下面Pyspark 实现:

def spark_read_csv_bf(spark, path, schema=None, encoding='utf8'):
    ''' :param spark: spark 2.0 sparkSession :param path: csv path :param encoding: :return: DataFrame '''
    rdd = spark.sparkContext.binaryFiles(path).values()\
                .flatMap(lambda x: csv.DictReader(io.BytesIO(x)))\
                .map(lambda x : { k:v.decode(encoding) for  k,v in x.iteritems()})
    if schema:
        return spark.createDataFrame(rdd, schema)
    else:
        return rdd.toDF()

Spark 2.2 之后版本 读取 CSV

spark 2.2之后的版本对该bug 进行修复, 具体的实现可以去看下, 通过在函数调用时添加参数 multiLine 解决了该问题, 参考链接:

[SPARK-19610][SQL] Support parsing multiline CSV files

[SPARK-20980] [SQL] Rename wholeFile to multiLine for both CSV and JSON

scala> val df2 = spark.read.option("header", true).option("multiLine", true).csv("file:///home/hadoop/test.csv")
df2: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]

scala> df2.count
res6: Long = 3

scala> df2.show
+---+---------+--------------+
| id|     name|       address|
+---+---------+--------------+
|  1|zhang san|china shanghai|
|  2|    li si| china
beijing|
|  3|      tom|china shanghai|
+---+---------+--------------+

你可能感兴趣的:(spark)