DataFrame保存为hive表时的换行符问题

pyspark的DataFrame,在直接保存为hive表时,如果字符串中带有换行符,会导致换行错误。以spark 3.0.0版本为例。我们向hive表保存1条包含换行符字符串的数据,统计行数时却得到2行:

>>> df = spark.createDataFrame([(1,'hello\nworld')], ('id','msg'))
>>> df.write.format('hive').saveAsTable('test.newline0')
>>> spark.sql('SELECT COUNT(1) FROM test.newline0').show()
+--------+
|count(1)|
+--------+
| 2|
+--------+

这一问题的相关文档我找了很久,最后发现是在Specifying storage format for Hive tables一节。直接使用hive格式保存时,底层是'textfile'且默认换行符是'\n',因此自然会出现换行错误。可以通过以下代码进行验证:

>>> df.write.format('hive').option('fileFormat', 'textfile').option('lineDelim', '\x13').saveAsTable('test.newline1')
Traceback (most recent call last):
File "", line 1, in
File "/usr/share/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 868, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/share/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call
File "/usr/share/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 137, in deco
raise_from(converted)
File "", line 3, in raise_from
pyspark.sql.utils.IllegalArgumentException: Hive data source only support newline '\n' as line delimiter, but given: �

解决的方法也很简单,使用其他格式进行保存:

>>> df.write.format('hive').option('fileFormat', 'parquet').saveAsTable('test.newline1')
>>> spark.sql('SELECT COUNT(1) FROM test.newline1').show()
+--------+
|count(1)|
+--------+
| 1|
+--------+

你可能感兴趣的:(DataFrame保存为hive表时的换行符问题)