前一段时间数据挖掘组的同学向我返回说自己的一段pyspark代码执行非常缓慢,而代码本身非常简单,就是查询hive 一个视图中的数据,而且通过limit 10
限制了数据量。
不说别的,先贴我的代码吧:
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
hc.setConf("hive.security.authorization.enabled", "false")
zj_sql = 'select * from silver_ep.zj_v limit 10'
zj_df = hc.sql(zj_sql)
zj_df.collect()
sql语句仅仅是从一个视图中查询10条语句,按道理说,查询速度应该非常快,但是执行结果是:任务执行了30分钟也没有执行完。视图对应的表的数据文件格式是parquet格式。
可能原因1:难道是因为我们使用了旧版的python api吗?因为我们的2.1.0 版本,通过查看2.1.0版本的spark对应的pyspark API specification ,我发现这样一句话:
class pyspark.sql.HiveContext(sparkContext, jhiveContext=None)
A variant of Spark SQL that integrates with data stored in Hive.
Configuration for Hive is read from hive-site.xml on the classpath. It supports running both SQL and HiveQL commands.
Parameters:
sparkContext – The SparkContext to wrap.
jhiveContext – An optional JVM Scala HiveContext. If set, we do not instantiate a new HiveContext in the JVM, instead we make all calls to this object.
Note Deprecated in 2.0.0. Use SparkSession.builder.enableHiveSupport().getOrCreate().
和
class pyspark.sql.SQLContext(sparkContext, sparkSession=None, jsqlContext=None)
The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x.
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
A SQLContext can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files.
2.0+ 版本的spark已经不推荐我们使用SQLContext
、HiveContext
,虽然初步推断是这个导致问题的可能性不大,因为尽管我们使用了旧的api,但是spark server确是最新的啊,总不至于旧的api依然使用旧的spark服务器吧?但是,我还是尝试使用新版spark推荐的SparkSession
方式去
调用,结果在预料之中,执行效率没有改变。排除这个原因。
可能原因2:由于查询的是一个view,与普通表不同,在查询view的时候会增加一些额外的查询操作以首先构建view的查询结果,然后基于构建的view数据进行查询。
因此怀疑是是否因为这个view的创建语句含中有join等操作,导致子查询长期无法完成,因此查询速度缓慢,如果猜想正确,那么这条sql语句在hive中直接执行,速度应该也是非常缓慢的,于是通过beeline执行该sql,速度非常快,而且,查看这个view的创建语句:
CREATE VIEW `zj_v` AS SELECT `zj`.`hdate`,
MD5(`zj`.`firmid`) AS `FIRM_ID`,
`zj`.`allenablemoney`,
`zj`.`alloutmoney`,
`zj`.`zcmoney`,
`zj`.`netzcmoney`,
`zj`.`rzmoney`,
`zj`.`rhmoney`,
`zj`.`minmoney`
FROM `SILVER_SILVER_NJSSEL`.`ZJ`
并没有join等操作,只是一个简单的查询。因此排除这个原因。
可能原因3:spark本身的解析引擎有问题
通过beeline使用的hadoop 的 mapreduce引擎做的文件解析和查询,spark使用的是自己的sql引擎做的解析。那么,是不是spark执行引擎没有一定的优化呢,于是,我在spark-sql中执行查询,结果显示,查询效率很高,大概2s返回结果。
可能原因4:难道我们的limit关键字没有起作用,也就是说spark是先把所有数据传输到driver然后才做limit操作的吗?也就是说,Spark在执行collect()这个action之前,遍历了全表,查询了所有的数据?我们使用explain来看看spark的执行计划:
>>> zj_df.explain(True)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
+- 'Project [*]
+- 'UnresolvedRelation `silver_ep`.`zj_v`
== Analyzed Logical Plan ==
hdate: string, FIRM_ID: string, allenablemoney: string, alloutmoney: string, zcmoney: string, netzcmoney: string, rz
GlobalLimit 10
+- LocalLimit 10
+- Project [hdate#38, FIRM_ID#37, allenablemoney#40, alloutmoney#41, zcmoney#42, netzcmoney#43, rzmoney#44, rhmon
+- SubqueryAlias zj_v
+- Project [hdate#38, md5(cast(firmid#39 as binary)) AS FIRM_ID#37, allenablemoney#40, alloutmoney#41, zcmo
+- SubqueryAlias zj
+- Relation[hdate#38,firmid#39,allenablemoney#40,alloutmoney#41,zcmoney#42,netzcmoney#43,rzmoney#44,r
== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
+- Project [hdate#38, md5(cast(firmid#39 as binary)) AS FIRM_ID#37, allenablemoney#40, alloutmoney#41, zcmoney#42
+- Relation[hdate#38,firmid#39,allenablemoney#40,alloutmoney#41,zcmoney#42,netzcmoney#43,rzmoney#44,rhmoney#45
== Physical Plan ==
CollectLimit 10
+- *Project [hdate#38, md5(cast(firmid#39 as binary)) AS FIRM_ID#37, allenablemoney#40, alloutmoney#41, zcmoney#42,
+- *BatchedScan parquet silver_silver_njssel.zj[hdate#38,firmid#39,allenablemoney#40,alloutmoney#41,zcmoney#42,ner/hive/warehouse/silver_silver_njssel.db/zj, PushedFilters: [], ReadSchema: struct>>
从explain的结果可以看到,spark的driver拿到了我们的sql以后,从我们的”limit 10”得到
GlobalLimit 10
然后,根据全局limit 10的执行计划,得到每台单机(一个或者多个executor进程,当我们在使用pyspark交互方式的时候,其实是一个pyspark进程下面的好多executor线程)的
LocalLimit 10
,显然,当executor在得到查询结果的时候,已经处理了limit 10 , 即提交的不是全局结果。
那么,时间到底消耗在哪儿呢?
可能原因5:collect()操作本身决定了需要这么长的时间
为了更佳准确的观察spark在执行我们的hive查询任务的时候的执行逻辑,我们通过
sc.setLogLevel("INFO")
修改pyspark的日志级别(发现通过修改log4j没有什么效果),将日志级别从WARN降低到INFO, 然后开始执行刚才的
2017-02-21 20:46:52,757 INFO [Executor task launch worker-8] datasources.FileScanRDD: Reading File path: hdfs://datah row]
2017-02-21 20:46:52,757 INFO [Executor task launch worker-28] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO [Executor task launch worker-11] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,757 INFO [Executor task launch worker-6] datasources.FileScanRDD: Reading File path: hdfs://datah row]
2017-02-21 20:46:52,757 INFO [Executor task launch worker-21] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO [Executor task launch worker-18] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO [Executor task launch worker-16] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO [Executor task launch worker-0] datasources.FileScanRDD: Reading File path: hdfs://datah row]
2017-02-21 20:46:52,756 INFO [Executor task launch worker-15] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO [Executor task launch worker-31] datasources.FileScanRDD: Reading File path: hdfs://datay row]
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further d
可见,Spark的解析引擎的执行策略是为每一个数据文件都创建了一个worker线程。因为我们是使用pyspark进行的,所以是单机执行模式,所有的executor属于同一进程下面的不同线程。所以,这个任务实际上是在一个机器上执行,共享一个jvm的内存。
而且,我们在运行过程中发现经常出现OutofMemory Exception
(非必现):
2017-02-22 12:07:37,724 ERROR [dag-scheduler-event-loop] scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ShuffleMapTask,ExceptionFailure(java.lang.OutOfMemoryError,Java heap space,[Ljava.lang.StackTraceElement;@394278bc,java.lang.OutOfMemoryError: Java heap space
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
原因很明显,将parquet文件读到内存的时候,发生了oom异常。问题又来了,我只需要10条数据,那么每个executor最多只需要读10条数据就可以结束了,为啥需要将整个parquet文件load到内存呢?然后我看了一下这些parquet文件的大小:
-rwxr-xr-x 2 appuser supergroup 126287896 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000000_0
-rwxr-xr-x 2 appuser supergroup 179992288 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000001_0
-rwxr-xr-x 2 appuser supergroup 155053353 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000002_0
-rwxr-xr-x 2 appuser supergroup 163026985 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000003_0
-rwxr-xr-x 2 appuser supergroup 155736832 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000004_0
-rwxr-xr-x 2 appuser supergroup 157311028 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000005_0
-rwxr-xr-x 2 appuser supergroup 150175977 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000006_0
-rwxr-xr-x 2 appuser supergroup 184228405 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000007_0
-rwxr-xr-x 2 appuser supergroup 162361165 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000008_0
这些文件都是150MB左右,并且parquet文件本身的存储性质决定了我们读取和解析parquet文件的时候,不是按行去读,而是一个row group一个row group去读的。通过parquet-tools工具解析这些parquet文件,发现这些文件基本上都最多只有2个row group,也就是说每个row group都非常大。
因此,当spark同时创建了多个task去读取这些parquet文件,尽管每个文件读进内存只需要一个row group,但是由于所有的task是属于同一进程,因此可能会把内存撑满。
这个对应的进程启动的时候,系统分配了多少内存给它呢?我们看一下这个执行进程的详细情况:
appuser 20739 1 3 Feb14 ? 06:00:46 /home/jdk/bin/java -cp /home/hbase/conf/:/home/spark/hadooplib/*:/home/spark/hivelib/*:/home/spark/hbaselib/*:/home/spark/kafkalib/*:/home/spark/extlib/*:/home/spark/conf/:/home/spark/jars/*:/home/hadoop/etc/hadoop/ -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://datahdfsmaster/spark/history -Xmx1024m org.apache.spark.deploy.history.HistoryServer
看到了,系统分配了1g内存给这个进程。在哪儿设置的呢?也可以跟代码进去看看:
pyspark:
export PYSPARK_DRIVER_PYTHON
export PYSPARK_DRIVER_PYTHON_OPTS
exec "\${SPARK_HOME}"/bin/spark-submit pyspark-shell-main --name "PySparkShell" "\$@"
spark-submit:
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="\$(cd "`dirname "$0"`"/..; pwd)"
fi
\#disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
exec "\${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "\$@"
spark-class:
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}
CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@")
COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}
if [ $LAUNCHER_EXIT_CODE != 0 ]; then
exit $LAUNCHER_EXIT_CODE
fi
CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"
这三个脚本是pyspark依次的执行逻辑,只有当我们在pyspark中执行任务的时候,才会调用到spark-class,然后通过
"\$RUNNER" -Xmx128m -cp "\$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "\$@"
组合出了我们用来创建一个单独java进程的命令,最后,其中,
exec "${CMD[@]}"
创建了一个独立的linux 进程,负责运行我们的分布式任务。 $RUNNER是$JAVA_HOME/java
,真正的-Xmx参数,是在
org.apache.spark.launcher.Main
里面进行设置的。
org.apache.spark.launcher.Main.main()[line 86]
->
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(Map env) [line 151]
->
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(Map env)
具体看最终的实现:
String tsMemory =
isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : null;
String memory = firstNonEmpty(tsMemory, config.get(SparkLauncher.DRIVER_MEMORY),
System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), DEFAULT_MEM);
cmd.add("-Xmx" + memory)
看到了吗,从系统变量 SPARK_DAEMON_MEMORY
、spark的配置文件中的配置项spark.driver.memor
以及系统变量SPARK_DRIVER_MEMORY
以及 系统变量 SPARK_MEM
和默认内存大小DEFAULT_MEM(1g)
中选择第一个不是空的值作为启动这个执行进程的xmx大小。最终,选择了使用默认值,因此jvm启动内存是1g。
那么,难道就这么一个简单的操作,这么简单的使用场景,当前最流行的分布式处理系统真的搞不定了吗?
我门需要的仅仅是10条数据,其实只要一个executor拿到了这10条数据,那目的就达到了,而不需要等到所有的executor都返回结果。
因此,我改用RDD.show()
操作,结果,速度非常快,几乎是立刻返回。
从show()
方法和collect()
方法的简单对比,我门可以发现它们的差别:
无论是collect()
还是take()
方法,最终都是通过Sparkcontext.runJob()
方法取提交任务并获取结果,但是runJob
方法是一个多态方法。collect()
中调用的runJob
方法是:
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
def take(num: Int): Array[T] = withScope {
val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)
if (num == 0) { //参数问题,直接返回空数组
new Array[T](0)
} else {
val buf = new ArrayBuffer[T]
val totalParts = this.partitions.length //这个rdd的partition个数
var partsScanned = 0
while (buf.size < num && partsScanned < totalParts) { //数据还不够,并且还有partition没有返回结果
// The number of partitions to try in this iteration. It is ok for this number to be
// greater than totalParts because we actually cap it at totalParts in runJob.
var numPartsToTry = 1L
if (partsScanned > 0) {
// If we didn't find any rows after the previous iteration, quadruple and retry.
// Otherwise, interpolate the number of partitions we need to try, but overestimate
// it by 50%. We also cap the estimation in the end.
if (buf.isEmpty) {
numPartsToTry = partsScanned * scaleUpFactor //如果这次取得对结果不够,下次需要增加扫描的partition个数
} else {
// the left side of max is >=1 whenever partsScanned >= 2
numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)
numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)
}
}
val left = num - buf.size
//确定partition的范围,在剩余需要扫描的partion和总的partion中取较小值作为partition的上限值,下限值是上次运行截止的partition
val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)//对指定范围对partition运行任务
res.foreach(buf ++= _.take(num - buf.size))
partsScanned += p.size
}
这是collect()方法所调用的runJob():
/**
* Run a job on all partitions in an RDD and return the results in an array.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @return in-memory collection with a result of the job (each collection element will contain
* a result from one partition)
*/
def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
runJob(rdd, func, 0 until rdd.partitions.length)
}
而take()方法调用的runJob()是
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like first()
* @return in-memory collection with a result of the job (each collection element will contain
* a result from one partition)
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: Iterator[T] => U,
partitions: Seq[Int]): Array[U] = {
val cleanedFunc = clean(func)
runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
}
两个runJob()的区别是传入的paritions不同,前者是在所有的partition上运行任务,而后者在部分partition上运行任务。通过查看take()
方法的源代码和注释,可以清晰地理解take()
方法是如何不断运行任务,直到取到的结果数量满足了参数规定的数量,或者,也有可能发生的是,当所有的job已经处理完了所有的partition,但是总共得到的结果依然不够则返回当前结果集的情形。