由于业务场景需求,需要通过spark从hdfs拉取数据进行40G的数据分析,按照官网搭建完spark的环境之后,就开始提交job运行了,大致的环境配置如下:4台服务器,每台服务器64G内存,千兆网卡,单块400G的SSD。主要的业务场景是:20G的HadoopRDD和20G的hadoopRDD进行数据的碰撞分析。
单次job大约耗时60s左右,测试了多次,差不多都是这个数值,感觉非常慢。因为理论上如果HadoopRDD的数据本地性做的足够好的话,数据都是在本地加载的,本地加载的话,物理SSD磁盘可以提供的峰值大约400M/s,那么如果按照数据总量40G大小计算,每台服务器10G的量,每秒读取400M/s,数据加载只需要花费25s的时候,和实际60s相差太大了。并且观察spark的统计界面,发现task的Locality Level都是ANY,而不是预期的NODE_LOCAL
这根本就没有做到数据本地加载。肯定是task任务调度的时候出现了问题,因此细读了下TaskSetManager
private[spark] class TaskSetManager( sched: TaskSchedulerImpl, val taskSet: TaskSet, val maxTaskFailures: Int, clock: Clock = new SystemClock()) extends Schedulable with Logging { private val pendingTasksForExecutor = new HashMap[String, ArrayBuffer[Int]]//其中key为标识excutor的信息 // Set of pending tasks for each host. Similar to pendingTasksForExecutor, // but at host level. private val pendingTasksForHost = new HashMap[String, ArrayBuffer[Int]]//其中key为标识主机的信息 // Set of pending tasks for each rack -- similar to the above. private val pendingTasksForRack = new HashMap[String, ArrayBuffer[Int]] // Set containing pending tasks with no locality preferences. var pendingTasksWithNoPrefs = new ArrayBuffer[Int] // Set containing all pending tasks (also used as a stack, as above). val allPendingTasks = new ArrayBuffer[Int] // Tasks that can be speculated. Since these will be a small fraction of total // tasks, we'll just hold them in a HashSet. val speculatableTasks = new HashSet[Int] }
TaskSetManager中存在的pendingTasksForHost是保存数据本地加载的那些task,其key为Spark内部数据节点的标识信息。而Spark中的HadoopRdd的分区location信息是通过BlockLocation.host来标识的,其中host是代表blk所在主机的hostname即主机名。因此准备打印pendingTasksForHost的key值和HadoopRdd的getPreferredLocations的返回值。
打印结果如下:
pendingTasksForHost的key为:172.25.3.160
HadoopRdd的getPreferredLocations的返回值为:172-25-3-160
现在很明显了,spark中标识work位置信息的是ip地址,而不是主机名,hadoopRdd标识数据本地性的信息是主机名而不是ip地址,那么到底是哪里出的问题呢?
现在开始追踪work的启动参数:
hadoop 10932 1 1 13:33 ? 00:00:16 /usr/dahua/jdk/bin/java-cp/usr/dahua/spark/executelib/hbase-protocol-0.98.3-hadoop2.jar:/usr/dahua/spark-1.4.0-bin-hadoop2.4/sbin/../conf/:/usr/dahua/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar:/usr/dahua/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/usr/dahua/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/dahua/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/usr/dahua/hadoop/etc/hadoop/-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=172.25.3.160:2181,172.25.3.161:2181,172.25.3.162:2181-Dspark.deploy.zookeeper.dir=/spark -Dspark.worker.cleanup.enabled=true-Dspark.worker.cleanup.interval=1800 -Dspark.worker.cleanup.appDataTtl=7200-Xms512m -Xmx512m -XX:MaxPermSize=128m org.apache.spark.deploy.worker.Worker --webui-port7078 spark://172-25-3-160:7077,172-25-3-161:7077,172-25-3-162:7077 |
因此如果spark的环境变量没有配置 SPARK_LOCAL_HOSTNAME的话,则取IP地址,否则取主机名。此刻立马看了下自己的配置,果然没有配置主机名,尼玛。work的启动参数并没有携带ip地址或者主机名,因此查看org.apache.spark.deploy.worker.Worker的加载private[deploy] object Worker extends Logging { def main(argStrings: Array[String]) { SignalLogger.register(log) val conf = new SparkConf val args = new WorkerArguments(argStrings, conf) val (actorSystem, _) = startSystemAndActor(args.host, args.port, args.webUiPort, args.cores, args.memory, args.masters, args.workDir)//实际上是通过内部生成host的,那么这个host代表的是ip还是主机名呢? actorSystem.awaitTermination() } } private[worker] class WorkerArguments(args: Array[String], conf: SparkConf) { var host = Utils.localHostName() … } def localHostName(): String = { customHostname.getOrElse(localIpAddress.getHostAddress)//优先取hostname,如果hostname不存在则获取ip地址 } private var customHostname: Option[String] = sys.env.get("SPARK_LOCAL_HOSTNAME")//获取spark的环境变量 /** * Get the local host's IP address in dotted-quad format (e.g. 1.2.3.4). * Note, this is typically not used from within core spark. */ private lazy val localIpAddress: InetAddress = findLocalInetAddress()//主机ip地址
立刻将SPARK_LOCAL_HOSTNAME配上(spark-env.sh配置文件中),即
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers # export SPARK_LOCAL_HOSTNAME="172-25-3-161" export SPARK_LOCAL_DIRS="/home/hadoop/sdc/sparktmp" #export SPARK_LOCAL_DIRS="/home/hadoop/sparktmp" export HADOOP_CONF_DIR="/usr/dahua/hadoop/etc/hadoop" |
重启spark,然后重新运行job,完成任务的时间大致变为32s左右,和理论的25s少了点,因为需要包含计算的时间和shuffle网络传输的时间。其任务的Locality Level如下:
至此终于把环境部署完成了