Spark开发中遇到的问题及解决方法

1.Windows下运行spark产生的Failed to locate the winutils binary in the hadoop binary path异常
解决方法:

1.下载winutils的windows版本
  GitHub上,有人提供了winutils的windows的版本,项目地址是:https://github.com/srccodes/hadoop-common-2.2.0-bin,直接下载此项目的zip包,下载后是文件名是hadoop-common-2.2.0-bin-master.zip,随便解压到一个目录
  2.配置环境变量
  增加用户变量HADOOP_HOME,值是下载的zip包解压的目录,然后在系统变量path里增加$HADOOP_HOME\bin 即可。  
  再次运行程序,正常执行。

原文链接:http://blog.csdn.net/shawnhu007/article/details/51518879

2.在Spark中,要访问hdfs文件系统上的文件,需要将hadoop的core-site.xml和hdfs-site.xml两个文件拷贝到spark的conf目录下

3.Spark引入第三方Jar包的问题
①可以使用maven的assembly插件将第三方Jar包全部打入生成的Jar中,存在的问题是Jar生成慢,生成的Jar包很大
②在spark-submit时添加–jars参数,问题是引入的Jar包比较多时,命令行比较长

spark-submit --jars ~/lib/hanlp-1.5.3.jar  --class "www.bdqn.cn.MyTest" --master spark://hadoop000:7077 ~/lib/SparkTechCount-1.0.jar

③配置spark的spark-defaults.conf设置第三方Jar包的目录,不过此种情况下集群上的每台机器都需要配置并上传Jar包

spark.executor.extraClassPath=/home/hadoop/app/spark-1.6.3-bin-hadoop2.6/external_jars/*
spark.driver.extraClassPath=/home/hadoop/app/spark-1.6.3-bin-hadoop2.6/external_jars/*

④在spark on yarn(cluster)模式下,可以将Jar包放到hdfs上,由于没有亲自测试,只是在此记录下。

3.Spark的文件系统在分布式环境下也使用的是HDFS,我的实验机在经过了1个周末后使用Xshell登录服务器后,发现在命令行模式下输入都卡顿。原因在网上查了下是因为hdfs的问题,先记录解决办法:
①此时使用stop-all.sh脚本已经无法停止hdfs了
②使用命令行查找Hadoop相关的进程号:

ps -ef | grep java | grep hadoop

然后使用kill xxxx把对应的进程一个一个的杀掉,我杀进程是从上到下挨个杀的,网上找到的资料有写杀进程的顺序应按照以下顺序,可以参考:

停止顺序: job 、task、namenode、datanode、secondarynode

③杀完进程后,再使用start-all.sh和stop-all.sh就可以了。
④在网上找到的永久解决方案,尝试效果待验证

出现这个问题的最常见原因是hadoop在stop的时候依据的是datanode上的mapred和dfs进程号。而默认的进程号保存在/tmp下,linux默认会每隔一段时间(一般是一个月或者7天左右)去删除这个目录下的文件。因此删掉hadoop-hadoop-jobtracker.pid和hadoop-hadoop-namenode.pid两个文件后,namenode自然就找不到datanode上的这两个进程了。
另外还有两个原因可能引起这个问题:
①:环境变量 $HADOOP_PID_DIR 在你启动hadoop后改变了
②:用另外的用户身份执行stop-all
解决方法:
①:永久解决方法,修改$HADOOP_HOME/conf/hadoop-env.sh里边,去掉export HADOOP_PID_DIR=/var/hadoop/pids的#号,创建/var/hadoop/pids或者你自己指定目录
发现问题后的解决方法:
①:这个时候通过脚本已经无法停止进程了,不过我们可以手工停止,方法是到各mfs master和各datanode执行ps -ef | grep java | grep hadoop找到进程号强制杀掉,然后在master执行start-all脚本重新启动,就能正常启动和关闭了。

我把HADOOP_PID_DIR指定为

export HADOOP_PID_DIR=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/pids

4.Spark开发时遇到这个Exception:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 2) had a not serializable result: java.util.ArrayList$SubList
Serialization stack:
    - object not serializable (class: java.util.ArrayList$SubList, value: [d])
    - field (class: scala.Tuple3, name: _2, type: class java.lang.Object)
    - object (class scala.Tuple3, ([a, b],[d],0.5))
    - writeObject data (class: java.util.ArrayList)
    - object (class java.util.ArrayList, [([a, b],[d],0.5), ([a, b],[c],0.5)])
    - element of array (index: 0)
    - array (class [Ljava.lang.Object;, size 11)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:339)
    at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:46)
    at org.dataalgorithms.MyImplementation.MarketBasketAnalyzeDriver.main(MarketBasketAnalyzeDriver.java:106)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

原因是我在Spark代码里调用了subList来获得List子串,解决方案如下
At some point you’re using something like: x
= myArrayList.subList(a,b));

After this x will not be serializable as the sublist object returned from subList() does not implement it. Try doing x
= new ArrayList(myArrayList.subList(a,b))); instead.

你可能感兴趣的:(大数据)