1、本地运行出错及解决办法
当运行如下命令时:
./bin/spark-submit /--class org.apache.spark.examples.mllib.JavaALS /--master local[*] //opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-yarn/lib/spark-examples_2.10-1.0.0-cdh5.1.2.jar //user/data/netflix_rating 10 10 /user/data/result
会出现如下错误:
Exception in thread "main" java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: hdfs at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:657) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:145) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:167) at org.apache.spark.mllib.recommendation.ALS$.train(ALS.scala:599) at org.apache.spark.mllib.recommendation.ALS.train(ALS.scala) at org.apache.spark.examples.mllib.JavaALS.main(JavaALS.java:80) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: java.io.IOException: No FileSystem for scheme: hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653) ... 34 more
出现此错误的原因为spark执行过程中缺少hadoop-hdfs的jar包,使用spark-submit中的--jar或者--driver-class-path参数可以解决此问题。当使用hadoop-hdfs时路径指的就是hdfs路径。
正确的执行方式如下:
./bin/spark-submit /--class org.apache.spark.examples.mllib.JavaALS /--driver-class-path /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-hdfs/hadoop-hdfs-2.3.0-cdh5.1.2.jar /--master local[*] //opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-yarn/lib/spark-examples_2.10-1.0.0-cdh5.1.2.jar //user/data/netflix_rating 10 10 /user/data/result或者./bin/spark-submit /--class org.apache.spark.examples.mllib.JavaALS /--jars /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-hdfs/hadoop-hdfs-2.3.0-cdh5.1.2.jar /--master local[*] //opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-yarn/lib/spark-examples_2.10-1.0.0-cdh5.1.2.jar //user/data/netflix_rating 10 10 /user/data/result
2、spark在yarn上运行错误及解决办法
当运行如下命令时:
./bin/spark-submit /--class org.apache.spark.examples.mllib.JavaALS /--master yarn-cluster //opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-yarn/lib/spark-examples_2.10-1.0.0-cdh5.1.2.jar //user/data/netflix_rating 10 10 /user/data/result
会出现如下错误:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/client/api/impl/YarnClientImpl at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.lang.ClassLoader.defineClass(ClassLoader.java:615) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.spark.util.Utils$$anonfun$classIsLoadable$1.apply(Utils.scala:143) at org.apache.spark.util.Utils$$anonfun$classIsLoadable$1.apply(Utils.scala:143) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.Utils$.classIsLoadable(Utils.scala:143) at org.apache.spark.deploy.SparkSubmit$.createLaunchEnv(SparkSubmit.scala:158) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:54) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.client.api.impl.YarnClientImpl at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 21 more
产生此错误的原因是缺少hadoop-yarn目录下的jar包,解决此问题的方法只能使用--driver-class-path参数,原因是执行spark on yarn时,需要提前将 hadoop-yarn目录下的jar包导入 。
正确的执行方式如下:
./bin/spark-submit /--class org.apache.spark.examples.mllib.JavaALS /--master yarn-cluster /--driver-class-path $(echo /opt/cloudera/parcels/CDH/lib/hadoop-yarn/*.jar |sed 's/ /:/g'):/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-hdfs/hadoop-hdfs-2.3.0-cdh5.1.2.jar //opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-yarn/lib/spark-examples_2.10-1.0.0-cdh5.1.2.jar //user/data/netflix_rating 10 10 /user/data/result
执行结果集如下图所示
/user/data/result/productFeatures/part-00000数据格式为:
22,[5.561720883259194, 1.8295046510786157, 1.456597387276617, 0.8851233321058966, -0.6750794769961516, 0.2105431165110079, 1.868136268816477, -0.7426684616337039, -0.5856268982634872, 2.2788288132587358] 31,[0.9093801231293616, 0.31519780093777366, 1.1875509370524693, 0.40381375438624073, 2.518833489342341, 1.4242427194658087, 2.0950977044322574, 0.9012256614215569, 1.1604700989497398, 0.15791920617498328] 76,[1.8285525546730474, 0.6330058247735413, 2.5686801366906984, 1.4128062599776998, 1.401816974160943, 0.1596137900376602, 1.5625150218484072, -0.9678843308247949, 2.682242352514027, 1.0599465865866935] 152,[0.014905493368344078, 0.43308346940343456, 0.2351848253710811, 0.26220235713374834, 0.055210836978533295, 0.21723689234341548, 0.09391052568889097, 0.7231946368850907, 0.02497671848923523, 0.5022350772242716] 206,[-0.5501117679008718, 0.4105849318486638, 1.0876481291363873, 2.233025299808942, 2.1038565118723387, 1.662798954470802, 1.575332336431819, 0.8167712158963146, 1.4536436809654083, -0.5224582242822096]
/user/data/result/userFeatures/part-00000数据格式为:
22,[0.18595332423070562, 0.26223861694267697, 0.2220917583718615, 0.015729079507204886, 0.4450456773474982, 0.12287125816024044, 0.4644319181495295, 0.38377345920108646, 0.28428991637647794, 0.17875507467819415] 31,[0.15133710263843259, -0.02354886937021699, 0.10618787396390789, 0.03258147800653979, 0.3556889855610244, 1.021110467423965, 0.3701959855785832, 0.1524124835894395, 0.23381646690418442, -0.012011907243505829] 76,[0.2344438657777155, 0.03821305024729112, 0.230093903321136, 0.48888224387617607, 0.30121869825786685, 0.48198504753122795, 0.29543641416718835, 0.39299434584620146, 0.27798068299013984, 0.15611605797193095] 121,[0.2038917971256244, 0.7576071991072084, 0.30603993855416245, 0.41995044224403344, 0.06550681386608997, 0.20395370870960078, 0.3444359097858106, 0.4935457123179016, 0.2041119263872145, 0.3518582534508109] 130,[0.042995762604581524, -0.21177745644812881, 0.7047019111940551, 0.44978429350262916, 0.18912686527984246, 0.6349887274906566, 0.29651737861710675, 0.49758500548973844, 0.02699514514764544, 0.39330900998421187] 152,[1.9989336762046868, 1.2185456627280438, -0.14465791504370654, 0.32972894935630664, -0.6316151112173617, -0.5568528040594881, 0.007477525352213408, -0.012087520291972442, 0.4184613236246099, -0.24669307203702268]
3、对于spark on hadoop执行中出行的问题需要去hadoop yarn对应的job日志中去看