(1) 启动hadoop
启动成功master节点进程:
Slave节点进程:
(2) 启动spark(注意路径)
启动成功:
Slave1和slave2的进程如下:
注意下面的路径需要一致:
为什么要设置执行环境?
首先我们直接运行SparkPi程序,右击
可以看到出错:
这个原因是找不到程序运行的master,我们需要配置spark的执行环境,
Spark的执行环境根据当前的集群模式可以分为以下几类:
local 本地单线程
local[K] 本地多线程(指定K个内核)
local[*] 本地多线程(指定所有可用内核)
spark://HOST:PORT 连接到指定的 Sparkstandalone cluster master,需要指定端口。
mesos://HOST:PORT 连接到指定的 Mesos 集群,需要指定端口。
yarn-client客户端模式 连接到 YARN 集群。需要配置HADOOP_CONF_DIR。
yarn-cluster集群模式 连接到 YARN 集群。需要配置HADOOP_CONF_DIR
下面我们配置spark执行环境
在SparkPI的下拉菜单中选择“EditConfigurations”
-Dspark.master=spark://192.168.189.130:7077
我们再次运行程序发现报如下错误:
问题原因:程序在运行的时候没有把jar包提交到spark的worker上面导致运行的worker找不到被调用的类
解决:将要运行的程序达成jar包,然后调用JavaSparkContext的addJar方法将该jar包提交到spark集群中,然后spark的master会将该jar包分发到各个worker上面
将程序打包:
因为我们在每台机器上安装了scala和spark所以我们这里可以去掉scala和spark
在程序中设置加载jar
执行完成后会生成工程的jar包
在程序中加载jar
将: val conf = new SparkConf().setAppName("Spark Pi")
修改为: val conf = new SparkConf().setAppName("SparkPi").setJars(List("/root/IdeaProjects/SparkExampleWorkspace/out/artifacts/SparkExampleWorkspace_jar/SparkExampleWorkspace.jar"))
运行程序得到结果:
错误原因:使用的scala版本太高,将scala版本将为2.10.x
分类: Hadoop spark2014-07-08 16:40 2355人阅读 评论(2) 收藏 举报
先上代码:
[python] view plaincopy
1. /*
2. * Licensed to the Apache Software Foundation (ASF) under one or more
3. * contributor license agreements. See the NOTICE file distributed with
4. * this work for additional information regarding copyright ownership.
5. * The ASF licenses this file to You under the Apache License, Version 2.0
6. * (the "License"); you may not use this file except in compliance with
7. * the License. You may obtain a copy of the License at
8. *
9. * http://www.apache.org/licenses/LICENSE-2.0
10. *
11. * Unless required by applicable law or agreed to in writing, software
12. * distributed under the License is distributed on an "AS IS" BASIS,
13. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14. * See the License for the specific language governing permissions and
15. * limitations under the License.
16. */
17.
18.
19. import java.util.Arrays;
20. import java.util.regex.Pattern;
21.
22. import org.apache.spark.api.java.JavaPairRDD;
23. import org.apache.spark.api.java.JavaRDD;
24. import org.apache.spark.api.java.JavaSparkContext;
25. import org.apache.spark.api.java.function.FlatMapFunction;
26. import org.apache.spark.api.java.function.Function2;
27. import org.apache.spark.api.java.function.PairFunction;
28.
29. import scala.Tuple2;
30.
31. public final class JavaWordCount {
32. private static final Pattern SPACE = Pattern.compile(" ");
33.
34. public static void main(String[] args) throws Exception {
35.
36. if (args.length < 2) {
37. System.err.println("Usage: JavaWordCount
38. System.exit(1);
39. }
40.
41. JavaSparkContext ctx = new JavaSparkContext(args[0], "JavaWordCount",
42. System.getenv("SPARK_HOME"), JavaSparkContext.jarOfClass(JavaWordCount.class));
43. ctx.addJar("/home/hadoop/Desktop/JavaSparkT.jar");
44. JavaRDD
45.
46. JavaRDD
47. @Override
48. public Iterable
49. return Arrays.asList(SPACE.split(s));
50. }
51. });
52.
53. JavaPairRDD
54. @Override
55. public Tuple2
56. return new Tuple2
57. }
58. });
59.
60. JavaPairRDD
61. @Override
62. public Integer call(Integer i1, Integer i2) {
63. return i1 + i2;
64. }
65. });
66. counts.saveAsTextFile(args[2]);
67. // counts.s
68. /*List
69. for (Tuple2,?> tuple : output) {
70. System.out.println(tuple._1() + ": " + tuple._2());
71. }*/
72. System.exit(0);
73. }
74. }
这是spark 自带的一个example 之前只能将代码达成jar包然后在spark的bin目录下面通过spark-class来运行,这样我们就没办法将spark的程序你很好的融合到现有的系统中,所以我希望通过java函数调用的方式运行这段程序,在一段时间的摸索和老师的指导下发现根据报错的意思应该是没有将jar包提交到spark的worker上面 导致运行的worker找不到被调用的类,会报如下错误:
[python] view plaincopy
1. 4/07/07 10:26:10 INFO TaskSetManager: Serialized task 1.0:0 as 2194 bytes in 104 ms
2.
3. 14/07/07 10:26:11 WARN TaskSetManager: Lost TID 0 (task 1.0:0)
4.
5. 14/07/07 10:26:11 WARN TaskSetManager: Loss was due to java.lang.ClassNotFoundException
6.
7. java.lang.ClassNotFoundException: JavaWordCount$1
8.
9. at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
10.
11. at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
12.
13. at java.security.AccessController.doPrivileged(Native Method)
14.
15. at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
16.
17. at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
18.
19. at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
20.
21. at java.lang.Class.forName0(Native Method)
22.
23. at java.lang.Class.forName(Class.java:270)
24.
25. at org.apache.spark.serializer.JavaDeserializationStream$anon$1.resolveClass(JavaSerializer.scala:37)
26.
27. at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
28.
29. at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
30.
31. at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
32.
33. at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
34.
35. at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
36.
37. at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
38.
39. at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
40.
41. at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
42.
43. at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
44.
45. at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
46.
47. at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
48.
49. at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
50.
51. at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
52.
53. at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
54.
55. at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
56.
57. at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
58.
59. at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
60.
61. at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
62.
63. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
64.
65. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
解决方案:将要运行的程序达成jar包,然后调用JavaSparkContext的addJar方法将该jar包提交到spark集群中,然后spark的master会将该jar包分发到各个worker上面,
代码如下:
这样运行时就不会出现java.lang.ClassNotFoundException:JavaWordCount$1这样的错误了
运行如下:
spark://localhost:7077 hdfs://localhost:9000/input/test.txt hdfs://localhost:9000/input/result.txt
然后会eclipse控制台中会有如下log
[python] view plaincopy
1. 14/07/08 16:03:06 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
2. 14/07/08 16:03:06 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.200.233 instead (on interface eth0)
3. 14/07/08 16:03:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
4. 14/07/08 16:03:07 INFO Slf4jLogger: Slf4jLogger started
5. 14/07/08 16:03:07 INFO Remoting: Starting remoting
6. 14/07/08 16:03:07 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:52469]
7. 14/07/08 16:03:07 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:52469]
8. 14/07/08 16:03:07 INFO SparkEnv: Registering BlockManagerMaster
9. 14/07/08 16:03:07 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140708160307-0a89
10. 14/07/08 16:03:07 INFO MemoryStore: MemoryStore started with capacity 484.2 MB.
11. 14/07/08 16:03:08 INFO ConnectionManager: Bound socket to port 47731 with id = ConnectionManagerId(192.168.200.233,47731)
12. 14/07/08 16:03:08 INFO BlockManagerMaster: Trying to register BlockManager
13. 14/07/08 16:03:08 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 192.168.200.233:47731 with 484.2 MB RAM
14. 14/07/08 16:03:08 INFO BlockManagerMaster: Registered BlockManager
15. 14/07/08 16:03:08 INFO HttpServer: Starting HTTP Server
16. 14/07/08 16:03:08 INFO HttpBroadcast: Broadcast server started at http://192.168.200.233:58077
17. 14/07/08 16:03:08 INFO SparkEnv: Registering MapOutputTracker
18. 14/07/08 16:03:08 INFO HttpFileServer: HTTP File server directory is /tmp/spark-86439c44-9a36-4bda-b8c7-063c5c2e15b2
19. 14/07/08 16:03:08 INFO HttpServer: Starting HTTP Server
20. 14/07/08 16:03:08 INFO SparkUI: Started Spark Web UI at http://192.168.200.233:4040
21. 14/07/08 16:03:08 INFO AppClient$ClientActor: Connecting to master spark://localhost:7077...
22. 14/07/08 16:03:09 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140708160309-0000
23. 14/07/08 16:03:09 INFO AppClient$ClientActor: Executor added: app-20140708160309-0000/0 on worker-20140708160246-localhost-34775 (localhost:34775) with 4 cores
24. 14/07/08 16:03:09 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708160309-0000/0 on hostPort localhost:34775 with 4 cores, 512.0 MB RAM
25. 14/07/08 16:03:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26. 14/07/08 16:03:09 INFO AppClient$ClientActor: Executor updated: app-20140708160309-0000/0 is now RUNNING
27. 14/07/08 16:03:10 INFO SparkContext: Added JAR /home/hadoop/Desktop/JavaSparkT.jar at http://192.168.200.233:52827/jars/JavaSparkT.jar with timestamp 1404806590353
28. 14/07/08 16:03:10 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=507720499
29. 14/07/08 16:03:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 135.5 KB, free 484.1 MB)
30. 14/07/08 16:03:12 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@localhost:42090/user/Executor#-1434031133] with ID 0
31. 14/07/08 16:03:13 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager localhost:56831 with 294.9 MB RAM
32. 14/07/08 16:03:13 INFO FileInputFormat: Total input paths to process : 1
33. 14/07/08 16:03:13 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
34. 14/07/08 16:03:13 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
35. 14/07/08 16:03:13 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
36. 14/07/08 16:03:13 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
37. 14/07/08 16:03:13 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
38. 14/07/08 16:03:13 INFO SparkContext: Starting job: saveAsTextFile at JavaWordCount.java:66
39. 14/07/08 16:03:13 INFO DAGScheduler: Registering RDD 4 (reduceByKey at JavaWordCount.java:60)
40. 14/07/08 16:03:13 INFO DAGScheduler: Got job 0 (saveAsTextFile at JavaWordCount.java:66) with 1 output partitions (allowLocal=false)
41. 14/07/08 16:03:13 INFO DAGScheduler: Final stage: Stage 0 (saveAsTextFile at JavaWordCount.java:66)
42. 14/07/08 16:03:13 INFO DAGScheduler: Parents of final stage: List(Stage 1)
43. 14/07/08 16:03:13 INFO DAGScheduler: Missing parents: List(Stage 1)
44. 14/07/08 16:03:13 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at JavaWordCount.java:60), which has no missing parents
45. 14/07/08 16:03:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at JavaWordCount.java:60)
46. 14/07/08 16:03:13 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
47. 14/07/08 16:03:13 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor 0: localhost (PROCESS_LOCAL)
48. 14/07/08 16:03:13 INFO TaskSetManager: Serialized task 1.0:0 as 2252 bytes in 39 ms
49. 14/07/08 16:03:17 INFO TaskSetManager: Finished TID 0 in 3310 ms on localhost (progress: 1/1)
50. 14/07/08 16:03:17 INFO DAGScheduler: Completed ShuffleMapTask(1, 0)
51. 14/07/08 16:03:17 INFO DAGScheduler: Stage 1 (reduceByKey at JavaWordCount.java:60) finished in 3.319 s
52. 14/07/08 16:03:17 INFO DAGScheduler: looking for newly runnable stages
53. 14/07/08 16:03:17 INFO DAGScheduler: running: Set()
54. 14/07/08 16:03:17 INFO DAGScheduler: waiting: Set(Stage 0)
55. 14/07/08 16:03:17 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
56. 14/07/08 16:03:17 INFO DAGScheduler: failed: Set()
57. 14/07/08 16:03:17 INFO DAGScheduler: Missing parents for Stage 0: List()
58. 14/07/08 16:03:17 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at saveAsTextFile at JavaWordCount.java:66), which is now runnable
59. 14/07/08 16:03:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[7] at saveAsTextFile at JavaWordCount.java:66)
60. 14/07/08 16:03:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
61. 14/07/08 16:03:17 INFO TaskSetManager: Starting task 0.0:0 as TID 1 on executor 0: localhost (PROCESS_LOCAL)
62. 14/07/08 16:03:17 INFO TaskSetManager: Serialized task 0.0:0 as 11717 bytes in 0 ms
63. 14/07/08 16:03:17 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@localhost:37990
64. 14/07/08 16:03:17 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 127 bytes
65. 14/07/08 16:03:18 INFO DAGScheduler: Completed ResultTask(0, 0)
66. 14/07/08 16:03:18 INFO TaskSetManager: Finished TID 1 in 1074 ms on localhost (progress: 1/1)
67. 14/07/08 16:03:18 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
68. 14/07/08 16:03:18 INFO DAGScheduler: Stage 0 (saveAsTextFile at JavaWordCount.java:66) finished in 1.076 s
69. 14/07/08 16:03:18 INFO SparkContext: Job finished: saveAsTextFile at JavaWordCount.java:66, took 4.719158065 s
程序执行结果如下:
[python] view plaincopy
1. [hadoop@localhost sbin]$ hadoop fs -ls hdfs://localhost:9000/input/result.txt
2. 14/07/08 16:04:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
3. Found 2 items
4. -rw-r--r-- 3 hadoop supergroup 0 2014-07-08 16:03 hdfs://localhost:9000/input/result.txt/_SUCCESS
5. -rw-r--r-- 3 hadoop supergroup 56 2014-07-08 16:03 hdfs://localhost:9000/input/result.txt/part-00000
6. [hadoop@localhost sbin]$ hadoop fs -cat hdfs://localhost:9000/input/result.txt/part-00000
7. 14/07/08 16:04:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
8. (caozw,1)
9. (hello,3)
10. (hadoop,1)
11. (2.2.0,1)
12. (world,1)
13. [hadoop@localhost sbin]$