第一步:(1)我的电脑是windows10,64位的环境,可以去eclipse官网下载自己电脑对应的eclipse版本。
(2)百度搜索eclispe,如图是我下载的eclipse版本,自动解压安装即可。
第二步:下载spark和hadoop相关文件
(1)下载和解压缩后的hadoop文件。
hadoop文件下载网址:https://download.csdn.net/download/qq_30993409/10561014
(2)下载和解压缩后的spark文件(我选的是spark1.6版本),选择1.6版本主要是引入下面这个jar包方便。
spark文件下载网址:https://pan.baidu.com/s/1WBrp-_boqwlPLNJST9yD_Q 密码:obo3
第三步:配置相应的环境变量,如下图。
第四步:windowsx的dos命令行下进行测试
(1)SparkPi测试
(2)spark-shell测试
第五步:eclipse下编写wordcount程序并且执行代码
对pom.xml文件进行编辑:
4.0.0
cn.spark
spark-study-java
0.0.1-SNAPSHOT
spark-study-java
jar
http://maven.apache.org
UTF-8
junit
junit
3.8.1
test
org.apache.spark
spark-core_2.10
1.6.0
org.apache.spark
spark-launcher_2.10
1.6.0
org.apache.spark
spark-sql_2.10
1.6.0
Java代码:
package cn.spark.study.java;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
public class WordCountLocal {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("wordCountLocal")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD lines = sc.textFile("C://Users//Yuan//Desktop//spark.txt");
JavaRDD words = lines.flatMap(new FlatMapFunction() {
private static final long serialVersionUID = 1L;
public Iterable call(String line) throws Exception {
return Arrays.asList(line.split(" "));
}
});
JavaPairRDD parirs = words.mapToPair(new PairFunction() {
private static final long serialVersionUID = 1L;
public Tuple2 call(String word) throws Exception {
return new Tuple2(word, 1);
}
});
JavaPairRDD wordcount = parirs.reduceByKey(new Function2() {
private static final long serialVersionUID = 1L;
public Integer call(Integer valA, Integer valB) throws Exception {
return valA+valB;
}
});
wordcount.foreach(new VoidFunction>() {
private static final long serialVersionUID = 1L;
public void call(Tuple2 wordCount) throws Exception {
System.out.println("["+wordCount._1+","+wordCount._2+"]");
}
});
sc.close();
}
}
运行程序结果如下:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/Users/Yuan/Desktop/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/Users/Yuan/.m2/repository/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/07/24 14:53:06 INFO SparkContext: Running Spark version 1.6.0
18/07/24 14:53:07 INFO SecurityManager: Changing view acls to: Yuan
18/07/24 14:53:07 INFO SecurityManager: Changing modify acls to: Yuan
18/07/24 14:53:07 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Yuan); users with modify permissions: Set(Yuan)
18/07/24 14:53:09 INFO Utils: Successfully started service 'sparkDriver' on port 54107.
18/07/24 14:53:11 INFO Slf4jLogger: Slf4jLogger started
18/07/24 14:53:12 INFO Remoting: Starting remoting
18/07/24 14:53:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:54120]
18/07/24 14:53:12 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54120.
18/07/24 14:53:12 INFO SparkEnv: Registering MapOutputTracker
18/07/24 14:53:12 INFO SparkEnv: Registering BlockManagerMaster
18/07/24 14:53:12 INFO DiskBlockManager: Created local directory at C:\Users\Yuan\AppData\Local\Temp\blockmgr-93c9430a-27aa-418c-a75c-19d9a946866f
18/07/24 14:53:13 INFO MemoryStore: MemoryStore started with capacity 444.4 MB
18/07/24 14:53:13 INFO SparkEnv: Registering OutputCommitCoordinator
18/07/24 14:53:14 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/07/24 14:53:14 INFO SparkUI: Started SparkUI at http://169.254.210.125:4040
18/07/24 14:53:14 INFO Executor: Starting executor ID driver on host localhost
18/07/24 14:53:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54127.
18/07/24 14:53:14 INFO NettyBlockTransferService: Server created on 54127
18/07/24 14:53:14 INFO BlockManagerMaster: Trying to register BlockManager
18/07/24 14:53:14 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54127 with 444.4 MB RAM, BlockManagerId(driver, localhost, 54127)
18/07/24 14:53:14 INFO BlockManagerMaster: Registered BlockManager
18/07/24 14:53:17 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 127.4 KB)
18/07/24 14:53:17 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 141.3 KB)
18/07/24 14:53:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54127 (size: 13.9 KB, free: 444.4 MB)
18/07/24 14:53:17 INFO SparkContext: Created broadcast 0 from textFile at WordCountLocal.java:23
18/07/24 14:53:19 WARN : Your hostname, DESKTOP-359QINH resolves to a loopback/non-reachable address: fe80:0:0:0:0:5efe:a9fe:82e%net10, but we couldn't find any external IP address!
18/07/24 14:53:21 INFO FileInputFormat: Total input paths to process : 1
18/07/24 14:53:21 INFO SparkContext: Starting job: foreach at WordCountLocal.java:42
18/07/24 14:53:21 INFO DAGScheduler: Registering RDD 3 (mapToPair at WordCountLocal.java:30)
18/07/24 14:53:21 INFO DAGScheduler: Got job 0 (foreach at WordCountLocal.java:42) with 1 output partitions
18/07/24 14:53:21 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at WordCountLocal.java:42)
18/07/24 14:53:21 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/07/24 14:53:21 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/07/24 14:53:21 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at WordCountLocal.java:30), which has no missing parents
18/07/24 14:53:22 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.8 KB, free 146.1 KB)
18/07/24 14:53:22 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.6 KB, free 148.7 KB)
18/07/24 14:53:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54127 (size: 2.6 KB, free: 444.4 MB)
18/07/24 14:53:22 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
18/07/24 14:53:22 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at WordCountLocal.java:30)
18/07/24 14:53:22 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
18/07/24 14:53:22 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2128 bytes)
18/07/24 14:53:22 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/07/24 14:53:22 INFO HadoopRDD: Input split: file:/C:/Users/Yuan/Desktop/spark.txt:0+171
18/07/24 14:53:22 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
18/07/24 14:53:22 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
18/07/24 14:53:22 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/07/24 14:53:22 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/07/24 14:53:22 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/07/24 14:53:23 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver
18/07/24 14:53:23 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 993 ms on localhost (1/1)
18/07/24 14:53:23 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/07/24 14:53:23 INFO DAGScheduler: ShuffleMapStage 0 (mapToPair at WordCountLocal.java:30) finished in 1.105 s
18/07/24 14:53:23 INFO DAGScheduler: looking for newly runnable stages
18/07/24 14:53:23 INFO DAGScheduler: running: Set()
18/07/24 14:53:23 INFO DAGScheduler: waiting: Set(ResultStage 1)
18/07/24 14:53:23 INFO DAGScheduler: failed: Set()
18/07/24 14:53:23 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCountLocal.java:36), which has no missing parents
18/07/24 14:53:23 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 151.7 KB)
18/07/24 14:53:23 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1785.0 B, free 153.4 KB)
18/07/24 14:53:23 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54127 (size: 1785.0 B, free: 444.4 MB)
18/07/24 14:53:23 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
18/07/24 14:53:23 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCountLocal.java:36)
18/07/24 14:53:23 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
18/07/24 14:53:23 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes)
18/07/24 14:53:23 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
18/07/24 14:53:23 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
18/07/24 14:53:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
[spark,7]
[hive,5]
[hadoop,6]
[core,2]
[streaming,1]
[sql,3]
[hbase,4]
18/07/24 14:53:23 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
18/07/24 14:53:23 INFO DAGScheduler: ResultStage 1 (foreach at WordCountLocal.java:42) finished in 0.404 s
18/07/24 14:53:23 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 404 ms on localhost (1/1)
18/07/24 14:53:23 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/07/24 14:53:23 INFO DAGScheduler: Job 0 finished: foreach at WordCountLocal.java:42, took 2.197408 s
18/07/24 14:53:24 INFO SparkUI: Stopped Spark web UI at http://169.254.210.125:4040
18/07/24 14:53:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/24 14:53:24 INFO MemoryStore: MemoryStore cleared
18/07/24 14:53:24 INFO BlockManager: BlockManager stopped
18/07/24 14:53:24 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/24 14:53:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/24 14:53:24 INFO SparkContext: Successfully stopped SparkContext
18/07/24 14:53:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
18/07/24 14:53:24 INFO ShutdownHookManager: Shutdown hook called
18/07/24 14:53:24 INFO ShutdownHookManager: Deleting directory C:\Users\Yuan\AppData\Local\Temp\spark-0540df48-4e2a-44b9-adf6-be22f4e361f4
18/07/24 14:53:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
运行结果中,如下打印即为词频统计结果:
[spark,7]
[hive,5]
[hadoop,6]
[core,2]
[streaming,1]
[sql,3]
[hbase,4]