统计地区人数
- 提取出第四个字段,然后是一个wordcount程序;
- 具体代码
package io.github.sparktrain
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by sunyonggang on 16-4-12.
*/
class PersonCount {
}
object PersonCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("PersonCountByDistrict").setMaster("local")
val sc = new SparkContext(conf)
val raw = sc.textFile("/home/sunyonggang/sparkdata/users.txt")
val rawMapReduce = raw.map(line => splitGet3(line)).map(a3 => (a3, 1)).reduceByKey(_ + _)
rawMapReduce.saveAsTextFile("/home/sunyonggang/sparkdata/resultPerson")
}
def splitGet3(line : String) : String = {
val a = line.split(",")
a(3)
}
}
3.中文字符需要特别处理的问题,我没遇到
按手机号码(第3个字段),前三位数字分组,统计每个分组的数量,并按手机号码前三位数字排序
- 提取第三个字段,取substring,然后后面是一个wordcount + sortByKeys
- 具体代码
package io.github.sparktrain
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by sunyonggang on 16-4-12.
*/
class PhoneGroup {
}
object PhoneGroup {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("PhoneGroupSorted").setMaster("local")
val sc = new SparkContext(conf)
// sc.addJar("sunyonggang@gg01:/home/sunyonggang/IdeaProjects/SparkTest/target/scala-2.1.0/sparktest_2.10-1.0.jar")
val raw = sc.textFile("/home/sunyonggang/sparkdata/users.txt")
val rmr = raw.map(line => splitLineWithPhone(line)).map(a => (a, 1)).reduceByKey(_ + _).sortByKey()
rmr.saveAsTextFile("/home/sunyonggang/sparkdata/resultPhone")
}
def splitLineWithPhone(line : String) : String = {
val a = line.split(",")
a(2).substring(0,3)
}
}
3.无特殊问题需要处理
合并结果
- 将结果合并打印出来
(龙山县,12445)
(永顺县,12146)
(花垣县,10453)
(保靖县,7258)
(吉首市,22435)
(凤凰县,10548)
(泸溪县,7102)
(古丈县,3721)
(134,5364)
(135,8059)
(136,1902)
(137,12438)
(139,7744)
(147,2921)
(150,11100)
(151,9072)
(152,8147)
(157,712)
(158,7192)
(159,4850)
(182,1740)
(183,1)
(187,4735)
(188,131)
提交代码与运行环境
- 代码如上
- 运行截图
sunyonggang@gg01:~$ spark-submit --class io.github.sparktrain.PersonCount /home/sunyonggang/IdeaProjects/SparkTest/target/scala-2.10/sparktest_2.10-1.0.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/04/18 07:41:01 INFO SparkContext: Running Spark version 1.4.0
16/04/18 07:41:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/18 07:41:02 INFO SecurityManager: Changing view acls to: sunyonggang
16/04/18 07:41:02 INFO SecurityManager: Changing modify acls to: sunyonggang
16/04/18 07:41:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sunyonggang); users with modify permissions: Set(sunyonggang)
16/04/18 07:41:02 INFO Slf4jLogger: Slf4jLogger started
16/04/18 07:41:02 INFO Remoting: Starting remoting
16/04/18 07:41:02 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:58325]
16/04/18 07:41:02 INFO Utils: Successfully started service 'sparkDriver' on port 58325.
16/04/18 07:41:03 INFO SparkEnv: Registering MapOutputTracker
16/04/18 07:41:03 INFO SparkEnv: Registering BlockManagerMaster
16/04/18 07:41:03 INFO DiskBlockManager: Created local directory at /tmp/spark-d67e031f-2814-44ff-9d61-5c1236b97014/blockmgr-230dc1be-f663-4df9-bed4-20b5687a0925
16/04/18 07:41:03 INFO MemoryStore: MemoryStore started with capacity 267.3 MB
16/04/18 07:41:03 INFO HttpFileServer: HTTP File server directory is /tmp/spark-d67e031f-2814-44ff-9d61-5c1236b97014/httpd-120951ce-1a2c-4aa0-b459-0fe29a319045
16/04/18 07:41:03 INFO HttpServer: Starting HTTP Server
16/04/18 07:41:03 INFO Utils: Successfully started service 'HTTP file server' on port 39301.
16/04/18 07:41:03 INFO SparkEnv: Registering OutputCommitCoordinator
16/04/18 07:41:04 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/18 07:41:04 INFO SparkUI: Started SparkUI at http://192.168.199.150:4040
16/04/18 07:41:04 INFO SparkContext: Added JAR file:/home/sunyonggang/IdeaProjects/SparkTest/target/scala-2.10/sparktest_2.10-1.0.jar at http://192.168.199.150:39301/jars/sparktest_2.10-1.0.jar with timestamp 1460936464170
16/04/18 07:41:04 INFO Executor: Starting executor ID driver on host localhost
16/04/18 07:41:05 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43564.
16/04/18 07:41:05 INFO NettyBlockTransferService: Server created on 43564
16/04/18 07:41:05 INFO BlockManagerMaster: Trying to register BlockManager
16/04/18 07:41:05 INFO BlockManagerMasterEndpoint: Registering block manager localhost:43564 with 267.3 MB RAM, BlockManagerId(driver, localhost, 43564)
16/04/18 07:41:05 INFO BlockManagerMaster: Registered BlockManager
16/04/18 07:41:06 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0, maxMem=280248975
16/04/18 07:41:06 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 267.1 MB)
16/04/18 07:41:06 INFO MemoryStore: ensureFreeSpace(14257) called with curMem=157248, maxMem=280248975
16/04/18 07:41:06 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 267.1 MB)
16/04/18 07:41:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:43564 (size: 13.9 KB, free: 267.3 MB)
16/04/18 07:41:06 INFO SparkContext: Created broadcast 0 from textFile at PersonCount.scala:17
16/04/18 07:41:06 INFO FileInputFormat: Total input paths to process : 1
16/04/18 07:41:07 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/04/18 07:41:07 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/04/18 07:41:07 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/04/18 07:41:07 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/04/18 07:41:07 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/04/18 07:41:07 INFO SparkContext: Starting job: saveAsTextFile at PersonCount.scala:19
16/04/18 07:41:07 INFO DAGScheduler: Registering RDD 3 (map at PersonCount.scala:18)
16/04/18 07:41:07 INFO DAGScheduler: Got job 0 (saveAsTextFile at PersonCount.scala:19) with 1 output partitions (allowLocal=false)
16/04/18 07:41:07 INFO DAGScheduler: Final stage: ResultStage 1(saveAsTextFile at PersonCount.scala:19)
16/04/18 07:41:07 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/04/18 07:41:07 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/04/18 07:41:07 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at PersonCount.scala:18), which has no missing parents
16/04/18 07:41:07 INFO MemoryStore: ensureFreeSpace(3968) called with curMem=171505, maxMem=280248975
16/04/18 07:41:07 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.9 KB, free 267.1 MB)
16/04/18 07:41:07 INFO MemoryStore: ensureFreeSpace(2281) called with curMem=175473, maxMem=280248975
16/04/18 07:41:07 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.1 MB)
16/04/18 07:41:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:43564 (size: 2.2 KB, free: 267.3 MB)
16/04/18 07:41:07 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
16/04/18 07:41:07 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at PersonCount.scala:18)
16/04/18 07:41:07 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/04/18 07:41:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1470 bytes)
16/04/18 07:41:07 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/04/18 07:41:07 INFO Executor: Fetching http://192.168.199.150:39301/jars/sparktest_2.10-1.0.jar with timestamp 1460936464170
16/04/18 07:41:08 INFO Utils: Fetching http://192.168.199.150:39301/jars/sparktest_2.10-1.0.jar to /tmp/spark-d67e031f-2814-44ff-9d61-5c1236b97014/userFiles-d96c1324-1abf-46df-b152-5c1d2c4058cf/fetchFileTemp5786246817305865613.tmp
16/04/18 07:41:08 INFO Executor: Adding file:/tmp/spark-d67e031f-2814-44ff-9d61-5c1236b97014/userFiles-d96c1324-1abf-46df-b152-5c1d2c4058cf/sparktest_2.10-1.0.jar to class loader
16/04/18 07:41:08 INFO HadoopRDD: Input split: file:/home/sunyonggang/sparkdata/users.txt:0+5793569
16/04/18 07:41:09 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2001 bytes result sent to driver
16/04/18 07:41:09 INFO DAGScheduler: ShuffleMapStage 0 (map at PersonCount.scala:18) finished in 1.467 s
16/04/18 07:41:09 INFO DAGScheduler: looking for newly runnable stages
16/04/18 07:41:09 INFO DAGScheduler: running: Set()
16/04/18 07:41:09 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/04/18 07:41:09 INFO DAGScheduler: failed: Set()
16/04/18 07:41:09 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1400 ms on localhost (1/1)
16/04/18 07:41:09 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/04/18 07:41:09 INFO DAGScheduler: Missing parents for ResultStage 1: List()
16/04/18 07:41:09 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at PersonCount.scala:19), which is now runnable
16/04/18 07:41:09 INFO MemoryStore: ensureFreeSpace(127560) called with curMem=177754, maxMem=280248975
16/04/18 07:41:09 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 124.6 KB, free 267.0 MB)
16/04/18 07:41:09 INFO MemoryStore: ensureFreeSpace(42784) called with curMem=305314, maxMem=280248975
16/04/18 07:41:09 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 41.8 KB, free 266.9 MB)
16/04/18 07:41:09 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:43564 (size: 41.8 KB, free: 267.2 MB)
16/04/18 07:41:09 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
16/04/18 07:41:09 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at PersonCount.scala:19)
16/04/18 07:41:09 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/04/18 07:41:09 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1231 bytes)
16/04/18 07:41:09 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/04/18 07:41:09 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/18 07:41:09 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
16/04/18 07:41:09 INFO FileOutputCommitter: Saved output of task 'attempt_201604180741_0001_m_000000_1' to file:/home/sunyonggang/sparkdata/resultPerson/_temporary/0/task_201604180741_0001_m_000000
16/04/18 07:41:09 INFO SparkHadoopMapRedUtil: attempt_201604180741_0001_m_000000_1: Committed
16/04/18 07:41:09 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1828 bytes result sent to driver
16/04/18 07:41:10 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at PersonCount.scala:19) finished in 0.373 s
16/04/18 07:41:10 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 403 ms on localhost (1/1)
16/04/18 07:41:10 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/04/18 07:41:10 INFO DAGScheduler: Job 0 finished: saveAsTextFile at PersonCount.scala:19, took 2.532049 s
16/04/18 07:41:10 INFO SparkContext: Invoking stop() from shutdown hook
16/04/18 07:41:10 INFO SparkUI: Stopped Spark web UI at http://192.168.199.150:4040
16/04/18 07:41:10 INFO DAGScheduler: Stopping DAGScheduler
16/04/18 07:41:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/04/18 07:41:10 INFO Utils: path = /tmp/spark-d67e031f-2814-44ff-9d61-5c1236b97014/blockmgr-230dc1be-f663-4df9-bed4-20b5687a0925, already present as root for deletion.
16/04/18 07:41:10 INFO MemoryStore: MemoryStore cleared
16/04/18 07:41:10 INFO BlockManager: BlockManager stopped
16/04/18 07:41:10 INFO BlockManagerMaster: BlockManagerMaster stopped
16/04/18 07:41:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/04/18 07:41:10 INFO SparkContext: Successfully stopped SparkContext
16/04/18 07:41:10 INFO Utils: Shutdown hook called
16/04/18 07:41:10 INFO Utils: Deleting directory /tmp/spark-d67e031f-2814-44ff-9d61-5c1236b97014
问题
- 以上运行都为集群中单机运行结果;
- 以整个集群运行时会出现
我将setMaster("spark://gg01:7077")之后的运行结果
java.lang.ClassNotFoundException: io.github.sparktrain.PhoneGroup$$anonfun$2
这个问题不知道怎么解决?
补充:问题的话,从本地读取存在问题,但假如讲文件放倒hdfs上读取的话没有问题