使用Java编写并运行Spark应用程序

我们首先提出这样一个简单的需求:
现在要分析某网站的访问日志信息,统计来自不同IP的用户访问的次数,从而通过Geo信息来获得来访用户所在国家地区分布状况。这里我拿我网站的日志记录行示例,如下所示:

1 121.205.198.92 - - [21/Feb/2014:00:00:07 +0800] "GET /archives/417.html HTTP/1.1" 200 11465 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
2 121.205.198.92 - - [21/Feb/2014:00:00:11 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 26 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
3 121.205.198.92 - - [21/Feb/2014:00:00:12 +0800] "GET /archives/417.html/ HTTP/1.1" 301 26 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
4 121.205.198.92 - - [21/Feb/2014:00:00:12 +0800] "GET /archives/417.html HTTP/1.1" 200 11465 "http://shiyanjun.cn/archives/417.html" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
5 121.205.241.229 - - [21/Feb/2014:00:00:13 +0800] "GET /archives/526.html HTTP/1.1" 200 12080 "http://shiyanjun.cn/archives/526.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
6 121.205.241.229 - - [21/Feb/2014:00:00:15 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 26 "http://shiyanjun.cn/archives/526.html/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"

Java实现Spark应用程序(Application)

我们实现的统计分析程序,有如下几个功能点:

  • 从HDFS读取日志数据文件
  • 将每行的第一个字段(IP地址)抽取出来
  • 统计每个IP地址出现的次数
  • 根据每个IP地址出现的次数进行一个降序排序
  • 根据IP地址,调用GeoIP库获取IP所属国家
  • 打印输出结果,每行的格式:[国家代码] IP地址 频率

下面,看我们使用Java实现的统计分析应用程序代码,如下所示:

001 package org.shirdrn.spark.job;
002  
003 import java.io.File;
004 import java.io.IOException;
005 import java.util.Arrays;
006 import java.util.Collections;
007 import java.util.Comparator;
008 import java.util.List;
009 import java.util.regex.Pattern;
010  
011 import org.apache.commons.logging.Log;
012 import org.apache.commons.logging.LogFactory;
013 import org.apache.spark.api.java.JavaPairRDD;
014 import org.apache.spark.api.java.JavaRDD;
015 import org.apache.spark.api.java.JavaSparkContext;
016 import org.apache.spark.api.java.function.FlatMapFunction;
017 import org.apache.spark.api.java.function.Function2;
018 import org.apache.spark.api.java.function.PairFunction;
019 import org.shirdrn.spark.job.maxmind.Country;
020 import org.shirdrn.spark.job.maxmind.LookupService;
021  
022 import scala.Serializable;
023 import scala.Tuple2;
024  
025 public class IPAddressStats implements Serializable {
026  
027      private static final long serialVersionUID = 8533489548835413763L;
028      private static final Log LOG = LogFactory.getLog(IPAddressStats.class);
029      private static final Pattern SPACE = Pattern.compile(" ");
030      private transient LookupService lookupService;
031      private transient final String geoIPFile;
032      
033      public IPAddressStats(String geoIPFile) {
034           this.geoIPFile = geoIPFile;
035           try {
036                // lookupService: get country code from a IP address
037                File file = new File(this.geoIPFile);
038                LOG.info("GeoIP file: " + file.getAbsolutePath());
039                lookupService = new AdvancedLookupService(file, LookupService.GEOIP_MEMORY_CACHE);
040           catch (IOException e) {
041                throw new RuntimeException(e);
042           }
043      }
044      
045      @SuppressWarnings("serial")
046      public void stat(String[] args) {
047           JavaSparkContext ctx = new JavaSparkContext(args[0], "IPAddressStats",
048                     System.getenv("SPARK_HOME"), JavaSparkContext.jarOfClass(IPAddressStats.class));
049           JavaRDD lines = ctx.textFile(args[1], 1);
050  
051           // splits and extracts ip address filed
052           JavaRDD words = lines.flatMap(new FlatMapFunction() {
053                @Override
054                public Iterable call(String s) {
055                     // 121.205.198.92 - - [21/Feb/2014:00:00:07 +0800] "GET /archives/417.html HTTP/1.1" 200 11465 "http://shiyanjun.cn/archives/417.html/" "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
056                     // ip address
057                     return Arrays.asList(SPACE.split(s)[0]);
058                }
059           });
060  
061           // map
062           JavaPairRDD ones = words.map(new PairFunction() {
063                @Override
064                public Tuple2 call(String s) {
065                     return new Tuple2(s, 1);
066                }
067           });
068  
069           // reduce
070           JavaPairRDD counts = ones.reduceByKey(new Function2() {
071                @Override
072                public Integer call(Integer i1, Integer i2) {
073                     return i1 + i2;
074                }
075           });
076           
077           List> output = counts.collect();
078           
079           // sort statistics result by value
080           Collections.sort(output, new Comparator>() {
081                @Override
082                public int compare(Tuple2 t1, Tuple2 t2) {
083                     if(t1._2 < t2._2) {
084                          return 1;
085                     else if(t1._2 > t2._2) {
086                          return -1;
087                     }
088                     return 0;
089                }
090           });
091           
092           writeTo(args, output);
093           
094      }
095      
096      private void writeTo(String[] args, List> output) {
097           for (Tuple2 tuple : output) {
098                Country country = lookupService.getCountry((String) tuple._1);
099                LOG.info("[" + country.getCode() + "] " + tuple._1 + "\t" + tuple._2);
100           }
101      }
102      
103      public static void main(String[] args) {
104           // ./bin/run-my-java-example org.shirdrn.spark.job.IPAddressStatsspark://m1:7077 hdfs://m1:9000/user/shirdrn/wwwlog20140222.log/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat
105           if (args.length < 3) {
106                System.err.println("Usage: IPAddressStats ");
107                System.err.println("    Example: org.shirdrn.spark.job.IPAddressStatsspark://m1:7077 hdfs://m1:9000/user/shirdrn/wwwlog20140222.log/home/shirdrn/cloud/programs/spark-0.9.0-incubating-bin-hadoop1/java-examples/GeoIP_DATABASE.dat");
108                System.exit(1);
109           }
110           
111           String geoIPFile = args[2];
112           IPAddressStats stats = new IPAddressStats(geoIPFile);
113           stats.stat(args);
114           
115           System.exit(0);
116  
117      }
118  
119 }

具体实现逻辑,可以参考代码中的注释。我们使用Maven管理构建Java程序,首先看一下我的pom配置中所依赖的软件包,如下所示:

01 <dependencies>
02           <dependency>
03                <groupId>org.apache.sparkgroupId>
04                <artifactId>spark-core_2.10artifactId>
05                <version>0.9.0-incubatingversion>
06           dependency>
07           <dependency>
08                <groupId>log4jgroupId>
09                <artifactId>log4jartifactId>
10                <version>1.2.16version>
11           dependency>
12           <dependency>
13                <groupId>dnsjavagroupId>
14                <artifactId>dnsjavaartifactId>
15                <version>2.1.1version>
16           dependency>
17           <dependency>
18                <groupId>commons-netgroupId>
19                <artifactId>commons-netartifactId>
20                <version>3.1version>
21           dependency>
22           <dependency>
23                <groupId>org.apache.hadoopgroupId>
24                <artifactId>hadoop-clientartifactId>
25                <version>1.2.1version>
26           dependency>
27      dependencies>

需要说明的是,当我们将程序在Spark集群上运行时,它要求我们的编写的Job能够进行序列化,如果某些字段不需要序列化或者无法序列化,可以直接使用transient修饰即可,如上面的属性lookupService没有实现序列化接口,使用transient使其不执行序列化,否则的话,可能会出现类似如下的错误:

01 14/03/10 22:34:06 INFO scheduler.DAGScheduler: Failed to run collect at IPAddressStats.java:76
02 Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.shirdrn.spark.job.IPAddressStats
03      at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
04      at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
05      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
06      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
07      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
08      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
09      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
10      at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:741)
11      at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:740)
12      at scala.collection.immutable.List.foreach(List.scala:318)
13      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:740)
14      at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569)
15      at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
16      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

你可能感兴趣的:(使用Java编写并运行Spark应用程序)