IntelliJ IDEA运行WordCount(详细版)

凡事谨守规模,必不大错
一生但足衣食,便称小康

相关链接

HDFS相关知识

  • Hadoop分布式文件系统(HDFS)快速入门
  • Hadoop分布式文件系统(HDFS)知识梳理(超详细)

Hadoop集群连接

  • Eclipse连接Hadoop集群
  • IntelliJ IDEA连接Hadoop集群

HDFS Java API

Hadoop分布式文件系统(HDFS)Java接口(HDFS Java API)详细版

WordCount程序分析

使用Java API编写WordCount程序

IntelliJ IDEA运行WordCount

文件下载

  • WordCount.java 提取码2kwo

具体步骤

注意:IntelliJ IDEA连接Hadoop集群执行完所有步骤后方可进行接下来的操作

  1. 在java目录下新建包cn.neu(包名自定),在该包中新建包connection.test(包名自定),将下载的WordCount.java文件拷贝到该包中
    IntelliJ IDEA运行WordCount(详细版)_第1张图片
  2. WordCount.java代码中有两处参数值,因此需要配置参数

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

点击IDEA右上角的“Edit Configurations”
Edit Configurations
点击弹出窗口左上角的“+”号,点击“Application”
IntelliJ IDEA运行WordCount(详细版)_第2张图片
Main class中填写WordCount类的包路径,在Program arguments中填写程序两个参数,即输入路径和输出路径
以我填写的Program arguments参数为例:"/input/data.txt" “/output/temp”,我的output目录中务必不存在temp目录
IntelliJ IDEA运行WordCount(详细版)_第3张图片
注意:输入路径中的文件存在于HDFS,输出路径最后一级目录务必不存在!否则会产生目录已存在的错误!
错误示例

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master:9000/output/temp already exists
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:562)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
	at cn.neu.connection.test.WordCount.main(WordCount.java:63)
  1. 点击运行即可,若出现org.apache.hadoop.security.AccessControlException:Permission denied: user=...错误,需要在主函数第一行添加代码System.setProperty(“HADOOP_USER_NAME”, ”root”);,其中root为远程Hadoop所在虚拟机的主机名,每个人根据各自的情况填写
    运行成功示例

EBUG - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2

DEBUG - IPC Client (2141179775) connection to master/172.16.29.94:9000 from Lenovo: closed
DEBUG - IPC Client (2141179775) connection to master/172.16.29.94:9000 from Lenovo: stopped, remaining connections 0
Process finished with exit code 0

  1. 使用XShell等终端模拟软件连接Hadoop集群所在的虚拟机,查看程序运行结果
[root@master ~]# hadoop fs -ls /output/temp
Found 2 items
-rw-r--r--   3 Lenovo supergroup          0 2019-10-17 15:48 /output/temp/_SUCCESS
-rw-r--r--   3 Lenovo supergroup     356409 2019-10-17 15:48 /output/temp/part-r-00000
[root@master ~]# hadoop fs -cat /output/temp/part-r-00000
...
zone	1
zur	2
zwaggered	1

也可执行命令hadoop fs -get /output/temp/part-r-00000 result.txt将结果文件保存到远程虚拟机上,并通过Xftp等SFTP文件传输软件将结果文件传输到本机
IntelliJ IDEA运行WordCount(详细版)_第4张图片
5. 若要再次执行,要么在参数配置中更改输出目录,要么删除输出路径下的文件,使用命令hadoop fs -rm -r /output/temp
有一个一劳永逸的方法,即在程序中主函数略加改动,即每次进行运算前检查输出路径是否存在,若存在则删除输出路径
改动前

		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
		if(otherArgs.length != 2){
			System.err.println("Usage WordCount  ");
			System.exit(2);
		}

改动后

		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
		if(otherArgs.length != 2){
			System.err.println("Usage WordCount  ");
			System.exit(2);
		}
		Path outPath = new Path(otherArgs[1]);
		if(fs.exists(outPath)) {
			fs.delete(outPath, true);
		}

有疑问的朋友可以在下方留言或者私信我,我尽快回答
欢迎各路大神萌新指点、交流!
求关注!求点赞!求收藏!

你可能感兴趣的:(Hadoop,IDEA,Java)