http://archive.cloudera.com/cdh5/cdh/5/
根据cdh选择版本,优先选择使用版本:hadoop-2.6.0-cdh5.7.0
解压jdktar -zxvf jdk-7u79-linux-x64.tar.gz -C ~/app/
进入到解压的目录文件里边,使用pwd显示目录信息,拷贝这个信息配置环境变量
配置环境变量vim ~/.bash_profile
export JAVA_HOME=.../jdk.1.7.0_79
export PATH=$JAVA_HOME/bin:$PATH
配置生效source ~/.bash_profile
echo $JAVA_HOEM
查看配置是否成功java -version
yum install -y ssh
ssh-keygen -t rsa
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
ssh localhost
,ssh 机器名字
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.0/hadoop-project-dist/hadoop-common/SingleCluster.html
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.0.tar.gz
解压tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C ~/app/
进入到解压的hadoop文件下
cd /etc
vim /hadoop-env.sh
设置一下java的JAVA_HOME参数
vim /core-site.xml
# 修改一下机器名字和端口
<property>
<name>fs.default.name</name>
<value>hdfs://节点机器名字:port</value> // 8020
</property>
# 临时文件
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/app/tmp</value> // 8020
</property>
vim /hdfs-site.xml
设置副本数为1,因为默认是3
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
vim slaves
主机名,有多少个datanode就需要写在这里边,这里暂时不用配置这个文件
启动hdfs,先到hadoop的目录下
bin/hdfs namenode -format
sbin/start-dfs.sh
验证是否启动成功:
方式1:
jps
DataNode
SecondaryNameNode
NameNode
方式二:浏览器访问方式: http://ip:50070
停止hdfs
sbin/stop-dfs.sh
将hadoop的bin目录配置成环境变量中,和jdk的配置是一样的操作
vim ~/.bash_profile
export HADOOP_HOME=/.../hadoop...cdh5.7.0
export PATH=$HADOOP_HOME/bin:$PATH
source ~/.bash_profile
hdfs dfs
,hadoop fs
这就是操作hadoop的命令,可以查看到命令的介绍和帮助信息
在/data/目录下创建一个hello.txt文件,vim hello.txt
查看hadoop的根目录hadoop fs -ls /
将data下面的hello.txt传递到hadoop上边去hadoop fs -put hello.txt /
查看hadoop上边的文件的内容hadoop fs -text /hello.txt
在hadoop上边创建一个目录hadoop fs -mkdir /test
递归创建文件夹hadoop fs -mkdir -p /test/a/b
递归展示目录hadoop fs -ls -R /
或者-lsr
选项
使用别的方式拷贝本地文件到hadoop上边hadoop fs -copyFromLocal hello.txt /test/a/b/h.txt
查看拷贝的文件的内容hadoop fs -cat /test/a/b/h.txt
将hadoop上边的内容拿到本地来hadoop fs -get /text/a/b/h.txt
删除文件hadoop fs rm /hello.txt
删除文件夹目录hadoop fs rm -R /test
从浏览器上边去浏览文件信息http://ip:50070
上边的Utilities
上传一个大的文件hadoop的安装包查看大小信息
统一赶礼版本和添加hadoop的依赖包
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0modelVersion>
<groupId>com.imooc.hadoopgroupId>
<artifactId>hadoop-trainartifactId>
<version>1.0version>
<packaging>jarpackaging>
<name>hadoop-trainname>
<url>http://maven.apache.orgurl>
<properties>
<project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
<hadoop.version>2.6.0-cdh5.7.0hadoop.version>
properties>
<repositories>
<repository>
<id>clouderaid>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/url>
repository>
repositories>
<dependencies>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-clientartifactId>
<version>${hadoop.version}version>
dependency>
<dependency>
<groupId>junitgroupId>
<artifactId>junitartifactId>
<version>4.10version>
<scope>testscope>
dependency>
dependencies>
project>
使用java junit测试的方式操作
// HDFSApp.java
package com.imooc.hadoop.hdfs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.net.URI;
/**
- Hadoop HDFS Java API 操作
*/
public class HDFSApp {
public static final String HDFS_PATH = "hdfs://ip:8020";
FileSystem fileSystem = null;
Configuration configuration = null;
/**
* 创建HDFS目录
*/
@Test
public void mkdir() throws Exception {
fileSystem.mkdirs(new Path("/hdfsapi/test"));
}
/**
* 创建文件
*/
@Test
public void create() throws Exception {
FSDataOutputStream output = fileSystem.create(new Path("/hdfsapi/test/a.txt"));
output.write("hello hadoop".getBytes());
output.flush();
output.close();
}
/**
* 查看HDFS文件的内容
*/
@Test
public void cat() throws Exception {
FSDataInputStream in = fileSystem.open(new Path("/hdfsapi/test/a.txt"));
IOUtils.copyBytes(in, System.out, 1024);
in.close();
}
/**
* 重命名
*/
@Test
public void rename() throws Exception {
Path oldPath = new Path("/hdfsapi/test/a.txt");
Path newPath = new Path("/hdfsapi/test/b.txt");
fileSystem.rename(oldPath, newPath);
}
/**
* 上传文件到HDFS
*
* @throws Exception
*/
@Test
public void copyFromLocalFile() throws Exception {
Path localPath = new Path("/Users/rocky/data/hello.txt");
Path hdfsPath = new Path("/hdfsapi/test");
fileSystem.copyFromLocalFile(localPath, hdfsPath);
}
/**
* 上传文件到HDFS
*/
@Test
public void copyFromLocalFileWithProgress() throws Exception {
InputStream in = new BufferedInputStream(
new FileInputStream(
new File("/Users/rocky/source/spark-1.6.1/spark-1.6.1-bin-2.6.0-cdh5.5.0.tgz")));
FSDataOutputStream output = fileSystem.create(new Path("/hdfsapi/test/spark-1.6.1.tgz"), new Progressable() {
public void progress() {
System.out.print("."); //带进度提醒信息
}});
IOUtils.copyBytes(in, output, 4096);
}
/**
* 下载HDFS文件
*/
@Test
public void copyToLocalFile() throws Exception {
Path localPath = new Path("/Users/rocky/tmp/h.txt");
Path hdfsPath = new Path("/hdfsapi/test/hello.txt");
fileSystem.copyToLocalFile(hdfsPath, localPath);
}
/**
* 查看某个目录下的所有文件
*/
@Test
public void listFiles() throws Exception {
FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/"));
for(FileStatus fileStatus : fileStatuses) {
String isDir = fileStatus.isDirectory() ? "文件夹" : "文件";
short replication = fileStatus.getReplication();
long len = fileStatus.getLen();
String path = fileStatus.getPath().toString();
System.out.println(isDir + "\t" + replication + "\t" + len + "\t" + path);
}
}
/**
* 删除
*/
@Test
public void delete() throws Exception{
fileSystem.delete(new Path("/"), true);
}
@Before
public void setUp() throws Exception {
System.out.println("HDFSApp.setUp");
configuration = new Configuration();
fileSystem = FileSystem.get(new URI(HDFS_PATH), configuration, "hadoop");
}
@After
public void tearDown() throws Exception {
configuration = null;
fileSystem = null;
System.out.println("HDFSApp.tearDown");
}
}
HDFS架构
1 Master(NameNode/NN) 带 N个Slaves(DataNode/DN)
HDFS/YARN/HBase
1个文件会被拆分成多个Block
blocksize:128M
130M ==> 2个Block: 128M 和 2M
NN:
1)负责客户端请求的响应
2)负责元数据(文件的名称、副本系数、Block存放的DN)的管理
DN:
1)存储用户的文件对应的数据块(Block)
2)要定期向NN发送心跳信息,汇报本身及其所有的block信息,健康状况
A typical deployment has a dedicated machine that runs only the NameNode software.
Each of the other machines in the cluster runs one instance of the DataNode software.
The architecture does not preclude running multiple DataNodes on the same machine
but in a real deployment that is rarely the case.
NameNode + N个DataNode
建议:NN和DN是部署在不同的节点上
replication factor:副本系数、副本因子
All blocks in a file except the last block are the same size
Hadoop伪分布式安装步骤
1)jdk安装
解压:tar -zxvf jdk-7u79-linux-x64.tar.gz -C ~/app
添加到系统环境变量: ~/.bash_profile
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_79
export PATH=$JAVA_HOME/bin:$PATH
使得环境变量生效: source ~/.bash_profile
验证java是否配置成功: java -v
2)安装ssh
sudo yum install ssh
ssh-keygen -t rsa
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
3)下载并解压hadoop
下载:直接去cdh网站下载
解压:tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C ~/app
4)hadoop配置文件的修改(hadoop_home/etc/hadoop)
hadoop-env.sh
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_79
core-site.xml
fs.defaultFS
hdfs://ip:8020
hadoop.tmp.dir
/home/hadoop/app/tmp
hdfs-site.xml
dfs.replication
1
slaves
5)启动hdfs
格式化文件系统(仅第一次执行即可,不要重复执行):hdfs/hadoop namenode -format
启动hdfs: sbin/start-dfs.sh
验证是否启动成功:
jps
DataNode
SecondaryNameNode
NameNode
浏览器访问方式: http://ip:50070
6)停止hdfs
sbin/stop-dfs.sh
Hadoop shell的基本使用
hdfs dfs
hadoop fs
Java API操作HDFS文件
文件 1 311585484 hdfs://hadoop000:8020/hadoop-2.6.0-cdh5.7.0.tar.gz
文件夹 0 0 hdfs://hadoop000:8020/hdfsapi
文件 1 49 hdfs://hadoop000:8020/hello.txt
文件 1 40762 hdfs://hadoop000:8020/install.log
问题:我们已经在hdfs-site.xml中设置了副本系数为1,为什么此时查询文件看到的3呢?
如果你是通过hdfs shell的方式put的上去的那么,才采用默认的副本系数1
如果我们是java api上传上去的,在本地我们并没有手工设置副本系数,所以否则采用的是hadoop自己的副本系数
1)vim mapred-site.xml
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
2)vim yarn-site.xml
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
sbin/start-yarn.sh
4)验证
1.jps
ResourceManager
NodeManager
2,http://ip:8088
5)停止YARN相关的进程
sbin/stop-yarn.sh
hadoop jar
提交作业hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar pi 2 3
求圆周率
Hadoop1.x时:
MapReduce:Master/Slave架构,1个JobTracker带多个TaskTracker
JobTracker: 负责资源管理和作业调度
TaskTracker:
定期向JT汇报本节点的健康状况、资源使用情况、作业执行情况;
接收来自JT的命令:启动任务/杀死任务
YARN:不同计算框架可以共享同一个HDFS集群上的数据,享受整体的资源调度
XXX on YARN的好处:
与其他计算框架共享集群资源,按资源需要分配,进而提高集群资源的利用率
XXX: Spark/MapReduce/Storm/Flink
YARN架构:
1)ResourceManager: RM
整个集群同一时间提供服务的RM只有一个,负责集群资源的统一管理和调度
处理客户端的请求: 提交一个作业、杀死一个作业
监控我们的NM,一旦某个NM挂了,那么该NM上运行的任务需要告诉我们的AM来如何进行处理
2) NodeManager: NM
整个集群中有多个,负责自己本身节点资源管理和使用
定时向RM汇报本节点的资源使用情况
接收并处理来自RM的各种命令:启动Container
处理来自AM的命令
单个节点的资源管理
3) ApplicationMaster: AM
每个应用程序对应一个:MR、Spark,负责应用程序的管理
为应用程序向RM申请资源(core、memory),分配给内部task
需要与NM通信:启动/停止task,task是运行在container里面,AM也是运行在container里面
4) Container
封装了CPU、Memory等资源的一个容器
是一个任务运行环境的抽象
5) Client
提交作业
查询作业的运行进度
杀死作业
YARN环境搭建
1)mapred-site.xml
mapreduce.framework.name
yarn
2)yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
3) 启动YARN相关的进程
sbin/start-yarn.sh
4)验证
jps
ResourceManager
NodeManager
http://ip:8088
5)停止YARN相关的进程
sbin/stop-yarn.sh
提交mr作业到YARN上运行:
/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar
hadoop jar
hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar pi 2 3
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
1)JobTracker: JT
2)TaskTracker: TT
3)MapTask
4)ReduceTask
package com.imooc.hadoop.mapreduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* 使用MapReduce开发WordCount应用程序
*/
public class WordCountApp {
/**
* Map:读取输入的文件
*/
public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
LongWritable one = new LongWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 接收到的每一行数据
String line = value.toString();
//按照指定分隔符进行拆分
String[] words = line.split(" ");
for(String word : words) {
// 通过上下文把map的处理结果输出
context.write(new Text(word), one);
}
}
}
/**
* Reduce:归并操作
*/
public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
long sum = 0;
for(LongWritable value : values) {
// 求key出现的次数总和
sum += value.get();
}
// 最终统计结果的输出
context.write(key, new LongWritable(sum));
}
}
/**
* 定义Driver:封装了MapReduce作业的所有信息
*/
public static void main(String[] args) throws Exception{
//创建Configuration
Configuration configuration = new Configuration();
//创建Job
Job job = Job.getInstance(configuration, "wordcount");
//设置job的处理类
job.setJarByClass(WordCountApp.class);
//设置作业处理的输入路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
//设置map相关参数
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
//设置reduce相关参数
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
//设置作业处理的输出路径
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
mvn clean package -DskipTests
scp target/hadoop-train-1.0.jar hadoop@hadoop000:~/lib
hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar com.imooc.hadoop.mapreduce.WordCountApp hdfs://hadoop000:8020/hello.txt hdfs://hadoop000:8020/output/wc
com.imooc.hadoop.mapreduce.WordCountApp
表示指定的主类hdfs://hadoop000:8020/hello.txt
表示的是输入文件的路径,也就是代码中的`new Path(args[0])``表示的是输出结果的路径,也就是代码中的
new Path(args[1])`相同的代码和脚本再次执行,会报错
security.UserGroupInformation:
PriviledgedActionException as:hadoop (auth:SIMPLE) cause:
org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://hadoop000:8020/output/wc already exists
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://hadoop000:8020/output/wc already exists
在MR中,输出文件是不能事先存在的
1)先手工通过shell的方式将输出文件夹先删除hadoop fs -rm -r /output/wc
,可以写shell脚本,先删除后执行
Path outputPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(configuration);
if(fileSystem.exists(outputPath)){
fileSystem.delete(outputPath, true);
System.out.println("output file exists, but is has deleted");
}
代码
package com.imooc.hadoop.mapreduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* 使用MapReduce开发WordCount应用程序
*/
public class WordCount2App {
/**
* Map:读取输入的文件
*/
public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
LongWritable one = new LongWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 接收到的每一行数据
String line = value.toString();
//按照指定分隔符进行拆分
String[] words = line.split(" ");
for(String word : words) {
// 通过上下文把map的处理结果输出
context.write(new Text(word), one);
}
}
}
/**
* Reduce:归并操作
*/
public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
long sum = 0;
for(LongWritable value : values) {
// 求key出现的次数总和
sum += value.get();
}
// 最终统计结果的输出
context.write(key, new LongWritable(sum));
}
}
/**
* 定义Driver:封装了MapReduce作业的所有信息
*/
public static void main(String[] args) throws Exception{
//创建Configuration
Configuration configuration = new Configuration();
// 准备清理已存在的输出目录
Path outputPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(configuration);
if(fileSystem.exists(outputPath)){
fileSystem.delete(outputPath, true);
System.out.println("output file exists, but is has deleted");
}
//创建Job
Job job = Job.getInstance(configuration, "wordcount");
//设置job的处理类
job.setJarByClass(WordCount2App.class);
//设置作业处理的输入路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
//设置map相关参数
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
//设置reduce相关参数
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
//设置作业处理的输出路径
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar com.imooc.hadoop.mapreduce.CombinerApp hdfs://hadoop000:8020/hello.txt hdfs://hadoop000:8020/output/wc
package com.imooc.hadoop.mapreduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* 使用MapReduce开发WordCount应用程序
*/
public class CombinerApp {
/**
* Map:读取输入的文件
*/
public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
LongWritable one = new LongWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 接收到的每一行数据
String line = value.toString();
//按照指定分隔符进行拆分
String[] words = line.split(" ");
for(String word : words) {
// 通过上下文把map的处理结果输出
context.write(new Text(word), one);
}
}
}
/**
* Reduce:归并操作
*/
public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
long sum = 0;
for(LongWritable value : values) {
// 求key出现的次数总和
sum += value.get();
}
// 最终统计结果的输出
context.write(key, new LongWritable(sum));
}
}
/**
* 定义Driver:封装了MapReduce作业的所有信息
*/
public static void main(String[] args) throws Exception{
//创建Configuration
Configuration configuration = new Configuration();
// 准备清理已存在的输出目录
Path outputPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(configuration);
if(fileSystem.exists(outputPath)){
fileSystem.delete(outputPath, true);
System.out.println("output file exists, but is has deleted");
}
//创建Job
Job job = Job.getInstance(configuration, "wordcount");
//设置job的处理类
job.setJarByClass(CombinerApp.class);
//设置作业处理的输入路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
//设置map相关参数
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
//设置reduce相关参数
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
//通过job设置combiner处理类,其实逻辑上和我们的reduce是一模一样的
job.setCombinerClass(MyReducer.class);
//设置作业处理的输出路径
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
partitioner决定map task输出的数据交由哪个reduce task处理
默认实现:分发的key的hash值对reduce task个数取模
开发实例 :统计不同品牌手机两天的销量
执行hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar com.imooc.hadoop.mapreduce.ParititonerApp hdfs://hadoop000:8020/partitioner hdfs://hadoop000:8020/output/partitioner
代码
package com.imooc.hadoop.mapreduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class ParititonerApp {
/**
* Map:读取输入的文件
*/
public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 接收到的每一行数据
String line = value.toString();
//按照指定分隔符进行拆分
String[] words = line.split(" ");
context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1])));
}
}
/**
* Reduce:归并操作
*/
public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
long sum = 0;
for(LongWritable value : values) {
// 求key出现的次数总和
sum += value.get();
}
// 最终统计结果的输出
context.write(key, new LongWritable(sum));
}
}
public static class MyPartitioner extends Partitioner<Text, LongWritable> {
@Override
public int getPartition(Text key, LongWritable value, int numPartitions) {
if(key.toString().equals("xiaomi")) {
return 0;
}
if(key.toString().equals("huawei")) {
return 1;
}
if(key.toString().equals("iphone7")) {
return 2;
}
return 3;
}
}
/**
* 定义Driver:封装了MapReduce作业的所有信息
*/
public static void main(String[] args) throws Exception{
//创建Configuration
Configuration configuration = new Configuration();
// 准备清理已存在的输出目录
Path outputPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(configuration);
if(fileSystem.exists(outputPath)){
fileSystem.delete(outputPath, true);
System.out.println("output file exists, but is has deleted");
}
//创建Job
Job job = Job.getInstance(configuration, "wordcount");
//设置job的处理类
job.setJarByClass(ParititonerApp.class);
//设置作业处理的输入路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
//设置map相关参数
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
//设置reduce相关参数
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
//设置job的partition
job.setPartitionerClass(MyPartitioner.class);
//设置4个reducer,每个分区一个
job.setNumReduceTasks(4);
//设置作业处理的输出路径
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
记录已运行完的mapreduce信息到指定的HDFS目录下
默认是没有开启的
配置vim mapred-site.xml
<property>
<name>mapreduce.jobhistory.addressname>
<value>master:10020value>
<description>MapReduce JobHistory Server IPC host:portdescription>
property>
<property>
<name>mapreduce.jobhistory.webapp.addressname>
<value>master:19888value>
<description>MapReduce JobHistory Server Web UI host:portdescription>
property>
<property>
<name>mapreduce.jobhistory.done-dirname>
<value>/history/donevalue>
property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dirname>
<value>/history/done_intermediatevalue>
property>
配置yarn聚合参数vim yarn-site.xml
<property>
<name>yarn.log-aggregation-enablename>
<value>truevalue>
property>
重启一下hadoop和yarn
启动history./mr-jobhistory-daemon.sh start historyserver
然后点击history就能看见
wordcount: 统计文件中每个单词出现的次数
需求:求wc
1) 文件内容小:shell
2)文件内容很大: TB GB ???? 如何解决大数据量的统计分析
==> url TOPN <== wc的延伸
工作中很多场景的开发都是wc的基础上进行改造的
借助于分布式计算框架来解决了: mapreduce
分而治之
(input) -> map -> -> combine -> -> reduce -> (output)
核心概念
Split:交由MapReduce作业来处理的数据块,是MapReduce中最小的计算单元
HDFS:blocksize 是HDFS中最小的存储单元 128M
默认情况下:他们两是一一对应的,当然我们也可以手工设置他们之间的关系(不建议)
InputFormat:
将我们的输入数据进行分片(split): InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
TextInputFormat: 处理文本格式的数据
OutputFormat: 输出
MapReduce1.x的架构
1)JobTracker: JT
作业的管理者 管理的
将作业分解成一堆的任务:Task(MapTask和ReduceTask)
将任务分派给TaskTracker运行
作业的监控、容错处理(task作业挂了,重启task的机制)
在一定的时间间隔内,JT没有收到TT的心跳信息,TT可能是挂了,TT上运行的任务会被指派到其他TT上去执行
2)TaskTracker: TT
任务的执行者 干活的
在TT上执行我们的Task(MapTask和ReduceTask)
会与JT进行交互:执行/启动/停止作业,发送心跳信息给JT
3)MapTask
自己开发的map任务交由该Task出来
解析每条记录的数据,交给自己的map方法处理
将map的输出结果写到本地磁盘(有些作业只仅有map没有reduce==>HDFS)
4)ReduceTask
将Map Task输出的数据进行读取
按照数据进行分组传给我们自己编写的reduce方法处理
输出结果写到HDFS
使用IDEA+Maven开发wc:
1)开发
2)编译:mvn clean package -DskipTests
3)上传到服务器:scp target/hadoop-train-1.0.jar hadoop@hadoop000:~/lib
4)运行
hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar com.imooc.hadoop.mapreduce.WordCountApp hdfs://hadoop000:8020/hello.txt hdfs://hadoop000:8020/output/wc
相同的代码和脚本再次执行,会报错
security.UserGroupInformation:
PriviledgedActionException as:hadoop (auth:SIMPLE) cause:
org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://hadoop000:8020/output/wc already exists
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://hadoop000:8020/output/wc already exists
在MR中,输出文件是不能事先存在的
1)先手工通过shell的方式将输出文件夹先删除
hadoop fs -rm -r /output/wc
2) 在代码中完成自动删除功能: 推荐大家使用这种方式
Path outputPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(configuration);
if(fileSystem.exists(outputPath)){
fileSystem.delete(outputPath, true);
System.out.println("output file exists, but is has deleted");
}
Combiner
hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar com.imooc.hadoop.mapreduce.CombinerApp hdfs://hadoop000:8020/hello.txt hdfs://hadoop000:8020/output/wc
使用场景:
求和、次数 +
平均数 X
Partitioner
hadoop jar /home/hadoop/lib/hadoop-train-1.0.jar com.imooc.hadoop.mapreduce.ParititonerApp hdfs://hadoop000:8020/partitioner hdfs://hadoop000:8020/output/partitioner
用户行为日志:
为什么要记录用户访问行为日志
日志产生的方式
用户行为日志的内容
日志数据内容:
1)访问的系统属性: 操作系统、浏览器等等
2)访问特征:点击的url、从哪个url跳转过来的(referer)、页面上的停留时间等
3)访问信息:session_id、访问ip(访问城市)等
2013-05-19 13:00:00 http://www.taobao.com/17/?tracker_u=1624169&type=1 B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1 http://hao.360.cn/ 1.196.34.243
用户行为日志分析的意义
1)数据采集
2)数据清洗
3)数据处理
4)处理结果入库
5)数据的可视化
根据日志信息收取浏览器信息
针对不同的浏览器进行统计
开源项目使用github.com/yammer/user_agent
的User Agent Parser
mvn clean package -DskipTest
mvn clean install -DskipTest
在pom.xml中添加解析的依赖
<dependency>
<groupId>com.kumkeegroupId>
<artifactId>UserAgentParserartifactId>
<version>0.0.1version>
dependency>
编写测试类
/**
* UserAgent测试类
*/
public class UserAgentTest{
// 单元测试:UserAgent工具类的使用
@Test
public void testUserAgentParser(){
String source = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36";
UserAgentParser userAgentParser = new UserAgentParser();
UserAgent agent = userAgentParser.parse(source);
// 浏览器信息
String browser = agent.getBrowser();
// 有很多信息可以自己选择打印
}
}
数据有点多,取100条head -n 100 10000_access.log > 100_access.log
,查看多少行wc -l 100_access.log
自己写的代码,java单机版测试
public class UserAgentTest{
@Test
public void testReadFile() throws Exception{
String path = "/users/rocky/data/imooc/100_access.log";
BufferReader reader = new BufferReader(
new InputStreamReader(new FileInputStream(new File(path)))
);
String line = "";
int i = 0;
Map<String,Integer> browserMap = new HashMap<String,Integer>();
UserAgentParser userAgentParser = new UserAgentParser();
while(line != null){
line = reader.readLine(); // 每次读入一行数据
i++;
if(StringUtils.isNotBlank(line)){
String source = line.subString(getCharacterPosition(value,"\"",7)) + 1;
UserAgent agent = userAgentParser.parse(source);
// 浏览器信息,有很多信息可以自己选择打印
String browser = agent.getBrowser();
int browserValue = browserMap.get(browser);
if (browserValue != null){
browserMap.put(browser,browserValue + 1)
}else {
browserMap.put(browser, 1)
}
}
}
System.out.println("一共有:" + i + "行数据");
for (Map.Entry<String,Integer> entry : browserMap.entrySet()){
System.out.println(entry.getKey()+":"+entry.getValue());
}
}
// 测试自定义方法getCharacterPosition
@Test
public void testGetCharacterPosition(){
String value = "......";
// 获取字符串中第7个 " 双引号的位置
int index = getCharacterPosition(value,"\"",7);
System.out.println(index);
}
// 获取指定字符串中指定标识的字符串出现的索引位置
private int getCharacterPosition(String value, String operator, int index){
Matcher slashMatcher = Pattern.compile(operator).matcher(value);
int mIdx = 0;
while(slashMathcer.find()){
mIdx++;
if(mIdx == index){
break;
}
}
return slashMatcher.start();
}
}
使用插件将UserAgent进行打包到代码中去mvn assembly:assembly
在dependencies>的外边,后边
<build>
<plugins>
<plugin>
<artifactId>artifactId>
<configuration>
<archive>
<manifest>
<mainClass>maven-assembly-pluginmainClass>
manifest>
archive>
<descriptorRefs>
<descriptorRef>jar-with-dependenciesdescriptorRef>
descriptorRefs>
configuration>
plugin>
plugins>
build>
运行hadoop jar /home/hadoop/lib/hadoop-train-1.0-jar-with-dependencies.jar com.imooc.hadoop.project.LogApp /10000_access.log /browserout
代码
// Decompiled by Jad v1.5.8e2. Copyright 2001 Pavel Kouznetsov.
// Jad home page: http://kpdus.tripod.com/jad.html
// Decompiler options: packimports(3) fieldsfirst ansi space
// Source File Name: LogApp.java
package com.imooc.hadoop.project;
import com.kumkee.userAgent.UserAgent;
import com.kumkee.userAgent.UserAgentParser;
import java.io.IOException;
import java.io.PrintStream;
import java.util.Iterator;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class LogApp
{
public static class MyReducer extends Reducer
{
protected void reduce(Text key, Iterable values, org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
{
long sum = 0L;
for (Iterator iterator = values.iterator(); iterator.hasNext();)
{
LongWritable value = (LongWritable)iterator.next();
sum += value.get();
}
context.write(key, new LongWritable(sum));
}
protected volatile void reduce(Object obj, Iterable iterable, org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
{
reduce((Text)obj, iterable, context);
}
public MyReducer()
{
}
}
public static class MyMapper extends Mapper
{
LongWritable one;
private UserAgentParser userAgentParser;
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
{
userAgentParser = new UserAgentParser();
}
protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
{
String line = value.toString();
String source = (new StringBuilder()).append(line.substring(LogApp.getCharacterPosition(line, "\"", 7))).append(1).toString();
UserAgent agent = userAgentParser.parse(source);
String browser = agent.getBrowser();
context.write(new Text(browser), one);
}
protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
{
userAgentParser = null;
}
protected volatile void map(Object obj, Object obj1, org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
{
map((LongWritable)obj, (Text)obj1, context);
}
public MyMapper()
{
one = new LongWritable(1L);
}
}
public LogApp()
{
}
private static int getCharacterPosition(String value, String operator, int index)
{
Matcher slashMatcher = Pattern.compile(operator).matcher(value);
for (int mIdx = 0; slashMatcher.find() && ++mIdx != index;);
return slashMatcher.start();
}
public static void main(String args[])
throws Exception
{
Configuration configuration = new Configuration();
Path outputPath = new Path(args[1]);
FileSystem fileSystem = FileSystem.get(configuration);
if (fileSystem.exists(outputPath))
{
fileSystem.delete(outputPath, true);
System.out.println("output file exists, but is has deleted");
}
Job job = Job.getInstance(configuration, "LogApp");
job.setJarByClass(com/imooc/hadoop/project/LogApp);
FileInputFormat.setInputPaths(job, new Path[] {
new Path(args[0])
});
job.setMapperClass(com/imooc/hadoop/project/LogApp$MyMapper);
job.setMapOutputKeyClass(org/apache/hadoop/io/Text);
job.setMapOutputValueClass(org/apache/hadoop/io/LongWritable);
job.setReducerClass(com/imooc/hadoop/project/LogApp$MyReducer);
job.setOutputKeyClass(org/apache/hadoop/io/Text);
job.setOutputValueClass(org/apache/hadoop/io/LongWritable);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
用户行为日志:用户每次访问网站时所有的行为数据(访问、浏览、搜索、点击...)
用户行为轨迹、流量日志
日志数据内容:
1)访问的系统属性: 操作系统、浏览器等等
2)访问特征:点击的url、从哪个url跳转过来的(referer)、页面上的停留时间等
3)访问信息:session_id、访问ip(访问城市)等
2013-05-19 13:00:00 http://www.taobao.com/17/?tracker_u=1624169&type=1 B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1 http://hao.360.cn/ 1.196.34.243
数据处理流程
1)数据采集
Flume: web日志写入到HDFS
2)数据清洗
脏数据
Spark、Hive、MapReduce 或者是其他的一些分布式计算框架
清洗完之后的数据可以存放在HDFS(Hive/Spark SQL)
3)数据处理
按照我们的需要进行相应业务的统计和分析
Spark、Hive、MapReduce 或者是其他的一些分布式计算框架
4)处理结果入库
结果可以存放到RDBMS、NoSQL
5)数据的可视化
通过图形化展示的方式展现出来:饼图、柱状图、地图、折线图
ECharts、HUE、Zeppelin
UserAgent
hadoop jar /home/hadoop/lib/hadoop-train-1.0-jar-with-dependencies.jar com.imooc.hadoop.project.LogApp /10000_access.log /browserout
Hadoop分布式环境搭建
hostname设置:sudo vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop001
hostname和ip地址的设置:sudo vi /etc/hosts
各节点角色分配:
在每台机器上运行:ssh-keygen -t rsa
以hadooop000机器为主
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop000
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop001
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop002
1)Hadoop安装
# hadoop-env.sh
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_79
# core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop000:8020</value>
</property>
# hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/app/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/app/tmp/dfs/data</value>
</property>
# yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop000</value>
</property>
# mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
# slaves
hadoop000
hadoop001
hadoop002
2)分发安装包到hadoop001和hadoop002节点
# hadoop和jdk的安装文件
scp -r ~/app hadoop@hadoop001:~/
scp -r ~/app hadoop@hadoop002:~/
# 配置环境变量的文件
scp ~/.bash_profile hadoop@hadoop001:~/
scp ~/.bash_profile hadoop@hadoop002:~/
source .bash_profile
3)对NN做格式化:只要在hadoop000上执行即可
bin/hdfs namenode -format
启动集群:只要在hadoop000上执行即可
sbin/start-all.sh
验证
jps
hadoop000:
SecondaryNameNode
DataNode
NodeManager
NameNode
ResourceManager
hadoop001:
NodeManager
DataNode
hadoop002:
NodeManager
DataNode
stop-all.sh
hadoop fs
将Hadoop项目运行到集群中
1)上传数据到hadoop000机器的data目录下
2)上传开发的jar到hadoop000机器的lib目录下
3)需要将数据上传到hdfs
4)在分布式集群上运行我们的mr程序
hadoop jar ~/lib/hadoop-train-1.0-jar-with-dependencies.jar com.imooc.hadoop.project.LogApp /10000_access.log /browserout
Hadoop分布式环境搭建
hadoop000: 192.168.199.102
hadoop001: 192.168.199.247
hadoop002: 192.168.199.138
hostname设置:sudo vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop001
hostname和ip地址的设置:sudo vi /etc/hosts
192.168.199.102 hadoop000
192.168.199.247 hadoop001
192.168.199.138 hadoop002
各节点角色分配:
hadoop000: NameNode/DataNode ResourceManager/NodeManager
hadoop001: DataNode NodeManager
hadoop002: DataNode NodeManager
前置安装
1)ssh免密码登陆
在每台机器上运行:ssh-keygen -t rsa
以hadooop000机器为主
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop000
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop001
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop002
2) jdk安装
在hadoop000机器上解压jdk安装包,并设置JAVA_HOME到系统环境变量
集群安装
1)Hadoop安装
在hadoop000机器上解压Hadoop安装包,并设置HADOOP_HOME到系统环境变量
hadoop-env.sh
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_79
core-site.xml
fs.default.name
hdfs://hadoop000:8020
hdfs-site.xml
dfs.namenode.name.dir
/home/hadoop/app/tmp/dfs/name
dfs.datanode.data.dir
/home/hadoop/app/tmp/dfs/data
yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.resourcemanager.hostname
hadoop000
mapred-site.xml
mapreduce.framework.name
yarn
slaves
hadoop000
hadoop001
hadoop002
2)分发安装包到hadoop001和hadoop002节点
scp -r ~/app hadoop@hadoop001:~/
scp -r ~/app hadoop@hadoop002:~/
scp ~/.bash_profile hadoop@hadoop001:~/
scp ~/.bash_profile hadoop@hadoop002:~/
在hadoop001和hadoop002机器上让.bash_profile生效
3)对NN做格式化:只要在hadoop000上执行即可
bin/hdfs namenode -format
4) 启动集群:只要在hadoop000上执行即可
sbin/start-all.sh
5) 验证
jps
hadoop000:
SecondaryNameNode
DataNode
NodeManager
NameNode
ResourceManager
hadoop001:
NodeManager
DataNode
hadoop002:
NodeManager
DataNode
webui: hadoop000:50070 hadoop000:8088
6) 集群停止: stop-all.sh
将Hadoop项目运行到集群中
1)上传数据到hadoop000机器的data目录下
2)上传开发的jar到hadoop000机器的lib目录下
3)需要将数据上传到hdfs
4)在分布式集群上运行我们的mr程序
hadoop jar ~/lib/hadoop-train-1.0-jar-with-dependencies.jar com.imooc.hadoop.project.LogApp /10000_access.log /browserout
https://projects.spring.io/spring-hadoop/
在pom文件中添加maven依赖
<dependency>
<groupId>org.springframework.datagroupId>
<artifactId>spring-data-hadoopartifactId>
<version>2.5.0.RELEASEversion>
dependency>
在resources中创建文件beans.xml
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:hdp="http://www.springframework.org/schema/hadoop"
xsi:schemaLocation="
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/hadoop
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">
<hdp:configuration id="hadoopConfiguration">
fs.defaultFS=hdfs://hadoop000:8020
hdp:configuration>
<hdp:file-system id="fileSystem" user="hadoop" configuration-ref="hadoopConfiguration"/>
beans>
测试类
// 使用spring hadoop来访问HDFS文件系统
public class SpringHadoopHDFSApp{
private ApplicationContext ctx;
private FileSystem fileSystem;
// 创建hdfs文件夹
@Test
public void testMkdir() throws Exception{
fileSystem.mkdirs(new Path("/springhdfs/"));
}
// 读取内容
@Test
public void cat() throws Exception {
FSDataInputStream in = fileSystem.open(new Path("/springhdfs/hello.txt"));
IOUtils.copyBytes(in, System.out, 1024);
in.close();
}
@Before
public void setUp(){
ctx = new ClassPathXmlApplicationContext("beans.xml");
fileSystem = (FileSystem)ctx.getBean("fileSystem");
}
@After
public void tearDown() throws Exception{
ctx = null;
fileSystem.close();
}
}
在resources中创建一个文件application.properties
spring.hadoop.fsUri=hdfs://hadoop000:8020
在beans.xml文件中使用
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:hdp="http://www.springframework.org/schema/hadoop"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:util="http://www.springframework.org/schema/util"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util.xsd
http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">
<hdp:configuration id="hadoopConfiguration">
fs.defaultFS=${spring.hadoop.fsUri}
hdp:configuration>
<context:property-placeholder location="application.properties"/>
<hdp:file-system id="fileSystem" user="hadoop" configuration-ref="hadoopConfiguration"/>
beans>
添加spring boot依赖
<dependency>
<groupId>org.springframework.datagroupId>
<artifactId>spring-data-hadoop-bootartifactId>
<version>2.5.0.RELEASEversion>
dependency>
测试类
// 使用spring boot来访问HDFS文件系统
@SpringApplication
public class SpringBootHDFSApp implements CommandLineRunner{
@Autowired
FsShell fsShell;
public void run(String... strings) throws Exception{
for(FileStatus fileStatus : fsShell.lsr("/springhdfs")){
System.out.println(">" + fileStatus.getPath());
}
}
public static void main(String[] args){
SpringApplication.run(SpringBootHDFSApp.class,args);
}
}
spark.apache.org
scala
mvn -version
环境部署先去官网下载spark源码包,选择2.1.0版本,source code,具体如何编译看文章http://spark.apache.org/docs/2.1.0/building-spark.html
解压编译好的包到app目录下
在spark的bin目录下去启动
spark-shell --help
spark-shell --master local[2]
,2表示以两个线程的方式,参考http://spark.apache.org/docs/2.1.0/submitting-applications.html#master-urls
如何实现wordcount
scala
var file = sc.textFile("file:///home/hadoop/data/hello.txt")
定义一个文件file.collect
file.count
spark实现wc:
val file = sc.textFile("file:///home/hadoop/data/hello.txt")
val a = file.flatMap(line => line.split(" "))
// 转换一下
val b = a.map(word => (word,1))
// 目前的数据内容为:Array((hadoop,1), (welcome,1), (hadoop,1), (hdfs,1), (mapreduce,1), (hadoop,1), (hdfs,1))
// 将key相同的两两相加
val c = b.reduceByKey(_ + _)
// c的内容:Array((mapreduce,1), (welcome,1), (hadoop,3), (hdfs,2))
一句话可以搞定上边写的代码sc.textFile("file:///home/hadoop/data/hello.txt").flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + _).collect
通过浏览器监控地址在启动spark的时候有显示http://ip:4040
spark和mapreduce的易用性?
flink.apache.org
Flink环境搭建
http://flink.apache.org/
,解压到app目录下./start-local.sh
http://ip:8081
https://github.com/apache/flink
如何使用https://ci.apache.org/projects/flink/flink-docs-release-1.4/quickstart/setup_quickstart.html
,在github上去看wordcount源码
./bin/flink run ./examples/batch/WordCount.jar \
--input file:///home/hadoop/data/hello.txt --output file:///home/hadoop/tmp/flink_wc_output
beam.apache.org
如何启动呢?https://beam.apache.org/get-started/quickstart-java/
运行测试
Beam运行:
#direct方式运行
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--inputFile=/home/hadoop/data/hello.txt --output=counts" \
-Pdirect-runner
#spark方式运行
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=SparkRunner --inputFile=/home/hadoop/data/hello.txt --output=counts" -Pspark-runner
spark启动:spark-shell --master local[2]
spark实现wc:
val file = sc.textFile("file:///home/hadoop/data/hello.txt")
val a = file.flatMap(line => line.split(" "))
val b = a.map(word => (word,1))
Array((hadoop,1), (welcome,1), (hadoop,1), (hdfs,1), (mapreduce,1), (hadoop,1), (hdfs,1))
val c = b.reduceByKey(_ + _)
Array((mapreduce,1), (welcome,1), (hadoop,3), (hdfs,2))
sc.textFile("file:///home/hadoop/data/hello.txt").flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + _).collect
Flink运行
./bin/flink run ./examples/batch/WordCount.jar \
--input file:///home/hadoop/data/hello.txt --output file:///home/hadoop/tmp/flink_wc_output
Beam运行:
# direct方式运行
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--inputFile=/home/hadoop/data/hello.txt --output=counts" \
-Pdirect-runner
# spark方式运行
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=SparkRunner --inputFile=/home/hadoop/data/hello.txt --output=counts" -Pspark-runner
# flink方式运行