题记
坚持学习一个东西真实一件很难的事情;对于关于hadoop spark的学习不知道从入门到放弃多少次了。首先是从scala的学习开始,学习近半年,
学习材料
-
书籍两本
第一本hadoop用的版本还算比较新,hadoop 2。第二本版本就比较低,因此代码学习建议使用第一本;而关于hadoop一些基础理论第二本理解都是ok的有助于理解第一本书中源代码。因为,第一本书主要集中在算法,代码注释比较少。另外就是配合官网文档理解。
第一章节:二次排序
学习工具:idea + maven
- 先pos上maven依赖
4.0.0
kean.learn
hadoop_spark
1.0-SNAPSHOT
org.apache.maven.plugins
maven-compiler-plugin
8
org.apache.hadoop
hadoop-common
2.7.3
org.apache.hadoop
hadoop-mapreduce-client-core
2.7.3
版本建议与自己安装的hadoop集群保持一致
- 二次排序主要学习的是一个map reduce,基础理论就不讲了
- entity代码
package org.dataalgorithms.chap01.mapreduce;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* The DateTemperaturePair class enable us to represent a
* composite type of (yearMonth, day, temperature). To persist
* a composite type (actually any data type) in Hadoop, it has
* to implement the org.apache.hadoop.io.Writable interface.
*
* To compare composite types in Hadoop, it has to implement
* the org.apache.hadoop.io.WritableComparable interface.
*
* @author Mahmoud Parsian
*/
public class DateTemperaturePair implements Writable, WritableComparable {
// 如果需要持久化存储定制的数据类型必须继承Writable接口
// 如果需要比较,还必须继承 WritableComparable接口
private final Text yearMonth = new Text();
private final Text day = new Text();
private final IntWritable temperature = new IntWritable();
public DateTemperaturePair() {
}
public DateTemperaturePair(String yearMonth, String day, int temperature) {
this.yearMonth.set(yearMonth);
this.day.set(day);
this.temperature.set(temperature);
}
public static DateTemperaturePair read(DataInput in) throws IOException {
DateTemperaturePair pair = new DateTemperaturePair();
pair.readFields(in);
return pair;
}
@Override
public void write(DataOutput out) throws IOException {
yearMonth.write(out);
day.write(out);
temperature.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
yearMonth.readFields(in);
day.readFields(in);
temperature.readFields(in);
}
@Override
public int compareTo(DateTemperaturePair pair) {
// 先按照年月排序
int compareValue = this.yearMonth.compareTo(pair.getYearMonth());
if (compareValue == 0) {
// 其次按温度排序
compareValue = temperature.compareTo(pair.getTemperature());
}
//return compareValue; // to sort ascending
// to sort descending
return -1 * compareValue;
}
public Text getYearMonthDay() {
return new Text(yearMonth.toString() + day.toString());
}
public Text getYearMonth() {
return yearMonth;
}
public Text getDay() {
return day;
}
public IntWritable getTemperature() {
return temperature;
}
public void setYearMonth(String yearMonthAsString) {
yearMonth.set(yearMonthAsString);
}
public void setDay(String dayAsString) {
day.set(dayAsString);
}
public void setTemperature(int temp) {
temperature.set(temp);
}
@Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
// 温度和年月相等判断为相等
DateTemperaturePair that = (DateTemperaturePair) o;
if (temperature != null ? !temperature.equals(that.temperature) : that.temperature != null) {
return false;
}
if (yearMonth != null ? !yearMonth.equals(that.yearMonth) : that.yearMonth != null) {
return false;
}
return true;
}
@Override
public int hashCode() {
int result = yearMonth != null ? yearMonth.hashCode() : 0;
result = 31 * result + (temperature != null ? temperature.hashCode() : 0);
return result;
}
@Override
public String toString() {
StringBuilder builder = new StringBuilder();
builder.append("DateTemperaturePair{yearMonth=");
builder.append(yearMonth);
builder.append(", day=");
builder.append(day);
builder.append(", temperature=");
builder.append(temperature);
builder.append("}");
return builder.toString();
}
}
每行数据具有:(年份 - 月) - 日 - 温度,因此该类定义了3个属性
- 定制分区器
分区器会根据映射器的输出键值决定哪个映射器发送到哪个规约器。
package org.dataalgorithms.chap01.mapreduce;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
/**
* The DateTemperaturePartitioner is a custom partitioner class,
* whcih partitions data by the natural key only (using the yearMonth).
* Without custom partitioner, Hadoop will partition your mapped data
* based on a hash code.
*
* In Hadoop, the partitioning phase takes place after the map() phase
* and before the reduce() phase
*
* @author Mahmoud Parsian
*/
public class DateTemperaturePartitioner extends Partitioner {
@Override
public int getPartition(DateTemperaturePair pair, Text text, int numberOfPartitions) {
// make sure that partitions are non-negative
// 更具自然键:yearMonth hash值来分组
return Math.abs(pair.getYearMonth().hashCode() % numberOfPartitions);
}
}
- 定制一个比较器
比较器控制哪些键会被分到一个reducer.reduce()。
package org.dataalgorithms.chap01.mapreduce;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
/**
* The DateTemperaturePartitioner is a custom partitioner class,
* whcih partitions data by the natural key only (using the yearMonth).
* Without custom partitioner, Hadoop will partition your mapped data
* based on a hash code.
*
* In Hadoop, the partitioning phase takes place after the map() phase
* and before the reduce() phase
*
* @author Mahmoud Parsian
*/
public class DateTemperaturePartitioner extends Partitioner {
@Override
public int getPartition(DateTemperaturePair pair, Text text, int numberOfPartitions) {
// make sure that partitions are non-negative
// 更具自然键:yearMonth hash值来分组
return Math.abs(pair.getYearMonth().hashCode() % numberOfPartitions);
}
}
理解分区器和比较器
关于分区器和比较器两者关系如何理解还不是很清楚。在理解之前申明下,我们需要将问题按年月汇总并将每日问题升序或者降序输出。输入例如:
2019 1 12 25
2019 2 13 25
2019 1 12 24
...
因此需要比较一个年月份的温度,必须将一个年月的温度分到一个partition,否则一个年月的温度分到两个以上的partition无法通过一次二次排序得出想要的结果。因此,定义一个分区器就是为了将一个年月的温度分到一个区上。另外分到一个区后这些entity需要相互比较,但是实体entity已经定义了equals比较,为什么这里还需要定义一个就不得而知定义Mapper
package org.dataalgorithms.chap01.mapreduce;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* SecondarySortMapper implements the map() function for
* the secondary sort design pattern.
*
* @author Mahmoud Parsian
*
*/
public class SecondarySortMapper extends Mapper {
// 抽象类的4个类型,代表mapper的 《输入键,输入值, 输出键, 输出值》
// value
private final Text theTemperature = new Text();
// key
private final DateTemperaturePair pair = new DateTemperaturePair();
@Override
/**
* @param key is generated by Hadoop (ignored here)
* @param value has this format: "YYYY,MM,DD,temperature"
*/
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 前两个参数为mapper的输入键与值 而第三个则是保存mapper的输出, 这里第一个参数key其实没有任何实际用处可以理解为文本的行数吧
String line = value.toString();
String[] tokens = line.split(",");
// YYYY = tokens[0]
// MM = tokens[1]
// DD = tokens[2]
// temperature = tokens[3]
String yearMonth = tokens[0] + tokens[1];
String day = tokens[2];
int temperature = Integer.parseInt(tokens[3]);
pair.setYearMonth(yearMonth);
pair.setDay(day);
pair.setTemperature(temperature);
theTemperature.set(tokens[3]);
context.write(pair, theTemperature);
}
}
- Reducer
reducer的输入信号键与值type必须与mapper 的输出保持一致
package org.dataalgorithms.chap01.mapreduce;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* SecondarySortReducer implements the reduce() function for
* the secondary sort design pattern.
*
* @author Mahmoud Parsian
*
*/
// public class SecondarySortReducer extends Reducer {
// // reducer <输入键类型,输入值类型,输出键类型,输出值类型>
//
// @Override
// protected void reduce(DateTemperaturePair key, Iterable values, Context context) throws IOException, InterruptedException {
// StringBuilder builder = new StringBuilder();
// for (Text value : values) {
// builder.append(value.toString());
// builder.append(",");
// }
// context.write(key.getYearMonth(), new Text(builder.toString()));
// }
// }
public class SecondarySortReducer extends Reducer {
// reducer <输入键类型,输入值类型,输出键类型,输出值类型>
@Override
protected void reduce(DateTemperaturePair key, Iterable values, Context context) throws IOException, InterruptedException {
// Iterable values: 多个mapper 输出构成
StringBuilder builder = new StringBuilder();
for (Text value : values) {
builder.append(value.toString());
builder.append(",");
}
context.write(key, new Text(builder.toString()));
}
}
这里讲reducer.reduce函数的输入键做了小小改变,保持与mapper.map的输出一致
- 定义Driver
package org.dataalgorithms.chap01.mapreduce;
import org.apache.log4j.Logger;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
/**
* SecondarySortDriver is driver class for submitting secondary sort job to Hadoop.
*
* @author Mahmoud Parsian
*/
public class SecondarySortDriver extends Configured implements Tool {
// 此类负责运行mapper 和 reducer
private static Logger theLogger = Logger.getLogger(SecondarySortDriver.class);
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf);
job.setJarByClass(SecondarySortDriver.class);
job.setJobName("SecondarySortDriver");
// args[0] = input directory
// args[1] = output directory
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// reduce 的输出类型,必须reduce一致
job.setOutputKeyClass(DateTemperaturePair.class);
job.setOutputValueClass(Text.class);
// map 的输出类型, 与reduce通常一致可以忽略不设置
// job.setMapOutputKeyClass(DateTemperaturePair.class);
// job.setMapOutputValueClass(Text.class);
job.setMapperClass(SecondarySortMapper.class);
job.setReducerClass(SecondarySortReducer.class);
job.setPartitionerClass(DateTemperaturePartitioner.class);
job.setGroupingComparatorClass(DateTemperatureGroupingComparator.class);
// 提交作业并等待执行完成
boolean status = job.waitForCompletion(true);
theLogger.info("run(): status=" + status);
return status ? 0 : 1;
}
/**
* The main driver for word count map/reduce program.
* Invoke this method to submit the map/reduce job.
*
* @throws Exception When there is communication problems with the job tracker.
*/
public static void main(String[] args) throws Exception {
// Make sure there are exactly 2 parameters
if (args.length != 2) {
theLogger.warn("SecondarySortDriver ");
throw new IllegalArgumentException("SecondarySortDriver ");
}
//String inputDir = args[0];
//String outputDir = args[1];
int returnStatus = submitJob(args);
theLogger.info("returnStatus=" + returnStatus);
System.exit(returnStatus);
}
/**
* The main driver for word count map/reduce program.
* Invoke this method to submit the map/reduce job.
*
* @throws Exception When there is communication problems with the job tracker.
*/
public static int submitJob(String[] args) throws Exception {
//String[] args = new String[2];
//args[0] = inputDir;
//args[1] = outputDir;
return ToolRunner.run(new SecondarySortDriver(), args);
}
}
- 集群提交jar执行
首先启动集群,前面有文章讲解如何用虚拟机安装集群;
maven打包jar
- 上传数据到hdfs
[root@master chapter1]# cat sample_input.txt
2019,p,4,40
2019,p,6,20
2019,x,2,9
2019,y,2,5
2019,x,1,3
2019,y,1,7
2019,y,3,1
2019,x,3,6
2019,z,1,4
2019,z,2,8
2019,z,3,7
2019,z,4,0
2019,p,1,10
2019,r,3,60
[root@master bin]# ./hadoop fs -ls /data_algorithms
[root@master bin]# ./hadoop fs -mkdir -p /data_algorithms/chapter1/input
[root@master bin]# ./hadoop fs -ls /data_algorithms/chapter1
Found 1 items
drwxr-xr-x - root supergroup 0 2019-04-14 16:13 /data_algorithms/chapter1/input
[root@master bin]# ./hadoop fs -put /root/Data/data_algorithms/chapter1/sample_input.txt /data_algorithms/chapter1/input
[root@master bin]# ./hadoop fs -cat /data_algorithms/chapter1/input/sample_input.txt
2019,p,4,40
2019,p,6,20
2019,x,2,9
2019,y,2,5
2019,x,1,3
2019,y,1,7
2019,y,3,1
2019,x,3,6
2019,z,1,4
2019,z,2,8
2019,z,3,7
2019,z,4,0
2019,p,1,10
2019,r,3,60
- 执行jar
root@master bin]# ./hadoop jar /root/Data/data_algorithms/chapter1/hadoop_spark-1.0-SNAPSHOT.jar org.dataalgorithms.chap01.mapreduce.SecondarySortDriver /data_algorithms/chapter1/input /data_algorithms/chapter1/output
19/04/14 16:27:04 INFO client.RMProxy: Connecting to ResourceManager at master/172.16.21.220:8032
19/04/14 16:27:05 INFO input.FileInputFormat: Total input paths to process : 1
19/04/14 16:27:06 INFO mapreduce.JobSubmitter: number of splits:1
19/04/14 16:27:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555221334170_0003
19/04/14 16:27:07 INFO impl.YarnClientImpl: Submitted application application_1555221334170_0003
19/04/14 16:27:07 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1555221334170_0003/
19/04/14 16:27:07 INFO mapreduce.Job: Running job: job_1555221334170_0003
19/04/14 16:27:17 INFO mapreduce.Job: Job job_1555221334170_0003 running in uber mode : false
19/04/14 16:27:17 INFO mapreduce.Job: map 0% reduce 0%
19/04/14 16:27:24 INFO mapreduce.Job: map 100% reduce 0%
19/04/14 16:27:32 INFO mapreduce.Job: map 100% reduce 100%
19/04/14 16:27:33 INFO mapreduce.Job: Job job_1555221334170_0003 completed successfully
19/04/14 16:27:33 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=234
FILE: Number of bytes written=238407
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=289
HDFS: Number of bytes written=334
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=4958
Total time spent by all reduces in occupied slots (ms)=5274
Total time spent by all map tasks (ms)=4958
Total time spent by all reduce tasks (ms)=5274
Total vcore-milliseconds taken by all map tasks=4958
Total vcore-milliseconds taken by all reduce tasks=5274
Total megabyte-milliseconds taken by all map tasks=5076992
Total megabyte-milliseconds taken by all reduce tasks=5400576
Map-Reduce Framework
Map input records=14
Map output records=14
Map output bytes=200
Map output materialized bytes=234
Input split bytes=131
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=234
Reduce input records=14
Reduce output records=5
Spilled Records=28
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=151
CPU time spent (ms)=1920
Physical memory (bytes) snapshot=309817344
Virtual memory (bytes) snapshot=4159598592
Total committed heap usage (bytes)=165810176
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=158
File Output Format Counters
Bytes Written=334
19/04/14 16:27:33 INFO mapreduce.SecondarySortDriver: run(): status=true
19/04/14 16:27:33 INFO mapreduce.SecondarySortDriver: returnStatus=0
[root@master bin]# ./hadoop fs -ls /data_algorithms/chapter1/output/
Found 2 items
-rw-r--r-- 3 root supergroup 0 2019-04-14 16:27 /data_algorithms/chapter1/output/_SUCCESS
-rw-r--r-- 3 root supergroup 334 2019-04-14 16:27 /data_algorithms/chapter1/output/part-r-00000
[root@master bin]# ./hadoop fs -cat /data_algorithms/chapter1/output/p*
DateTemperaturePair{yearMonth=2019z, day=4, temperature=0} 8,7,4,0,
DateTemperaturePair{yearMonth=2019y, day=3, temperature=1} 7,5,1,
DateTemperaturePair{yearMonth=2019x, day=1, temperature=3} 9,6,3,
DateTemperaturePair{yearMonth=2019r, day=3, temperature=60} 60,
DateTemperaturePair{yearMonth=2019p, day=1, temperature=10} 40,20,10,
- 将执行命令改写为脚本
[root@master chapter1]# ./run.sh
rmr: DEPRECATED: Please use 'rm -r' instead.
rmr: `/data_algorithms/chapter1/output': No such file or directory
2019,p,4,40
2019,p,6,20
2019,x,2,9
2019,y,2,5
2019,x,1,3
2019,y,1,7
2019,y,3,1
2019,x,3,6
2019,z,1,4
2019,z,2,8
2019,z,3,7
2019,z,4,0
2019,p,1,10
2019,r,3,60
./run.sh:行7: org.dataalgorithms.chap01.mapreduce.SecondarySortDriver: 未找到命令
Exception in thread "main" java.lang.ClassNotFoundException: /data_algorithms/chapter1/input
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
[root@master chapter1]# cat run.sh
# run.sh
export APP_JAR=/root/Data/data_algorithms/chapter1/hadoop_spark-1.0-SNAPSHOT.jar
INPUT=/data_algorithms/chapter1/input
OUTPUT=/data_algorithms/chapter1/output
$HADOOP_HOME/bin/hadoop fs -rmr $OUTPUT
$HADOOP_HOME/bin/hadoop fs -cat $INPUT/sam*
PROG=package org.dataalgorithms.chap01.mapreduce.SecondarySortDriver
$HADOOP_HOME/bin/hadoop jar $APP_JAR $PROG $INPUT $OUTPUT
cat是能得到input文件的内容的,搞不懂