题记

坚持学习一个东西真实一件很难的事情；对于关于hadoop spark的学习不知道从入门到放弃多少次了。首先是从scala的学习开始，学习近半年,这本书至少看了3遍，发现现在公司生产基本都是用java，后面java入门放弃后，在工作中还是学会了。在此之际发现大厂一般都要求会一点hadoop spark。无赖再次学习！希望能坚持下去，因为目前工作中用不到，这种东西没有实战很难有深刻的理解。废话不多说现在直接开始学习吧，希望大家能鼓励我学习下去，也希望能工作中能快速用上。

学习材料

书籍两本

image.png

image.png

第一本hadoop用的版本还算比较新，hadoop 2。第二本版本就比较低，因此代码学习建议使用第一本；而关于hadoop一些基础理论第二本理解都是ok的有助于理解第一本书中源代码。因为，第一本书主要集中在算法，代码注释比较少。另外就是配合官网文档理解。

第一章节：二次排序

学习工具：idea + maven

先pos上maven依赖



    4.0.0

    kean.learn
    hadoop_spark
    1.0-SNAPSHOT
    
        
            
                org.apache.maven.plugins
                maven-compiler-plugin
                
                    8
                    8
                
            
        
    

    
        
        
            org.apache.hadoop
            hadoop-common
            2.7.3
        
        
        
            org.apache.hadoop
            hadoop-mapreduce-client-core
            2.7.3

版本建议与自己安装的hadoop集群保持一致

二次排序主要学习的是一个map reduce，基础理论就不讲了

entity代码

package org.dataalgorithms.chap01.mapreduce;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * The DateTemperaturePair class enable us to represent a
 * composite type of (yearMonth, day, temperature). To persist
 * a composite type (actually any data type) in Hadoop, it has
 * to implement the org.apache.hadoop.io.Writable interface.
 * 
 * To compare composite types in Hadoop, it has to implement
 * the org.apache.hadoop.io.WritableComparable interface.
 *
 * @author Mahmoud Parsian
 */
public class DateTemperaturePair implements Writable, WritableComparable {
    // 如果需要持久化存储定制的数据类型必须继承Writable接口
    // 如果需要比较，还必须继承 WritableComparable接口

    private final Text yearMonth = new Text();
    private final Text day = new Text();
    private final IntWritable temperature = new IntWritable();


    public DateTemperaturePair() {
    }

    public DateTemperaturePair(String yearMonth, String day, int temperature) {
        this.yearMonth.set(yearMonth);
        this.day.set(day);
        this.temperature.set(temperature);
    }

    public static DateTemperaturePair read(DataInput in) throws IOException {
        DateTemperaturePair pair = new DateTemperaturePair();
        pair.readFields(in);
        return pair;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        yearMonth.write(out);
        day.write(out);
        temperature.write(out);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        yearMonth.readFields(in);
        day.readFields(in);
        temperature.readFields(in);
    }

    @Override
    public int compareTo(DateTemperaturePair pair) {
        // 先按照年月排序
        int compareValue = this.yearMonth.compareTo(pair.getYearMonth());
        if (compareValue == 0) {
            // 其次按温度排序
            compareValue = temperature.compareTo(pair.getTemperature());
        }
        //return compareValue;      // to sort ascending
        // to sort descending
        return -1 * compareValue;
    }

    public Text getYearMonthDay() {
        return new Text(yearMonth.toString() + day.toString());
    }

    public Text getYearMonth() {
        return yearMonth;
    }

    public Text getDay() {
        return day;
    }

    public IntWritable getTemperature() {
        return temperature;
    }

    public void setYearMonth(String yearMonthAsString) {
        yearMonth.set(yearMonthAsString);
    }

    public void setDay(String dayAsString) {
        day.set(dayAsString);
    }

    public void setTemperature(int temp) {
        temperature.set(temp);
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) {
            return true;
        }
        if (o == null || getClass() != o.getClass()) {
            return false;
        }
        // 温度和年月相等判断为相等
        DateTemperaturePair that = (DateTemperaturePair) o;
        if (temperature != null ? !temperature.equals(that.temperature) : that.temperature != null) {
            return false;
        }
        if (yearMonth != null ? !yearMonth.equals(that.yearMonth) : that.yearMonth != null) {
            return false;
        }
        return true;
    }

    @Override
    public int hashCode() {
        int result = yearMonth != null ? yearMonth.hashCode() : 0;
        result = 31 * result + (temperature != null ? temperature.hashCode() : 0);
        return result;
    }

    @Override
    public String toString() {
        StringBuilder builder = new StringBuilder();
        builder.append("DateTemperaturePair{yearMonth=");
        builder.append(yearMonth);
        builder.append(", day=");
        builder.append(day);
        builder.append(", temperature=");
        builder.append(temperature);
        builder.append("}");
        return builder.toString();
    }
}

每行数据具有：（年份 - 月） - 日 - 温度，因此该类定义了3个属性

定制分区器
分区器会根据映射器的输出键值决定哪个映射器发送到哪个规约器。

package org.dataalgorithms.chap01.mapreduce;


import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/**
 * The DateTemperaturePartitioner is a custom partitioner class,
 * whcih partitions data by the natural key only (using the yearMonth).
 * Without custom partitioner, Hadoop will partition your mapped data
 * based on a hash code.
 * 
 * In Hadoop, the partitioning phase takes place after the map() phase
 * and before the reduce() phase
 *
 * @author Mahmoud Parsian
 */
public class DateTemperaturePartitioner extends Partitioner {

    @Override
    public int getPartition(DateTemperaturePair pair, Text text, int numberOfPartitions) {
        // make sure that partitions are non-negative
        // 更具自然键：yearMonth hash值来分组
        return Math.abs(pair.getYearMonth().hashCode() % numberOfPartitions);
    }
}

定制一个比较器
比较器控制哪些键会被分到一个reducer.reduce()。

package org.dataalgorithms.chap01.mapreduce;


import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/**
 * The DateTemperaturePartitioner is a custom partitioner class,
 * whcih partitions data by the natural key only (using the yearMonth).
 * Without custom partitioner, Hadoop will partition your mapped data
 * based on a hash code.
 * 
 * In Hadoop, the partitioning phase takes place after the map() phase
 * and before the reduce() phase
 *
 * @author Mahmoud Parsian
 */
public class DateTemperaturePartitioner extends Partitioner {

    @Override
    public int getPartition(DateTemperaturePair pair, Text text, int numberOfPartitions) {
        // make sure that partitions are non-negative
        // 更具自然键：yearMonth hash值来分组
        return Math.abs(pair.getYearMonth().hashCode() % numberOfPartitions);
    }
}

理解分区器和比较器
关于分区器和比较器两者关系如何理解还不是很清楚。在理解之前申明下，我们需要将问题按年月汇总并将每日问题升序或者降序输出。输入例如：
2019 1 12 25
2019 2 13 25
2019 1 12 24
...
因此需要比较一个年月份的温度，必须将一个年月的温度分到一个partition，否则一个年月的温度分到两个以上的partition无法通过一次二次排序得出想要的结果。因此，定义一个分区器就是为了将一个年月的温度分到一个区上。另外分到一个区后这些entity需要相互比较，但是实体entity已经定义了equals比较，为什么这里还需要定义一个就不得而知
定义Mapper

package org.dataalgorithms.chap01.mapreduce;


import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/** 
 * SecondarySortMapper implements the map() function for 
 * the secondary sort design pattern.
 *
 * @author Mahmoud Parsian
 *
 */
public class SecondarySortMapper extends Mapper {
    // 抽象类的4个类型，代表mapper的 《输入键，输入值， 输出键， 输出值》

    // value
    private final Text theTemperature = new Text();
    // key
    private final DateTemperaturePair pair = new DateTemperaturePair();

    @Override
    /**
     * @param key is generated by Hadoop (ignored here)
     * @param value has this format: "YYYY,MM,DD,temperature"
     */
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 前两个参数为mapper的输入键与值 而第三个则是保存mapper的输出， 这里第一个参数key其实没有任何实际用处可以理解为文本的行数吧
        String line = value.toString();
        String[] tokens = line.split(",");
        // YYYY = tokens[0]
        // MM = tokens[1]
        // DD = tokens[2]
        // temperature = tokens[3]
        String yearMonth = tokens[0] + tokens[1];
        String day = tokens[2];
        int temperature = Integer.parseInt(tokens[3]);

        pair.setYearMonth(yearMonth);
        pair.setDay(day);
        pair.setTemperature(temperature);
        theTemperature.set(tokens[3]);

        context.write(pair, theTemperature);
    }
}

Reducer
reducer的输入信号键与值type必须与mapper 的输出保持一致

package org.dataalgorithms.chap01.mapreduce;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/** 
 * SecondarySortReducer implements the reduce() function for 
 * the secondary sort design pattern.
 *
 * @author Mahmoud Parsian
 *
 */
// public class SecondarySortReducer extends Reducer {
//  // reducer <输入键类型，输入值类型，输出键类型，输出值类型>
//
//     @Override
//     protected void reduce(DateTemperaturePair key, Iterable values, Context context) throws IOException, InterruptedException {
//      StringBuilder builder = new StringBuilder();
//      for (Text value : values) {
//          builder.append(value.toString());
//          builder.append(",");
//      }
//         context.write(key.getYearMonth(), new Text(builder.toString()));
//     }
// }


public class SecondarySortReducer extends Reducer {
    // reducer <输入键类型，输入值类型，输出键类型，输出值类型>

    @Override
    protected void reduce(DateTemperaturePair key, Iterable values, Context context) throws IOException, InterruptedException {
        // Iterable values: 多个mapper 输出构成
        StringBuilder builder = new StringBuilder();
        for (Text value : values) {
            builder.append(value.toString());
            builder.append(",");
        }
        context.write(key, new Text(builder.toString()));
    }
}

这里讲reducer.reduce函数的输入键做了小小改变，保持与mapper.map的输出一致

定义Driver

package org.dataalgorithms.chap01.mapreduce;

import org.apache.log4j.Logger;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;

/**
 * SecondarySortDriver is driver class for submitting secondary sort job to Hadoop.
 *
 * @author Mahmoud Parsian
 */
public class SecondarySortDriver extends Configured implements Tool {
    // 此类负责运行mapper 和 reducer

    private static Logger theLogger = Logger.getLogger(SecondarySortDriver.class);

    @Override
    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        Job job = new Job(conf);
        job.setJarByClass(SecondarySortDriver.class);
        job.setJobName("SecondarySortDriver");

        // args[0] = input directory
        // args[1] = output directory
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // reduce 的输出类型，必须reduce一致
        job.setOutputKeyClass(DateTemperaturePair.class);
        job.setOutputValueClass(Text.class);

        // map 的输出类型, 与reduce通常一致可以忽略不设置
        // job.setMapOutputKeyClass(DateTemperaturePair.class);
        // job.setMapOutputValueClass(Text.class);


        job.setMapperClass(SecondarySortMapper.class);
        job.setReducerClass(SecondarySortReducer.class);
        job.setPartitionerClass(DateTemperaturePartitioner.class);
        job.setGroupingComparatorClass(DateTemperatureGroupingComparator.class);


        // 提交作业并等待执行完成
        boolean status = job.waitForCompletion(true);
        theLogger.info("run(): status=" + status);
        return status ? 0 : 1;
    }

    /**
     * The main driver for word count map/reduce program.
     * Invoke this method to submit the map/reduce job.
     *
     * @throws Exception When there is communication problems with the job tracker.
     */
    public static void main(String[] args) throws Exception {
        // Make sure there are exactly 2 parameters
        if (args.length != 2) {
            theLogger.warn("SecondarySortDriver  ");
            throw new IllegalArgumentException("SecondarySortDriver  ");
        }

        //String inputDir = args[0];
        //String outputDir = args[1];
        int returnStatus = submitJob(args);
        theLogger.info("returnStatus=" + returnStatus);

        System.exit(returnStatus);
    }


    /**
     * The main driver for word count map/reduce program.
     * Invoke this method to submit the map/reduce job.
     *
     * @throws Exception When there is communication problems with the job tracker.
     */
    public static int submitJob(String[] args) throws Exception {
        //String[] args = new String[2];
        //args[0] = inputDir;
        //args[1] = outputDir;
        return ToolRunner.run(new SecondarySortDriver(), args);
    }
}

集群提交jar执行

首先启动集群，前面有文章讲解如何用虚拟机安装集群；

image.png

maven打包jar

image.png

上传数据到hdfs

[root@master chapter1]# cat sample_input.txt 
2019,p,4,40
2019,p,6,20
2019,x,2,9
2019,y,2,5
2019,x,1,3
2019,y,1,7
2019,y,3,1
2019,x,3,6
2019,z,1,4
2019,z,2,8
2019,z,3,7
2019,z,4,0
2019,p,1,10
2019,r,3,60
[root@master bin]# ./hadoop fs -ls /data_algorithms
[root@master bin]# ./hadoop fs -mkdir -p /data_algorithms/chapter1/input
[root@master bin]# ./hadoop fs -ls /data_algorithms/chapter1
Found 1 items
drwxr-xr-x   - root supergroup          0 2019-04-14 16:13 /data_algorithms/chapter1/input
[root@master bin]# ./hadoop fs -put /root/Data/data_algorithms/chapter1/sample_input.txt /data_algorithms/chapter1/input
[root@master bin]# ./hadoop fs -cat /data_algorithms/chapter1/input/sample_input.txt
2019,p,4,40
2019,p,6,20
2019,x,2,9
2019,y,2,5
2019,x,1,3
2019,y,1,7
2019,y,3,1
2019,x,3,6
2019,z,1,4
2019,z,2,8
2019,z,3,7
2019,z,4,0
2019,p,1,10
2019,r,3,60

执行jar

root@master bin]# ./hadoop jar /root/Data/data_algorithms/chapter1/hadoop_spark-1.0-SNAPSHOT.jar org.dataalgorithms.chap01.mapreduce.SecondarySortDriver /data_algorithms/chapter1/input /data_algorithms/chapter1/output

19/04/14 16:27:04 INFO client.RMProxy: Connecting to ResourceManager at master/172.16.21.220:8032
19/04/14 16:27:05 INFO input.FileInputFormat: Total input paths to process : 1
19/04/14 16:27:06 INFO mapreduce.JobSubmitter: number of splits:1
19/04/14 16:27:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555221334170_0003
19/04/14 16:27:07 INFO impl.YarnClientImpl: Submitted application application_1555221334170_0003
19/04/14 16:27:07 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1555221334170_0003/
19/04/14 16:27:07 INFO mapreduce.Job: Running job: job_1555221334170_0003
19/04/14 16:27:17 INFO mapreduce.Job: Job job_1555221334170_0003 running in uber mode : false
19/04/14 16:27:17 INFO mapreduce.Job:  map 0% reduce 0%
19/04/14 16:27:24 INFO mapreduce.Job:  map 100% reduce 0%
19/04/14 16:27:32 INFO mapreduce.Job:  map 100% reduce 100%
19/04/14 16:27:33 INFO mapreduce.Job: Job job_1555221334170_0003 completed successfully
19/04/14 16:27:33 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=234
        FILE: Number of bytes written=238407
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=289
        HDFS: Number of bytes written=334
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=4958
        Total time spent by all reduces in occupied slots (ms)=5274
        Total time spent by all map tasks (ms)=4958
        Total time spent by all reduce tasks (ms)=5274
        Total vcore-milliseconds taken by all map tasks=4958
        Total vcore-milliseconds taken by all reduce tasks=5274
        Total megabyte-milliseconds taken by all map tasks=5076992
        Total megabyte-milliseconds taken by all reduce tasks=5400576
    Map-Reduce Framework
        Map input records=14
        Map output records=14
        Map output bytes=200
        Map output materialized bytes=234
        Input split bytes=131
        Combine input records=0
        Combine output records=0
        Reduce input groups=5
        Reduce shuffle bytes=234
        Reduce input records=14
        Reduce output records=5
        Spilled Records=28
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=151
        CPU time spent (ms)=1920
        Physical memory (bytes) snapshot=309817344
        Virtual memory (bytes) snapshot=4159598592
        Total committed heap usage (bytes)=165810176
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=158
    File Output Format Counters 
        Bytes Written=334
19/04/14 16:27:33 INFO mapreduce.SecondarySortDriver: run(): status=true
19/04/14 16:27:33 INFO mapreduce.SecondarySortDriver: returnStatus=0
[root@master bin]# ./hadoop fs -ls /data_algorithms/chapter1/output/
Found 2 items
-rw-r--r--   3 root supergroup          0 2019-04-14 16:27 /data_algorithms/chapter1/output/_SUCCESS
-rw-r--r--   3 root supergroup        334 2019-04-14 16:27 /data_algorithms/chapter1/output/part-r-00000
[root@master bin]# ./hadoop fs -cat /data_algorithms/chapter1/output/p*
DateTemperaturePair{yearMonth=2019z, day=4, temperature=0}  8,7,4,0,
DateTemperaturePair{yearMonth=2019y, day=3, temperature=1}  7,5,1,
DateTemperaturePair{yearMonth=2019x, day=1, temperature=3}  9,6,3,
DateTemperaturePair{yearMonth=2019r, day=3, temperature=60} 60,
DateTemperaturePair{yearMonth=2019p, day=1, temperature=10} 40,20,10,

将执行命令改写为脚本

[root@master chapter1]# ./run.sh 
rmr: DEPRECATED: Please use 'rm -r' instead.
rmr: `/data_algorithms/chapter1/output': No such file or directory
2019,p,4,40
2019,p,6,20
2019,x,2,9
2019,y,2,5
2019,x,1,3
2019,y,1,7
2019,y,3,1
2019,x,3,6
2019,z,1,4
2019,z,2,8
2019,z,3,7
2019,z,4,0
2019,p,1,10
2019,r,3,60
./run.sh:行7: org.dataalgorithms.chap01.mapreduce.SecondarySortDriver: 未找到命令
Exception in thread "main" java.lang.ClassNotFoundException: /data_algorithms/chapter1/input
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
[root@master chapter1]# cat run.sh 
# run.sh
export APP_JAR=/root/Data/data_algorithms/chapter1/hadoop_spark-1.0-SNAPSHOT.jar
INPUT=/data_algorithms/chapter1/input
OUTPUT=/data_algorithms/chapter1/output
$HADOOP_HOME/bin/hadoop fs -rmr $OUTPUT
$HADOOP_HOME/bin/hadoop fs -cat $INPUT/sam*
PROG=package org.dataalgorithms.chap01.mapreduce.SecondarySortDriver
$HADOOP_HOME/bin/hadoop jar $APP_JAR $PROG $INPUT $OUTPUT

cat是能得到input文件的内容的，搞不懂

hadoop-spark 大数据处理技巧章节一（上）

题记

学习材料

第一章节：二次排序

你可能感兴趣的:(hadoop-spark 大数据处理技巧章节一（上）)