hadoop-spark 大数据处理技巧章节一(上)

题记

坚持学习一个东西真实一件很难的事情;对于关于hadoop spark的学习不知道从入门到放弃多少次了。首先是从scala的学习开始,学习近半年,这本书至少看了3遍,发现现在公司生产基本都是用java,后面java入门放弃后,在工作中还是学会了。在此之际发现大厂一般都要求会一点hadoop spark。无赖再次学习!希望能坚持下去,因为目前工作中用不到,这种东西没有实战很难有深刻的理解。废话不多说现在直接开始学习吧,希望大家能鼓励我学习下去,也希望能工作中能快速用上。

学习材料

  • 书籍两本


    hadoop-spark 大数据处理技巧章节一(上)_第1张图片
    image.png

    hadoop-spark 大数据处理技巧章节一(上)_第2张图片
    image.png

第一本hadoop用的版本还算比较新,hadoop 2。第二本版本就比较低,因此代码学习建议使用第一本;而关于hadoop一些基础理论第二本理解都是ok的有助于理解第一本书中源代码。因为,第一本书主要集中在算法,代码注释比较少。另外就是配合官网文档理解。

第一章节:二次排序

学习工具:idea + maven

  1. 先pos上maven依赖


    4.0.0

    kean.learn
    hadoop_spark
    1.0-SNAPSHOT
    
        
            
                org.apache.maven.plugins
                maven-compiler-plugin
                
                    8
                    8
                
            
        
    

    
        
        
            org.apache.hadoop
            hadoop-common
            2.7.3
        
        
        
            org.apache.hadoop
            hadoop-mapreduce-client-core
            2.7.3
        


    



版本建议与自己安装的hadoop集群保持一致

  1. 二次排序主要学习的是一个map reduce,基础理论就不讲了
  • entity代码
package org.dataalgorithms.chap01.mapreduce;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * The DateTemperaturePair class enable us to represent a
 * composite type of (yearMonth, day, temperature). To persist
 * a composite type (actually any data type) in Hadoop, it has
 * to implement the org.apache.hadoop.io.Writable interface.
 * 

* To compare composite types in Hadoop, it has to implement * the org.apache.hadoop.io.WritableComparable interface. * * @author Mahmoud Parsian */ public class DateTemperaturePair implements Writable, WritableComparable { // 如果需要持久化存储定制的数据类型必须继承Writable接口 // 如果需要比较,还必须继承 WritableComparable接口 private final Text yearMonth = new Text(); private final Text day = new Text(); private final IntWritable temperature = new IntWritable(); public DateTemperaturePair() { } public DateTemperaturePair(String yearMonth, String day, int temperature) { this.yearMonth.set(yearMonth); this.day.set(day); this.temperature.set(temperature); } public static DateTemperaturePair read(DataInput in) throws IOException { DateTemperaturePair pair = new DateTemperaturePair(); pair.readFields(in); return pair; } @Override public void write(DataOutput out) throws IOException { yearMonth.write(out); day.write(out); temperature.write(out); } @Override public void readFields(DataInput in) throws IOException { yearMonth.readFields(in); day.readFields(in); temperature.readFields(in); } @Override public int compareTo(DateTemperaturePair pair) { // 先按照年月排序 int compareValue = this.yearMonth.compareTo(pair.getYearMonth()); if (compareValue == 0) { // 其次按温度排序 compareValue = temperature.compareTo(pair.getTemperature()); } //return compareValue; // to sort ascending // to sort descending return -1 * compareValue; } public Text getYearMonthDay() { return new Text(yearMonth.toString() + day.toString()); } public Text getYearMonth() { return yearMonth; } public Text getDay() { return day; } public IntWritable getTemperature() { return temperature; } public void setYearMonth(String yearMonthAsString) { yearMonth.set(yearMonthAsString); } public void setDay(String dayAsString) { day.set(dayAsString); } public void setTemperature(int temp) { temperature.set(temp); } @Override public boolean equals(Object o) { if (this == o) { return true; } if (o == null || getClass() != o.getClass()) { return false; } // 温度和年月相等判断为相等 DateTemperaturePair that = (DateTemperaturePair) o; if (temperature != null ? !temperature.equals(that.temperature) : that.temperature != null) { return false; } if (yearMonth != null ? !yearMonth.equals(that.yearMonth) : that.yearMonth != null) { return false; } return true; } @Override public int hashCode() { int result = yearMonth != null ? yearMonth.hashCode() : 0; result = 31 * result + (temperature != null ? temperature.hashCode() : 0); return result; } @Override public String toString() { StringBuilder builder = new StringBuilder(); builder.append("DateTemperaturePair{yearMonth="); builder.append(yearMonth); builder.append(", day="); builder.append(day); builder.append(", temperature="); builder.append(temperature); builder.append("}"); return builder.toString(); } }

每行数据具有:(年份 - 月) - 日 - 温度,因此该类定义了3个属性

  1. 定制分区器
    分区器会根据映射器的输出键值决定哪个映射器发送到哪个规约器。
package org.dataalgorithms.chap01.mapreduce;


import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/**
 * The DateTemperaturePartitioner is a custom partitioner class,
 * whcih partitions data by the natural key only (using the yearMonth).
 * Without custom partitioner, Hadoop will partition your mapped data
 * based on a hash code.
 * 

* In Hadoop, the partitioning phase takes place after the map() phase * and before the reduce() phase * * @author Mahmoud Parsian */ public class DateTemperaturePartitioner extends Partitioner { @Override public int getPartition(DateTemperaturePair pair, Text text, int numberOfPartitions) { // make sure that partitions are non-negative // 更具自然键:yearMonth hash值来分组 return Math.abs(pair.getYearMonth().hashCode() % numberOfPartitions); } }

  1. 定制一个比较器
    比较器控制哪些键会被分到一个reducer.reduce()。
package org.dataalgorithms.chap01.mapreduce;


import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/**
 * The DateTemperaturePartitioner is a custom partitioner class,
 * whcih partitions data by the natural key only (using the yearMonth).
 * Without custom partitioner, Hadoop will partition your mapped data
 * based on a hash code.
 * 

* In Hadoop, the partitioning phase takes place after the map() phase * and before the reduce() phase * * @author Mahmoud Parsian */ public class DateTemperaturePartitioner extends Partitioner { @Override public int getPartition(DateTemperaturePair pair, Text text, int numberOfPartitions) { // make sure that partitions are non-negative // 更具自然键:yearMonth hash值来分组 return Math.abs(pair.getYearMonth().hashCode() % numberOfPartitions); } }

  1. 理解分区器和比较器
    关于分区器和比较器两者关系如何理解还不是很清楚。在理解之前申明下,我们需要将问题按年月汇总并将每日问题升序或者降序输出。输入例如:
    2019 1 12 25
    2019 2 13 25
    2019 1 12 24
    ...
    因此需要比较一个年月份的温度,必须将一个年月的温度分到一个partition,否则一个年月的温度分到两个以上的partition无法通过一次二次排序得出想要的结果。因此,定义一个分区器就是为了将一个年月的温度分到一个区上。另外分到一个区后这些entity需要相互比较,但是实体entity已经定义了equals比较,为什么这里还需要定义一个就不得而知

  2. 定义Mapper

package org.dataalgorithms.chap01.mapreduce;


import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/** 
 * SecondarySortMapper implements the map() function for 
 * the secondary sort design pattern.
 *
 * @author Mahmoud Parsian
 *
 */
public class SecondarySortMapper extends Mapper {
    // 抽象类的4个类型,代表mapper的 《输入键,输入值, 输出键, 输出值》

    // value
    private final Text theTemperature = new Text();
    // key
    private final DateTemperaturePair pair = new DateTemperaturePair();

    @Override
    /**
     * @param key is generated by Hadoop (ignored here)
     * @param value has this format: "YYYY,MM,DD,temperature"
     */
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 前两个参数为mapper的输入键与值 而第三个则是保存mapper的输出, 这里第一个参数key其实没有任何实际用处可以理解为文本的行数吧
        String line = value.toString();
        String[] tokens = line.split(",");
        // YYYY = tokens[0]
        // MM = tokens[1]
        // DD = tokens[2]
        // temperature = tokens[3]
        String yearMonth = tokens[0] + tokens[1];
        String day = tokens[2];
        int temperature = Integer.parseInt(tokens[3]);

        pair.setYearMonth(yearMonth);
        pair.setDay(day);
        pair.setTemperature(temperature);
        theTemperature.set(tokens[3]);

        context.write(pair, theTemperature);
    }
}

  1. Reducer
    reducer的输入信号键与值type必须与mapper 的输出保持一致
package org.dataalgorithms.chap01.mapreduce;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/** 
 * SecondarySortReducer implements the reduce() function for 
 * the secondary sort design pattern.
 *
 * @author Mahmoud Parsian
 *
 */
// public class SecondarySortReducer extends Reducer {
//  // reducer <输入键类型,输入值类型,输出键类型,输出值类型>
//
//     @Override
//     protected void reduce(DateTemperaturePair key, Iterable values, Context context) throws IOException, InterruptedException {
//      StringBuilder builder = new StringBuilder();
//      for (Text value : values) {
//          builder.append(value.toString());
//          builder.append(",");
//      }
//         context.write(key.getYearMonth(), new Text(builder.toString()));
//     }
// }


public class SecondarySortReducer extends Reducer {
    // reducer <输入键类型,输入值类型,输出键类型,输出值类型>

    @Override
    protected void reduce(DateTemperaturePair key, Iterable values, Context context) throws IOException, InterruptedException {
        // Iterable values: 多个mapper 输出构成
        StringBuilder builder = new StringBuilder();
        for (Text value : values) {
            builder.append(value.toString());
            builder.append(",");
        }
        context.write(key, new Text(builder.toString()));
    }
}

这里讲reducer.reduce函数的输入键做了小小改变,保持与mapper.map的输出一致

  1. 定义Driver
package org.dataalgorithms.chap01.mapreduce;

import org.apache.log4j.Logger;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;

/**
 * SecondarySortDriver is driver class for submitting secondary sort job to Hadoop.
 *
 * @author Mahmoud Parsian
 */
public class SecondarySortDriver extends Configured implements Tool {
    // 此类负责运行mapper 和 reducer

    private static Logger theLogger = Logger.getLogger(SecondarySortDriver.class);

    @Override
    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        Job job = new Job(conf);
        job.setJarByClass(SecondarySortDriver.class);
        job.setJobName("SecondarySortDriver");

        // args[0] = input directory
        // args[1] = output directory
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // reduce 的输出类型,必须reduce一致
        job.setOutputKeyClass(DateTemperaturePair.class);
        job.setOutputValueClass(Text.class);

        // map 的输出类型, 与reduce通常一致可以忽略不设置
        // job.setMapOutputKeyClass(DateTemperaturePair.class);
        // job.setMapOutputValueClass(Text.class);


        job.setMapperClass(SecondarySortMapper.class);
        job.setReducerClass(SecondarySortReducer.class);
        job.setPartitionerClass(DateTemperaturePartitioner.class);
        job.setGroupingComparatorClass(DateTemperatureGroupingComparator.class);


        // 提交作业并等待执行完成
        boolean status = job.waitForCompletion(true);
        theLogger.info("run(): status=" + status);
        return status ? 0 : 1;
    }

    /**
     * The main driver for word count map/reduce program.
     * Invoke this method to submit the map/reduce job.
     *
     * @throws Exception When there is communication problems with the job tracker.
     */
    public static void main(String[] args) throws Exception {
        // Make sure there are exactly 2 parameters
        if (args.length != 2) {
            theLogger.warn("SecondarySortDriver  ");
            throw new IllegalArgumentException("SecondarySortDriver  ");
        }

        //String inputDir = args[0];
        //String outputDir = args[1];
        int returnStatus = submitJob(args);
        theLogger.info("returnStatus=" + returnStatus);

        System.exit(returnStatus);
    }


    /**
     * The main driver for word count map/reduce program.
     * Invoke this method to submit the map/reduce job.
     *
     * @throws Exception When there is communication problems with the job tracker.
     */
    public static int submitJob(String[] args) throws Exception {
        //String[] args = new String[2];
        //args[0] = inputDir;
        //args[1] = outputDir;
        return ToolRunner.run(new SecondarySortDriver(), args);
    }
}
  1. 集群提交jar执行
  1. 首先启动集群,前面有文章讲解如何用虚拟机安装集群;


    hadoop-spark 大数据处理技巧章节一(上)_第3张图片
    image.png
  1. maven打包jar


    hadoop-spark 大数据处理技巧章节一(上)_第4张图片
    image.png
  2. 上传数据到hdfs
[root@master chapter1]# cat sample_input.txt 
2019,p,4,40
2019,p,6,20
2019,x,2,9
2019,y,2,5
2019,x,1,3
2019,y,1,7
2019,y,3,1
2019,x,3,6
2019,z,1,4
2019,z,2,8
2019,z,3,7
2019,z,4,0
2019,p,1,10
2019,r,3,60
[root@master bin]# ./hadoop fs -ls /data_algorithms
[root@master bin]# ./hadoop fs -mkdir -p /data_algorithms/chapter1/input
[root@master bin]# ./hadoop fs -ls /data_algorithms/chapter1
Found 1 items
drwxr-xr-x   - root supergroup          0 2019-04-14 16:13 /data_algorithms/chapter1/input
[root@master bin]# ./hadoop fs -put /root/Data/data_algorithms/chapter1/sample_input.txt /data_algorithms/chapter1/input
[root@master bin]# ./hadoop fs -cat /data_algorithms/chapter1/input/sample_input.txt
2019,p,4,40
2019,p,6,20
2019,x,2,9
2019,y,2,5
2019,x,1,3
2019,y,1,7
2019,y,3,1
2019,x,3,6
2019,z,1,4
2019,z,2,8
2019,z,3,7
2019,z,4,0
2019,p,1,10
2019,r,3,60
  1. 执行jar
root@master bin]# ./hadoop jar /root/Data/data_algorithms/chapter1/hadoop_spark-1.0-SNAPSHOT.jar org.dataalgorithms.chap01.mapreduce.SecondarySortDriver /data_algorithms/chapter1/input /data_algorithms/chapter1/output
19/04/14 16:27:04 INFO client.RMProxy: Connecting to ResourceManager at master/172.16.21.220:8032
19/04/14 16:27:05 INFO input.FileInputFormat: Total input paths to process : 1
19/04/14 16:27:06 INFO mapreduce.JobSubmitter: number of splits:1
19/04/14 16:27:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555221334170_0003
19/04/14 16:27:07 INFO impl.YarnClientImpl: Submitted application application_1555221334170_0003
19/04/14 16:27:07 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1555221334170_0003/
19/04/14 16:27:07 INFO mapreduce.Job: Running job: job_1555221334170_0003
19/04/14 16:27:17 INFO mapreduce.Job: Job job_1555221334170_0003 running in uber mode : false
19/04/14 16:27:17 INFO mapreduce.Job:  map 0% reduce 0%
19/04/14 16:27:24 INFO mapreduce.Job:  map 100% reduce 0%
19/04/14 16:27:32 INFO mapreduce.Job:  map 100% reduce 100%
19/04/14 16:27:33 INFO mapreduce.Job: Job job_1555221334170_0003 completed successfully
19/04/14 16:27:33 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=234
        FILE: Number of bytes written=238407
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=289
        HDFS: Number of bytes written=334
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=4958
        Total time spent by all reduces in occupied slots (ms)=5274
        Total time spent by all map tasks (ms)=4958
        Total time spent by all reduce tasks (ms)=5274
        Total vcore-milliseconds taken by all map tasks=4958
        Total vcore-milliseconds taken by all reduce tasks=5274
        Total megabyte-milliseconds taken by all map tasks=5076992
        Total megabyte-milliseconds taken by all reduce tasks=5400576
    Map-Reduce Framework
        Map input records=14
        Map output records=14
        Map output bytes=200
        Map output materialized bytes=234
        Input split bytes=131
        Combine input records=0
        Combine output records=0
        Reduce input groups=5
        Reduce shuffle bytes=234
        Reduce input records=14
        Reduce output records=5
        Spilled Records=28
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=151
        CPU time spent (ms)=1920
        Physical memory (bytes) snapshot=309817344
        Virtual memory (bytes) snapshot=4159598592
        Total committed heap usage (bytes)=165810176
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=158
    File Output Format Counters 
        Bytes Written=334
19/04/14 16:27:33 INFO mapreduce.SecondarySortDriver: run(): status=true
19/04/14 16:27:33 INFO mapreduce.SecondarySortDriver: returnStatus=0
[root@master bin]# ./hadoop fs -ls /data_algorithms/chapter1/output/
Found 2 items
-rw-r--r--   3 root supergroup          0 2019-04-14 16:27 /data_algorithms/chapter1/output/_SUCCESS
-rw-r--r--   3 root supergroup        334 2019-04-14 16:27 /data_algorithms/chapter1/output/part-r-00000
[root@master bin]# ./hadoop fs -cat /data_algorithms/chapter1/output/p*
DateTemperaturePair{yearMonth=2019z, day=4, temperature=0}  8,7,4,0,
DateTemperaturePair{yearMonth=2019y, day=3, temperature=1}  7,5,1,
DateTemperaturePair{yearMonth=2019x, day=1, temperature=3}  9,6,3,
DateTemperaturePair{yearMonth=2019r, day=3, temperature=60} 60,
DateTemperaturePair{yearMonth=2019p, day=1, temperature=10} 40,20,10,
  1. 将执行命令改写为脚本
[root@master chapter1]# ./run.sh 
rmr: DEPRECATED: Please use 'rm -r' instead.
rmr: `/data_algorithms/chapter1/output': No such file or directory
2019,p,4,40
2019,p,6,20
2019,x,2,9
2019,y,2,5
2019,x,1,3
2019,y,1,7
2019,y,3,1
2019,x,3,6
2019,z,1,4
2019,z,2,8
2019,z,3,7
2019,z,4,0
2019,p,1,10
2019,r,3,60
./run.sh:行7: org.dataalgorithms.chap01.mapreduce.SecondarySortDriver: 未找到命令
Exception in thread "main" java.lang.ClassNotFoundException: /data_algorithms/chapter1/input
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
[root@master chapter1]# cat run.sh 
# run.sh
export APP_JAR=/root/Data/data_algorithms/chapter1/hadoop_spark-1.0-SNAPSHOT.jar
INPUT=/data_algorithms/chapter1/input
OUTPUT=/data_algorithms/chapter1/output
$HADOOP_HOME/bin/hadoop fs -rmr $OUTPUT
$HADOOP_HOME/bin/hadoop fs -cat $INPUT/sam*
PROG=package org.dataalgorithms.chap01.mapreduce.SecondarySortDriver
$HADOOP_HOME/bin/hadoop jar $APP_JAR $PROG $INPUT $OUTPUT

cat是能得到input文件的内容的,搞不懂

你可能感兴趣的:(hadoop-spark 大数据处理技巧章节一(上))