Accelerating Comparison by Providing RawComparator

When a job is in sorting or merging phase, Hadoop leverage RawComparator for the map output key to compare keys. Built-in Writable classes such as IntWritable have byte-level implementation that are fast because they don't require the byte form of the object to be unmarshalled to Object form for the comparision. When writing your own Writable, it may be tempting to implement the WritableComparable interface for it's easy to implemente this interface without knowing the layout of the custom Writable layout in memory. Unfortunately, it requres Object unmarshalling from byte form which lead to inefficiency of comparisions.

 

In this blog post, I'll show you how to implement your custom RawComparator to avoid the inefficiencies. But by comparision, I'll implement the WritableComparable interface first, then implement RawComparator with the same custom object.

 

Suppose you have a custom Writable called Person, in order to make it comparable, you implement the WritableComparable like this:

import org.apache.hadoop.io.WritableComparable;

import java.io.*;

public class Person implements WritableComparable<Person> {

    private String firstName;
    private String lastName;

    public Person() {
    }

    public String getFirstName() {
        return firstName;
    }

    public void setFirstName(String firstName) {
        this.firstName = firstName;
    }

    public String getLastName() {
        return lastName;
    }

    public void setLastName(String lastName) {
        this.lastName = lastName;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.lastName = in.readUTF();
        this.firstName = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(lastName);
        out.writeUTF(firstName);
    }

    @Override
    public int compareTo(Person other) {
        int cmp = this.lastName.compareTo(other.lastName);
        if (cmp != 0) {
            return cmp;
        }
        return this.firstName.compareTo(other.firstName);
    }

    public void set(String lastName, String firstName) {
        this.lastName = lastName;
        this.firstName = firstName;
    }
}

The trouble with this Comparator is that MapReduce store your intermediary map output data in byte form, and every time it needs to sort your data, it has to unmarshall it into Writable form to perform the comparison, this unmarshalling is expensive because it recreates your objects for comparison purposes. 

 

To write a byte-level Comparator for the Person class, we have to implement the RawComparator interface. Let's revisit the Person class and look at how to do this. In the Person class, we store the two fields, firstname and last name, as string, and used the DataOutput's writableUTF method to write them out.

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(lastName);
        out.writeUTF(firstName);
    }

 If you're going to read the javadoc of writeUTF(String str, DataOut out), you will see below statement:

 

     * First, two bytes are written to out as if by the <code>writeShort</code>

     * method giving the number of bytes to follow. This value is the number of

     * bytes actually written out, not the length of the string. Following the

     * length, each character of the string is output, in sequence, using the

     * modified UTF-8 encoding for the character. If no exception is thrown, the

     * counter <code>written</code> is incremented by the total number of 

     * bytes written to the output stream. This will be at least two 

     * plus the length of <code>str</code>, and at most two plus 

     * thrice the length of <code>str</code>.

This simply means that the writeUTF method writes two bytes containing the length of the string, followed by the byte form of the string.

 

Assume that you want to perform a lexicographical comparison that includes both the last and the first name, you can not do this with the entire byte array because the string lengths are also encoded in the array. Instead, the comparator needs to be smart enough to skip over the string lengths, as below code shown:

import org.apache.hadoop.io.WritableComparator;

public class PersonBinaryComparator extends WritableComparator {
    protected PersonBinaryComparator() {
        super(Person.class, true);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
                       int l2) {
        
        // Compare last name
        int lastNameResult = compare(b1, s1, b2, s2);

        // If last name is identical, return the result of comparison
        if (lastNameResult != 0) {
            return lastNameResult;
        }

        // Read the size of of the last name from the byte array
        int b1l1 = readUnsignedShort(b1, s1);
        int b2l1 = readUnsignedShort(b2, s2);

        // Return the comparison result on the first name
        return compare(b1, s1 + b1l1 + 2, b2, s2 + b2l1 + 2);
    }

    // Compare string in byte form
    public static int compare(byte[] b1, int s1, byte[] b2, int s2) {
        // Read the size of the UTF-8 string in byte array
        int b1l1 = readUnsignedShort(b1, s1);
        int b2l1 = readUnsignedShort(b2, s2);

        // Perform lexicographical comparison of the UTF-8 binary data
        // with the WritableComparator.compareBytes(...) method
        return compareBytes(b1, s1 + 2, b1l1, b2, s2 + 2, b2l1);
    }

    // Read two bytes
    public static int readUnsignedShort(byte[] b, int offset) {
        int ch1 = b[offset];
        int ch2 = b[offset + 1];
        return (ch1 << 8) + (ch2);
    }
}

 

Final note: Using the writableUTF is limited because it can only support string that contain less than 65525 (two bytes) characters. If you need to work with a larger string, you should look at using Hadoop's Text class, which can support much larget strings. The implementation of Text's comparator is similar to what we completed in this blog post.

 

你可能感兴趣的:(comparator)