Why Not Use Java Object Serialization?

Java comes with its own serialization mechanism, called Java Object Serialization (often
referred to simply as “Java Serialization”), that is tightly integrated with the language,
so it’s natural to ask why this wasn’t used in Hadoop. Here’s what Doug Cutting said
in response to that question:
Why didn’t I use Serialization when we first started Hadoop? Because it looked
big and hairy and I thought we needed something lean and mean, where we had
precise control over exactly how objects are written and read, since that is central
to Hadoop. With Serialization you can get some control, but you have to fight for
it.
The logic for not using RMI was similar. Effective, high-performance inter-process
communications are critical to Hadoop. I felt like we’d need to precisely control
how things like connections, timeouts and buffers are handled, and RMI gives you
little control over those.
The problem is that Java Serialization doesn’t meet the criteria for a serialization format
listed earlier: compact, fast, extensible, and interoperable.

Java Serialization is not compact: it writes the classname of each object being written
to the stream—this is true of classes that implement java.io.Serializable or
java.io.Externalizable. Subsequent instances of the same class write a reference handle
to the first occurrence, which occupies only 5 bytes. However, reference handles
don’t work well with random access, since the referent class may occur at any point in
the preceding stream—that is, there is state stored in the stream. Even worse, reference
handles play havoc with sorting records in a serialized stream, since the first record of
a particular class is distinguished and must be treated as a special case.
All these problems are avoided by not writing the classname to the stream at all, which
is the approach that Writable takes. This makes the assumption that the client knows
the expected type. The result is that the format is considerably more compact than Java
Serialization, and random access and sorting work as expected since each record is
independent of the others (so there is no stream state).
Java Serialization is a general-purpose mechanism for serializing graphs of objects, so
it necessarily has some overhead for serialization and deserialization operations. What’s
more, the deserialization procedure creates a new instance for each object deserialized
from the stream. Writable objects, on the other hand, can be (and often are) reused.
For example, for a MapReduce job, which at its core serializes and deserializes billions
of records of just a handful of different types, the savings gained by not having to allocate
new objects are significant.
In terms of extensibility, Java Serialization has some support for evolving a type, but it
is brittle and hard to use effectively (Writables have no support: the programmer has
to manage them himself).
In principle, other languages could interpret the Java Serialization stream protocol (defined
by the Java Object Serialization Specification), but in practice there are no widely
used implementations in other languages, so it is a Java-only solution. The situation is
the same for Writables.

你可能感兴趣的:(java,mapreduce,hadoop,Access,performance)