《Hadoop: The Definitive Guide》reading notes:
This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.
MapReduce:
http://strata.oreilly.com/2011/01/what-is-hadoop.html
引用
MapReduce: you
map the operation out to all of those servers and then you
reduce the results back into a single result set.
MapReduce can be seen as a
complement to a Rational Database Management System (RDBMS).
MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data.
MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated.
Another difference between MapReduce and an RDBMS is the amount of structure in the datasets on which they operate. Structured data is data that is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other hand, is looser, and though there may be a schema, it is often ignored, so it may be used only as a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold any form of data. Unstructured data does not have any particular internal structure: for example, plain text or image data.
MapReduce works well on unstructured or semistructured data because it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not intrinsic properties of the data, but they are chosen by the person analyzing the data.
Chapter2 MapRecduce - Scaling Out 全章节需读透:
引用
A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program, and configuration
information. Hadoop runs the job by dividing it into tasks, of which there are two types:
map tasks and reduce tasks.
There are two types of nodes that control the job execution process: a jobtracker and
a number of tasktrackers. The jobtracker coordinates all the jobs run on the system by
scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress
reports to the jobtracker, which keeps a record of the overall progress of each job. If a
task fails, the jobtracker can reschedule it on a different tasktracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined
map function for each record in the split.
HDFS:
These are areas where HDFS is not a good fit today:
1 Low-latency(低延时) data access - HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency.
HBase is currently a better choice for low-latency access.
2 Lots of small files - Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode.
3 Multiple writers, arbitrary file modifications - Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers or for modifications at arbitrary offsets in the file. (These might be supported in the future, but they are likely to be relatively inefficient.)
Hadoop Default Ports Quick Reference:
http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/
Hadoop FS Shell Guide (0.19):
http://hadoop.apache.org/docs/r0.19.1/hdfs_shell.html