lHDFS too has the concept of a block, but it is a much larger unit 64 MB by default.
lLike in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored as independent units.
lUnlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
lThe namenode manages the filesystem namespace.
nIt maintains the filesystem tree and the metadata for all the files and directories in the tree.
nThis information is stored persistently on the local disk in the form of two files: the namespace image and the edit log.
nThe namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
lDatanodes are the work horses of the filesystem.
nThey store and retrieve blocks when they are told to (by clients or the namenode)
nThey report back to the namenode periodically with lists of blocks that they are storing.
lsecondary namenode
nIt does not act as a namenode.
nIts main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
nIt keeps a copy of the merged name space image, which can be used in the event of the namenode failing.
lThe VERSION file is a Java properties file that contains information about the version of HDFS that is running
nThe layoutVersion is a negative integer that defines the version of HDFS’s persistent data structures.
nThe namespaceID is a unique identifier for the filesystem, which is created whenthe filesystem is first formatted.
nThe cTime property marks the creation time of the namenode’s storage.
nThe storageType indicates that this storage directory contains data structures for a namenode.
lWhen a filesystem client performs a write operation, it is first recorded in the edit log.
lThe namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified.
lThe edit log is flushed and synced after every write before a success code is returned to the client.
lThe fsimage file is a persistent checkpoint of the filesystem metadata. it is not updated for every filesystem write operation.
lIf the namenode fails, then the latest state of its metadata can be reconstructed by loading the fsimage from disk into memory, then applying each of the operations in the edit log.
lThis is precisely what the namenode does when it starts up.
lThe fsimage file contains a serialized form of all the directory and file inodes in the filesystem.
lThe secondary namenode is to produce checkpoints of the primary’s in-memory filesystem metadata.
lThe checkpointing process proceeds as follows :
nThe secondary asks the primary to roll its edits file, so new edits go to a new file.
nThe secondary retrieves fsimage and edits from the primary (using HTTP GET).
nThe secondary loads fsimage into memory, applies each operation from edits, then creates a new consolidated fsimage file.
nThe secondary sends the new fsimage back to the primary (using HTTP POST).
nThe primary replaces the old fsimage with the new one from the secondary, and the old edits file with the new one it started in step 1. It also updates the fstime file to record the time that the checkpoint was taken.
nAt the end of the process, the primary has an up-to-date fsimage file, and a shorter edits file.
lA datanode’s VERSION file
lThe other files in the datanode’s current storage directory are the files with the blk_ prefix.
nThere are two types: the HDFS blocks themselves (which just consist of the file’s raw bytes) and the metadata for a block (with a .meta suffix).
nA block file just consists of the raw bytes of a portion of the file being stored;
nthe metadata file is made up of a header with version and type information, followed by a series of checksums for sections of the block.
lWhen the number of blocks in a directory grows to a certain size, the datanode creates a new subdirectory in which to place new blocks and their accompanying metadata.
lThe client opens the file it wishes to read by calling open() on the FileSystem object (step 1).
lDistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file (step 2).
lFor each block, the namenode returns the addresses of the datanodes that have a copy of that block.
lThe datanodes are sorted according to their proximity to the client.
lThe DistributedFileSystem returns a FSDataInputStream to the client for it to read data from.
lThe client then calls read() on the stream (step 3).
lDFSInputStream connects to the first (closest) datanode for the first block in the file.
lData is streamed from the datanode back to the client (step 4).
lWhen the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block (step 5).
lWhen the client has finished reading, it calls close() on the FSDataInputStream (step 6).
lDuring reading, if the client encounters an error while communicating with a datanode, then it will try the next closest one for that block.
lIt will also remember datanodes that have failed so that it doesn’t needlessly retry them forlater blocks.
lThe client also verifies checksums for the data transferred to it from the datanode. If a corrupted block is found, it is reported to the namenode.
lThe client creates the file by calling create() (step 1).
lDistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2).
lThe namenode performs various checks to make sure the file doesn’t already exist, and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException.
lThe DistributedFileSystem returns a FSDataOutputStream for the client to start writing data to.
lAs the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue.
lThe data queue is consumed by the Data Streamer, whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms apipeline.
lThe DataStreamer streams the packets to the first datanode in the pipeline, which storesthe packet and forwards it to the second datanode in the pipeline. Similarly, the seconddatanode stores the packet and forwards it to the third (and last) datanode in the pipe line (step 4).
lDFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline (step 5).
lIf a datanode fails while data is being written to it,
nFirst the pipeline is closed, and any packets in the ack queue are added to the front of the data queue.
nThe current block on the good datanodes is given a new identity by the namenode, so that the partial block on the failed datanode will be deleted if the failed data node recovers later on.
nThe failed datanode is removed from the pipeline and the remainder of the block’s data is written to the two good datanodes in the pipeline.
nThe namenode notices that the block is under-replicated, and it arranges for a further replica to be created on another node.
lWhen the client has finished writing data it calls close() on the stream (step 6). This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete (step7).
lMapReduce has two phases: the map phase and the reduce phase.
lEach phase has key-value pairs as input and output (the types can be specified).
nThe input key-value types of the map phase is determined by the input format
nThe output key-value types of the map phase should match the input key value types of the reduce phase
nThe output key-value types of the reduce phase can be set in the JobConf interface.
lThe programmer specifies two functions: the map function and the reduce function.
lThe input types of the reduce function must match the output type of the map function.
lAn input path is specified by calling the static addInputPath() method on FileInputFormat
nIt can be a single file, a directory, or a file pattern.
naddInputPath() can be called more than once to use input from multiple paths.
lThe output path is specified by the static setOutputPath() method on FileOutputFormat.
nIt specifies a directory where the output files from the reducer functions are written.
nThe directory shouldn’t exist before running the job
lThe map and reduce types can be specified via the setMapperClass() and setReducerClass() methods.
lThe setOutputKeyClass() and setOutputValueClass() methods control the output types for the map and the reduce functions, which are often the same.
nIf they are different, then the map output types can be set using the methods setMapOutputKeyClass() and setMapOutputValueClass().
lThe input types are controlled via the input format, which we have not explicitly set since we are using the default TextInputFormat.
lA MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information.
lHadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks.
lThere are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers.
nThe jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.
nTasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job.
nIf a tasks fails, the jobtracker can reschedule it on a different tasktracker.
lHadoop divides the input to a MapReduce job into fixed-size input splits.
lHadoop creates one map task for each split, which runs the user defined map function for each record in the split.
lHadoop does its best to run the map task on a node where the input data resides in HDFS.
nThis is called the data locality optimization.
nThis is why the optimal split size is the same as the block size: it is the largest size of input that can be guaranteed to be stored on a single node.
lReduce tasks don’t have the advantage of data locality
nThe input to a single reduce task is normally the output from all mappers.
nThe output of the reduce is normally stored in HDFS for reliability.
The number of reduce tasks is not governed by the size of the input, but is specified independently.
lWhen there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task.
lThere can be many keys (and their associated values) in each partition, but the records for every key are all in a single partition.
lThe partitioning can be controlled by a user-defined partitioning function
nNormally the default partitioner which buckets keys using a hash function.
nconf.setPartitionerClass(HashPartitioner.class);
nconf.setNumReduceTasks(1);
lThe data flow between map and reduce tasks is “the shuffle,” as each reduce task is fed by many map tasks.
lIt’s also possible to have zero reduce tasks. This can be appropriate when you don’t need the shuffle since the processing can be carried out entirely in parallel
lThe map and reduce functions in Hadoop MapReduce have the following general form:
lThe partition function operates on the intermediate key and value types (K2and V2), and returns the partition index.
lInput types are set by the input format.
nFor instance, a TextInputFormat generates keys of type LongWritable and values of type Text.
lA minimal MapReduce driver, with the defaults explicitly set
lThe default input format is TextInputFormat, which produces keys of type LongWritable (the offset of the beginning of the line in the file) and values of type Text (the line of text).
lThe setNumMapTasks() call does not necessarily set the number of map tasks to one
nThe actual number of map tasks depends on the size of the input
lThe default mapper is IdentityMapper
lMap tasks are run by MapRunner, the default implementation of MapRunnable that calls the Mapper’s map() method sequentially with each record.
lThe default partitioner is HashPartitioner, which hashes a record’s key to determine which partition the record belongs in.
nEach partition is processed by a reduce task, so the number of partitions is equal to the number of reduce tasks for the job
lThe default reducer is IdentityReducer
lRecords are sorted by the MapReduce system before being presented to the reducer.
lThe default output format is TextOutputFormat, which writes out records, one per line, by converting keys and values to strings and separating them with a tab character.
lAn input split is a chunk of the input that is processed by a single map.
lEach split is divided into records, and the map processes each record—a key-value pair—in turn.
lAn InputSplit has a length in bytes, and a set of storage locations, which are just hostname strings.
lA split doesn’t contain the input data; it is just a reference to the data.
lThe storage locations are used by the MapReduce system to place map tasks as close to the split’s data as possible
lThe size is used to order the splits so that the largest get processed first
lAn InputFormat is responsible for creating the input splits, and dividing them into records.
lThe JobClient calls the getSplits() method, passing the desired number of map tasks as the numSplits argument.
lHaving calculated the splits, the client sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers.
lOn a tasktracker, the map task passes the split to the getRecordReader() method on InputFormat to obtain a RecordReader for that split.
lA RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs, which it passes to the map function.
lThe same key and value objects are used on each invocation of the map() method—only their contents are changed.If you need to change the value out of map, make a copy of the object you want to hold on to.
lFileInputFormat is the base class for all implementations of InputFormat that use files as their data source.
lIt provides two things: a place to define which files are included as the input to a job, and an implementation for generating splits for the input files.
lFileInputFormat input paths may represent a file, a directory, or, by using a glob, a collection of files and directories.
lTo exclude certain files from the input, you can set a filter using the setInputPathFilter() method on FileInputFormat
lFileInputFormat splits only large files. Here “large” means larger than an HDFS block.
lProperties for controlling split size
nThe minimum split size is usually 1 byte, by setting this to a value larger than the block size, they can force splits to be larger than a block.
nThe maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block.
lHadoop works better with a small number of large files than a large number of small files.
lWhere FileInputFormat creates a split per file, CombineFileInputFormat packs many files into each split so that each mapper has more to process.
lOne technique for avoiding the many small files case is to merge small files into larger files by using a SequenceFile: the keys can act as filenames and the values as file contents.
lTextInputFormat is the default InputFormat.
nEach record is a line of input.
nThe key, a LongWritable, is the byte offset within the file of the beginning of the line.
nThe value is the contents of the line, excluding any line terminators, and is packaged as a Text object.
lThe logical records that FileInputFormats define do not usually fit neatly into HDFS blocks.
lA single file is broken into lines, and the line boundaries do not correspond with the HDFS block boundaries.
lSplits honor logical record boundaries
nThe first split contains line 5, even though it spans the first and second block.
nThe second split starts at line 6.
lData-local maps will perform some remote reads.
lIt is common for each line in a file to be a key-value pair, separated by a delimiter such as a tab character.
lYou can specify the separator via the key.value.separator.in.input.line property.
lIf you want your mappers to receive a fixed number of lines of input, then NLineInputFormat is the InputFormat to use.
lLike TextInputFormat, the keys are the byte offsets within the file and the values are the lines themselves.
lN refers to the number of lines of input that each mapper receives.
lHadoop’s sequence file format stores sequences of binary key-value pairs.
lTo use data from sequence files as the input to MapReduce, you use SequenceFileInputFormat.
lThe keys and values are determined by the sequence file, and you need to make sure that your map input types correspond.
lFor example, if your sequence file has IntWritable keys and Text values, then the map signature would be Mapper<IntWritable, Text, K, V>.
lSequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects.
lSequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque binary objects.
lThey are encapsulated as BytesWritable objects
lWriting a SequenceFile
nTo create a SequenceFile, use one of its createWriter() static methods, which returns a SequenceFile.Writer instance.
nspecify a stream to write to (either a FSDataOutputStream or a FileSystem and Path pairing), a Configuration object, and the key and value types.
nOnce you have a SequenceFile.Writer, you then write key-value pairs, using the append() method.
nThen when you’ve finished you call the close() method
lReading a SequenceFile
nReading sequence files from beginning to end is a matter of creating an instance of SequenceFile.Reader, and iterating over records by repeatedly invoking one of the next() methods.
lThe SequenceFile Format
nA sequence file consists of a header followed by one or more records.
nThe first three bytes of a sequence file are the bytes SEQ, which acts a magic number, followed by a single byte representing the version number.
nThe header contains other fields including the names of the key and value classes,compression details, user-defined metadata, and the sync marker.
nThe sync marker is used to allow a reader to synchronize to a record boundary from any position in the file.
lThe MultipleInputs class allows you to specify the InputFormat and Mapper to use on a per-path basis.
lThe default output format, TextOutputFormat, writes records as lines of text.
lIts keys and values may be of any type, since TextOutputFormat turns them to strings by calling toString() on them.
lEach key-value pair is separated by a tab character, although that may be changed using the mapred.textoutputformat.separator property.
lSequenceFileOutputFormat
lSequenceFileAsBinaryOutputFormat
lMapFileOutputFormat
lYou create an instance of MapFile.Writer, then call the append() method to add entries in order.
lKeys must be instances of WritableComparable, and values must be Writable
lIf we look at the MapFile, we see it’s actually a directory containing two files called data and index:
lBoth files are SequenceFiles. The data file contains all of the entries, in order:
lThe index file contains a fraction of the keys, and contains a mapping from the key to that key’s offset in the data file:
lyou create a MapFile.Reader, then call the next() method until it returns false
lMultipleOutputFormat allows you to write data to multiple files whose names are derived from the output keys and values.
nconf.setOutputFormat(StationNameMultipleTextOutputFormat.class);
lMultipleOutputs can emit different types for each output.
lAn instance of the Configuration class (found in the org.apache.hadoop.conf package) represents a collection of configuration properties and their values.
lConfigurations read their properties from resources—XML files
lwe can access its properties using a piece of code like this:
lWhen developing Hadoop applications, it is common to switch between running the application locally and running it on a cluster.
lhadoop-local.xml
lhadoop-localhost.xml
lhadoop-cluster.xml
lWith this setup, it is easy to use any configuration with the -conf command-line switch.
lFor example, the following command shows a directory listing on the HDFS server running in pseudo-distributed mode on localhost:
lThere are four independent entities:
nThe client, which submits the MapReduce job.
nThe jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
nThe tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.
nThe distributed filesystem, which is used for sharing job files between the other entities.
lThe runJob() method on JobClient creates a new JobClient instance and calls submitJob() on it.
lHaving submitted the job, runJob() polls the job’s progress once a second, and reports the progress to the console if it has changed since the last report.
lWhen the job is complete, if it was successful, the job counters are displayed. Otherwise, the error that caused the job to fail is logged to the console.
lAsks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker)
lChecks the output specification of the job.
lComputes the input splits for the job.
lCopies the resources needed to run the job, including the job JAR file, the configuration file and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID.
lTells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker)
lWhen the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from where the job scheduler will pick it up and initialize it.
lInitialization involves creating an object to represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of the tasks’ status and progress.
lTo create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem.
lIt then creates one map task for each split.
lTasks are given IDs at this point.
lTasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.
lAs a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to the tasktracker using the heartbeat return value
lBefore it can choose a task for the tasktracker, the jobtracker must choose a job to select the task from according to priority.(setJobPriority() and FIFO)
lTasktrackers have a fixed number of slots for map tasks and for reduce tasks.
lThe default scheduler fills empty map task slots before reduce task slots
lTo choose a reduce task the jobtracker simply takes the next in its list of yet-to-be-run reduce tasks, since there are no data locality considerations.
lNow the tasktracker has been assigned a task, the next step is for it to run the task.
lFirst, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s filesystem.
lIt also copies any files needed from the distributed cache by the application to the local disk
lSecond, it creates a local working directory for the task, and un-jars the contents of the JAR into this directory.
lThird, it creates an instance of TaskRunner to run the task.
lTaskRunner launches a new Java Virtual Machine to run each task in
lIt is however possible to reuse the JVM between tasks;
lThe child process communicates with its parent through the umbilical interface.
lWhen the jobtracker receives a notification that the last task for a job is complete, it changes the status for the job to “successful.” T
lhen, when the JobClient polls for status, it learns that the job has completed successfully, so it prints a message to tell the user, and then returns from the runJob() method.
lThe most common way is when user code in the map or reduce task throws a runtime exception.
nthe child JVM reports the error back to its parent tasktracker, before it exits.
nThe error ultimately makes it into the user logs.
nThe tasktracker marks the task attempt as failed, freeing up a slot to run another task.
lAnother failure mode is the sudden exit of the child JVM
nthe tasktracker notices that the process has exited, and marks the attempt as failed.
lHanging tasks are dealt with differently.
nThe tasktracker notices that it hasn’t received a progress update for a while, and proceeds to mark the task as failed.
nThe child JVM process will be automatically killed after this period
lWhen the jobtracker is notified of a task attempt that has failed (by the tasktracker’s heartbeat call) it will reschedule execution of the task.
nThe jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously failed.
nIf a task fails more than four times, it will not be retried further.
lIf a tasktracker fails by crashing, or running very slowly, it will stop sending heartbeats to the jobtracker (or send them very infrequently).
lThe jobtracker will notice a tasktracker that has stopped sending heartbeats and remove it from its pool of tasktrackers to schedule tasks on.
lThe jobtracker arranges for map tasks that were run and completed successfully on that tasktracker to be rerun if they belong to incomplete jobs, since their intermediate output residing on the failed tasktracker’s local filesystem may not be accessible to the reduce task. Any tasks in progress are also rescheduled.
lWhen the map function starts producing output, it is not simply written to disk.
lEach map task has a circular memory buffer that it writes the output to.
lWhen the contents of the buffer reach a certain threshold size, a background thread will start to spill the contents to disk.
lSpills are written in round-robin fashion to the directories specified by the mapred.local.dir property
lBefore it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.
lWithin each partition, the background thread performs an in-memory sort by key.
lEach time the memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record there could be several spill files.
lBefore the task is finished, the spill files are merged into a single partitioned and sorted output file.
lThe output file’s partitions are made available to the reducers over HTTP.
lThe number of worker threads used to serve the file partitions is controlled by the task tracker.http.threads property
lAs map tasks complete successfully, they notify their parent tasktracker of the status update, which in turn notifies the jobtracker.
lfor a given job, the jobtracker knows the mapping between map outputs and tasktrackers.
lA thread in the reducer periodically asks the jobtracker for map output locations until it has retrieved them all.
lThe reduce task needs the map output for its particular partition from several map tasks across the cluster.
lThe map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task.
lThe reduce task has a small number of copier threads so that it can fetch map outputs in parallel.
lAs the copies accumulate on disk, a background thread merges them into larger, sorted files.
lWhen all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering.
lDuring the reduce phase the reduce function is invoked for each key in the sorted output. The output of this phase is written directly to the output filesystem, typically HDFS.
原文地址:http://www.cnblogs.com/forfuture1978/archive/2010/02/27/1674955.html 感谢原作者的分享!