GarfieldEr007

Yahoo! Hadoop Module 2: The Hadoop Distributed File System

Module 2: The Hadoop Distributed File System

Previous module | Table of contents | Next module

Introduction

HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. This module introduces the design of this distributed file system and instructions on how to operate it.

Goals for this Module:

Understand the basic design of HDFS and how it relates to basic distributed file system concepts
Learn how to set up and use HDFS from the command line
Learn how to use HDFS in your applications

Outline

Introduction
Goals for this Module
Outline
Distributed File System Basics
Configuring HDFS
Interacting With HDFS
1. Common Example Operations
2. HDFS Command Reference
3. DFSAdmin Command Reference
Using HDFS in MapReduce
Using HDFS Programmatically
HDFS Permissions and Security
Additional HDFS Tasks
1. Rebalancing Blocks
2. Copying Large Sets of Files
3. Decommissioning Nodes
4. Verifying File System Health
5. Rack Awareness
HDFS Web Interface
References

Distributed File System Basics

A distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network. There are a number of distributed file systems that solve this problem in different ways.

NFS, the Network File System, is the most ubiquitous distributed file system. It is one of the oldest still in use. While its design is straightforward, it is also very constrained. NFS provides remote access to a single logical volume stored on a single machine. An NFS server makes a portion of its local file system visible to external clients. The clients can then mount this remote file system directly into their own Linux file system, and interact with it as though it were part of the local drive.

One of the primary advantages of this model is its transparency. Clients do not need to be particularly aware that they are working on files stored remotely. The existing standard library methods like open(), close(),fread(), etc. will work on files hosted over NFS.

But as a distributed file system, it is limited in its power. The files in an NFS volume all reside on a single machine. This means that it will only store as much information as can be stored in one machine, and does not provide any reliability guarantees if that machine goes down (e.g., by replicating the files to other servers). Finally, as all the data is stored on a single machine, all the clients must go to this machine to retrieve their data. This can overload the server if a large number of clients must be handled. Clients must also always copy the data to their local machines before they can operate on it.

HDFS is designed to be robust to a number of the problems that other DFS's such as NFS are vulnerable to. In particular:

HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.
HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available.
HDFS should provide fast, scalable access to this information. It should be possible to serve a larger number of clients by simply adding more machines to the cluster.
HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.

But while HDFS is very scalable, its high performance design also restricts it to a particular class of applications; it is not as general-purpose as NFS. There are a large number of additional decisions and trade-offs that were made with HDFS. In particular:

Applications that use HDFS are assumed to perform long sequential streaming reads from files. HDFS is optimized to provide streaming read performance; this comes at the expense of random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files after they have already been closed are not supported. (An extension to Hadoop will provide support for appending new data to the ends of files; it is scheduled to be included in Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does not provide a mechanism for local caching of data. The overhead of caching is great enough that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and intermittently. The cluster must be able to withstand the complete failure of several machines, possibly many happening at the same time (e.g., if a rack fails all together). While performance may degrade proportional to the number of machines lost, the system as a whole should not become overly slow, nor should information be lost. Data replication strategies combat this problem.

The design of HDFS is based on the design of GFS, the Google File System. Its design was described in a paper published by Google.

HDFS is a block-structured file system: individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity. Individual machines in the cluster are referred to as DataNodes. A file can be made of several blocks, and they are not necessarily stored on the same machine; the target machines which hold each block are chosen randomly on a block-by-block basis. Thus access to a file may require the cooperation of multiple machines, but supports file sizes far larger than a single-machine DFS; individual files can require more space than a single hard drive could hold.

If several machines must be involved in the serving of a file, then a file could be rendered unavailable by the loss of any one of those machines. HDFS combats this problem by replicating each block across a number of machines (3, by default).

Figure 2.1: DataNodes holding blocks of multiple files with a replication factor of 2. The NameNode maps the filenames onto the block ids.

Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in HDFS is 64MB -- orders of magnitude larger. This allows HDFS to decrease the amount of metadata storage required per file (the list of blocks per file will be smaller as the size of individual blocks increases). Furthermore, it allows for fast streaming reads of data, by keeping large amounts of data sequentially laid out on the disk. The consequence of this decision is that HDFS expects to have very large files, and expects them to be read sequentially. Unlike a file system such as NTFS or EXT, which see many very small files, HDFS expects to store a modest number of very large files: hundreds of megabytes, or gigabytes each. After all, a 100 MB file is not even two full blocks. Files on your computer may also frequently be accessed "randomly," with applications cherry-picking small amounts of information from several different locations in a file which are not sequentially laid out. By contrast, HDFS expects to read a block start-to-finish for a program. This makes it particularly useful to the MapReduce style of programming described in Module 4. That having been said, attempting to use HDFS as a general-purpose distributed file system for a diverse set of applications will be suboptimal.

Because HDFS stores files as a set of large blocks across several machines, these files are not part of the ordinary file system. Typing ls on a machine running a DataNode daemon will display the contents of the ordinary Linux file system being used to host the Hadoop services -- but it will not include any of the files stored inside the HDFS. This is because HDFS runs in a separate namespace, isolated from the contents of your local files. The files inside HDFS (or more accurately: the blocks that make them up) are stored in a particular directory managed by the DataNode service, but the files will named only with block ids. You cannot interact with HDFS-stored files using ordinary Linux file modification tools (e.g., ls, cp, mv, etc). However, HDFS does come with its own utilities for file management, which act very similar to these familiar tools. A later section in this tutorial will introduce you to these commands and their operation.

It is important for this file system to store its metadata reliably. Furthermore, while the file data is accessed in a write once and read many model, the metadata structures (e.g., the names of files and directories) can be modified by a large number of clients concurrently. It is important that this information is never desynchronized. Therefore, it is all handled by a single machine, called the NameNode. The NameNode stores all the metadata for the file system. Because of the relatively low amount of metadata per file (it only tracks file names, permissions, and the locations of each block of each file), all of this information can be stored in the main memory of the NameNode machine, allowing fast access to the metadata.

To open a file, a client contacts the NameNode and retrieves a list of locations for the blocks that comprise the file. These locations identify the DataNodes which hold each block. Clients then read file data directly from the DataNode servers, possibly in parallel. The NameNode is not directly involved in this bulk data transfer, keeping its overhead to a minimum.

Of course, NameNode information must be preserved even if the NameNode machine fails; there are multiple redundant systems that allow the NameNode to preserve the file system's metadata even if the NameNode itself crashes irrecoverably. NameNode failure is more severe for the cluster than DataNode failure. While individual DataNodes may crash and the entire cluster will continue to operate, the loss of the NameNode will render the cluster inaccessible until it is manually restored. Fortunately, as the NameNode's involvement is relatively minimal, the odds of it failing are considerably lower than the odds of an arbitrary DataNode failing at any given point in time.

A more thorough overview of the architectural decisions involved in the design and implementation of HDFS is given in the official Hadoop HDFS documentation. Before continuing in this tutorial, it is advisable that you read and understand the information presented there.

Configuring HDFS

The HDFS for your cluster can be configured in a very short amount of time. First we will fill out the relevant sections of the Hadoop configuration file, then format the NameNode.

CLUSTER CONFIGURATION

These instructions for cluster configuration assume that you have already downloaded and unzipped a copy of Hadoop. Module 3 discusses getting started with Hadoop for this tutorial. Module 7 discusses how to set up a larger cluster and provides preliminary setup instructions for Hadoop, including downloading prerequisite software.

The HDFS configuration is located in a set of XML files in the Hadoop configuration directory; conf/ under the main Hadoop install directory (where you unzipped Hadoop to). The conf/hadoop-defaults.xml file contains default values for every parameter in Hadoop. This file is considered read-only. You override this configuration by setting new values in conf/hadoop-site.xml. This file should be replicated consistently across all machines in the cluster. (It is also possible, though not advisable, to host it on NFS.)

Configuration settings are a set of key-value pairs of the format:

  
    property-name
    property-value

Adding the line true inside the property body will prevent properties from being overridden by user applications. This is useful for most system-wide configuration options.

The following settings are necessary to configure HDFS:

key	value	example
fs.default.name	protocol://servername:port	hdfs://alpha.milkman.org:9000
dfs.data.dir	pathname	/home/username/hdfs/data
dfs.name.dir	pathname	/home/username/hdfs/name

These settings are described individually below:

fs.default.name - This is the URI (protocol specifier, hostname, and port) that describes the NameNode for the cluster. Each node in the system on which Hadoop is expected to operate needs to know the address of the NameNode. The DataNode instances will register with this NameNode, and make their data available through it. Individual client programs will connect to this address to retrieve the locations of actual file blocks.

dfs.data.dir - This is the path on the local file system in which the DataNode instance should store its data. It is not necessary that all DataNode instances store their data under the same local path prefix, as they will all be on separate machines; it is acceptable that these machines are heterogeneous. However, it will simplify configuration if this directory is standardized throughout the system. By default, Hadoop will place this under/tmp. This is fine for testing purposes, but is an easy way to lose actual data in a production system, and thus must be overridden.

dfs.name.dir - This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. It is only used by the NameNode instance to find its information, and does not exist on the DataNodes. The caveat above about /tmp applies to this as well; this setting must be overridden in a production system.

Another configuration parameter, not listed above, is dfs.replication. This is the default replication factor for each block of data in the file system. For a production cluster, this should usually be left at its default value of 3. (You are free to increase your replication factor, though this may be unnecessary and use more space than is required. Fewer than three replicas impact the high availability of information, and possibly the reliability of its storage.)

The following information can be pasted into the hadoop-site.xml file for a single-node configuration:


  
    fs.default.name

    hdfs://your.server.name.com:9000
  
  
    dfs.data.dir

    /home/username/hdfs/data
  
  
    dfs.name.dir

    /home/username/hdfs/name

Of course, your.server.name.com needs to be changed, as does username. Using port 9000 for the NameNode is arbitrary.

After copying this information into your conf/hadoop-site.xml file, copy this to the conf/ directories on all machines in the cluster.

The master node needs to know the addresses of all the machines to use as DataNodes; the startup scripts depend on this. Also in the conf/ directory, edit the file slaves so that it contains a list of fully-qualified hostnames for the slave instances, one host per line. On a multi-node setup, the master node (e.g.,localhost) is not usually present in this file.

Then make the directories necessary:

  user@EachMachine$ mkdir -p $HOME/hdfs/data

  user@namenode$ mkdir -p $HOME/hdfs/name

The user who owns the Hadoop instances will need to have read and write access to each of these directories. It is not necessary for all users to have access to these directories. Set permissions with chmodas appropriate. In a large-scale environment, it is recommended that you create a user named "hadoop" on each node for the express purpose of owning and running Hadoop tasks. For a single individual's machine, it is perfectly acceptable to run Hadoop under your own username. It is not recommended that you run Hadoop as root.

STARTING HDFS

Now we must format the file system that we just configured:

  user@namenode:hadoop$ bin/hadoop namenode -format

This process should only be performed once. When it is complete, we are free to start the distributed file system:

  user@namenode:hadoop$ bin/start-dfs.sh

This command will start the NameNode server on the master machine (which is where the start-dfs.shscript was invoked). It will also start the DataNode instances on each of the slave machines. In a single-machine "cluster," this is the same machine as the NameNode instance. On a real cluster of two or more machines, this script will ssh into each slave machine and start a DataNode instance.

Interacting With HDFS

This section will familiarize you with the commands necessary to interact with HDFS, loading and retrieving data, as well as manipulating files. This section makes extensive use of the command-line.

The bulk of commands that communicate with the cluster are performed by a monolithic script namedbin/hadoop. This will load the Hadoop system with the Java virtual machine and execute a user command. The commands are specified in the following form:

  user@machine:hadoop$ bin/hadoop moduleName -cmd args...

The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific command within this module to execute. Its arguments follow the command name.

Two such modules are relevant to HDFS: dfs and dfsadmin. Their use is described in the sections below.

COMMON EXAMPLE OPERATIONS

The dfs module, also known as "FsShell," provides basic file manipulation operations. Their usage is introduced here.

A cluster is only useful if it contains data of interest. Therefore, the first operation to perform is loading information into the cluster. For purposes of this example, we will assume an example user named "someone" -- but substitute your own username where it makes sense. Also note that any operation on files in HDFS can be performed from any node with access to the cluster, whose conf/hadoop-site.xml is configured to setfs.default.name to your cluster's NameNode. We will call the fictional machine on which we are operatinganynode. Commands are being run from the "hadoop" directory where you installed Hadoop. This may be/home/someone/src/hadoop on your machine, or /home/foo/hadoop on someone else's. These initial commands are centered around loading information into HDFS, checking that it's there, and getting information back out of HDFS.

Listing files

If we attempt to inspect HDFS, we will not find anything interesting there:

  someone@anynode:hadoop$ bin/hadoop dfs -ls
  someone@anynode:hadoop$

The "-ls" command returns silently. Without any arguments, -ls will attempt to show the contents of your "home" directory inside HDFS. Don't forget, this is not the same as /home/$USER (e.g., /home/someone) on the host machine (HDFS keeps a separate namespace from the local files). There is no concept of a "current working directory" or cd command in HDFS.

If you provide -ls with an argument, you may see some initial directory contents:

  someone@anynode:hadoop$ bin/hadoop dfs -ls /
  Found 2 items
  drwxr-xr-x   - hadoop supergroup          0 2008-09-20 19:40 /hadoop
  drwxr-xr-x   - hadoop supergroup          0 2008-09-20 20:08 /tmp

These entries are created by the system. This example output assumes that "hadoop" is the username under which the Hadoop daemons (NameNode, DataNode, etc) were started. "supergroup" is a special group whose membership includes the username under which the HDFS instances were started (e.g., "hadoop"). These directories exist to allow the Hadoop MapReduce system to move necessary data to the different job nodes; this is explained in more detail in Module 4.

So we need to create our home directory, and then populate it with some files.

Inserting data into the cluster

Whereas a typical UNIX or Linux system stores individual users' files in /home/$USER, the Hadoop DFS stores these in /user/$USER. For some commands like ls, if a directory name is required and is left blank, this is the default directory name assumed. (Other commands require explicit source and destination paths.) Any relative paths used as arguments to HDFS, Hadoop MapReduce, or other components of the system are assumed to be relative to this base directory.

Step 1: Create your home directory if it does not already exist.

  someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user

If there is no /user directory, create that first. It will be automatically created later if necessary, but for instructive purposes, it makes sense to create it manually ourselves this time.

Then we are free to add our own home directory:

  someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user/someone

Of course, replace /user/someone with /user/yourUserName.

Step 2: Upload a file. To insert a single file into HDFS, we can use the put command like so:

  someone@anynode:hadoop$ bin/hadoop dfs -put /home/someone/interestingFile.txt /user/yourUserName/

This copies /home/someone/interestingFile.txt from the local file system into/user/yourUserName/interestingFile.txt on HDFS.

Step 3: Verify the file is in HDFS. We can verify that the operation worked with either of the two following (equivalent) commands:

  someone@anynode:hadoop$ bin/hadoop dfs -ls /user/yourUserName
  someone@anynode:hadoop$ bin/hadoop dfs -ls

You should see a listing that starts with Found 1 items and then includes information about the file you inserted.

The following table demonstrates example uses of the put command, and their effects:

Command:	Assuming:	Outcome:
`bin/hadoop dfs -put foo bar`	No file/directory named`/user/$USER/bar` exists in HDFS	Uploads local file `foo` to a file named`/user/$USER/bar`
`bin/hadoop dfs -put foo bar`	`/user/$USER/bar` is a directory	Uploads local file `foo` to a file named`/user/$USER/bar/foo`
`bin/hadoop dfs -put foo somedir/somefile`	`/user/$USER/somedir` does not exist in HDFS	Uploads local file `foo` to a file named`/user/$USER/somedir/somefile`, creating the missing directory
`bin/hadoop dfs -put foo bar`	`/user/$USER/bar` is already a file in HDFS	No change in HDFS, and an error is returned to the user.

When the put command operates on a file, it is all-or-nothing. Uploading a file into HDFS first copies the data onto the DataNodes. When they all acknowledge that they have received all the data and the file handle is closed, it is then made visible to the rest of the system. Thus based on the return value of the put command, you can be confident that a file has either been successfully uploaded, or has "fully failed;" you will never get into a state where a file is partially uploaded and the partial contents are visible externally, but the upload disconnected and did not complete the entire file contents. In a case like this, it will be as though no upload took place.

Step 4: Uploading multiple files at once. The put command is more powerful than moving a single file at a time. It can also be used to upload entire directory trees into HDFS.

Create a local directory and put some files into it using the cp command. Our example user may have a situation like the following:

  someone@anynode:hadoop$ ls -R myfiles
  myfiles:
  file1.txt  file2.txt  subdir/

  myfiles/subdir:
  anotherFile.txt
  someone@anynode:hadoop$

This entire myfiles/ directory can be copied into HDFS like so:

  someone@anynode:hadoop$ bin/hadoop -put myfiles /user/myUsername
  someone@anynode:hadoop$ bin/hadoop -ls
  Found 1 items
  /user/someone/myfiles       2008-06-12 20:59    rwxr-xr-x    someone    supergroup
  user@anynode:hadoop bin/hadoop -ls myfiles
  Found 3 items
  /user/someone/myfiles/file1.txt      186731  2008-06-12 20:59        rw-r--r--       someone   supergroup
  /user/someone/myfiles/file2.txt      168     2008-06-12 20:59        rw-r--r--       someone   supergroup
  /user/someone/myfiles/subdir                 2008-06-12 20:59        rwxr-xr-x       someone   supergroup

Thus demonstrating that the tree was correctly uploaded recursively. You'll note that in addition to the file path, ls also reports the number of replicas of each file that exist (the "1" in ), the file size, upload time, permissions, and owner information.

Another synonym for -put is -copyFromLocal. The syntax and functionality are identical.

Retrieving data from HDFS

There are multiple ways to retrieve files from the distributed file system. One of the easiest is to use cat to display the contents of a file on stdout. (It can, of course, also be used to pipe the data into other applications or destinations.)

Step 1: Display data with cat.

If you have not already done so, upload some files into HDFS. In this example, we assume that a file named "foo" has been loaded into your home directory on HDFS.

  someone@anynode:hadoop$ bin/hadoop dfs -cat foo
  (contents of foo are displayed here)
  someone@anynode:hadoop$

Step 2: Copy a file from HDFS to the local file system.

The get command is the inverse operation of put; it will copy a file or directory (recursively) from HDFS into the target of your choosing on the local file system. A synonymous operation is called -copyToLocal.

  someone@anynode:hadoop$ bin/hadoop dfs -get foo localFoo
  someone@anynode:hadoop$ ls
  localFoo
  someone@anynode:hadoop$ cat localFoo
  (contents of foo are displayed here)

Like the put command, get will operate on directories in addition to individual files.

Shutting Down HDFS

If you want to shut down the HDFS functionality of your cluster (either because you do not want Hadoop occupying memory resources when it is not in use, or because you want to restart the cluster for upgrading, configuration changes, etc.), then this can be accomplished by logging in to the NameNode machine and running:

  someone@namenode:hadoop$ bin/stop-dfs.sh

This command must be performed by the same user who started HDFS with bin/start-dfs.sh.

HDFS COMMAND REFERENCE

There are many more commands in bin/hadoop dfs than were demonstrated here, although these basic operations will get you started. Running bin/hadoop dfs with no additional arguments will list all commands which can be run with the FsShell system. Furthermore, bin/hadoop dfs -help commandName will display a short usage summary for the operation in question, if you are stuck.

A table of all operations is reproduced below. The following conventions are used for parameters:

italics denote variables to be filled out by the user.
"path" means any file or directory name.
"path..." means one or more file or directory names.
"file" means any filename.
"src" and "dest" are path names in a directed operation.
"localSrc" and "localDest" are paths as above, but on the local file system. All other file and path names refer to objects inside HDFS.
Parameters in [brackets] are optional.

Command	Operation
-ls path	Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.
-lsr path	Behaves like `-ls`, but recursively displays entries in all subdirectories of path.
-du path	Shows disk usage, in bytes, for all files which match path; filenames are reported with the full HDFS protocol prefix.
-dus path	Like `-du`, but prints a summary of disk usage of all files/directories in the path.
-mv src dest	Moves the file or directory indicated by src to dest, within HDFS.
-cp src dest	Copies the file or directory identified by src to dest, within HDFS.
-rm path	Removes the file or empty directory identified by path.
-rmr path	Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).
-put localSrcdest	Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
-copyFromLocallocalSrc dest	Identical to `-put`
-moveFromLocallocalSrc dest	Copies the file or directory from the local file system identified by localSrc to dest within HDFS, then deletes the local copy on success.
-get [-crc] srclocalDest	Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.
-getmerge srclocalDest[addnl]	Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.
-cat filename	Displays the contents of filename on stdout.
-copyToLocal [-crc] srclocalDest	Identical to `-get`
-moveToLocal [-crc] srclocalDest	Works like `-get`, but deletes the HDFS copy on success.
-mkdir path	Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., like `mkdir -p` in Linux).
-setrep [-R] [-w]rep path	Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)
-touchz path	Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.
-test -[ezd] path	Returns 1 if path exists; has zero length; or is a directory, or 0 otherwise.
-stat [format]path	Prints information about path. format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
-tail [-f] file	Shows the lats 1KB of file on stdout.
-chmod [-R]mode,mode,...path...	Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with `-R`. mode is a 3-digit octal mode, or `{augo}+/-{rwxX}`. Assumes `a` if no scope is specified and does not apply a umask.
-chown [-R] [owner][:[group]] path...	Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if `-R` is specified.
-chgrp [-R]group path...	Sets the owning group for files or directories identified by path.... Sets group recursively if`-R` is specified.
-help cmd	Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd

DFSADMIN COMMAND REFERENCE

While the dfs module for bin/hadoop provides common file and directory manipulation commands, they all work with objects within the file system. The dfsadmin module manipulates or queries the file system as a whole. The operation of the commands in this module is described in this section.

Getting overall status: A brief status report for HDFS can be retrieved with bin/hadoop dfsadmin -report. This returns basic information about the overall health of the HDFS cluster, as well as some per-server metrics.

More involved status: If you need to know more details about what the state of the NameNode's metadata is, the command bin/hadoop dfsadmin -metasave filename will record this information in filename. The metasave command will enumerate lists of blocks which are under-replicated, in the process of being replicated, and scheduled for deletion. NB: The help for this command states that it "saves NameNode's primary data structures," but this is a misnomer; the NameNode's state cannot be restored from this information. However, it will provide good information about how the NameNode is managing HDFS's blocks.

Safemode: Safemode is an HDFS state in which the file system is mounted read-only; no replication is performed, nor can files be created or deleted. This is automatically entered as the NameNode starts, to allow all DataNodes time to check in with the NameNode and announce which blocks they hold, before the NameNode determines which blocks are under-replicated, etc. The NameNode waits until a specific percentage of the blocks are present and accounted-for; this is controlled in the configuration by thedfs.safemode.threshold.pct parameter. After this threshold is met, safemode is automatically exited, and HDFS allows normal operations. The bin/hadoop dfsadmin -safemode what command allows the user to manipulate safemode based on the value of what, described below:

enter - Enters safemode
leave - Forces the NameNode to exit safemode
get - Returns a string indicating whether safemode is ON or OFF
wait - Waits until safemode has exited and returns

Changing HDFS membership - When decommissioning nodes, it is important to disconnect nodes from HDFS gradually to ensure that data is not lost. See the section on decommissioning later in this document for an explanation of the use of the -refreshNodes dfsadmin command.

Upgrading HDFS versions - When upgrading from one version of Hadoop to the next, the file formats used by the NameNode and DataNodes may change. When you first start the new version of Hadoop on the cluster, you need to tell Hadoop to change the HDFS version (or else it will not mount), using the command:bin/start-dfs.sh -upgrade. It will then begin upgrading the HDFS version. The status of an ongoing upgrade operation can be queried with the bin/hadoop dfsadmin -upgradeProgress status command. More verbose information can be retrieved with bin/hadoop dfsadmin -upgradeProgress details. If the upgrade is blocked and you would like to force it to continue, use the command: bin/hadoop dfsadmin -upgradeProgress force. (Note: be sure you know what you are doing if you use this last command.)

When HDFS is upgraded, Hadoop retains backup information allowing you to downgrade to the original HDFS version in case you need to revert Hadoop versions. To back out the changes, stop the cluster, re-install the older version of Hadoop, and then use the command: bin/start-dfs.sh -rollback. It will restore the previous HDFS state.

Only one such archival copy can be kept at a time. Thus, after a few days of operation with the new version (when it is deemed stable), the archival copy can be removed with the command bin/hadoop dfsadmin -finalizeUpgrade. The rollback command cannot be issued after this point. This must be performed before a second Hadoop upgrade is allowed.

Getting help - As with the dfs module, typing bin/hadoop dfsadmin -help cmd will provide more usage information about the particular command.

Using HDFS in MapReduce

The HDFS is a powerful companion to Hadoop MapReduce. By setting the fs.default.name configuration option to point to the NameNode (as was done above), Hadoop MapReduce jobs will automatically draw their input files from HDFS. Using the regular FileInputFormat subclasses, Hadoop will automatically draw its input data sources from file paths within HDFS, and will distribute the work over the cluster in an intelligent fashion to exploit block locality where possible. The mechanics of Hadoop MapReduce are discussed in much greater detail in Module 4.

Using HDFS Programmatically

While HDFS can be manipulated explicitly through user commands, or implicitly as the input to or output from a Hadoop MapReduce job, you can also work with HDFS inside your own Java applications. (A JNI-based wrapper, libhdfs also provides this functionality in C/C++ programs.)

This section provides a short tutorial on using the Java-based HDFS API. It will be based on the following code listing:

1:  import java.io.File;
2:  import java.io.IOException;
3:
4:  import org.apache.hadoop.conf.Configuration;
5:  import org.apache.hadoop.fs.FileSystem;
6:  import org.apache.hadoop.fs.FSDataInputStream;
7:  import org.apache.hadoop.fs.FSDataOutputStream;
8:  import org.apache.hadoop.fs.Path;
9:
10: public class HDFSHelloWorld {
11:
12:   public static final String theFilename = "hello.txt";
13:   public static final String message = "Hello, world!\n";
14:
15:   public static void main (String [] args) throws IOException {
16:
17:     Configuration conf = new Configuration();
18:     FileSystem fs = FileSystem.get(conf);
19:
20:     Path filenamePath = new Path(theFilename);
21:
22:     try {
23:       if (fs.exists(filenamePath)) {
24:         // remove the file first
25:         fs.delete(filenamePath);
26:       }
27:
28:       FSDataOutputStream out = fs.create(filenamePath);
29:       out.writeUTF(message;
30:       out.close();
31:
32:       FSDataInputStream in = fs.open(filenamePath);
33:       String messageIn = in.readUTF();
34:       System.out.print(messageIn);
35:       in.close();
46:     } catch (IOException ioe) {
47:       System.err.println("IOException during operation: " + ioe.toString());
48:       System.exit(1);
49:     }
40:   }
41: }

This program creates a file named hello.txt, writes a short message into it, then reads it back and prints it to the screen. If the file already existed, it is deleted first.

First we get a handle to an abstract FileSystem object, as specified by the application configuration. TheConfiguration object created uses the default parameters.

17:     Configuration conf = new Configuration();
18:     FileSystem fs = FileSystem.get(conf);

The FileSystem interface actually provides a generic abstraction suitable for use in several file systems. Depending on the Hadoop configuration, this may use HDFS or the local file system or a different one altogether. If this test program is launched via the ordinary 'java classname' command line, it may not findconf/hadoop-site.xml and will use the local file system. To ensure that it uses the proper Hadoop configuration, launch this program through Hadoop by putting it in a jar and running:

$HADOOP_HOME/bin/hadoop jar yourjar HDFSHelloWorld

Regardless of how you launch the program and which file system it connects to, writing to a file is done in the same way:

28:       FSDataOutputStream out = fs.create(filenamePath);
29:       out.writeUTF(message);
30:       out.close();

First we create the file with the fs.create() call, which returns an FSDataOutputStream used to write data into the file. We then write the information using ordinary stream writing functions; FSDataOutputStream extends the java.io.DataOutputStream class. When we are done with the file, we close the stream with out.close().

This call to fs.create() will overwrite the file if it already exists, but for sake of example, this program explicitly removes the file first anyway (note that depending on this explicit prior removal is technically a race condition). Testing for whether a file exists and removing an existing file are performed by lines 23-26:

23:       if (fs.exists(filenamePath)) {
24:         // remove the file first
25:         fs.delete(filenamePath);
26:       }

Other operations such as copying, moving, and renaming are equally straightforward operations on Pathobjects performed by the FileSystem.

Finally, we re-open the file for read, and pull the bytes from the file, converting them to a UTF-8 encoded string in the process, and print to the screen:

32:       FSDataInputStream in = fs.open(filenamePath);
33:       String messageIn = in.readUTF();
34:       System.out.print(messageIn);
35:       in.close();

The fs.open() method returns an FSDataInputStream , which subclasses java.io.DataInputStream . Data can be read from the stream using the readUTF() operation, as on line 33. When we are done with the stream, we call close() to free the handle associated with the file.

More information:

Complete JavaDoc for the HDFS API is provided athttp://hadoop.apache.org/common/docs/r0.20.2/api/index.html.

A direct link to the FileSystem interface is:http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.html.

Another example HDFS application is available on the Hadoop wiki. This implements a file copy operation.

HDFS Permissions and Security

Starting with Hadoop 0.16.1, HDFS has included a rudimentary file permissions system. This permission system is based on the POSIX model, but does not provide strong security for HDFS files. The HDFS permissions system is designed to prevent accidental corruption of data or casual misuse of information within a group of users who share access to a cluster. It is not a strong security model that guarantees denial of access to unauthorized parties.

HDFS security is based on the POSIX model of users and groups. Each file or directory has 3 permissions (read, write and execute) associated with it at three different granularities: the file's owner, users in the same group as the owner, and all other users in the system. As the HDFS does not provide the full POSIX spectrum of activity, some combinations of bits will be meaningless. For example, no file can be executed; the +x bits cannot be set on files (only directories). Nor can an existing file be written to, although the +w bits may still be set.

Security permissions and ownership can be modified using the bin/hadoop dfs -chmod, -chown, and -chgrpoperations described earlier in this document; they work in a similar fashion to the POSIX/Linux tools of the same name.

Determining identity - Identity is not authenticated formally with HDFS; it is taken from an extrinsic source. The Hadoop system is programmed to use the user's current login as their Hadoop username (i.e., the equivalent of whoami). The user's current working group list (i.e, the output of groups) is used as the group list in Hadoop. HDFS itself does not verify that this username is genuine to the actual operator.

Superuser status - The username which was used to start the Hadoop process (i.e., the username who actually ran bin/start-all.sh or bin/start-dfs.sh) is acknowledged to be the superuser for HDFS. If this user interacts with HDFS, he does so with a special username superuser. This user's operations on HDFS never fail, regardless of permission bits set on the particular files he manipulates. If Hadoop is shutdown and restarted under a different username, that username is then bound to the superuser account.

Supergroup - There is also a special group named supergroup, whose membership is controlled by the configuration parameter dfs.permissions.supergroup.

Disabling permissions - By default, permissions are enabled on HDFS. The permission system can be disabled by setting the configuration option dfs.permissions to false. The owner, group, and permissions bits associated with each file and directory will still be preserved, but the HDFS process does not enforce them, except when using permissions-related operations such as -chmod.

Additional HDFS Tasks

REBALANCING BLOCKS

New nodes can be added to a cluster in a straightforward manner. On the new node, the same Hadoop version and configuration (conf/hadoop-site.xml) as on the rest of the cluster should be installed. Starting the DataNode daemon on the machine will cause it to contact the NameNode and join the cluster. (The new node should be added to the slaves file on the master server as well, to inform the master how to invoke script-based commands on the new node.)

But the new DataNode will have no data on board initially; it is therefore not alleviating space concerns on the existing nodes. New files will be stored on the new DataNode in addition to the existing ones, but for optimum usage, storage should be evenly balanced across all nodes.

This can be achieved with the automatic balancer tool included with Hadoop. The Balancer class will intelligently balance blocks across the nodes to achieve an even distribution of blocks within a given threshold, expressed as a percentage. (The default is 10%.) Smaller percentages make nodes more evenly balanced, but may require more time to achieve this state. Perfect balancing (0%) is unlikely to actually be achieved.

The balancer script can be run by starting bin/start-balancer.sh in the Hadoop directory. The script can be provided a balancing threshold percentage with the -threshold parameter; e.g., bin/start-balancer.sh -threshold 5. The balancer will automatically terminate when it achieves its goal, or when an error occurs, or it cannot find more candidate blocks to move to achieve better balance. The balancer can always be terminated safely by the administrator by running bin/stop-balancer.sh.

The balancing script can be run either when nobody else is using the cluster (e.g., overnight), but can also be run in an "online" fashion while many other jobs are on-going. To prevent the rebalancing process from consuming large amounts of bandwidth and significantly degrading the performance of other processes on the cluster, the dfs.balance.bandwidthPerSec configuration parameter can be used to limit the number of bytes/sec each node may devote to rebalancing its data store.

COPYING LARGE SETS OF FILES

When migrating a large number of files from one location to another (either from one HDFS cluster to another, from S3 into HDFS or vice versa, etc), the task should be divided between multiple nodes to allow them all to share in the bandwidth required for the process. Hadoop includes a tool called distcp for this purpose.

By invoking bin/hadoop distcp src dest, Hadoop will start a MapReduce task to distribute the burden of copying a large number of files from src to dest. These two parameters may specify a full URL for the the path to copy. e.g., "hdfs://SomeNameNode:9000/foo/bar/" and "hdfs://OtherNameNode:2000/baz/quux/" will copy the children of /foo/bar on one cluster to the directory tree rooted at /baz/quux on the other. The paths are assumed to be directories, and are copied recursively. S3 URLs can be specified with s3://bucket-name/key.

DECOMMISSIONING NODES

In addition to allowing nodes to be added to the cluster on the fly, nodes can also be removed from a cluster while it is running, without data loss. But if nodes are simply shut down "hard," data loss may occur as they may hold the sole copy of one or more file blocks.

Nodes must be retired on a schedule that allows HDFS to ensure that no blocks are entirely replicated within the to-be-retired set of DataNodes.

HDFS provides a decommissioning feature which ensures that this process is performed safely. To use it, follow the steps below:

Step 1: Cluster configuration. If it is assumed that nodes may be retired in your cluster, then before it is started, an excludes file must be configured. Add a key named dfs.hosts.exclude to your conf/hadoop-site.xml file. The value associated with this key provides the full path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.

Step 2: Determine hosts to decommission. Each machine to be decommissioned should be added to the file identified by dfs.hosts.exclude, one per line. This will prevent them from connecting to the NameNode.

Step 3: Force configuration reload. Run the command bin/hadoop dfsadmin -refreshNodes. This will force the NameNode to reread its configuration, including the newly-updated excludes file. It will decommission the nodes over a period of time, allowing time for each node's blocks to be replicated onto machines which are scheduled to remain active.

Step 4: Shutdown nodes. After the decommission process has completed, the decommissioned hardware can be safely shutdown for maintenance, etc. The bin/hadoop dfsadmin -report command will describe which nodes are connected to the cluster.

Step 5: Edit excludes file again. Once the machines have been decommissioned, they can be removed from the excludes file. Running bin/hadoop dfsadmin -refreshNodes again will read the excludes file back into the NameNode, allowing the DataNodes to rejoin the cluster after maintenance has been completed, or additional capacity is needed in the cluster again, etc.

VERIFYING FILE SYSTEM HEALTH

After decommissioning nodes, restarting a cluster, or periodically during its lifetime, you may want to ensure that the file system is healthy--that files are not corrupted or under-replicated, and that blocks are not missing.

Hadoop provides an fsck command to do exactly this. It can be launched at the command line like so:

  bin/hadoop fsck [path] [options]

If run with no arguments, it will print usage information and exit. If run with the argument /, it will check the health of the entire file system and print a report. If provided with a path to a particular directory or file, it will only check files under that path. If an option argument is given but no path, it will start from the file system root (/). The options may include two different types of options:

Action options specify what action should be taken when corrupted files are found. This can be -move, which moves corrupt files to /lost+found, or -delete, which deletes corrupted files.

Information options specify how verbose the tool should be in its report. The -files option will list all files it checks as it encounters them. This information can be further expanded by adding the -blocks option, which prints the list of blocks for each file. Adding -locations to these two options will then print the addresses of the DataNodes holding these blocks. Still more information can be retrieved by adding -racks to the end of this list, which then prints the rack topology information for each location. (See the next subsection for more information on configuring network rack awareness.) Note that the later options do not imply the former; you must use them in conjunction with one another. Also, note that the Hadoop program uses -files in a "common argument parser" shared by the different commands such as dfsadmin, fsck, dfs, etc. This means that if you omit a path argument to fsck, it will not receive the -files option that you intend. You can separate common options from fsck-specific options by using -- as an argument, like so:

  bin/hadoop fsck -- -files -blocks

The -- is not required if you provide a path to start the check from, or if you specify another argument first such as -move.

By default, fsck will not operate on files still open for write by another client. A list of such files can be produced with the -openforwrite option.

Rack Awareness

For small clusters in which all servers are connected by a single switch, there are only two levels of locality: "on-machine" and "off-machine." When loading data from a DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the local DataNode, and will pick two other machines at random from the cluster.

For larger Hadoop installations which span multiple racks, it is important to ensure that replicas of data exist on multiple racks. This way, the loss of a switch does not render portions of the data unavailable due to all replicas being underneath it.

HDFS can be made rack-aware by the use of a script which allows the master node to map the network topology of the cluster. While alternate configuration strategies can be used, the default implementation allows you to provide an executable script which returns the "rack address" of each of a list of IP addresses.

The network topology script receives as arguments one or more IP addresses of nodes in the cluster. It returns on stdout a list of rack names, one for each input. The input and output order must be consistent.

To set the rack mapping script, specify the key topology.script.file.name in conf/hadoop-site.xml. This provides a command to run to return a rack id; it must be an executable script or program. By default, Hadoop will attempt to send a set of IP addresses to the file as several separate command line arguments. You can control the maximum acceptable number of arguments with the topology.script.number.args key.

Rack ids in Hadoop are hierarchical and look like path names. By default, every node has a rack id of/default-rack. You can set rack ids for nodes to any arbitrary path, e.g., /foo/bar-rack. Path elements further to the left are higher up the tree. Thus a reasonable structure for a large installation may be /top-switch-name/rack-name.

Hadoop rack ids are not currently expressive enough to handle an unusual routing topology such as a 3-d torus; they assume that each node is connected to a single switch which in turn has a single upstream switch. This is not usually a problem, however. Actual packet routing will be directed using the topology discovered by or set in switches and routers. The Hadoop rack ids will be used to find "near" and "far" nodes for replica placement (and in 0.17, MapReduce task placement).

The following example script performs rack identification based on IP addresses given a hierarchical IP addressing scheme enforced by the network administrator. This may work directly for simple installations; more complex network configurations may require a file- or table-based lookup process. Care should be taken in that case to keep the table up-to-date as nodes are physically relocated, etc. This script requires that the maximum number of arguments be set to 1.

#!/bin/bash
# Set rack id based on IP address.
# Assumes network administrator has complete control
# over IP addresses assigned to nodes and they are
# in the 10.x.y.z address space. Assumes that
# IP addresses are distributed hierarchically. e.g.,
# 10.1.y.z is one data center segment and 10.2.y.z is another;
# 10.1.1.z is one rack, 10.1.2.z is another rack in
# the same segment, etc.)
#
# This is invoked with an IP address as its only argument

# get IP address from the input
ipaddr=$0

# select "x.y" and convert it to "x/y"
segments=`echo $ipaddr | cut --delimiter=. --fields=2-3 --output-delimiter=/`
echo /${segments}

HDFS Web Interface

HDFS exposes a web server which is capable of performing basic status monitoring and file browsing operations. By default this is exposed on port 50070 on the NameNode. Accessing http://namenode:50070/ with a web browser will return a page containing overview information about the health, capacity, and usage of the cluster (similar to the information returned by bin/hadoop dfsadmin -report).

The address and port where the web interface listens can be changed by setting dfs.http.address inconf/hadoop-site.xml. It must be of the form address:port. To accept requests on all addresses, use0.0.0.0.

From this interface, you can browse HDFS itself with a basic file-browser interface. Each DataNode exposes its file browser interface on port 50075. You can override this by setting the dfs.datanode.http.addressconfiguration key to a setting other than 0.0.0.0:50075. Log files generated by the Hadoop daemons can be accessed through this interface, which is useful for distributed debugging and troubleshooting.

References

HDFS source code
from: https://developer.yahoo.com/hadoop/tutorial/module2.html

你可能感兴趣的:(Hadoop)

浅谈MapReduce Android路上的人 Hadoop 分布式计算 mapreduce 分布式框架 hadoop
从今天开始，本人将会开始对另一项技术的学习，就是当下炙手可热的Hadoop分布式就算技术。目前国内外的诸多公司因为业务发展的需要，都纷纷用了此平台。国内的比如BAT啦，国外的在这方面走的更加的前面，就不一一列举了。但是Hadoop作为Apache的一个开源项目，在下面有非常多的子项目，比如HDFS，HBase,Hive，Pig,等等，要先彻底学习整个Hadoop，仅仅凭借一个的力量，是远远不够的。
Hadoop 傲雪凌霜，松柏长青后端大数据 hadoop 大数据分布式
ApacheHadoop是一个开源的分布式计算框架，主要用于处理海量数据集。它具有高度的可扩展性、容错性和高效的分布式存储与计算能力。Hadoop核心由四个主要模块组成，分别是HDFS（分布式文件系统）、MapReduce（分布式计算框架）、YARN（资源管理）和HadoopCommon（公共工具和库）。1.HDFS（HadoopDistributedFileSystem）HDFS是Hadoop生
Hadoop架构 henan程序媛 hadoop 大数据分布式
一、案列分析1.1案例概述现在已经进入了大数据(BigData)时代，数以万计用户的互联网服务时时刻刻都在产生大量的交互，要处理的数据量实在是太大了，以传统的数据库技术等其他手段根本无法应对数据处理的实时性、有效性的需求。HDFS顺应时代出现，在解决大数据存储和计算方面有很多的优势。1.2案列前置知识点1.什么是大数据大数据是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的大量数据集合，
分享一个基于python的电子书数据采集与可视化分析 hadoop电子书数据分析与推荐系统 spark大数据毕设项目（源码、调试、LW、开题、PPT) 计算机源码社 Python项目大数据大数据 python hadoop 计算机毕业设计选题计算机毕业设计源码数据分析 spark毕设
作者：计算机源码社个人简介：本人八年开发经验，擅长Java、Python、PHP、.NET、Node.js、Android、微信小程序、爬虫、大数据、机器学习等，大家有这一块的问题可以一起交流！学习资料、程序开发、技术解答、文档报告如需要源码，可以扫取文章下方二维码联系咨询Java项目微信小程序项目Android项目Python项目PHP项目ASP.NET项目Node.js项目选题推荐项目实战|p
hbase介绍 CrazyL- 云计算+大数据 hbase
hbase是一个分布式的、多版本的、面向列的开源数据库hbase利用hadoophdfs作为其文件存储系统，提供高可靠性、高性能、列存储、可伸缩、实时读写、适用于非结构化数据存储的数据库系统hbase利用hadoopmapreduce来处理hbase、中的海量数据hbase利用zookeeper作为分布式系统服务特点：数据量大：一个表可以有上亿行，上百万列（列多时，插入变慢）面向列：面向列（族）的
大数据毕业设计hadoop+spark+hive知识图谱租房数据分析可视化大屏租房推荐系统 58同城租房爬虫房源推荐系统房价预测系统计算机毕业设计机器学习深度学习人工智能 2401_84572577 程序员大数据 hadoop 人工智能
做了那么多年开发，自学了很多门编程语言，我很明白学习资源对于学一门新语言的重要性，这些年也收藏了不少的Python干货，对我来说这些东西确实已经用不到了，但对于准备自学Python的人来说，或许它就是一个宝藏，可以给你省去很多的时间和精力。别在网上瞎学了，我最近也做了一些资源的更新，只要你是我的粉丝，这期福利你都可拿走。我先来介绍一下这些东西怎么用，文末抱走。（1）Python所有方向的学习路线（
Spark集群的三种模式 MelodyYN #Spark spark hadoop big data
文章目录1、Spark的由来1.1Hadoop的发展1.2MapReduce与Spark对比2、Spark内置模块3、Spark运行模式3.1Standalone模式部署配置历史服务器配置高可用运行模式3.2Yarn模式安装部署配置历史服务器运行模式4、WordCount案例1、Spark的由来定义：Hadoop主要解决，海量数据的存储和海量数据的分析计算。Spark是一种基于内存的快速、通用、可
月度总结 | 2022年03月 | 考研与就业的抉择 | 确定未来走大数据开发路线「已注销」个人总结 hadoop
一、时间线梳理3月3日，寻找到同专业的就业伙伴3月5日，着手准备Java八股文，决定先走Java后端路线3月8月，申请到了校图书馆的考研专座，决定暂时放弃就业，先准备考研，买了数学和408的资料书3月9日-3月13日，因疫情原因，宿舍区暂封，这段时间在准备考研，发现内容特别多3月13日-3月19日，大部分时间在刷Hadoop、Zookeeper、Kafka的视频，同时在准备实习的项目3月20日，退
HBase介绍 mingyu1016 数据库
概述HBase是一个分布式的、面向列的开源数据库,源于google的一篇论文《bigtable：一个结构化数据的分布式存储系统》。HBase是GoogleBigtable的开源实现，它利用HadoopHDFS作为其文件存储系统，利用HadoopMapReduce来处理HBase中的海量数据，利用Zookeeper作为协同服务。HBase的表结构HBase以表的形式存储数据。表有行和列组成。列划分为
Java中的大数据处理框架对比分析省赚客app开发者 java 开发语言
Java中的大数据处理框架对比分析大家好，我是微赚淘客系统3.0的小编，是个冬天不穿秋裤，天冷也要风度的程序猿！今天，我们将深入探讨Java中常用的大数据处理框架，并对它们进行对比分析。大数据处理框架是现代数据驱动应用的核心，它们帮助企业处理和分析海量数据，以提取有价值的信息。本文将重点介绍ApacheHadoop、ApacheSpark、ApacheFlink和ApacheStorm这四种流行的
Hadoop windows intelij 跑 MR WordCount piziyang12138
一、软件环境我使用的软件版本如下:IntellijIdea2017.1Maven3.3.9Hadoop分布式环境二、创建maven工程打开Idea,file->new->Project,左侧面板选择maven工程。(如果只跑MapReduce创建java工程即可，不用勾选Creatfromarchetype，如果想创建web工程或者使用骨架可以勾选)image.png设置GroupId和Artif
Hadoop学习第三课（HDFS架构--读、写流程）小小程序员呀~ 数据库 hadoop 架构 big data
1.块概念举例1：一桶水1000ml，瓶子的规格100ml=>需要10个瓶子装完一桶水1010ml，瓶子的规格100ml=>需要11个瓶子装完一桶水1010ml，瓶子的规格200ml=>需要6个瓶子装完块的大小规格，只要是需要存储，哪怕一点点，也是要占用一个块的块大小的参数：dfs.blocksize官方默认的大小为128M官网：https://hadoop.apache.org/docs/r3.
hadoop启动HDFS命令 m0_67401228 java 搜索引擎 linux 后端
启动命令：/hadoop/sbin/start-dfs.sh停止命令：/hadoop/sbin/stop-dfs.sh
【计算机毕设-大数据方向】基于Hadoop的电商交易数据分析可视化系统的设计与实现程序员-石头山大数据实战案例大数据 hadoop 毕业设计毕设
博主介绍：✌全平台粉丝5W+,高级大厂开发程序员，博客之星、掘金/知乎/华为云/阿里云等平台优质作者。【源码获取】关注并且私信我【联系方式】最下边感兴趣的可以先收藏起来，同学门有不懂的毕设选题，项目以及论文编写等相关问题都可以和学长沟通，希望帮助更多同学解决问题前言随着电子商务行业的迅猛发展，电商平台积累了海量的数据资源，这些数据不仅包括用户的基本信息、购物记录，还包括用户的浏览行为、评价反馈等多
分布式离线计算—Spark—基础介绍测试开发abbey 人工智能—大数据
原文作者：饥渴的小苹果原文地址：【Spark】Spark基础教程目录Spark特点Spark相对于Hadoop的优势Spark生态系统Spark基本概念Spark结构设计Spark各种概念之间的关系Executor的优点Spark运行基本流程Spark运行架构的特点Spark的部署模式Spark三种部署方式Hadoop和Spark的统一部署摘要：Spark是基于内存计算的大数据并行计算框架Spar
spark常用命令我是浣熊的微笑 spark
查看报错日志：yarnlogsapplicationIDspark2-submit--masteryarn--classcom.hik.ReadHdfstest-1.0-SNAPSHOT.jar进入$SPARK_HOME目录，输入bin/spark-submit--help可以得到该命令的使用帮助。hadoop@wyy:/app/hadoop/spark100$bin/spark-submit--
spark启动命令学不会又听不懂 spark 大数据分布式
hadoop启动：cd/root/toolssstart-dfs.sh，只需在hadoop01上启动stop-dfs.sh日志查看：cat/root/toolss/hadoop/logs/hadoop-root-datanode-hadoop03.outzookeeper启动：cd/root/toolss/zookeeperbin/zkServer.shstart，三台都要启动bin/zkServ
编程常用命令总结 Yellow0523 Linux BigData 大数据
编程命令大全1.软件环境变量的配置JavaScalaSparkHadoopHive2.大数据软件常用命令Spark基本命令Spark-SQL命令Hive命令HDFS命令YARN命令Zookeeper命令kafka命令Hibench命令MySQL命令3.Linux常用命令Git命令conda命令pip命令查看Linux系统的详细信息查看Linux系统架构(X86还是ARM，两种方法都可)端口号命令L
Hadoop常见面试题整理及解答叶青舟 Linux hdfs 大数据 hadoop linux
Hadoop常见面试题整理及解答一、基础知识篇：1.把数据仓库从传统关系型数据库转到hadoop有什么优势？答：（1）关系型数据库成本高，且存储空间有限。而Hadoop使用较为廉价的机器存储数据，且Hadoop可以将大量机器构建成一个集群，并在集群中使用HDFS文件系统统一管理数据，极大的提高了数据的存储及处理能力。（2）关系型数据库仅支持标准结构化数据格式，Hadoop不仅支持标准结构化数据格式
2025毕业设计指南：如何用Hadoop构建超市进货推荐系统？大数据分析助力精准采购计算机编程指导师 Java实战集 Python实战集大数据实战集课程设计 hadoop 数据分析 spring boot java 进货 python
✍✍计算机编程指导师⭐⭐个人介绍：自己非常喜欢研究技术问题！专业做Java、Python、小程序、安卓、大数据、爬虫、Golang、大屏等实战项目。⛽⛽实战项目：有源码或者技术上的问题欢迎在评论区一起讨论交流！⚡⚡Java实战|SpringBoot/SSMPython实战项目|Django微信小程序/安卓实战项目大数据实战项目⚡⚡文末获取源码文章目录⚡⚡文末获取源码基于hadoop的超市进货推荐系
Hadoop Common 之序列化机制小解猫君之上 #Apache Hadoop
1.JavaSerializable序列化该序列化通过ObjectInputStream的readObject实现序列化，ObjectOutputStream的writeObject实现反序列化。这不过此种序列化虽然跨病态兼容性强，但是因为存储过多的信息，但是传输效率比较低，所以hadoop弃用它。（序列化信息包括这个对象的类，类签名，类的所有静态，费静态成员的值，以及他们父类都要被写入）publ
深入理解hadoop(一)----Common的实现----Configuration maoxiao_jsd 深入理解----hadoop
属本人个人原创，转载请注明,希望对大家有帮助！！一,hadoop的配置管理a,hadoop通过独有的Configuration处理配置信息Configurationconf=newConfiguration();conf.addResource("core-default.xml");conf.addResource("core-site.xml");后者会覆盖前者中未final标记的相同配置项b
hadoop 0.22.0 部署笔记 weixin_33701564 大数据 java 运维
为什么80%的码农都做不了架构师？>>>因为需要使用hbase，所以开始对hbase进行学习。hbase是部署在hadoop平台上的NOSql数据库，因此在部署hbase之前需要先部署hadoop。环境：redhat5、hadoop-0.22.0.tar.gz、jdk-6u13-linux-i586.zipip192.168.1.128hostname：localhost.localdomain（
解决Windows环境下hadoop集群的运行_window运行hadoop,unknown hadoop01(4) 2401_84160087 大数据面试学习
网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。需要这份系统化资料的朋友，可以戳这里获取一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！org.apache.hadoophadoop-com
解决Windows环境下hadoop集群的运行_window运行hadoop,unknown hadoop01(3) 2401_84160087 大数据面试学习
网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。需要这份系统化资料的朋友，可以戳这里获取一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！xmlns:xsi="http://www.w3.or
深入解析HDFS：定义、架构、原理、应用场景及常用命令 CloudJourney hdfs 架构 hadoop
引言Hadoop分布式文件系统（HDFS，HadoopDistributedFileSystem）是Hadoop框架的核心组件之一，它提供了高可靠性、高可用性和高吞吐量的大规模数据存储和管理能力。本文将从HDFS的定义、架构、工作原理、应用场景以及常用命令等多个方面进行详细探讨，帮助读者全面深入地了解HDFS。1.HDFS的定义1.1什么是HDFSHDFS是Hadoop生态系统中的一个分布式文件系
Hadoop的搭建流程 lzhlizihang hadoop 大数据分布式
文章目录一、配置IP二、配置主机名三、配置主机映射四、关闭防火墙五、配置免密六、安装jdk1、第一步：2、第二步：3、第三步：4、第四步：5、第五步：七、安装hadoop1、上传2、解压3、重命名4、开始配置环境变量5、刷新配置文件6、验证hadoop命令是否可以识别八、全分布搭建7、修改配置文件core-site.xml8、修改配置文件hdfs-site.xml9、修改配置文件hadoop-en
hive搭建 -----内嵌模式和本地模式 lzhlizihang hive hadoop
文章目录一、内嵌模式（使用较少）1、上传、解压、重命名2、配置环境变量3、配置conf下的hive-env.sh4、修改conf下的hive-site.xml5、启动hadoop集群6、给hdfs创建文件夹7、修改hive-site.xml中的非法字符8、初始化元数据9、测试是否成功10、内嵌模式的缺点二、本地模式（最常用）1、检查mysql是否正常2、上传、解压、重命名3、配置环境变量4、修改c
Hadoop之mapreduce -- WrodCount案例以及各种概念 lzhlizihang hadoop mapreduce 大数据
文章目录一、MapReduce的优缺点二、MapReduce案例--WordCount1、导包2、Mapper方法3、Partitioner方法（自定义分区器）4、reducer方法5、driver（main方法）6、Writable（手机流量统计案例的实体类）三、关于片和块1、什么是片，什么是块？2、mapreduce启动多少个MapTask任务？四、MapReduce的原理五、Shuffle过
IAAS: IT公司去IOE-Alibaba系统构架解读 wishchin 心理学/职业 BigDataMini Spark PaaS
从Hadoop到自主研发，技术解读阿里去IOE后的系统架构原地址：......................云计算阿里飞天摘要：从IOE时代，到Hadoop与飞天并行，再到飞天单集群5000节点的实现，阿里一直摸索在技术衍变的前沿。这里，我们将从架构、性能、运维等多个方面深入了解阿里基础设施。【导读】互联网的普及，智能终端的增加，大数据时代悄然而至。在这个数据为王的时代，数十倍、数百倍的数据给各
矩阵求逆（JAVA）初等行变换 qiuwanchi 矩阵求逆（JAVA）
package gaodai.matrix; import gaodai.determinant.DeterminantCalculation; import java.util.ArrayList; import java.util.List; import java.util.Scanner; /** * 矩阵求逆(初等行变换) * @author 邱万迟 *
JDK timer antlove java jdk schedule code timer
1.java.util.Timer.schedule(TimerTask task, long delay)：多长时间（毫秒）后执行任务 2.java.util.Timer.schedule(TimerTask task, Date time)：设定某个时间执行任务 3.java.util.Timer.schedule(TimerTask task, long delay,longperiod
JVM调优总结 -Xms -Xmx -Xmn -Xss coder_xpf jvm 应用服务器
堆大小设置JVM 中最大堆大小有三方面限制：相关操作系统的数据模型（32-bt还是64-bit）限制；系统的可用虚拟内存限制；系统的可用物理内存限制。32位系统下，一般限制在1.5G~2G；64为操作系统对内存无限制。我在Windows Server 2003 系统，3.5G物理内存，JDK5.0下测试，最大可设置为1478m。典型设置： java -Xmx
JDBC连接数据库 Array_06 jdbc
package Util; import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; public class JDBCUtil { //完
Unsupported major.minor version 51.0（jdk版本错误） oloz java
java.lang.UnsupportedClassVersionError: cn/support/cache/CacheType : Unsupported major.minor version 51.0 (unable to load class cn.support.cache.CacheType) at org.apache.catalina.loader.WebappClassL
用多个线程处理1个List集合 362217990 多线程 thread list 集合
昨天发了一个提问，启动5个线程将一个List中的内容，然后将5个线程的内容拼接起来，由于时间比较急迫，自己就写了一个Demo，希望对菜鸟有参考意义。。 import java.util.ArrayList; import java.util.List; import java.util.concurrent.CountDownLatch; public c
JSP简单访问数据库香水浓 sql mysql jsp
学习使用javaBean，代码很烂，仅为留个脚印 public class DBHelper { private String driverName; private String url; private String user; private String password; private Connection connection; privat
Flex4中使用组件添加柱状图、饼状图等图表 AdyZhang Flex
1.添加一个最简单的柱状图 ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 <?xml version= "1.0"&n
Android 5.0 - ProgressBar 进度条无法展示到按钮的前面 aijuans android
在低于SDK < 21 的版本中，ProgressBar 可以展示到按钮前面，并且为之在按钮的中间，但是切换到android 5.0后进度条ProgressBar 展示顺序变化了，按钮再前面，ProgressBar 在后面了我的xml配置文件如下： [html] view plain copy <RelativeLa
查询汇总的sql baalwolf sql
select list.listname, list.createtime,listcount from dream_list as list , (select listid,count(listid) as listcount from dream_list_user group by listid order by count(
Linux du命令和df命令区别 BigBird2012 linux
1，两者区别 du，disk usage,是通过搜索文件来计算每个文件的大小然后累加，du能看到的文件只是一些当前存在的，没有被删除的。他计算的大小就是当前他认为存在的所有文件大小的累加和。
AngularJS中的$apply，用还是不用？ bijian1013 JavaScript AngularJS $apply
在AngularJS开发中，何时应该调用$scope.$apply()，何时不应该调用。下面我们透彻地解释这个问题。但是首先，让我们把$apply转换成一种简化的形式。 scope.$apply就像一个懒惰的工人。它需要按照命
[Zookeeper学习笔记十]Zookeeper源代码分析之ClientCnxn数据序列化和反序列化 bit1129 zookeeper
ClientCnxn是Zookeeper客户端和Zookeeper服务器端进行通信和事件通知处理的主要类，它内部包含两个类，1. SendThread 2. EventThread， SendThread负责客户端和服务器端的数据通信，也包括事件信息的传输，EventThread主要在客户端回调注册的Watchers进行通知处理 ClientCnxn构造方法 &
【Java命令一】jmap bit1129 Java命令
jmap命令的用法： [hadoop@hadoop sbin]$ jmap Usage: jmap [option] <pid> (to connect to running process) jmap [option] <executable <core> (to connect to a
Apache 服务器安全防护及实战 ronin47
此文转自IBM. Apache 服务简介 Web 服务器也称为 WWW 服务器或 HTTP 服务器 (HTTP Server)，它是 Internet 上最常见也是使用最频繁的服务器之一，Web 服务器能够为用户提供网页浏览、论坛访问等等服务。由于用户在通过 Web 浏览器访问信息资源的过程中，无须再关心一些技术性的细节，而且界面非常友好，因而 Web 在 Internet 上一推出就得到
unity 3d实例化位置出现布置？ brotherlamp unity教程 unity unity资料 unity视频 unity自学
问：unity 3d实例化位置出现布置？答：实例化的同时就可以指定被实例化的物体的位置,即 position Instantiate (original : Object, position : Vector3, rotation : Quaternion) : Object 这样你不需要再用Transform.Position了, 如果你省略了第二个参数(
《重构，改善现有代码的设计》第八章 Duplicate Observed Data bylijinnan java 重构
import java.awt.Color; import java.awt.Container; import java.awt.FlowLayout; import java.awt.Label; import java.awt.TextField; import java.awt.event.FocusAdapter; import java.awt.event.FocusE
struts2更改struts.xml配置目录 chiangfai struts.xml
struts2默认是读取classes目录下的配置文件，要更改配置文件目录，比如放在WEB-INF下，路径应该写成../struts.xml(非/WEB-INF/struts.xml) web.xml文件修改如下： <filter> <filter-name>struts2</filter-name> <filter-class&g
redis做缓存时的一点优化 chenchao051 redis hadoop pipeline
最近集群上有个job，其中需要短时间内频繁访问缓存，大概7亿多次。我这边的缓存是使用redis来做的，问题就来了。首先，redis中存的是普通kv，没有考虑使用hash等解结构，那么以为着这个job需要访问7亿多次redis，导致效率低，且出现很多redi
mysql导出数据不输出标题行 daizj mysql 数据导出去掉第一行去掉标题
当想使用数据库中的某些数据，想将其导入到文件中，而想去掉第一行的标题是可以加上-N参数如通过下面命令导出数据： mysql -uuserName -ppasswd -hhost -Pport -Ddatabase -e " select * from tableName" > exportResult.txt 结果为： studentid
phpexcel导出excel表简单入门示例 dcj3sjt126com PHP Excel phpexcel
先下载PHPEXCEL类文件，放在class目录下面，然后新建一个index.php文件，内容如下 <?php error_reporting(E_ALL); ini_set('display_errors', TRUE); ini_set('display_startup_errors', TRUE); if (PHP_SAPI == 'cli') die('
爱情格言 dcj3sjt126com 格言
1) I love you not because of who you are, but because of who I am when I am with you. 　　我爱你，不是因为你是一个怎样的人，而是因为我喜欢与你在一起时的感觉。 　　2) No man or woman is worth your tears, and the one who is, won‘t
转 Activity 详解——Activity文档翻译 e200702084 android UI sqlite 配置管理网络应用
activity 展现在用户面前的经常是全屏窗口，你也可以将 activity 作为浮动窗口来使用（使用设置了 windowIsFloating 的主题），或者嵌入到其他的 activity （使用 ActivityGroup ）中。当用户离开 activity 时你可以在 onPause() 进行相应的操作。更重要的是，用户做的任何改变都应该在该点上提交 ( 经常提交到 ContentPro
win7安装MongoDB服务 geeksun mongodb
1. 下载MongoDB的windows版本：mongodb-win32-x86_64-2008plus-ssl-3.0.4.zip，Linux版本也在这里下载，下载地址： http://www.mongodb.org/downloads 2. 解压MongoDB在D:\server\mongodb, 在D:\server\mongodb下创建d
Javascript魔法方法:__defineGetter__,__defineSetter__ hongtoushizi js
转载自： http://www.blackglory.me/javascript-magic-method-definegetter-definesetter/ 在javascript的类中,可以用defineGetter和defineSetter_控制成员变量的Get和Set行为例如,在一个图书类中,我们自动为Book加上书名符号: function Book(name){
错误的日期格式可能导致走nginx proxy cache时不能进行304响应 jinnianshilongnian cache
昨天在整合某些系统的nginx配置时，出现了当使用nginx cache时无法返回304响应的情况，出问题的响应头： Content-Type:text/html; charset=gb2312 Date:Mon, 05 Jan 2015 01:58:05 GMT Expires:Mon , 05 Jan 15 02:03:00 GMT Last-Modified:Mon, 05
数据源架构模式之行数据入口 home198979 PHP 架构行数据入口
注：看不懂的请勿踩，此文章非针对java，java爱好者可直接略过。一、概念行数据入口（Row Data Gateway）：充当数据源中单条记录入口的对象，每行一个实例。二、简单实现行数据入口为了方便理解，还是先简单实现： <?php /** * 行数据入口类 */ class OrderGateway { /*定义元数
Linux各个目录的作用及内容 pda158 linux 脚本
1）根目录“/” 　　根目录位于目录结构的最顶层，用斜线（/）表示，类似于 Windows 操作系统的“C:\“，包含Fedora操作系统中所有的目录和文件。　　2）/bin 　　/bin 　　目录又称为二进制目录，包含了那些供系统管理员和普通用户使用的重要 linux命令的二进制映像。该目录存放的内容包括各种可执行文件，还有某些可执行文件的符号连接。常用的命令有：cp、d
ubuntu12.04上编译openjdk7 ol_beta HotSpot jvm jdk OpenJDK
获取源码从openjdk代码仓库获取(比较慢) 安装mercurial Mercurial是一个版本管理工具。 sudo apt-get install mercurial 将以下内容添加到$HOME/.hgrc文件中，如果没有则自己创建一个： [extensions] forest=/home/lichengwu/hgforest-crew/forest.py fe
将数据库字段转换成设计文档所需的字段 vipbooks 设计模式工作正则表达式
哈哈，出差这么久终于回来了，回家的感觉真好！ PowerDesigner的物理数据库一出来，设计文档中要改的字段就多得不计其数，如果要把PowerDesigner中的字段一个个Copy到设计文档中，那将会是一件非常痛苦的事情。