在Windows上构建和安装Hadoop 2.0或更新的版本

  1. 介绍
  2. 为Windows构建hadoop内核
  3. 开始一个单节点(伪分布式)集群
  4. 总结

1、介绍

Hadoop version 2.2 onwards includes native support for Windows. The official Apache Hadoop releases do not include Windows binaries (yet, as of January 2014). However building a Windows package from the sources is fairly straightforward.
Hadoop is a complex system with many components. Some familiarity at a high level is helpful before attempting to build or install it or the first time. Familiarity with Java is necessary in case you need to troubleshoot.


2、为Windows构建Hadoop内核

2、1选择目标操作系统的版本
The Hadoop developers have used Windows Server 2008 and Windows Server 2008 R2 during development and testing. Windows Vista and Windows 7 are also likely to work because of the Win32 API similarities with the respective server SKUs. We have not tested on Windows XP or any earlier versions of Windows and these are not likely to work. Any issues reported on Windows XP or earlier will be closed as Invalid.
Do not attempt to run the installation from within Cygwin. Cygwin is neither required nor supported.


2、2选择Java版本和设置JAVA_HOME
Oracle JDK versions 1.7 and 1.6 have been tested by the Hadoop developers and are known to work.
Make sure that JAVA_HOME is set in your environment and does not contain any spaces. If your default Java installation directory has spaces then you must use the Windows 8.3 Pathname instead e.g. c:\Progra~1\Java… instead of c:\Program Files\Java….


2、3获得Hadoop的源文件
The current stable release as of August 2014 is 2.5. The source distribution can be retrieved from the ASF download server or using subversion or git.

  • From the ASF Hadoop download page or a mirror.
  • Subversion URL: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.5
  • Git repository URL: git://git.apache.org/hadoop-common.git. After downloading the sources via git, switch to the stable 2.5 using git checkout branch-2.5, or use the appropriate branch name if you are targeting a newer version.

2、4为构建安装依赖和设置环境
The BUILDING.txt file in the root of the source tree has detailed information on the list of requirements and how to install them. It also includes information on setting up the environment and a few quirks that are specific to Windows. It is strongly recommended that you read and understand it before proceeding.


BUILDING.txt的内容补充:


Building on Windows

Requirements:

  • Windows System
  • JDK 1.6+
  • Maven 3.0 or later
  • Findbugs 1.3.9 (if running findbugs)
  • ProtocolBuffer 2.5.0
  • CMake 2.6 or newer
  • Windows SDK or Visual Studio 2010 Professional
  • Unix command-line tools from GnuWin32 or Cygwin: sh, mkdir, rm, cp, tar, gzip
  • zlib headers (if building native code bindings for zlib)
  • Internet connection for first build (to fetch all Maven and Hadoop dependencies)
    **If using Visual Studio, it must be Visual Studio 2010 Professional (not 2012).
    199 Do not use Visual Studio Express. It does not support compiling for 64-bit,
    200 which is problematic if running a 64-bit system. The Windows SDK is free to
    201 download here:**
    http://www.microsoft.com/en-us/download/details.aspx?id=8279

Building:

  1. Keep the source code tree in a short path to avoid running into problems related to Windows maximum path length limitation. (For example, C:\hdc).
  2. Run builds from a Windows SDK Command Prompt. (Start, All Programs, Microsoft Windows SDK v7.1, Windows SDK 7.1 Command Prompt.)
  3. JAVA_HOME must be set, and the path must not contain spaces. If the full path would contain spaces, then use the Windows short path instead.
  4. You must set the Platform environment variable to either x64 or Win32 depending on whether you’re running a 64-bit or 32-bit system. Note that this is case-sensitive. It must be “Platform”, not “PLATFORM” or “platform”.Environment variables on Windows are usually case-insensitive, but Maven treats them as case-sensitive. Failure to set this environment variable correctly will cause msbuild to fail while building the native code in hadoop-common.
set Platform=x64 (when building on a 64-bit system)
set Platform=Win32 (when building on a 32-bit system)
  1. Several tests require that the user must have the Create Symbolic Links privilege.
  2. All Maven goals are the same as described above with the exception that native code is built by enabling the ‘native-win’ Maven profile. -Pnative-win is enabled by default when building on Windows since the native components are required (not optional) on Windows.
    If native code bindings for zlib are required, then the zlib headers must be deployed on the build machine. Set the ZLIB_HOME environment variable to the directory containing the headers.
set ZLIB_HOME=C:\zlib-1.2.7
  1. At runtime, zlib1.dll must be accessible on the PATH. Hadoop has been tested with zlib 1.2.7, built using Visual Studio 2010 out of contrib\vstudio\vc10 in the zlib 1.2.7 source tree.
    http://www.zlib.net/

Building distributions
* Build distribution with native code :

mvn package [-Pdist][-Pdocs][-Psrc][-Dtar] //构建发布版本

2、5 A few words on Native IO support
Hadoop on Linux includes optional Native IO support. However Native IO is mandatory(强制性的) on Windows and without it you will not be able to get your installation working. You must follow all the instructions from BUILDING.txt to ensure that Native IO support is built correctly.


2、6 构建和复制Package文件
To build a binary distribution(是jar包) run the following command from the root of the source tree.

mvn package -Pdist,native-win -DskipTests -Dtar

Note that this command must be run from a Windows SDK command prompt as documented in BUILDING.txt. A successful build generates a binary hadoop .tar.gz package in hadoop-dist\target.
The Hadoop version is present in the package file name. If you are targeting a different version then the package name will be different.


2、7 Install(安装)
Pick a target directory for installing the package. We use c:\deploy as an example. Extract the tar.gz file (e.g.hadoop-2.5.0.tar.gz) under c:\deploy. This will yield a directory structure like the following. If installing a multi-node cluster, then repeat this step on every node.

C:\deploy>dir
 Volume in drive C has no label.
 Volume Serial Number is 9D1F-7BAC

 Directory of C:\deploy

01/18/2014  08:11 AM    <DIR>          .
01/18/2014  08:11 AM    <DIR>          ..
01/18/2014  08:28 AM    <DIR>          bin
01/18/2014  08:28 AM    <DIR>          etc
01/18/2014  08:28 AM    <DIR>          include
01/18/2014  08:28 AM    <DIR>          libexec
01/18/2014  08:28 AM    <DIR>          sbin
01/18/2014  08:28 AM    <DIR>          share
               0 File(s)              0 bytes

3、开始一个单节点(伪分布式)集群

This section describes the absolute minimum configuration required to start a Single Node (pseudo-distributed) cluster and also run an example MapReduce job.


3、1 Example HDFS Configuration
Before you can start the Hadoop Daemons you will need to make a few edits to configuration files. The configuration file templates will all be found in c:\deploy\etc\hadoop, assuming your installation directory is c:\deploy.


First edit the file hadoop-env.cmd to add the following lines near the end of the file.

set HADOOP_PREFIX=c:\deploy
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin

Edit or create the file core-site.xml and make sure it has the following configuration key:

<configuration>
  <property>
    <name>fs.default.namename>
    <value>hdfs://0.0.0.0:19000value>
  property>
configuration>

Edit or create the file hdfs-site.xml and add the following configuration key:

<configuration>
  <property>
    <name>dfs.replicationname>
    <value>1value>
  property>
configuration>

Finally, edit or create the file slaves and make sure it has the following entry:

localhost

The default configuration puts the HDFS metadata and data files under \tmp on the current drive. In the above example this would be c:\tmp. For your first test setup you can just leave it at the default.


3、2 Example YARN Configuration
Edit or create mapred-site.xml under %HADOOP_PREFIX%\etc\hadoop and add the following entries, replacing %USERNAME% with your Windows user name.

<configuration>

   <property>
     <name>mapreduce.job.user.namename>
     <value>%USERNAME%value>
   property>

   <property>
     <name>mapreduce.framework.namename>
     <value>yarnvalue>
   property>

  <property>
    <name>yarn.apps.stagingDirname>
    <value>/user/%USERNAME%/stagingvalue>
  property>

  <property>
    <name>mapreduce.jobtracker.addressname>
    <value>localvalue>
  property>

configuration>

Finally, edit or create yarn-site.xml and add the following entries:

<configuration>
  <property>
    <name>yarn.server.resourcemanager.addressname>
    <value>0.0.0.0:8020value>
  property>

  <property>
    <name>yarn.server.resourcemanager.application.expiry.intervalname>
    <value>60000value>
  property>

  <property>
    <name>yarn.server.nodemanager.addressname>
    <value>0.0.0.0:45454value>
  property>

  <property>
    <name>yarn.nodemanager.aux-servicesname>
    <value>mapreduce_shufflevalue>
  property>

  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.classname>
    <value>org.apache.hadoop.mapred.ShuffleHandlervalue>
  property>

  <property>
    <name>yarn.server.nodemanager.remote-app-log-dirname>
    <value>/app-logsvalue>
  property>

  <property>
    <name>yarn.nodemanager.log-dirsname>
    <value>/dep/logs/userlogsvalue>
  property>

  <property>
    <name>yarn.server.mapreduce-appmanager.attempt-listener.bindAddressname>
    <value>0.0.0.0value>
  property>

  <property>
    <name>yarn.server.mapreduce-appmanager.client-service.bindAddressname>
    <value>0.0.0.0value>
  property>

  <property>
    <name>yarn.log-aggregation-enablename>
    <value>truevalue>
  property>

  <property>
    <name>yarn.log-aggregation.retain-secondsname>
    <value>-1value>
  property>

  <property>
    <name>yarn.application.classpathname>
    <value>%HADOOP_CONF_DIR%,%HADOOP_COMMON_HOME%/share/hadoop/common/*,%HADOOP_COMMON_HOME%/share/hadoop/common/lib/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*value>
  property>
configuration>

3、3 Initialize Environment Variables
Run c:\deploy\etc\hadoop\hadoop-env.cmd to setup environment variables that will be used by the startup scripts and the daemons.


3、4 Format the FileSystem
Format the filesystem with the following command:

%HADOOP_PREFIX%\bin\hdfs namenode -format

This command will print a number of filesystem parameters. Just look for the following two strings to ensure that the format command succeeded.

14/01/18 08:36:23 INFO namenode.FSImage: Saving image file \tmp\hadoop-username\dfs\name\current\fsimage.ckpt_0000000000000000000 using no compression
14/01/18 08:36:23 INFO namenode.FSImage: Image file \tmp\hadoop-username\dfs\name\current\fsimage.ckpt_0000000000000000000 of size 200 bytes saved in 0 seconds.

3、5 Start HDFS Daemons
Run the following command to start the NameNode and DataNode on localhost

%HADOOP_PREFIX%\sbin\start-dfs.cmd

To verify that the HDFS daemons are running, try copying a file to HDFS.

C:\deploy>%HADOOP_PREFIX%\bin\hdfs dfs -put myfile.txt /

C:\deploy>%HADOOP_PREFIX%\bin\hdfs dfs -ls /
Found 1 items
drwxr-xr-x   - username supergroup          4640 2014-01-18 08:40 /myfile.txt

C:\deploy>

3、6 Start YARN Daemons and run a YARN job
Finally, start the YARN daemons.

%HADOOP_PREFIX%\sbin\start-yarn.cmd

The cluster should be up and running! To verify, we can run a simple wordcount job on the text file we just copied to HDFS.

%HADOOP_PREFIX%\bin\yarn jar %HADOOP_PREFIX%\share\hadoop\mapreduce\hadoop-mapreduce-example
s-2.5.0.jar wordcount /myfile.txt /out

4、总结

Caveats
The following features are yet to be implemented for Windows。

  • Hadoop Security
  • Short-circuit reads

你可能感兴趣的:(Hadoop)