hadoop(1)Quick Guide to Hadoop on Ubuntu

hadoop(1)Quick Guide to Hadoop on Ubuntu

The Apache Hadoop software library scale up from single servers to thousands of machines, each offering local computation and storage.

Subprojects:
Hadoop Common
Hadoop Distributed File System(HDFS): A distribute file system that provides high-throughput access to application data.
Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.

Other Hadoop-related projects:
Avro:  A data serialization system
Cassandra: A scalable multi-master database with no single points of failure.
chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large table.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.

1. Single Node Setup
Quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System(HDFS).

We need to install http://www.cygwin.com/ in win7 first. Win7 is only for developing. But I use my virtual machine ubuntu.

Download the Hadoop release from http://mirror.cc.columbia.edu/pub/software/apache//hadoop/common/stable/. The file names are
hadoop-0.23.0-src.tar.gz  and hadoop-0.23.0.tar.gz.

I decided to build it from the source file.

Install ProtocolBuffer on ubuntu.
download file from this URL http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.gz
>wget http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.gz
>tar zxvf protobuf-2.4.1.tar.gz
>cd protobuf-2.4.1
>sudo ./configure --prefix=/usr/local
>sudo make
>sudo make install

Install Hadoop Common
>svn checkout http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.23.0/
>cd release-0.23.0
>mvn package -Pdist,native,docs,src -DskipTests -Dtar

error message:
org.apache.maven.reactor.MavenExecutionException: Failed to validate POM for project org.apache.hadoop:hadoop-project at /home/carl/download/release-0.23.0/hadoop-project/pom.xml
        at org.apache.maven.DefaultMaven.getProjects(DefaultMaven.java:404)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:272)
        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138)

Solution:
try to install maven3 instead
>sudo apt-get remove maven2
>sudo apt-get autoremove maven2
>sudo apt-get install python-software-properties
>sudo add-apt-repository "deb http://build.discursive.com/apt/ lucid main"
>sudo apt-get update
>sudo apt-get install maven

add this to /etc/environment PATH column.
/usr/local/apache-maven-3.0.3/bin
>. /etc/environment
It is work now.

>mvn package -Pdist,native,docs,src -DskipTests -Dtar

Fail, and I check the BUILDING.txt file and get this:
* Unix System
* JDK 1.6
* Maven 3.0
* Forrest 0.8 (if generating docs)
* Findbugs 1.3.9 (if running findbugs)
* ProtocolBuffer 2.4.1+ (for MapReduce)
* Autotools (if compiling native code)
* Internet connection for first build (to fetch all Maven and Hadoop dependencies)

Install Forrest on Ubuntu
http://forrest.apache.org/
>wget http://mirrors.200p-sf.sonic.net/apache//forrest/apache-forrest-0.9.tar.gz
>tar zxvf apache-forrest-0.9.tar.gz
>sudo mv apache-forrest-0.9 /usr/local/
>sudo vi /etc/environment
add path /usr/local/apache-forrest-0.9/bin in PATH
>. /etc/environment

Install Autotools in Ubuntu
>sudo apt-get install build-essential g++ automake autoconf gnu-standards autoconf-doc libtool gettext autoconf-archive

build the hadoop again
>mvn package -Pdist -DskipTests=true -Dtar

Build success. I can get the file /home/carl/download/release-0.23.0/hadoop-dist/target/hadoop-0.23.0-SNAPSHOT.tar.gz

Make sure ssh and rsync are on my system.
>sudo apt-get install ssh
>sudo apt-get install rsync

Unpack the hadoop distribution.
>tar zxvf hadoop-0.23.0-SNAPSHOT.tar.gz
>sudo mv hadoop-0.23.0-SNAPSHOT /usr/local/
>cd /usr/local/
>sudo mv hadoop-0.23.0-SNAPSHOT hadoop-0.23.0
>cd hadoop-0.23/conf/
>vi hadoop-env.sh
modify the line of JAVA_HOME to following statement
JAVA_HOME=/usr/lib/jvm/java-6-sun

check the hadoop command
>bin/hadoop version
Hadoop 0.23.0-SNAPSHOT
Subversion http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.23.0/hadoop-common-project/hadoop-common -r 1196973
Compiled by carl on Wed Nov 30 02:32:31 EST 2011
From source with checksum 4e42b2d96c899a98a8ab8c7cc23f27ae

There are 3 modes:
Local (Standalone) Mode
Pseudo-Distributed Mode
Fully-Distributed Mode

Standalone Operation
>mkdir input
>cp conf/*.xml input
>vi input/1.xml
YARNtestforfun
>bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar grep input output 'YARN[a-zA-Z.]+'
>cat output/*
1       YARNtestforfun

Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

Configuration
conf/core-site.xml:
<configuration>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>
</configuration>

conf/hdfs-site.xml:
<configuration>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
</configuration>

conf/mapred-site.xml:
<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
</property>
</configuration>

Setup passphraseless ssh
>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
>cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
>ssh localhost
Then I can ssh connect to the localhost without password.

Execution
Format a new distributed-filesystem:
>bin/hadoop namenode -format

Start the hadoop daemons:
>bin/start-all.sh

The logs go there ${HADOOP_HOME}/logs. /usr/local/hadoop-0.23.0/logs/yarn-carl-nodemanager-ubuntus.out. And the error messages are as following:
No HADOOP_CONF_DIR set.
Please specify it either in yarn-env.sh or in the environment.
solution:
>sudo vi yarn-env.sh
>sudo vi /etc/environment
>sudo vi hadoop-env.sh
add one line:
HADOOP_CONF_DIR=/usr/local/hadoop-0.23.0/conf
HADOOP_COMMON_HOME=/usr/local/hadoop-0.23.0/share/hadoop/common
HADOOP_HDFS_HOME=/usr/local/hadoop-0.23.0/share/hadoop/hdfs

>bin/start-all.sh

http://192.168.56.101:9999/node
http://192.168.56.101:8088/cluster

change the configuration files, comment all the other xml files in conf directory.
>vi conf/yarn-site.xml
<?xml version="1.0"?>
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Find something different with the latest version 0.23.0. So I need to do some changes according to another guide.

references:
http://guoyunsky.iteye.com/category/186934
http://hadoop.apache.org/common/docs/r0.19.2/cn/quickstart.html
http://hadoop.apache.org/
http://guoyunsky.iteye.com/blog/1233707
http://hadoop.apache.org/common/
http://hadoop.apache.org/common/docs/stable/single_node_setup.html
http://www.blogjava.net/shenh062326/archive/2011/11/10/yuling_hadoop_0-23_compile.html
http://sillycat.iteye.com/blog/965534
http://www.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/
http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/SingleCluster.html
http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html


你可能感兴趣的:(ubuntu)