Nutch and Hadoop Tutorial Finished

Nutch and Hadoop Tutorial Finished

Dennis Kubes
Mon, 20 Mar 2006 10:47:14 -0800

<!--X-Subject-Header-End--><!--X-Head-of-Message--><!--X-Head-of-Message-End--><!--X-Head-Body-Sep-Begin-->
<!--X-Head-Body-Sep-End--><!--X-Body-of-Message-->
Here it is for the list, I will try to put it on the wiki as well.

Dennis

How to Setup Nutch and Hadoop 
----------------------------------------------------------------------------
----
After searching the web and mailing lists, it seems that there is very
little information on how to setup Nutch using the Hadoop (formerly NDFS)
distributed file system.  The purpose of this tutorial is to provide a
step-by-step method to get Nutch running with the Hadoop file system on
multiple machines, including being able to both index and search across the
distributed filesystem (DFS).  This document does not go into the Nutch or
Hadoop architecture.  It only tells how to get the systems up and running.

Some things are assumed for this tutorial.  First, you will need root level
access to all of the boxes you are deploying to.  Two, all boxes will need
an SSH server running (not just a client) as Hadoop uses SSH to start slave
servers.  Three, this tutorial uses Whitebox Enterprise Linux 3 Respin 2
(WHEL).  For those of you who don't know Whitebox, it is a RedHat Enterprise
Linux clone.  You should be able to follow along for any linux system, but
the systems I use are Whitebox.  Four, this tutorial uses Nutch 0.8 Dev
Revision 385702, and may not be compatible with future releases of either
Nutch or Hadoop.  Five, for this tutorial we are using a single name node
and six data nodes of which one is on the same machine as the name node.  If
you are using a different number
of machines you should still be fine but you should at least have one
machine for a name/data node and one other machine for just a data node.
Last remember that this is a tutorial from my personal experience setting up
Nutch and Hadoop and suggestions are welcome to help improve this tutorial
for others.


Downloading Nutch and Hadoop
----------------------------------------------------------------------------
----
Unless you are working with a development version of Hadoop, which this
document doesn't cover, you only need to download Nutch.  The  Hadoop
filesystem is bundled with Nutch.

The only way to get Nutch 0.8 Dev as of this writing that I know of is
through Subversion.  I used the eclipse plugin for subversion which can be
downloaded through the update manager using the url:

http://subclipse.tigris.org/update_1.0.x

 If you are not using eclipse you will need to get a subversion client. Once
you have a subversion client you can visit the nutch subversion  webpage to
see all of the subversion info:

http://lucene.apache.org/nutch/version_control.html

Or if you just want to download nutch the subversion url is:

http://svn.apache.org/repos/asf/lucene/nutch/

I checked out the main trunk into my eclipse but it can be checked out to a
standard filesystem as well.  We are going to use ant to build it so if you
have java and ant installed you should be fine.  I am not going to go into
how to install java or ant, if you are working with this  level of software
you should know how to do it.


Building Nutch and Hadoop
----------------------------------------------------------------------------
----
Once you have nutch downloaded go to the download directory where you should
see the following folders and files:

+ bin
+ conf
+ docs
+ lib
+ site
+ src
    build.properties (add this one)
    build.xml
    CHANGES.txt
    default.properties
    index.html
    LICENSE.txt
    README.txt

Add a build.properties file and inside of it add a variable called  dist.dir
with its value as the location where you want to build nutch. So if you are
building on a linux machine it would look something  like this:

dist.dir=/path/to/build

You can name it anything you want but I recommend using a new empty folder
to build into.  Go ahead and create the folder you specified if it doesn't
exist.

To build nutch call the package ant task like this:

ant package

This should build nutch into your output folder.  When it is finished you
are ready to move on to deploying and configuring nutch.


Setting Up The Deployment Architecture
----------------------------------------------------------------------------
----
Once we get nutch deployed to all six machines we are going to call a script
that starts the name node and all of the data nodes.  This means that the
script is going to start the hadoop servers on the master node and then will
ssh into all of the slave nodes and start servers on the slave nodes.

Because of this the script is going to expect that nutch is installed in
exactly the same location on every machine.  It is also going to  expect
that the local filesystem on each node that Hadoop is using to  store data
is the same on every machine.

The way we did it was to create the following directory structure on  every
machine.  The search directory is where nutch is installed.  The filesystem
is the root of the hadoop filesystem.  The home directory is the nutch
users's home directory.  On our master node we also  installed a tomcat 5.5
server for searching.

/nutch
  /search
    (nutch installation goes here)
  /filesystem
  /home
    (nutch user's home directory)
  /tomcat    (only on one server for searching)

I am not going to go into detail about how to install tomcat as there are
plenty of good tutorials on how to do that, except to say that we removed
all of the wars from the webapps directory and created a  folder called ROOT
under webapps into which we unzipped the nutch war file (nutch-0.8-dev.war).

So log into the master nodes and all of the slave nodes as root. Create the
nutch user and the different filesystems with the following commands:

mkdir /nutch
mkdir /nutch/search
mkdir /nutch/filesystem
mkdir /nutch/home

useradd -d /nutch/home -g users
chown -R nutch:users /nutch
passwd nutch nutchuserpassword

The script that we are going to start the master and slave nodes is going to
need to be able to use a password-less login through ssh.  For this we are
going to have to setup ssh keys on each of the nodes  including the ability
for the master node to be able to login to itself. And we are going to have
to change the ssh daemon (sshd) to allow us to setup proper environment
variables on login.  

First we are going to edit the ssh daemon.  The line that reads
#PermitUserEnvironment no should be changed to yes and the daemon restarted.
This will need to be done on all nodes.

vi /etc/ssh/sshd_config
PermitUserEnvironment yes

service sshd restart

Next we are going to create the keys on the master node and copy them over
to each of the slave nodes.  This must be done as the nutch user we created
earlier.  Don't just su in as the nutch user, start up a new shell and
login as the nutch user.  If you su in the passwordless login we are about
to setup will not work upon testing but will work when a session is started
as the nutch user. 
cd /nutch/home

ssh-keygen -t rsa (Use empty responses for each prompt)
  Enter passphrase (empty for no passphrase): 
  Enter same passphrase again: 
  Your identification has been saved in /nutch/home/.ssh/id_rsa.
  Your public key has been saved in /nutch/home/.ssh/id_rsa.pub.
  The key fingerprint is:
  a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc [EMAIL PROTECTED]

On the master node you will copy the public key you just created to a file
called authorized_keys in the same directory:

cd /nutch/home/.ssh
cp id_rsa.pub authorized_keys

You only have to run the ssh-keygen on the master node.  On each of the
slave nodes after the filesystem is created you will just need to copy the
keys over using scp.

scp /nutch/home/.ssh/authorized_keys 
  [EMAIL PROTECTED]:/nutch/home/.ssh/authorized_keys

You will have to enter the password for the nutch user the first time. An
ssh propmt will appear the first time you login to each computer  asking if
you want to add the computer to the known hosts.  Answer yes to  the propmt.
Once the key is copied you shouldn't have to enter a password  when logging
in as the nutch user.  Test it by logging into the slave nodes that you just
copied the keys to:

ssh slavenode
[EMAIL PROTECTED] (a command prompt should appear without requiring a
password)
hostname (should return the name of the slave node)

Once we have the ssh daemon configured, the ssh keys created and copied to
all of the nodes we will need to create an environment file for ssh to use.
When nutch logs in to the slave nodes using ssh, the environment file
creates the environment variables for the shell.  The environment file is
created under the nutch home .ssh directory.  We will create the environment
file on the master node and copy it to all of the slave nodes.

vi /nutch/home/.ssh/environment

.. environment variables

Then copy it to all of the slave nodes using scp:

scp /nutch/home/.ssh/environment [EMAIL PROTECTED]:/nutch/home/.ssh/environment

When all of this is complete we are ready to start deploying nutch to all of
the nodes.


Deploy Nutch to Single Machine
----------------------------------------------------------------------------
----
First we will deploy nutch to a single node, the master node, but operate it
in distributed mode.  This means that it will use the Hadoop filesystem
instead of the local filesystem.  We will start with a single node to make
sure that everything is up and running and will then move on to adding the
other slave nodes.  All of the following should be done from a session
started as the nutch user. 
We are going to setup nutch on the master node and then when we are ready we
will copy the entire installation to the slave nodes.

First copy the files from the nutch build to the deploy directory using
something like the following command:

cp -R /path/to/build/* /nutch/search

Then make sure that all of the shell scripts are in unix format and are
executable.

dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop
/nutch/search/bin/nutch
chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop
/nutch/search/bin/nutch
dos2unix /nutch/search/config/*.sh
chmod 700 /nutch/search/config/*.sh

When we were first trying to setup nutch we were getting bad interpreter and
command not found errors because the scripts were in dos format on linux and
not executable.  Notice that we are doing both the bin and config directory.
In the config directory there is a file called hadoop-env.sh that is called
by other scripts.

There are a few scripts that you will need to be aware of.  In the bin
directory there is the nutch script, the hadoop script, the start-all.sh
script and the stop-all.sh script.  The nutch script is used to do things
like start the  nutch crawl.  The hadoop script allows you it interact with
the hadoop file system.  
The start-all.sh script starts all of the servers on the master and slave
nodes.
The stop-all.sh. scrip stops all of the servers.

If you want to see options for nutch use the following command:

bin/nutch

Or if you want to see the options for hadoop use:

bin/hadoop

Hadoop has other components such as the distributed filesystem which have
their own option.  You can use commands such as the following to see
component options.

bin/hadoop dfs

There are also files that you need to be aware of.  In the conf directory
there are the nutch-default.xml, the nutch-site.xml, the hadoop-default.xml
and the  hadoop-site.xml.  The nutch-default.xml file holds all of the
default options  for nutch, the hadoop-default.xml file does the same for
hadoop.  To override any of these options, we copy the properties to their
respective *-site.xml files and change their values.  Below I will give you
an example hadoop-site.xml file and later a nutch-site.xml file.

There is also a file named slaves inside the config directory.  This is
where we put the names of the slave nodes.  Since we are running a slave
data node on the same machine we are running the master node, we will also
need the local computer in this slave list.  Here is what the slaves file
will look like to start.

localhost

It comes this way to start so you shouldn't have to make any changes.  Later
we  will add all of the nodes to this file, one node per line.  Below is an
example hadoop-site.xml file.

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>fs.default.name</name>
  <value>devcluster01:9000</value>
  <description>
    The name of the default file system. Either the literal string 
    "local" or a host:port for NDFS.
  </description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>devcluster01:9001</value>
  <description>
    The host and port that the MapReduce job tracker runs at. If 
    "local", then jobs are run in-process as a single map and 
    reduce task.
  </description>
</property>

<property> 
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>
    define mapred.map tasks to be number of slave hosts
  </description> 
</property> 

<property> 
  <name>mapred.reduce.tasks</name>
  <value>2</value>
  <description>
    define mapred.reduce tasks to be number of slave hosts
  </description> 
</property> 

<property>
  <name>dfs.name.dir</name>
  <value>/nutch/filesystem/name</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/nutch/filesystem/data</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/nutch/filesystem/mapreduce/system</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/nutch/filesystem/mapreduce/local</value>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

</configuration>

The fs.default.name property is used by nutch to determine the filesystem
that it is going to use.  Since we are using the hadoop filesystem we have
to point this to the hadoop master or name node.  In this case it is
devcluster01:9000 which is the server that houses the name node on our
network.

The hadoop package really comes with two components.  One is the distributed
filesystem.  Two is the mapreduce functionality.  While the distibuted
filesystem allows you to store and replicate files over many commodity
machines, the mapreduce package allows you to easily perform parallel
programming tasks.

The distributed file system has name nodes and data nodes.  When a client
wants to manipulate a file in the file system it contacts the name node
which then tells it which data node to contact to get the file.  The name
node is the coordinator and stores what blocks (not really files but you can
think of them as such for now) are on what computers and what needs to be
replicated to different data nodes.  The data nodes are just the workhorses.
They store the actual files, serve them up on request, etc.  So if you are
running a name node and a data node on the same computer it is still
communicating over sockets as if the data node was on a different computer.

I won't go into detail here about how mapreduce works, that is a topic for
another tutorial and when I have learned it better myself I will write one,
but simply put mapreduce breaks  programming tasks into map operations (a ->
b,c,d) and reduce operations     (list -> a).  Once a probelm has been
broken down into map and reduce operations then multiple map operations and
multiple reduce operations can be distributed to run on different servers in
parallel.  So instead of handing off a file to a filesystem node, we are
handing off a processing operation to a node which then processes it and
returns the result to the master node.  The coordination server for
mapreduce is called the mapreduce job tracker.  Each node that performs
processing has a daemon called a task tracker that runs and communicates
with the mapreduce job tracker.

The nodes for both the filesystem and mapreduce communicate with their
masters through a continuous heartbeat (like a ping) every 10 seconds or so.
If the heartbeat stops then the master assumes the node is down and doesn't
use it for future operations.

The mapred.job.tracker property specifies the master mapreduce tracker so I
guess it is possible to have the name node and the mapreduce tracker on
different computers.  That is something I have not done yet.  

The mapred.map.tasks and mapred.reduce.tasks properties tell how many tasks
you want to run in parallel.  This should be a multiple of the number of
computers that you have.  In our case since we are starting out with 1
computer we will have 2 map and 2 reduce tasks.  Later we will increase
these values as we add more nodes.

The dfs.name.dir property is the directory used by the name node to store
tracking and coordination information for the data nodes.

The dfs.data.dir property is the directory used by the data nodes to store
the actual filesystem data blocks.  Remember that this is expected to be the
same on every node.

The mapred.system.dir property is the directory that the mapreduce tracker
uses to store its data.  This is only on the tracker and not on the
mapreduce hosts.

The mapred.local.dir property is the directory on the nodes that mapreduce
uses to store its local data.  I have found that mapreduce uses a huge
amount of local space to perform its tasks (i.e. in the Gigabytes).  That
may just be how I have my servers configured though.  I have also found that
the intermediate files produced by mapreduce don't seem to get deleted when
the task exits.  Again that may be my configuration.  This property is also
expected to be the same on every node.

The dfs.replication property states how many servers a single file should be
replicated to before it becomes available.  Because we are using only a
single server for right now we have this at 1.  If you set this value higher
than the number of data nodes that you have available then you will start
seeing alot of (Zero targets found, forbidden1.size=1) type errors in the
logs.  We will increase this value as we add more nodes.

Now that we have our hadoop configured and our slaves file configured it is
time to start up hadoop on a single node and test that it is working
properly.  To start up all of the hadoop servers on the local machine (name
node, data node, mapreduce tracker, job tracker) use the following command
as the nutch user:

cd /nutch/search
bin/start-all.sh

To stop all of the servers you would use the following command:

bin/stop-all.sh

If everything has been setup correctly you should see output saying that the
name node, data node, job tracker, and task tracker services have started.
If this happens then we are ready to test out the filesystem.  You can also
take a look at the log files under /nutch/search/logs to see output from the
different daemons services we just started.

To test the filesystem we are going to create a list of urls that we are
going to use later for the crawl.  Run the following commands:

cd /nutch/search
mkdir urls
vi urls/urllist.txt

http://lucene.apache.org

You should now have a urls/urllist.txt file with the one line pointing to
the apache lucene site.  Now we are going to add that directory to the
filesystem.  Later the nutch crawl will use this file as a list of urls to
crawl.  To add the urls directory to the filesystem run the following
command:

cd /nutch/search
bin/hadoop dfs -put urls urls

You should see output stating that the directory was added to the
filesystem. You can also confirm that the directory was added by using the
ls command:

cd /nutch/search
bin/hadoop dfs -ls

Something interesting to note about the distributed filesystem is that it is
user specific.  If you store a directory urls under the filesystem with the
nutch user, it is actually stored as /user/nutch/urls.  What this means to
us is that the user that does the crawl and stores it in the distributed
filesystem must also be the user that starts the search, or no results will
come back.  You can try this yourself by logging in with a different user
and runing the ls command as shown.  It won't find the directories because
is it looking under a different directory /user/username instead of
/user/nutch.

If everything worked then you are good to add other nodes and start the
crawl.


Deploy Nutch to Multiple Machines
----------------------------------------------------------------------------
----
Once you have got the single node up and running we can copy the
configuration to the other slave nodes and setup those slave nodes to be
started out start script.  First if you still have the servers running on
the local node stop them with the stop-all script.

To copy the configuration to the other machines run the following command.
If you have followed the configuration up to this point, things should go
smoothly:

cd /nutch/search
scp -r /nutch/search/* [EMAIL PROTECTED]:/nutch/search

Do this for every computer you want to use as a slave node.  Then edit the
slaves file, adding each slave node name to the file, one per line.  You
will also want to edit the hadoop-site.xml file and change the values for
the map and reduce task numbers, making this a multiple of the number of
machines you have.  For our system which has 6 data nodes I put in 32 as the
number of tasks.  The replication property can also be changed at this time.
A good starting value si something like 2 or 3.  Once this is done you
should be able to startup all of the nodes.

To start all of the nodes we use the exact same command as before:

cd /nutch/search
bin/start-all.sh

The first time all of the nodes are started there may be the ssh dialog
asking to add the hosts to the known_hosts file.  You will have to type in
yes for each one and hit enter.  The output may be a little wierd the first
time but just keep typing yes and hitting enter if the dialogs keep
appearing.  You should see output showing all the servers starting on the
local machine and the job tracker and data nodes servers starting on the
slave nodes.  Once this is complete we are ready to begin our crawl.


Performing a Nutch Crawl On a Single Site
----------------------------------------------------------------------------
----
Now that we have the the distributed file system up and running we can
peform our nutch crawl.  In this tutorial we are only going to crawl a
single site.  I am not as concerned with someone being able to learn the
crawling aspect of nutch as I am with being able to setup the distributed
filesystem and mapreduce.

To make sure we crawl only a single site we are going to edit crawl
urlfilter file as set the filter to only pickup lucene.apache.org:

cd /nutch/search
vi conf/crawl-urlfilter.txt

change the line that reads +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
to read +^http://([a-z0-9]*\.)*lucene.apache.org/

We have already added our urls to the distributed filesystem and we have
edited our urlfilter so now it is time to begin the crawl.  To start the
nutch crawl use the following command:

cd /nutch/search
bin/nutch crawl urls -dir crawled -depth 1

We are using the nutch crawl command.  The urls is the urls directory that
we added to the distributed filesystem.  The -dir crawled is the output
directory.  This will also go to the distributed filesystem.  The depth is 1
meaning it will only get 1 page link deep.  There are other options you can
specify, see the command documentation for those options.

You should see the crawl startup and see output for jobs running and map and
reduce percentages.  You can keep track of the jobs by pointing you browser
to the master name node:

http://devcluster01:50030

You can also startup new terminals into the slave machine and tail the log
files to see detailed output for that slave node.  The crawl will probably
take a while to complete.  When it is done we are ready to do the search.


Performing a Search with Hadoop and Nutch
----------------------------------------------------------------------------
----
To perform a search on the index we just created within the distributed
filesystem we first need to setup and configure the nutch war file.  If you
setup the tomcat server as we stated earlier then you should have a tomcat
installation under /nutch/tomcat and in the webapps directory you should
have a folder called ROOT with the nutch war file unzipped inside of it.
Now we just need to configure the application to use the distributed
filesystem for searching.  We do this by editing the hadoop-site.xml file
under the WEB-INF/classes directory.  Use the following commands:

cd /nutch/tomcat/webapps/ROOT/WEB-INF/classes
vi hadoop-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>devcluster01:9000</value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>crawled</value>
  </property>

</configuration>

The fs.default.name property as before is pointed to our name node.

The searcher.dir directory is the directory that we specified in the
distributed filesystem under which the index was stored.  In our crawl
command earlier we used the crawled directory.

Once the hadoop-site.xml file is edited then the application should be ready
to go.  You can start tomcat with the following command:

cd /nutch/tomcat
bin/startup.sh

Then point you browser to http://nameode:8080 to see the nutch application.
If 
everything has been configured correctly then you should be able to enter
queries and see results.


Conclusion
----------------------------------------------------------------------------
----
I know this has been a lengthy tutorial but hopefully it has gotten you
familiar with both nutch and hadoop.  Both Nutch and Hadoop are complicated
applications and setting them up as you have learned is not necessarily an
easy task.  I hope that this document has helped to make it easier for you.

As Nutch and Hadoop continue to evovle I will do my best to update this
tutorial.  If you have any comments or suggestions feel free to email them
to me at [EMAIL PROTECTED]  If you have questions about nutch or
hadoop they should be addressed to their respective mailing lists.

<!--X-Body-of-Message-End--><!--X-MsgBody-End--><!--X-Follow-Ups-->

你可能感兴趣的:(mapreduce,tomcat,hadoop,ssh,Lucene)