This section describes the installation of three software packages that, when installed, will provide you with a complete working cluster. These packages differ radically. openMosix provides Linux kernel extensions that transparently move processes among machines to balance loads and optimize performance. While a truly remarkable package, it is not what people typically think about when they hear the word "cluster." OSCAR and Rocks are collections of software packages that can be installed at once, providing a more traditional Beowulf-style cluster. Whichever way you decide to go, you will be up and running in short order.
Chapter 5. openMosix
openMosix is software that extends the Linux kernel so that processes can migrate transparently among the different machines within a cluster in order to more evenly distribute the workload. This chapter gives the basics of setting up and using an openMosix cluster. There is a lot more to openMosix than described here, but this should be enough to get you started and keep you running for a while unless you have some very special needs.
5.1 What Is openMosix?
Basically, the openMosix software includes both a set of kernel patches and support tools. The patches extend the kernel to provide support for moving processes among machines in the cluster. Typically, process migration is totally transparent to the user. However, by using the tools provided with openMosix, as well as third-party tools, you can control the migration of processes among machines.
Let's look at how openMosix might be used to speed up a set of computationally expensive tasks. Suppose, for example, you have a dozen files to compress using a CPU-intensive program on a machine that isn't part of an openMosix cluster. You could compress each file one at a time, waiting for one to finish before starting the next. Or you could run all the compressions simultaneously by starting each compression in a separate window or by running each compression in the background (ending each command line with an &). Of course, either way will take about the same amount of time and will load down your computer while the programs are running.
However, if your computer is part of an openMosix cluster, here's what will happen: First, you will start all of the processes running on your computer. With an openMosix cluster, after a few seconds, processes will start to migrate from your heavily loaded computer to other idle or less loaded computers in the clusters. (As explained later, because some jobs may finish quickly, it can be counterproductive to migrate too quickly.) If you have a dozen idle machines in the cluster, each compression should run on a different machine. Your machine will have only one compression running on it (along with a little added overhead) so you still may be able to use it. And the dozen compressions will take only a little longer than it would normally take to do a single compression.
If you don't have a dozen computers, or some of your computers are slower than others, or some are otherwise loaded, openMosix will move the jobs around as best it can to balance the load. Once the cluster is set up, this is all done transparently by the system. Normally, you just start your jobs. openMosix does the rest. On the other hand, if you want to control the migration of jobs from one computer to the next, openMosix supplies you with the tools to do just that.
(Currently, openMosix also includes a distributed filesystem. However, this is slated for removal in future releases. The new goal is to integrate support for a clustering filesystem such as Intermezzo.)
5.2 How openMosix Works
openMosix originated as a fork from the earlier MOSIX (Multicomputer Operating System for Unix) project. The openMosix project began when the licensing structure for MOSIX moved away from a General Public License. Today, it has evolved into a project in its own right. The original MOSIX project is still quite active under the direction of Amnon Barak (http://www.mosix.org). openMosix is the work of Moshe Bar, originally a member of the MOSIX team, and a number of volunteers. This book focuses on openMosix, but MOSIX is a viable alternative that can be downloaded at no cost.
As noted in Chapter 1, one approach to sharing a computation between processors in a single-enclosure computer with multiple CPUs is symmetric multiprocessor (SMP) computing. openMosix has been described, accurately, as turning a cluster of computers into a virtual SMP machine, with each node providing a CPU. openMosix is potentially much cheaper and scales much better than SMPs, but communication overhead is higher. (openMosix will work with both single-processor systems and SMP systems.) openMosix is an example of what is sometimes called single system image clustering (SSI) since each node in the cluster has a copy of a single operating system kernel.
The granularity for openMosix is the process. Individual programs, as in the compression example, may create the processes, or the processes may be the result of different forks from a single program. However, if you have a computationally intensive task that does everything in a single process (and even if multiple threads are used), then, since there is only one process, it can't be shared among processors. The best you can hope for is that it will migrate to the fastest available machine in the cluster.
Not all processes migrate. For example, if a process only lasts a few seconds (very roughly, less than 5 seconds depending on a number of factors), it will not have time to migrate. Currently, openMosix does not work with multiple processes using shared writable memory, such as web servers.[1] Similarly, processes doing direct manipulation of I/O devices won't migrate. And processes using real-time scheduling won't migrate. If a process has already migrated to another processor and attempts to do any these things, the process will migrate back to its unique home node (UHN), the node where the process was initially created, before continuing.
[1] Actually, the migration of shared memory (MigSHM) patch is an openMosix patch that implements shared memory migration. At the time this was written, it was not part of the main openMosix tree. (Visit http://mcaserta.com/maask/.)
To support process migration, openMosix divides processes into two parts or contexts. The user context contains the program code, stack, data, etc., and is the part that can migrate. The system context, which contains a description of the resources the process is attached to and the kernel stack, does not migrate but remains on the UHN.
openMosix uses an adaptive resource allocation policy. That is, each node monitors and compares its own load with the loads on a portion of the other computers within the cluster. When a computer finds a more lightly loaded computer (based on the overall capacity of the computer), it will attempt to migrate a process to the more lightly loaded computer, thereby creating a more balanced load between the two. As the loads on individual computers change, e.g., when jobs start or finish, processes will migrate among the computers to rebalance loads across the cluster, adapting dynamically to the changes in loads.
Individual nodes, acting as autonomous systems, decide which processes migrate. The communications among small sets of nodes within the cluster used to compare loads is randomized. Consequently, clusters scale well because of this random element. Since communications is within subsets in the cluster, nodes have limited but recent information about the state of the whole cluster. This approach reduces overhead and communication.
While load comparison and process migration are generally automatic within a cluster, openMosix provides tools to control migration. It is possible to alter the cluster's perception of how heavily an individual computer is loaded, to tie processes to a specific computer, or to block the migration of processes to a computer. However, precise control for the migration of a group of processes is not practical with openMosix at this time.[2]
[2] This issue is addressed by a patch that allows the creation of process groups, available at http://www.openmosixview.com/miggroup/.
The openMosix API uses the values in the flat files in /proc/hpc to record and control the state of the cluster. If you need information about the current configuration, want to do really low-level management, or write management scripts, you can look at or write to these files.
5.3 Selecting an Installation Approach
Since openMosix is a kernel extension, it won't work with just any kernel. At this time, you are limited to a relatively recent (at least version 2.4.17 or more recent) IA32-compatible Linux kernel. An IA64 port is also available. However, don't expect openMosix to be available for a new kernel the same day a new kernel is released. It takes time to develop patches for a kernel. Fortunately, your choice of Linux distributions is fairly broad. Among others, openMosix has been reported to work on Debian, Gentoo, Red Hat, and SuSe Linux. If you just want to play with it, you might consider Bootable Cluster CD (BCCD), Knoppix, or PlumpOS, three CD-bootable Linux distributions that include openMosix. You'll also need a reasonably fast network and a fair amount of swap space to run openMosix.
To build your openMosix cluster, you need to install an openMosix extended kernel on each of the nodes in the cluster. If you are using a suitable version of Linux and have no other special needs, you may be able to download a precompiled version of the kernel. This will significantly simplify setup. Otherwise, you'll need to obtain a clean copy of the kernel sources, apply the openMosix patches to the kernel source code, recompile the sources, and install the patched kernel. This isn't as difficult as it might sound, but it is certainly more involved than just installing a precompiled kernel. Recompiling the kernel is described in detail later in this chapter. We'll start with precompiled kernels.
While using a precompiled kernel is the easiest way to go, it has a few limitations. The documentation is a little weak with the precompiled kernels, so you won't know exactly what options have been compiled into the kernel without doing some digging. (However, the .config files are available via CVS and the options seem to be reasonable.) If you already have special needs that required recompiling your kernel, e.g., nonstandard hardware, don't expect those needs to go away.
You'll need to use the same version of the patched kernel on all your systems, so choose accordingly. This doesn't mean you must use the same kernel image. For example, you can use different compiles to support different hardware. But all your kernels should have the same version number.
The openMosix user tools should be downloaded when you download the openMosix kernel or kernel patches. Additionally, you will also want to download and install openMosixView, third-party tools for openMosix.
5.4 Installing a Precompiled Kernel
The basic steps for installing a precompiled kernel are selecting and downloading the appropriate files and packages, installing those packages, and making a few minor configuration changes.
5.4.1 Downloading
You'll find links to available packages at http://openmosix.sourceforge.net.[3] You'll need to select from among several versions and compilations. At the time this was written, there were half a dozen different kernel versions available. For each of these, there were eight possible downloads, including a README file, a kernel patch file, a source file that contains both a clean copy of the kernel and the patches, and five precompiled kernels for different processors. The precompiled versions are for an Intel 386 processor, an Intel 686 processor, an Athlon processor, Intel 686 SMP processors, or Athlon SMP processors. The Intel 386 is said to be the safest version. The Intel 686 version is for Intel Pentium II and later CPUs. With the exception of the text README file and a compressed (gz) set of patches, the files are in RPM format.
[3] And while you are at it, you should also download a copy of Kris Buytaert's openMosix HOWTO from http://www.tldp.org/HOWTO/openMosix-HOWTO/.
The example that follows uses the package openmosix-kernel-2.4.24-openmosix.i686.rpm for a single processor Pentium II system running Red Hat 9. Be sure you read the README file! While you are at it, you should also download a copy of the latest suitable version of the openMosix user tools from the same site. Again, you'll have a number of choices. You can download binaries in RPM or DEB format as well as the sources. For this example, the file openmosix-tools-0.3.5-1.i386.rpm was used.
Perhaps the easiest thing to do is to download everything at once and burn it to a CD so you'll have everything handy as you move from machine to machine. But you could use any of the techniques described in Chapter 8, or you could use the C3 tools described in Chapter 10. Whatever your preference, you'll need to get copies of these files on each machine in your cluster.
There is one last thing to do before you install梒reate an emergency boot disk if you don't have one. While it is unlikely that you'll run into any problems with openMosix, you are adding a new kernel.
Don't delete the old kernel. As long as you keep it and leave it in your boot configuration file, you should still be able to go back to it. If you do delete it, an emergency boot disk will be your only hope.
To create a boot disk, you use the mkbootdisk command as shown here:
[root@fanny root]# uname -r 2.4.20-6 [root@fanny root]# mkbootdisk \ > --device /dev/fd0 2.4.20-6 Insert a disk in /dev/fd0. Any information on the disk will be lost. Press <Enter> to continue or ^C to abort:
(The last argument to mkbootdisk is the kernel version. If you can't remember this, use the command uname -r first to refresh your memory.)
5.4.2 Installing
Since we are working with RPM packages, installation is a breeze. Just change to the directory where you have the files and, as root, run rpm.
[root@fanny root]# rpm -vih openmosix-kernel-2.4.24-openmosix1.i686.rpm Preparing... ########################################### [100%] 1:openmosix-kernel ########################################### [100%] [root@fanny root]# rpm -vih openmosix-tools-0.3.5-1.i386.rpm Preparing... ########################################### [100%] 1:openmosix-tools ########################################### [100%] Edit /etc/openmosix.map if you don't want to use the autodiscovery daemon.
That's it! The kernel has been installed for you in the /boot directory.
This example uses the 2.4.24-om1 release. 2.4.24-om2 should be available by the time you read this. This newer release corrects several bugs and should be used.
You should also take care to use an openMosix tool set that is in sync with the kernel you are using, i.e., one that has been compiled with the same kernel header files. If you are compiling both, this shouldn't be a problem. Otherwise, you should consult the release notes for the tools.
5.4.3 Configuration Changes
While the installation will take care of the stuff that can be automated, there are a few changes you'll have to do manually to get openMosix running. These are very straightforward.
As currently installed, the next time you reboot your systems, your loader will give you the option of starting openMosix but it won't be your default kernel. To boot to the new openMosix kernel, you'll just need to select it from the menu. However, unless you set openMosix as the default kernel, you'll need to manually select it every time you reboot a system.
If you want openMosix as the default kernel, you'll need to reconfigure your boot loader. For example, if you are using grub, then you'll need to edit /etc/grub.conf to select the openMosix kernel. The installation will have added openMosix to this file, but will not have set it as the default kernel. You should see two sets of entries in this file. (You'll see more than two if you already have other additional kernels). Change the variable default to select which kernel you want as the default. The variable is indexed from 0. If openMosix is the first entry in the file, change the line to setting default so that it reads default=0.
If you are using LILO, the procedure is pretty much the same except that you will need to manually create the entry in the configuration file and rerun the loader. Edit the file /etc/lilo.conf. You can use a current entry as a template. Just copy the entry, edit it to use the new kernel, and give it a new label. Change default so that it matches your new label, e.g., default=openMosix. Save the file and run the command /sbin/lilo -v.
Another issue is whether your firewall will block openMosix traffic. The openMosix FAQ reports that openMosix uses UDP ports in the 5000-5700 range, UDP port 5428, and TCP ports 723 and 4660. (You can easily confirm this by monitoring network traffic, if in doubt.) You will also need to allow any other related traffic such as NFS or SSH traffic. Address this before you proceed with the configuration of openMosix.
In general, security has not been a driving issue with the development of openMosix. Consequently, it is probably best to use openMosix in a restrictive environment. You should either locate your firewall between your openMosix cluster and all external networks, or you should completely eliminate the external connection.
openMosix needs to know about the other machines in your cluster. You can either use the autodiscovery tool omdiscd to dynamically create a map, or you can create a static map by editing the file /etc/openmosix.map (or /etc/mosix.map or /etc/hpc.map on earlier versions of openMosix). omdiscd can be run as a foreground command or as a daemon in the background. Routing must be correctly configured for omdiscd to run correctly. For small, static clusters, it is probably easier to edit /etc/openmosix.map once and be done with it.
For a simple cluster, this file can be very short. Its simplest form has one entry for each machine. In this format, each entry consists of three fields梐 unique device node number (starting at 1) for each machine, the machine's IP address, and a 1 indicating that it is a single machine. It is also possible to have a single entry for a range of machines that have contiguous IP addresses. In that case, the first two fields are the same梩he node number for the first machine and the IP address of the first machine. The third field is the number of machines in the range. The address can be an IP number or a device name from your /etc/hosts file. For example, consider the following entry:
1 fanny.wofford.int 5
This says that fanny.wofford.int is the first of five nodes in a cluster. Since fanny's IP address is 10.0.32.144, the cluster consists of the following five machines: 10.0.32.144, 10.0.32.145, 10.0.32.146, 10.0.32.147, and 10.0.32.148. Their node numbers are 1 through 5. You could use separate entries for each machine. For example,
1 fanny.wofford.int 1 2 george.wofford.int 1 3 hector.wofford.int 1 4 ida.wofford.int 1 5 james.wofford.int 1
or, equivalently
1 10.0.32.144 1 2 10.0.32.145 1 3 10.0.32.146 1 4 10.0.32.147 1 5 10.0.32.148 1
Again, you can use the first of these two formats only if you have entries for each machine in /etc/hosts. If you have multiple blocks of noncontiguous machines, you will need an entry for each contiguous block. If you use host names, be sure you have an entry in your host table for your node that has its actual IP address, not just the local host address. That is, you need lines that look like
127.0.0.1 localhost 172.16.1.1 amy
not
127.0.0.1 localhost amy
You can list the map that openMosix is using with the showmap command. (This is nice to know if you are using autodiscovery.)
[root@fanny etc]# showmap My Node-Id: 0x0001 Base Node-Id Address Count ------------ ---------------- ----- 0x0001 10.0.32.144 1 0x0002 10.0.32.145 1 0x0003 10.0.32.146 1 0x0004 10.0.32.147 1 0x0005 10.0.32.148 1
Keep in mind that the format depends on the map file format. If you use the range format for your map file, you will see something like this instead:
[root@fanny etc]# showmap My Node-Id: 0x0001 Base Node-Id Address Count ------------ ---------------- ----- 0x0001 10.0.32.144 5
While the difference is insignificant, it can be confusing if you aren't expecting it.
There is also a configuration file /etc/openmosix/openmosix.config. If you are using autodiscovery, you can edit this to start the discovery daemon whenever openMosix is started. This file is heavily commented, so it should be clear what you might need to change, if anything. It can be ignored for most small clusters using a map file.
Of course, you will need to duplicate this configuration on each node on your cluster. You'll also need to reboot each machine so that the openMosix kernel is loaded. As root, you can turn openMosix on or off as needed. When you install the user tools package, a script called openmosix is copied to /etc/init.d so that openMosix will be started automatically. (If you are manually compiling the tools, you'll need to copy this script over.) The script takes the arguments start, stop, status, restart, and reload, as you might have guessed. For example,
[root@james root]# /etc/init.d/openmosix status This is OpenMosix node #5 Network protocol: 2 (AF_INET) OpenMosix range 1-5 begins at fanny.wofford.int Total configured: 5
Use this script to control openMosix as needed. You can also use the setpe command, briefly described later in this chapter, to control openMosix.
Congratulations, you are up and running.
5.5 Using openMosix
At its simplest, openMosix is transparent to the user. You can sit back and reap the benefits. But at times, you'll want more control. At the very least, you may want to verify that it is really running properly. (You could just time applications with computers turned on and off, but you'll probably want to be a little more sophisticated than that.) Fortunately, openMosix provides some tools that allow you to monitor and control various jobs. If you don't like the tools that come with openMosix, you can always install other tools such as openMosixView.
5.5.1 User Tools
You should install the openMosix user tools before you start running openMosix. This package includes several useful management tools (migrate, mosctl, mosmon, mosrun, and setpe), an openMosix aware version of ps and top called, suitably, mps and mtop, and a startup script /etc/init.d/openmosix. (This is actually a link to the file /etc/rc.d/init.d/openmosix.)
5.5.1.1 mps and mtop
Both mps and mtop will look a lot like their counterparts, ps and top. The major difference is that each has an additional column that gives the node number on which a process is running. Here is part of the output from mps:
[root@fanny sloanjd]# mps PID TTY NODE STAT TIME COMMAND ... 19766 ? 0 R 2:32 ./loop 19767 ? 2 S 1:45 ./loop 19768 ? 5 S 3:09 ./loop 19769 ? 4 S 2:58 ./loop 19770 ? 2 S 1:47 ./loop 19771 ? 3 S 2:59 ./loop 19772 ? 6 S 1:43 ./loop 19773 ? 0 R 1:59 ./loop ...
As you can see from the third column, process 19769 is running on node 4. It is important to note that mps must be run on the machine where the process originated. You will not see the process if you run ps, mps, top, or mtop on any of the other machines in the cluster even if the process has migrated to that machine. (Arguably, in this respect, openMosix is perhaps a little too transparent. Fortunately, a couple of the other tools help.)
5.5.1.2 migrate
The tool migrate explicitly moves a process from one node to another. Since there are circumstances under which some processes can't migrate, the system may be forced to ignore this command. You'll need the PID and the node number of the destination machine. Here is an example:
[sloanjd@fanny sloanjd]$ migrate 19769 5
This command will move process 19769 to node number 5. (You can use home in place of the node number to send a process back to the CPU where it was started.) It might be tempting to think you are reducing the load on node number 4, the node where the process was running, but in a balanced system with no other action, another process will likely migrate to node 4.
5.5.1.3 mosctl
With mosctl, you have greater control over how processes are run on individual machines. For example, you can block the arrival of guest processes to lighten the load on a machine. You can use mosctl with the setspeed option to override a node's idea of its own speed. This can be used to attract or discourage process migration to the machine. mosctl can also be used to display utilization or tune openMosix performance parameters. There are too many arguments to go into here, but they are described in the manpage.
5.5.1.4 mosmon
While mps won't tell you if a process has migrated to your machine, you can get a good idea of what is going across the cluster with the mosmon utility. mosmon is an ncurses-based utility that will display a simple bar graph showing the loads on the nodes in your cluster. This can give you a pretty good idea of what is going on. Figure 5-1 shows mosmon in action.
Figure 5-1. mosmon
In this example, eight identical processes are running on a six-node cluster. Obviously, the second and sixth nodes have two processes each while the remaining four machines are each running a single process. Of course, other processes could be mixed into this, affecting an individual machine's load. You can change the view to display memory, speed, and utilization as well as change the layout of the graph. Press h while the program is running to display the various options. Press q to quit the program.
Incidentally, mosmon goes by several different names, including mon and, less commonly, mmon. The original name was mon, and it is often referred to by that name in openMosix documentation. The shift to mosmon was made to eliminate a naming conflict with the network-monitoring tool mon. The local name is actually set by a compile-time variable.
5.5.1.5 mosrun
The mosrun command can also be used to advise the system to run a specific program on a specified node. You'll need the program name and the destination node number (or use -h for the home node). Actually, mosrun is one of a family of commands used to control node allocation preferences. These are listed and described on the manpage for mosrun.
5.5.1.6 setpe
The setpe command can be used to manually configure a node. (In practice, setpe is usually called from the script /etc/init.d/openmosix rather than used directly.) As root, you can use setpe to start or stop openMosix. For example, you could start openMosix with a specific configuration file with a command like
[root@ida sloanjd]# /sbin/setpe -w -f /etc/openmosix.map
setpe takes several options including -r to read the configuration file, -c to check the map's consistency, and -off to shut down openMosix. Consult the manpage for more information.
5.5.2 openMosixView
openMosixView extends the basic functionality of the user tools while providing a spiffy X-based GUI. However, the basic user tools must be installed for openMosixView to work. openMosixView is actually seven applications that can be invoked from the main administration application.
If you want to install openMosixView, which is strongly recommended, download the package from http://www.openmosixview.com. Look over the documentation for any dependencies that might apply. Depending on what you have already installed on your system, you may need to install additional packages. For example, GLUT is one of more than two dozen dependences. Fortunately (or annoyingly), rpm will point out to you what needs to be added.
Then, as root, install the appropriate packages.
[root@fanny root]# rpm -vih glut-3.7-12.i386.rpm warning: glut-3.7-12.i386.rpm: V3 DSA signature: NOKEY, key ID db42a60e Preparing... ########################################### [100%] 1:glut ########################################### [100%] [root@fanny root]# rpm -vih openmosixview-1.5-redhat90.i386.rpm Preparing... ########################################### [100%] 1:openmosixview ########################################### [100%]
As with the kernel, you'll want to repeat this on every node. This installation will install documentation in /usr/local.
Once installed, you are basically ready to run. However, by default, openMosixView uses RSH. It is strongly recommended that you change this to SSH. Make sure you have SSH set up on your system. (See Chapter 4 for more information on SSH.) Then, from the main application, select the Config menu.
The main applications window is shown in Figure 5-2. You get this by running the command openmosixview in an X window environment.
Figure 5-2. openMosixView
This view displays information for each of the five nodes in this cluster. The first column displays the node's status by node number. The background color is green if the node is available or red if it is unavailable. The second column, buttons with IP numbers, allows you to configure individual systems. If you click on one of these buttons, a pop-up window will appear for that node, as shown in Figure 5-3. You'll notice that the configuration options are very similar to those provided by the mosctl command.
Figure 5-3. openMosix configuration window
As you can see from the figure, you can control process migration, etc., with this window. The third column in Figure 5-2, the sliders, controls the node efficiencies used by openMosix when load balancing. By changing these, you alter openMosix's idea of the relative efficiencies of the nodes in the cluster. This in turn influences how jobs migrate. Note that the slider settings do not change the efficiency of the node, just openMosix's perception of the node's capabilities. The remaining columns provide general information about the nodes. These should be self-explanatory.
The buttons along the top provide access to additional applications. For example, the third button, which looks like a gear, launches the process viewer openMosixprocs. This is shown in Figure 5-4.
Figure 5-4. openMosixprocs
openMosixprocs allows you to view and manage individual processes started on the node from which openMosixprocs is run. (Since it won't show you processes migrated from other systems, you'll need openMosixprocs on each node.) You can select a user in the first entry field at the top of the window and click on refresh to focus in on a single user's processes. By double-clicking on an individual process, you can call up the openMosixprocs-Migrator, which will provide additional statistics and allow some control of a process.
openMosixView provides a number of additional tools that aren't described here. These include a 3D process viewer (3dmosmon), a data collection daemon (openMosixcollector), an analyzer (openMosixanalyzer), an application for viewing process history (openMosixHistory), and a migration monitor and controller (openMosixmigmon) that supports drag-and-drop control on process migration.
5.5.3 Testing openMosix
It is unlikely that you will have any serious problems setting up openMosix. But you may want to confirm that it is working. You could just start a few processes and time them with openMosix turned on and off. Here is the simple C program that can be used to generate some activity.
#include <stdio.h> int foo(int,int); int main( void ) { int i,j; for (i=1; i<100000; i++) for (j=1; j<100000; j++) foo(i,j); return 0; } int foo(int x, int y) { return(x+y); }
This program does nothing useful, but it will take several minutes to complete on most machines. (You can adjust the loop count if it doesn't run long enough to suit you.) By compiling this (without optimizations) and then starting several copies running in the background, you'll have a number of processes you can watch.
While timing will confirm that you are actually getting a speedup, you'll get a better idea of what is going on if you run mosmon. With mosmon, you can watch process migration and load balancing as it happens.
If you are running a firewall on your machines, the most likely problem you will have is getting connection privileges correct. You may want to start by disconnecting your cluster from the Internet and disabling the firewall. This will allow you to confirm that openMosix is correctly installed and that the firewall is the problem. You can use the command netstat -a to identify which connections you are using. This should give you some guidance in reconfiguring your firewall.
Finally, an openMosix stress test is available for the truly adventurous. It can be downloaded from http://www.openmosixview.com/omtest/. This web page also describes the test (actually a test suite) and has a link to a sample report. You can download sources or an RPM. You'll need to install expect before installing the stress test. To run the test, you should first change to the /usr/local/omtest directory and then run the script ./openmosix_stress_test.sh. A report is saved in the /tmp directory.
The test takes a while to run and produces a very long report. For example, it took over an hour and a half on an otherwise idle five-node cluster of Pentium II's and produced an 18,224-line report. While most users will find this a bit of overkill for their needs, it is nice to know it is available. Interpretation of the results is beyond the scope of this book.
5.6 Recompiling the Kernel
First, ask yourself why you would want to recompile the kernel. There are several valid reasons. If you normally have to recompile your kernel, perhaps because you use less-common hardware or need some special compile option, then you'll definitely need to recompile for openMosix. Or maybe you just like tinkering with things. If you have a reason, go for it. Even if you have never done it before, it is not that difficult, but the precompiled kernels do work well. For most readers, recompiling the kernel is optional, not mandatory. (If you are not interested in recompiling the kernel, you can skip the rest of this section.)
Before you start, do you have a recovery disk? Are you sure you can boot from it? If not, go make one right now before you begin.
Let's begin by going over the basic steps of a fairly generic recompilation, and then we'll go through an example. First, you'll need to decide which version of the kernel you want to use. Check to see what is available. (You can use the uname -r command to see what you are currently using, but you don't have to feel bound by that.)
You are going to need both a set of patches and a clean set of kernel source files. Accepted wisdom says that you shouldn't use the source files that come with any specific Linux releases because, as a result of customizations, the patches will not apply properly. As noted earlier in this chapter, you can download the kernel sources and patches from http://openmosix.sourceforge.net or you can just download the patches. If you have downloaded just the patches, you can go to http://www.kernel.org to get the sources. You'll end up with the same source files either way.
If you download the source file from the openMosix web site, you'll have an RPM package to install. When you install this, it will place compressed copies of the patches and the source tree (in gzip or bzip2 format) as well as several sample kernel configuration files in the directory /usr/src/redhat/SOURCES. The next step is to unpack the sources and apply the patches.
Using gunzip or bzip2 and then tar, unpack the files in the appropriate directory. Where you put things is largely up to you, but it is a good idea to try to be consistent with the default layout of your system. Move the patch files into the root directory of your source tree. Once you have all the files in place, you can use the patch command to patch the kernel sources.
The next step is to create the appropriate configuration file. In theory, there are four ways you can do this. You could directly edit the default configuration file, typically /usr/src/linux/.config, or you can run one of the commands make config, make menuconfig, or make xconfig. In practice, you should limit yourself to the last two choices. Direct editing of the configuration file for anything other than minor changes is for fools, experts, or foolish experts. And while config is the most universal approach, it is also the most unforgiving and should be used only as a last resort. It streams the configuration decisions past you and there is no going back once you have made a decision. The remaining choices are menuconfig, which requires the ncurses library, and xconfig, which requires X windows and TCL/TK libraries. Both work nicely. Figure 5-5 shows the basic layout with menuconfig.
Figure 5-5. Main menuconfig menu
Configuration parameters are arranged in groups by functionality. The first group is for openMosix. You can easily move through this menu and select the appropriate actions. You will be given a submenu for each group. Figure 5-6 shows the openMosix submenu.
Figure 5-6. openMosix system submenu
xconfig is very similar but has a fancy GUI.
Because there are so many decisions, this is the part of the process where you are most apt to make a mistake. This isn't meant to discourage you, but don't be surprised if you have to go through this process several times. For the most part, the defaults are reasonable. Be sure you select the right processor type and all appropriate file systems. (Look at /etc/fstab, run the mount command, or examine /proc/filesystems to get an idea of what file systems you are currently using.) If you downloaded the sources from the openMosix web page, you have several sample configuration files. You can copy one of these over and use it as your starting point. This will give you some reasonable defaults. You can also get a description of various options (including openMosix options!) by looking in the Documentation/Configure.help file in your source tree. As a general rule of thumb, if you don't need something, don't include it.
Once you have the configuration file, you are ready to build the image. You'll use the commands make dep, make clean, make bzImage, make modules, and make modules_install. (You'll need modules enabled, since openMosix uses them.) If all goes well, you'll be left with a file bzImage in the directory arch/i386/boot/ under your source tree.
The next to last step is to install the kernel, i.e., arrange for the system to boot from this new kernel. You'll probably want to move it to the /boot directory and rename it. Since you are likely to make several kernels once you get started, be sure to use a meaningful name. You may need to create a ram-disk. You also need to configure your boot loader to find the file as described earlier in this chapter. When copying over the new kernel, don't delete the original kernel!
Now you are ready to reboot and test your new kernel. Pay close attention to the system messages when you reboot. This will be your first indication of any configuration errors you may have made. You'll need to go back to the configuration step to address these.
Of course, this is just the kernel you've installed. You'll still need to go back and install the user tools and configure openMosix for your system. But even if you are compiling the kernel, there is no reason you can't use the package to install the user tools.
Here is an example using Red Hat 9. Although Red Hat 9 comes with the 2.4.20 version of the kernel, this example uses a later version of the kernel, openmosix-kernel-2.4.24-openmosix1.src.rpm. The first step is installing this package.
[root@fanny root]# rpm -vih openmosix-kernel-2.4.24-openmosix1.src.rpm 1:openmosix-kernel ########################################### [100%] [root@fanny root]# cd /usr/src/redhat/SOURCES [root@fanny SOURCES]# ls kernel-2.4.20-athlon.config kernel-2.4.24-athlon-smp.config kernel-2.4.20-athlon-smp.config kernel-2.4.24-i386.config kernel-2.4.20-i386.config kernel-2.4.24-i686.config kernel-2.4.20-i686.config kernel-2.4.24-i686-smp.config kernel-2.4.20-i686-smp.config linux-2.4.24.tar.bz2 kernel-2.4.24-athlon.config openMosix-2.4.24-1.bz2
As you can see, the package includes the source files, patches, and sample configuration files.
Next, unpack the files. (With some versions, you may need to use gunzip instead of bunzip2.)
[root@fanny SOURCES]# bunzip2 linux-2.4.24.tar.bz2 [root@fanny SOURCES]# bunzip2 openMosix-2.4.24-1.bz2 [root@fanny SOURCES]# mv linux-2.4.24.tar /usr/src [root@fanny SOURCES]# cd /usr/src [root@fanny src]# tar -xvf linux-2.4.24.tar ...
The last command creates the directory linux-2.4.24 under /usr/src. If you are working with different versions of the kernel, you probably want to give this directory a more meaningful name.
The next step is to copy over the patch file and, if you desire, one of the sample configuration files. Then, you can apply the patches.
[root@fanny src]# cd /usr/src/redhat/SOURCES [root@fanny SOURCES]# cp openMosix-2.4.24-1 /usr/src/linux-2.4.24/ [root@fanny SOURCES]# cp kernel-2.4.24-i686.config \ > /usr/src/linux-2.4.24/.config [root@fanny SOURCES]# cd /usr/src/linux-2.4.24 [root@fanny linux-2.4.24]# cat openMosix-2.4.24-1 | patch -Np1 ...
You should see a list of the patched files stream by as the last command runs.
Next, you'll need to create or edit a configuration file. This example uses the supplied configuration file that was copied over as a starting point.
[root@fanny linux-2.4.24]# make menuconfig
Make whatever changes you need and then save your new configuration.
Once configured, it is time to make the kernel.
[root@fanny linux-2.4.24]# make dep ... [root@fanny linux-2.4.24]# make clean ... [root@fanny linux-2.4.24]# make bzImage ... [root@fanny linux-2.4.24]# make modules ... [root@fanny linux-2.4.24]# make modules_install ...
These commands can take a while and produce a lot of output, which has been omitted here.
The worst is over now. You need to copy your kernel to /boot, create a ram-disk, and configure your boot loader.
[root@fanny linux-2.4.24]# cd /usr/src/linux-2.4.24/arch/i386/boot/ [root@fanny boot]# cp bzImage /boot/vmlinuz-8jul04
If you haven't changed kernels, you may be able to use the existing ram-disk. Otherwise, use the mkinitrd script to create a new one.
[root@fanny boot]# cd /boot [root@fanny boot]# mkinitrd /boot/initrd-2.4.24.img 2.4.24-om
The first argument is the name for the ram-disk and the second argument is the appropriate module directory under /lib/modules. See the manpage for details.
The last step is to change the boot loader. This system uses grub, so the file /etc/grub.conf needs to be edited. You might add something like the following:
title My New openMosix Kernel root (hd0,0) kernel /vmlinuz-8jul04 ro root=LABEL=/ initrd /initrd-2.4.24.img
When the system reboots, the boot menu now has My New openMosix Kernel as an entry. Select that entry to boot to the new kernel.
While these steps should be adequate for most readers, it is important to note that, depending on your hardware, etc., additional steps may be required. Fortunately, there has been a lot written on the general process of recompiling Linux kernels. See the Appendix A for pointers to more information.
Setting up a cluster can involve the installation and configuration of a lot of software as well as reconfiguration of the system and previously installed software. OSCAR (Open Source Cluster Application Resources) is a software package that is designed to simplify cluster installation. A collection of open source cluster software, OSCAR includes everything that you are likely to need for a dedicated, high-performance cluster. OSCAR takes you completely through the installation of your cluster. If you download, install, and run OSCAR, you will have a completely functioning cluster when you are done.
This chapter begins with an overview of why you might use OSCAR, followed by a description of what is included in OSCAR. Next, the discussion turns to the installation and configuration of OSCAR. This includes a description of how to customize OSCAR and the changes OSCAR makes to your system. Finally, there are three brief sections, one on cluster security, one on switcher, and another on using OSCAR with LAM/MPI.
Because OSCAR is an extensive collection of software, it is beyond the scope of this book to cover every package in detail. Most of the software in OSCAR is available as standalone versions, and many of the key packages included by OSCAR are described in later chapters in this book. Consequently, this chapter focuses on setting up OSCAR and on software unique to OSCAR. By the time you have finished this chapter, you should be able to judge whether OSCAR is appropriate for your needs and know how to get started.
The design goals for OSCAR include using the best-of-class software, eliminating the downloading, installation, and configuration of individual components, and moving toward the standardization of clusters. OSCAR, it is said, reduces the need for expertise in setting up a cluster. In practice, it might be more fitting to say that OSCAR delays the need for expertise and allows you to create a fully functional cluster before mastering all the skills you will eventually need. In the long run, you will want to master those packages in OSCAR that you come to rely on. OSCAR makes it very easy to experiment with packages and dramatically lowers the barrier to getting started.
OSCAR was created and is maintained by the Open Cluster Group (http://www.openclustergroup.org), an informal group dedicated to simplifying the installation and use of clusters and broadening their use. Over the years, a number of organizations and companies have supported the Open Cluster Group, including Dell, IBM, Intel, NCSA, and ORNL, to mention only a few.
OSCAR is designed with high-performance computing in mind. Basically, it is designed to be used with an asymmetric cluster (see Chapter 1). Unless you customize the installation, the computer nodes are meant to be dedicated to the cluster. Typically, you do not log directly onto the client nodes but rather work from the head node. (Although OSCAR sets up SSH so that you can log onto clients without a password, this is done primarily to simplify using the cluster software.)
|
Actually, OSCAR could be used for any cluster application梟ot just high-performance computing. (A recently created subgroup, HA-OSCAR, is starting to look into high-availability clusters.) While OSCAR installs a number of packages specific to high-performance computing by default which would be of little use for some other cluster uses, e.g., MPI and PVM, it is easy to skip the installation of these packages. It is very easy to include additional RPM packages to an OSCAR installation. Although OSCAR does not provide a simple mechanism to do a post-installation configuration for such packages, you can certainly include configuration scripts if you create your own packages. There is a HOWTO on the OSCAR web site that describes how to create custom packages. Generally, this will be easier than manually configuring added packages after the installation. (However, by using the C3 tool set included in OSCAR, many post-install configuration tasks shouldn't be too difficult.)
Because of the difficulty in bringing together a wide variety of software and because the individual software packages are constantly being updated, some of the software included in OSCAR has not always been the most current versions available. In practice, this is not a problem. The software OSCAR includes is stable and should meet most of your needs.
While OSCAR was originally created using Red Hat Linux, a goal of the project is to move beyond support for a single distribution and Mandrake Linux is now also supported. The OSCAR project has shifted to SIS in order to eventually support most RPM-based versions of Linux. But don't expect support for the latest Linux versions to be immediately available as the new versions are released.
6.2 What's in OSCAROSCAR brings together a number of software packages for clustering. Most of the packages listed in this section are available as standalone packages and have been briefly described in Chapter 2. Some of the more important packages are described in detail in later chapters as well. However, there are several scripts unique to OSCAR. Most are briefly described in this chapter. It is likely that everything you really need to get started with a high-performance cluster is included either in the OSCAR tar-ball or as part of the base operating system OSCAR is installed under. Nonetheless, OSCAR provides a script, the Oscar Package Downloader (opd) that simplifies the download and installation of additional packages that are available from OSCAR repositories in an OSCAR-compatible format. opd is so easy to use that for practical purposes any package available through opd can be considered part of OSCAR. opd can be invoked as a standalone program or from the OSCAR installation wizard, the GUI-based OSCAR installer. Additional packages available using opd include things like Myrinet drivers and support for thin OSCAR clients, as well as management packages like Ganglia. Use of opd is described later in this chapter. OSCAR packages fall into three categories. Core packages must be installed. Included packages are distributed as part of OSCAR, but you can opt out on installing these packages. Third-party packages are additional packages that are available for download and are compatible with OSCAR, but aren't required. There are six core packages at the heart of OSCAR that you must install:
OSCAR includes a number of packages and scripts that are used to build your cluster. The installation wizard will give you the option of deciding which to include:
OSCAR provides additional system tools, either as part of the OSCAR distribution or through opd, used to manage your cluster:
Of course, any high-performance cluster would be incomplete without programming tools. The OSCAR distribution includes four packages, while two more (as noted) are available through opd:
If you install the four included packages, the default, they should cover all your programming needs. Additionally, OSCAR will install and configure (or reconfigure) a number of services and packages supplied as part of your Linux release.[1] These potentially include Apache, DHCP, NFS, mySQL, openSSL, openSSH, rrdtool, pcp, php, python, rsync, tftp, etc. Exactly which of these is actually installed or configured will depend on what other software you elect to install. In the unlikely event that you are unhappy with the way OSCAR sets up any of these, you'll need to go back and reconfigure them after the installation is completed.
|
This section should provide you with a fairly complete overview of the installation process. The goal here is to take you through a typical installation and to clarify a few potential problems you might encounter. Some customizations you might want to consider are described briefly at the end of this section. The OSCAR project provides a very detailed set of installation instructions running over 60 pages, which includes a full screen-by-screen walkthrough. If you decide OSCAR is right for you, you should download the latest version and read it very carefully before you begin. It will be more current and complete than the overview provided here. Go to http://oscar.openclustergroup.org and follow the documentation link.
Because OSCAR is a complex set of software that includes a large number of programs and services, it can be very unforgiving if you make mistakes when setting it up. For some errors, you may be able to restart the installation process. For others, you will be better served by starting again from scratch. A standard installation, however, should not be a problem. If you have a small cluster and the hardware is ready to go, with a little practice you can be up and running in less than a day.
The installation described here is typical. Keep in mind, however, that your installation may not go exactly like the one described here. It will depend on some of the decisions you make. For example, if you select to install PVFS, you'll see an additional console window early in the installation specific to that software.
There are several things you need to do before you install OSCAR. First, you need to plan your system. Figure 6-1 shows the basic architecture of an OSCAR cluster. You first install OSCAR on the cluster's head node or server, and then OSCAR installs the remaining machines, or clients, from the server. The client image is a disk image for the client that includes the boot sector, operating system, and other software for the client. Since the head node is used to build the client image, is the home for most user services, and is used to administer the cluster, you'll need a well-provisioned machine. In particular, don't try to skimp on disk space桹SCAR uses a lot. The installation guide states that after you have installed the system, you will need at least 2 GB (each) of free space under both the / and /var directories while 4 GB for each is recommended. Since the head is also the home for your users' files, you'll need to keep this in mind as well. It is a good idea to put the /, /var, and /home directories on separate disk partitions. This will simplify reinstalls and provide a more robust server.
As you can see from the figure, the server or head is dual homed; that is, it has two network interfaces. The interface attached to the external network is called the public interface. The private interface attaches to the cluster's network. While you don't have to use this configuration, be aware that OSCAR will set up a DHCP server on the private interface. If you put everything on a public network with an existing DHCP server, you may have a war between the two DHCP servers. The remainder of this chapter assumes you'll be using a configuration like the one shown in Figure 6-1.
It is strongly recommended that you begin with a clean install of your operating system and that you customize your OSCAR installation as little as possible the first time you install it. OSCAR is a complex collection of software. With a vanilla installation, all should work well. This isn't to say you can't do customizations, just do so with discretion. Don't be surprised if a custom install takes a few tries to get right.
The installation documentation lists a few supported versions of Linux. It is strongly recommend that you stick to the list. For Red Hat, a workstation install that includes the Software Development group and an X Windows environment should work nicely for the server. (You may also want to add some network utilities such as VNC-server and Ethereal to make life easier, and you may want to remove openOffice to discourage that kind of activity on the cluster. That's your call; it won't affect your OSCAR installation either way.) You should also do manual disk partitioning to ensure that you meet the space requirements and to control the disk layout. (It is possible to work around some allocation problems using links, but this is a nuisance best avoided.) Don't install any updates to your system at this point. Doing so may break the OSCAR installation, and you can always add these after you install OSCAR.
Since you have two interfaces, you need to make sure that your network configuration is correct. The configuration of the public interface, of course, will be determined by the configuration of the external network. For example, an external DHCP server might be used to configure the public interface when booting the server. For the cluster's network, use a private address space distinct from the external address space. Table 6-1 lists reserved address spaces that you might use per RFC 1918.
Address Spaces |
---|
10.0.0.0 to 10.255.255.255 |
172.16.0.0 to 172.31.255.255 |
192.168.0.0 to 192.168.255.255 |
By way of example, assume you have fewer than 255 computers and your organization's internal network is already using the first address range (10.X.X.X). You might select one of the class C ranges from the third address range, e.g., 192.168.1.0 through 192.168.1.255. The usual IP configuration constraints apply, e.g., don't assign the broadcast address to a machine. In this example, you would want to avoid 192.168.1.0 (and, possibly, 192.168.1.255). Once you have selected the address space, you can configure the private interface using the tool of your choice, e.g., neat, ifconfig, or netcfg. You will need to set the IP address, subnet mask, and default gateway. And don't forget to configure the interface to be active on startup. In this example, you might use an IP address of 192.168.1.1 with a mask of 255.255.255.0 for the private interface.[2] The public interface will be the gateway for the private network. This will leave 192.168.1.2 through 192.168.1.254 as addresses for your compute nodes when you set up DHCP. Of course, if you plan ahead, you can also configure the interface during the Linux installation.
[2] While this is the simplest choice, a better choice is to use 192.168.1.254 for the server and starting at 192.168.1.1 for the clients. The advantage is that the low-order portion of the IP addresses will match the node numbers, at least for your first 253 machines.
Once you have the interfaces configured, reboot the server and verify that everything works. You can use ifconfig -a to quickly confirm that both interfaces are up. If it is possible to put a live machine on the internal network, you can confirm that routing works correctly by pinging the machine. Do as much checking as you can at this point. Once the cluster is installed, testing can be more difficult. You don't want to waste a lot of time trying to figure out what went wrong with the OSCAR installation when the network was broken before you began.
Another pre-installation consideration is the security settings for the server you are building. If you have the security set too tightly on the server, it will interfere with the client installation. If you have customized the security settings on a system, you need to pay particular attention. For example, if you have already installed SSH, be sure that you permit root logins to your server (or plan to spend a lot of time at the server). If you can isolate the cluster from the external network, you can just turn off the firewall.
Even if the installation goes well, you still may encounter problems later. For example, with Red Hat 9, the default firewall settings may cause problems for services like Ganglia. Since OSCAR includes pfilter, it is usually OK to just turn off Red Hat's firewall. However, this is a call you will have to make based on your local security policies.
You should also ensure that the head node's host name is correctly set. Make sure that the hostname command returns something other than localhost and that the returned name resolves to the internal interface. For example,
[root@amy root]# /bin/hostname amy [root@amy root]# ping -c1 amy PING amy (172.16.1.254) 56(84) bytes of data. 64 bytes from amy (172.16.1.254): icmp_seq=1 ttl=64 time=0.166 ms --- amy ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.166/0.166/0.166/0.000 ms
Notice that hostname returns amy and that when amy is pinged, the name resolves to the address 172.16.1.254.
It is also a good idea to make sure you have enough disk space before going on. You can use the df -h command. This is also a good point to do other basic configuration tasks, such as setting up printers, setting the message of the day, etc.
The next step is to get the software you will need onto the server. This consists of the OSCAR distribution and the Linux packages you need to build the image for the client machines. For the Linux packages, first create the directory /tftpboot/rpm and then copy over the packages. It will be a lot simpler if you just copy everything over rather than try to figure out exactly what is needed. For Red Hat 9, mount each of the three distribution disks and copy over the all the RPM files from ../cdrom/RedHat/RPMS. The sequence looks like this:
[root@amy /root]# mkdir -p /tftpboot/rpm [root@amy /root]# mount /mnt/cdrom [root@amy /root]# cd /mnt/cdrom/RedHat/RPMS [root@amy RPMS]# cp *.rpm /tftpboot/rpm/ [root@amy RPMS]# cd / [root@amy /]# eject cdrom
You repeat the last five steps two more times, once for each of the remaining CD-ROMs. If your system automounts CD-ROMs, you'll skip the manual mounts. You'll copy more than 1,400 packages, so this can take a while with slower systems. (OSCAR will subsequently add additional packages to this directory.)
If you are tight on disk space, you can install the packages on a different partition and link to them. For example, if you've installed the packages in /var/tftpboot/rpm, you could do the following:
[root@amy root]# mkdir /tftpboot/ [root@amy root]# ln -s /var/tftpboot/rpm /tftpboot/rpm
Note that the directory, not the individual packages, is linked.
You can download the OSCAR package from http://oscar.sourceforge.net. You'll have the option of downloading OSCAR with or without the sources (SRPMs) for most of the packages in OSCAR. Since it is unlikely you'll need the sources and since you can download them separately later should you need them, it is OK to skip them and go with the standard download. We'll download to the /root directory, a safe place to install from.
Next, you will unpack the code
[root@amy root]# gunzip oscar-3.0.tar.gz [root@amy root]# tar -xvf oscar-3.0.tar ...
This creates a directory, /root/oscar-3.0, which you should cd to for the next phase of the installation process. You may also want to browse the subdirectories that are created.
Before the installation wizard can be run the first time, it must be configured and installed. Log in as root or use su - to become root. Change to the installation directory and run configure and make install.
[root@amy root]# cd /root/oscar-3.0 [root@amy oscar-3.0]# ./configure ... [root@amy oscar-3.0]# make install ...
Now you are ready to run the wizard.
At this point, it is generally a good idea to start another shell so the environment variables are sourced from /etc/profile.d. To start the installation, change to the installation directory and run the install_cluster script from a terminal window under X. The install_cluster script expects the private interface as an argument. Be sure to adjust this parameter as needed. Here is an example of starting the script:
[root@amy oscar-3.0]# cd $OSCAR_HOME && pwd /opt/oscar [root@amy oscar]# ./install_cluster eth1
The first time you run the wizard, you will be prompted for a password for the MySQL database. Then, after a bit (depending on dependencies that need to be addressed), the OSCAR GUI-style installation wizard will appear. It may take several minutes for the wizard to appear. The console window from which the script was run will provide additional output, so keep it visible. This information is also written to an install log in the OSCAR installation directory. Figure 6-2 shows the wizard.
The Installation Wizard shows the basic steps that you will be going through to install your cluster. You can get a helpful explanation for any step by using the adjacent Help... button.
Before the installation can proceed, you should download any third-party packages you'll want using opd. Since opd downloads packages over the Internet, you'll need a working Internet connection to use it. Of course, if you are not interested in any of the third-party packages, you can skip this step. Also, it is possible to add packages later. But it is generally simpler if you do everything at once. You'll miss out on some very nice software if you skip this step.
opd can be run as a separate program outside of the wizard or you can run it from the wizard by clicking on the first button, Downloading Additional OSCAR Packages.... Generally, it is easier to run opd from the wizard, so that's what's described here. But there are some rare circumstances where you might want use the command-line version of opd, so there is a very brief description in the accompanying sidebar.
When you open opd from the wizard, a window will appear as shown in Figure 6-3. Another pop up will appear briefly displaying the message Downloading Package Information... as the OSCAR repositories on the Internet are visited to see what packages are available. (Keep in mind that packages are added over time, so you may see additional packages not shown or discussed here.)
Using the downloader is straightforward. If you click on an item, it will display information about the package in the lower pane, including a description, prerequisite packages, and conflict. Just select the appropriate tab. In the upper pane, put a checkmark next to the packages you want. Then click on the Download Selected Packages button. A new pop up will appear with the message Downloading Package File with a file name and a percentage. Be patient; it may look like nothing is happening although the download is proceeding normally.[3] If you have a reasonable connection to the Internet, the download should go quickly. The packages are downloaded to the directory /var/cache/oscar/downloads and are unpacked in separate directories under /var/lib/oscar/packages/.
[3] The percentage refers not to an individual package download but to the percentage of the total number of packages that have been downloaded. So if you are downloading five packages, the percentages will jump by 20 percent as each package is retrieved.
The next step is to select the packages you want to install. When you click on the Select OSCAR Packages to Install... button, the Oscar Package Selection window will
be displayed as shown in Figure 6-4. This displays the packages that are available (but not the individual RPMs).
The information provided in the lower pane is basically the same as that provided by the OSCAR Package Downloader window, except the information is available for all the packages. The check boxes in the upper pane determine whether the packages are to be installed. Any package that you added with opd will also be included in the list, but by default, will not be selected. Don't forget to select these. If you haven't downloaded any packages, you probably won't need to change anything here, but scroll down the list and carefully look it over. If there is something you don't need or want, disable it. But keep in mind that it is generally easier to include something now than to go back and add it later. Don't bother trying to remove any of OSCAR's core packages; OSCAR won't let you. And it is strongly recommended that you don't remove pfilter. (If you have a compelling reason not to include pfilter, be sure to consult the installation manual for additional details explaining how to do this correctly.)
OSCAR constructs an image for client nodes, i.e., a copy of the operating system files and software that will be installed on the client. With OSCAR, you can build multiple images. If you are going to build multiple images, it is possible to define different sets of installation packages. The drop-down box at the top of the window allows you to select among the sets you've defined. You can define and manipulate sets by clicking on the Manage Sets button at the top of the window. A pop-up window, shown in Figure 6-5, allows you to manipulate sets, etc. The easiest way to create a new set is to duplicate an existing set, rename the set, and then edit it.
Step 2 is the configuration of selected OSCAR packages. All in all, the default configurations should meet most users' needs, so you can probably skip this step. Figure 6-6 shows the configuration menu. Most packages do not require configuration at this point and are not included in the menu.
In this example, only five of the packages need or permit additional configuration. Each of these, if selected, will generate a window that is self-explanatory. The Environment Switcher allows you to select either LAM/MPI or MPICH as the default. Since a user can change the default setting, your selection isn't crucial. The switcher script can be run on the command line and is described later in the chapter.
The kernel_picker is potentially a complicated option. Fortunately, if you are using the default kernel, you can ignore it completely. Basically, the kernel_picker allows you to change kernels used when building system images. You could use it to install a previously built kernel such as one configured with the openMosix extensions. The kernel_picker window is shown in Figure 6-7. (See the kernel_picker(1) manpage for more information.)
Figure 6-8 shows the ntpconfig window. The ntpconfig option allows you to specify the address of NTP servers used by the cluster server. While the server synchronizes to an external source, the clients synchronize to the cluster server. There are several default NTP servers listed with check boxes, and you can enter your own choices. In this example, salieri.wofford.int has been added. If you have a local timeserver, you'll certainly want to use that instead of the defaults, or if you know of a "closer" timeserver, you may prefer to use it. But if in doubt, the defaults will work.
Pretty much everyone can expect to see the three choices just described. If you have added additional packages, you may have other choices. In this example, the packages for Ganglia and PVFS were both added, so there are configuration windows for each of these. (With Ganglia you can change the naming information and the network interface used to reach the client nodes. With PVFS you can change the number of I/O servers you are using.)
When you complete a step successfully, you should see a message to that effect in the console window, as shown in Figure 6-9. For some steps, there is also a pop-up window that tells you when the step is finished. While the first two steps are optional, in general be very careful not to go to the next step until you are told to do so. The console window also displays error messages. Unfortunately, the console can be a little misleading. You may see some benign error messages, particularly from rpm and rsync, and occasionally real error messages may get lost in the output. Nonetheless, the console is worth watching and will give you an idea of what is going on.
In Step 3, you will install all the packages that the server needs and configure them. There are no fancy graphics here, but you will see a lot of activity in the console window. It will take several minutes to set up everything. A pop-up window will appear, telling you that you were successful or that there was an error, when this step completes. If all is well, you can close the popup window and move on to the next step. If not, you'll need to go to the console window and try to puzzle out the error messages, correct the problem, and begin again. You should need to run this step only once.
In Step 4, you build the client image. The client image is all the software that will be installed on a client, including the operating system. Since it is possible to create multiple client images, you are given the option to specify a few details as shown in Figure 6-10. You can specify image names if you have multiple images, the location of the packages used to build the image, and the names of the package list and disk partition files. These last two files are described later in this chapter. The defaults are shown in the figure. If you aren't building multiple images, you can probably stick with the defaults. You can also determine how the IP addresses of the clients are set and the behavior of the clients once the installation completes. Your choices are dhcp, static, and replicant. With static, the IP addresses will be assigned to the clients once and for all at the time of the installation. This is the most reasonable choice. dhcp used DHCP to set IP addresses, while replicant doesn't mess with addresses. The next button allows you to turn multicasting on or off. The possible post-install actions are beep, reboot, or shutdown. With beep, the clients will unmount the file system and beep at you until rebooted. reboot and shutdown are just what you would expect. All in all, OSCAR's defaults are reasonable. When you have made your selection, click on Build Image.
OSCAR uses SIS to create the image. Unlike our example in Chapter 8, you do not need to create a sample system. Image creation is done on the server.
This step takes a while to complete. There is a red bar that grows from left to right at the bottom of the window that will give you some idea of your progress. However, you will be done before the bar is complete. Another pop-up window will appear when you are done. You'll run this step once for each different image you want to create. For most clusters, that's one image. Keep in mind that images take a lot of space. Images are stored in the directory /var/lib/systemimager/images.
Once you have built the image, things should start going a lot faster. Step 5 defines the scope of your network. This is done using the window shown in Figure 6-11. If you have multiple images, you can select the image you want to use in the first field. The next five fields are used to specify how node names will be constructed. The host name is constructed by appending a number to the base name. That number begins at the start value and is padded with leading zeros, if needed, as specified by the padding field. The domain name is then appended to the node name to form the fully qualified domain name or FQDN. The number of hosts you create is specified in the fourth field. In this example, four nodes are created with the names node1.oscar.int, node2.oscar.int, node3.oscar.int, and node4.oscar.int. (With padding set to 3, you would get node001.oscar.int, etc.) OSCAR assumes that hosts are numbered sequentially. If for some reason you aren't building a single block of sequential hosts, you can rerun this step to build the block's hosts as needed.
The last three fields are used to set IP parameters. In this example, the four hosts will have IP addresses from 172.16.1.1 through 172.16.1.4 inclusive.
Once you have the fields the way you want them, click on the Addclients button. You should see a small pop-up window indicating that you were successful. If so, you can close the pop-up window and the client definition window and go on to the next step.
Step 6, shown in Figure 6-12, sets up the DHCP server and maps IP addresses to MAC addresses. (It is possible to run OSCAR without configuring the head as a DHCP server, but that isn't described here.) This step requires several substeps. First, you will need to collect the MAC or Ethernet addresses from the adapters in each of the client machines. You can do this manually or use OSCAR to do it. If you select the Collect MAC Addresses button and then power on each client, OSCAR will listen to the network, capture MAC addresses from DHCP requests, and display the captured addresses in the upper left pane. However, if no DHCP requests are generated, the machines won't be discovered. (Be sure to turn this option off when you have collected your addresses.) Under some circumstances, it is possible to collect MAC addresses from machines not in your cluster. If this happens, you can use the Remove button to get rid of the addresses you don't want. If you collect the MAC addresses, be sure to save them to a file using the Export MACs to file... button.
Alternately, if you know the MAC addresses, you can enter them into a file and read the file with the Import MACs from file... button. To create the file, just put one MAC address on a line with the fields separated by colons. Here is part of a MAC file:
00:08:c7:07:6e:57 00:08:c7:07:68:48 00:08:c7:07:c1:73 00:08:c7:07:6f:56
OSCAR can be picky about the format of these addresses. (If you are collecting MAC addresses rather than importing them from a file, it is a good idea to export the collected MAC addresses. In the event you want to reinstall your clusters, this can save some work.)
Once you have the MACs, you'll need to assign them to the clients displayed in the top right pane. You can do this all at once with the Assign all MACs button, or you can do it individually with the Assign MAC to Node button. While the first method is quicker, you may prefer the second method to better control which machine gets which address. With the second method, click on a MAC address to select it, click on a client's interface, and then click the Assign MAC to Node button. Repeat this step for each client.
If the Dynamic DHCP update checkbox is selected, then each time you assign an MAC address, the DHCP server is refreshed. If not selected, then once you have configured your nodes you can click on Configure DHCP Server. OSCAR creates the DHCP configuration file /etc/dhcpd.conf and starts DHCP. If you already have a DHCP configuration file, OSCAR will save it as dhcpd.conf.oscarbak before creating the new file.
SIS is used to push files to the nodes. By default, images are transferred using rsync. It is also possible to distribute images using flamethrower, a multicast-based program. Because the multicast facilities are still somewhat experimental, rsync is the recommended method for new users. If you elect to use flamethrower, you'll need to ensure that your network is properly configured to support multicasting. If the Enable Multicasting checkbox is selected, flamethrower is used to push files. If it is unselected, rsync is used. Chapter 8 provides a detailed description of SIS and rsync.
Next, you'll need to create an autoinstall diskette. When the potential client machines are booted with this diskette, the process of downloading their image begins. Click on the button in the lower left of the window and a new window will take you through the creation of the floppy. Use the default standard when prompted for a flavor. If you have a large cluster, you should create several diskettes so you can install several systems at once.
|
You are through with the Mac Address Collection window but there is one more thing you must do before going to the next step梚nstall the image on your clients. While this sounds formidable, it is very straightforward with OSCAR. Just insert the floppy you just created and reboot each system.
You should see a "SYSLINUX 2.0 Screen" with a boot prompt. You can hit return at the prompt or just wait a few seconds. The system will go to the OSCAR server and download and install the client operating system. Repeat this process with each system. You can do all your clients at the same time if you wish. The boot floppy is only used for a couple of minutes so once the install is on its way, you can remove the floppy and move on to another machine. If you have several floppies, you can get a number of installations going very quickly. The installation will depend on how many clients you have, how fast your network is, and how many packages went into your cluster image, but it should go fairly quickly.
|
When a client's image is installed, the machine will start beeping. If you haven't already removed the floppy, do so now and reboot the system. The filesystems on the clients will not be mounted at this point so it is safe to just cycle the power. (Actually, you could have set the system to automatically reboot back in Step 4, but you'll need to make sure the floppy has been removed in a timely manner if you do so.)
Once all the clients have booted, there are a few post-install scripts that need to be run. Just click on the button. After a few minutes, you should get the popup window shown in Figure 6-13. Well done! But just to be on the safe side, you should test your cluster.
Step 8 tests your cluster. Another console window opens and you see the results from a variety of tests. Figure 6-14 shows what the output looks like early in the process. There is a lot more output that will vary depending on what you've installed. (Note that you may see some PBS errors because the PBS server is initially shutdown. It's OK to ignore these.)
Congratulations! You have an OSCAR cluster up and running! This probably seems like a complicated process when you read about it here, but it all goes fairly quickly. And think for a moment how much you have accomplished.
If something may goes wrong with your installation, OSCAR provides a start_over script that can be used to clean up from the installation and give you another shot at installing OSCAR. This is not an uninstaller. It will not return your machine to the pristine state it was in before the installation but should clean things up enough so that you'll be able to reinstall OSCAR. If you use this script, be sure to log out and back onto the system before you reinstall OSCAR. On the other hand, you may just want to go back and do a clean install.
As should be apparent from the installation you just went through, there are several things you can do to customize your installation. First, you can alter the kernel using kernel_picker. For example, if you want to install the openMosix kernel on each system, you would begin by installing the openMosix kernel on the head node. Then, when installing OSCAR, you would use kernel_picker to select the openMosix kernel. This is shown in Figure 6-15.
Of course, for a new kernel to boot properly, you'll need to ensure that the appropriate kernel load modules are available on each machine. For openMosix, you can do this by installing the openMosix package.
Fortunately, it is straightforward to change the packages that OSCAR installs. For example, if you are installing the openMosix kernel, you'll want the openMosix tools as well. If you look back at Figure 6-10, one of the fields was Package File. In the directory /opt/oscar/oscarsamples there are several files, one for each supported Linux distribution. These files contain the packages that will be installed by OSCAR. For example, for Red Hat 9 the file is redhat-9-i386.rpmlist. If there are some additional packages that you would like to install on the cluster nodes, you can make a backup copy of the desired lists and then add those packages to the list. You should put one package per line. You need to include only the package name, not its version number. For example, to install the openMosix tools package, you could add a line with openmosix-tools (rather than openmosix-tools-0.3.5-1.i386.rpm). The package list is pretty basic, which leads to a quick install but a minimal client. Of course, you'll need to make sure the packages are in (or linked to) the /tftpboot/rpm directory and that you include all dependencies in the package list.
While you are in the /opt/oscar/oscarsamples directory, you can also alter the disk setup by editing either the sample.disk.ide or sample.disk.scsi file. For example, if you have an IDE drive and you want to use the ext3 file system rather than ext2, just change all the ext2 entries to ext3 in the file sample.disk.ide. Of course, unless you have a compelling reason, you should probably skip these changes.
It is pretty obvious that OSCAR has just installed a number of applications on your system. As you might expect, OSCAR made a number of additional, mostly minor, changes. It will probably take you a while to discover everything that has changed, but these changes shouldn't cause any problems.
While OSCAR tries to conform to standard installation practices, you won't get exactly the same installation and file layout that you might have gotten had you installed each application individually. The changes are really minimal, however. If you've never done individual installations, the whole issue is probably irrelevant unless you are looking at the original documentation that comes with the application.
You can expect to find most configuration files in the usual places梩ypically but not always under the /etc directory. Configuration files that OSCAR creates or changes include c3.conf, crontab, dhcpd.conf, gmetad.conf, gmond.conf, ntp.conf, ntp/step-tickers, pcp.conf, pfilter.conf, ssh/ssh_config, and files in xinetd.d. OSCAR will also update /etc/hosts, /etc/exports, and /etc/fstab as needed.
Several of the packages that are installed require accounts, which are created during the install. Take a look at /etc/passwd to see which accounts have been added to your system. For the global user profiles, OSCAR includes a link to a script to set up SSH keys and adds some paths. You might want to look at /etc/profile.d/ssh-oscar.sh and /etc/profile.d/ssh-oscar.csh. OSCAR restarts all affected services.
There are three more buttons above the Quit button on the wizard. Each does exactly what you would expect. The Add OSCAR Clients... adds additional nodes. Adding a node involves three, now familiar steps. When you select Add OSCAR Clients... you'll get the menu shown in Figure 6-16.
The first step defines the client or range of clients. You'll get the same menu (Figure 6-11) you used when you originally set up clients. Be sure you set every field as appropriate. OSCAR doesn't remember what you used in the past, so it is possible to end up with inconsistent host names and domains. (If this happens, you can just delete the new nodes and add them again, correcting the problem, but be sure to exit and restart OSCAR after deleting and before adding a node back.) Of course, you'll also need to set the starting node and number of nodes you are adding. In the second step, you map the MAC address to a machine just as you've done before (see Figure 6-12). Finally, with the last step you run the scripts to complete the setup.
Deleting a node is even easier. Just select the Delete OSCAR Clients... button on the wizard. You'll see a window like the one shown in Figure 6-17 listing the nodes on your cluster. Select the nodes you want to delete and click on the Delete clients button. OSCAR will take care of the rest. (Deleting a node only removes it from the cluster. The data on the node's hard disk is unaffected as are services running on the node.)
Finally, you can install and uninstall packages using the Install/Uninstall OSCAR Packages... button. This opens the window shown in Figure 6-18. Set the checkbox and click on the Execute button. Any new packages you've checked will be installed, while old packages you've unchecked will be uninstalled. This is a new feature in OSCAR and should be used with caution.
OSCAR uses a layered approach to security. The architecture used in this chapter, a single-server node as the only connection to the external network, implies that everything must go through the server. If you can control the placement of the server on the external network, e.g., behind a corporate firewall, you can minimize the threat to the cluster. While outside the scope of this discussion, this is something you should definitely investigate.
The usual advice for securing a server applies to an OSCAR server. For example, you should disable unneeded services and delete unused accounts. With a Red Hat installation, TCP wrappers is compiled into xinetd and available by default. You'll need to edit the /etc/hosts.allow and /etc/hosts.deny files to configure this correctly. There are a number of good books (and web pages) on security. Get one and read it!
In an OSCAR cluster, access to the cluster is controlled through pfilter, a package included in the OSCAR distribution. pfilter is both a firewall and a compiler for firewall rulesets. (The pfilter software can be downloaded separately from http://pfilter.sourceforge.net/.)
pfilter is run as a service, which makes it easy to start it, stop it, or check its status.
[root@amy root]# service pfilter stop Stopping pfilter: [ OK ] [root@amy root]# service pfilter start Starting pfilter: [ OK ] [root@amy root]# service pfilter status pfilter is running
If you are having communications problems between nodes, you may want to temporarily disable pfilter. Just don't forget to restart it when you are done!
You can request a list of the chains or rules used by pfilter with the service command.
[root@amy root]# service pfilter chains table filter: ...
This produces a lot of output that is not included here.
The configuration file for pfilter, /etc/pfilter.conf, contains the rules used by pfilter and can be edited if you need to change them. The OSCAR installation adds some rules to the default configuration. These appear to be quite reasonable, so it is unlikely that you'll need to make any changes. The manpages for pfilter.conf(5) and pfilter.rulesets(5) provide detailed instructions should you wish to make changes. While the rules use a very simple and readable syntax, instruction in firewall rulesets is outside the scope of this book.
Within the cluster, OSCAR is designed to use the SSH protocol for communications. Use of older protocols such as TELNET or RSH is strongly discouraged and really isn't needed. openSSH is set up for you as part of the installation. OPIUM, the OSCAR Password Installer and User Manager tool, handles this. OPIUM installs scripts that will automatically generate SSH keys for users. Once OSCAR is installed, the next time a user logs in or starts a new shell, she will see the output from the key generation script. (Actually, at any point after Step 3 in the installation of OSCAR, key generation is enabled.) Figure 6-19 shows such a login. Note that no action is required on the part of the user. Apart from the display of a few messages, the process is transparent to users.
Once you set up the cluster, you should be able to use the ssh command to log onto any node from any other node, including the server, without using a password. On first use, you will see a warning that the host has been added to the list of known hosts. All this is normal. (The changes are saved to the directory /etc/profile.d.)
|
In addition to setting up openSSH on the cluster, OPIUM includes a sync_users script that synchronizes password and group files among the cluster using C3 as a transport mechanism. By default, this is run every 15 minutes by cron. It can also be run by root with the --force option if you don't want to wait for cron. It cannot be run by other users. OPIUM is installed in /opt/opium with sync_users in the subdirectory bin. The configuration file for sync_users, sync_user.conf, is in the etc subdirectory. You can edit the configuration file to change how often cron runs sync_user or which files are updated, among other things. (sync_users is something of a misnomer since it can be used to update any file.)
switcher is a script that simplifies changes to a user's environment. It allows the user to make, with a single command, all the changes to paths and environmental variables needed to run an application. switcher is a script that uses the modules package.
The modules package is an interesting package in its own right. It is a general utility that allows users to dynamically modify their environment using modulefiles. Each modulefile contains the information required to configure a shell for a specific application. A user can easily switch to another application, making required environmental changes with a single command. While it is not necessary to know anything about modules to use switcher, OSCAR installs the modules system and, it is available should you need or wish to use it. modules can be downloaded from http://modules.sourceforge.net/.
switcher is designed so that changes take effect on future shells, not the current one. This was a conscious design decision. The disadvantage is that you will need to start a new shell to see the benefits of your change. On the positive side, you will not need to run switcher each time you log in. Nor will you need to edit your "dot" files such as .bashrc. You can make your changes once and forget about them. While switcher is currently used to change between the two MPI environments provided with OSCAR, it provides a general mechanism that can be used for other tasks. When experimenting with switcher, it is a good idea to create a new shell and test changes before closing the old shell. If you have problems, you can go back to the old shell and correct them.
With switcher, tags are used to group similar software packages. For example, OSCAR uses the tag mpi for the included MPI systems. (You can list all available tags by invoking switcher with just the --list option.) You can easily list the attributes associated with a tag.
[sloanjd@amy sloanjd]$ switcher mpi --list lam-7.0 lam-with-gm-7.0 mpich-ch_p4-gcc-1.2.5.10
In this example, we see the attributes are the two available MPI implementations.
You use the --show option to use switcher to determine the default MPI environment.
[sloanjd@amy sloanjd]$ switcher mpi --show system:default=lam-7.0 system:exists=true
Alternately, you can use the which command:
[sloanjd@amy sloanjd]$ which mpicc /opt/lam-7.0/bin/mpicc
From the path, we can see that we are set up to use LAM/MPI rather than MPICH.
To change the default to MPICH, simply assign the desired attribute value to the tag.
[sloanjd@amy sloanjd]$ switcher mpi = mpich-ch_p4-gcc-1.2.5.10 Attribute successfully set; new attribute setting will be effective for future shells
The change will not take effect immediately, but you will be using MPICH the next time you log in (and every time you log in until you run switcher again.) After the first time you make a change, switcher will ask you to confirm tag changes. (Also, the very first time you use switcher to change a tag, you'll receive a tag "does not exist" error message that can be safely ignored.)
As root, you can change the default tag for everyone using the --system flag.
[root@amy root]# switcher mpi = lam-7.0 --system
One last word of warning! If you make a typo when entering the value for the attribute, switcher will not catch your mistake.
6.6 Using LAM/MPI with OSCARBefore we leave OSCAR, let's look at a programming example. You can use this to convince yourself that everything is really working. You can find several LAM/MPI examples in /usr/share/doc/lam-oscar-7.0/examples and the documentation in /opt/lam-7.0/share/lam/doc. (For MPICH, look in /opt/mpich-1.2.5.10-ch_p4-gcc/examples for code and /opt/mpich-1.2.5.10-ch_p4-gcc/doc for documentation.) Log on as a user other than root and verify that LAM/MPI is selected using switcher. [sloanjd@amy doc]$ switcher mpi --show user:default=lam-7.0 system:exists=true If necessary, change this and log off and back on. If you haven't logged onto the individual machines, you need to do so now using ssh to register each machine with ssh. You could do this with a separate command for each machine. [sloanjd@amy sloanjd]$ ssh node1 ... Using a shell looping command is probably better since it will ensure that you don't skip any machines and can reduce typing. With the Bash shell, the following command will initiate your logon to the machines node1 through node99, each in turn. [sloanjd@amy sloanjd]$ for ((i=1; i<100; i++)) > do > ssh node${i} > done Just adjust the loop for a different number of machines. You will need to adjust the syntax accordingly for other shells. This goes fairly quickly and you'll need to do this only once. Create a file that lists the individual machines in the cluster by IP address. For example, you might create a file called myhosts like the following: [sloanjd@amy sloanjd]$ cat myhosts 172.16.1.1 172.16.1.2 172.16.1.3 172.16.1.4 172.16.1.5 This should contain the server as well as the clients. Next, run lamboot with the file's name as an argument. [sloanjd@amy sloanjd]$ lamboot myhosts LAM 7.0/MPI 2 C++/ROMIO - Indiana University You now have a LAM/MPI daemon running on each machine in your cluster. Copy over the example you want to run, compile it with mpicc, and then run it with mpirun. [sloanjd@amy sloanjd]$ cp /usr/share/doc/lam-oscar-7.0/examples/ alltoall/alltoall.c $HOME [sloanjd@amy sloanjd]$ mpicc -o alltoall alltoall.c [sloanjd@amy sloanjd]$ mpirun -np 4 alltoall Rank 0 not sending to myself Rank 1 sending message "1" to rank 0 Rank 2 sending message "2" to rank 0 ... You should see additional output. The amount will depend on the number of machines in myhosts. Happy coding, everyone! |
The previous chapter showed the use of OSCAR to coordinate the many activities that go into setting up and administering a cluster. This chapter discusses another popular kit for accomplishing roughly the same tasks.
NPACI Rocks is a collection of open source software for building a high-performance cluster. The primary design goal for Rocks is to make cluster installation as easy as possible. Unquestionably, they have gone a long way toward meeting this goal. To accomplish this, the default installation makes a number of reasonable assumptions about what software should be included and how the cluster should be configured. Nonetheless, with a little more work, it is possible to customize many aspects of Rocks.
When you install Rocks, you will install both the clustering software and a current version of Red Hat Linux updated to include security patches. The Rocks installation will correctly configure various services, so this is one less thing to worry about. Installing Rocks installs Red Hat Linux, so you won't be able to add Rocks to an existing server or use it with some other Linux distribution.
Default installations tend to go very quickly and very smoothly. In fact, Rocks' management strategy assumes that you will deal with software problems on a node by reinstalling the system on that node rather than trying to diagnose and fix the problem. Depending on hardware, it may be possible to reinstall a node in under 10 minutes. Even if your systems take longer, after you start the reinstall, everything is automatic, so you don't need to hang around.
In this chapter, we'll look briefly at how to build and use a Rocks cluster. This coverage should provide you with enough information to decide whether Rocks is right for you. If you decide to install Rocks, be sure you download and read the current documentation. You might also want to visit Steven Baum's site, http://stommel.tamu.edu/~baum/npaci.html.
In this section we'll look at a default Rocks installation. We won't go into the same level of detail as we did with OSCAR, in part because Rocks offers a simpler installation. This section should give you the basics.
There are several things you need to do before you begin your installation. First, you need to plan your system. A Rocks cluster has the same basic architecture as an OSCAR cluster (see Figure 6-1). The head node or frontend is a server with two network interfaces. The public interface is attached to the campus network or the Internet while the private interface is attached to the cluster. With Rocks, the first interface (e.g., eth0) is the private interface and the second (e.g., eth1) is the public interface. (This is the opposite of what was described for OSCAR.)
You'll install the frontend first and then use it to install the compute nodes. The compute nodes use HTTP to pull the Red Hat and cluster packages from the front-end. Because Rocks uses Kickstart and Anaconda (described in Chapter 8), heterogeneous hardware is supported.
Diskless clusters are not an option with Rocks. It assumes you will have hard disks in all your nodes. For a default installation, you'll want at least an 8 GB disk on the frontend. For compute nodes, by altering the defaults, you can get by with smaller drives. It is probably easier to install the software on the compute nodes by booting from a CD-ROM, but if your systems don't have CD-ROM drives, you can install the software by booting from a floppy or by doing a network boot. Compute nodes should be configured to boot without an attached keyboard or should have a keyboard or KVM switch attached.
Rocks supports both Ethernet and Myrinet. For the cluster's private network, use a private address space distinct from the external address space per RFC 1918. It's OK to let an external DHCP server configure the public interface, but you should let Rocks configure the private interface.
To install Rocks, you'll first need the appropriate CD-ROMs. Typically, you'll go to the Rocks web site http://rocks.npaci.edu/Rocks/, follow the link to the download page, download the ISO images you want, and burn CD-ROMs from these images. (This is also a good time to download the user manuals if you haven't already done so.) Rocks currently supports x86 (Pentium and Athlon), x86_64 (AMD Opteron), and IA-64 (Itanium) architectures.
Be sure to download the software that is appropriate for your systems. You'll need at least two ISO images, maybe more depending upon the software you want. Every installation will require the Rocks Base and HPC Roll. The core install provides several flavors of MPICH, Ganglia, and PVFS. If you want additional software that is not part of the core Rocks installation, you'll need to download additional rolls. For example, if you want tripwire and chkrootkit, two common security enhancements, you could download the Area 51 roll. If you are interested in moving on to grid computing, Rocks provides rolls that ease that process (see the sidebar, "Rocks and Grids").
Currently available rolls include the following:
This roll includes the Sun Grid Engine, a job queuing system for grids. Think of this as a grid-aware alternative to openPBS. This is open source distributed management software. For more information on SGE, visit http://gridengine.sunsource.net.
The NSF Middleware Initiative (NMI) grid roll contains a full complement of grid software, including the Globus toolkit, Condor-G, Network Weather Service, and MPICH-G2, to name only a few. For more information on the NMI project, visit http://www.nsf-middleware.org.
This roll installs and configures the Intel C compiler and the Intel FORTRAN compiler. (You'll still need licenses from Intel.) It also includes the MPICH environments built for these compilers. For more information on the Intel compilers and their use with Rocks, visit http://www.intel.com/software/products/distributors/rock_cluster.htm.
This roll currently includes tripwire and chkrootkit. tripwire is a security auditing package. chrootkit examines a system for any indication that a root kit has been installed. For more information on these tools, visit the sites http://www.tripwire.org and http://www.chkrootkit.org.
This roll includes the OpenSCE software that originated at Kasetsart University, Thailand. For more information on OpenSCE, visit http://www.opensce.org.
The Java roll contains the Java Virtual Machine. For more information on Java, visit http://java.sun.com.
The Portable Batch System roll includes the OpenPBS and Maui queuing and scheduling software. For more information on these packages, see Chapter 11 or visit http://www.openpbs.org.
This roll includes the Condor workload management software. Condor provides job queuing, scheduling, and priority management along with resource monitoring and management. For more information on Condor, visit http://www.cs.wisc.edu/condor/.
Some rolls are not available for all architectures. It's OK to install more than one roll, so get what you think you may need now. Generally, you won't be able to add a roll once the cluster is installed. (This should change in the future.)
Once you've burned CD-ROMs from the ISO images, you are ready to start the installation. You'll start with the frontend.
Rocks and GridsWhile grids are beyond the scope of this book, it is worth mentioning that, through its rolls mechanism, Rocks makes it particularly easy to move into grid computing. The grid roll is particularly complete, providing pretty much everything you'll need to get started條iterally dozens of software tools and packages. Software includes:
These are just the core. It you are new to grids and want to get started, this is the way to go. (The Appendix A includes the URLs for these tools.) |
The frontend installation should go very smoothly. After the initial boot screens, you'll see a half dozen or so screens asking for additional information along with other screens giving status information for the installation. If you've installed Red Hat Linux before, these screens will look very familiar. On a blue background, you'll see the Rocks version information at the very top of the screen and interface directions at the bottom of the screen. In the center of the screen, you'll see a gray window with fields for user supplied information or status information. Although you can probably ignore them, as with any Red Hat installation, the Linux virtual consoles are available as shown in Table 7-1. If you have problems, don't forget these.
Console |
Use |
Keystroke |
---|---|---|
1 |
Installation |
Cntl-Alt-F1 |
2 |
Shell prompt |
Cntl-Alt-F2 |
3 |
Installation log |
Cntl-Alt-F3 |
4 |
System messages |
Cntl-Alt-F4 |
5 |
Other messages |
Cntl-Alt-F5 |
Boot the frontend with the Rocks Base CD and stay with the machine. After a moment, you will see a boot screen giving you several options. Type frontend at the boot: prompt and press Enter. You need to do this quickly because the system will default to a compute node installation after a few seconds and the prompt will disappear. If you miss the prompt, just reboot the system and pay closer attention.
After a brief pause, the system prompts you to register your roll CDs. When it asks whether you have any roll CDs, click on Yes. When the CD drive opens, replace the Rocks Base CD with the HPC Roll CD. After a moment the system will ask if you have another roll CD. Repeat this process until you have added all the roll CDs you have. Once you are done, click on No and the system will prompt you for the original Rocks Base CD. Registration is now done, but at the end of the installation you'll be prompted for these disks again for the purpose of actual software installation.
The next screen prompts you for information that will be included in the web reports that Ganglia creates. This includes the cluster name, the cluster owner, a contact, a URL, and the latitude and longitude for the cluster location. You can skip any or all of this information, but it only takes a moment to enter. You can change all this later, but it can be annoying trying to find the right files. By default, the web interface is not accessible over the public interface, so you don't have to worry about others outside your organization seeing this information.
The next step is partitioning the disk drive. You can select Autopartition and let Rocks partition the disk using default values or you can manually partition the disk using Disk Druid. The current defaults are 6 GB for / and 1 GB for swap space. /export gets the remaining space. If you manually partition the drive, you need at least 6 GB for / and you must have a /export partition.
The next few screens are used to configure the network. Rocks begins with the private interface. You can choose to have DHCP configure this interface, but since this is on the internal network, it isn't likely that you want to do this. For the internal network, use a private address range that doesn't conflict with the external address range. For example, if your campus LAN uses 10.X.X.X, you might use 172.16.1.X for your internal network. When setting up clients, Rocks numbers machines from the highest number downward, e.g., 172.16.1.254, 172.16.1.253, ....
For the public interface, you can manually enter an IP address and mask or you can rely on DHCP. If you are manually entering the information, you'll be prompted for a routing gateway and DNS servers. If you are using DHCP, you shouldn't be asked for this information.
The last network setup screen asks for a node name. While it is possible to retrieve this information by DHCP, it is better to set it manually. Otherwise, you'll need to edit /etc/resolv.conf after the installation to add the frontend to the name resolution path. Choose the frontend name carefully. It will be written to a number of files, so it is very difficult to change. It is a very bad idea to try to change hostnames after installing Rocks.
Once you have the network parameters set, you'll be prompted for a root password. Then Rocks will format the filesystem and begin installing the packages. As the installation proceeds, Rocks provides a status report showing each package as it is installed, time used, time remaining, etc. This step will take a while.
Once the Rocks Base CD has been installed, you'll be prompted for each of the roll CDs once again. Just swap CDs when prompted to do so. When the last roll CD has been installed, the frontend will reboot.
Your frontend is now installed. You can move onto the compute nodes or you can stop and poke around on the frontend first. The first time you log onto the frontend, you will be prompted for a file and passphrase for SSH.
Rocks Frontend Node - Wofford Rocks Cluster Rocks 3.2.0 (Shasta) Profile built 17:10 29-Jul-2004 Kickstarted 17:12 29-Jul-2004 It doesn't appear that you have set up your ssh key. This process will make the files: /root/.ssh/identity.pub /root/.ssh/identity /root/.ssh/authorized_keys Generating public/private rsa1 key pair. Enter file in which to save the key (/root/.ssh/identity): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/identity. Your public key has been saved in /root/.ssh/identity.pub. The key fingerprint is: 86:ad:c4:e3:a4:3a:90:bd:7f:f1:bd:7a:df:f7:a0:1c [email protected]
The default file name is reasonable, but you really should enter a passphrase梠ne you can remember.
The next step is to install the compute nodes. Before you do this, you may want to make a few changes to the defaults. For example, you might want to change how the disks will be partitioned, what packages will be installed, or even which kernel will be used. For now, we'll stick with the defaults. Customizations are described in the next two sections, so you may want to read ahead before going on. But it's really easy to reinstall the compute nodes, so don't feel you have to master everything at once.
To install the compute nodes, you'll begin by running the program insert-ethers as root on the frontend. Next, you'll boot a compute node using the Rocks Base CD. Since the Rocks Base CD defaults to compute node install, you won't need to type anything on the cluster node. The insert-ethers program listens for a DHCP query from the booting compute node, assigns it a name and IP address, records information in its database, and begins the installation of the client.
Let's look at the process in a little more detail. insert-ethers collects MAC address information and enters it into the Rocks cluster database. It can also be used to replace (--replace), update (--update), and remove (--remove) information in the database. This information is used to generate the DHCP configuration file and the host file.
There is one potential problem you might face when using insert-ethers. If you have a managed Ethernet switch, when booted it will issue a DHCP request. You don't want to treat it like a compute node. Fortunately, the Rocks implementers foresaw this problem. When you start insert-ethers, you are given a choice of the type of appliance to install. You can select Ethernet Switch as an option and configure your switch. When you are done, quit and restart insert-ethers. This time select Compute. Now you are ready to boot your compute nodes. If you aren't setting up an Ethernet switch, you can just select Compute the first time you run insert-ethers.
The next step is to boot your compute nodes. As previously noted, you can use the Rocks Base CD to do this. If your compute nodes don't have CD-ROM drives, you have two other options. You can use a network boot if your network adapters support a PXE boot, or you can create a PXE boot floppy. Consult your hardware documentation to determine how to do a PXE boot using a network adapter. The Rocks FAQ, included in NPSCI Rocks Cluster Distribution: Users Guide, has the details for creating a PXE boot floppy.
When insert-ethers runs, it displays a window labeled Inserted Appliances. As each compute node is booted, it displays the node's MAC address and assigned name. Typically, insert-ethers will name the systems compute-0-0, compute-0-1, etc. (The file /etc/host defines aliases for these, c0-0, c0-1, etc., for those of us who don't type well.) If you start insert-ethers with the command-line option --cabinet=1, it will generate the names compute-1-0, compute-1-1, etc. This allows you to create a two-tier naming system, if you want. You can change the starting point for the second number with the --rank. See the insert-ethers(8) manpage for more details.
A couple of minutes after you reboot your compute node, it will eject the CD-ROM. You can take the CD-ROM and move on to your next machine. If you have a terminal connected to the system, you'll get a status report as the installation proceeds.
If you need to reinstall a node, you can use the shoot-node command. This is useful when changing the configuration of a node, e.g., adding a new package. This command takes the name of the machine or machines as an argument.
[root@frontend root]# shoot-node compute-0-0
Since this is run on the frontend, it can be used to remotely reinstall a system. This command is described in the shoot-node(8) manpage.
Since Rocks installs Linux for you, you will need to do a little digging to see how things are set up. Among other services, Rocks installs and configures 411 (an NIS replacement), Apache, DHCP, MySQL, NFS, NTP, Postfix, and SSH, as well as cluster-specific software such as Ganglia and PVFS. Configuration files are generally where you would expect them. You'll probably want to browse the files in /etc, /etc/init.d, /etc/ssh, and /etc/xinetd.d. Other likely files include crontab, dhcpd.conf, exports, fstab, gmetad.conf, gmond.conf, hosts, ntp.conf, and ntp/step-tickers. You might also run the commands
[root@frontend etc]# ps -aux | more ... [root@frontend etc]# /sbin/service --status-all | more ... [root@frontend etc]# netstat -a | more ...
The cluster software that Rocks installs is in /opt or /usr/share.
If you have been using Red Hat for a while, you probably have some favorite packages that Rocks may not have installed. Probably the best way to learn what you have is to just poke around and try things.
Starting with Rocks 3.1.0, 411 now replaces NIS. 411 automatically synchronizes the files listed in /var/411/Files.mk. The password and group files are among these. When you add users, you'll want to use useradd.
[root@frontend 411]# useradd -p xyzzy -c "Joe Sloan" \ > -d /export/home/sloanjd sloanjd ...
This automatically invokes 411. When a user changes a password, you'll need to sync the changes with the compute nodes. You can do this with the command
[root@frontend root]# make -C /var/411
A more complete discussion of 411 can be found in the Rocks user's guide. At this time, there isn't a 411 man page. To remove users, use userdel.
You'll probably want to start the X Window System so you can run useful graphical tools such as Ganglia. Before you can run X the first time, you'll need to run redhat-config-xfree86. If you are comfortable setting options, go for it. If you are new to the X Window System, you'll probably be OK just accepting the defaults. You can then start X with the xstart command. (If you get a warning message about no screen savers, just ignore it.)
Once X is working, you'll need to do the usual local customizations such as setting up printers, creating a message of the day, etc.
Rocks uses Kickstart and Anaconda to install the individual compute nodes. However, rather than use the usual flat, text-based configuration file for Kickstart, Rocks decomposes the Kickstart file into a set of XML files for the configuration information. The Kickstart configuration is generated dynamically from these. These files are located in the /export/home/install/rocks-dist/enterprise/3/en/os/i386/build/nodes/ directory. Don't change these. If you need to create customization files, you can put them in the directory /home/install/site-profiles/3.2.0/nodes/ for Rocks Version 3.2.0. There is a sample file skeleton.xml that you can use as a template when creating new configuration files. When you make these changes, you'll need to apply the configuration change to the distribution using the rocks-dist command. The following subsections give examples. (For more information on rocks-dist, see the rocks-dist(1) manpage.)
If you want to install additional RPM packages, first copy those packages to the directory /home/install/contrib./enterprise/3/public/arch/RPMS, where arch is the architecture you are using, e.g., i386.
[root@frontend root]# mv ethereal-0.9.8-6.i386.rpm \ > /home/install/contrib/enterprise/3/public/i386/RPMS/ [root@frontend root]# mv ethereal-gnome-0.9.8-6.i386.rpm \ > /home/install/contrib/enterprise/3/public/i386/RPMS/
Next, create a configuration file extend-compute.xml. Change to the profile directory, copy skeleton.xml, and edit it with your favorite text editor such as vi.
[root@frontend root]# cd /home/install/site-profiles/3.2.0/nodes [root@frontend nodes]# cp skeleton.xml extend-compute.xml [root@frontend nodes]# vi extend-compute.xml ...
Next, add a line to extend-compute.xml for each package.
<package> ethereal </package> <package> ethereal-gnome </package>
Notice that only the base name for a package is used; omit the version number and .rpm suffix.
Finally, apply the configuration change to the distribution.
[root@frontend nodes]# cd /home/install [root@frontend install]# rocks-dist dist ...
You can now install the compute nodes and the desired packages will be included.
In general, it is probably a good idea to stick to one disk-partitioning scheme. Unless you turn the feature off as described in the next subsection, compute nodes will automatically be reinstalled after a power outage. If you are using multiple partitioning schemes, the automatic reinstallation could result in some drives with undesirable partitioning. Of course, the downside of a single-partitioning scheme is that it may limit the diversity of hardware you can use.
To change the default disk partitioning scheme used by Rocks to install compute nodes, first create a replacement partition configuration file. Begin by changing to the directory where the site profiles are stored. Create a configuration file replace-auto-partition.xml. Change to the profile directory, copy skeleton.xml, and edit it.
[root@frontend root]# cd /home/install/site-profiles/3.2.0/nodes [root@frontend nodes]# cp skeleton.xml replace-auto-partition.xml [root@frontend nodes]# vi replace-auto-partition.xml ...
Under the main section, you'll add something like the following:
<main> <part> / --size 2048 --ondisk hda </part> <part> swap --size 500 --ondisk hda </part> <part> /mydata --size 1 --grow --ondisk hda </part> </main>
Apart from the XML tags, this is standard Kickstart syntax. This example, a partitioning scheme for an older machine, uses 2 GB for the root partition, 500 MB for a swap partition, and the rest of the disk for the /mydata partition.
The last step is to apply the configuration change to the distribution.
[root@frontend nodes]# cd /home/install [root@frontend install]# rocks-dist dist ...
You can now install the system using the new partitioning scheme.
By default, a compute node will attempt to reinstall itself whenever it does a hard restart, e.g., after a power failure. You can disable this behavior by executing the next two commands.
[root@frontend root]# cluster-fork '/etc/rc.d/init.d/rocks-grub stop' compute-0-0: Rocks GRUB: Setting boot action to 'boot current kernel': [ OK ] ... [root@frontend root]# cluster-fork '/sbin/chkconfig --del rocks-grub' compute-0-0: ...
The command cluster-fork is used to execute a command on every machine in the cluster. In this example, the two commands enclosed in quotes will be executed on each compute node. Of course, if you really wanted to, you could log onto each, one at a time, and execute those commands. cluster-fork is a convenient tool to have around. Additional information can be found in the Rocks user's guide. There is no manpage at this time.
Creating and installing custom kernels on the compute nodes, although more involved, is nonetheless straightforward under Rocks. You'll first need to create a compute node, build a new kernel on the compute node, package it using rpm, copy it to the frontend, rebuild the Rocks distribution with rocks-dist, and reinstall the compute nodes. The details are provided in the Rocks user's guide along with descriptions of other customizations you might want to consider.
One of Rocks' strengths is the web-based management tools it provides. Initially, these are available only from within the clusters since the default firewall configuration blocks HTTP connections to the frontend's public interface. If you want to allow external access, you'll need to change the firewall configuration. To allow access over the public interface, edit the file /etc/sysconfig/iptables and uncomment the line:
-A INPUT -i eth1 -p tcp -m tcp --dport www -j ACCEPT
Then restart the iptables service.
[root@frontend sysconfig]# service iptables restart
Some pages, for security reasons, will still be unreachable.
To view the management page locally, log onto the frontend, start the X Window System, start your browser, and go to http://localhost. You should get a screen that looks something like Figure 7-1.
The links on the page will vary depending on the software or rolls you chose to install. For example, if you didn't install PBS, you won't see a link to the PBS Job Queue. Here is a brief description of the links shown on this page.
Rocks maintains a MySQL database for the server. The database is used to generate service-specific configuration files such as /etc/hosts and /etc/dhcpd.conf. This phpMyAdmin web interface to the database can be accessed through the first link. This page will not be accessible over the public interface even if you've changed the firewall. Figure 7-2 shows the first screen into the database. You can follow the links on the left side of the page to view information about the cluster.
This link provides a way into Ganglia's home page. Ganglia, a cluster monitoring package, is described in Chapter 10.
This link takes you to a page that displays the top processes running on the cluster. This is basically the Unix top command, but provides cluster-wide information. The columns are similar to those provided by top except for the first two. The first, TN, gives the age of the information in seconds, and the second, HOST, is the host name for the cluster node that the process is running on. You can look at the top(1) manpage for information on how to interpret this page. Figure 7-3 shows the Cluster Top screen for an idle cluster.
PBS is described in Chapter 11. You should see the PBS link only if you've installed the PBS roll.
This is an alert system that sends RSS-style news items for events within the cluster. It is documented in the Rocks Reference Guide.
This link takes you into the /proc subdirectory. The files in this subdirectory contain dynamic information about the state of the operating system. You can examine files to see the current configuration, and, in some cases, change the file to alter the configuration. This page is accessible only on a local system.
The Cluster Distribution link is a link into the /home/install directory on the frontend. This directory holds the RPM packages used to construct the cluster. This page is accessible only on a local system.
This link provides a graphical representation of the information used to create the Kickstart file. This is generated on the fly. Different display sizes are available.
This link returns a page that lists the various rolls that have been installed on your cluster.
These are online versions of the Rocks documentation that have been alluded to so often in this chapter.
This link generates a PDF document containing labels for each node in the cluster. The labels contain the cluster name, node name, MAC address, and the Rocks logo. If your cluster name is too long, the logo will obscure it. You should be able to print the document on a standard sheet of labels such as Avery 5260 stock.
This will take you to the Rocks registration site, so you can add your cluster to the list of other Rocks clusters.
Finally, there is a link to the Rocks home page.
Before we leave Rocks, let's look at a programming example you can use to convince yourself that everything is really working.
While Rocks doesn't include MPI/LAM, it gives you your choice of several MPICH distributions. The /opt directory contains subdirectories for MPICH, MPICH-MPD, and MPICH2-MPD. Under MPICH, there is also a version of MPICH for Myrnet users. The distinctions are described briefly in Chapter 9. We'll stick to MPICH for now.
You can begin by copying one of the examples to your home directory.
[sloanjd@frontend sloanjd]$ cd /opt/mpich/gnu/examples [sloanjd@frontend examples]$ cp cpi.c ~ [sloanjd@frontend examples]$ cd
Next, compile the program.
[sloanjd@frontend sloanjd]$ /opt/mpich/gnu/bin/mpicc cpi.c -o cpi
(Clearly, you'll want to add this directory to your path once you decide which version of MPICH to use.)
Before you can run the program, you'll want to make sure SSH is running and that no error or warning messages are generated when you log onto the remote machines. (SSH is discussed in Chapter 4.)
Now you can run the program. (Rocks automatically creates the machines file used by the system, so that's one less thing to worry about. But you can use the -machinefile filename option if you wish.)
[sloanjd@frontend sloanjd]$ /opt/mpich/gnu/bin/mpirun -np 4 cpi Process 0 on frontend.public Process 2 on compute-0-1.local Process 1 on compute-0-0.local Process 3 on compute-0-0.local pi is approximately 3.1416009869231245, Error is 0.0000083333333314 wall clock time = 0.010533
That's all there is to it.
Since Rocks also includes the High-Performance Linpack (HPL) benchmark, so you might want to run it. You'll need the HPL.dat file. With Rocks 3.2.0, you can copy it to your directory from /var/www/html/rocks-documentation/3.2.0/. To run the benchmark, use the command
[sloanjd@frontend sloanjd]$ /opt/mpich/gnu/bin/mpirun -nolocal \ > -np 2 /opt/hpl/gnu/bin/xhpl ...
(Add a machine file if you like.) You can find more details in the Rocks user manual.