Part III: Building Custom Clusters
|
Chapter 8. Cloning Systems
Setting up a cluster means setting up machines梙opefully, lots of machines. While you should begin with a very small number of machines as you figure out what you want, eventually you'll get to the point where you are mindlessly installing system after system. Fortunately, most of those machines will have identical setups. You could simply repeat the process for each machine, but this will be both error prone and immensely boring. You need a way to automate the process.
The approach you need depends on the number of machines to be set up and configured, the variety of machines, how mission critical the cluster is, and your level of patience. For three or four machines, a manual install and configuration of each machine is a reasonable approach, particularly if you are working with an odd mix of different machines so that each setup is different. But even with a very small number of machines, the process will go more smoothly if you can automate some of the post-installation tasks such as copying configuration files.
Unless you have the patience of Job, with more than eight or ten machines in your cluster, you'll want to automate as much of the process as possible. And as your cluster's continuous operation becomes more crucial, the need for an automated approach becomes even more important.
This chapter begins with a quick look at simple approaches to ease configuring multiple systems after the operating system has been installed. These techniques are useful for any size cluster. Even if you are clearly in the fully automated camp, you should still skim this section since these techniques apply to maintaining clusters as well as setting up clusters.
Next, three tools that are useful when building larger clusters are described?I>Kickstart, g4u (ghost for Unix), and SystemImager (part of the Systems Installation Suite). These tools are representative of three different approaches that can be used. Kickstart is a package-based installation program that allows you to automate the installation of the operating system. g4u is a simple image-based program that allows you to copy and distribute disk images. SystemImager is a more versatile set of tools with capabilities that extend beyond installing systems. The tools in SystemImager allow you to build, clone, and configure a system. While these tools vary in scope, each does what it was designed to do quite well. There are many other tools not discussed here.
8.1 Configuring Systems
Cloning refers to creating a number of identical systems. In practice, you may not always want systems that are exactly alike. If you have several different physical configurations, you'll need to adapt to match the hardware you have. It would be pointless to use the identical partitioning schemes on hard disks with different capacities. Furthermore, each system will have different parameters, e.g., an IP address or host name that must be unique to the system.
Setting up a system can be divided roughly into two stages梚nstalling the operating system and then customizing it to fit your needs. This division is hazy at best. Configuration changes to the operating system could easily fall into either category. Nonetheless, many tools and techniques fall, primarily, into one of these stages so the distinction is helpful. We'll start with the second task first since you'll want to keep this ongoing process in mind when looking at tools designed for installing systems.
8.1.1 Distributing Files
The major part of the post-install configuration is getting the right files onto your system and keeping those files synchronized. This applies both to configuring the machine for the first time and to maintaining existing systems. For example, when you add a new user to your cluster, you won't want to log onto every machine in the cluster and repeat the process. It is much simpler if you can push the relevant accounting files to each machine in the cluster from your head node.
What you will want to copy will vary with your objectives, but Table 8-1 lists a few likely categories.
Table 8-1.
Types of Files
Accounting files, e.g., /etc/passwd, /etc/shadow, /etc/group, /etc/gshadow
Configuration files, e.g., /etc/motd, /etc/fstab, /etc/hosts, /etc/printcap.local
Security configuration files such as firewall rulesets or public keys
Packages for software you wish to install
Configuration files for installed software
User scripts
Kernal images and kernal source files
Many of these are one-time copies, but others, like the accounting files, will need to be updated frequently.
You have a lot of options. Some approaches work best when moving sets of files but can be tedious when dealing with just one or two files. If you are dealing with a number of files, you'll need some form of repository. (While you could pack a collection of files into a single file using tar, this approach works well only if the files aren't changing.) You could easily set up your own HTTP or FTP server for both packages and customized configuration files, or you could put them on a floppy or CD and carry the disk to each machine. If you are putting together a repository of files, perhaps the best approach is to use NFS.
With NFS, you won't need to copy anything. But while this works nicely with user files, it can create problems with system files. For example, you may not want to mount a single copy of /etc using NFS since, depending on your flavor of Linux, there may be files in the /etc that are unique to each machine, e.g., /etc/HOSTNAME. The basic problem with NFS is that the granularity (a directory) is too coarse. Nonetheless, NFS can be used as a first step in distributing files. For example, you might set up a shared directory with all the distribution RPMs along with any other software you want to add. You can then mount this directory on the individual machines. Once mounted, you can easily copy files where you need them or install them from that directory. For packages, this can easily be done with a shell script.
While any of these approaches will work and are viable approaches on an occasional basis, they are a little clunky, particularly if you need to move only a file or two. Fortunately, there are also a number of commands designed specifically to move individual files between machines. If you have enabled the r-service commands, you could use rcp. A much better choice is scp, the SSH equivalent. You could also consider rdist. Debian users should consider apt-get. cpush, one of the tools supplied in C3 and described in Chapter 10, is another choice. One particularly useful command is rsync, which will be described next.
8.1.1.1 Pushing files with rsync
rsync is GNU software written by Andrew Tridgell and Paul Mackerras. rsync is sometimes described as a faster, more flexible replacement for rcp, but it is really much more. rsync has several advantages. It can synchronize a set of files very quickly because it sends only the difference in the files over the link. It can also preserve file settings. Finally, since other tools described later in this book such as SystemImager and C3 use it, a quick review is worthwhile.
rsync is included in most Linux distributions. It is run as a client on the local machine and as a server on the remote machine. With most systems, before you can start the rsync daemon on the machine that will act as the server, you'll need to create both a configuration file and a password file.[1]
[1] Strictly speaking, the daemon is unnecessary if you have SSH or RSH.
A configuration file is composed of optional global commands followed by one or more module sections. Each module or section begins with a module name and continues until the next module is defined. A module name associates a symbolic name to a directory. Modules are composed of parameter assignments in the form option = value. An example should help clarify this.
# a sample rsync configuration file -- /etc/rsyncd.conf # [systemfiles] # source/destination directory for files path = /etc # authentication -- users, hosts, and password file auth users = root, sloanjd hosts allow = amy basil clara desmond ernest fanny george hector james secrets file = /etc/rsyncd.secrets # allow read/write read only = false # UID and GID for transfer uid = root gid = root
There are no global commands in this example, only the single module [systemfiles]. The name is an arbitrary string (hopefully not too arbitrary) enclosed in square brackets. For each module, you must specify a path option, which identifies the target directory on the server accessed through the module.
The default is for files to be accessible to all users without a password, i.e., anonymous rsync. This is not what we want, so we use the next three commands to limit access. The auth user option specifies a list of users that can access a module, effectively denying access to all other users. The hosts allow option limits the machines that can use this module. If omitted, then all machines will have access. In place of a list of machines, an address/mask pattern can be used. The secrets file specifies the name of a password file used for authentication. The file is used only if the auth user option is also used. The format of the secrets file is user:password, one entry per line. Here is an example:
root:RSpw012...
The secrets file should be readable only by root, and should not be writable or executable. rsync will balk otherwise.
By default, files are read only; i.e., files can be downloaded from the server but not uploaded to the server. Set the read only option to false if you want to allow writing, i.e., uploading files from clients to the server. Finally, the uid and gid options set the user and group identities for the transfer. The configuration file is described in detail in the manpage rsyncd.conf(5). As you might imagine, there are a number of other options not described here.
rsync usually uses rsh or ssh for communications (although it is technically possible to bypass these). Consequently, you'll need to have a working version of rsh or ssh on your system before using rsync.
To move files between machines, you will issue an rsync command on a local machine, which will contact an rsync daemon on a remote machine. Thus, to move files rsync must be installed on each client and the remote server must be running the rsync daemon. The rsync daemon is typically run by xinetd but can be run as a separate process if it is started using the --daemon option. To start rsync from xinetd, you need to edit the file /etc/xinetd.d/rsync, change the line disable = yes to disable = no, and reinitialize or restart xinetd. You can confirm it is listening by using netstat.
[root@fanny xinetd.d]# netstat -a | grep rsync tcp 0 0 *:rsync *:* LISTEN
rsync uses TCP port 873 by default.
rsync can be used in a number of different ways. Here are a couple of examples to get you started. In this example, the file passwd is copied from fanny to george while preserving the group, owner, permissions, and time settings for the file.
[root@fanny etc]# rsync -gopt passwd george::systemfiles Password:
Recall systemfiles is the module name in the configuration file. Note that the system prompts for the password that is stored in the /etc/rsyncd.secrets file on george. You can avoid this step (useful in scripts) with the --password-file option. This is shown in the next example when copying the file shadow.
[root@fanny etc]# rsync -gopt --password-file=rsyncd.secrets shadow / george::systemfiles
If you have the rsync daemon running on each node in your cluster, you could easily write a script that would push the current accounting files to each node. Just be sure you get the security right.
In the preceding examples, rsync was used to push files. It can also be used to pull files. (fanny has the same configuration files as george.)
[root@george etc]# rsync -gopt fanny::systemfiles/shadow /etc/shadow
Notice that the source file is actually /etc/shadow but the /etc is implicit because it is specified in the configuration file.
rsync is a versatile tool. It is even possible to clone running systems with rsync. Other command forms are described in the manpage rsync(1).
8.2 Automating Installations
There are two real benefits from an automated installation梚t should save you work, and it will ensure the consistency of your installation, which will ultimately save you a lot more work. There are several approaches you can take, but the key to any approach is documentation. You'll first want to work through one or more manual installations to become clear on the details. You need to determine how you want your system configured and in what order the configuration steps must be done. Create an install and a post-install checklist.
If you are only doing a few machines, you can do the installations manually from the checklist if you are very careful. But this can be an error-prone activity, so even small clusters can benefit from automated installs. If you are building a large cluster, you'll definitely need some tools. There are many. This chapter focuses on three fairly representative approaches桼ed Hat's Kickstart, g4u, and SystemImager.
Each of the tools described in this chapter has it place. Kickstart does a nice job for repetitive installations. It is the best approach if you have different hardware. You just create and edit a copy of the configuration file for each machine type. However, Kickstart may not be the best tool for post-installation customizations.
With image software like g4u or SystemImager, you can install software and reconfigure systems to your heart's delight before cloning. If you prepare your disk before using it, g4u images use less space than SystemImager, and it is definitely faster. g4u is the simplest tool to learn to use and is largely operating system independent. SystemInstaller is the more versatile tool, but comes with a significant learning curve. Used in combination with rsync, it provides a mechanism to maintain your systems as well as install them. In the long run, this combination may be your best choice.
8.2.1 Kickstart
Red Hat's Kickstart is a system designed to automate the installation of a large number of identical Linux systems. Similar programs exist for other releases, such as DrakX for Mandrake Linux and Fully Automatic Installation (FAI) for Debian. A Kickstart installation can be done using a local CD-ROM or hard drive, or over a network using FTP, NFS, or HTTP. We'll look at using a local CD-ROM and using NFS over a network. NFS is preferable when working with a large number of machines.
|
Anaconda is the Red Hat installation program. It is written in Python with some custom modules in C. Anaconda is organized in stages. The first stage is an installer which loads kernel modules needed later. It is this loader that goes to the appropriate installation source. Finally, Anaconda has an auto-install mechanism, Kickstart, that allows installs to be scripted via the Kickstart configuration file.
8.2.1.1 Configuration file
The first step in using Kickstart is to create a Kickstart configuration file. Once you have the configuration file, you'll create a boot disk and start the installation. You have two options in creating a configuration file梱ou can edit an existing configuration file or you can use Red Hat's Kickstart Configurator program to create a new file. While the configuration program has a nice GUI and is easy to use, older versions don't give you the option of reopening an existing configuration file. So with the older version, you'll need to get everything right the first time, start over from scratch, or manually edit the file that it creates after the fact.
Using Kickstart Configurator is straightforward. Since it provides a GUI, you'll need to be running the X Window System. You can start it from a console window with the command /usr/sbin/ksconfig or, if you are using gnome, from Main Menu Button Programs System Kickstart Configurator. Figure 8-1 shows the initial window.
Figure 8-1. Kickstart Configurator
Simply work your way down the lists on the left setting the fields on the right as needed. Most of what you'll see will be familiar questions from a normal installation, although perhaps in slightly more detail. On the second screen, Installation Method, you'll be asked for the installation method桟D-ROM, FTP, etc. The last two screens ask for pre-installation and post-installation scripts, allowing you to add additional tasks to the install. When you are done, save the file.
Alternatively, you could use an existing configuration file. The Red Hat installation program creates a Kickstart file for the options you select when you do an installation. This is saved as /root/anaconda-ks.cfg. (There is also a template for a configuration file on the Red Hat documentation disk called sample.ks, but it is a bit sparse.) If you have already done a test installation, you may have something very close to what you need, although you may want to tweak it a bit.
Once you have a configuration file, you may need to make a few changes. Often, manually editing an existing configuration file is the easiest way to get exactly what you want. Since the configuration is a simple text file, this is a very straightforward process. The configuration file is divided into four sections that must be in the order they are described here.
The command section comes first and contains basic system information such as keyboard and mouse information, the disk partition, etc. Here is part of a command section with comments explaining each command:
# Kickstart file # Do a clean install rather than an upgrade (optional). install # Install from a CD-ROM; could also be nfs, hard drive, or # a URL for FTP or HTTP (required). cdrom # language used during installation (required) lang en_US # languages to install on system (required) langsupport --default en_US.iso885915 en_US.iso885915 # type of keyboard (required) keyboard us # type of mouse (required) mouse genericps/2 --device psaux -emulthree # X configuration (optional) xconfig --card "Matrox Millennium G200" --videoram 8192 --hsync 30.0-60.0 --vsync 47.5-125.0 --resolution 1024x768 --depth 16 --startxonboot # network setup (optional) network --device eth0 --bootproto dhcp # root password (required) rootpw --iscrypted $1$ÌZ5ÙÏÍÙÑ$Ulh7W6TkpQ3O3eTHtk4wG1 # firewall setup (optional) firewall --medium --dhcp --port ssh:tcp # system authentication (required) authconfig --enableshadow --enablemd5 # timezone (required) timezone --utc America/New_York # bootloader (required) bootloader --md5pass=$1$Åq9erÒE$HoYKj.adlPZyv4mGtc62W. # remove old partitions from disk (optional) clearpart --all --drives=had #partition information (required) part /boot --fstype ext3 --size=50 --ondisk=hda part / --fstype ext3 --size=1100 --grow --ondisk=hda part swap --size=256 --ondisk=hda
Other options and details can be found in the first chapter of The Official Red Hat Linux Customization Guide on the Red Hat documentation disk.
|
The second part of the configuration file lists the packages that will be installed. This section begins with the line %packages. Here is a part of a sample listing for this section:
%packages @ Printing Support @ Classic X Window System @ X Window System @ GNOME @ Sound and Multimedia Support @ Network Support @ Software Development @ Workstation Common ... balsa gnumeric-devel esound-devel ImageMagick-c++-devel mozilla-chat ...
Often you need to list only a component, not the individual packages. In this example, the lines starting with @ are all components. The remaining lines are all individual packages.
The last two sections, the pre-install and post-install configuration sections, are optional. These are commands that are run immediately before and immediately after installation. Here is an example that adds a user:
%post /usr/sbin/useradd sloanjd chfn -f 'Joe Sloan' sloanjd /usr/sbin/usermod -p '$1$ÎgùyUDî$oyWJSirX8I0XElXVGXesG2.' Sloanjd
Note that a pre-install section is not run in a chroot environment, while a post-install section is.[2] Basically, these sections provide a primitive way of doing custom configurations. This can be useful for small changes but is awkward for complex tasks. For more details about the configuration file, see the Red Hat documentation.
[2] A chroot environment restricts access to the part of the filesystem you are working in, denying access to the remainder of the filesystem.
8.2.1.2 Using Kickstart
Once you have the Kickstart file, you need to place the file where it will be available to the system you are configuring. This can be done in several ways depending on how you will boot the system. For a CD-ROM installation, you could simply copy the file over to a floppy.
[root@amy root]# mount /mnt/floppy [root@amy root]# cp ks.cfg /mnt/floppy/ks.cfg [root@amy root]# umount /mnt/floppy
Reboot your system from an installation CD-ROM. (If your system won't boot from a CD-ROM, you could create a floppy boot disk and copy the configuration file onto it.) With this approach, you'll need to tell the system where to find the configuration file. At the boot prompt, enter the command
boot: linux ks=floppy
While you will be able to complete the installation without typing anything else, you will still need to swap CD-ROMs. This probably isn't what you had in mind, but it is a good, quick way to test your Kickstart file.
If you want to do a network installation, you can provide the installation files via FTP, NFS, or HTTP. You will need to set up the corresponding server, make the appropriate changes to the Kickstart configuration file and copy it to the server, and create a network boot disk. (A network or PXE boot is also an option.) If you want to do an unattended installation, you will also need a DHCP server to provide both the IP address and the location of the Kickstart configuration file. Using a boot disk with an NFS server is probably the most common approach.
To set up a NFS server, you'll need to identify a machine with enough free space to hold all the installation CD-ROMS, copy over the contents of the CD-ROMs, and configure the NFS server software. For example, to install Red Hat 9, you might begin by creating the directory /export/9.0 and copying over the distribution files.
[root@fanny root]# mkdir -p /export/9.0 [root@fanny root]# mount /mnt/cdrom [root@fanny root]# cp -arv /mnt/cdrom/RedHat /export/9.0 ... [root@fanny root]# eject cdrom
You'll repeat the last three steps for each CD-ROM.
To configure NFS, you'll need to install the NFS package if it is not already installed, edit /etc/exports so that the target can mount the directory with the files, e.g., /export/9.0, and start or restart NFS. For example, you might add something like the following lines to /etc/exports.
/export/9.0 george hector ida james /kickstart george hector ida james
This allows the four listed machines access to the installation directory and the directory holding the Kickstart configuration file. You'll start or restart NFS with either /sbin/service nfs start or /sbin/service nfs restart.
Since you are doing a network install, you'll need to replace the entry CDROM in ks.cfg with information about the NFS server such as
nfs --server 10.0.32.144 --dir /export/9.0 network --device eth0 --bootproto dhcp
The second line says to use DHCP, which is the default if this information isn't provided. While not always necessary, it may be safer in some circumstances to use IP addresses rather than host names.
If you aren't using PXE, you'll need a network boot disk. It is tempting to think that, since we have specified an NFS install in the Kickstart file, any boot disk should work. Not so! Put a blank floppy in your floppy drive, mount the first distribution CD-ROM, change to the images subdirectory, and then use the following command:
[root@amy images]# dd if=bootnet.img of=/dev/fd0 bs=1440k ...
If you don't need to do an unattended installation, the simplest approach is to copy the configuration file to the boot floppy and tell the boot loader where to find the file, just as you did with the CD-ROM installation. If you want to do an unattended installation, things are a little more complicated.
For an unattended installation, you will need to copy the Kickstart configuration file onto your NFS server and edit the boot disk configuration file. While you can place the file in the installation directory of your NFS server, a more general approach is to create a separate directory for Kickstart configuration files such as /kickstart. You'll need to export this directory via NFS as shown earlier. If you only need one configuration file, ks.cfg is the usual choice. However, if you create multiple Kickstart configuration files, you can use a convention supported by Kickstart. Name each machine using the format IP-number-kickstart where IP-number is replaced by the IP address of the target node such as 10.0.32.146-kickstart. This allows you to maintain a different configuration file for each machine in your cluster.
To access the file, you need to tell the client where to find the configuration file. For testing, you can do this manually at the boot loader. For example, you might enter something like
boot: linux ks=nfs:10.0.32.144:/kickstart/
This tells the loader to use the NFS server 10.0.32.144 and look in the /kickstart directory. It will look for a file using the name format IP-number-kickstart. Alternatively, you could give a complete file name.
For an unattended installation, you will need to edit syslinux.cfg on the boot disk, changing the line
default
to something like
default linux ks=nfs:10.0.32.144:/kickstart/
You might also shorten the timeout. Once done, you just insert the floppy and power up the node. The remainder of the installation will take place over your network.
While Kickstart does what it was designed to do quite well, there are some severe limitations to what it can do. As a package-based installation, there is no easy way to deal with needs that aren't packaged-based. For example, if you recompile your kernel, modify configuration files, or install non-package software, you'll need to do some kind of scripting to deal with these special cases. That may be OK for one or two changes, but it can become tedious very quickly. It you need to make a number of customizations, you may be better served with an image-based tool like g4u or SystemImager.
8.2.2 g4u
Image copying is useful in any context where you have a large number of identical machines. While we will be using it to clone machines in a high-performance cluster, it could also be used in setting up a web server farm, a corporate desktop environment, or a computer laboratory. With image copying, you begin by building a sample machine, installing all the software needed, and doing any desired customizations. Then you copy over an image of the disk to other machines, causing all the added software and customizations to get copied as well.
g4u is a simple disk image installer. It allows you to copy the image of a computer's disk to a server and then install that image on other machines in your cluster. The design philosophy for g4u is very simple. g4u is indifferent to what is on the disk梚t just copies bits. It doesn't matter what version of Unix or what file system you use. It doesn't care if a disk sector is unused梚t still gets copied. The image is compressed while on the server, but otherwise is an exact copy of the disk. If the image includes configuration files that are specific to the original machine, e.g., a static IP address or a host-name file, you will have to correct these after installing the image. (You can avoid most problems of this sort if you use DHCP to configure your systems.) g4u works best when used with disks with the same size and geometry but, under limited circumstances, it may be finessed to work with other disks. Image copying is the simplest approach to learn and to use and is usable with almost any operating system.
There are three things you will need to do before using g4u. If you don't already have an FTP server, you will need to create one to store the images. You will need to download the g4u software. And, while not strictly required, you should prepare your source system for cloning. All of these are very straightforward.
To set up an FTP server, you'll need to install the software, edit the configuration files, and start the daemon. Several FTP server implementations are available. Select and install your favorite. The vsftpd (Very Secure FTP) package is a good choice for this purpose. You'll need to edit the appropriate configuration files, /etc/vsftpd/vsftpd.conf, /etc/vsftpd.ftpusers, and /etc/vsftpd.user_list. Then start the service.
[root@fanny etc]# /etc/init.d/vsftpd start Starting vsftpd for vsftpd: [ OK ]
(When you are through cloning systems, you may want to disable FTP until you need it again because it poses a security risk. Just replace start with stop in the above.) Consult the documentation with your distribution or the appropriate manpages.
The g4u software consists of a NetBSD boot disk with the image-copying software. While it is possible to download the sources, it is much simpler if you just download a disk image with the software. You can download either a floppy image or a CD-ROM ISO image in either zipped or uncompressed format from http://www.feyrer.de/g4u/. (The uncompressed ISO image is smaller than 1.5 MB so downloads go quickly.) Once you have downloaded the image, unzip it if it is compressed and create your disk. With a floppy, you can use a command similar to the following, adjusting the version number as needed:
[root@fanny root]# cat g4u-1.16.fs > /dev/fd0
(With Windows, you can use rawrite.exe, which can also be downloaded from the web site.) For a CD-ROM, use your favorite software.
Since g4u creates a disk image, it copies not only files but unused sectors as well. If there is a lot of garbage in the unused sectors on the disk, they will take up space in the compressed image, and creating that image will take longer. You can minimize this problem by writing zeros out to the unused sectors before you capture the image. (Long strings of zeros compress quickly and use very little space.) The g4u documentation recommends creating a file of zeros that grows until it fills all the free space on the system, and then deleting that file.
[root@ida root]# dd if=/dev/zero of=/0bits bs=20971520 dd: writing `/0bits': No space left on device 113+0 records in 112+0 records out [root@ida root]# rm /0bits rm: remove `/0bits'? y
Once the file is deleted, the unused sectors will still contain mostly zeros and should compress nicely. While you don't have to do this, it will significantly reduce storage needs and transfer time.
To use g4u, you will need to capture the original disk and then copy it to the new machines. Begin by shutting down the source machine and then booting it with the g4u disk. As the system boots, you'll see some messages, including a list of commands, and then a command-line prompt. To capture and upload the disk, use the uploaddisk command. For example,
# uploaddisk [email protected] ida.g4u
The arguments to uploaddisk are the user's FTP server and the saved images. You'll see a few more messages and then the system will prompt you for the user's FTP password. As the disk image is captured and uploaded to the FTP server, the software will display dots on the screen. When the upload is complete, the software will display some statistics about the transfer.
To create new systems from the image, the process is almost the same. Boot the new system from the g4u disk and use the slurpdisk command, like so:
# slurpdisk [email protected] ida.g4u
You'll be prompted for a password again and see similar messages. However, the download tends to go much faster than the upload. When the user prompt returns, remove the g4u disk and reboot the system. Log in and make any needed configuration changes. That's really all there is to it!
8.2.3 SystemImager
SystemImager is a part of the Systems Installation Suite (SIS), a set of tools for building an image for a cluster node and then copying it to other nodes. In many ways it is quite similar to g4u. However, there are several major differences in both the way it works and in the added functionality it provides. It is also a much more complicated tool how to learn to use. These differences will be apparent as you read through this section.
As with g4u, with SIS you will set up a single node as a model, install the operating system and any additional software you want, and configure the machine exactly the way you want it. Next, copy the image of this machine to a server, and then from the server to the remaining machines in the cluster.
SystemImager is also useful in maintaining clusters since it provides an easy way to synchronize files among machines. For example, if you have a security patch to install on all the machines in the cluster, you could install it on your model computer and then update the cluster. Since SIS uses rsync, this is very efficient. Only the files changed by the patch will be copied.
The Systems Installation Suite is made up of three tools, SystemConfigurator, SystemImager, and SystemInstaller. From a pragmatic perspective, SystemImager is the place to begin and, depending upon your needs, may be the only part of the suite you will need to master.
SystemInstaller is generally used to build a pre-installation image on the image server without having to first create a model system. For example, OSCAR uses SystemInstaller to do just this. But if you are happy building the model system, which is strongly recommended since it gives you an opportunity to test your configuration before you copy it, there is no reason to be in a hurry to learn the details of SystemInstaller.
SystemConfigurator allows you to do a post-installation configuration of your system. While it is a useful standalone tool, it is integrated into SystemImager so that its use is transparent to the user. So while you will need to install SystemConfigurator, you don't need to learn the details of SystemConfigurator to get started using SIS. Consequently, this section focuses on SystemImager.
Since SystemImager uses client-server architecture, you will need to set up two machines initially before you can begin cloning systems. The image server manages the installation, holds the clone image, and usually provides other needed services such as DHCP. You will also need to set up the model node or golden client. Once you have created the golden client, its image is copied to the server and can then be installed on the remaining machines within the cluster.
The installation of SystemImager can be divided into four multistep phases梥etting up the image server, setting up the golden client, transferring the image to the image server, and copying the image to the remaining nodes in the cluster. Each of these phases is described in turn. If you installed OSCAR, this setup has already been done for you. However, OSCAR users may want to skim this material to get a better idea of how OSCAR works and can be used.
8.2.3.1 Image server setup
In setting up the image server, you will need to select a server, install Linux and other system software as needed, install the SystemImager software on the server, and determine both how you will assign IP addresses to clients and how you will start the download.
You'll want to take care in selecting your server. Typically, the SystemImager server will also act as the head node for your cluster and will provide additional network services such as DHCP. While it is possible to distribute some of this functionality among several machines, this isn't usually done and won't be discussed here. If you already have a server, this is a likely choice provided it has enough space.
Unlike g4u, the images SystemImager creates are stored as uncompressed directory trees on the server. This has a number of advantages. First, it works nicely with rsync. And as a live filesystem, you can chroot to it and make changes or even install packages (if you are a brave soul). You'll only copy useful files, not unused sectors. While this approach has a number of advantages, even a single image can take up a lot of space. Heterogeneous clusters will require multiple images. Taken together, this implies you'll want a speedy machine with lots of disk space for your server.
Because of dependencies, you should install all of SIS even if you plan to use only SystemImager. You have a couple of choices as to how you do this. There is a Perl installation script that can be downloaded and run. It will take care of downloading and installing everything else you need. Of course, you'll need Internet access from your cluster for this to work. Alternatively, you can download DEB or RPM packages and install. Downloading these packages and burning them onto a CD-ROM is one approach to setting up an isolated cluster. This chapter describes the installation process using RPM packages.
Since SIS supports a wide variety of different Linux releases, you'll need to select the correct packages for your distribution, and you'll need a number of packages to install SystemImager. These can be downloaded from SourceForge. Go to http://sisuite.sourceforge.net and follow the links to SystemConfigurator, SystemImager, and SystemInstaller, as needed, to download the individual packages. If in doubt, you can read the release notes for details on many of the packages.
There may be additional dependencies that you'll also need to address. For a simple Red Hat install, you'll need to install the following packages, if they are not already on your system, in this order: rsync, perl-AppConfig, perl-XML-Simple, systemconfigurator, systemimager-common, systemimager-server, perl-MLDBM, and systeminstaller. You'll also need a boot package specific to your architecture. For example, you would use systemimager-boot-i386-standard for the Intel 386 family. rsync is usually already installed. Install these as you would install any RPM.
[root@fanny sysimager]# rpm -vih perl-AppConfig-1.52-4.noarch.rpm Preparing... ########################################### [100%] 1:perl-AppConfig ########################################### [100%]
Repeat the process with each package. There is also an X interface to SystemInstaller called tksis. If you want to install this, you will need to install perl-DBI, perl-TK, and systeminstall-x11. (If you have problems with circular dependencies, you might put the package names all on the same line and use rpm -Uvh to install them.)
The SIS installation will create a directory /etc/systemimager containing the configuration files used by SystemImager. By default, SystemImager is not started. You can use the command service systemimager start to manually start it. SystemImager starts the rsync daemon using the configuration file in /etc/systemimager, so if rsync is already running on your system, you'll need to turn it off first. As with any manual start, if you restart the system, you'll need to restart SystemImager. (With a recent release, the names of several services have changed. To ensure you are using the appropriate names, look in /etc/init.d to see what is installed.)
There are a couple of other things you might want to set up on your server if you don't already have them. With SIS, there are four installation methods. You can boot the machine you are installing the image on from a floppy, from CD-ROM, from its hard drive, or over the network using a PXE-based network adapter. (The hard drive option is used for upgrading systems rather than for new installs.)
If you are going to do a network boot, you will need a TFTP server. SIS includes a command, mkbootserver, which will handle the configuration for you, but you must first install some packages?I>tftp-server, tftp, and pxe. Once these packages are installed, the script mkbootserver will take care of everything else. As needed, it will create the /tftpboot directory, modify /etc/services, modify /etc/inetd.conf or /etc/xinetd.d/tftp, verify that the TFTP server works, configure PXE creating /etc/pxe.conf, verify the pxe daemon is running, verify the network interface is up, and pass control to the mkdhcpserver command to configure a DHCP server. Once mkbootserver has been run, your server should be appropriately configured for booting clients and installing images via PXE. Of course, you'll need a PXE-enabled network adapter in your client.
Even if you aren't booting via PXE, you will probably still want to use DHCP to assign IP addresses. This isn't absolutely necessary since you can create a configuration diskette for each machine with the appropriate information, but it is probably the easiest way to go. Using DHCP implies you'll need a DHCP server, i.e., both server software and a configuration file. Setting up the software is usually just a matter of installing the dhcp package.
[root@fanny root]# rpm -vih dhcp-3.0pl1-23.i386.rpm warning: dhcp-3.0pl1-23.i386.rpm: V3 DSA signature: NOKEY, key ID db42a60e Preparing... ########################################### [100%] 1:dhcp ########################################### [100%]
To create a configuration file, typically /etc/dhcpd.conf, use the mkdhcpserver script. You'll need to collect information about your network such as the IP address range, broadcast address, network mask, DNS servers, and the network gateway before you run this script. Here is an example of using mkdhcpserver for a simple network.
[root@fanny root]# mkdhcpserver Welcome to the SystemImager "mkdhcpserver" command. This command will prepare this computer to be a DHCP server by creating a dhcpd.conf file for use with your ISC DHCP server (v2 or v3). If there is an existing file, it will be backed up with the .beforesystemimager extension. Continue? (y/[n]): y Type your response or hitto accept [defaults]. If you don't have a response, such as no first or second DNS server, just hit and none will be used. What is your DHCP daemon major version number (2 or 3)? [2]: 2 Use of uninitialized value in concatenation (.) or string at /usr/sbin/ mkdhcpserver line 202, line 2. What is the name of your DHCP daemon config file? [ ]: /etc/dhcpd.conf What is your domain name? [localdomain.domain]: wofford.int What is your network number? [192.168.1.0]: 10.0.32.0 What is your netmask? [255.255.255.0]: 255.255.248.0 What is the starting IP address for your dhcp range? [192.168.1.1]: 10.0.32.145 What is the ending IP address for your dhcp range? [192.168.1.100]: 10.0.32.146 What is the IP address of your first DNS server? [ ]: 10.0.80.3 What is the IP address of your second DNS server? [ ]: 10.0.80.2 What is the IP address of your third DNS server? [ ]: What is the IP address of your default gateway? [192.168.1.254]: 10.0.32.2 What is the IP address of your image server? [192.168.1.254]: 10.0.32.144 What is the IP address of your boot server? [ ]: 10.0.32.144 What is the IP address of your log server? [ ]: Will your clients be installed over SSH? (y/[n]): y What is the base URL to use for ssh installs? [http://10.0.32.144/ systemimager/boot/]: What... is the air-speed velocity of an unladen swallow? [ ]: Wrong!!! (with a Monty Python(TM) accent...) Press to continue... Ahh, but seriously folks... Here are the values you have chosen: ####################################################################### ISC DHCP daemon version: 2 DHCP daemon using fixed-address patch: n ISC DHCP daemon config file: /etc/dhcpd.conf DNS domain name: wofford.int Network number: 10.0.32.0 Netmask: 255.255.248.0 Starting IP address for your DHCP range: 10.0.32.145 Ending IP address for your DHCP range: 10.0.32.146 First DNS server: 10.0.80.3 Second DNS server: 10.0.80.2 Third DNS server: Default gateway: 10.0.32.2 Image server: 10.0.32.144 Boot server: 10.0.32.144 Log server: Log server port: SSH files download URL: http://10.0.32.144/systemimager/boot/ ####################################################################### Are you satisfied? (y/[n]): y The dhcp server configuration file (/etc/dhcpd.conf) file has been created for you. Please verify it for accuracy. If this file does not look satisfactory, you can run this command again to re-create it: "mkdhcpserver" WARNING!: If you have multiple physical network interfaces, be sure to edit the init script that starts dhcpd to specify the interface that is connected to your DHCP clients. Here's an example: Change "/usr/sbin/dhcpd" to "/usr/sbin/dhcpd eth1". Depending on your distribution, you may be able to set this with the "INTERFACES" variable in either "/etc/default/dhcp" or in your dhcpd initialization script (usually "/etc/init.d/dhcpd"). Also, be sure to start or restart your dhcpd daemon. This can usually be done with a command like "/etc/init.d/dhcpd restart" or similar. Would you like me to restart your DHCP server software now? (y/[n]): y Shutting down dhcpd: [FAILED] Starting dhcpd: [ OK ]
As you can see, the script is very friendly. There is also important information buried in the output, such as the warning about restarting the DHCP daemon. Be sure you read it carefully. If you already have a DHCP configuration file, it is backed up, usually as /etc/dhcpd.conf.beforesystemimager. You may need to merge information from your old file into the newly created file.
As previously noted, you don't have to use DHCP. You can create a configuration disk with a file local.cfg for each machine with the information provided by DHCP. Here is an example.
HOSTNAME=hector DOMAINNAME=wofford.int DEVICE=eth0 IPADDR=10.0.32.146 NETMASK=255.255.248.0 NETWORK=10.0.32.0 BROADCAST=10.0.39.255 GATEWAY=10.0.32.2 IMAGESERVER=10.0.32.144 IMAGENAME=ida.image
Regardless of how you are booting for your install, the software will look for a floppy with this file and use the information if provided. In this example, the client names that have been automatically generated are not being used, so it is necessary to rename the installation scripts on the image server. We'll come back to this.
8.2.3.2 Golden client setup
The golden client is a model for the other machines in your cluster. Setting up the golden client requires installing and configuring Linux, the SystemImager software, and any other software you want on each client. You will also need to run the prepareclient script to collect image information and start the rsync daemon for the image transfer.
Because you are using an image install, your image should contain everything you want on the cluster nodes, and should be compatible with the node's hardware. In setting up the client, think about how it will be used and what you will need. Doing as much of this as possible will save you work in the long run. For example, if you generate SSH keys prior to cloning systems, you won't have to worry about key distribution. However, getting the software right from the start isn't crucial. SystemImager includes a script to update clients, and since it uses rsync, updates go fairly quickly. Nonetheless, this is something of a nuisance, so you'll want to minimize updates as much as possible. If possible, set up your client and test it in the environment in which it will be used.
Getting the hardware right is more important. The hardware doesn't have to be identical on every node, but it needs to be close. For network and video adapters, you'll want the same chipset. Although disk sizes don't have to be identical, it is better to select for your golden client a machine with the smallest disk size in your cluster. And you can't mix IDE and SCSI systems. Having said all this, remember that you can have multiple images. So if you have a cluster with three different sets of hardware, you can create three images and do three sets of installs.[3]
[3] To some extent, you can install an image configured for different hardware and use kudzu to make corrections once the system reboots. For example, I've done this with network adapters. When the system boots for the first time, I delete the image's adapter and configure the actual adapter in the machine. (Actually, SystemConfigurator should be able to manage NIC detection and setup.)
Once you have built your client, you'll need to install the SystemImager client software. This is done in much the same manner as with the server but there is less to install. For a typical Red Hat install, you'll need perl-AppConfig, systemconfigurator, systemimager-common, and systemimager-client packages at a minimum.
Once all the software has been installed and configured, there is one final step in preparing the client. This involves collecting information about the client needed to build the image by running the prepareclient script. The script is very friendly and describes in some detail what it is doing.
[root@ida sis]# prepareclient Welcome to the SystemImager prepareclient command. This command may modify the following files to prepare your golden client for having its image retrieved by the imageserver. It will also create the /etc/systemimager directory and fill it with information about your golden client. All modified files will be backed up with the .before_systemimager-3.0.1 extension. /etc/services: This file defines the port numbers used by certain software on your system. I will add appropriate entries for rsync if necessary. /etc/inetd.conf: This is the configuration file for the inet daemon, which starts up certain server software when the associated client software connects to your machine. SystemImager needs to run rsync as a standalone daemon on your golden client until it's image is retrieved by your image server. I will comment out the rsync entry in this file if it exists. The rsync daemon will not be restarted when this machine is rebooted. /tmp/rsyncd.conf.13129: This is a temporary configuration file that rsync needs on your golden client in order to make your filesystem available to your image server. See "prepareclient -help" for command line options. Continue? (y/[n]): y *********************************** WARNING *********************************** This utility starts an rsync daemon that makes all of your files accessible by anyone who can connect to the rsync port of this machine. This is the case until you reboot, or kill the 'rsync --daemon' process by hand. By default, once you use getimage to retrieve this image on your image server, these contents will become accessible to anyone who can connect to the rsync port on your imageserver. See rsyncd.conf(5) for details on restricting access to these files on the imageserver. See the systemimager-ssh package for a more secure method of making images available to clients. *********************************** WARNING *********************************** Continue? (y/[n]): y Signaling xinetd to restart... Using "sfdisk" to gather information about /dev/hda... done! Starting or re-starting rsync as a daemon.....done! This client is ready to have its image retrieved. You must now run the "getimage" command on your imageserver.
As you can see from the output, the script runs the rsync server daemon on the client. For this reason, you should wait to run this script until just before you are ready to transfer the image to the image server. Also, be sure to disable this rsync server after copying the client image to the image server.
8.2.3.3 Retrieving the image
This is perhaps the simplest phase of the process. To get started, run the getimage script. You'll need to specify the name or address of the client and a name for the image. It should look something like this:
[root@fanny scripts]# getimage -golden-client ida -image ida.image This program will get the "ida.image" system image from "ida" making the assumption that all filesystems considered part of the system image are using ext2, ext3, jfs, FAT, reiserfs, or xfs. This program will not get /proc, NFS, or other filesystems not mentioned above. *********************************** WARNING *********************************** All files retrieved from a golden client are, by default, made accessible to anyone who can connect to the rsync port of this machine. See rsyncd.conf(5) for details on restricting access to these files on the imageserver. See the systemimager-ssh package for a more secure (but less effecient) method of making images available to clients. *********************************** WARNING *********************************** See "getimage -help" for command line options. Continue? ([y]/n): y Retrieving /etc/systemimager/mounted_filesystems from ida to check for mounted filesystems... ------------- ida mounted_filesystems RETRIEVAL PROGRESS ------------- receiving file list ... done /var/lib/systemimager/images/ida.image/etc/systemimager/mounted_filesystems wrote 138 bytes read 114 bytes 504.00 bytes/sec total size is 332 speedup is 1.32 ------------- ida mounted_filesystems RETRIEVAL FINISHED ------------- Retrieving image ida.image from ida ------------- ida.image IMAGE RETRIEVAL PROGRESS ------------- ...
At this point you'll see the names of each of the files whiz by. After the last file has been transferred, the script will print a summary.
... wrote 92685 bytes read 2230781 bytes 10489.69 bytes/sec total size is 1382212004 speedup is 594.89 ------------- ida.image IMAGE RETRIEVAL FINISHED ------------- Pressto continue... IP Address Assignment --------------------- There are four ways to assign IP addresses to the client systems on an ongoing basis: 1) DHCP ---------------------------------------------------------------- A DHCP server will assign IP addresses to clients installed with this image. They may be assigned a different address each time. If you want to use DHCP, but must ensure that your clients receive the same IP address each time, see "man mkdhcpstatic". 2) STATIC ---------------------------------------------------------------- The IP address the client uses during autoinstall will be permanently assigned to that client. 3) REPLICANT ---------------------------------------------------------------- Don't mess with the network settings in this image. I'm using it as a backup and quick restore mechanism for a single machine. Which method do you prefer? [1]: You have chosen method 1 for assigning IP addresses. Are you satisfied? ([y]/n): y Would you like to run the "addclients" utility now? (y/[n]): n
Unless you have edited /etc/systemimager/systemimager.conf, the image will be stored in the directory /var/lib/systemimager/images as the subdirectory ida.image.
The getimage command runs mkautoinstallscript, which creates the auto-install script /var/lib/systemimager/scripts/ida.image.master in this case, and gives you the option to move onto the next step. But before you do, you may want to kill the rsync daemon on the golden client.
[root@ida sysconfig]# ps -aux | grep rsync | grep -v grep root 13142 0.0 0.4 1664 576 ? S 15:46 0:00 rsync --daemon -- [root@ida sysconfig]# kill 13142
8.2.3.4 Cloning the systems
The final steps of distributing the image to the clients require creating the installation scripts for the clients, preparing any needed boot media, and then booting the clients to initiate the process.[4]
[4] The latest release of SIS includes a program flamethrower. This is use to multicast images speeding the file distribution process on multicast enabled networks. flamethrower is not discussed in this chapter.
As noted above, you should now have an initial auto-install script. The next script you'll run is addclients, which does three things梚t automatically generates host names for each node, it creates symbolic links to the auto-install script, one for each client, and it populates the /etc/hosts table.
[root@fanny root]# addclients Welcome to the SystemImager "addclients" utility ...
A copy of the host table and the install scripts for the individual machines are located in the directory /var/lib/systemimager/scripts. If you don't want to use the automatically generated names, you'll need to edit /etc/hosts and /var/lib/systemimager/scripts/hosts, replacing the automatically generated names with the names you want. You'll also need to rename the individual install scripts in /var/lib/systemimager/scripts to match your naming scheme. Of course, if you are happy with the generated names, you can skip all this.
If you are using a network or PXE boot, you can restart the clients now. If you are booting from a floppy or CD-ROM, you'll first need to make a boot disk. You can use the scripts mkautoinstalldiskette or mkautoinstallcd to make, respectively, a boot diskette or boot CD-ROM. Here is an example of making a CD-ROM.
[root@fanny root]# mkautoinstallcd -out autoinstall.iso Here is a list of available flavors: standard Which flavor would you like to use? [standard]: ...
Note that the default or standard flavor was used. This was created when the package systemimager-boot-i386-standard was installed. With the CD-ROM script, an ISO image is generated that can be used to burn a CD-ROM. Fortunately, this is a relatively small file, so it can easily be moved to another system with a CD-ROM burner. If you elect to use the diskette script instead, it will mount, format, and record the diskette for you. If you don't want to use DHCP, put the file local.cfg on a separate diskette even if you are using a CD-ROM to boot. When booting from a diskette, you'll need to put local.cfg on that diskette. Be warned, you may run out of space if you use a diskette. If you aren't using a local configuration file, you need only one boot disk. You need a diskette for each machine, however, if you are using the local configuration file. If you upgrade SystemImager, remember to regenerate your boot disks as they are release dependent.
Now that you have the boot disk, all you need to do is reboot the client from it. The client will locate the image server and then download and run the installation script. You can sit back and watch the magic for a while. After a short time, your systems should begin to beep at you. At this point, you can remove any diskettes or CD-ROMs and reboot the systems. Your node is installed.
There is one last script you may want to run if you are using DHCP. The script mkdhcpstatic can update your DHCP configuration file, associating IP addresses with MAC addresses. That is, if you run this script, each IP address will be tied to a specific machine based on the MAC address of the machine to which it was first assigned. Since IP addresses are handed out in numerical order, by booting the individual machines in a specific order and then running mkdhcpstatic, you can control IP assignments.
8.2.3.5 Other tasks
As if building your network isn't enough, SystemImager can also be used to maintain and update your clients. The script updateclient is used to resynchronize a client with an image. Its calling syntax is similar to getimage.
[root@hector root]# updateclient -server fanny -image ida.image Updating image from module ida.image... receiving file list ... done ...
You'll see a lot of file names whiz by at this point.
... wrote 271952 bytes read 72860453 bytes 190201.31 bytes/sec total size is 1362174476 speedup is 18.63 Running bootloader... Probing devices to guess BIOS drives. This may take a long time. Installation finished. No error reported. This is the contents of the device map /boot/grub/device.map. Check if this is correct or not. If any of the lines is incorrect, fix it and re-run the script `grub-install'. (fd0) /dev/fd0 (hd0) /dev/hda Probing devices to guess BIOS drives. This may take a long time. Probing devices to guess BIOS drives. This may take a long time.
It should be noted that the script is fairly intelligent. It will not attempt to update some classes of files, such as log files, etc.
SystemInstaller also provides several commands for manipulating images. The commands cpimage, mvimage, lsimage, and rmimage are, as you might guess, analogous to cp, mv, ls, and rm.
8.3 Notes for OSCAR and Rocks Users
Since OSCAR installs and uses SIS, much of this material probably seemed vaguely familiar to you. OSCAR uses SystemInstaller to build the image directly on the server rather than capture the image from a golden client. However, once you have installed OSCAR, you can use the SIS scripts as you see fit.
The configuration file for rsync is in /etc/systemimager/rsync. OSCAR stores the SystemImager files in /var/lib/systemimager. For example, the image files it creates are in /var/lib/systemimager/images.
Rocks uses Kickstart. It uses XML files to record configuration information, dynamically generating the Kickstart configuration file. Changing these XML files is described in Chapter 7. You can interactively re-Kickstart a compute node with the shoot-node command. See the manpage shoot-node(8) for more details.
Chapter 9. Programming Software
After the operating system and other basic system software, you'll want to install the core software as determined by the cluster's mission. If you are planning to develop applications, you'll need software development tools, including libraries that support parallel processing. If you plan to run a set of existing cluster-ready applications, you'll need to select and install those applications as part of the image you will clone.
This chapter presupposes you'll want to develop cluster software and will need the tools to do so. For many clusters this may not be the case. For example, if you are setting up a cluster to process bioinformatics data, your needs may be met with the installation of applications such as BLAST, ClustalW, FASTA, etc. If this is the path you are taking, then identifying, installing, and learning to use these applications are the next steps you need to take.[1] For now, you can safely skip this chapter. But don't forget that it is here. Even if you are using canned applications, at some point you may want to go beyond what is available and you'll need the tools in this chapter.
[1] Steven Baum's site, http://stommel.tamu.edu/~baum/npaci.html, while ostensibly about Rocks, contains a very long list of cluster applications for those who want to write their own applications.
This chapter describes the installation and basic use of the software development tools used to develop and run cluster applications. It also briefly mentions some tools that you are likely to need that should already be part of your system. For clusters where you develop the application software, the software described in this chapter is essential. In contrast, you may be able to get by without management and scheduling software. You won't get far without the software described here.
If you've installed OSCAR or Rocks, you will have pretty much everything you need. Nonetheless, you'll still want to skim this chapter to learn more about how to use that software. For cluster application developers, this is the first software you need to learn how to use.
9.1 Programming Languages
While there are hundreds of programming languages available, when it comes to writing code for high-performance clusters, there are only a couple of realistic choices. For pragmatic reasons, your choices are basically FORTRAN or C/C++.
Like it or not, FORTRAN has always been the lingua franca of high-performance computing. Because of the installed base of software, this isn't likely to change soon. This doesn't mean that you need to use FORTRAN for new projects, but if you have an existing project using FORTRAN, then you'll need to support it. This comes down to knowing how your cluster will be used and knowing your users' needs.
FORTRAN has changed considerably over the years, so the term can mean different things to different people. While there are more recent versions of FORTRAN, your choice will likely be between FORTRAN 77 and FORTRAN 90. For a variety of reasons, FORTRAN 77 is likely to get the nod over FORTRAN 90 despite the greater functionality of FORTRAN 90. First, the GNU implementation of FORTRAN 77 is likely to already be on your machine. If it isn't, it is freely available and easily obtainable. If you really want FORTRAN 90, don't forget to budget for it. But you should also realize that you may face compatibility issues. When selecting parallel programming libraries to use with your compiler, your choices will be more limited with FORTRAN 90.
C and C++ are the obvious alternatives to FORTRAN. For new applications that don't depend on compatibility with legacy FORTRAN applications, C is probably the best choice. In general, you have greater compatibility with libraries. And at this point in time, you are likely to find more programmers trained in C than FORTRAN. So when you need help, you are more likely to find a helpful C than FORTRAN programmer. For this and other reasons, the examples in this book will stick to C.
With most other languages you are out of luck. With very few exceptions, the parallel programming libraries simply don't have binding for other languages. This is changing. While bindings for Python and Java are being developed, it is probably best to think of these as works in progress. If you want to play it safe, you'll stick to C or FORTRAN.
9.2 Selecting a Library
Those of you who do your own dentistry will probably want to program your parallel applications from scratch. It is certainly possible to develop your code with little more than a good compiler. You could manually set up communication channels among processes using standard systems calls.[2]
[2] In fairness, there may be some very rare occasions where efficiency concerns might dictate this approach.
The rest of you will probably prefer to use libraries designed to simplify parallel programming. This really comes down to two choices梩he Parallel Virtual Machine (PVM) library or the Message Passing Interface (MPI) library. Work was begun on PVM in 1989 and continued into the early '90s as a joint effort among Oak Ridge National Laboratory, the University of Tennessee, Emory University, and Carnegie-Mellon University. An implementation of PVM is available from http://www.netlib.org/pvm3/. This PVM implementation provides both libraries and tools based on a message-passing model.
Without getting into a philosophical discussion, MPI is a newer standard that seems to be generally preferred over PVM by many users. For this reason, this book will focus on MPI. However, both PVM and MPI are solid, robust approaches that will potentially meet most users' needs. You won't go too far wrong with either. OSCAR, you will recall, installs both PVM and MPI.
MPI is an API for parallel programming based on a message-passing model for parallel computing. MPI processes execute in parallel. Each process has a separate address space. Sending processes specify data to be sent and a destination process. The receiving process specifies an area in memory for the message, the identity of the source, etc.
Primarily, MPI can be thought of as a standard that specifies a library. Users can write code in C, C++, or FORTRAN using a standard compiler and then link to the MPI library. The library implements a predefined set of function calls to send and receive messages among collaborating processes on the different machines in the cluster. You write your code using these functions and link the completed code to the library.
The MPI specification was developed by the MPI Forum, a collaborative effort with support from both academia and industry. It is suitable for both small clusters and "big-iron" implementations. It was designed with functionality, portability, and efficiency in mind. By providing a well-designed set of function calls, the library provides a wide range of functionality that can be implemented in an efficient manner. As a clearly defined standard, the library can be implemented on a variety of architectures, allowing code to move easily among machines.
MPI has gone through a couple of revisions since it was introduced in the early '90s. Currently, people talk of MPI-1 (typically meaning Version 1.2) and MPI-2. MPI-1 should provide for most of your basic needs, while MPI-2 provides enhancements.
While there are several different implementations of MPI, there are two that are widely used桳AM/MPI and MPICH. Both LAM/MPI and MPICH go beyond simply providing a library. Both include programming and runtime environments providing mechanisms to run programs across the cluster. Both are widely used, robust, well supported, and freely available. Excellent documentation is provided with both. Both provide all of MPI-1 and considerable portions of MPI-2, including ROMIO, Argonne National Laboratory's freely available high-performance IO system. (For more information on ROMIO, visit http://www.mcs.anl.gov/romio.) At this time, neither is totally thread-safe. While there are differences, if you are just getting started, you should do well with either product. And since both are easy to install, with very little extra work you can install both.
9.3 LAM/MPI
The Local Area Multicomputer/Message Passing Interface (LAM/MPI) was originally developed by the Ohio Supercomputing Center. It is now maintained by the Open Systems Laboratory at Indiana University. As previously noted, LAM/MPI (or LAM for short) is both an MPI library and an execution environment. Although beyond the scope of this book, LAM was designed to include an extensible component framework known as System Service Interface (SSI), one of its major strengths. It works well in a wide variety of environments and supports several methods of inter-process communications using TCP/IP. LAM will run on most Unix machines (but not Windows). New releases are tested with both Red Hat and Mandrake Linux.
Documentation can be downloaded from the LAM site, http://www.lam-mpi.org/. There are also tutorials, a FAQ, and archived mailing lists. This chapter provides an overview of the installation process and a description of how to use LAM. For more up-to-date and detailed information, you should consult the LAM/MPI Installation Guide and the LAM/MPI User's Guide.
9.3.1 Installing LAM/MPI
You have two basic choices when installing LAM. You can download and install a Red Hat package, or you can download the source and recompile it. The package approach is very quick, easy to automate, and uses somewhat less space. If you have a small cluster and are manually installing the software, it will be a lot easier to use packages. Installing from the source will allow you to customize the installation, i.e., select which features are enabled and determine where the software is installed. It is probably a bad idea to mix installations since you could easily end up with different versions of the software, something you'll definitely want to avoid.
Installing from a package is done just as you'd expect. Download the package from http://www.lam-mpi.org/ and install it just as you would any Red Hat package.
[root@fanny root]# rpm -vih lam-7.0.6-1.i586.rpm Preparing... ########################################### [100%] 1:lam ########################################### [100%]
The files will be installed under the /usr directory. The space used is minimal. You can use the laminfo command to see the details of the installation, including compiler bindings and which modules are installed, etc.
If you need more control over the installation, you'll want to do a manual install: fetch the source, compile, install, and configure. The manual installation is only slightly more involved. However, it does take considerably longer, something to keep in mind if you'll be repeating the installation on each machine in your cluster. But if you are building an image, this is a one-time task. The installation requires a POSIX- compliant operating system, an appropriate compiler (e.g., GNU 2.95 compiler suite) and utilities such as sed, grep, and awk, and a modern make. You should have no problem with most versions of Linux.
First, you'll need to decide where to put everything, a crucial step if you are installing more than one version of MPI. If care isn't taken, you may find that part of an installation has been overwritten. In this example, the source files are saved in /usr/local/src/lam-7.0.6 and the installed code in /usr/local/lam-7.0.6. First, download the appropriate file from http://www.lam-mpi.org/ to /usr/local/src. Next, uncompress and unpack the file.
[root@fanny src]# bunzip2 lam-7.0.6.tar.bz2 [root@fanny src]# tar -xvf lam-7.0.6.tar ... [root@fanny src]# cd lam-7.0.6
You'll see a lot of files stream by as the source is unpacked. If you want to capture this output, you can tee it to a log file. Just append | tee tar.log to the end of the line and the output will be copied to the file tar.log. You can do something similar with subsequent commands.
Next, create the directory where the executables will be installed and configure the code specifying that directory with the --prefix option. You may also include any other options you desire. The example uses a configuration option to specify SSH as well. (You could also set this through an environmental variable LAMRSH, rather than compiling it into the code梥omething you must do if you use a package installation.)
[root@fanny lam-7.0.6]# mkdir /usr/local/lam-7.0.6 [root@fanny lam-7.0.6]# ./configure --prefix=/usr/local/lam-7.0.6 \ > --with-rsh="ssh -x"
If you don't have a FORTRAN compiler, you'll need to add --without-fc to the configure command. A description of other configuration options can be found in the documentation. However, the defaults are quite reasonable and will be adequate for most users. Also, if you aren't using the GNU compilers, you need to set and export compiler variables. The documentation advises that you use the same compiler to build LAM/MPI that you'll use when using LAM/MPI.
Next, you'll need to make and install the code.
[root@fanny lam-7.0.6]# make ... [root@fanny lam-7.0.6]# make install ...
You'll see a lot of output with these commands, but all should go well. You may also want to make the examples and clean up afterwards.
[root@fanny lam-7.0.6]# make examples ... [root@fanny lam-7.0.6]# make clean ...
Again, expect a lot of output. You only need to make the examples on the cluster head. Congratulations, you've just installed LAM/MPI. You can verify the settings and options with the laminfo command.
9.3.2 User Configuration
Before you can use LAM, you'll need to do a few more things. First, you'll need to create a host file or schema, which is basically a file that contains a list of the machines in your cluster that will participate in the computation. In its simplest form, it is just a text file with one machine name per line. If you have multiple CPUs on a host, you can repeat the host name or you can append a CPU count to a line in the form cpu=n, where n is the number of CPUs. However, you should realize that the actual process scheduling on the node is left to the operating system. If you need to change identities when logging into a machine, it is possible to specify that username for a machine in the schema file, e.g., user=smith. You can create as many different schemas as you want and can put them anywhere on the system. If you have multiple users, you'll probably want to put the schema in a public directory, for example, /etc/lamhosts.
You'll also want to set your $PATH variable to include the LAM executables, which can be trickier than it might seem. If you are installing both LAM/MPI and MPICH, there are several programs (e.g., mpirun, mpicc, etc.) that have the same name with both systems, and you need to be able to distinguish between them. While you could rename these programs for one of the packages, that is not a good idea. It will confuse your users and be a nuisance when you upgrade software. Since it is unlikely that an individual user will want to use both packages, the typical approach is to set the path to include one but not the other. Of course, as the system administrator, you'll want to test both, so you'll need to be able to switch back and forth. OSCAR's solution to this problem is a package called switcher that allows a user to easily change between two configurations. switcher is described in Chapter 6.
A second issue is making sure the path is set properly for both interactive and noninteractive or non-login shells. (The path you want to add is /usr/local/lam-7.0.6/bin if you are using the same directory layout used here.) The processes that run on the compute nodes are run in noninteractive shells. This can be particularly confusing for bash users. With bash, if the path is set in .bash_profile and not in .bashrc, you'll be able to log onto each individual system and run the appropriate programs, but you won't be able to run the programs remotely. Until you realize what is going on, this can be a frustrating problem to debug. So, if you use bash, don't forget to set your path in .bashrc. (And while you are setting paths, don't forget to add the manpages when setting up your paths, e.g., /usr/local/lam-7.0.6/man.)
It should be downhill from here. Make sure you have ssh-agent running and that you can log onto other machines without a password. Setting up and using SSH is described in Chapter 4. You'll also need to ensure that there is no output to stderr whenever you log in using SSH. (When LAM sees output to stderr, it thinks something bad is happening and aborts.) Since you'll get a warning message the first time you log into a system with SSH as it adds the remote machine to the known hosts, often the easiest thing to do (provided you don't have too many machines in the cluster) is to manually log into each machine once to get past this problem. You'll only need to do this once. recon, described in the subsection on testing, can alert you to some of these problems.
Also, the directory /tmp must be writable. Don't forget to turn off or reconfigure your firewall as needed.
9.3.3 Using LAM/MPI
The basic steps in creating and executing a program with LAM are as follows:
-
Booting the runtime system with lamboot.
-
Writing and compiling a program with the appropriate compiler, e.g., mpicc.[3]
[3] Actually, you don't need to boot the system to compile code.
-
Execute the code with the mpirun command.
-
Clean up any crashed processes with lamclean if things didn't go well.
-
Shut down the runtime system with the command lamhalt.
Each of these steps will now be described.
In order to use LAM, you will need to launch the runtime environment. This is referred to as booting LAM and is done with the lamboot command. Basically, lamboot starts the lamd daemon, the message server, on each machine.
|
You specify the schema you want to use as an argument.
[sloanjd@fanny sloanjd]$ lamboot -v /etc/lamhosts LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University n-1<9677> ssi:boot:base:linear: booting n0 (fanny.wofford.int) n-1<9677> ssi:boot:base:linear: booting n1 (george.wofford.int) ... n0<15402> ssi:boot:base:linear: finished
As noted above, you must be able to log onto the remote systems without a password and without any error messages. (If this command doesn't work the first time, you might give this a couple of tries to clear out any one time error messages.) If you don't want to see the list of nodes, leave out the -v. You can always use the lamnodes command to list the nodes later if you wish.
[sloanjd@fanny sloanjd]$ lamnodes n0 10.0.32.144:1:origin,this_node n1 10.0.32.145:1: ...
You'll only need to boot the system once at the beginning of the session. It will remain loaded until you halt it or log out. (Also, you can omit the schema and just use the local machine. Your code will run only on the local node, but this can be useful for initial testing.)
Once you have entered your program using your favorite editor, the next step is to compile and link the program. You could do this directly by typing in all the compile options you'll need. But it is much simpler to use one of the wrapper programs supplied with LAM. The programs mpicc, mpiCC, and mpif77 will respectively invoke the C, C++, and FORTRAN 77 compilers on your system, supplying the appropriate command-line arguments for LAM. For example, you might enter something like the following:
[sloanjd@fanny sloanjd]$ mpicc -o hello hello.c
(hello.c is one of the examples that comes with LAM and can be found in /usr/local/src/lam-7.0.6/examples/hello if you use the same directory structure used here to set up LAM.) If you want to see which arguments are being passed to the compiler, you can use the -showme argument. For example,
[sloanjd@fanny sloanjd]$ mpicc -showme -o hello hello.c gcc -I/usr/local/lam-7.0.6/include -pthread -o hello hello.c -L/usr/local/ lam-7.0.6/lib -llammpio -llamf77mpi -lmpi -llam -lutil
With -showme, the program isn't compiled; you just see the arguments that would have been used had it been compiled. Any other arguments that you include in the call to mpicc are passed on to the underlying compiler unchanged. In general, you should avoid using the -g (debug) option when it isn't needed because of the overhead it adds.
To compile the program, rerun the last command without -showme if you haven't done so. You now have an executable program. Run the program with the mpirun command. Basically, mpirun communicates with the remote LAM daemon to fork a new process, set environment variables, redirect I/O, and execute the user's command. Here is an example:
[sloanjd@fanny sloanjd]$ mpirun -np 4 hello Hello, world! I am 0 of 4 Hello, world! I am 1 of 4 Hello, world! I am 2 of 4 Hello, world! I am 3 of 4
As shown in this example, the argument -np 4 specified that four processes be used when running the program. If more machines are available, only four will be used. If fewer machines are available, some machines will be used more than once.
Of course, you'll need the executable on each machine. If you're using NFS to mount your home directories, this has already been taken care of if you are working in that directory. You should also remember that mpirun can be run on a single machine, which can be helpful when you want to test code away from a cluster.
If a program crashes, there may be extraneous processes running on remote machines. You can clean these up with the lamclean command. This is a command you'll use only when you are having problems. Try lamclean first and if it hangs, you can escalate to wipe. Rerun lamboot after using wipe. This isn't necessary with lamclean. Both lamclean and wipe take a -v for verbose output.
Once you are done, you can shut down LAM with the lamhalt command, which kills the lamd daemon on each machine. If you wish, you can use -v for verbose output. Two other useful LAM commands are mpitask and mpimsg, which are used to monitor processes across the cluster and to monitor the message buffer, respectively.
9.3.4 Testing the Installation
LAM comes with a set of examples, tests, and tools that you can use to verify that it is properly installed and runs correctly. We'll start with the simplest tests first.
The recon tool verifies that LAM will boot properly. recon is not a complete test, but it confirms that the user can execute commands on the remote machine, and that the LAM executables can be found and executed.
[sloanjd@fanny bin]$ recon ----------------------------------------------------------------------------- Woo hoo! recon has completed successfully. This means that you will most likely be able to boot LAM successfully with the "lamboot" command (but this is not a guarantee). See the lamboot(1) manual page for more information on the lamboot command. If you have problems booting LAM (with lamboot) even though recon worked successfully, enable the "-d" option to lamboot to examine each step of lamboot and see what fails. Most situations where recon succeeds and lamboot fails have to do with the hboot(1) command (that lamboot invokes on each host in the hostfile). -----------------------------------------------------------------------------
Since lamboot is required to run the next tests, you'll need to run these tests as a non-privileged user. Once you have booted LAM, you can use the tping command to check basic connectivity. tping is similar to ping but uses the LAM echo server. This confirms that both network connectivity and that the LAM daemon is listening. For example, the following command sends two one-byte packets to the first three machines in your cluster.
[sloanjd@fanny sloanjd]$ tping n1-3 -c2 1 byte from 3 remote nodes: 0.003 secs 1 byte from 3 remote nodes: 0.002 secs 2 messages, 2 bytes (0.002K), 0.006 secs (0.710K/sec) roundtrip min/avg/max: 0.002/0.003/0.003
If you want to probe every machine, use n without a count.
The LAM test suite is the most comprehensive way to test your system. It can be used to confirm that you have a complete and correct installation. Download the test suite that corresponds to your installation and then uncompress and unpack it.
[sloanjd@fanny sloanjd]$ bunzip2 lamtests-7.0.6.tar.bz2 [sloanjd@fanny sloanjd]$ tar -xvf lamtests-7.0.6.tar ...
This creates the directory lamtests-7.0.6 with the tests and a set of directions in the file README. Next, you should start LAM with lamboot if you haven't already done so. Then change to the test directory and run configure.
[sloanjd@fanny sloanjd]$ cd lamtests-7.0.6 [sloanjd@fanny lamtests-7.0.6]$ ./configure ...
Finally, run make.
[sloanjd@fanny lamtests-7.0.6]$ make -k check ...
You'll see lots of output scroll past. Don't be concerned about an occasional error message while it is running. What you want is a clean bill of health when it is finally done. You can run specific tests in the test suite by changing into the appropriate subdirectory and running make.
9.4 MPICH
Message Passing Interface Chameleon (MPICH) was developed by William Gropp and Ewing Lusk and is freely available from Argonne National Laboratory (http://www-unix.mcs.anl.gov/mpi/mpich/). Like LAM, it is both a library and an execution environment. It runs on a wide variety of Unix platforms and is even available for Windows NT.
Documentation can be downloaded from the web site. There are separate manuals for each of the communication models. This chapter provides an overview of the installation process and a description of how to use MPICH. For more up-to-date and detailed information, you should consult the appropriate manual for the communications model you are using.
9.4.1 Installing
There are five different "flavors" of MPICH reflecting the type of machine it will run on and how interprocess communication has been implemented:
ch_p4
This is probably the most common version. The "ch" is for channel and the "p4" for portable programs for parallel processors.
ch_p4mpd
This extends ch_p4 mode by including a set of daemons built to support parallel processing. The MPD is for multipurpose daemon. MPD is a new high-performance job launcher designed as a replacement for mpirun.
ch_shmem
This is a version for shared memory or SMP systems.
globus2
This is a version for computational grids. (See http://www.globus.org for more on the Globus project.)
ch_nt
This is a version of MPI for Windows NT machines.
The best choice for most clusters is either the ch_p4 model or ch_p4mpd model. The ch_p4mpd model assumes a homogenous architecture while ch_p4 works with mixed architectures. If you have a homogenous architecture, ch_p4mpd should provide somewhat better performance. This section will describe the ch_p4 since it is more versatile.
The first step in installing MPICH is to download the source code for your system. MPICH is not available in binary (except for Windows NT). Although the available code is usually updated with the latest patches, new patches are occasionally made available, so you'll probably want to check the patch list at the site. If necessary, apply the patches to your download file following the directions supplied with the patch file.
Decide where you want to install the software. This example uses /usr/local/src/mpich. Then download the source to the appropriate directory, uncompress it, and unpack it.
[root@fanny src]# gunzip mpich.tar.gz [root@fanny src]# tar -xvf mpich.tar ...
Expect lots of output! Change to the directory where the code was unpacked, make a directory for the installation, and run configure.
[root@fanny src]# cd mpich-1.2.5.2 [root@fanny mpich-1.2.5.2]# mkdir /usr/local/mpich-1.2.5.2 [root@fanny mpich-1.2.5.2]# ./configure --prefix=/usr/local/mpich-1.2.5.2 \ > -rsh=ssh ...
As with LAM, this installation configures MPICH to use SSH.[4] Other configuration options are described in the installation and user's guides.
[4] Alternatively, you could use the environmental variable $RSHCOMMAND to specify SSH.
Next, you'll make, install, and clean up.
[root@fanny mpich-1.2.5.2]# make ... [root@fanny mpich-1.2.5.2]# make install ... [root@fanny mpich-1.2.5.2]# make clean ...
Again, you'll see lots of output after each of these steps. The first make builds the software while the make install, which is optional, puts it in a public directory. It is also a good idea to make the tests on the head node.
MPICH on Windows SystemsFor those who need to work in different environments, it is worth noting that MPICH will run under Windows NT and 2000. (While I've never tested it in a cluster setting, I have used MPICH on XP to compile and run programs.) To install, download the self-extracting archive. By default, this will install the runtime DLLs, the development libraries, jumpshot, and a PDF of the user's manual. I've used this combination without problems with Visual Studio.NET and CodeWarrior. It is said to work with GCC but I haven't tested it. Installing MPICH on a laptop can be very helpful at times, even if you aren't attaching the laptop to a cluster. You can use it to initially develop and test code. In this mode, you would run code on a single machine as though it were a cluster. This is not the same as running the software on a cluster, and you definitely won't see any performance gains, but it will allow you to program when you are away from your cluster. Of course, you can also include Windows machines in your cluster as compute nodes. For more information, see the MPICH ch_nt manual. |
Before you can use MPICH, you'll need to tell it which machines to use by editing the file machine.architecture. For Linux clusters, this is the file machine.LINUX and is located in the directory ../share under installation directory. If you use the same file layout used here, the file is /usr/local/mpich-1.2.5.2/share/machines.LINUX. This file is just a simple list of machines with one hostname per line. For SMP systems, you can append a :n where n is the number of processors in the host. This file plays the same role as the schema with LAM. (You can specify a file with a different set of machines as a command-line argument when you run a program if desired.)
9.4.2 User Configuration
Since individual users don't set up schemas for MPICH, there is slightly less you need to do compared to LAM. Besides this difference, the user setup is basically the same. You'll need to set the $PATH variable appropriately (and $MANPATH, if you wish). The same concerns apply with MPICH as with LAM梱ou need to distinguish between LAM and MPICH executables if you install both, and you need to ensure the path is set for both interactive and noninteractive logins. You'll also need to ensure that you can log onto each machine in the cluster using SSH without a password. (For more information on these issues, see the subsection on user configuration under LAM/MPI.)
9.4.3 Using MPICH
Unlike LAM, you don't need to boot or shut down the runtime environment when running an MPICH program. With MPICH you'll just need to write, compile, and run your code. The downside is, if your program crashes, you may need to manually kill errant processes on compute nodes. But this shouldn't be a common problem. Also, you'll be able to run programs as root provided you distribute the binaries to all the nodes. (File access can be an issue if you don't export root's home directory via NFS.)
The first step is to write and enter your program using your favorite text editor. Like LAM, MPICH supplies a set of wrapper programs to simplify compilation?I>mpicc, mpiCC, and mpif77, and mpif90 for C, C++, FORTRAN 77, and FORTRAN 90, respectively. Here is an example of compiling a C program:
[sloanjd@fanny sloanjd]$ mpicc -o cpi cpi.c
cpi.c is one of the sample programs included with MPICH. It can be found in the directory ../examples/basic under the source directory.
You can see the options supplied by the wrapper program without executing the code by using the -show option. For example,
[sloanjd@fanny sloanjd]$ mpicc -show -o cpi cpi.c gcc -DUSE_STDARG -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1 -DHAVE_ STDARG_H=1 -DUSE_STDARG=1 -DMALLOC_RET_VOID=1 -L/opt/mpich-1.2.5.10-ch_p4-gcc/ lib -o cpi cpi.c -lmpich
Obviously, you'll want to use the wrapper programs rather than type in arguments manually.
To run a program, you use the mpirun command. Again, before the code will run, you must have copies of the binaries on each machine and you must be able to log into each machine with SSH without a password. Here is an example of running the code we just compiled.
[sloanjd@fanny sloanjd]$ mpirun -np 4 cpi Process 0 of 4 on fanny.wofford.int pi is approximately 3.1415926544231239, Error is 0.0000000008333307 wall clock time = 0.008783 Process 2 of 4 on hector.wofford.int Process 1 of 4 on george.wofford.int Process 3 of 4 on ida.wofford.int
The argument -np 4 specified running the program with four processes. If you want to specify a particular set of machines, use the -machinefile argument.
[sloanjd@fanny sloanjd]$ mpirun -np 4 -machinefile machines cpi Process 0 of 4 on fanny.wofford.int pi is approximately 3.1415926544231239, Error is 0.0000000008333307 wall clock time = 0.007159 Process 1 of 4 on george.wofford.int Process 2 of 4 on fanny.wofford.int Process 3 of 4 on george.wofford.int
In this example, four processes were run on the two machines listed in the file machines. Notice that each machine was used twice. You can view the mpicc(1) and mpirun(1) manpage for more details.
9.4.4 Testing the Installation
You can test connectivity issues and the like with the MPICH-supplied script tstmachines, which is located in the ../sbin directory under the MPICH installation. This script takes the architecture as an argument. For example,
[sloanjd@fanny sloanjd]$ /usr/local/mpich-1.2.5.2/sbin/tstmachines LINUX
If all is well, the script runs and terminates silently. If there is a problem, it makes suggestions on how to fix the problem. If you want more reassurance that it is actually doing something, you can run it with the -v argument.
For more thorough testing, MPICH provides a set of tests with the distribution. You'll find a thorough collection of tests supplied with the source files. These are in the directory ../examples/test. You run these tests by executing the command:
[sloanjd@fanny test]$ make testing | tee make.log ...
You'll need to do this in the test directory. This directory must be shared among all the nodes on the cluster, so you will have to either mount this directory on all the machines or copy its contents over to a mounted directory. When this runs, you'll see a lot of output as your cluster is put through its paces. The output will be copied to the file make.log, so you'll be able to peruse it at your leisure.
9.4.5 MPE
The Multi-Processing Environment (MPE) library extends MPI. MPE provides such additional facilities as libraries for creating log files, an X graphics library, graphical visualization tools, routines for serializing sections of parallel code, and debugger setup routines. While developed for use with MPICH, MPE can be used with any MPI implementation. MPE is included with MPICH and will be built and installed. MPE includes both a library for collecting information and a viewer for displaying the collected information. A user's guide is available that provides greater detail. Use of MPE is described in greater detail in Chapter 17.
MPE includes four viewers?I>upshot, nupshot, jumpshot-2, and jumpshot-3. These are not built automatically since the software required for the build may not be present on every machine. Both upshot and nupshot require Tcl/Tk and Wish. jumpshot-2 and jumpshot-3 require Java.
There are three different output formats for MPE log files?I>alog, an ASCII format provided for backwards compatibility; clog, alog's binary equivalent; and slog, a scalable format capable of handling very large files. upshot reads alog files, nupshot and jumpshot-2 read clog files, and jumpshot-3 reads slog files. MPE includes two utilities, clog2slog and clog2alog, to convert between formats. The basic functionality of the viewers is similar, so installing any one of them will probably meet your basic needs.
Although the requirements are different, the compilation process is similar for each tool. You can build the viewers collectively or individually. For example, to compile jumpshot-3, you'll need to install Java if you don't already have it. JDK-1.1, JDK-1.2, or JDK-1.3 can be used. (jumpshot-2 compiles only with JDK-1.1.) If you don't have the appropriate Java, you can download it from http://www.blackdown.org or http://java.sun.com and follow the installation directions given at the respective site. Once Java has been installed, make sure that you add its directory to your path. Next, change to the ../mpe/viewer/jumpshot-3 subdirectory under the MPICH directory, for example, /usr/local/src/mpich-1.2.5.2/mpe/viewers/jumpshot-3. Now you can configure and build jumpshot-3.
[root@fanny jumpshot-3]# ./configure ... [root@fanny jumpshot-3]# make ... [root@fanny jumpshot-3]# make install ...
jumpshot-3 will be installed in the /usr/local/bin directory as jumpshot. (You will only need to install it on the head node.) For details on the installation of the other viewer, see the MPE installation and user's guide.
To test your installation, you'll need to compile a program using the -mpilog option and run the code to create a log file.
[sloanjd@fanny sloanjd]$ mpicc -mpilog -o cpi cpi.c [sloanjd@fanny sloanjd]$ mpirun cpi ...
When you run the code, the log file cpi.clog will be created. You'll need to convert this to a format that jumpshot-3 can read.
[sloanjd@fanny sloanjd]$ clog2slog cpi.clog
The conversion routines are in the directory ../mpich-1.2.5.2/bin. Now you can view the output. Of course, you must have a graphical login for this to work. With this command, several windows should open on your display.
[sloanjd@fanny sloanjd]$ jumpshot cpi.slog
As noted, the use of MPE will be described in greater detail in Chapter 17.
9.5 Other Programming SoftwareKeeping in mind that your head node will also serve as a software development platform, there are other software packages that you'll want to install. One obvious utility is the ubiquitous text editor. Fortunately, most likely choices are readily available and will be part of your basic installation. Just don't forget them when you install the system. Because personal preferences vary so widely, you'll want to include the full complement. 9.5.1 DebuggersAnother essential tool is a software debugger. Let's face it, using printf to debug parallel code is usually a hopeless task. With multiple processes and buffered output, it is unlikely you'll know where the program was executing when you actually see the output. The best solution is a debugger designed specifically for parallel code. While commercial products such as TotalView are available and work well with MPI, free software is wanting. At the very least, you will want a good traditional debugger such as gdb. Programs that extend gdb, such as ddd (the Data Display Debugger), are a nice addition. (Debugging is discussed in greater detail in Chapter 16.) Since it is difficult to tell when they will be needed and just how essential they will be, try to be as inclusive as possible when installing these tools. As part of the gcc development package, gdb is pretty standard fare and should already be on your system. However, ddd may not be installed by default. Since ddd provides a GUI for other debuggers such as gdb, there is no point installing it on a system that doesn't have X Windows and gdb a or similar debugger. ddd is often included as part of a Linux distribution; for instance, Red Hat includes it. If not, you can download it from http://www.gnu.org/software/ddd. The easiest way to install it is from an RPM. [root@fanny root]# rpm -vih ddd-3.3.1-23.i386.rpm warning: ddd-3.3.1-23.i386.rpm: V3 DSA signature: NOKEY, key ID db42a60e Preparing... ########################################### [100%] 1:ddd ########################################### [100%] Depending on what is installed on your system, you may run into a few dependencies. For example, ddd requires openmotif. 9.5.2 HDF5Depending on the nature of the programming you do, there may be other useful libraries that you'll want to install. One such package that OSCAR includes is Hierarchical Data Format (Version 5) or HDF5. HDF5 is a freely available software package developed by the HDF5 group at the National Center for Supercomputing Applications (NCSA). The official web site is http://hdf.ncsa.uiuc.edu/HDF5/. HDF5 is both a file format standard and a library with utilities specifically designed for storing scientific data. It supports very large files and is designed and tuned for efficient storage on parallel computing systems. Data is stored in two parts, a header and a data array. The header contains the information needed to interpret the data array. That is, it describes and annotates the data set. The data sets are essentially multidimensional arrays of items. The API is available only in C. While HDF5 is beyond the scope of this book, you should be aware it exists should you need it. An extensive tutorial, as well as other documentation, is available at the software's web site. 9.5.3 SPRNGScalable Parallel Random Number Generators (SPRNG) is a library that provides six different state-of-the-art random number generators for use with parallel programs. SPRNG integrates nicely with MPI. Its use is described in Chapter 15. SPRNG is freely available from http://sprng.cs.fsu.edu/. At the time this was written, the latest version sprng2.0a.tgz. First, download the package and move it to an appropriate directory, e.g., /usr/local/src. The next step is to unpack it. [root@amy src]# gunzip sprng2.0a.tgz [root@amy src]# tar -xvf sprng2.0a.tar ... Then change to the directory to where you just unpacked the source. Before you can build it, you need to edit a couple of files. In the first section of the file make.CHOICES, select the appropriate platform. Typically, this will be INTEL for Linux clusters. Make sure the line PLAT = INTEL is uncommented and the lines for other platforms are commented out. Because you want to use it with MPI, in the second section, uncomment the line MPIDEF = -DSPRNG_MPI You should also comment out the two lines in the third section if libgmp.a is not available on your system. You should also edit the appropriate architecture file in the SRC subdirectory, typically make.INTEL. You'll need to make two sets of changes for a Linux cluster. First, change all the gcc optimization flags from -O3 to -O1. Next, change all the paths to MPI to match your machine. For the setup shown in this chapter, the following lines were changed: MPIDIR = -L/usr/local/mpich-1.2.5.2/lib and CFLAGS = -O1 -DLittleEndian $(PMLCGDEF) $(MPIDEF) -D$(PLAT) \ -I/usr/local/mpich-1.2.5.2/include -I/usr/local/mpich-1.2.5.2/include CLDFLAGS = -O1 FFLAGS = -O1 $(PMLCGDEF) $(MPIDEF) -D$(PLAT) \ -I/usr/local/mpich-1.2.5.2/include -I/usr/local/mpich-1.2.5.2/include -I. F77LDFLAGS = -O1 Once you've done this, run make from the root of the source tree. If you want to play with the MPI examples, run make mpi in the EXAMPLES subdirectory. To use the library, you must adjust your compile paths to include the appropriate directories. For example, to use SPRNG with OSCAR and MPICH, the following changes should work. MPIDIR = -L/opt/mpich-1.2.5.10-ch_p4-gcc/lib MPILIB = -lmpich # Please include mpi header file path, if needed CFLAGS = -O1 -DLittleEndian $(PMLCGDEF) $(MPIDEF) -D$(PLAT) -I/opt/mpich- 1.2.5.10-ch_p4-gcc/include -I/opt/mpich-1.2.5.10-ch_p4-gcc/include CLDFLAGS = -O1 FFLAGS = -O1 $(PMLCGDEF) $(MPIDEF) -D$(PLAT) -I/opt/mpich-1.2.5.10-ch_p4- gcc/include -I/opt/mpich-1.2.5.10-ch_p4-gcc/include -I. F77LDFLAGS = -O1 Note this installation is specific to one version of MPI. See Chapter 15 for the details of using SPRNG. |
9.6 Notes for OSCAR Users
LAM/MPI, MPICH, and HDF5 are installed as part of a standard OSCAR installation under the /opt directory to conform to the File System Hierarchy (FSH) standard (http://www.pathname.com/fhs/). Both MPICH and HDF5 have documentation subdirectories doc with additional information. OSCAR does not install MPE as part of the MPICH installation. If you want to use MPE, you'll need to go back and do a manual installation. Fortunately, this is not particularly difficult, but it can be a bit confusing.
9.6.1 Adding MPE
First, use switcher to select your preferred version of MPI . Since you can't run LAM/MPI as root, MPICH is probably a better choice. For example,
[root@amy root]# switcher mpi --list lam-7.0 lam-with-gm-7.0 mpich-ch_p4-gcc-1.2.5.10 [root@amy root]# switcher mpi = mpich-ch_p4-gcc-1.2.5.10 Attribute successfully set; new attribute setting will be effective for future shells
If you had to change MPI, log out and back onto the system.
Next, you'll need to retrieve and unpack a copy of MPICH.
[root@amy root]# cp mpich.tar.gz /usr/local/src [root@amy root]# cd /usr/local/src [root@amy src]# gunzip mpich.tar.gz [root@amy src]# tar -xvf mpich.tar ...
/usr/local/src is a reasonable location.
If you don't have it on your system, you'll need to install Java to build the jumpshot.
[root@amy src]# bunzip2 j2sdk-1.3.1-FCS-linux-i386.tar.bz2 [root@amy src]# tar -xvf j2sdk-1.3.1-FCS-linux-i386.tar ...
Again, /usr/local/src is a reasonable choice.
Next, you need to set your PATH to include Java and set environmental variables for MPICH.
[root@amy src]# export PATH=/usr/local/src/j2sdk1.3.1/bin:$PATH [root@amy src]# export MPI_INC="-I/opt/mpich-1.2.5.10-ch_p4-gcc/include" [root@amy src]# export MPI_LIBS="-L/opt/mpich-1.2.5.10-ch_p4-gcc/lib" [root@amy src]# export MPI_CC=mpicc [root@amy src]# export MPI_F77=mpif77
(Be sure these paths match your system.)
Now you can change to the MPE directory and run configure, make, and make install.
[root@amy src]# cd mpich-1.2.5.2/mpe [root@amy mpe]# ./configure ... [root@amy mpe]# make ... [root@amy mpe]# make install ...
You should now have MPE on your system. If you used the same directories as used here, it will be in /usr/local/src/mpich-1.2.5.2/mpe.
9.7 Notes for Rocks Users
Rocks does not include LAM/MPI or HDF5 but does include several different MPICH releases, located in /opt. MPE is included as part of Rocks with each release. The MPE libraries are included with the MPICH libraries, e.g., /opt/mpich/gnu/lib. Rocks includes the jumpshot3 script as well, e.g., /opt/mpich/gnu/share/jumpshot-3/bin for MPICH. (Rocks also includes upshot.)
By default, Rocks does not include Java. There is, however, a Java roll for Rocks. To use jumpshot3, you'll need to install the appropriate version of Java. You can look in the jumpshot3 script to see what it expects. You should see something like the following near the top of the file:
... JAVA_HOME=/usr/java/j2sdk1.4.2_02 ... JVM=/usr/java/j2sdk1.4.2_02/bin/java ...
You can either install j2sdk1.4.2-02 in /usr/java or you can edit these lines to match your Java installation. For example, if you install the Java package described in the last section, you might change these lines to
JAVA_HOME=/usr/local/src/j2sdk1.3.1 JVM=/usr/local/src/j2sdk1.3.1/bin/java
Adjust the path according to your needs.
Chapter 10. Management Software
Now that you have a cluster, you are going to want to keep it running, which will involve a number of routine system administration tasks. If you have done system administration before, then for the most part you won't be doing anything new. The administrative tasks you'll face are largely the same tasks you would face with any multiuser system. It is just that these tasks will be multiplied by the number of machines in your cluster. While creating 25 new accounts on a server may not sound too hard, when you have to duplicate those accounts on each node in a 200-node cluster, you'll probably want some help.
For a small cluster with only a few users, you may be able to get by doing things the way you are used to doing them. But why bother? The tools in this chapter are easy to install and use. Mastering them, which won't take long, will lighten your workload.
While there are a number of tools available, two representative tools (or tool sets) are described in this chapter梩he Cluster Command and Control (C3) tools set and Ganglia. C3 is a set of utilities that can be used to automate a number of tasks across a cluster or multiple clusters, such as executing the same command on every machine or distributing files to every machine. Ganglia is used to monitor the health of your cluster from a single node using a web-based interface.
10.1 C3
Cluster Command and Control is a set of about a dozen command-line utilities used to execute common management tasks. These commands were designed to provide a look and feel similar to that of issuing commands on a single machine.[1] The commands are both secure and scale reliably. Each command is actually a Python script. C3 was developed at Oak Ridge National Laboratory and is freely available.
[1] A Python/TK GUI known as C2G has also been developed.
10.1.1 Installing C3
There are two ways C3 can be installed. With the basic install, you'll do a full C3 installation on a single machine, typically the head node, and issue commands on that machine. With large clusters, this can be inefficient because that single machine must communicate with each of the other machines in the cluster. The alternate approach is referred to as a scalable installation. With this method, C3 is installed on all the machines and the configuration is changed so that a tree structure is used to distribute commands. That is, commands fan out through intermediate machines and are relayed across the cluster more efficiently. Both installations begin the same way; you'll just need to repeat the installation with the scalable install to alter the configuration file. This description will stick to the simple install. The simple installation includes a file README.scale that describes the scalable installation.
Since the C3 tools are scripts, there is very little to do to install them. However, since they rely on several other common packages and services, you will need to be sure that all the prerequisites are met. On most systems this won't be a problem; everything you'll need will already be in place.
Before you can install C3, make sure that rsync, Perl, SSH, and Python are installed on your system and available. Name resolution, either through DNS or a host file, must be available as well. Additionally, if you want to use the C3 command pushimage, SystemImager must be installed. Installing SystemImager is discussed in Chapter 8.
Once you have met the prerequisites, you can download, unpack, and install C3. To download it, go to http://www.csm.ornl.gov/torc/C3/ and follow the link to the download page. You can download sources or an RPM package. In this example, sources are used. If you install from RPMs, install the full install RPM and profile RPM on servers and the client RPM on clients. Note that with the simple installation you only need to install C3 on the head node of your cluster. However, you will need SSH and the like on every node.
Once you have unpacked the software and read the README files, you can run the install script Install-c3.
[root@fanny src]# gunzip c3-4.0.1.tar.gz [root@fanny src]# tar -xvf c3-4.0.1.tar [root@fanny src]# cd c3-4.0.1 [root@fanny c3-4.0.1]# ./Install-c3
The install script will copy the scripts to /opt/c3-4 (for Version 4 at least), set paths, and install man pages. There is nothing to compile.
The next step is creating a configuration file. The default file is /etc/c3.conf. However, you can use other configuration files if you wish by explicitly referencing them in C3 commands using the -f option with the file name.
Here is a very simple configuration file:
cluster local { fanny.wofford.int george.wofford.int hector.wofford.int ida.wofford.int james.wofford.int }
This example shows a configuration for a single cluster. In fact, the configuration file can contain information on multiple clusters. Each cluster will have its own cluster description block, which begins with the identifier cluster followed by a name for a cluster. The name can be used in C3 commands to identify the specific cluster if you have multiple cluster description blocks. Next, the machines within the cluster are listed within curly braces. The first machine listed is the head node. To remove ambiguity, the head node entry can consist of two parts separated by a colon梩he head node's external interface to the left of the colon and the head node's internal interface to the right of the colon. (Since fanny has a single interface, that format was not appropriate for this example.) The head node is followed by the compute nodes. In this example, the compute nodes are listed one per line. It is possible to specify a range. For example, node[01-64] would specify 64 machines with the names node1, node2, etc. The cluster definition block is closed with another curly brace. Of course, all machine names must resolve to IP addresses, typically via the /etc/hosts file. (The commands cname and cnum, described later in this section, can be discerning the details surrounding node indices.)
Within the compute node list, you can also use the qualifiers exclude and dead. exclude is applied to range qualifiers and immediately follow a range specification. dead applies to individual machines and precedes the machine name. For example,
node[1-64] exclude 60 alice dead bob carol
In this list node60 and bob are designated as being unavailable. Starting with Version 3 of C3, it is possible to use ranges in C3 commands to restrict actions to just those machines within the range. The order of the machines in the configuration file determines their numerical position within the range. In the example, the 67 machines defined have list positions 0 through 66. If you deleted bob from the file instead of marking it as dead, carol's position would change from 66 to 65, which could cause confusion. By using exclude and dead, you effectively remove a machine from a cluster without renumbering the remaining machines. dead can also be used with a dummy machine to switch from 0-indexing to 1-indexing. For example, just add the following line to the beginning of the machine list:
dead place_holder
Once done, all the machines in the list move up one position. For more details on the configuration file, see the c3.conf(5) and c3-scale(5) manpages.
Once you have created your configuration file, there is one last thing you need to do before C3 is ready to go. For the command ckill to work properly, the Perl script ckillnode must be installed on each individual machine. Fortunately, the rest of C3 is installed and functional, so you can use it to complete the installation. Just issue these commands:
[root@fanny root]# cexec mkdir /opt/c3-4 ************************* local ************************* --------- george.wofford.int--------- ... [root@fanny root]# cpush /opt/c3-4/ckillnode building file list ... building file list ... building file list ... building file list ... done ...
The first command makes the directory /opt/c3-4 on each machine in your cluster and the second copies the file ckillnode to each machine. You should see a fair amount of output with each command. If you are starting SSH manually, you'll need to start it before you try this.
10.1.2 Using C3 Commands
Here is a brief description of C3's more useful utilities.
10.1.2.1 cexec
This command executes a command string on each node in a cluster. For example,
[root@fanny root]# cexec mkdir tmp ************************* local ************************* --------- george.wofford.int--------- --------- hector.wofford.int--------- --------- ida.wofford.int--------- --------- james.wofford.int---------
The directory tmp has been created on each machine in the local cluster. cexec has a serial version cexecs that can be used for testing. With the serial version, the command is executed to completion on each machine before it is executed on the next machine. If there is any ambiguity about the order of execution for the parts of a command, you should use double quotes within the command. Consider:
[root@fanny root]# cexec "ps | grep a.out" ...
The quotes are needed here so grep will be run on each individual machine rather than have the full output from ps shipped to the head node.
10.1.2.2 cget
This command is used to retrieve a file from each machine in the cluster. Since each file will initially have the same name, when the file is copied over, the cluster and host names are appended. Here is an example.
[root@fanny root]# cget /etc/motd [root@fanny root]# ls motd_local_george.wofford.int motd_local_hector.wofford.int motd_local_ida.wofford.int motd_local_james.wofford.int
cget ignores links and subdirectories.
10.1.2.3 ckill
This script allows you to kill a process running on each node in your cluster. To use it, specify the process by name, not by number, because it is unlikely that the processes will have the same process ID on each node.
[root@fanny root]# ckill -u sloanjd a.out uid selected is 500 uid selected is 500 uid selected is 500 uid selected is 500
You may also specify an owner as shown in the example. By default, the local user name will be used.
10.1.2.4 cpush
This command is used to move a file to each node on the cluster.
[root@fanny root]# cpush /etc/motd /root/motd.bak building file list ... done building file list ... done motd motd building file list ... done motd wrote 119 bytes read 36 bytes 62.00 bytes/sec total size is 39 speedup is 0.25 wrote 119 bytes read 36 bytes 62.00 bytes/sec total size is 39 speedup is 0.25 wrote 119 bytes read 36 bytes 62.00 bytes/sec total size is 39 speedup is 0.25 building file list ... done motd wrote 119 bytes read 36 bytes 62.00 bytes/sec total size is 39 speedup is 0.25
As you can see, statistics for each move are printed. If you only specify one file, it will use the same name and directory for the source and the destination.
10.1.2.5 crm
This routine deletes or removes files across the cluster.
[root@fanny root]# crm /root/motd.bak
Like its serial counterpart, you can use the -i, -r and -v options for interactive, recursive, and verbose deletes, respectively. Please note, the -i option only prompts once, not for each node. Without options, crm silently deletes files.
10.1.2.6 cshutdown
This utility allows you to shut down the nodes in your cluster.
[root@fanny root]# cshutdown -r t 0
In this example, the time specified was 0 for an immediate reboot. (Note the absence of the hyphen for the t option.) Additional options are supported, e.g., to include a shutdown message.
10.1.2.7 clist, cname, and cnum
These three commands are used to query the configuration file to assist in determining the appropriate numerical ranges to use with C3 commands. clist lists the different clusters in the configuration file.
[root@amy root]# clist cluster oscar_cluster is a direct local cluster cluster pvfs_clients is a direct local cluster cluster pvfs_iod is a direct local cluster
cname lists the names of machines for a specified range.
[root@fanny root]# cname local:0-1 nodes from cluster: local cluster: local ; node name: george.wofford.int cluster: local ; node name: hector.wofford.int
Note the use of 0 indexing.
cnum determines the index of a machine given its name.
[root@fanny root]# cnum ida.wofford.int nodes from cluster: local ida.wofford.int is at index 2 in cluster local
These can be very helpful because it is easy to lose track of which machine has which index.
10.1.2.8 Further examples and comments
Here is an example using a range:
[root@fanny root]# cpush local:2-3 data ...
local designates which cluster is within your configuration file. Because compute nodes are numbered from 0, this will push the file data to the third and fourth nodes in the cluster. (That is, it will send the file from fanny to ida and james, skipping over george and hector.) Is that what you expected? For more information on ranges, see the manpage c3-range(5).
Note that the name used in C3 commands must match the name used in the configuration file. For C3, ida and ida.wofford.int are not equal even if there is an alias ida that resolves to ida.wofford.int. For example,
[root@fanny root]# cnum ida.wofford.int nodes from cluster: local ida.wofford.int is at index 2 in cluster local [root@fanny root]# cnum ida nodes from cluster: local
When in doubt about what form to use, just refer back to /etc/c3.conf.
In addition to the commands just described, the C3 command cpushimage can be used with SystemImager to push an image from server to nodes. There are also several user-contributed utilities. While not installed, these can be found in the C3 source tree in the subdirectory contrib. User-contributed scripts can be used as examples for writing other scripts using C3 commands.
C3 commands take a number of different options not discussed here. For a brief description of other options, use the --help option with individual commands. For greater detail, consult the manpage for the individual command.
10.2 Ganglia
With a large cluster, it can be a daunting task just to ensure that every machine is up and running every day if you try to do it manually. Fortunately, there are several tools that you can use to monitor the state of your cluster. In clustering circles, the better known of these include Ganglia, Clumon, and Performance Co-Pilot (CPC). While this section will describe Ganglia, you might reasonably consider any of these.
Ganglia is a real-time performance monitor for clusters and grids. If you are familiar with MRTG, Ganglia uses the same round-robin database package that was developed for MRTG. Memory efficient and robust, Ganglia scales well and has been used with clusters with hundreds of machines. It is also straightforward to configure for use with multiple clusters so that a single management station can monitor all the nodes within multiple clusters. It was developed at UCB, is freely available (via a BSD license), and has been ported to a number of different architectures.
Ganglia uses a client-server model and is composed of four parts. The monitor daemon gmond needs to be installed on every machine in the cluster. The backend for data collection, the daemon gmetad, and the web interface frontend are installed on a single management station. (There is also a Python class for sorting and classifying data from large clusters.) Data are transmitted using XML and XDR via both TCP and multicasting.
In addition to these core components, there are two command-line tools. The cluster status tool gstat provides a way to query gmond, allowing you to create a status report for your cluster. The metric tool gmetric allows you to easily monitor additional host metrics in addition to Ganglia's predefined metrics. For instance, suppose you have a program (and interface) that measures a computer's temperature on each node. gmetric can be used to request that gmond run this program. By running the gmetric command under cron, you could track computer temperature over time.
Finally, Ganglia also provides an execution environment. gexec allows you to run commands across the cluster transparently and forward stdin, stdout, and stderr. This discussion will focus of the three core elements of Ganglia?I>gmond, gmetad, and the web frontend.
10.2.1 Installing and Using Ganglia
Ganglia can be installed by compiling the sources or using RPM packages. The installation of the software for the management station, i.e., the node that collects information from the other nodes and maintains the database, is somewhat more involved. With large clusters, you may want to use a machine as a dedicated monitor. For smaller clusters, you may be able to get by with your head node if it is reasonably equipped. We'll look at the installation of the management node first since it is more involved.
10.2.1.1 RRDTool
Before you begin, there are several prerequisites for installing Ganglia. First, your network and hosts must be multicast enabled. This typically isn't a problem with most Linux installations. Next, the management station or stations, i.e., the machine on which you'll install gmetad and the web frontend, will also need RRDtool and Perl and a PHP-enabled web server.[2] (Since you will install only gmond on your compute nodes, these do not require Apache or RRDtool.)
[2] It appears that only the include file and library from RRDtool is needed, but I have not verified this. Perl is required for RRDtool, not Ganglia.
RRDtool is a round-robin database. As you add information to the database, the oldest data is dropped from the database. This allows you to store data in a compact manner that will not expand endlessly over time. Sources can be downloaded from http://www.rrdtool.org/. To install it, you'll need to unpack it and run configure, make, and make install.
[root@fanny src]# gunzip rrdtool-1.0.48.tar.gz [root@fanny src]# tar -vxf rrdtool-1.0.48.tar ... [root@fanny src]# cd rrdtool-1.0.48 [root@fanny rrdtool-1.0.48]# ./configure ... [root@fanny rrdtool-1.0.48]# make [root@fanny rrdtool-1.0.48]# make install ...
You'll see a lot of output along the way. In this example, I've installed it under /usr/local/src. If you want to install it in a different directory, you can use the --prefix option to specify the directory when you run configure. It doesn't really matter where you put it, but when you build Ganglia you'll need to tell Ganglia where to find the RRDtool library and include files.
10.2.1.2 Apache and PHP
Next, check the configuration files for Apache to ensure the PHP module is loaded. For Red Hat 9.0, the primary configuration file is httpd.conf and is located in /etc/httpd/conf/. It, in turn, includes the configuration files in /etc/httpd/conf.d/, in particular php.conf. What you are looking for is a configuration command that loads the PHP module somewhere in one of the Apache configuration files. That is, one of the configuration files should have some lines like the following:
LoadModule php4_module modules/libphp4.so ...SetOutputFilter PHP SetInputFilter PHP LimitRequestBody 524288
If you used the package system to set up Apache and PHP, this should have been done for you. Finally, make sure Apache is running.
10.2.1.3 Ganglia monitor core
Next, you'll need to download the appropriate software. Go to http://ganglia.sourceforge.net/. You'll have a number of choices, including both source files and RPM files, for both Ganglia and related software. The Ganglia monitor core contains both gmond and gmetad (although by default it doesn't install gmetad). Here is an example of using the monitor core download to install from source files. First, unpack the software.
[root@fanny src]# gunzip ganglia-monitor-core-2.5.6.tar.gz [root@fanny src]# tar -xvf ganglia-monitor-core-2.5.6.tar ...
As always, once you have unpacked the software, be sure to read the README file.
Next, change to the installation directory and build the software.
[root@fanny src]# cd ganglia-monitor-core-2.5.6 [root@fanny ganglia-monitor-core-2.5.6]# ./configure \ > CFLAGS="-I/usr/local/rrdtool-1.0.48/include" \ > CPPFLAGS="-I/usr/local/rrdtool-1.0.48/include" \ > LDFLAGS="-L/usr/local/rrdtool-1.0.48/lib" --with-gmetad ... [root@fanny ganglia-monitor-core-2.5.6]# make ... [root@fanny ganglia-monitor-core-2.5.6]# make install ...
As you can see, this is a pretty standard install with a couple of small exceptions. First, you'll need to tell configure where to find the RRDtool to include file and library by setting the various flags as shown above. Second, you'll need to explicitly tell configure to build gmetad. This is done with the --with-gmetad option.
Once you've built the software, you'll need to install and configure it. Both gmond and gmetad have very simple configuration files. The samples files gmond/gmond.conf and gmetad/gmetad.conf are included as part of the source tree. You should copy these to /etc and edit them before you start either program. The sample files are well documented and straightforward to edit. Most defaults are reasonable. Strictly speaking, the gmond.conf file is not necessary if you are happy with the defaults. However, you will probably want to update the cluster information at a minimum. The gmetad.conf file must be present and you'll need to identify at least one data source. You may also want to change the identity information in it.
For gmetad.conf, the data source entry is a list of the machines that will be monitored. The format is the identifier data_source followed by a unique string identifying the cluster. Next is an optional polling interval. Finally, there is a list of machines and optional port numbers. Here is a simple example:
data_source "my cluster" 10.0.32.144 10.0.32.145 10.0.32.146 10.0.32.147
The default sampling interval is 15 seconds and the default port is 8649.
Once you have the configuration files in place and edited to your satisfaction, copy the initialization files and start the programs. For gmond, it will look something like this:
[root@fanny ganglia-monitor-core-2.5.6]# cp ./gmond/gmond.init \ > /etc/rc.d/init.d/gmond [root@fanny ganglia-monitor-core-2.5.6]# chkconfig --add gmond [root@fanny ganglia-monitor-core-2.5.6]# /etc/rc.d/init.d/gmond start Starting GANGLIA gmond: [ OK ]
As shown, you'll want to ensure that gmond is started whenever you reboot.
Before you start gmetad, you'll want to create a directory for the database.
[root@fanny ganglia-monitor-core-2.5.6]# mkdir -p /var/lib/ganglia/rrds [root@fanny ganglia-monitor-core-2.5.6]# chown -R nobody \ > /var/lib/ganglia/rrds
Next, copy over the initialization file and start the program.
[root@fanny ganglia-monitor-core-2.5.6]# cp ./gmetad/gmetad.init \ > /etc/rc.d/init.d/gmetad [root@fanny ganglia-monitor-core-2.5.6]# chkconfig --add gmetad [root@fanny ganglia-monitor-core-2.5.6]# /etc/rc.d/init.d/gmetad start Starting GANGLIA gmetad: [ OK ]
Both programs should now be running. You can verify this by trying to TELNET to their respective ports, 8649 for gmond and 8651 for gmetad. When you do this you should see a couple of messages followed by a fair amount of XML scroll by.
[root@fanny ganglia-monitor-core-2.5.6]# telnet localhost 8649 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. ...
If you see output such as this, everything is up and running. (Since you are going to the localhost, this should work even if your firewall is blocking TELNET.)
10.2.1.4 Web frontend
The final step in setting up the monitoring station is to install the frontend software. This is just a matter of downloading the appropriate file and unpacking it. Keep in mind that you must install this so that it is reachable as part of your website. Examine the DocumentRoot in your Apache configuration file and install the package under this directory. For example,
[root@fanny root]# grep DocumentRoot /etc/httpd/conf/httpd.conf ... DocumentRoot "/var/www/html" ...
Now that you know where the document root is, copy the web frontend to this directory and unpack it.
[root@fanny root]# cp ganglia-webfrontend-2.5.5.tar.gz /var/www/html/ [root@fanny root]# cd /var/www/html [root@fanny html]# gunzip ganglia-webfrontend-2.5.5.tar.gz [root@fanny html]# tar -xvf ganglia-webfrontend-2.5.5.tar
There is nothing to build in this case. The configuration file is conf.php. Among other things, you can use this to change the appearance of your web site by changing the display themes.
At this point, you should be able to examine the state of this machine. (You'll still need to install gmond on the individual nodes before you can look at the rest of the cluster.) Start your web browser and visit your site, e.g., http://localhost/ganglia-webfrontend-2.5.5/. You should see something like Figure 10-1.
Figure 10-1. Ganglia on a single node
This shows the host is up. Next, we need to install gmond on the individual nodes so we can see the rest of the cluster. You could use the same technique used above梛ust skip over the prerequisites and the gmetad steps. But it is much easier to use RPM. Just download the package to an appropriate location and install it. For example,
[root@george root]# rpm -vih ganglia-monitor-core-gmond-2.5.6-1.i386.rpm Preparing... ########################################### [100%] 1:ganglia-monitor-core-gm########################################### [100%] Starting GANGLIA gmond: [ OK ]
gmond is installed in /usr/sbin and its configuration file in /etc. Once you've installed gmond on a machine, it should appear on your web page when you click on refresh. Repeat the installation for your remaining nodes.
Once you have Ganglia running, you may want to revisit the configuration files. With Ganglia running, it will be easier to see exactly what effect a change to a configuration file has. Of course, if you change a configuration file, you'll need to restart the appropriate services before you will see anything different.
You should have no difficulty figuring out how to use Ganglia. There are lots of "hot spots" on the pages, so just click and see what you get. The first page will tell you how many machines are up and down and their loads. You can select a physical view or collect information on individual machines. Figure 10-2 shows information for an individual machine. You can also change the metric displayed. However, not all metrics are supported. The Ganglia documentation supplies a list of supported metrics by architecture.
Figure 10-2. Ganglia Node View
As you can see, these screen captures were made when the cluster was not otherwise in use. Otherwise the highlighted load figures would reflect that activity.
10.3 Notes for OSCAR and Rocks Users
C3 is a core OSCAR package that is installed in /opt/c3-4 and can be used as shown in this chapter. Both Ganglia and Clumon (which uses Performance Co-Pilot) may be available as additional packages for OSCAR. As add-ons, these may not always be available immediately when new versions of OSCAR are released. For example, there was a delay with both when OSCAR 3.0 was released. When installing Ganglia using the package add option with OSCAR, you may want to tweak the configuration files, etc.
Although not as versatile as the C3 command set, Rocks supplies the command cluster-fork for executing commands across a cluster.
For OSCAR, the web-accessible reports for Clumon and Ganglia are installed in /var/www/html/clumon and /var/www/html/ganglia, respectively. Thus, to access the Ganglia web report on amy.wofford.int, the URL is http://amy.wofford.int/ganglia/. The page format used by OSCAR is a little different, but you would use Ganglia in much the same way.
Ganglia is fully integrated into Rocks and is available as a link from the administrative home for the frontend.
Chapter 11. Scheduling SoftwareBasically, scheduling software lets you run your cluster like a batch system, allowing you to allocate cluster resources, such as CPU time and memory, on a job-by-job basis. Jobs are queued and run as resources become available, subject to the priorities you establish. Your users will be able to add and remove jobs from the job queue as well as track the progress of their jobs. As the administrator, you will be able to establish priorities and manage the queue. Scheduling software is not a high priority for everyone. If the cluster is under the control of a single user, then scheduling software probably isn't needed. Similarly, if you have a small cluster with very few users or if your cluster is very lightly used, you may not need scheduling software. As long as you have more resources than you need, manual scheduling may be a viable alternative梐t least initially. If you have a small cluster and only occasionally wish you had scheduling software, it may be easier to add a few more computers or build a second cluster than deal with the problems that scheduling software introduces. But if you have a large cluster with a growing user base, at some point you'll want to install scheduling software. At a minimum, scheduling software helps you effectively use your hardware and provides a more equitable sharing of resources. Scheduling software has other uses as well, including accounting and monitoring. The information provided by good scheduling software can be a huge help when planning for the future of your cluster. There are several freely available scheduling systems from which you can select, including Portable Batch System (PBS), Maui, Torque, and Condor. OSCAR includes Portable Batch System (PBS) along with Maui. Torque is also available for OSCAR via opd. Rocks provides a PBS roll that includes Maui and Torque and a second roll that includes Condor. Since PBS is available for both OSCAR and Rocks, that's what's described in this chapter. (For more information on the alternatives, visit the web sites listed in the Appendix A.) PBS is a powerful and versatile system. While this chapter sticks to the basics, you should keep in mind that there is a lot more to PBS than described here. Look at the Administrator Guide to learn more, particularly if you need help with more advanced features. |
11.1 OpenPBSBefore the emergence of clusters, the Unix-based Network Queuing System (NQS) from NASA Ames Research Center was a commonly used batch-queuing system. With the emergence of parallel distributed system, NQS began to show its limitations. Consequently, Ames led an effort to develop requirements and specifications for a newer, cluster-compatible system. These requirements and specifications later became the basis for the IEEE 1003.2d POSIX standard. With NASA funding, PBS, a system conforming to those standards, was developed by Veridian in the early 1990s. PBS is available in two forms桹penPBS or PBSPro. OpenPBS is the unsupported original open source version of PBS, while PBSPro is a newer commercial product. In 2003, PBSPro was acquired by Altair Engineering and is now marketed by Altair Grid Technologies, a subsidiary of Altair Engineering. The web site for OpenPBS is http://www.openpbs.org; the web site for PBSPro is http://www.pbspro.com. Although much of the following will also apply to PBSPro, the remainder of this chapter describes OpenPBS, which is often referred to simply as PBS. However, if you have the resources to purchase software, it is well worth looking into PBSPro. Academic grants have been available in the past, so if you are eligible, this is worth looking into as well. As an unsupported product, OpenPBS has its problems. Of the software described in this book, it was, for me, the most difficult to install. In my opinion, it is easier to install OSCAR, which has OpenPBS as a component, or Rocks along with the PBS roll than it is to install just OpenPBS. With this warning in mind, we'll look at a typical installation later in this chapter. 11.1.1 ArchitectureBefore we install PBS, it is helpful to describe its architecture. PBS uses a client-server model and is organized as a set of user-level commands that interact with three system-level daemons. Jobs are submitted using the user-level commands and managed by the daemons. PBS also includes an API. The pbs_server daemon, the job server, runs on the server system and is the heart of the PBS system. It provides basic batch services such as receiving and creating batch jobs, modifying the jobs, protecting jobs against crashes, and running the batch jobs. User commands and the other daemons communicate with the pbs_server over the network using TCP. The user commands need not be installed on the server. The job server manages one or more queues. (Despite the name, queues are not restricted to first-in, first-out scheduling.) A scheduled job waiting to be run or a job that is actually running is said to be a member of its queue. The job server supports two types of queues, execution and routing. A job in an execution queue is waiting to execute while a job in a routing queue is waiting to be routed to a new destination for execution. The pbs_mom daemon executes the individual batch jobs. This job executor daemon is often called the MOM because it is the "mother" of all executing jobs and must run on every system within the cluster. It creates an execution environment that is as nearly identical to the user's session as possible. MOM is also responsible for returning the job's output to the user. The final daemon, pbs_sched, implements the cluster's job-scheduling policy. As such, it communicates with the pbs_server and pbs_mom daemons to match available jobs with available resources. By default, a first-in, first-out scheduling policy is used, but you are free to set your own policies. The scheduler is highly extensible. PBS provides both a GUI interface as well as 1003.2d-compliant command-line utilities. These commands fall into three categories: management, operator, and user commands. Management and operator commands are usually restricted commands. The commands are used to submit, modify, delete, and monitor batch jobs. 11.1.2 Installing OpenPBSWhile detailed installation directions can be found in the PBS Administrator Guide, there are enough "gotchas" that it is worth going over the process in some detail. Before you begin, be sure you look over the Administrator Guide as well. Between the guide and this chapter, you should be able to overcome most obstacles. Before starting with the installation proper, there are a couple of things you need to check. As noted, PBS provides both command-line utilities and a graphical interface. The graphical interface requires Tcl/Tk 8.0 or later, so if you want to use it, make sure Tcl/Tk is installed. You'll want to install Tcl/Tk before you install PBS. For a Red Hat installation, you can install Tcl/Tk from the packages supplied with the operating system. For more information on Tcl/Tk, visit the web site http://www.scriptics.com/. In order to build the GUI, you'll also need the X11 development packages, which Red Hat users can install from the supplied RPMs. The first step in the installation proper is to download the software. Go to the OpenPBS web site (http://www-unix.mcs.anl.gov/openpbs/) and follow the links to the download page. The first time through, you will be redirected to a registration page. With registration, you will receive by email an account name and password that you can use to access the actual download page. Since you have to wait for approval before you receive the account information, you'll want to plan ahead and register a couple of days before you plan to download and install the software. Making your way through the registration process is a little annoying because it keeps pushing the commercial product, but it is straightforward and won't take more than a few minutes. Once you reach the download page, you'll have the choice of downloading a pair of RPMs or the patched source code. The first RPM contains the full PBS distribution and is used to set up the server, and the second contains just the software needed by the client and is used to set up compute nodes within a cluster. While RPMs might seem the easiest way to go, the available RPMs are based on an older version of Tcl/Tk (Version 8.0). So unless you want to backpedal梚.e., track down and install these older packages, a nontrivial task梚nstalling the source is preferable. That's what's described here. Download the source and move it to your directory of choice. With a typical installation, you'll end up with three directory trees梩he source tree, the installation tree, and the working directory tree. In this example, I'm setting up the source tree in the directory /usr/local/src. Once you have the source package where you want it, unpack the code. [root@fanny src]# gunzip OpenPBS_2_3_16.tar.gz [root@fanny src]# tar -vxpf OpenPBS_2_3_16.tar When untarring the package, use the -p option to preserve permissions bits. Since the OpenPBS code is no longer supported, it is somewhat brittle. Before you can compile the code, you will need to apply some patches. What you install will depend on your configuration, so plan to spend some time on the Internet: the OpenPBS URL given above is a good place to start. For Red Hat Linux 9.0, start by downloading the scaling patch from http://www-unix.mcs.anl.gov/openpbs/ and the errno and gcc patches from http://bellatrix.pcl.ox.ac.uk/~ben/pbs/. (Working out the details of what you need is the annoying side of installing OpenPBS.) Once you have the patches you want, install them. [root@fanny src]# cp openpbs-gcc32.patch /usr/local/src/OpenPBS_2_3_16/ [root@fanny src]# cp openpbs-errno.patch /usr/local/src/OpenPBS_2_3_16/ [root@fanny src]# cp ncsa_scaling.patch /usr/local/src/OpenPBS_2_3_16/ [root@fanny src]# cd /usr/local/src/OpenPBS_2_3_16/ [root@fanny OpenPBS_2_3_16]# patch -p1 -b < openpbs-gcc32.patch patching file buildutils/exclude_script [root@fanny OpenPBS_2_3_16]# patch -p1 -b < openpbs-errno.patch patching file src/lib/Liblog/pbs_log.c patching file src/scheduler.basl/af_resmom.c [root@fanny OpenPBS_2_3_16]# patch -p1 -b < ncsa_scaling.patch patching file src/include/acct.h patching file src/include/cmds.h patching file src/include/pbs_ifl.h patching file src/include/qmgr.h patching file src/include/server_limits.h The scaling patch changes built-in limits that prevent OpenPBS from working with larger clusters. The other patches correct problems resulting from recent changes to the gcc complier.[1]
As noted, you'll want to keep the installation directory separate from the source tree, so create a new directory for PBS. /usr/local/OpenPBS is a likely choice. Change to this directory and run configure, make, make install, and make clean from it. [root@fanny src]# mkdir /usr/local/OpenPBS [root@fanny src]# cd /usr/local/OpenPBS [root@fanny OpenPBS]# /usr/local/src/OpenPBS_2_3_16/configure \ > --set-default-server=fanny --enable-docs --with-scp ... [root@fanny OpenPBS]# cd /usr/local/src/OpenPBS_2_3_16/ [root@fanny OpenPBS-2.3.16]# make ... [root@fanny OpenPBS-2.3.16]# /usr/local/src/OpenPBS [root@fanny OpenPBS]# make install ... [root@fanny OpenPBS]# make clean ... In this example, the configuration options set fanny as the server, create the documentation, and use scp (SSH secure copy program) when moving files between remote hosts. Normally, you'll create the documentation only on the server. The Administrator Guide contains several pages of additional options. By default, the procedure builds all the software. For the compute nodes, this really isn't necessary since all you need is pbs_mom on these machines. Thus, there are several alternatives that you might want to consider when setting up the clients. You could just go ahead and build everything like you did for the server, or you could use different build options to restrict what is built. For example, the option --disable-server prevents the pbs_server daemon from being built. Or you could build and then install just pbs_mom and the files it needs. To do this, change to the MOM subdirectory, in this example /usr/local/OpenPBS/src/resmom, and run make install to install just MOM. [root@ida OpenPBS]# cd /usr/local/OpenPBS/src/resmom [root@ida resmom]# make install ... Yet another possibility is to use NFS to mount the appropriate directories on the client machines. The Administrator Guide outlines these alternatives but doesn't provide many details. Whatever your approach, you'll need pbs_mom on every compute node. The make install step will create the /usr/spool/PBS working directory, and will install the user commands in /usr/local/bin and the daemons and administrative commands in /usr/local/sbin. make clean removes unneeded files. 11.1.3 Configuring PBSBefore you can use PBS, you'll need to create or edit the appropriate configuration files, located in the working directory, e.g., /usr/spool/PBS, or its subdirectories. First, the server needs the node file, a file listing the machines it will communicate with. This file provides the list of nodes used at startup. (This list can be altered dynamically with the qmgr command.) In the subdirectory server_priv, create the file nodes with the editor of your choice. The nodes file should have one entry per line with the names of the machines in your cluster. (This file can contain additional information, but this is enough to get you started.) If this file does not exist, the server will know only about itself. MOM will need the configuration file config, located in the subdirectory mom_priv. At a minimum, you need an entry to start logging and an entry to identity the server to MOM. For example, your file might look something like this: $logevent 0x1ff $clienthost fanny The argument to $logevent is a mask that determines what is logged. A value of 0X0ff will log all events excluding debug messages, while a value of 0X1ff will log all events including debug messages. You'll need this file on every machine. There are a number of other options, such as creating an access list. Finally, you'll want to create a default_server file in the working directory with the fully qualified domain name of the machine running the server daemon. PBS uses ports 15001-15004 by default, so it is essential that your firewall doesn't block these ports. These can be changed by editing the /etc/services file. A full list of services and ports can be found in the Administrator Guide (along with other configuration options). If you decide to change ports, it is essential that you do this consistently across your cluster! Once you have the configuration files in place, the next step is to start the appropriate daemons, which must be started as root. The first time through, you'll want to start these manually. Once you are convinced that everything is working the way you want, configure the daemons to start automatically when the systems boot by adding them to the appropriate startup file, such as /etc/rc.d/rc.local. All three daemons must be started on the server, but the pbs_mom is the only daemon needed on the compute nodes. It is best to start pbs_mom before you start the pbs_server so that it can respond to the server's polling. Typically, no options are needed for pbs_mom. The first time (and only the first time) you run pbs_server, start it with the option -t create. [root@fanny OpenPBS]# pbs_server -t create This option is used to create a new server database. Unlike pbs_mom and pbs_sched, pbs_server can be configured dynamically after it has been started. The options to pbs_sched will depend on your site's scheduling policies. For the default FIFO scheduler, no options are required. For a more detailed discussion of command-line options, see the manpages for each daemon. 11.1.4 Managing PBSWe'll begin by looking at the command-line utilities first since the GUI may not always be available. Once you have mastered these commands, using the GUI should be straightforward. From a manager's perspective, the first command you'll want to become familiar with is qmgr, the queue management command. qmgr is used to create job queues and manage their properties. It is also used to manage nodes and servers providing an interface to the batch system. In this section we'll look at a few basic examples rather than try to be exhaustive. First, identify the pbs_server managers, i.e., the users who are allowed to reconfigure the batch system. This is generally a one-time task. (Keep in mind that not all commands require administrative privileges. Subcommands such as the list and print can be executed by all users.) Run the qmgr command as follows, substituting your username: [root@fanny OpenPBS]# qmgr Max open servers: 4 Qmgr: set server [email protected] Qmgr: quit You can specify multiple managers by adding their names to the end of the command, separated by commas. Once done, you'll no longer need root privileges to manage PBS. Your next task will be to create a queue. Let's look at an example. [sloanjd@fanny PBS]$ qmgr Max open servers: 4 Qmgr: create queue workqueue Qmgr: set queue workqueue queue_type = execution Qmgr: set queue workqueue resources_max.cput = 24:00:00 Qmgr: set queue workqueue resources_min.cput = 00:00:01 Qmgr: set queue workqueue enabled = true Qmgr: set queue workqueue started = true Qmgr: set server scheduling = true Qmgr: set server default_queue = workqueue Qmgr: quit In this example we have created a new queue named workqueue. We have limited CPU time to between 1 second and 24 hours. The queue has been enabled, started, and set as the default queue for the server, which must have at least one queue defined. All queues must have a type, be enabled, and be started. As you can see from the example, the general form of a qmgr command line is a command (active, create, delete, set, unset, list, or print) followed by a target (server, queue, or node) followed by an attribute assignment. These keywords can be abbreviated as long as there is no ambiguity. In the first example in this section, we set a server attribute. In the second example, the target was the queue that we were creating for most of the commands. To examine the configuration of the server, use the command Qmgr: print server This can be used to save the configuration you are using. Use the command [root@fanny PBS]# qmgr -c "print server" > server.config Note, that with the -c flag, qmgr commands can be entered on a single line. To re-create the queue at a later time, use the command [root@fanny PBS]# qmgr < server.config This can save a lot of typing or can be automated if needed. Other actions are described in the documentation. Another useful command is pbsnodes, which lists the status of the nodes on your cluster. [sloanjd@amy sloanjd]$ pbsnodes -a oscarnode1.oscardomain state = free np = 1 properties = all ntype = cluster oscarnode2.oscardomain state = free np = 1 properties = all ntype = cluster ... On a large cluster, that can create a lot of output. 11.1.5 Using PBSFrom the user's perspective, the place to start is the qsub command, which submits jobs. The only jobs that the qsub accepts are scripts, so you'll need to package your tasks appropriately. Here is a simple example script: #!/bin/sh #PBS -N demo #PBS -o demo.txt #PBS -e demo.txt #PBS -q workq #PBS -l mem=100mb mpiexec -machinefile /etc/myhosts -np 4 /home/sloanjd/area/area The first line specified the shell to use in interpreting the script, while the next few lines starting with #PBS are directives that are passed to PBS. The first names the job, the next two specify where output and error output go, the next to last identifies the queue that is used, and the last lists a resource that will be needed, in this case 100 MB of memory. The blank line signals the end of PBS directives. Lines that follow the blank line indicate the actual job. Once you have created the batch script for your job, the qsub command is used to submit the job. [sloanjd@amy area]$ qsub pbsdemo.sh 11.amy When run, qsub returns the job identifier as shown. A number of different options are available, both as command-line arguments to qsub or as directives that can be included in the script. See the qsub (1B) manpage for more details. There are several things you should be aware of when using qsub. First, as noted, it expects a script. Next, the target script cannot take any command-line arguments. Finally, the job is launched on one node. The script must ensure that any parallel processes are then launched on other nodes as needed. In addition to qsub, there are a number of other useful commands available to the general user. The commands qstat and qdel can be used to manage jobs. In this example, qstat is used to determine what is on the queue: [sloanjd@amy area]$ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 11.amy pbsdemo sloanjd 0 Q workq 12.amy pbsdemo sloanjd 0 Q workq qdel is used to delete jobs as shown. [sloanjd@amy area]$ qdel 11.amy [sloanjd@amy area]$ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 12.amy pbsdemo sloanjd 0 Q workq qstat can be called with the job identifier to get more information about a particular job or with the -s option to get more details. A few of the more useful ones include the following: qalter
This is used to modify the attributes of an existing job. qhold
This is used to place a hold on a job. qmove
This is used to move a job from one queue to another. qorder
This is used to change the order of two jobs. qrun
This is used to force a server to start a job. If you start with the qsub (1B) manpage, other available commands are listed in the "See Also" section.
Figure 11-1. xpbs -admin
Figure 11-2. xpbsmon
11.1.6 PBS's GUIPBS provides two GUIs for queue management. The command xpbs will start a general interface. If you need to do administrative tasks, you should include the argument -admin. Figure 11-1 shows the xpbs GUI with the -admin option. Without this option, the general appearance is the same, but a number of buttons are missing. You can terminate a server; start, stop, enable, or disable a queue; or run or rerun a job. To monitor nodes in your cluster, you can use the xpbsmon command, shown for a few machines in Figure 11-2. 11.1.7 Maui SchedulerIf you need to go beyond the schedulers supplied with PBS, you should consider installing Maui. In a sense, Maui picks up where PBS leaves off. It is an external scheduler梩hat is, it does not include a resource manager. Rather, it can be used in conjunction with a resource manager such as PBS to extend the resource manager's capabilities. In addition to PBS, Maui works with a number of other resource managers. Maui controls how, when, and where jobs will be run and can be described as a policy engine. When used correctly, it can provide extremely high system utilization and should be considered for any large or heavily utilized cluster that needs to optimize throughput. Maui provides a number of very advanced scheduling options. Administration is through the master configuration file maui.cfg and through either a text-based or a web-based interface. Maui is installed by default as part of OSCAR and Rocks. For the most recent version of Maui or for further documentation, you should visit the Maui web site, http://www.supercluster.org. |
11.2 Notes for OSCAR and Rocks Users
As previously noted, both OpenPBS and Maui are installed as part of the OSCAR setup. The installation directory for OpenPBS is /opt/pbs. You'll find the various commands in subdirectories under this directory. The working directory for OpenPBS is /var/spool/pbs, where you'll find the configuration and log files. The default queue, as you may have noticed from previous examples, is workq. Under OSCAR, Maui is installed in the directory /opt/maui. By default, the OpenPBS FIFO scheduler is disabled.
OpenPBS and Maui are available for Rocks as a separate roll. If you need OpenPBS, be sure you include the roll when you build your cluster as it is not currently possible to add the roll once the cluster has been installed. Once installed, the system is ready to use. The default queue is default.
Rocks also provides a web-based interface for viewing the job queue that is available from the frontend's home page. Using the web interface, you can view both the job queue and the physical job assignments. PBS configuration files are located in /opt/torque. Manpages are in /opt/torque/man. Maui is installed under /opt/maui.
Chapter 12. Parallel Filesystems
If you are certain that your cluster will only be used for computationally intensive tasks that involve very little interaction with the filesystem, you can safely skip this chapter. But increasingly, tasks that are computationally expensive also involve a large amount of I/O, frequently accessing either large data sets or large databases. If this is true for at least some of your cluster's applications, you need to ensure that the I/O subsystem you are using can keep up. For these applications to perform well, you will need a high-performance filesystem.
Selecting a filesystem for a cluster is a balancing act. There are a number of different characteristics that can be used to compare filesystems, including robustness, failure recovery, journaling, enhanced security, and reduced latency. With clusters, however, it often comes down to a trade-off between convenience and performance. From the perspective of convenience, the filesystem should be transparent to users, with files readily available across the cluster. From the perspective of performance, data should be available to the processor that needs it as quickly as possible. Getting the most from a high-performance filesystem often means programming with the filesystem in mind梩ypically a very "inconvenient" task. The good news is that you are not limited to a single filesystem.
The Network File System (NFS) was introduced in Chapter 4. NFS is strong on convenience. With NFS, you will recall, files reside in a directory on a single disk drive that is shared across the network. The centralized availability provided by NFS makes it an important part of any cluster. For example, it provides a transparent mechanism to ensure that binaries of freshly compiled parallel programs are available on all the machines in the cluster. Unfortunately, NFS is not very efficient. In particular, it has not been optimized for the types of I/O often needed with many high-performance cluster applications.
High-performance filesystems for clusters are designed using different criteria, primarily to optimize performance when accessing large data sets from parallel applications. With parallel filesystems, files may be distributed across a cluster with different pieces of the file on different machines allowing parallel access.
A parallel filesystem might not provide optimal performance for serial programs or single tasks. Because high-performance filesystems are designed for a different purpose, they should not be thought of as replacements for NFS. Rather, they complement the functionality provided by NFS. Many clusters benefit from both NFS and a high-performance filesystem.
There's more good news. If you need a high-performance filesystem, there are a number of alternatives. If you have very deep pockets, you can go for hardware-based solutions. With network attached storage (NAS), a dedicated server is set up to service file requests for the network. In a sense, NAS owns the filesystem. Since serving files is NAS's only role, NAS servers tend to be highly optimized file servers. But because these are still traditional servers, latency can still be a problem.
The next step up is a storage area network (SAN). Typically, a SAN provides direct block-level access to the physical hardware. A SAN typically includes high-performance networking as well. Traditionally, SANs use fibre channel (FC) technology. More recently, IP-based storage technologies that operate at the block level have begun to emerge. This allows the creation of a SAN using more familiar IP-based technologies.
Because of the high cost of hardware-based solutions, they are outside the scope of this book. Fortunately, there are also a number of software-based filesystems for clusters, each with its own set of features and limitations. While many of the following might not be considered a high-performance filesystem, you might consider one of the following, depending upon your needs. However, you should be very careful before adopting any of these. Like most software, these should be regarded as works in progress. While they may be ideal for some uses, they may be problematic for others. Caveat emptor! These packages are generally available as both source tar balls and as RPMs.
ClusterNFS
This is a set of patches for the NFS server daemon. The clients run standard NFS software. The patches allow multiple diskless clients to mount the same root filesystem by "reinterpreting" file names. ClusterNFS is often used with Mosix. If you are building a diskless cluster, this is a package you might want to consider (http://clusternfs.sourceforge.net/).
Coda
Coda is a distributed filesystem developed at Carnegie Mellon University. It is derived from the Andrew File System. Coda has many interesting features such as performance enhancement through client side persistent caching, bandwidth adaptation, and robust behavior with partial network failures. It is a well documented, ongoing project. While it may be too early to use Coda with large, critical systems, this is definitely a distributed filesystem worth watching (http://www.coda.cs.cmu.edu/index.html).
InterMezzo
This distributed filesystem from CMU was inspired by Coda. InterMezzo is designed for use with high-availability clusters. Among other features, it offers automatic recovery from network outages (http://www.inter-mezzo.org/).
Lustre
Lustre is a cluster filesystem designed to work with very large clusters梪p to 10,000 nodes. It was developed and is maintained by Cluster File Systems, Inc. and is available under a GPL. Since Lustre patches the kernel, you'll need to be running a 2.4.X kernel (http://www.lustre.org/).
OpenAFS
The Andrew File System was originally created at CMU and now developed and supported by IBM. OpenAFS is source fork released by IBM. It provides scalable client-server-based architecture with transparent data migration. Consider OpenAFS a potential replacement for NFS (http://www.openafs.org/).
Parallel Virtual File System (PVFS)
PVFS provides high-performance, parallel filesystem. The remainder of this chapter describes PVFS in detail (http://www.parl.clemson.edu/pvfs/).
This is only a partial listing of what is available. If you are looking to implement a SAN, you might consider Open Global File System (OpenGFS) (http://opengfs.sourceforge.net/). Red Hat markets a commercial, enterprise version of OpenGFS. If you are using IBM hardware, you might what to look into General Parallel File System (GPFS) (http://www-1.ibm.com/servers/eserver/clusters/software/gpfs.html). In this chapter we will look more closely at PVFS, an open source, high-performance filesystem available for both Rocks and OSCAR.
12.1 PVFS
PVFS is a freely available, software-based solution jointly developed by Argonne National Laboratory and Clemson University. PVFS is designed to distribute data among the disks throughout the cluster and will work with both serial and parallel programs. In programming, it works with traditional Unix file I/O semantics, with the MPI-2 ROMIO semantics, or with the native PVFS semantics. It provides a consistent namespace and transparent access using existing utilities along with a mechanism for programming application-specific access. Although PVFS is developed using X-86-based Linux platforms, it runs on some other platforms. It is available for both OSCAR and Rocks. PVFS2, a second generation PVFS, is in the works.
On the downside, PVFS does not provide redundancy, does not support symbolic or hard links, and it does not provide a fsck-like utility.
Figure 12-1 shows the overall architecture for a cluster using PVFS. Machines in a cluster using PVFS fall into three possibly overlapping categories based on functionality. Each PVFS has one metadata server. This is a filesystem management node that maintains or tracks information about the filesystem such as file ownership, access privileges, and locations, i.e., the filesystem's metadata.
Figure 12-1. Internal cluster architecture
Because PVFS distributes files across the cluster nodes, the actual files are located on the disks on I/O servers. I/O servers store the data using the existing hardware and filesystem on that node. By spreading or striping a file across multiple nodes, applications have multiple paths to data. A compute node may access a portion of the file on one machine while another node accesses a different portion of the file located on a different I/O server. This eliminates the bottleneck inherent in a single file server approach such as NFS.
The remaining nodes are the client nodes. These are the actual compute nodes within the clusters, i.e., where the parallel jobs execute. With PVFS, client nodes and I/O servers can overlap. For a small cluster, it may make sense for all nodes to be both client and I/O nodes. Similarly, the metadata server can also be an I/O server or client node, or both. Once you start writing data to these machines, it is difficult to change the configuration of your system. So give some thought to what you need.
12.1.1 Installing PVFS on the Head Node
Installing and configuring PVFS is more complicated that most of the other software described in this book for a couple of reasons. First, you will need to decide how to partition your cluster. That is, you must decide which machine will be the metadata server, which machines will be clients, and which machines will be I/O servers. For each type of machine, there is different software to install and a different configuration. If a machine is going to be both a client and an I/O server, it must be configured for each role. Second, in order to limit the overhead of accessing the filesystem through the kernel, a kernel module is used. This may entail further tasks such as making sure the appropriate kernel header files are available or patching the code to account for differences among Linux kernels.
This chapter describes a simple configuration where fanny is the metadata server, a client, and an I/O server, and all the remaining nodes are both clients and I/O servers. As such, it should provide a fairly complete idea about how PVFS is set up. If you are configuring your cluster differently, you won't need to do as much. For example, if some of your nodes are only I/O nodes, you can skip the client configuration steps on those machines.
In this example, the files are downloaded, compiled, and installed on fanny since fanny plays all three roles. Once the software is installed on fanny, the appropriate pieces are pushed to the remaining machines in the cluster.
The first step, then, is to download the appropriate software. To download PVFS, first go to the PVFS home page (http://www.parl.clemson.edu/pvfs/) and follow the link to files. This site has links to several download sites. (You'll want to download the documentation from this site before moving on to the software download sites.) There are two tar archives to download: the sources for PVFS and for the kernel module.
You should also look around for any patches you might need. For example, at the time this was written, because of customizations to the kernel, the current version of PVFS would not compile correctly under Red Hat 9.0. Fortunately, a patch from http://www.mcs.anl.gov/~robl/pvfs/redhat-ntpl-fix.patch.gz was available.[1] Other patches may also be available.
[1] Despite the URL, this was an uncompressed text file at the time this was written.
Once you have the files, copy the files to an appropriate directory and unpack them.
[root@fanny src]# gunzip pvfs-1.6.2.tgz [root@fanny src]# gunzip pvfs-kernel-1.6.2-linux-2.4.tgz [root@fanny src]# tar -xvf pvfs-1.6.2.tar ... [root@fanny src]# tar -xvf pvfs-kernel-1.6.2-linux-2.4.tar ...
It is simpler if you install these under the same directory. In this example, the directory /usr/local/src is used. In the documentation that comes with PVFS, a link was created to the first directory.
[root@fanny src]# ln -s pvfs-1.6.0 pvfs
This will save a little typing but isn't essential.
|
Next, apply any patches you may need. As noted, with this version the kernel module sources need to be patched.
[root@fanny src]# mv redhat-ntpl-fix.patch pvfs-kernel-1.6.2-linux-2.4/ [root@fanny src]# cd pvfs-kernel-1.6.2-linux-2.4 [root@fanny pvfs-kernel-1.6.2-linux-2.4]# patch -p1 -b < \> redhat-ntpl-fix.patch patching file config.h.in patching file configure patching file configure.in patching file kpvfsd.c patching file kpvfsdev.c patching file pvfsdev.c patching file pvfsdev.c
Apply any other patches that might be needed.
The next steps are compiling PVFS and the PVFS kernel module. Here are the steps for compiling PVFS:
[root@fanny pvfs-kernel-1.6.2-linux-2.4]# cd /usr/local/src/pvfs [root@fanny pvfs]# ./configure ... [root@fanny pvfs]# make ... [root@fanny pvfs]# make install ...
There is nothing new here.
Next, repeat the process with the kernel module.
[root@fanny src]# cd /usr/local/src/pvfs-kernel-1.6.2-linux-2.4 [root@fanny pvfs-kernel-1.6.2-linux-2.4]# ./configure ... [root@fanny pvfs-kernel-1.6.2-linux-2.4]# make ... [root@fanny pvfs-kernel-1.6.2-linux-2.4]# make install install -c -d /usr/local/sbin install -c mount.pvfs /usr/local/sbin install -c pvfsd /usr/local/sbin NOTE: pvfs.o must be installed by hand! NOTE: install mount.pvfs by hand to /sbin if you want 'mount -t pvfs' to work
This should go very quickly.
As you see from the output, the installation for the kernel requires some additional manual steps. Specifically, you need to decide where you want to put the kernel module. The following works for Red Hat 9.0.
[root@fanny pvfs-kernel-1.6.2-linux-2.4]# mkdir \ > /lib/modules/2.4.20-6/kernel/fs/pvfs [root@fanny pvfs-kernel-1.6.2-linux-2.4]# cp pvfs.o \ > /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o
If you are doing something different, you may need to poke around a bit to find the right location.
12.1.2 Configuring the Metadata Server
If you have been following along, at this point you should have all the software installed on the head node, i.e., the node that will function as the metadata server for the filesystem. The next step is to finish configuring the metadata server. Once this is done, the I/O server and client software can be installed and configured.
Configuring the meta-server is straightforward. First, create a directory to store filesystem data.
[root@fanny pvfs-kernel-1.6.2-linux-2.4]# mkdir /pvfs-meta
Keep in mind, this directory is used to store information about the PVFS filesystem. The actual data is not stored in this directory. Once PVFS is running, you can ignore this directory.
Next, create the two metadata configuration files and place them in this directory. Fortunately, PVFS provides a script to simplify the process.
[root@fanny pvfs-kernel-1.6.2-linux-2.4]# cd /pvfs-meta [root@fanny pvfs-meta]# /usr/local/bin/mkmgrconf This script will make the .iodtab and .pvfsdir files in the metadata directory of a PVFS file system. Enter the root directory (metadata directory): /pvfs-meta/ Enter the user id of directory: root Enter the group id of directory: root Enter the mode of the root directory: 777 Enter the hostname that will run the manager: fanny Searching for host...success Enter the port number on the host for manager: (Port number 3000 is the default) 3000 Enter the I/O nodes: (can use form node1, node2, ... or nodename{#-#,#,#}) fanny george hector ida james Searching for hosts...success I/O nodes: fanny george hector ida james Enter the port number for the iods: (Port number 7000 is the default) 7000 Done!
Running this script creates the two configuration files .pvfsdir and .iodtab. The file .pvfsdir contains permission information for the metadata directory. Here is the file the mkmgrconf script creates when run as shown.
84230 0 0 0040777 3000 fanny /pvfs-meta/ /
The first entry is the inode number of the configuration file. The remaining entries correspond to the questions answered earlier.
The file .iodtab is a list of the I/O servers and their port numbers. For this example, it should look like this:
fanny:7000 george:7000 hector:7000 ida:7000 james:7000
Systems can be listed by name or by IP number. If the default port (7000) is used, it can be omitted from the file.
|
12.1.3 I/O Server Setup
To set up the I/O servers, you need to create a data directory on the appropriate machines, create a configuration file, and then push the configuration file, along with the other I/O server software, to the appropriate machines. In this example, all the nodes in the cluster including the head node are I/O servers.
The first step is to create a directory with the appropriate ownership and permissions on all the I/O servers. We start with the head node.
[root@fanny /]# mkdir /pvfs-data [root@fanny /]# chmod 700 /pvfs-data [root@fanny /]# chown nobody.nobody /pvfs-data
Keep in mind that these directories are where the actual pieces of a data file will be stored. However, you will not access this data in these directories directly. That is done through the filesystem at the appropriate mount point. These PVFS data directories, like the meta-server's metadata directory, can be ignored once PVFS is running.
Next, create the configuration file /etc/iod.conf using your favorite text editor. (This is optional, but recommended.) iod.conf describes the iod environment. Every line, apart from comments, consists of a key and a corresponding value. Here is a simple example:
# iod.conf-iod configuration file datadir /pvfs-data user nobody group nobody logdir /tmp rootdir / debug 0
As you can see, this specifies a directory for the data, the user and group under which the I/O daemon iod will run, the log and root directories, and a debug level. You can also specify other parameters such as the port and buffer information. In general, the defaults are reasonable, but you may want to revisit this file when fine-tuning your system.
While this takes care of the head node, the process must be repeated for each of the remaining I/O servers. First, create the directory and configuration file for each of the remaining I/O servers. Here is an example using the C3 utilities. (C3 is described in Chapter 10.)
[root@fanny /]# cexec mkdir /pvfs-data ... [root@fanny /]# cexec chmod 700 /pvfs-data ... [root@fanny /]# cexec chown nobody.nobody /pvfs-data ... [root@fanny /]# cpush /etc/iod.conf ...
Since the configuration file is the same, it's probably quicker to copy it to each machine, as shown here, rather than re-create it.
Finally, since the iod daemon was created only on the head node, you'll need to copy it to each of the remaining I/O servers.
[root@fanny root]# cpush /usr/local/sbin/iod ...
While this example uses C3's cpush, you can use whatever you are comfortable with.
If you aren't configuring every machine in your cluster to be an I/O server, you'll need to adapt these steps as appropriate for your cluster. This is easy to do with C3's range feature.
12.1.4 Client Setup
Client setup is a little more involved. For each client, you'll need to create a PVFS device file, copy over the kernel module, create a mount point and a PVFS mount table, and copy over the appropriate executable along with any other utilities you might need on the client machine. In this example, all nodes including the head are configured as clients. But because we have already installed software on the head node, some of the steps aren't necessary for that particular machine.
First, a special character file needs to be created on each of the clients using the mknod command.
[root@fanny /]# cexec mknod /dev/pvfsd c 60 0 ...
/dev/pvfsd is used to communicate between the pvfsd daemon and the kernel module pvfs.o. It allows programs to access PVFS files, once mounted, using traditional Unix filesystem semantics.
We will need to distribute both the kernel module and the daemon to each node.
[root@fanny /]# cpush /usr/local/sbin/pvfsd ... [root@fanny /]# cexec mkdir /lib/modules/2.4.20-6/kernel/fs/pvfs/ ... [root@fanny /]# cpush /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o ...
The kernel module registers the filesystem with the kernel while the daemon performs network transfers.
Next, we need to create a mount point.
[root@fanny root]# mkdir /mnt/pvfs [root@fanny /]# cexec mkdir /mnt/pvfs ...
This example uses /mnt/pvfs, but /pvfs is another frequently used alternative. The mount directory is where the files appear to be located. This is the directory you'll use to access or reference files.
The mount.pvfs executable is used to mount a filesystem using PVFS and should be copied to each client node.
[root@fanny /]# cpush /usr/local/sbin/mount.pvfs /sbin/ ...
mount.pvfs can be invoked by the mount command on some systems, or it can be called directly.
Finally, create /etc/pvfstab, a mount table for the PVFS system. This needs to contain only a single line of information as shown here:
fanny:/pvfs-meta /mnt/pvfs pvfs port=3000 0 0
If you are familiar with /etc/fstab, this should look very familiar. The first field is the path to the metadata information. The next field is the mount point. The third field is the filesystem type, which is followed by the port number. The last two fields, traditionally used to determine when a filesystem is dumped or checked, aren't currently used by PVFS. These fields should be zeros. You'll probably need to change the first two fields to match your cluster, but everything else should work as shown here.
Once you have created the mount table, push it to the remaining nodes.
[root@fanny /]# cpush /etc/pvfstab ... [root@fanny /]# cexec chmod 644 /etc/pvfstab ...
Make sure the file is readable as shown.
While it isn't strictly necessary, there are some other files that you may want to push to your client nodes. The installation of PVFS puts a number of utilities in /usr/local/bin. You'll need to push these to the clients before you'll be able to use them effectively. The most useful include mgr-ping, iod-ping, pvstat, and u2p.
[root@fanny root]# cpush /usr/local/bin/mgr-ping ... [root@fanny root]# cpush /usr/local/bin/iod-ping ... [root@fanny root]# cpush /usr/local/bin/pvstat ... [root@fanny pvfs]# cpush /usr/local/bin/u2p ...
As you gain experience with PVFS, you may want to push other utilities across the cluster.
If you want to do program development using PVFS, you will need access to the PVFS header files and libraries and the pvfstab file. By default, header and library files are installed in /usr/local/include and /usr/local/lib, respectively. If you do program development only on your head node, you are in good shape. But if you do program development on any of your cluster nodes, you'll need to push these files to those nodes. (You might also want to push the manpages as well, which are installed in /usr/local/man.)
12.1.5 Running PVFS
Finally, now that you have everything installed, you can start PVFS. You need to start the appropriate daemons on the appropriate machines and load the kernel module. To load the kernel module, use the insmod command.
[root@fanny root]# insmod /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o [root@fanny root]# cexec insmod /lib/modules/2.4.20-6/kernel/fs/pvfs/pvfs.o ...
Next, run the mgr daemon on the metadata server. This is the management daemon.
[root@fanny root]# /usr/local/sbin/mgr
On each I/O server, start the iod daemon.
[root@fanny root]# /usr/local/sbin/iod [root@fanny root]# cexec /usr/local/sbin/iod ...
Next, start the pvfsd daemon on each client node.
[root@fanny root]# /usr/local/sbin/pvfsd [root@fanny root]# cexec /usr/local/sbin/pvfsd ...
Finally, mount the filesystem on each client.
[root@fanny root]# /usr/local/sbin/mount.pvfs fanny:/pvfs-meta /mnt/pvfs [root@fanny /]# cexec /sbin/mount.pvfs fanny:/pvfs-meta /mnt/pvfs ...
PVFS should be up and running.[2]
[2] Although not described here, you'll probably want to make the necessary changes to your startup file so that this is all done automatically. PVFS provides scripts enablemgr and enableiod for use with Red Hat machines.
To shut PVFS down, use the umount command to unmount the filesystem, e.g., umount /mnt/pvfs, stop the PVFS processes with kill or killall, and unload the pvfs.o module with the rmmod command.
12.1.5.1 Troubleshooting
There are several things you can do to quickly check whether everything is running. Perhaps the simplest is to copy a file to the mounted directory and verify that it is accessible on other nodes. If you have problems, there are a couple of other things you might want to try to narrow things down.
First, use ps to ensure the daemons are running on the appropriate machines. For example,
[root@fanny root]# ps -aux | grep pvfsd root 15679 0.0 0.1 1700 184 ? S Jun21 0:00 /usr/local/sbin/pvfsd
Of course, mgr should be running only on the metadata server and iod should be running on all the I/O servers (but nowhere else).
Each process will create a log file, by default in the /tmp directory. Look to see if these are present.
[root@fanny root]# ls -l /tmp total 48 -rwxr-xr-x 1 root root 354 Jun 21 11:13 iolog.OxLkSR -rwxr-xr-x 1 root root 0 Jun 21 11:12 mgrlog.z3tg11 -rwxr-xr-x 1 root root 119 Jun 21 11:21 pvfsdlog.msBrCV ...
The garbage at the end of the filenames is generated to produce a unique filename.
The mounted PVFS will be included in the listing given with the mount command.
[root@fanny root]# mount ... fanny:/pvfs-meta on /mnt/pvfs type pvfs (rw) ...
This should work on each node.
In addition to the fairly obvious tests just listed, PVFS provides a couple of utilities you can turn to. The utilities iod-ping and mgr-ping can be used to check whether the I/O and metadata servers are running and responding on a particular machine.
Here is an example of using iod-ping:
[root@fanny root]# /usr/local/bin/iod-ping localhost:7000 is responding. [root@fanny root]# cexec /usr/local/bin/iod-ping ************************* local ************************* --------- george.wofford.int--------- localhost:7000 is responding. --------- hector.wofford.int--------- localhost:7000 is responding. --------- ida.wofford.int--------- localhost:7000 is responding. --------- james.wofford.int--------- localhost:7000 is responding.
12.2 Using PVFSTo make effective use of PVFS, you need to understand how PVFS distributes files across the cluster. PVFS uses a simple striping scheme with three striping parameters. base
The cluster node where the file starts, given as an index where the first I/O server is 0. Typically, this defaults to 0. pcount
The number of I/O servers among which the file is partitioned. Typically, this defaults to the total number of I/O servers. ssize
The size of each strip, i.e., contiguous blocks of data. Typically, this defaults to 64 KB. Figure 12-2 should help clarify how files are distributed. In the figure, the file is broken into eight pieces and distributed among four I/O servers. base is the index of the first I/O server. pcount is the number of servers used, i.e., four in this case. ssize is the size of each of the eight blocks. Of course, the idea is to select a block size that will optimize parallel access to the file.
Figure 12-2. Overlap within files
You can examine the distribution of a file using the pvstat utility. For example, [root@fanny pvfs]# pvstat data data: base = 0, pcount = 5, ssize = 65536 [root@fanny pvfs]# ls -l data -rw-r--r-- 1 root root 10485760 Jun 21 12:49 data A little arithmetic shows this file is broken into 160 pieces with 32 blocks on each I/O server. If you copy a file to a PVFS filesystem using cp, it will be partitioned automatically for you using what should be reasonable defaults. For more control, you can use the u2p utility. With u2p, the command-line option -s sets the stripe size; -b specifies the base; and -n specifies the number of nodes. Here is an example: [root@fanny /]# u2p -s16384 data /mnt/data 1 node(s); ssize = 8192; buffer = 0; nanMBps (0 bytes total) [root@fanny /]# pvstat /mnt/data /mnt/data: base = 0, pcount = 1, ssize = 8192 Typically, u2p is used to convert an existing file for use with a parallel program. While Unix system call read and write will work with the PVFS without any changes, large numbers of small accesses will not perform well. The buffered routines from the standard I/O library (e.g., fread and fwrite) should work better provided an adequate buffer is used. To make optimal use of PVFS, you will need to write your programs to use PVFS explicitly. This can be done using the native PVFS access provided through the libpvfs.a library. Details can be found in Using the Parallel Virtual File System, part of the documentation available at the PVFS web site. Programming examples are included with the source in the examples subdirectory. Clearly, you should understand your application's data requirements before you begin programming. Alternatively, PVFS can be used with the ROMIO interface from http://www.mcs.anl.gov. The ROMIO is included with both MPICH and LAM/MPI. (If you compile ROMIO, you need to specify PVFS support. Typically, you use the compile flags -lib=/usr/local/lib/libpvfs.a and -file_system=pvfs+nfs+ufs.) ROMIO provides two important optimizations, data sieving and two-phase I/O. Additional information is available at the ROMIO web site. |
12.3 Notes for OSCAR and Rocks Users
Both OSCAR and Rocks use NFS. Rocks uses autofs to mount home directories; OSCAR doesn't. (Automounting and autofs is discussed briefly in Chapter 4.)
PVFS is available as an add-on package for OSCAR. By default, it installs across the first eight available nodes using the OSCAR server as the metadata server and mount point. The OSCAR server is not configured as an I/O server. OSCAR configures PVFS to start automatically when the system is rebooted.
With OSCAR, PVFS is installed in the directory /opt/pvfs, e.g., the libraries are in /opt/pvfs/lib and the manpages are in /opt/pvfs/man. The manpages are not placed on the user's path but can be with the -M option to man. For example,
[root@amy /]# man -M /opt/pvfs/man/ pvfs_chmod
The PVFS utilities are in /opt/pvfs/bin and the daemons are in /opt/pvfs/sbin. The mount point for PVFS is /mnt/pvfs. Everything else is pretty much where you would expect it to be.
PVFS is fully integrated into Rocks on all nodes. However, you will need to do several configuration tasks. Basically, this means following the steps outlined in this chapter. However, you'll find that some of the steps have been done for you.
On the meta-server, the directory /pvfs-meta is already in place; run /usr/bin/mkmgrconf to create the configuration files. For the I/O servers, you'll need to create the data directory /pvfs-data but the configuration file is already in place. The kernel modules are currently in /lib/modules/2.4.21-15.EL/fs/ and are preloaded. You'll need to start the I/O daemon /usr/sbin/iod, and you'll need to mount each client using /sbin/mount.pvfs. All in all it goes quickly. Just be sure to note locations for the various commands.
The iod daemon seems to be OK on all the clients. If you run mgr-ping, only the metadata server should respond.