(Part 1--Chapter 1-4) High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI

Part I: An Introduction to Clusters
The first section of this book is a general introduction to clusters. It is largely background material. Readers already familiar with clusters may want to quickly skim this material and then move on to subsequent chapters. This section is divided into four chapters.

Chapter 1. Cluster Architecture

Computing speed isn't just a convenience. Faster computers allow us to solve larger problems, and to find solutions more quickly, with greater accuracy, and at a lower cost. All this adds up to a competitive advantage. In the sciences, this may mean the difference between being the first to publish and not publishing. In industry, it may determine who's first to the patent office.

Traditional high-performance clusters have proved their worth in a variety of uses梖rom predicting the weather to industrial design, from molecular dynamics to astronomical modeling. High-performance computing (HPC) has created a new approach to science梞odeling is now a viable and respected alternative to the more traditional experiential and theoretical approaches.

Clusters are also playing a greater role in business. High performance is a key issue in data mining or in image rendering. Advances in clustering technology have led to high-availability and load-balancing clusters. Clustering is now used for mission-critical applications such as web and FTP servers. For example, Google uses an ever-growing cluster composed of tens of thousands of computers.

1.1 Modern Computing and the Role of Clusters

Because of the expanding role that clusters are playing in distributed computing, it is worth considering this question briefly. There is a great deal of ambiguity, and the terms used to describe clusters and distributed computing are often used inconsistently. This chapter doesn't provide a detailed taxonomy梚t doesn't include a discussion of Flynn's taxonomy or of cluster topologies. This has been done quite well a number of times and too much of it would be irrelevant to the purpose of this book. However, this chapter does try to explain the language used. If you need more general information, see the Appendix A for other sources. High Performance Computing, Second Edition (O'Reilly), by Dowd and Severance is a particularly readable introduction.

When computing, there are three basic approaches to improving performance梪se a better algorithm, use a faster computer, or divide the calculation among multiple computers. A very common analogy is that of a horse-drawn cart. You can lighten the load, you can get a bigger horse, or you can get a team of horses. (We'll ignore the option of going into therapy and learning to live with what you have.) Let's look briefly at each of these approaches.

First, consider what you are trying to calculate. All too often, improvements in computing hardware are taken as a license to use less efficient algorithms, to write sloppy programs, or to perform meaningless or redundant calculations rather than carefully defining the problem. Selecting appropriate algorithms is a key way to eliminate instructions and speed up a calculation. The quickest way to finish a task is to skip it altogether.

If you need only a modest improvement in performance, then buying a faster computer may solve your problems, provided you can find something you can afford. But just as there is a limit on how big a horse you can buy, there are limits on the computers you can buy. You can expect rapidly diminishing returns when buying faster computers. While there are no hard and fast rules, it is not unusual to see a quadratic increase in cost with a linear increase in performance, particularly as you move away from commodity technology.

The third approach is parallelism, i.e., executing instructions simultaneously. There are a variety of ways to achieve this. At one end of the spectrum, parallelism can be integrated into the architecture of a single CPU (which brings us back to buying the best computer you can afford). At the other end of the spectrum, you may be able to divide the computation up among different computers on a network, each computer working on a part of the calculation, all working at the same time. This book is about that approach梙arnessing a team of horses.

1.1.1 Uniprocessor Computers

The traditional classification of computers based on size and performance, i.e., classifying computers as microcomputers, workstations, minicomputers, mainframes, and supercomputers, has become obsolete. The ever-changing capabilities of computers means that today's microcomputers now outperform the mainframes of the not-too-distant past. Furthermore, this traditional classification scheme does not readily extend to parallel systems and clusters. Nonetheless, it is worth looking briefly at the capabilities and problems associated with more traditional computers, since these will be used to assemble clusters. If you are working with a team of horses, it is helpful to know something about a horse.

Regardless of where we place them in the traditional classification, most computers today are based on an architecture often attributed to the Hungarian mathematician John von Neumann. The basic structure of a von Neumann computer is a CPU connected to memory by a communications channel or bus. Instructions and data are stored in memory and are moved to and from the CPU across the bus. The overall speed of a computer depends on both the speed at which its CPU can execute individual instructions and the overhead involved in moving instructions and data between memory and the CPU.

Several technologies are currently used to speed up the processing speed of CPUs. The development of reduced instruction set computer (RISC) architectures and post-RISC architectures has led to more uniform instruction sets. This eliminates cycles from some instructions and allows a higher clock-rate. The use of RISC technology and the steady increase in chip densities provide great benefits in CPU speed.

Superscalar architectures and pipelining have also increased processor speeds. Superscalar architectures execute two or more instructions simultaneously. For example, an addition and a multiplication instruction, which use different parts of the CPU, might be executed at the same time. Pipelining overlaps the different phase of instruction execution like an assembly line. For example, while one instruction is executed, the next instruction can be fetched from memory or the results from the previous instructions can be stored.

Memory bandwidth, basically the rate at which bits are transferred from memory over the bus, is a different story. Improvements in memory bandwidth have not kept up with CPU improvements. It doesn't matter how fast the CPU is theoretically capable of running if you can't get instructions and data into or out of the CPU fast enough to keep the CPU busy. Consequently, memory access has created a performance bottleneck for the classical von Neumann architecture: the von Neumann bottleneck.

Computer architects and manufacturers have developed a number of techniques to minimize the impact of this bottleneck. Computers use a hierarchy of memory technology to improve overall performance while minimizing cost. Frequently used data is placed in very fast cache memory, while less frequently used data is placed in slower but cheaper memory. Another alternative is to use multiple processors so that memory operations are spread among the processors. If each processor has its own memory and its own bus, all the processors can access their own memory simultaneously.

1.1.2 Multiple Processors

Traditionally, supercomputers have been pipelined, superscalar processors with a single CPU. These are the "big iron" of the past, often requiring "forklift upgrades" and multiton air conditioners to prevent them from melting from the heat they generate. In recent years we have come to augment that definition to include parallel computers with hundreds or thousands of CPUs, otherwise known as multiprocessor computers. Multiprocessor computers fall into two basic categories梒entralized multiprocessors (or single enclosure multiprocessors) and multicomputers.

1.1.2.1 Centralized multiprocessors

With centralized multiprocessors, there are two architectural approaches based on how memory is managed?I>uniform memory access (UMA) and nonuniform memory access (NUMA) machines. With UMA machines, also called symmetric multiprocessors (SMP), there is a common shared memory. Identical memory addresses map, regardless of the CPU, to the same location in physical memory. Main memory is equally accessible to all CPUs, as shown in Figure 1-1. To improve memory performance, each processor has its own cache.

Figure 1-1. UMA architecture

 


There are two closely related difficulties when designing a UMA machine. The first problem is synchronization. Communications among processes and access to peripherals must be coordinated to avoid conflicts. The second problem is cache consistency. If two different CPUs are accessing the same location in memory and one CPU changes the value stored in that location, then how is the cache entry for the other CPU updated? While several techniques are available, the most common is snooping. With snooping, each cache listens to all memory accesses. If a cache contains a memory address that is being written to in main memory, the cache updates its copy of the data to remain consistent with main memory.

A closely related architecture is used with NUMA machines. Roughly, with this architecture, each CPU maintains its own piece of memory, as shown in Figure 1-2. Effectively, memory is divided among the processors, but each process has access to all the memory. Each individual memory address, regardless of the processor, still references the same location in memory. Memory access is nonuniform in the sense that some parts of memory will appear to be much slower than other parts of memory since the bank of memory "closest" to a processor can be accessed more quickly by that processor. While this memory arrangement can simplify synchronization, the problem of memory coherency increases.

Figure 1-2. NUMA architecture

 


Operating system support is required with either multiprocessor scheme. Fortunately, most modern operating systems, including Linux, provide support for SMP systems, and support is improving for NUMA architectures.

When dividing a calculation among processors, an important concern is granularity, or the smallest piece that a computation can be broken into for purposes of sharing among different CPUs. Architectures that allow smaller pieces of code to be shared are said to have a finer granularity (as opposed to a coarser granularity). The granularity of each of these architectures is the thread. That is, the operating system can place different threads from the same process on different processors. Of course, this implies that, if your computation generates only a single thread, then that thread can't be shared between processors but must run on a single CPU. If the operating system has nothing else for the other processors to do, they will remain idle and you will see no benefit from having multiple processors.

A third architecture worth mentioning in passing is processor array, which, at one time, generated a lot of interest. A processor array is a type of vector computer built with a collection of identical, synchronized processing elements. Each processor executes the same instruction on a different element in a data array.

Numerous issues have arisen with respect to processor arrays. While some problems map nicely to this architecture, most problems do not. This severely limits the general use of processor arrays. The overall design doesn't work well for problems with large serial components. Processor arrays are typically designed around custom VLSI processors, resulting in much higher costs when compared to more commodity-oriented multiprocessor designs. Furthermore, processor arrays typically are single user, adding to the inherent cost of the system. For these and other reasons, processor arrays are no longer popular.

1.1.2.2 Multicomputers

A multicomputer configuration, or cluster, is a group of computers that work together. A cluster has three basic elements梐 collection of individual computers, a network connecting those computers, and software that enables a computer to share work among the other computers via the network.

For most people, the most likely thing to come to mind when speaking of multicomputers is a Beowulf cluster. Thomas Sterling and Don Becker at NASA's Goddard Space Flight Center built a parallel computer out of commodity hardware and freely available software in 1994 and named their system Beowulf.[1] While this is perhaps the best-known type of multicomputer, a number of variants now exist.

[1] If you think back to English lit, you will recall that the epic hero Beowulf was described as having "the strength of many."

First, both commercial multicomputers and commodity clusters are available. Commodity clusters, including Beowulf clusters, are constructed using commodity, off-the-shelf (COTS) computers and hardware. When constructing a commodity cluster, the norm is to use freely available, open source software. This translates into an extremely low cost that allows people to build a cluster when the alternatives are just too expensive. For example, the "Big Mac" cluster built by Virginia Polytechnic Institute and State University was initially built using 1100 dual-processor Macintosh G5 PCs. It achieved speeds on the order of 10 teraflops, making it one of the fastest supercomputers in existence. But while supercomputers in that class usually take a couple of years to construct and cost in the range of $100 million to $250 million, Big Mac was put together in about a month and at a cost of just over $5 million. (A list of the fastest machines can be found at http://www.top500.org. The site also maintains a list of the top 500 clusters.)

In commodity clusters, the software is often mix-and-match. It is not unusual for the processors to be significantly faster than the network. The computers within a cluster can be dedicated to that cluster or can be standalone computers that dynamically join and leave the cluster. Typically, the term Beowulf is used to describe a cluster of dedicated computers, often with minimal hardware. If no one is going to use a node as a standalone machine, there is no need for that node to have a dedicated keyboard, mouse, video card, or monitor. Node computers may or may not have individual disk drives. (Beowulf is a politically charged term that is avoided in this book.) While a commodity cluster may consist of identical, high-performance computers purchased specifically for the cluster, they are often a collection of recycled cast-off computers, or a pile-of-PCs (POP).

Commercial clusters often use proprietary computers and software. For example, a SUN Ultra is not generally thought of as a COTS computer, so an Ultra cluster would typically be described as a proprietary cluster. With proprietary clusters, the software is often tightly integrated into the system, and the CPU performance and network performance are well matched. The primary disadvantage of commercial clusters is, as you no doubt guessed, their cost. But if money is not a concern, then IBM, Sun Microsystems, or any number of other companies will be happy to put together a cluster for you. (The salesman will probably even take you to lunch.)

A network of workstations (NOW), sometimes called a cluster of workstations (COW), is a cluster composed of computers usable as individual workstations. A computer laboratory at a university might become a NOW on the weekend when the laboratory is closed. Or office machines might join a cluster in the evening after the daytime users leave.

Software is an integral part of any cluster. A discussion of cluster software will constitute the bulk of this book. Support for clustering can be built directly into the operating system or may sit above the operating system at the application level, often in user space. Typically, when clustering support is part of the operating system, all nodes in the cluster need to have identical or nearly identical kernels; this is called a single system image (SSI). At best, the granularity is the process. With some software, you may need to run distinct programs on each node, resulting in even coarser granularity. Since each computer in a cluster has its own memory (unlike a UMA or NUMA computer), identical addresses on individual CPUs map different physical memory locations. Communication is more involved and costly.

1.1.2.3 Cluster structure

It's tempting to think of a cluster as just a bunch of interconnected machines, but when you begin constructing a cluster, you'll need to give some thought to the internal structure of the cluster. This will involve deciding what roles the individual machines will play and what the interconnecting network will look like.

The simplest approach is a symmetric cluster. With a symmetric cluster (Figure 1-3) each node can function as an individual computer. This is extremely straightforward to set up. You just create a subnetwork with the individual machines (or simply add the computers to an existing network) and add any cluster-specific software you'll need. You may want to add a server or two depending on your specific needs, but this usually entails little more than adding some additional software to one or two of the nodes. This is the architecture you would typically expect to see in a NOW, where each machine must be independently usable.

Figure 1-3. Symmetric clusters

 


There are several disadvantages to a symmetric cluster. Cluster management and security can be more difficult. Workload distribution can become a problem, making it more difficult to achieve optimal performance.

For dedicated clusters, an asymmetric architecture is more common. With asymmetric clusters (Figure 1-4) one computer is the head node or frontend. It serves as a gateway between the remaining nodes and the users. The remaining nodes often have very minimal operating systems and are dedicated exclusively to the cluster. Since all traffic must pass through the head, asymmetric clusters tend to provide a high level of security. If the remaining nodes are physically secure and your users are trusted, you'll only need to harden the head node.

Figure 1-4. Asymmetric clusters

 


The head often acts as a primary server for the remainder of the clusters. Since, as a dual-homed machine, it will be configured differently from the remaining nodes, it may be easier to keep all customizations on that single machine. This simplifies the installation of the remaining machines. In this book, as with most descriptions of clusters, we will use the term public interface to refer to the network interface directly connected to the external network and the term private interface to refer to the network interface directly connected to the internal network.

The primary disadvantage of this architecture comes from the performance limitations imposed by the cluster head. For this reason, a more powerful computer may be used for the head. While beefing up the head may be adequate for small clusters, its limitations will become apparent as the size of the cluster grows. An alternative is to incorporate additional servers within the cluster. For example, one of the nodes might function as an NFS server, a second as a management station that monitors the health of the clusters, and so on.

I/O represents a particular challenge. It is often desirable to distribute a shared filesystem across a number of machines within the cluster to allow parallel access. Figure 1-5 shows a more fully specified cluster.

Figure 1-5. Expanded cluster

 


Network design is another key issue. With small clusters, a simple switched network may be adequate. With larger clusters, a fully connected network may be prohibitively expensive. Numerous topologies have been studied to minimize connections (costs) while maintaining viable levels of performance. Examples include hyper-tree, hyper-cube, butterfly, and shuffle-exchange networks. While a discussion of network topology is outside the scope of this book, you should be aware of the issue.

Heterogeneous networks are not uncommon. Although not shown in the figure, it may be desirable to locate the I/O servers on a separate parallel network. For example, some clusters have parallel networks allowing administration and user access through a slower network, while communications for processing and access to the I/O servers is done over a high-speed network.

1.2 Types of Clusters

Originally, "clusters" and "high-performance computing" were synonymous. Today, the meaning of the word "cluster" has expanded beyond high-performance to include high-availability (HA) clusters and load-balancing (LB) clusters. In practice, there is considerable overlap among these梩hey are, after all, all clusters. While this book will focus primarily on high-performance clusters, it is worth taking a brief look at high-availability and load-balancing clusters.

High-availability clusters, also called failover clusters, are often used in mission-critical applications. If you can't afford the lost business that will result from having your web server go down, you may want to implement it using a HA cluster. The key to high availability is redundancy. An HA cluster is composed of multiple machines, a subset of which can provide the appropriate service. In its purest form, only a single machine or server is directly available梐ll other machines will be in standby mode. They will monitor the primary server to insure that it remains operational. If the primary server fails, a secondary server takes its place.

The idea behind a load-balancing cluster is to provide better performance by dividing the work among multiple computers. For example, when a web server is implemented using LB clustering, the different queries to the server are distributed among the computers in the clusters. This might be accomplished using a simple round-robin algorithm. For example, Round-Robin DNS could be used to map responses to DNS queries to the different IP addresses. That is, when a DNS query is made, the local DNS server returns the addresses of the next machine in the cluster, visiting machines in a round-robin fashion. However, this approach can lead to dynamic load imbalances. More sophisticated algorithms use feedback from the individual machines to determine which machine can best handle the next task.

Keep in mind, the term "load-balancing" means different things to different people. A high-performance cluster used for scientific calculation and a cluster used as a web server would likely approach load-balancing in entirely different ways. Each application has different critical requirements.

To some extent, any cluster can provide redundancy, scalability, and improved performance, regardless of its classification. Since load-balancing provides greater availability, it is not unusual to see both load-balancing and high-availability in the same cluster. The Linux Virtual Server Project (LVSR) is an example of combining these two approaches. An LVSR server is a high-availability server implemented by distributing tasks among a number of real servers. Interested readers are encouraged to visit the web pages for the Linux Virtual Server Project (http://www.linux-vs.org) and the High-Availability Linux Project (http://www.linux-ha.org) and to read the relevant HOWTOs. OSCAR users will want to visit the High-Availability OSCAR web site http://www.openclustergroup.org/HA-OSCAR/.

1.3 Distributed Computing and Clusters

While the term parallel is often used to describe clusters, they are more correctly described as a type of distributed computing. Typically, the term parallel computing refers to tightly coupled sets of computation. Distributed computing is usually used to describe computing that spans multiple machines or multiple locations. When several pieces of data are being processed simultaneously in the same CPU, this might be called a parallel computation, but would never be described as a distributed computation. Multiple CPUs within a single enclosure might be used for parallel computing, but would not be an example of distributed computing. When talking about systems of computers, the term parallel usually implies a homogenous collection of computers, while distributed computing typically implies a more heterogeneous collection. Computations that are done asynchronously are more likely to be called distributed than parallel. Clearly, the terms parallel and distributed lie at either end of a continuum of possible meanings. In any given instance, the exact meanings depend upon the context. The distinction is more one of connotations than of clearly established usage.

Since cluster computing is just one type of distributed computing, it is worth briefly mentioning the alternatives. The primary distinction between clusters and other forms of distributed computing is the scope of the interconnecting network and the degree of coupling among the individual machines. The differences are often ones of degree.

Clusters are generally restricted to computers on the same subnetwork or LAN. The term grid computing is frequently used to describe computers working together across a WAN or the Internet. The idea behind the term "grid" is to invoke a comparison between a power grid and a computational grid. A computational grid is a collection of computers that provide computing power as a commodity. This is an active area of research and has received (deservedly) a lot of attention from the National Science Foundation. The most significant differences between cluster computing and grid computing are that computing grids typically have a much larger scale, tend to be used more asynchronously, and have much greater access, authorization, accounting, and security concerns. From an administrative standpoint, if you build a grid, plan on spending a lot of time dealing with security-related issues. Grid computing has the potential of providing considerably more computing power than individual clusters since a grid may combine a large number of clusters.

Peer-to-peer computing provides yet another approach to distributed computing. Again this is an ambiguous term. Peer-to-peer may refer to sharing cycles, to the communications infrastructure, or to the actual data distributed across a WAN or the Internet. Peer-to-peer cycle sharing is best exemplified by SETI@Home, a project to analyze radio telescope data for signs of extraterrestrial intelligence. Volunteers load software onto their Internet-connected computers. To the casual PC or Mac user, the software looks like a screensaver. When a computer becomes idle, the screensaver comes on and the computer begins analyzing the data. If the user begins using the computer again, the screensaver closes and the data analysis is suspended. This approach has served as a model for other research, including the analysis of cancer and AIDS data.

Data or file-sharing peer-to-peer networks are best exemplified by Napster, Gnutella, or Kazaa technologies. With some peer-to-peer file-sharing schemes, cycles may also be provided for distributed computations. That is, by signing up and installing the software for some services, you may be providing idle cycles to the service for other uses beyond file sharing. Be sure you read the license before you install the software if you don't want your computers used in this way.

Other entries in the distributed computing taxonomy include federated clusters and constellations. Federated clusters are clusters of clusters, while constellations are clusters where the number of CPUs is greater than the number of nodes. A four-node cluster of SGI Altrix computers with 128 CPUs per node is a constellation. Peer-to-peer, grids, federated clusters, and constellations are outside the scope of this book.

1.4 Limitations

While clusters have a lot to offer, they are not panaceas. There is a limit to how much adding another computer to a problem will speed up a calculation. In the ideal situation, you might expect a calculation to go twice as fast on two computers as it would on one. Unfortunately, this is the limiting case and you can only approach it.

Any calculation can be broken into blocks of code or instructions that can be classified in one of two exclusive ways. Either a block of code can be parallelized and shared among two or more machines, or the code is essentially serial and the instructions must be executed in the order they are written on a single machine. Any code that can't be parallelized won't benefit from any additional processors you may have.

There are several reasons why some blocks of code can't be parallelized and must be executed in a specific order. The most obvious example is I/O, where the order of operations is typically determined by the availability, order, and format of the input and the format of the desired output. If you are generating a report at the end of a program, you won't want the characters or lines of output printed at random.

Another reason some code can't be parallelized comes from the data dependencies within the code. If you use the value of x to calculate the value of y, then you'll need to calculate x before you calculate y. Otherwise, you won't know what value to use in the calculation. Basically, to be able to parallelize two instructions, neither can depend on the other. That is, the order in which the two instructions finish must not matter.

Thus, any program can be seen as a series of alternating sections梥ections that can be parallelized and effectively run on different machines interspersed with sections that must be executed as written and that effectively can only be run on a single machine. If a program spends most of its time in code that is essentially serial, parallel processing will have limited value for this code. In this case, you will be better served with a faster computer than with parallel computers. If you can't change the algorithm, big iron is the best approach for this type of problem.

1.4.1 Amdahl's Law

As just noted, the amount of code that must be executed serially limits how much of a speedup you can expect from parallel execution. This idea has been formalized by what is known as Amdahl's Law, named after Gene Amdahl, who first stated the law in the late sixties. In a nutshell, Amdahl's Law states that the serial portion of a program will be the limiting factor in how much you can speed up the execution of the program using multiple processors.[2]

[2] While Amdahl's Law is the most widely known and most useful metric for describing parallel performance, there are others. These include Gustafson-Barsus's, Sun's, and Ni's Laws and the Karp-Flat and the Isoefficiency Metrics.

An example should help clarify Amdahl's Law. Let's assume you have a computation that takes 10 hours to complete on a currently available computer and that 90 percent of your code can be parallelized. In other words, you are spending one hour doing instructions that must be done serially and nine hours doing instructions that can be done in parallel. Amdahl's Law states that you'll never be able to run this code on this class of computers in less than one hour, regardless of how many additional computers you have available. To see this, imagine that you had so many computers that you could execute all the parallel code instantaneously. You would still have the serial code to execute, which has to be done on a single computer, and it would still take an hour.[3]

[3] For those of you who love algebra, the speedup factor is equal to 1/(s + p/N), where s is the fraction of the code that is inherently serial, p is the fraction of the code that can be parallelized, and N is the number of processors available. Clearly, p + s = 1. As the number of processors becomes very large, p/N becomes very small, and the speedup becomes essentially 1/s. So if s is 0.1, the largest speedup you can expect is a factor of 10, no matter how many processors you have available.

In practice, you won't have an unlimited number of processors, so your total time will always be longer. Figure 1-6 shows the amount of time needed for this example, depending on the number of processors you have available.

Figure 1-6. Execution time vs. number of processors

 


You should also remember that Amdahl's law is an ideal. In practice, there is the issue of the overhead introduced by parallelizing the code. For example, coordinating communications among the various processes will require additional code. This adds to the overall execution time. And if there is contention for the network, this can stall processes, further slowing the calculation. In other words, Amdahl's Law is the best speedup you can hope for, but not the actual speedup you'll see.

What can you do if you need to do this calculation in less than one hour? As I noted earlier, you have three choices when you want to speed up a calculation梑etter algorithms, faster computers, or more computers. If more computers won't take you all the way, your remaining choices are better algorithms and faster computers. If you can rework your code so that a larger fraction can be done in parallel, you'll see an increased benefit from a parallel approach. Otherwise, you'll need to dig deep into your pocket for faster computers.

Surprisingly, a fair amount of controversy still surrounds what should be obvious once you think about it. This stems in large part from the misapplication of Amdahl's Law over the years. For example, Amdahl's Law has been misused as an argument favoring faster computers over parallel computing.

The most common misuse is based on the assumption that the amount of speedup is independent of the size of the problem. Amdahl's Law simply does not address how problems scale. The fraction of the code that must be executed serially usually changes as the size of the problem changes. So, it is a mistake to assume that a problem's speedup factor will be the same when the scale of the problem changes. For instance, if you double the length of a simulation, you may find that the serial portions of the simulation, such as the initialization and report phases, are basically unchanged, while the parallelizable portion of the code is what doubles. Hence, the fraction of the time spent in the serial code will decrease and Amdahl's Law will specify a greater speedup. This is good news! After all, it's when problems get bigger that we most need the speedup. For most problems, the speedup factor will depend upon the problem size. As the problem size changes, so does the speedup factor. The amount will depend on the nature of the individual problem, but typically, the speedup will increase as the size of the problem increases. As the problem size grows, it is not unusual to the see a linear increase in the amount of time spent in the serial portion of the code and a quadratic increase in the amount of time spent in the parallelizable portion of the code. Unfortunately, if you only apply Amdahl's Law to the smaller problem size, you'll underestimate the benefit of a parallel approach.

Having said this, it is important to remember that Amdahl's Law does clearly state a limitation of parallel computing. But this limitation varies not only from problem to problem, but with the size of the problem as well.

One last word about the limitations of clusters梩he limitations are often tied to a particular approach. It is often possible to mix approaches and avoid limitations. For example, in constructing your clusters, you'll want to use the best computers you can afford. This will lessen the impact of inherently serial code. And don't forget to look at your algorithms!

1.5 My Biases

The material covered in this book reflects three of my biases, of which you should be aware. I have tried to write a book to help people get started with clusters. As such, I have focused primarily on mainstream, high-performance computing, using open source software. Let me explain why.

First, there are many approaches and applications for clusters. I do not believe that it is feasible for any book to address them all, even if a less-than-exhaustive approach is used. In selecting material for this book, I have tried to use the approaches and software that are the most useful for the largest number of people. I feel that it is better to cover a limited number of approaches than to try to say too much and risk losing focus. However, I have tried to justify my decisions and point out options along the way so that if your needs don't match my assumptions, you'll at least have an idea where to start looking.

Second, in keeping with my goal of addressing mainstream applications of clusters, the book primarily focuses on high-performance computing. This is the application from which clusters grew and remains one of their dominant uses. Since high availability and load balancing tend to be used with mission-critical applications, they are beyond the scope of a book focusing on getting started with clusters. You really should have some basic experience with generic clusters before moving on to such mission-critical applications. And, of course, improved performance lies at the core of all the other uses for clusters.

Finally, I have focused on open source software. There are a number of proprietary solutions available, some of which are excellent. But given the choice between comparable open source software and proprietary software, my preference is for open source. For clustering, I believe that high-quality, robust open source software is readily available and that there is little justification for considering proprietary software for most applications.

While I'll cover the basics of clusters here, you would do well to study the specifics of clusters that closely match your applications as well. There are a number of well-known clusters that have been described in detail. A prime example is Google, with literally tens of thousands of computers. Others include clusters at Fermilab, Argonne National Laboratory (Chiba City cluster), and Oak Ridge National Laboratory. Studying the architecture of clusters similar to what you want to build should provide additional insight. Hopefully, this book will leave you well prepared to do just that.

One last comment梚f you keep reading, I promise not to mention horses again.

Chapter 2. Cluster Planning

This chapter is an overview of cluster planning. It begins by introducing four key steps in developing a design for a cluster. Next, it presents several questions you can ask to help you determine what you want and need in a cluster. Finally, it briefly describes some of the software decisions you'll make and how these decisions impact the overall architecture of the cluster. In addition to helping people new to clustering plan the critical foundations of their cluster, the chapter serves as an overview of the software described in the book and its uses.

2.1 Design Steps

Designing a cluster entails four sets of design decisions. You should:

  1. Determine the overall mission for your cluster.

  2. Select a general architecture for your cluster.

  3. Select the operating system, cluster software, and other system software you will use.

  4. Select the hardware for the cluster.

While each of these tasks, in part, depends on the others, the first step is crucial. If at all possible, the cluster's mission should drive all other design decisions. At the very least, the other design decisions must be made in the context of the cluster's mission and be consistent with it.

Selecting the hardware should be the final step in the design, but often you won't have as much choice as you would like. A number of constraints may drive you to select the hardware early in the design process. The most obvious is the need to use recycled hardware or similar budget constraints. Chapter 3 describes hardware consideration is greater detail.

2.2 Determining Your Cluster's Mission

Defining what you want to do with the cluster is really the first step in designing it. For many clusters, the mission will be clearly understood in advance. This is particularly true if the cluster has a single use or a few clearly defined uses. However, if your cluster will be an open resource, then you'll need to anticipate potential uses. In that case, the place to start is with your users.

While you may think you have a clear idea of what your users will need, there may be little semblance between what you think they should need and what they think they need. And while your assessment may be the correct one, your users are still apt to be disappointed if the cluster doesn't live up to their expectations. Talk to your users.

You should also keep in mind that clusters have a way of evolving. What may be a reasonable assessment of needs today may not be tomorrow. Good design is often the art of balancing today's resources with tomorrow's needs. If you are unsure about your cluster's mission, answering the following questions should help.

2.2.1 What Is Your User Base?

In designing a cluster, you must take into consideration the needs of all users. Ideally this will include both the potential users as well as the obvious early adopters. You will need to anticipate any potential conflicting needs and find appropriate compromises.

The best way to avoid nasty surprises is to include representative users in the design process. If you have only a few users, you can easily poll the users to see what you need.

If you have a large user base, particularly one that is in flux, you will need to anticipate all reasonable, likely needs. Generally, this will mean supporting a wider range of software. For example, if you are the sole user and you only use one programming language and parallel programming library, there is no point in installing others. If you have dozens of users, you'll probably need to install multiple programming languages and parallel programming libraries.

2.2.2 How Heavily Will the Cluster Be Used?

Will the cluster be in constant use, with users fighting over it, or will it be used on an occasional basis as large problems arise? Will some of your jobs have higher priorities than others? Will you have a mix of jobs, some requiring the full capabilities of the cluster while others will need only a subset?

If you have a large user base with lots of potential conflicts, you will need some form of scheduling software. If your cluster will be lightly used or have very few users who are willing to work around each other, you may be able to postpone installing scheduling software.

2.2.3 What Kinds of Software Will You Run on the Cluster?

There are several levels at which this question can be asked. At a cluster management level, you'll need to decide which systems software you want, e.g., BSD, Linux, or Windows, and you'll need to decide what clustering software you'll need. Both of these choices will be addressed later in this chapter.

From a user perspective, you'll need to determine what application-level software to use. Will your users be using canned applications? If so, what are these applications and what are their requirements? Will your users be developing software? If so, what tools will they need? What is the nature of the software they will write and what demands will this make on your cluster? For example, if your users will be developing massive databases, will you have adequate storage? Will the I/O subsystem be adequate? If your users will carry out massive calculations, do you have adequate computational resources?

2.2.4 How Much Control Do You Need?

Closely related to the types of code you will be running is the question of how much control you will need over the code. There are a range of possible answers. If you need tight control over resources, you will probably have to write your own applications. User-developed code can make explicit use of the available resources.

For some uses, explicit control isn't necessary. If you have calculations that split nicely into separate processes and you'd just like them to run faster, software that provides transparent control may be the best solution. For example, suppose you have a script that invokes a file compression utility on a large number of files. It would be convenient if you could divide these file compression tasks among a number of processes, but you don't care about the details of how this is done.

openMosix, code that extends the Linux kernel, provides this type of transparent support. Processes automatically migrate among cluster computers. The advantage is that you may need to rewrite user code. However, the transparent control provided by openMosix will not work if the application uses shared memory or runs as a single process.

2.2.5 Will This Be a Dedicated or Shared Cluster?

Will the machines that comprise the cluster be dedicated to the cluster, or will they be used for other tasks? For example, a number of clusters have been built from office machines. During the day, the administrative staff uses the machines. In the evening and over the weekend, they are elements of a cluster. University computing laboratories have been used in the same way.

Obviously, if you have a dedicated cluster, you are free to configure the nodes as you see fit. With a shared cluster, you'll be limited by the requirements of the computers' day jobs. If this is the case, you may want to consider whether a dual-boot approach is feasible.

2.2.6 What Resources Do You Have?

Will you be buying equipment or using existing equipment? Will you be using recycled equipment? Recycled equipment can certainly reduce your costs, but it will severely constrain what you can do. At the very least, you'll need a small budget to adapt and maintain the equipment you have. You may need to purchase networking equipment such as a switch and cables, or you may need to replace failing parts such as disk drives and network cards. (See Chapter 3 for more information about hardware.)

2.2.7 How Will Cluster Access Be Managed?

Will you need local or remote access or both? Will you need to provide Internet access, or can you limit it to the local or campus network? Can you isolate the cluster? If you must provide remote access, what will be the nature of that access? For example, will you need to install software to provide a graphical interface for remote users? If you can isolate your network, security becomes less of an issue. If you must provide remote access, you'll need to consider tools like SSH and VNC. Or is serial port access by a terminal server sufficient?

2.2.8 What Is the Extent of Your Cluster?

The term cluster usually applies to computers that are all on the same subnet. If you will be using computers on different networks, you are building a grid. With a grid you'll face greater communications overhead and more security issues. Maintaining the grid will also be more involved and should be addressed early in the design process. This book doesn't cover the special considerations needed for grids.

2.2.9 What Security Concerns Do You Have?

Can you trust your users? If the answer is yes, this greatly simplifies cluster design. You can focus on controlling access to the cluster. If you can't trust your users, you'll need to harden each machine and develop secure communications. A closely related question is whether you can control physical access to your computers. Again, controlling physical access will simplify securing your cluster since you can focus on access points, e.g., the head node rather than the cluster as a whole. Finally, do you deal with sensitive data? Often the value of the data you work with determines the security measures you must take.

2.3 Architecture and Cluster Software

Once you have established the mission for your cluster, you can focus on its architecture and select the software. Most high-performance clusters use an architecture similar to that shown in Figure 1-5. The software described in this book is generally compatible with that basic architecture. If this does not match the mission of your cluster, you still may be able to use many of the packages described in this book, but you may need to make a few adaptations.

Putting together a cluster involves the selection of a variety of software. The possibilities are described briefly here. Each is discussed in greater detail in subsequent chapters in this book.

2.3.1 System Software

One of the first selections you will probably want to make is the operating system, but this is actually the final software decision you should make. When selecting an operating system, the fundamental question is compatibility. If you have a compelling reason to use a particular piece of software and it will run only under a single operating system, the choice has been made for you. For example, openMosix uses extensions to the Linux kernel, so if you want openMosix, you must use Linux. Provided the basic issue of compatibility has been met, the primary reasons to select a particular operating system are familiarity and support. Stick with what you know and what's supported.

All the software described in this book is compatible with Linux. Most, but not all, of the software will also work nicely with other Unix systems. In this book, we'll be assuming the use of Linux. If you'd rather use BSD or Solaris, you'll probably be OK with most of the software, but be sure to check its compatibility before you make a commitment. Some of the software, such as MPICH, even works with Windows.

There is a natural human tendency to want to go with the latest available version of an operating system, and there are some obvious advantages to using the latest release. However, compatibility should drive this decision as well. Don't expect clustering software to be immediately compatible with the latest operating system release. Compatibility may require that you use an older release. (For more on Linux, see Chapter 4.)

In addition to the operating system itself, you may need additional utilities or extensions to the basic services provided by the operating system. For example, to create a cluster you'll need to install the operating system and software on a large number of machines. While you could do this manually with a small cluster, it's an error-prone and tedious task. Fortunately, you can automate the process with cloning software. Cloning is described in detail in Chapter 8.

High-performance systems frequently require extensive I/O. To optimize performance, parallel file systems may be used. Chapter 12 looks at the Parallel Virtual File System (PVFS), an open source high-performance file system.

2.3.2 Programming Software

There are two basic decisions you'll need to make with respect to programming software梩he programming languages you want to support and the libraries you want to use. If you have a small user base, you may be able to standardize on a single language and a single library. If you can pull this off, go for it; life will be much simpler. However, if you need to support a number of different users and applications, you may be forced to support a wider variety of programming software.

The parallel programming libraries provide a mechanism that allows you to easily coordinate computing and exchange data among programs running on the cluster. Without this software, you'll be forced to rely on operating system primitives to program your cluster. While it is certainly possible to use sockets to build parallel programs, it is a lot more work and more error prone. The most common libraries are the Message Passing Interface (MPI) and Parallel Virtual Machine (PVM) libraries.

The choice of program languages depends on the parallel libraries you want to use. Typically, the libraries provide bindings for only a small number of programming languages. There is no point in installing Ada if you can't link it to the parallel library you want to use. Traditionally, parallel programming libraries support C and FORTRAN, and C++ is growing in popularity. Libraries and languages are discussed in greater detail in Chapter 9.

2.3.3 Control and Management

In addition to the programming software, you'll need to keep your cluster running. This includes scheduling and management software.

Cluster management includes both routine system administration tasks and monitoring the health of your cluster. With a cluster, even a simple task can become cumbersome if it has to be replicated over a large number of systems. Just checking which systems are available can be a considerable time sink if done on a regular basis. Fortunately, there are several packages that can be used to simplify these tasks. Cluster Command and Control (C3) provides a command-line interface that extends across a cluster, allowing easy replication of tasks on each machine in a cluster or on a subset of the cluster. Ganglia provides web-based monitoring in a single interface. Both C3 and Ganglia can be used with federated clusters as well as simple clusters. C3 and Ganglia are described in Chapter 10.

Scheduling software determines when your users' jobs will be executed. Typically, scheduling software can allocate resources, establish priorities, and do basic accounting. For Linux clusters there are two likely choices?span class="docEmphasis">Condor and Portable Batch System (PBS). If you have needs for an advanced scheduler, you might also consider Maui. PBS is available as a commercial product, PBSPro, and as open source software, OpenPBS. OpenPBS is described in Chapter 11.

2.4 Cluster Kits

If installing all of this software sounds daunting, don't panic. There are a couple of options you can consider. For permanent clusters there are, for lack of a better name, cluster kits, software packages that automate the installation process. A cluster kit provides all the software you are likely to need in a single distribution.

Cluster kits tend to be very complete. For example, the OSCAR distribution contains both PVM and two versions of MPI. If some software isn't included, you can probably get by without it. Another option, described in the next section, is a CD-ROM-based cluster.

Cluster kits are designed to be turnkey solutions. Short of purchasing a prebuilt, preinstalled proprietary cluster, a cluster kit is the simplest approach to setting up a full cluster. Configuration parameters are largely preset by people who are familiar with the software and how the different pieces may interact. Once you have installed the kit, you have a functioning cluster. You can focus on using the software rather than installing it. Support groups and mailing lists are generally available.

Some kits have a Linux distribution included in the package (e.g., Rocks), while others are installed on top of an existing Linux installation (e.g., OSCAR). Even if Linux must be installed first, most of the configuration and the installation of needed packages will be done for you.

There are two problems with using cluster kits. First, cluster kits do so much for you that you can lose touch with your cluster, particularly if everything is new to you. Initially, you may not understand how the cluster is configured, what customizations have been made or are possible, or even what has been installed. Even making minor changes after installing a kit can create problems if you don't understand what you have. Ironically, the more these kits do for you, the worse this problem may be. With a kit, you may get software you don't want to deal with梥oftware your users may expect you to maintain and support. And when something goes wrong, as it will, you may be at a loss about how to deal with it.

A second problem is that, in making everything work together, kit builders occasionally have to do things a little differently. So when you look at the original documentation for the individual components in a kit, you may find that the software hasn't been installed as described. When you learn more about the software, you'll come to understand and appreciate why the changes were made. But in the short term, these changes can add to the confusion.

So while a cluster kit can get you up and running quickly, you will still need to learn the details of the individual software. You should follow up the installation with a thorough study of how the individual pieces in the kit work. For most beginners, the single advantage of being able to get a cluster up and running quickly probably outweighs all of the disadvantages.

While other cluster kits are available, the three most common kits for Linux clusters are NPACI Rocks, OSCAR, and Scyld Beowulf.[1] While Scyld Beowulf is a commercial product available from Penguin Computing, an earlier, unsupported version is available for a very nominal cost from http://www.linuxcentral.com/. Donald Becker, one of the original Beowulf developers, founded Scyld Computing, which was subsequently acquired by Penguin Computing. Scyld is built on top of Red Hat Linux and includes an enhanced kernel, tools, and utilities. While Scyld Beowulf is a solid system, you face the choice of using an expensive commercial product or a somewhat dated, unsupported product. Furthermore, variants of both Rocks and OSCAR are available. For example, BioBrew (http://bioinformatics.org/biobrew/) is a Rocks-based system that contains a number of packages for analyzing bioinformatics information. For these reasons, either Rocks or OSCAR is arguably a better choice than Scyld Beowulf.

[1] For grid computing, which is outside the scope of this book, the Globus Toolkit is a likely choice.

NPACI (National Partnership for Advanced Computational Infrastructure) Rocks is a collection of open source software for creating a cluster built on top of Red Hat Linux. Rocks takes a cookie-cutter approach. To install Rocks, begin by downloading a set of ISO images from http://rocks.npaci.edu/Rocks/ and use them to create installation CD-ROMs. Next, boot to the first CD-ROM and answer a few questions as the cluster is built. Both Linux and the clustering software are installed. (This is a mixed blessing梚t simplifies the installation but you won't have any control over how Linux is installed.) The installation should go very quickly. In fact, part of the Rocks' management strategy is that, if you have problems with a node, the best solution is to reinstall the node rather than try to diagnose and fix the problem. Depending on hardware, it may be possible to reinstall a node in under 10 minutes. When a Rocks installation goes as expected, you can be up and running in a very short amount of time. However, because the installation of the cluster software is tied to the installation of the operating system, if the installation fails, you can be left staring at a dead system and little idea of what to do. Fortunately, this rarely happens.

OSCAR, from the Open Cluster Group, uses a different installation strategy. With OSCAR, you first install Linux (but only on the head node) and then install OSCAR梩he installations of the two are separate. This makes the installation more involved, but it gives you more control over the configuration of your system, and it is somewhat easier (that's easier, not easy) to recover when you encounter installation problems. And because the OSCAR installation is separate from the Linux installation, you are not tied to a single Linux distribution.

Rocks uses a variant of Red Hat's Anaconda and Kickstart programs to install the compute nodes. Thus, Rocks is able to probe the system to see what hardware is present. To be included in Rocks, software must be available as an RPM and configuration must be entirely automatic. As a result, with Rocks it is very straightforward to set up a cluster using heterogeneous hardware. OSCAR, in contrast, uses a system image cloning strategy to distribute the disk image to the compute nodes. With OSCAR it is best to use the same hardware throughout your cluster. Rocks requires systems with hard disks. Although not discussed in this book, OSCAR's thin client model is designed for diskless systems.

Both Rocks and OSCAR include a variety of software and build complete clusters. In fact, most of the core software is the same for both OSCAR and Rocks. However, there are a few packages that are available for one but not the other. For example, Condor is readily available for Rocks while LAM/MPI is included in OSCAR.

Clearly, Rocks and OSCAR take orthogonal approaches to building clusters. Cluster kits are difficult to build. OSCAR scales well over Linux distributions. Rocks scales well with heterogeneous hardware. No one approach is better in every situation.

Rocks and OSCAR are at the core of this book. The installation, configuration, and use of OSCAR are described in detail in Chapter 6. The installation, configuration, and use of Rocks is described in Chapter 7. Rocks and OSCAR heavily influenced the selection of the individual tools described in this book. Most of the software described in this book is included in Rocks and OSCAR or is compatible with them. However, to keep the discussions of different software clean, the book includes separate chapters for the various software packages included in Rocks and OSCAR.

This book also describes many of the customizations made by these kits. At the end of many of the chapters, there is a brief section for Rocks and OSCAR users summarizing the difference between the default, standalone installation of the software and how these kits install it. Hopefully, therefore, this book addresses both of the potential difficulties you might encounter with a cluster條earning the details of the software and discovering the differences that cluster kits introduce.

Putting aside other constraints such as the need for diskless systems or heterogeneous hardware, if all goes well, a novice can probably build a Rocks cluster a little faster than an OSCAR cluster. But if you want greater control over how your cluster is configured, you may be happier with OSCAR in the long run. Typically, OSCAR provides better documentation, although Rocks documentation has been improving. You shouldn't go far wrong with either.

2.5 CD-ROM-Based Clusters

If you just want to learn about clusters, only need a cluster occasionally, or can't permanently install a cluster, you might consider one of the CD-ROM-based clusters. With these, you create a set of bootable CD-ROMs, sometimes called "live filesystem" CDs. When you need the cluster, you reboot your available systems using the CD-ROMs, do a few configuration tasks, and start using your cluster. The cluster software is all available from the CD-ROM and the computers' hard disks are unchanged. When you are done, you simply remove the CD-ROM and reboot the system to return to the operating system installed on the hard disk. Your cluster persists until you reboot.

Clearly, this is not an approach to use for a high-availability or mission-critical cluster, but it is a way to get started and learn about clusters. It is a viable way to create a cluster for short-term use. For example, if a computer lab is otherwise idle over the weekend, you could do some serious calculations using this approach.

There are some significant difficulties with this approach, most notably problems with storage. It is possible to work around this problem by using a hybrid approach梥etting up a dedicated system for storage and using the CD-ROM-based systems as compute-only nodes.

Several CD-ROM-based systems are available. You might look at ClusterKnoppix, http://bofh.be/clusterknoppix/, or Bootable Cluster CD (BCCD), http://bccd.cs.uni.edu/. The next subsection, a very brief description of BCCD, should give you the basic idea of how these systems work.

2.5.1 BCCD

BCCD was developed by Paul Gray as an educational tool. If you want to play around with a small cluster, BCCD is a very straightforward way to get started. On an occasional basis, it is a viable alternative. What follows is a general overview of running BCCD for the first time.

The first step is to visit the BCCD download site, download an ISO image for a CD-ROM, and use it to burn a CD-ROM for each system. (Creating CD-ROMs from ISO images is briefly discussed in Chapter 4.) Next, boot each machine in your cluster from the CD-ROM. You'll need to answer a few questions as the system boots. First, you'll enter a password for the default user, bccd. Next, you'll answer some questions about your network. The system should autodetect your network card. Then it will prompt you for the appropriate driver. If you know the driver, select it from the list BCCD displays. Otherwise, select "auto" from the menu to have the system load drivers until a match is found. If you have a DHCP and DNS server available on your network, this will go much faster. Otherwise, you'll need to enter the usual network configuration information桰P address, netmask, gateway, etc.

Once the system boots, log in to complete the configuration process. When prompted, start the BCCD heartbeat process. Next, run the utilities bccd-allowall and bccd-snarfhosts. The first of these collects hosts' keys used by SSH and the second creates the machines file used by MPI. You are now ready to use the system.

Admittedly, this is a pretty brief description, but it should give you some idea as to what's involved in using BCCD. The boot process is described in greater detail at the project's web site. To perform this on a regular basis with a number of machines would be an annoying process. But for a few machines on an occasional basis, it is very straightforward.

2.6 Benchmarks

Once you have your cluster running, you'll probably want to run a benchmark or two just to see how well it performs. Unfortunately, benchmarking is, at best, a dark art. In practice, sheep entrails may give better results.

Often the motivation for benchmarks is hubris梩he desire to prove your system is the best. This can be crucial if funding is involved, but otherwise is probably a meaningless activity and a waste of time. You'll have to judge for yourself.

Keep in mind that a benchmark supplies a single set of numbers that is very difficult to interpret in isolation. Benchmarks are mostly useful when making comparisons between two or more closely related configurations on your own cluster.

There are at least three reasons you might run benchmarks. First, a benchmark will provide you with a baseline. If you make changes to your cluster or if you suspect problems with your cluster, you can rerun the benchmark to see if performance is really any different. Second, benchmarks are useful when comparing systems or cluster configurations. They can provide a reasonable basis for selecting between alternatives. Finally, benchmarks can be helpful with planning. If you can run several with differently sized clusters, etc., you should be able to make better estimates of the impact of scaling your cluster.

Benchmarks are not infallible. Consider the following rather simplistic example: Suppose you are comparing two clusters with the goal of estimating how well a particular cluster design scales. Cluster B is twice the size of cluster A. Your goal is to project the overall performance for a new cluster C, which is twice the size of B. If you rely on a simple linear extrapolation based on the overall performance of A and B, you could be grossly misled. For instance, if cluster A has a 30% network utilization and cluster B has a 60% network utilization, the network shouldn't have a telling impact on overall performance for either cluster. But if the trend continues, you'll have a difficult time meeting cluster C's need for 120% network utilization.

There are several things to keep in mind when selecting benchmarks. A variety of different things affect the overall performance of a cluster, including the configuration of the individual systems and the network, the job mix on the cluster, and the instruction mix in the cluster applications. Benchmarks attempt to characterize performance by measuring, in some sense, the performance of CPU, memory, or communications. Thus, there is no exact correspondence between what may affect a cluster's performance and what a benchmark actually measures.

Furthermore, since several factors are involved, different benchmarks may weight different factors. Thus, it is generally meaningless to compare the results of one benchmark on one system with a different set of benchmarks on a different system, even when the benchmarks reputedly measure the same thing.

When you select a benchmark, first decide why you need it and how it will be used. For many purposes, the best benchmark is the actual applications that you will run on your cluster. It doesn't matter how well your cluster does with memory benchmarks if your applications are constantly thrashing. The primary difficulty in using actual applications is running them in a consistent manner so that you have repeatable results. This can be a real bear! Even small changes in data can produce significant changes in performance. If you do decide to use your applications, be consistent.

If you don't want to use your applications, there are a number of cluster benchmarks available. Here are a few that you might consider:

Hierarchical Integration (HINT)

 

The HINT benchmark, developed at the U.S. Department of Energy's Ames Research Laboratory, is used to test subsystem performance. It can be used to compare both processor performance and memory subsystem performance. It is now supported by Brigham Young University. (http://hint.byu.edu)


 

High Performance Linpack

 

Linpack was written by Jack Dongarra and is probably the best known and most widely used benchmark in high-performance computing. The HPL version of Linpack is used to rank computers on the TOP500 Supercomputer Site. HPL differs from its predecessor in that the user can specify the problem size. (http://www.netlib.org/benchmark/hpl/)


 

Iozone

 

Iozone is an I/O and filesystem benchmark tool. It generates and performs a variety of file operations and can be used to access filesystem performance. (http://www.iozone.org)


 

Iperf

 

Iperf was developed to measure network performance. It measures TCP and UDP bandwidth performance, reporting delay jitter and datagram loss as well as bandwidth. (http://dast.nlanr.net/Projects/Iperf/)


 

NAS Parallel Benchmarks

 

The Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) are application-centric benchmarks that have been widely used to compare the performance of parallel computers. NPB is actually a suite of eight programs. (http://science.nas.nasa.gov/Software/NPB/)

There are many other benchmarks available. The Netlib Repository is a good place to start if you need additional benchmarks, http://www.netlib.org.

2.6 Benchmarks

Once you have your cluster running, you'll probably want to run a benchmark or two just to see how well it performs. Unfortunately, benchmarking is, at best, a dark art. In practice, sheep entrails may give better results.

Often the motivation for benchmarks is hubris梩he desire to prove your system is the best. This can be crucial if funding is involved, but otherwise is probably a meaningless activity and a waste of time. You'll have to judge for yourself.

Keep in mind that a benchmark supplies a single set of numbers that is very difficult to interpret in isolation. Benchmarks are mostly useful when making comparisons between two or more closely related configurations on your own cluster.

There are at least three reasons you might run benchmarks. First, a benchmark will provide you with a baseline. If you make changes to your cluster or if you suspect problems with your cluster, you can rerun the benchmark to see if performance is really any different. Second, benchmarks are useful when comparing systems or cluster configurations. They can provide a reasonable basis for selecting between alternatives. Finally, benchmarks can be helpful with planning. If you can run several with differently sized clusters, etc., you should be able to make better estimates of the impact of scaling your cluster.

Benchmarks are not infallible. Consider the following rather simplistic example: Suppose you are comparing two clusters with the goal of estimating how well a particular cluster design scales. Cluster B is twice the size of cluster A. Your goal is to project the overall performance for a new cluster C, which is twice the size of B. If you rely on a simple linear extrapolation based on the overall performance of A and B, you could be grossly misled. For instance, if cluster A has a 30% network utilization and cluster B has a 60% network utilization, the network shouldn't have a telling impact on overall performance for either cluster. But if the trend continues, you'll have a difficult time meeting cluster C's need for 120% network utilization.

There are several things to keep in mind when selecting benchmarks. A variety of different things affect the overall performance of a cluster, including the configuration of the individual systems and the network, the job mix on the cluster, and the instruction mix in the cluster applications. Benchmarks attempt to characterize performance by measuring, in some sense, the performance of CPU, memory, or communications. Thus, there is no exact correspondence between what may affect a cluster's performance and what a benchmark actually measures.

Furthermore, since several factors are involved, different benchmarks may weight different factors. Thus, it is generally meaningless to compare the results of one benchmark on one system with a different set of benchmarks on a different system, even when the benchmarks reputedly measure the same thing.

When you select a benchmark, first decide why you need it and how it will be used. For many purposes, the best benchmark is the actual applications that you will run on your cluster. It doesn't matter how well your cluster does with memory benchmarks if your applications are constantly thrashing. The primary difficulty in using actual applications is running them in a consistent manner so that you have repeatable results. This can be a real bear! Even small changes in data can produce significant changes in performance. If you do decide to use your applications, be consistent.

If you don't want to use your applications, there are a number of cluster benchmarks available. Here are a few that you might consider:

Hierarchical Integration (HINT)

 

The HINT benchmark, developed at the U.S. Department of Energy's Ames Research Laboratory, is used to test subsystem performance. It can be used to compare both processor performance and memory subsystem performance. It is now supported by Brigham Young University. (http://hint.byu.edu)


 

High Performance Linpack

 

Linpack was written by Jack Dongarra and is probably the best known and most widely used benchmark in high-performance computing. The HPL version of Linpack is used to rank computers on the TOP500 Supercomputer Site. HPL differs from its predecessor in that the user can specify the problem size. (http://www.netlib.org/benchmark/hpl/)


 

Iozone

 

Iozone is an I/O and filesystem benchmark tool. It generates and performs a variety of file operations and can be used to access filesystem performance. (http://www.iozone.org)


 

Iperf

 

Iperf was developed to measure network performance. It measures TCP and UDP bandwidth performance, reporting delay jitter and datagram loss as well as bandwidth. (http://dast.nlanr.net/Projects/Iperf/)


 

NAS Parallel Benchmarks

 

The Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) are application-centric benchmarks that have been widely used to compare the performance of parallel computers. NPB is actually a suite of eight programs. (http://science.nas.nasa.gov/Software/NPB/)

There are many other benchmarks available. The Netlib Repository is a good place to start if you need additional benchmarks, http://www.netlib.org.

3.1 Design Decisions

While you may have some idea of what you want, it is still worthwhile to review the implications of your choices. There are several closely related, overlapping key issues to consider when acquiring PCs for the nodes in your cluster:

  • Will you have identical systems or a mixture of hardware?

  • Will you scrounge for existing computers, buy assembled computers, or buy the parts and assemble your own computers?

  • Will you have full systems with monitors, keyboards, and mice, minimal systems, or something in between?

  • Will you have dedicated computers, or will you share your computers with other users?

  • Do you have a broad or shallow user base?

This is this most important thing I'll say in this chapter?span class="docEmphasis">if at all possible, use identical systems for your nodes. Life will be much simpler. You'll need to develop and test only one configuration and then you can clone the remaining machines. When programming your cluster, you won't have to consider different hardware capabilities as you attempt to balance the workload among machines. Also, maintenance and repair will be easier since you will have less to become familiar with and will need to keep fewer parts on hand. You can certainly use heterogeneous hardware, but it will be more work.

In constructing a cluster, you can scrounge for existing computers, buy assembled computers, or buy the parts and assemble your own. Scrounging is the cheapest way to go, but this approach is often the most time consuming. Usually, using scrounged systems means you'll end up with a wide variety of hardware, which creates both hardware and software problems. With older scrounged systems, you are also more likely to have even more hardware problems. If this is your only option, try to standardize hardware as much as possible. Look around for folks doing bulk upgrades when acquiring computers. If you can find someone replacing a number of computers at one time, there is a good chance the computers being replaced will have been a similar bulk purchase and will be very similar or identical. These could come from a computer laboratory at a college or university or from an IT department doing a periodic upgrade.

Buying new, preassembled computers may be the simplest approach if money isn't the primary concern. This is often the best approach for mission-critical applications or when time is a critical factor. Buying new is also the safest way to go if you are uncomfortable assembling computers. Most system integrators will allow considerable latitude over what to include with your systems, particularly if you are buying in bulk. If you are using a system integrator, try to have the integrator provide a list of MAC addresses and label each machine.

Building your own system is cheaper, provides higher performance and reliability, and allows for customization. Assembling your own computers may seem daunting, but it isn't that difficult. You'll need time, personnel, space, and a few tools. It's a good idea to build a single system and test it for hardware and software compatibility before you commit to a large bulk order. Even if you do buy preassembled computers, you will still need to do some testing and maintenance. Unfortunately, even new computers are occasionally DOA.[1] So the extra time may be less than you'd think. And by building your own, you'll probably be able to afford more computers.

[1] Dead on arrival: nonfunctional when first installed.

If you are constructing a dedicated cluster, you will not need full systems. The more you can leave out of each computer, the more computers you will be able to afford, and the less you will need to maintain on individual computers. For example, with dedicated clusters you can probably do without monitors, keyboards, and mice for each individual compute node. Minimal machines have the smallest footprint, allowing larger clusters when space is limited and have smaller power and air conditioning requirements. With a minimal configuration, wiring is usually significantly easier, particularly if you use rack-mounted equipment. (However, heat dissipation can be a serious problem with rack-mounted systems.) Minimal machines also have the advantage of being less likely to be reallocated by middle management.

The size of your user base will also affect your cluster design. With a broad user base, you'll need to prepare for a wider range of potential uses梞ore applications software and more systems tools. This implies more secondary storage and, perhaps, more memory. There is also the increased likelihood that your users will need direct access to individual nodes.

Shared machines, i.e., computers that have other uses in addition to their role as a cluster node, may be a way of constructing a part-time cluster that would not be possible otherwise. If your cluster is shared, then you will need complete, fully functioning machines. While this book won't focus on such clusters, it is certainly possible to have a setup that is a computer lab on work days and a cluster on the weekend, or office machines by day and cluster nodes at night.

3.1.1 Node Hardware

Obviously, your computers need adequate hardware for all intended uses. If your cluster includes workstations that are also used for other purposes, you'll need to consider those other uses as well. This probably means acquiring a fairly standard workstation. For a dedicated cluster, you determine your needs and there may be a lot you won't need梐udio cards and speakers, video capture cards, etc. Beyond these obvious expendables, there are other additional parts you might want to consider omitting such as disk drives, keyboards, mice, and displays. However, you should be aware of some of the potential problems you'll face with a truly minimalist approach. This subsection is a quick review of the design decisions you'll need to make.

3.1.1.1 CPUs and motherboards

While you can certainly purchase CPUs and motherboards from different sources, you need to select each with the other in mind. These two items are the heart of your system. For optimal performance, you'll need total compatibility between these. If you are buying your systems piece by piece, consider buying an Intel- or ADM-compatible motherboard with an installed CPU. However, you should be aware that some motherboards with permanently affixed CPUs are poor performers, so choose with care.

You should also buy your equipment from a known, trusted source with a reputable warranty. For example, in recent years a number of boards have been released with low-grade electrolytic capacitors. While these capacitors work fine initially, the board life is disappointingly brief. People who bought these boards from fly-by-night companies were out of luck.

In determining the performance of a node, the most important factors are processor clock rate, cache size, bus speed, memory capacity, disk access speed, and network latency. The first four are determined by your selection of CPU and motherboard. And if you are using integrated EIDE interfaces and network adapters, all six are at least influenced by your choice of CPU and motherboard.

Clock speed can be misleading. It is best used to compare processors within the same family since comparing processors from different families is an unreliable way to measure performance. For example, an AMD Athlon 64 may outperform an Intel Pentium 4 when running at the same clock rate. Processor speed is also very application dependent. If your data set fits within the large cache in a Prescott-core Pentium 4 but won't fit in the smaller cache in an Athlon, you may see much better performance with the Pentium.

Selecting a processor is a balancing act. Your choice will be constrained by cost, performance, and compatibility. Remember, the rationale behind a commodity off-the-shelf (COTS) cluster is buying machines that have the most favorable price to performance ratio, not pricey individual machines. Typically you'll get the best ratio by purchasing a CPU that is a generation behind the current cutting edge. This means comparing the numbers. When comparing CPUs, you should look at the increase in performance versus the increase in the total cost of a node. When the cost starts rising significantly faster than the performance, it's time to back off. When a 20 percent increase in performance raises your cost by 40 percent, you've gone too far.

Since Linux works with most major chip families, stay mainstream and you shouldn't have any software compatibility problems. Nonetheless, it is a good idea to test a system before committing to a bulk purchase. Since a primary rationale for building your own cluster is the economic advantage, you'll probably want to stay away from the less common chips. While clusters built with UltraSPARC systems may be wonderful performers, few people would describe these as commodity systems. So unless you just happen to have a number of these systems that you aren't otherwise using, you'll probably want to avoid them.[2]

[2] Radajewski and Eadline's Beowulf HOWTO refers to "Computer Shopper"-certified equipment. That is, if equipment isn't advertised in Computer Shopper, it isn't commodity equipment.

With standalone workstations, the overall benefit of multiple processors (i.e., SMP systems) is debatable since a second processor can remain idle much of the time. A much stronger argument can be made for the use of multiple processor systems in clusters where heavy utilization is assured. They add additional CPUs without requiring additional motherboards, disk drives, power supplies, cases, etc.

When comparing motherboards, look to see what is integrated into the board. There are some significant differences. Serial, parallel, and USB ports along with EIDE disk adapters are fairly standard. You may also find motherboards with integrated FireWire ports, a network interface, or even a video interface. While you may be able to save money with built-in network or display interfaces (provided they actually meet your needs), make sure they can be disabled should you want to install your own adapter in the future. If you are really certain that some fully integrated motherboard meets your needs, eliminating the need for daughter cards may allow you to go with a small case. On the other hand, expandability is a valuable hedge against the future. In particular, having free memory slots or adapter slots can be crucial at times.

Finally, make sure the BIOS Setup options are compatible with your intended configuration. If you are building a minimal system without a keyboard or display, make sure the BIOS will allow you to boot without them attached. That's not true for some BIOSs.

3.1.1.2 Memory and disks

Subject to your budget, the more cache and RAM in your system, the better. Typically, the faster the processor, the more RAM you will need. A very crude rule of thumb is one byte of RAM for every floating-point operation per second. So a processor capable of 100 MFLOPs would need around 100 MB of RAM. But don't take this rule too literally.

Ultimately, what you will need depends on your applications. Paging creates a severe performance penalty and should be avoided whenever possible. If you are paging frequently, then you should consider adding more memory. It comes down to matching the memory size to the cluster application. While you may be able to get some idea of what you will need by profiling your application, if you are creating a new cluster for as yet unwritten applications, you will have little choice but to guess what you'll need as you build the cluster and then evaluate its performance after the fact. Having free memory slots can be essential under these circumstances.

Which disks to include, if any, is perhaps the most controversial decision you will make in designing your cluster. Opinions vary widely. The cases both for and against diskless systems have been grossly overstated. This decision is one of balancing various tradeoffs. Different contexts tip the balance in different directions. Keep in mind, diskless systems were once much more popular than they are now. They disappeared for a reason. Despite a lot of hype a few years ago about thin clients, the reemergence of these diskless systems was a spectacular flop. Clusters are, however, a notable exception. Diskless clusters are a widely used, viable approach that may be the best solution in some circumstances.

There are a number of obvious advantages to diskless systems. There is a lower cost per machine, which means you may be able to buy a bigger cluster with better performance. With rapidly declining disk prices, this is becoming less of an issue. A small footprint translates into lowered power and HVAC needs. And once the initial configuration has stabilized, software maintenance is simpler.

But the real advantage of diskless systems, at least with large clusters, is reduced maintenance. With diskless systems, you eliminate all moving parts aside from fans. For example, the average life (often known as mean time between failures, mean time before failure, or mean time to failure) of one manufacturer's disks is reported to be 300,000 hours or 34 years of continuous operation. If you have a cluster of 100 machines, you'll replace about three of these drives a year. This is a nuisance, but doable. If you have a cluster with 12,000 nodes, then you are looking at a failure, on average, every 25 hours梤oughly once a day.

There is also a downside to consider. Diskless systems are much harder for inexperienced administrators to configure, particularly with heterogeneous hardware. The network is often the weak link in a cluster. In diskless systems the network will see more traffic from the network file system, compounding the problem. Paging across a network can be devastating to performance, so it is critical that you have adequate local memory. But while local disks can reduce network traffic, they don't eliminate it. There will still be a need for network-accessible file systems.

Simply put, disk-based systems are more versatile and more forgiving. If you are building a dedicated cluster with new equipment and have experience with diskless systems, you should definitely consider diskless systems. If you are new to clusters, a disk-based cluster is a safer approach. (Since this book's focus is getting started with clusters, it does not describe setting up diskless clusters.)

If you are buying hard disks, there are three issues: interface type (EIDE vs. SCSI), disk latency (a function of rotational speed), and disk capacity. From a price-performance perspective, EIDE is probably a better choice than SCSI since virtually all motherboards include a built-in EIDE interface. And unless you are willing to pay a premium, you won't have much choice with respect to disk latency. Almost all current drives rotate at 7,200 RPM. While a few 10,000 RPM drives are available, their performance, unlike their price, is typically not all that much higher. With respect to disk capacity, you'll need enough space for the operating system, local paging, and the data sets you will be manipulating. Unless you have extremely large data sets, when recycling older computers a 10 GB disk should be adequate for most uses. Often smaller disks can be used. For new systems, you'll be hard pressed to find anything smaller that 20 GB, which should satisfy most uses. Of course, other non-cluster needs may dictate larger disks.

You'll probably want to include either a floppy drive or CD-ROM drive in each system. Since CD-ROM drives can be bought for under $15 and floppy drives for under $5, you won't save much by leaving these out. For disk-based systems, CD-ROMs or floppies can be used to initiate and customize network installs. For example, when installing the software on compute nodes, you'll typically use a boot floppy for OSCAR systems and a CD-ROM on Rocks systems. For diskless systems, CD-ROMs or floppies can be used to boot systems over the network without special BOOT ROMs on your network adapters. The only compelling reason to not include a CD-ROM or floppy is a lack of space in a truly minimal system.

When buying any disks, don't forget the cables.

3.1.1.3 Monitors, keyboards, and mice

Many minimal systems elect not to include monitors, keyboards, or mice but rely on the network to provide local connectivity as needed. While this approach is viable only with a dedicated cluster, its advantages include lower cost, less equipment to maintain, and a smaller equipment footprint. There are also several problems you may encounter with these headless systems. Depending on the system BIOS, you may not be able to boot a system without a display card or keyboard attached. When such systems boot, they probe for an attached keyboard and monitor and halt if none are found. Often, there will be a CMOS option that will allow you to override the test, but this isn't always the case.

Another problem comes when you need to configure or test equipment. A lack of monitor and keyboard can complicate such tasks, particularly if you have network problems. One possible solution is the use of a crash cart梐 cart with keyboard, mouse, and display that can be wheeled to individual machines and connected temporarily. Provided the network is up and the system is booting properly, X Windows or VNC provide a software solution.

Yet another alternative, particularly for small clusters, is the use of a keyboard-video-mouse (KVM) switch. With these switches, you can attach a single keyboard, mouse, and monitor to a number of different machines. The switch allows you to determine which computer is currently connected. You'll be able to access only one of the machines at a time, but you can easily cycle among the machines at the touch of a button. It is not too difficult to jump between machines and perform several tasks at once. However, it is fairly easy to get confused about which system you are logged on to. If you use a KVM switch, it is a good idea to configure the individual systems so that each displays its name, either as part of the prompt for command-line systems or as part of the background image for GUI-based systems.

There are a number of different switches available. Avocet even sells a KVM switch that operates over IP and can be used with remote clusters. Some KVM switches can be very pricey so be sure to shop around. Don't forget to include the cost of cables when pricing KVM switches. Frequently, these are not included with the switch and are usually overpriced. You'll need a set for every machine you want to leave connected, but not necessarily every machine.

The interaction between the system and the switch may provide a surprise or two. As previously noted, some systems don't allow booting without a keyboard, i.e., there is no CMOS override for booting without a keyboard. A KVM switch may be able to fool these systems. Such systems may detect a keyboard when connected to a KVM switch even when the switch is set to a different system. On the other hand, if you are installing Linux on a computer and it probes for a monitor, unless the switch is set to that system, the monitor won't be found.

Keep in mind, both the crash cart and the KVM switch approaches assume that individual machines have display adapters.


For this reason, you should seriously consider including a video card even when you are going with a headless systems. Very inexpensive cards or integrated adapters can be used since you won't need anything fancy. Typically, embedded video will only add a few dollars to the price of a motherboard.

One other possibility is to use serial consoles. Basically, the idea is to replace the attached monitor and keyboard with a serial connection to a remote system. With a fair amount of work, most Linux systems can be reconfigured to work in this manner. If you are using rack-mount machines, many of them support serial console redirection out of the box. With this approach, the systems use a connection to a serial port to eliminate the need for a KVM switch. Additional hardware is available that will allow you to multiplex serial connections from a number of machines. If this approach is of interest, consult the Remote Serial Console HOWTO at http://www.tldp.org/HOWTO/Remote-Serial-Console-HOWTO/.

3.1.1.4 Adapters, power supplies, and cases

As just noted, you should include a video adapter. The network adapter is also a key component. You must buy an adapter that is compatible with the cluster network. If you are planning to boot a diskless system over the network, you'll need an adapter that supports it. This translates into an adapter with an appropriate network BOOT ROM, i.e., one with pre-execution environment (PXE) support. Many adapters come with a built-in (but empty) BOOT ROM socket so that the ROM can be added. You can purchase BOOT ROMs for these cards or burn your own. However, it may be cheaper to buy a new card with an installed BOOT ROM than to add the BOOT ROMs. And unless you are already set up to burn ROMs, you'll need to be using several machines before it becomes cost effective to buy an EPROM burner.

To round things out, you'll need something to put everything in and a way to supply power, i.e., a case and power supply. With the case, you'll have to balance keeping the footprint small and having room to expand your system. If you buy too small a power supply, it won't meet your needs or allow you to expand your system. If you buy too large a power supply, you waste money and space. If you add up the power requirements for your individual components and add in another 50 percent as a fudge factor, you should be safe.

One last word about node selection梬hile we have considered components individually, you should also think about the system collectively before you make a final decision. If collectively the individual systems generate more heat that you can manage, you may need to reconsider how you configure individual machines. For example, Google is said to use less-powerful machines in its clusters in order to balance computation needs with total operational costs, a judgment that includes the impact of cooling needs.

3.1.2 Cluster Head and Servers

Thus far, we have been looking at the compute nodes within the cluster. Depending on your configuration, you will need a head node and possibly additional servers. Ideally, the head node and most servers should be complete systems since it will add little to your overall cost and can simplify customizing and maintaining these systems. Typically, there is no need for these systems to use the same hardware that your compute nodes use. Go for enhancements that will improve performance that you might not be able to afford on every node. These machines are the place for large, fast disks and lots of fast memory. A faster processor is also in order.

On smaller clusters, you can usually use one machine as both the head and as the network file server. This will be a dual-homed machine (two network interfaces) that serves as an access point for the cluster. As such, it will be configured to limit and control access as well as provide it. When the services required by the network file systems put too great a strain on the head node, the network file system can be moved to a separate server to improve performance.

If you are setting up systems as I/O servers for a parallel file system, it is likely that you'll want larger and faster drives on these systems. Since you may have a number of I/O servers in a larger cluster, you may need to look more closely at cost and performance trade-offs.

3.1.3 Cluster Network

By definition, a cluster is a networked collection of computers. For commodity clusters, networking is often the weak link. The two key factors to consider when designing your network are bandwidth and latency. Your application or application mix will determine just how important these two factors are. If you need to move large blocks of data, bandwidth will be critical. For real-time applications or applications that have lots of interaction among nodes, minimizing latency is critical. If you have a mix of applications, both can be critical.

It should come as no surprise that a number of approaches and products have been developed. High-end Ethernet is probably the most common choice for clusters. But for some low-latency applications, including many real-time applications, you may need to consider specialized low-latency hardware. There are a number of choices. The most common alternative to Ethernet is Myrinet from Myricom, Inc. Myrinet is a proprietary solution providing high-speed bidirectional connectivity (currently about 2 Gbps in each direction) and low latencies (currently under 4 microseconds). Myrinet uses a source-routing strategy and allows arbitrary length packets.

Other competitive technologies that are emerging or are available include cLAN from Emulex, QsNet from Quadrics, and Infiniband from the Infiniband consortium. These are high-performance solutions and this technology is rapidly changing.

The problem with these alternative technologies is their extremely high cost. Adapters can cost more than the combined cost of all the other hardware in a node. And once you add in the per node cost of the switch, you can easily triple the cost of a node. Clearly, these approaches are for the high-end systems.

Fortunately, most clusters will not need this extreme level of performance. Continuing gains in speed and rapidly declining costs make Ethernet the network of choice for most clusters. Now that Gigabit Ethernet is well established and 10 Gigabit Ethernet has entered the marketplace, the highly expensive proprietary products are no longer essential for most needs.

For Gigabit Ethernet, you will be better served with an embedded adapter rather than an add-on PCI board since Gigabit can swamp the PCI bus. Embedded adapters use workarounds that take the traffic off the PCI bus. Conversely, with 100BaseT, you may prefer a separate adapter rather than an embedded one since an embedded adapter may steal clock cycles from your applications.

Unless you are just playing around, you'll probably want, at minimum, switched Fast Ethernet. If your goal is just to experiment with clusters, almost any level of networking can be used. For example, clusters have been created using FireWire ports. For two (or even three) machines, you can create a cluster using crossover cables.

Very high-performance clusters may have two parallel networks. One is used for messages passing among the nodes, while the second is used for the network file system. In the past, elaborate technology, architectures, and topologies have been developed to optimize communications. For example, channel bonding uses multiple interfaces to multiplex channels for higher bandwidth. Hypercube topologies have been used to minimize communication path length. These approaches are beyond the scope of this book. Fortunately, declining networking prices and faster networking equipment have lessened the need for these approaches.

3.2 Environment

You are going to need some place to put your computers. If you are lucky enough to have a dedicated machine room, then you probably have everything you need. Otherwise, select or prepare a location that provides physical security, adequate power, and adequate heating and cooling. While these might not be issues with a small cluster, proper planning and preparation is essential for large clusters. Keep in mind, you are probably going to be so happy with your cluster that you'll want to expand it. Since small clusters have ways of becoming large clusters, plan for growth from the start.

3.2.1 Cluster Layout

Since the more computers you have, the more space they will need, plan your layout with wiring, cooling, and physical access in mind. Ignore any of these at your peril. While it may be tempting to stack computers or pack them into large shelves, this can create a lot of problems if not handled with care. First, you may find it difficult to physically access individual computers to make repairs. If the computers are packed too tightly, you'll create heat dissipation problems. And while this may appear to make wiring easier, in practice it can lead to a rat's nest of cables, making it difficult to divide your computers among different power circuits.

From the perspective of maintenance, you'll want to have physical access to individual computers without having to move other computers and with a minimum of physical labor. Ideally, you should have easy access to both the front and back of your computers. If your nodes are headless (no monitor, mouse, or keyboard), it is a good idea to assemble a crash cart. So be sure to leave enough space to both wheel and park your crash cart (and a chair) among your machines.

To prevent overheating, leave a small gap between computers and take care not to obstruct any ventilation openings. (These are occasionally seen on the sides of older computers!) An inch or two usually provides enough space between computers, but watch for signs of overheating.

Cable management is also a concern. For the well-heeled, there are a number of cable management systems on the market. Ideally, you want to keep power cables and data cables separated. The traditional rule of thumb was that there should be at least a foot of separation between parallel data cables and power cables runs, and that data cables and power cables should cross at right angles. In practice, the 60Hz analog power signal doesn't affect high-speed digital signals. Still, separating cables can make your cluster more manageable.

Standard equipment racks are very nice if you can afford them. Cabling is greatly simplified. But keep in mind that equipment racks pack things very closely and heat can be a problem. One rule of thumb is to stay under 100 W per square foot. That is about 1000 W for a 6-foot, 19-inch rack.

Otherwise, you'll probably be using standard shelving. My personal preference is metal shelves that are open on all sides. When buying shelves, take into consideration both the size and the weight of all the equipment you will have. Don't forget any displays, keyboards, mice, KVM switches, network switches, or uninterruptible power supplies that you plan to use. And leave yourself some working room.

3.2.2 Power and Air Conditioning

You'll need to make sure you have adequate power for your cluster, and to remove all the heat generated by that power, you'll need adequate air conditioning. For small clusters, power and air conditioning may not be immediate concerns (for now!), but it doesn't hurt to estimate your needs. If you are building a large cluster, take these needs into account from the beginning. Your best bet is to seek professional advice if it is readily available. Most large organizations have heating, ventilation, and air conditioning (HVAC) personnel and electricians on staff. While you can certainly estimate your needs yourself, if you have any problems you will need to turn to these folks for help, so you might want to include them from the beginning. Also, a second set of eyes can help prevent a costly mistake.

3.2.2.1 Power

In an ideal universe, you would simply know the power requirements of your cluster. But if you haven't built it yet, this knowledge can be a little hard to come by. The only alternative is to estimate your needs. A rough estimate is fairly straightforward: just inventory all your equipment and then add up all the wattages. Divide the total wattage by the voltage to get the amperage for the circuit, and then figure in an additional 50 percent or so as a safety factor.

For a more careful analysis, you should take into account the power factor. A switching power supply can draw more current than reported by their wattage ratings. For example, a fully loaded 350 W power supply may draw 500 W for 70 percent of the time and be off the other 30 percent of the time. And since a power supply may be 70 percent efficient, delivering those 500 W may require around 715 W. In practice, your equipment will rarely operate at maximum-rated capacity. Some power supplies are power-factor corrected (PFC). These power supplies will have power factors closer to 95 percent than 70 percent.

As you can see, this can get complicated very quickly. Hopefully, you won't be working with fully loaded systems. On the other hand, if you expect your cluster to grow, plan for more. Having said all this, for small clusters a 20-amp circuit should be adequate, but there are no guarantees.

When doing your inventory, the trick is remembering to include everything that enters the environment. It is not just the computers, network equipment, monitors, etc., that make up a cluster. It includes everything梕quipment that is only used occasionally such as vacuum cleaners, personal items such as the refrigerator under your desk, and fixtures such as lights. (Ideally, you should keep the items that potentially draw a lot of current, such as vacuum cleaners, floor polishers, refrigerators, and laser printers, off the circuits your cluster is on.) Also, be careful to ensure you aren't sharing a circuit unknowingly梐 potential problem in an older building, particularly if you have remodeled and added partitions.

The quality of your power can be an issue. If in doubt, put a line monitor on your circuit to see how it behaves. You might consider an uninterruptible power supply (UPS), particularly for your servers or head nodes. However, the cost can be daunting when trying to provide UPSs for an entire cluster. Moreover, UPSs should not be seen as an alternative to adequate wiring. If you are interested in learning more about or sizing a UPS, see the UPS FAQ at the site of the Linux Documentation Project (http://www.tldp.org/).

While you are buying UPSs, you may also want to consider buying other power management equipment. There are several vendors that supply managed power distribution systems. These often allow management over the Internet, through a serial connection, or via SNMP. With this equipment, you'll be able to monitor your cluster and remotely power-down or reboot equipment.

And one last question to the wise:

Do you know how to kill the power to your system?


This is more than idle curiosity. There may come a time when you don't want power to your cluster. And you may be in a big hurry when the time comes.

Knowing where the breakers are is a good start. Unfortunately, these may not be close at hand. They may even be locked away in a utility closet. One alternative is a scram switch. A scram switch should be installed between the UPS and your equipment. You should take care to ensure the switch is accessible but will not inadvertently be thrown.

You should also ensure that your maintenance staff knows what a UPS is. I once had a server/UPS setup in an office that flooded. When I came in, the UPS had been unplugged from the wall, but the computer was still plugged into the UPS. Both computer and UPS were drenched梐 potentially deadly situation. Make sure your maintenance staff knows what they are dealing with.

3.2.2.2 HVAC

As with most everything else, when it comes to electronics, heat kills. There is no magical temperature or temperature range that if you just keep your computers and other equipment within that range, everything will be OK. Unfortunately, it just isn't that simple.

Failure rate is usually a nonlinear function of temperature. As the temperature rises, the probability of failure also increases. For small changes in temperature, a rough rule of thumb is that you can expect the failure rate to double with an 18F (10C) increase in temperature. For larger changes, the rate of failure typically increases more rapidly than the rise in temperature. Basically, you are playing the odds. If you operate your machine room at a higher than average temperature, you'll probably see more failures. It is up to you to decide if the failure rate is unacceptable.

Microenvironments also matter. It doesn't matter if it is nice and cool in your corner of the room if your equipment rack is sitting in a corner in direct sunlight where the temperature is 15F (8C) warmer. If the individual pieces of equipment don't have adequate cooling, you'll have problems. This means that computers that are spread out in a room with good ventilation may be better off at a higher room temperature than those in a tightly packed cluster that lacks ventilation, even when the room temperature is lower.

Finally, the failure rate will also depend on the actual equipment you are using. Some equipment is designed and constructed to be more heat tolerant, e.g., military grade equipment. Consult the specifications if in doubt.

While occasionally you'll see recommended temperature ranges for equipment or equipment rooms, these should be taken with a grain of salt. Usually, recommended temperatures are a little below 70F (21C). So if you are a little chilly, your machines are probably comfortable.

Maintaining a consistent temperature can be a problem, particularly if you leave your cluster up and running at night, over the weekend, and over holidays. Heating and air conditioning are often turned off or scaled back when people aren't around. Ordinarily, this makes good economic sense. But when the air conditioning is cut off for a long Fourth of July weekend, equipment can suffer. Make sure you discuss this with your HVAC folks before it becomes a problem. Again, occasional warm spells probably won't be a problem, but you are pushing your luck.

Humidity is also an issue. At a high humidity, condensation can become a problem; at a low humidity, static electricity is a problem. The optimal range is somewhere in between. Recommended ranges are typically around 40 percent to 60 percent.

Estimating your air conditioning needs is straightforward but may require information you don't have. Among other things, proper cooling depends on the number and area of external walls, the number of windows and their exposure to the sun, the external temperature, and insulation. Your maintenance folks may have already calculated all this or may be able to estimate some of it.

What you are adding is heat contributed by your equipment and staff, something that your maintenance folks may not have been able to accurately predict. Once again, you'll start with an inventory of your equipment. You'll want the total wattage. You can convert this to British Thermal Units per hour by multiplying the wattage by 3.412. Add in another 300 BTU/H for each person working in the area. Add in the load from the lights, walls, windows, etc., and then figure in another 50 percent as a safety factor. Since air conditioning is usually expressed in tonnage, you may need to divide the BTU/H total by 12,000 to get the tonnage you need. (Or, just let the HVAC folks do all this for you.)

3.2.3 Physical Security

Physical security includes both controlling access to computers and protecting computers from physical threats such as flooding. If you are concerned about someone trying to break into your computers, the best solution is to take whatever steps you can to ensure that they don't have physical access to the computers. If you can't limit access to the individual computers, then you should password protect the CMOS, set the boot order so the system only boots from the hard drive, and put a lock on each case. Otherwise, someone can open the case and remove the battery briefly (roughly 15 to 20 minutes) to erase the information in CMOS including the password.[3] With the password erased, the boot order can be changed. Once this is done, it is a simple matter to boot to a floppy or CD-ROM, mount the hard drive, and edit the password files, etc. (Even if you've removed both floppy and CD-ROM drives, an intruder could bring one with them.) Obviously, this solution is only as good as the locks you can put on the computers and does very little to protect you from vandals.

[3] Also, there is usually a jumper that will immediately discharge the CMOS.

Broken pipes and similar disasters can be devastating. Unfortunately, it can be difficult to access these potential threats. Computers can be damaged when a pipe breaks on another floor. Just because there is no pipe immediately overhead doesn't mean that you won't be rained on as water from higher floors makes its way to the basement. Keeping equipment off the floor and off the top of shelves can provide some protection. It is also a good idea to keep equipment away from windows.

There are several web sites and books that deal with disaster preparedness. As the importance of your cluster grows, disaster preparedness will become more important.



Chapter 4. Linux for Clusters

This chapter reviews some of the issues involved in setting up a Linux system for use in a cluster. While several key services are described in detail, for the most part the focus is more on the issues and rationales than on specifics. Even if you are an old pro at Linux system administration, you may still want to skim this chapter for a quick overview of the issues as they relate to clusters, particularly the section on configuring services. If you are new to Linux system administration, this chapter will probably seem very terse. What's presented here is the bare minimum a novice system administrator will need to get started. The Appendix A lists additional sources.

This chapter covers material you'll need when setting up the head node and a typical cluster node. Depending on the approach you take, much of this may be done for you. If you are building your cluster from the ground up, you'll need to install the head node, configure the individual services on it, and build at least one compute node. Once you have determined how a compute node should be configured, you can turn to Chapter 8 for a discussion of how to duplicate systems in an efficient manner. It is much simpler with kits like OSCAR and Rocks.

With OSCAR, you'll need to install Linux on the head system, but OSCAR will configure the services for you. It will also build the client, i.e., generate a system image and install it on the compute nodes. OSCAR will configure and install most of the packages you'll need. The key to using OSCAR is to use a version of Linux that is known to be compatible with OSCAR. OSCAR is described in Chapter 6. With Rocks, described in Chapter 7, everything will be done for you. Red Hat Linux comes as part of the Rocks distribution.

This chapter begins with a discussion of selecting a Linux distribution. A general discussion of installing Linux follows. Next, the configuration of relevant network services is described. Finally, there is a brief discussion of security. If you are adding clustering software to an existing collection of workstations, presumably Linux is already installed on your machines. If this is the case, you can probably skim the first couple of sections. But while you won't need to install Linux, you will need to ensure that it is configured correctly and all the services you'll need are available.

4.1 Installing Linux

If Linux isn't built into your cluster software, the first step is to decide what distribution and version of Linux you want.

4.1.1 Selecting a Distribution

This decision will depend on what clustering software you want to use. It doesn't matter what the "best" distribution of Linux (Red Hat, Debian, SUSE, Mandrake, etc.) or version (7.3, 8.0, 9.0, etc.) is in some philosophical sense if the clustering software you want to use isn't available for that choice. This book uses the Red Hat distribution because the clustering software being discussed was known to work with that distribution. This is not an endorsement of Red Hat; it was just a pragmatic decision.

Keep in mind that your users typically won't be logging onto the compute nodes to develop programs, etc., so the version of Linux used there should be largely irrelevant to the users. While users will be logging onto the head node, this is not a general-purpose server. They won't be reading email, writing memos, or playing games on this system (hopefully). Consequently, many of the reasons someone might prefer a particular distribution are irrelevant.

This same pragmatism should extend to selecting the version as well as the distribution you use. In practice, this may mean using an older version of Linux. There are basically three issues involved in using an older version梒ompatibility with newer hardware; bug fixes, patches, and continued support; and compatibility with clustering software.

If you are using recycled hardware, using an older version shouldn't be a problem since drivers should be readily available for your older equipment. If you are using new equipment, however, you may run into problems with older Linux releases. The best solution, of course, is to avoid this problem by planning ahead if you are buying new hardware. This is something you should be able to work around by putting together a single test system before buying the bulk of the equipment.

With older versions, many of the problems are known. For bugs, this is good news since someone else is likely to have already developed a fix or workaround. With security holes, this is bad news since exploits are probably well circulated. With an older version, you'll need to review and install all appropriate security patches. If you can isolate your cluster, this will be less of an issue.

Unfortunately, at some point you can expect support for older systems to be discontinued. However, a system will not stop working just because it isn't supported. While not desirable, this is also something you can live with.

The final and key issue is software compatibility. Keep in mind that it takes time to develop software for use with a new release, particularly if you are customizing the kernel. As a result, the clustering software you want to use may not be available for the latest version of your favorite Linux distribution. In general, software distributed as libraries (e.g., MPI) are more forgiving than software requiring kernel patches (e.g., openMosix) or software that builds kernel modules (e.g., PVFS). These latter categories, by their very nature, must be system specific. Remember that using clustering software is the raison d'être for your cluster. If you can't run it, you are out of business. Unless you are willing to port the software or compromise your standards, you may be forced to use an older version of Linux. While you may want the latest and greatest version of your favorite flavor of Linux, you need to get over it.

If at all feasible, it is best to start your cluster installation with a clean install of Linux. Of course, if you are adding clustering software to existing systems, this may not be feasible, particularly if the machines are not dedicated to the cluster. If that is the case, you'll need to tread lightly. You'll almost certainly need to make changes to these systems, changes that may not go as smoothly as you'd like. Begin by backing up and carefully documenting these systems.

4.1.2 Downloading Linux

With most flavors of Linux, there are several ways you can do the installation. Typically you can install from a set of CD-ROMs, from a hard disk partition, or over a network using NFS, FTP, or HTTP. The decision will depend in part on the hardware you have available, but for initial experimentation it is probably easiest to use CD-ROMs. Buying a boxed set can be a real convenience, particularly if it comes with a printed set of manuals. But if you are using an older version of Linux, finding a set of CD-ROMs to buy can be difficult. Fortunately, you should have no trouble finding what you need on the Internet.

Downloading is the cheapest and easiest way to go if you have a fast Internet connection and a CD-ROM burner. Typically, you download ISO images梔isk images for CD-ROMs. These are basically single-file archives of everything on a CD-ROM. Since ISO images are frequently over 600 MB each and since you'll need several of them, downloading can take hours even if you have a fast connection and days if you're using a slow modem.

If you decide to go this route, follow the installation directions from your download site. These should help clarify exactly what you need and don't need and explain any other special considerations. For example, for Red Hat Linux the place to start is http://www.redhat.com/apps/download/. This will give you a link to a set of directions with links to download sites. Don't overlook the mirror sites; your download may go faster with them than with Red Hat's official download site.

For Red Hat Linux 9.0, there are seven disks. (Earlier versions of Red Hat have fewer disks.) Three of these are the installation disks and are essential. Three disks contain the source files for the packages. It is very unlikely you'll ever need these. If you do, you can download them later. The last disk is a documentation disk. You'd be foolish to skip this disk. Since the files only fill a small part of a CD, the ISO image is relatively small and the download doesn't take very long.

It is a good idea to check the MD5SUM for each ISO you download. Run the md5sum program and compare the results to published checksums.

[root@cs sloanjd]# md5sum FC2-i386-rescuecd.iso
22f4bfca5baefe89f0e04166e738639f  FC2-i386-rescuecd.iso


This will ensure both that the disk image hasn't been tampered with and that your download wasn't corrupted.

Once you have downloaded the ISO images, you'll need to burn your CD-ROMs. If you downloaded the ISO images to a Windows computer, you could use something like Roxio Easy Creator.[1] If you already have a running Linux system, you might use X-CD-Roast.

[1] There is an appealing irony to using Windows to download Linux.

Once you have the CD-ROMs, you can do an installation by following the appropriate directions for your software and system. Usually, this means booting to the first CD-ROM, which, in turn, runs an installation script. If you can't boot from the CD-ROM, you'll need to create a boot floppy using the directions supplied with the software. For Red Hat Linux, see the README file on the first installation disk.

4.1.3 What to Install?

What you install will depend on how you plan to use the machine. Is this a dedicated cluster? If so, users probably won't log onto individual machines, so you can get by with installing the minimal software required to run applications on each compute node. Is it a cluster of workstations that will be used in other ways? If that is the case, be sure to install X and any other appropriate applications. Will you be writing code? Don't forget the software development package and editors. Will you be recompiling the kernel? If so, you'll need the kernel sources.[2] If you are building kernel modules, you'll need the kernel header files. (In particular, these are needed if you install PVFS. PVFS is described in Chapter 12.) A custom installation will give you the most control over what is installed, i.e., the greatest opportunity to install software that you don't need and omit that which you do need.

[2] In general, you should avoid recompiling the kernel unless it is absolutely necessary. While you may be able to eke out some modest performance gains, they are rarely worth the effort.

Keep in mind that you can go back and add software. You aren't trapped by what you include at this point. At this stage, the important thing is to remember what you actually did. Take careful notes and create a checklist as you proceed. The quickest way to get started is to take a minimalist approach and add anything you need later, but some people find it very annoying to have to go back and add software. If you have the extra disk space (2 GB or so), then you may want to copy all the packages to a directory on your server. Not having to mount disks and search for packages greatly simplifies adding packages as needed. You only need to do this with one system and it really doesn't take that long. Once you have worked out the details, you can create a Kickstart configuration file to automate all this. Kickstart is described in more detail in Chapter 8.

4.2 Configuring Services

Once you have the basic installation completed, you'll need to configure the system. Many of the tasks are no different for machines in a cluster than for any other system. For other tasks, being part of a cluster impacts what needs to be done. The following subsections describe the issues associated with several services that require special considerations. These subsections briefly recap how to configure and use these services. Remember, most of this will be done for you if you are using a package like OSCAR or Rocks. Still, it helps to understand the issues and some of the basics.

4.2.1 DHCP

Dynamic Host Configuration Protocol (DHCP) is used to supply network configuration parameters, including IP addresses, host names, and other information to clients as they boot. With clusters, the head node is often configured as a DHCP server and the compute nodes as DHCP clients. There are two reasons to do this. First, it simplifies the installation of compute nodes since the information DHCP can supply is often the only thing that is different among the nodes. Since a DHCP server can handle these differences, the node installation can be standardized and automated. A second advantage of DHCP is that it is much easier to change the configuration of the network. You simply change the configuration file on the DHCP server, restart the server, and reboot each of the compute nodes.

The basic installation is rarely a problem. The DHCP system can be installed as a part of the initial Linux installation or after Linux has been installed. The DHCP server configuration file, typically /etc/dhcpd.conf, controls the information distributed to the clients. If you are going to have problems, the configuration file is the most likely source.

The DHCP configuration file may be created or changed automatically when some cluster software is installed. Occasionally, the changes may not be done optimally or even correctly so you should have at least a reading knowledge of DHCP configuration files. Here is a heavily commented sample configuration file that illustrates the basics. (Lines starting with "#" are comments.)

# A sample DHCP configuration file.
            # The first commands in this file are global,
            # i.e., they apply to all clients.
            # Only answer requests from known machines,
            # i.e., machines whose hardware addresses are given.
            deny unknown-clients;
            # Set the subnet mask, broadcast address, and router address.
            option subnet-mask 255.255.255.0;
            option broadcast-address 172.16.1.255;
            option routers 172.16.1.254;
            # This section defines individual cluster nodes.
            # Each subnet in the network has its own section.
            subnet 172.16.1.0 netmask 255.255.255.0 {
            group {
            # The first host, identified by the given MAC address,
            # will be named node1.cluster.int, will be given the
            # IP address 172.16.1.1, and will use the default router
            # 172.16.1.254 (the head node in this case).
            host node1{
            hardware ethernet 00:08:c7:07:68:48;
            fixed-address 172.16.1.1;
            option routers 172.16.1.254;
            option domain-name "cluster.int";
            }
            host node2{
            hardware ethernet 00:08:c7:07:c1:73;
            fixed-address 172.16.1.2;
            option routers 172.16.1.254;
            option domain-name "cluster.int";
            }
            # Additional node definitions go here.
            }
            }
            # For servers with multiple interfaces, this entry says to ignore requests
            # on specified subnets.
            subnet 10.0.32.0 netmask 255.255.248.0 {  not authoritative; }

As shown in this example, you should include a subnet section for each subnet on your network. If the head node has an interface for the cluster and a second interface connected to the Internet or your organization's network, the configuration file will have a group for each interface or subnet. Since the head node should answer DHCP requests for the cluster but not for the organization, DHCP should be configured so that it will respond only to DHCP requests from the compute nodes.

4.2.2 NFS

A network filesystem is a filesystem that physically resides on one computer (the file server), which in turn shares its files over the network with other computers on the network (the clients). The best-known and most common network filesystem is Network File System (NFS). In setting up a cluster, designate one computer as your NFS server. This is often the head node for the cluster, but there is no reason it has to be. In fact, under some circumstances, you may get slightly better performance if you use different machines for the NFS server and head node. Since the server is where your user files will reside, make sure you have enough storage. This machine is a likely candidate for a second disk drive or raid array and a fast I/O subsystem. You may even what to consider mirroring the filesystem using a small high-availability cluster.

Why use an NFS? It should come as no surprise that for parallel programming you'll need a copy of the compiled code or executable on each machine on which it will run. You could, of course, copy the executable over to the individual machines, but this quickly becomes tiresome. A shared filesystem solves this problem. Another advantage to an NFS is that all the files you will be working on will be on the same system. This greatly simplifies backups. (You do backups, don't you?) A shared filesystem also simplifies setting up SSH, as it eliminates the need to distribute keys. (SSH is described later in this chapter.) For this reason, you may want to set up NFS before setting up SSH. NFS can also play an essential role in some installation strategies.

If you have never used NFS before, setting up the client and the server are slightly different, but neither is particularly difficult. Most Linux distributions come with most of the work already done for you.

4.2.2.1 Running NFS

Begin with the server; you won't get anywhere with the client if the server isn't already running. Two things need to be done to get the server running. The file /etc/exports must be edited to specify which machines can mount which directories, and then the server software must be started. Here is a single line from the file /etc/exports on the server amy:

/home    basil(rw) clara(rw) desmond(rw) ernest(rw) george(rw)

This line gives the clients basil, clara, desmond, ernest, and george read/write access to the directory /home on the server. Read access is the default. A number of other options are available and could be included. For example, the no_root_squash option could be added if you want to edit root permission files from the nodes.

Pay particular attention to the use of spaces in this file.


Had a space been inadvertently included between basil and (rw), read access would have been granted to basil and read/write access would have been granted to all other systems. (Once you have the systems set up, it is a good idea to use the command showmount -a to see who is mounting what.)

Once /etc/exports has been edited, you'll need to start NFS. For testing, you can use the service command as shown here

[root@fanny init.d]# /sbin/service nfs start
            Starting NFS services:                                     [  OK  ]
            Starting NFS quotas:                                       [  OK  ]
            Starting NFS mountd:                                       [  OK  ]
            Starting NFS daemon:                                       [  OK  ]
            [root@fanny init.d]# /sbin/service nfs status
            rpc.mountd (pid 1652) is running...
            nfsd (pid 1666 1665 1664 1663 1662 1661 1660 1657) is running...
            rpc.rquotad (pid 1647) is running...

(With some Linux distributions, when restarting NFS, you may find it necessary to explicitly stop and restart both nfslock and portmap as well.) You'll want to change the system configuration so that this starts automatically when the system is rebooted. For example, with Red Hat, you could use the serviceconf or chkconfig commands.

For the client, the software is probably already running on your system. You just need to tell the client to mount the remote filesystem. You can do this several ways, but in the long run, the easiest approach is to edit the file /etc/fstab, adding an entry for the server. Basically, you'll add a line to the file that looks something like this:

amy:/home    /home    nfs    rw,soft    0 0

In this example, the local system mounts the /home filesystem located on amy as the /home directory on the local machine. The filesystems may have different names. You can now manually mount the filesystem with the mount command

[root@ida /]# mount /home

When the system reboots, this will be done automatically.

When using NFS, you should keep a couple of things in mind. The mount point, /home, must exist on the client prior to mounting. While the remote directory is mounted, any files that were stored on the local system in the /home directory will be inaccessible. They are still there; you just can't get to them while the remote directory is mounted. Next, if you are running a firewall, it will probably block NFS traffic. If you are having problems with NFS, this is one of the first things you should check.

File ownership can also create some surprises. User and group IDs should be consistent among systems using NFS, i.e., each user will have identical IDs on all systems. Finally, be aware that root privileges don't extend across NFS shared systems (if you have configured your systems correctly). So if, as root, you change the directory (cd) to a remotely mounted filesystem, don't expect to be able to look at every file. (Of course, as root you can always use su to become the owner and do all the snooping you want.) Details for the syntax and options can be found in the nfs(5), exports(5), fstab(5), and mount(8) manpages. Additional references can be found in the Appendix A.

4.2.2.2 Automount

The preceding discussion of NFS describes editing the /etc/fstab to mount filesystems. There's another alternative梪sing an automount program such as autofs or amd. An automount daemon mounts a remote filesystem when an attempt is made to access the filesystem and unmounts the filesystem when it is no longer needed. This is all transparent to the user.

While the most common use of automounting is to automatically mount floppy disks and CD-ROMs on local machines, there are several advantages to automounting across a network in a cluster. You can avoid the problem of maintaining consistent /etc/fstab files on dozens of machines. Automounting can also lessen the impact of a server crash. It is even possible to replicate a filesystem on different servers for redundancy. And since a filesystem is mounted only when needed, automounting can reduce network traffic. We'll look at a very simple example here. There are at least two different HOWTOs (http://www.tldp.org/) for automounting should you need more information.

Automounting originated at Sun Microsystems, Inc. The Linux automounter autofs, which mimics Sun's automounter, is readily available on most Linux systems. While other automount programs are available, most notably amd, this discussion will be limited to using autofs.

Support for autofs must be compiled into the kernel before it can be used. With most Linux releases, this has already been done. If in doubt, use the following to see if it is installed:

[root@fanny root]# cat /proc/filesystems
            ...

Somewhere in the output, you should see the line

nodev   autofs

If you do, you are in business. Otherwise, you'll need a new kernel.

Next, you need to configure your systems. autofs uses the file /etc/auto.master to determine mount points. Each line in the file specifies a mount point and a map file that defines which filesystems will be mounted to the mount point. For example, in Rocks the auto.master file contains the single line:

/home auto.home --timeout 600

In this example, /home is the mount point, i.e., where the remote filesystem will be mounted. The file auto.home specifies what will be mounted.

In Rocks, the file /etc/auto.home will have multiple entries such as:

sloanjd  frontend.local:/export/home/sloanjd

The first field is the name of the subdirectory that will be created under the original mount point. In this example, the directory sloanjd will be mounted as a subdirectory of /home on the client system. The subdirectories are created dynamically by automount and should not exist on the client. The second field is the hostname (or server) and directory that is exported. (Although not shown in this example, it is possible to specify mount parameters for each directory in /etc/auto.home.) NFS should be running and you may need to update your /etc/exports file.

Once you have the configuration files copied to each system, you need to start autofs on each system. autofs is usually located in /etc/init.d and accepts the commands start, restart, status, and reload. With Red Hat, it is available through the /sbin/service command. After reading the file, autofs starts an automount process with appropriate parameters for each mount point and mounts filesystems as needed. For more information see the autofs(8) and auto.master(5) manpages.

4.2.3 Other Cluster File System

NFS has its limitations. First, there are potential security issues. Since the idea behind NFS is sharing, it should come as no surprise that over the years crackers have found ways to exploit NFS. If you are going to use NFS, it is important that you use a current version, apply any needed patches, and configure it correctly.

Also, NFS does not scale well, although there seems to be some disagreement about its limitations. For clusters, with fewer than 100 nodes, NFS is probably a reasonable choice. For clusters with more than 1,000 nodes, NFS is generally thought to be inadequate. Between 100 and 1,000 nodes, opinions seem to vary. This will depend in part on your hardware. It will also depend on how your applications use NFS. For a bioinformatics clusters, many of the applications will be read intensive. For a graphics processing cluster, rendering applications will be write intensive. You may find that NFS works better with the former than the latter. Other applications will have different characteristics, each stressing the filesystem in a different way. Ultimately, it comes down to what works best for you and your applications, so you'll probably want to do some experimenting.

Keep in mind that NFS is not meant to be a high-performance, parallel filesystem. Parallel filesystems are designed for a different purpose. There are other filesystems you could consider, each with its own set of characteristics. Some of these are described briefly in Chapter 12. Additionally, there are other storage technologies such as storage area network (SAN) technology. SANs offer greatly improve filesystem failover capabilities and are ideal for use with high-availability clusters. Unfortunately, SANs are both expensive and difficult to set up. iSCSI (SCSI over IP) is an emerging technology to watch.

If you need a high-performance, parallel filesystems, PVFS is a reasonable place to start, as it is readily available for both Rocks and OSCAR. PVFS is discussed in Chapter 12.

4.2.4 SSH

To run software across a cluster, you'll need some mechanism to start processes on each machine. In practice, a prerequisite is the ability to log onto each machine within the cluster. If you need to enter a password for each machine each time you run a program, you won't get very much done. What is needed is a mechanism that allows logins without passwords.

This boils down to two choices梱ou can use remote shell (RSH) or secure shell (SSH). If you are a trusting soul, you may want to use RSH. It is simpler to set up with less overhead. On the other hand, SSH network traffic is encrypted, so it is safe from snooping. Since SSH provides greater security, it is generally the preferred approach.

SSH provides mechanisms to log onto remote machines, run programs on remote machines, and copy files among machines. SSH is a replacement for ftp, telnet, rlogin, rsh, and rcp. A commercial version of SSH is available from SSH Communications Security (http://www.ssh.com), a company founded by Tatu Ylönen, an original developer of SSH. Or you can go with OpenSSH, an open source version from http://www.openssh.org.

OpenSSH is the easiest since it is already included with most Linux distributions. It has other advantages as well. By default, OpenSSH automatically forwards the DISPLAY variable. This greatly simplifies using the X Window System across the cluster. If you are running an SSH connection under X on your local machine and execute an X program on the remote machine, the X window will automatically open on the local machine. This can be disabled on the server side, so if it isn't working, that is the first place to look.

There are two sets of SSH protocols, SSH-1 and SSH-2. Unfortunately, SSH-1 has a serious security vulnerability. SSH-2 is now the protocol of choice. This discussion will focus on using OpenSSH with SSH-2.

Before setting up SSH, check to see if it is already installed and running on your system. With Red Hat, you can check to see what packages are installed using the package manager.

[root@fanny root]# rpm -q -a | grep ssh
            openssh-3.5p1-6
            openssh-server-3.5p1-6
            openssh-clients-3.5p1-6
            openssh-askpass-gnome-3.5p1-6
            openssh-askpass-3.5p1-6

This particular system has the SSH core package, both server and client software as well as additional utilities. The SSH daemon is usually started as a service. As you can see, it is already running on this machine.

[root@fanny root]# /sbin/service sshd status
            sshd (pid 28190 1658) is running...

Of course, it is possible that it wasn't started as a service but is still installed and running. You can use ps to double check.

[root@fanny root]# ps -aux | grep ssh
            root     29133  0.0  0.2  3520  328 ?        S    Dec09   0:02 /usr/sbin/sshd
            ...

Again, this shows the server is running.

With some older Red Hat installations, e.g., the 7.3 workstation, only the client software is installed by default. You'll need to manually install the server software. If using Red Hat 7.3, go to the second install disk and copy over the file RedHat/RPMS/openssh-server-3.1p1-3.i386.rpm. (Better yet, download the latest version of this software.) Install it with the package manager and then start the service.

[root@james root]# rpm -vih openssh-server-3.1p1-3.i386.rpm
            Preparing...                ########################################### [100%]
            1:openssh-server         ########################################### [100%]
            [root@james root]# /sbin/service sshd start
            Generating SSH1 RSA host key:                              [  OK  ]
            Generating SSH2 RSA host key:                              [  OK  ]
            Generating SSH2 DSA host key:                              [  OK  ]
            Starting sshd:                                             [  OK  ]

When SSH is started for the first time, encryption keys for the system are generated. Be sure to set this up so that it is done automatically when the system reboots.

Configuration files for both the server, sshd_config, and client, ssh_config, can be found in /etc/ssh, but the default settings are usually quite reasonable. You shouldn't need to change these files.

4.2.4.1 Using SSH

To log onto a remote machine, use the command ssh with the name or IP address of the remote machine as an argument. The first time you connect to a remote machine, you will receive a message with the remote machines' fingerprint, a string that identifies the machine. You'll be asked whether to proceed or not. This is normal.

[root@fanny root]# ssh amy
            The authenticity of host 'amy (10.0.32.139)' can't be established.
            RSA key fingerprint is 98:42:51:3e:90:43:1c:32:e6:c4:cc:8f:4a:ee:cd:86.
            Are you sure you want to continue connecting (yes/no)? yes
            Warning: Permanently added 'amy,10.0.32.139' (RSA) to the list of known hosts.
            root@amy's password:
            Last login: Tue Dec  9 11:24:09 2003
            [root@amy root]#

The fingerprint will be recorded in a list of known hosts on the local machine. SSH will compare fingerprints on subsequent logins to ensure that nothing has changed. You won't see anything else about the fingerprint unless it changes. Then SSH will warn you and query whether you should continue. If the remote system has changed, e.g., if it has been rebuilt or if SSH has been reinstalled, it's OK to proceed. But if you think the remote system hasn't changed, you should investigate further before logging in.

Notice in the last example that SSH automatically uses the same identity when logging into a remote machine. If you want to log on as a different user, use the -l option with the appropriate account name.

You can also use SSH to execute commands on remote systems. Here is an example of using date remotely.

[root@fanny root]# ssh -l sloanjd hector date
            sloanjd@hector's password:
            Mon Dec 22 09:28:46 EST 2003

Notice that a different account, sloanjd, was used in this example.

To copy files, you use the scp command. For example,

[root@fanny root]# scp /etc/motd george:/root/
            root@george's password:
            motd                 100% |*****************************|     0       00:00

Here file /etc/motd was copied from fanny to the /root directory on george.

In the examples thus far, the system has asked for a password each time a command was run. If you want to avoid this, you'll need to do some extra work. You'll need to generate a pair of authorization keys that will be used to control access and then store these in the directory ~/.ssh. The ssh-keygen command is used to generate keys.

[sloanjd@fanny sloanjd]$ ssh-keygen -b1024 -trsa
            Generating public/private rsa key pair.
            Enter file in which to save the key (/home/sloanjd/.ssh/id_rsa):
            Enter passphrase (empty for no passphrase):
            Enter same passphrase again:
            Your identification has been saved in /home/sloanjd/.ssh/id_rsa.
            Your public key has been saved in /home/sloanjd/.ssh/id_rsa.pub.
            The key fingerprint is:
            2d:c8:d1:e1:bc:90:b2:f6:6d:2e:a5:7f:db:26:60:3f sloanjd@fanny
            [sloanjd@fanny sloanjd]$ cd .ssh
            [sloanjd@fanny .ssh]$ ls -a
            .  ..  id_rsa  id_rsa.pub  known_hosts

The options in this example are used to specify a 1,024-bit key and the RSA algorithm. (You can use DSA instead of RSA if you prefer.) Notice that SSH will prompt you for a passphrase, basically a multi-word password.

Two keys are generated, a public and a private key. The private key should never be shared and resides only on the client machine. The public key is distributed to remote machines. Copy the public key to each system you'll want to log onto, renaming it authorized_keys2.

[sloanjd@fanny .ssh]$ cp id_rsa.pub authorized_keys2
            [sloanjd@fanny .ssh]$ chmod go-rwx authorized_keys2
            [sloanjd@fanny .ssh]$ chmod 755 ~/.ssh

If you are using NFS, as shown here, all you need to do is copy and rename the file in the current directory. Since that directory is mounted on each system in the cluster, it is automatically available.

If you used the NFS setup described earlier, root's home directory/root, is not shared. If you want to log in as root without a password, manually copy the public keys to the target machines. You'll need to decide whether you feel secure setting up the root account like this.


You will use two utilities supplied with SSH to manage the login process. The first is an SSH agent program that caches private keys, ssh-agent. This program stores the keys locally and uses them to respond to authentication queries from SSH clients. The second utility, ssh-add, is used to manage the local key cache. Among other things, it can be used to add, list, or remove keys.

[sloanjd@fanny .ssh]$ ssh-agent $SHELL
            [sloanjd@fanny .ssh]$ ssh-add
            Enter passphrase for /home/sloanjd/.ssh/id_rsa:
            Identity added: /home/sloanjd/.ssh/id_rsa (/home/sloanjd/.ssh/id_rsa)

(While this example uses the $SHELL variable, you can substitute the actual name of the shell you want to run if you wish.) Once this is done, you can log in to remote machines without a password.

This process can be automated to varying degrees. For example, you can add the call to ssh-agent as the last line of your login script so that it will be run before you make any changes to your shell's environment. Once you have done this, you'll need to run ssh-add only when you log in. But you should be aware that Red Hat console logins don't like this change.

You can find more information by looking at the ssh(1), ssh-agent(1), and ssh-add(1) manpages. If you want more details on how to set up ssh-agent, you might look at SSH, The Secure Shell by Barrett and Silverman, O'Reilly, 2001. You can also find scripts on the Internet that will set up a persistent agent so that you won't need to rerun ssh-add each time.

One last word of warning: If you are using ssh-agent, it becomes very important that you log off whenever you leave your machine. Otherwise, you'll be leaving not just one system wide open, but all of your systems.


4.2.5 Other Services and Configuration Tasks

Thus far, we have taken a minimalist approach. To make like easier, there are several other services that you'll want to install and configure. There really isn't anything special that you'll need to do梛ust don't overlook these.

4.2.5.1 Apache

While an HTTP server may seem unnecessary on a cluster, several cluster management tools such as Clumon and Ganglia use HTTP to display results. If you will monitor your cluster only from the head node, you may be able to get by without installing a server. But if you want to do remote monitoring, you'll need to install an HTTP server. Since most management packages like these assume Apache will be installed, it is easiest if you just go ahead and set it up when you install your cluster.

4.2.5.2 Network Time Protocol (NTP)

It is important to have synchronized clocks on your cluster, particularly if you want to do performance monitoring or profiling. Of course, you don't have to synchronize your system to the rest of the world; you just need to be internally consistent. Typically, you'll want to set up the head node as an NTP server and the compute nodes as NTP clients. If you can, you should sync the head node to an external timeserver. The easiest way to handle this is to select the appropriate option when you install Linux. Then make sure that the NTP daemon is running:

[root@fanny root]# /sbin/service ntpd status
            ntpd (pid 1689) is running...

Start the daemon if necessary.

4.2.5.3 Virtual Network Computing (VNC)

This is a very nice package that allows remote graphical logins to your system. It is available as a Red Hat package or from http://www.realvnc.com/. VNC can be tunneled using SSH for greater security.


4.2.5.4 Multicasting

Several clustering utilities use multicasting to distribute data among nodes within a cluster, either for cloning systems or when monitoring systems. In some instances, multicasting can greatly increase performance. If you are using a utility that relies on multicasting, you'll need to ensure that multicasting is supported. With Linux, multicasting must be enabled when the kernel is built. With most distributions, this is not a problem. Additionally, you will need to ensure that an appropriate multicast entry is included in your route tables. You will also need to ensure that your networking equipment supports multicast. This won't be a problem with hubs; this may be a problem with switches; and, should your cluster span multiple networks, this will definitely be an issue with routers. Since networking equipment varies significantly from device to device, you need to consult the documentation for your specific hardware. For more general information on multicasting, you should consult the multicasting HOWTOs.

4.2.5.5 Hosts file and name services

Life will be much simpler in the long run if you provide appropriate name services. NIS is certainly one possibility. At a minimum, don't forget to edit /etc/hosts for your cluster. At the very least, this will reduce network traffic and speed up some software. And some packages assume it is correctly installed. Here are a few lines from the host file for amy:

127.0.0.1               localhost.localdomain localhost
            10.0.32.139             amy.wofford.int         amy
            10.0.32.140             basil.wofford.int       basil
            ...

Notice that amy is not included on the line with localhost. Specifying the host name as an alias for localhost can break some software.

4.3 Cluster Security

Security is always a two-edged sword. Adding security always complicates the configuration of your systems and makes using a cluster more difficult. But if you don't have adequate security, you run the risk of losing sensitive data, losing control of your cluster, having it damaged, or even having to completely rebuild it. Security management is a balancing act, one of trying to figure out just how little security you can get by with.

As previously noted, the usual architecture for a cluster is a set of machines on a dedicated subnet. One machine, the head node, connects this network to the outside world, i.e., the organization's network and the Internet. The only access to the cluster's dedicated subnet is through the head node. None of the compute nodes are attached to any other network. With this model, security typically lies with the head node. The subnet is usually a trust-based open network.

There are several reasons for this approach. With most clusters, the communication network is the bottleneck. Adding layers of security to this network will adversely affect performance. By focusing on the head node, security administration is localized and thus simpler. Typically, with most clusters, any sensitive information resides on the head node, so it is the point where the greatest level of protection is needed. If the compute nodes are not isolated, each one will need to be secured from attack.

This approach also simplifies setting up packet filtering, i.e., firewalls. Incorrectly configured, packet filters can create havoc within your cluster. Determining what traffic to allow can be a formidable challenge when using a number of different applications. With the isolated network approach, you can configure the internal interface to allow all traffic and apply the packet filter only to public interface.

This approach doesn't mean you have a license to be sloppy within the cluster. You should take all reasonable precautions. Remember that you need to protect the cluster not just from external threats but from internal ones as well梬hether intentional or otherwise.

Since a thorough discussion of security could easily add a few hundred pages to this book, it is necessary to assume that you know the basics of security. If you are a novice system administrator, this is almost certainly not the case, and you'll need to become proficient as quickly as possible. To get started, you should:

  • Be sure to apply all appropriate security patches, at least to the head node, and preferably to all nodes. This is a task you will need to do routinely, not just when you set up the cluster.

  • Know what is installed on your system. This can be a particular problem with cluster kits. Audit your systems regularly.

  • Differentiate between what's available inside the cluster and what is available outside the cluster. For example, don't run NFS outside the cluster. Block portmapper on the public interface of the head node.

  • Don't put too much faith in firewalls, but use one, at least on the head node's public interface, and ensure that it is configured correctly.

  • Don't run services that you don't need. Routinely check which services are running, both with netstat and with a port scanner like nmap.

  • Your head node should be dedicated to the cluster, if at all possible. Don't set it up as a general server.

  • Use the root account only when necessary. Don't run programs as root unless it is absolutely necessary.

There is no easy solution to the security dilemma. While you may be able to learn enough, you'll never be able to learn it all.


你可能感兴趣的:((Part 1--Chapter 1-4) High Performance Linux Clusters with OSCAR, Rocks, OpenMosix, and MPI)