1.What is a Cluster?--什么是Cluster?
VERITAS Cluster Server(VCS) connects, or clusters, multiple, independent systems
into a management framework for increased availability. Each system, or node, runs its
own operating system and cooperates at the software level to form a cluster. VCS links
commodity hardware with intelligent software to provide application failover and
control. When a node or a monitored application fails, other nodes can take predefined
action to take over and bring up services elsewhere in the cluster.
2.Detecting Failure--失败检测
VCS can detect application failure and node failure among cluster members.
(1).Detecting Application Failure
At the highest level, VCS is typically deployed to keep business-critical applications
online and available to users. VCS provides a mechanism to detect failure of an
application and any underlying resources or services supporting the application. VCS
issues specific commands, tests, or scripts that monitor the overall health of an
application. VCS also determines the health of underlying system resources supporting
the application, such as file systems and network interfaces.
(2).Detecting Node Failure
One of the most difficult tasks in clustering is correctly discriminating between loss of a
system and loss of communication between systems. There are several technologies used
for this purpose, including heartbeat networks between servers, quorum disks, and SCSI
reservation. VCS uses a redundant network heartbeat along with SCSI III-based
membership coordination and data protection for detecting failure on a node and on
fencing.
3.Switchover and Failover
Failover and switchover are the processes of bringing up application services on a
different node in a cluster. In both cases, an application and its network identity are
brought up on a selected node. Client systems access a virtual IP address that moves with
the service. Client systems are unaware of which server they are using.
A virtual IP address is an address brought up in addition to the base address of systems in
the cluster. For example, in a 2-node cluster consisting of db-server1 and db-server2, a
virtual address may be called db-server. Clients will then access db-server and be
unaware of which physical server is actually hosting the db-server. Virtual IP addresses
use a technology known as IP Aliasing.
(1)The Switchover Process
A switchover is an orderly shutdown of an application and its supporting resources on
one server and a controlled startup on another server. Typically this means unassigning
the virtual IP, stopping the application, and deporting shared storage. On the other server,
the process is reversed. Storage is imported, file systems are mounted, the application is
started, and the virtual IP address is brought up.
(2)The Failover Process
A failover is similar to a switchover, except the ordered shutdown of applications on the
original node may not be possible. In this case services are simply started on another
node. The process of starting the application on the node is identical in a failover or
switchover. This means the application must be capable of restarting following a crash of
its original host.
4.Cluster Control, Communications, and Membership
(1)High-Availability Daemon (HAD)
The high-availability daemon, or HAD, is the main VCS daemon running on each system.
It is responsible for building the running cluster configuration from the configuration
files, distributing the information when new nodes join the cluster, responding to operator
input, and taking corrective action when something fails. It is typically known as the VCS
engine. The engine uses agents to monitor and manage resources.
(2)Low Latency Transport (LLT)
VCS uses private network communications between cluster nodes for cluster
maintenance. The Low Latency Transport functions as a high-performance, low-latency
replacement for the IP stack, and is used for all cluster communications. VERITAS
requires two completely independent networks between all cluster nodes, which provide
the required redundancy in the communication path and enable VCS to discriminate
between a network failure and a system failure. LLT has two major functions.
(3)Group Membership Services/Atomic Broadcast (GAB)
The Group Membership Services/Atomic Broadcast protocol (GAB) is responsible for
cluster membership and cluster communications.
◆ Cluster Membership
GAB maintains cluster membership by receiving input on the status of the heartbeat
from each node via LLT. When a system no longer receives heartbeats from a peer, it
marks the peer as DOWN and excludes the peer from the cluster. In most
configurations, the I/O fencing module is used to prevent network partitions.
◆ Cluster Communications
GAB’s second function is reliable cluster communications. GAB provides guaranteed
delivery of point-to-point and broadcast messages to all nodes.
5.I/O Fencing Module
The I/O fencing module implements a quorum-type functionality to ensure only one
cluster survives a split of the private network. I/O fencing also provides the ability to
perform SCSI-III persistent reservations on failover. The shared VERITAS Volume
Manager disk groups offer complete protection against data corruption by nodes assumed
to be excluded from cluster membership.