OpenPBS 脚本样本 非常值得参考

Here is a list of frequently asked questions that may help you answer any questions you may have before you even have to ask them.


"Using your account"

  • Login and file transfer

    TELNET, RLOGIN and FTP have been disabled on ABACUS for security reasons. User can use SSH or SLOGIN to log into your accounts on ABACUS and use SCP to transfer files between different machines. Windows users can use "Secure Shell Client" for login and "Secure File Transfer Client" for file transfer. For those users who have no access to "Secure Shell Client" on Windows machines, can download a free SSH Client called putty.exe from the PuTTY web page. SFTP can be used instead of FTP for those who prefer using FTP for file transfer.

    Head node is the Login node. Its IP address is abacus.uwaterloo.ca. The following example shows how users can log into the head node. Login to compute nodes is not recommended, but they can do so when necessary. Users will get a uniform interface for the home directories no matter which node they log into.

    Example of login to ABACUS from another UNIX/Linux machine: Suppose that you are a user on a UNIX machine named monolith, you want to log into abacus, you have a user name of "foobar" on ABACUS and a password of "tricky". You do following (the texts in bold face are the commands you need to type in),

    monolith:~% ssh -l foobar abacus.uwaterloo.ca
    
        foobar@abacus's password: tricky
    
        [foobar@head ~]$
    
        
    Example of transfer files between ABACUS and another UNIX/Linux machine: Suppose that you are a user on UNIX machine monolith, you want to transfer a file named file.txt which is located in the home directory of monolith, to ABACUS, your user name is "foobar" on ABACUS and password is "tricky". You do following,
    monolith:~% scp file.txt [email protected]:
    
        foobar@abacus's password: tricky
    
        
    Example of using SFTP: Suppose that you are a user on UNIX machine monolith, you want to transfer files between monolith and ABACUS, your user name is "foobar" on ABACUS and password is "tricky". You do following,
    monolith:~% sftp [email protected]
    
        foobar@abacus's password: tricky
    
        sftp>
    
        
  • Changing the password

    First, login to ABACUS, then issue the command 'passwd'. The system will prompt you for the old (existing) password and ask you to choose a new password. Please follow this guideline in choosing a password,

    [foobar@head ~]$ passwd
    
        

  • Login to compute nodes from a head node

    Supposed you have logged into head, you now want to log into node035 (i.e., quad32g001), you do,

    [foobar@head ~]$ ssh node035
    
        

"Login without using password"


Users can generate an authentication key to login to ABACUS from another UNIX machine without using the password. The authentication key is different for each machine, each pair of machines need to set it up individually. Suppose a user named "foobar" wants to login to ABACUS from another UNIX machine monolith, follow these steps,
monolith:~% ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/home/foobar/.ssh/id_rsa):

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /home/foobar/.ssh/id_rsa.

Your public key has been saved in /home/foobar/.ssh/id_rsa.pub.

The key fingerprint is:

0c:44:8c:3e:b9:b4:20:e3:83:4b:19:d9:54:cf:65:35 foobar@monolith

Please note, when the system prompts for passphrase, just enter, don't type any passphrase.
monolith:~% cd .ssh

monolith:~/.ssh% scp id_rsa.pub abacus:

On ABACUS,
[foobar@head ~]$ cd .ssh

If the file authorized_keys does not already exist,
[foobar@head .ssh]$ touch authorized_keys

[foobar@head .ssh]$ cat ~/id_rsa.pub >> authorized_keys

Now, user foobar can login to ABACUS from monolith without typing the password,
monolith:~% ssh foobar@abacus


"Using a job queuing system"


TORQUE/PBS and Maui were installed on ABACUS for batch processing.

The Portable Batch System, PBS, is a workload management system for Linux clusters. It supplies command to submit, monitor, and delete jobs. It has the following components.

Job Server - also called pbs_server provides the basic batch services such as receiving/creating a batch job, modifying the job, protecting the job against system crashes, and running the job.

Job Executor - a daemon (pbs_mom) that actually places the job into execution when it receives a copy of the job from the Job Server, and returns the job's output to the user.

Job Scheduler - a daemon that contains the site's policy controlling which job is run and where and when it is run. PBS allows each site to create its own Scheduler. Maui Scheduler is used on ABACUS.

Below are the steps needed to run user job:

  • Create a job script containing the PBS options.
  • Submit the job script file to PBS.
  • Monitor the job.

PBS Options

Below are some of the commonly used PBS options in a job script file. The options start with "#PBS."

Option                     Description

======                     ===========

#PBS -N MyJob              Assigns a job name. The default is the name

of PBS job script.

#PBS -l nodes=4:ppn=2      The number of nodes and processors per node.

#PBS -q queuename          Assigns the queue your job will use.

#PBS -l walltime=01:00:00  The maximum wall-clock time during which this

job can run.

#PBS -o mypath/my.out      The path and file name for standard output.

#PBS -e mypath/my.err      The path and file name for standard error.

#PBS -j oe                 Join option that merges the standard error stream

with the standard output stream of the job.

#PBS -W stagein=file_list  Copies the file onto the execution host before

the job starts.

#PBS -W stageout=file_list Copies the file from the execution host after the

job completes.

#PBS -m b                  Sends mail to the user when the job begins.

#PBS -m e                  Sends mail to the user when the job ends.

#PBS -m a                  Sends mail to the user when job aborts (with an

error).

#PBS -m ba                 Allows a user to have more than 1 command with the

same flag by grouping the messages together on 1

line, else only the last command gets executed.

#PBS -r n                  Indicates that a job should not rerun if it fails.

#PBS -V                    Exports all environment variables to the job.

Job Script Example

A job script may consist of PBS directives, comments and executable statements. A PBS directive provides a way of specifying job attributes in addition to the command line options.

For example, a simple job script, named geo1.bash, contains the following lines:

  #!/bin/bash

#PBS -l nodes=1:ppn=1

#PBS -V

PBS_O_WORKDIR=/home/huang/temp

myPROG='/home/huang/software/nwchem-4.7/bin/LINUX64_x86_64/nwchem'

myARGS='/home/huang/software/tce-test/geo-0.98.nw'

cd $PBS_O_WORKDIR

$myPROG $myARGS >& out1

An example to run a job in a specific node, contains the following lines:
  #!/bin/bash

#PBS -l nodes=node035:ppn=1

#PBS -V

PBS_O_WORKDIR=/home/huang/temp

myPROG='/home/huang/software/nwchem-4.7/bin/LINUX64_x86_64/nwchem'

myARGS='/home/huang/software/tce-test/geo-0.98.nw'

cd $PBS_O_WORKDIR

$myPROG $myARGS >& out1

Another example, a MPI job scipt, named geo2.bash, contains the following lines:
  #!/bin/bash

#PBS -l nodes=4:ppn=4

#PBS -V

NCPUS=16

PBS_O_WORKDIR=/home/huang/temp

cd $PBS_O_WORKDIR

cat $PBS_NODEFILE > .machinefile

myPROG='/home/huang/software/nwchem-4.7/bin/LINUX64_x86_64/nwchem_mpi'

myARGS='/home/huang/software/tce-test/geo-0.98.nw'

MPIRUN='/opt/mpich.pgi/bin/mpirun'

$MPIRUN -np $NCPUS -machinefile .machinefile $myPROG $myARGS >& out2

The above job script templates should be modified for the need of the job. You need to change the contents of the variables PBS_O_WORKDIR, myPROG and myARGS only.


Submitting a Job

Use the qsub command to submit the job,

qsub geo2.bash

PBS assigns a job a unique job identifier once it is submitted (e.g. 70.head). This job identifier will be used to monitor status of the job later. After a job has been queued, it is selected for execution based on the time it has been in the queue, wall-clock time limit, and number of processors.


Monitoring a Job

Below are the PBS commands for monitoring a job:

Command       Function

=======       ========

qstat -a      check status of jobs, queues, and the PBS server

qstat -f      get all the information about a job, i.e. resources requested,

resource limits, owner, source, destination, queue, etc.

qdel JobID    delete a job from the queue

qhold JobID   hold a job if it is in the queue

qrls JobID    release a job from hold


There are some quite useful Maui commands for monitoring a job, too:

Command            Description

=======            ===========

showq              Show a detailed list of submitted jobs

showbf             Show the free resources (time and processors available)

at the moment

checkjob JobID     Show a detailed description of the job JobID

showstart JobID    Gives an estimate of the expected started time of the

job JobID


For example, to check the status of a job,

qstat -f 70.head

or

checkjob 70.head


"File backup"

File systems on the head node are backed up to tape drives once a week. Incremental backup for the /home file system to another Linux machine is done daily. Users are also encouraged to back up their files to another system or any removable media by themselves for safety. For example, to copy file over to another UNIX/Linux machine, users can use rsync or scp commands. To copy files over to their PCs, users can use 'SSH Secure File Transfer Client'.


"Using the cluster in a courteous way"

You might be wondering why your jobs are running slowly sometimes. There are numerous possible explanations for abacus's performance. However, the system load and the NFS file system are the two common issues causing the problem.
  • High system load.
    ABACUS has 37 nodes, 33 of them are dual CPU systems, 4 of them are quad CPU systems. In each individual node, if the number of running jobs are more than 2 on the dual systems or more than 4 on the quad systems, each job is effectively only assigned part of a CPU for computation. Therefore, users are recommended to submit a job through a job queuing system rather than logging into a compute node to run a job there directly. The queuing system will balance the load among the nodes automatically.
  • I/O intensive jobs.
    User home directories are mounted using the NFS file system. No matter which node a user's job is running on, file reading and writing to the /home file system are taking place on the head node via the NFS mounting. Running jobs can be slowed down significantly if many of them are I/O intensive, since these jobs need to access files on the head node simultaneously. Therefore, users are required to use the scratch space local to the compute nodes for the intermediate files created by the running programs.
Briefly saying, users should use the cluster in a courteous way, and shouldn't run too many jobs at one time.

你可能感兴趣的:(open)