安装slurm与重启slurm

1. 先安装openssl和munge 

2. install

Install(caoj7)
./configure --prefix=/ usr /local -- sysconfdir =/ usr /local/ etc --enable-debug
make
sudo make install

2. Slurm.conf (If revised, slurmctld andslurmd need toreboot)

Use doc/html/ configurator.html to create slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=vm1
#ControlAddr=
# 
#MailProg=/bin/mail 
MpiDefault=none
#MpiParams=ports=#-# 
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818 
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=caoj7
SlurmdUser=caoj7 
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
# 
# 
# TIMERS 
#KillWait=30 
#MinJobAge=300 
#SlurmctldTimeout=120 
#SlurmdTimeout=300 
# 
# 
# SCHEDULING 
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321 
SelectType=select/linear
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3 
#SlurmctldLogFile=
#SlurmdDebug=3 
#SlurmdLogFile=
# 
# 
# COMPUTE NODES 
NodeName=vm[2-5] CPUs=4 State=UNKNOWN 
PartitionName=compute Nodes=vm[2-5] Default=YES MaxTime=INFINITE State=UP
/ usr /local/ etc / slurm.conf  (revised SlurmUser=caoj7 SlurmdUser=caoj7)
sudo scp / usr /local/ etc / slurm.conf      vm2 :/ usr /local/ etc /   (etc.)
sudo chown caoj7:caoj7 / usr /local/ etc / slurm.conf  (etc.)

3.  Createfile and dir
sudo  touch / var /run/ slurmctld.pid
sudo  chown caoj7:caoj7 / var /run/ slurmctld.pid
sudo  touch / var /run/ slurmd.pid
sudo  chown caoj7:caoj7 / var /run/ slurmd.pid
touch / var /run/ slurmd.pid
–sudo mkdir /var/spool/slurmd
•sudo chown -R caoj7:caoj7 /var/spool/slurmd
sudo  touch / var /spool/ job_state
sudo  chown caoj7:caoj7 / var /spool/ job_state
sudo  touch / var /spool/ resv_state
sudo  chown caoj7:caoj7 / var /spool/ resv_state
sudo  touch / var /spool/ node_state
sudo  chown caoj7:caoj7 / var /spool/ node_state
sudo  touch / var /spool/ trigger_state
sudo  chown caoj7:caoj7 / var /spool/ trigger_state

4.  Startup
Master
slurmctld -D vvvvvv
If/ var /run/ slurmctld.pid is removed, use vi to re-createit
Slave
slurmd -D vvvvvv
If/ var /run/ slurmd.pid is removed, use vi to re-createit

5. Error

Slurmctld error: authentication: expired credential
Timer isnot sync.
Date –s “2012-9-3 14:27:00”
Reboot munge and slurm

Ifnode002 can’t register to master
Might because ssh
Try  ssh masternode (e.g., node001) from node002

salloc 出错
[caoj7@vm2 mpi ]$ salloc -N2
-bash:./ salloc : /lib/ld-linux.so.2: bad ELFinterpreter: No such file or directory
[caoj7@vm1 mpi ]$ ldd / usr /local/bin/ salloc
  linux-vdso.so.1 =>  (0x00007fff0ebff000)
  libdl.so.2 =>/lib64/libdl.so.2 (0x0000003d3f000000)
  libpthread.so.0 =>/lib64/libpthread.so.0 (0x0000003d6e000000)
  libc.so.6 => /lib64/libc.so.6(0x0000003d6dc00000)
  /lib64/ld-linux-x86-64.so.2(0x0000003d6d400000 )

[caoj7@vm1 mpi ]$ cd /lib
[caoj7@vm1lib]$ ln -s/lib64/ld-linux-x86-64.so.2 ld-linux.so.2
但后来又出错了,unlink后正确

------------------------------------------------------------------
重启
1. 启动munge
[caoj7@vm5 ~]$ sudo /etc/init.d/munge start
2. 启动slurmctld或者slurmd

[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: error: Unable to open pidfile `/var/run/slurmd.pid': Permission denied
slurmd: slurmd started on Fri 30 Nov 2012 09:57:55 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=846
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': No such file or directory
slurmd: Slurmd shutdown completing

[caoj7@vm5 ~]$ sudo touch /var/run/slurmd.pid
[caoj7@vm5 ~]$ sudo chown caoj7:caoj7 /var/run/slurmd.pid

[caoj7@vm5 ~]$ slurmd -D vvvvvv

slurmd: slurmd version 2.4.4 started
slurmd: error: Possible corrupt pidfile `/var/run/slurmd.pid'
slurmd: slurmd started on Fri 30 Nov 2012 09:58:48 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=899
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': Permission denied
slurmd: Slurmd shutdown completing

[caoj7@vm5 ~]$ touch /var/run/slurmd.pid 

[caoj7@vm5 ~]$ slurmd -D vvvvvv

slurmd: slurmd version 2.4.4 started
slurmd: slurmd started on Fri 30 Nov 2012 09:59:14 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=925

你可能感兴趣的:(Cluster)