slurm集群安装

环境:三台物理机,os均为ubuntu-18-04 LTS,hostname分别为tian-609-06、tian-609-07、tian-609-08。其中tian-609-06作为控制节点和计算节点,其他节点作为计算节点。

1、安装munge和slurm(所有机器)

sudo apt install munge slurm-wlm

2、配置/etc/slurm-llnl/slurm.conf文件(所有机器,配置一样)

# slurm.conf file generated by configurator easy.html. 
# Put this file on all nodes of your cluster. 
# See the slurm.conf man page for more information. 
# 
ControlMachine=tian-609-06 # 
#ControlAddr= 
# 
#MailProg=/bin/mail 
MpiDefault=none 
#MpiParams=ports=#-# 
ProctrackType=proctrack/pgid 
ReturnToService=1 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 
#SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 
#SlurmdPort=6818 
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 
SlurmUser=slurm 
#SlurmdUser=root 
StateSaveLocation=/var/lib/slurm-llnl/slurmctld 
SwitchType=switch/none 
TaskPlugin=task/none 
# 
# 
# TIMERS 
#KillWait=30 
#MinJobAge=300 
#SlurmctldTimeout=120 
#SlurmdTimeout=300 
# 
# 
# SCHEDULING 
FastSchedule=1 
SchedulerType=sched/builtin 
#SchedulerPort=7321 
SelectType=select/linear 
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageType=accounting_storage/none 
#AccountingStoragePass=/var/run/munge/global.socket.2 
ClusterName=workstation # 
#JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none 
#SlurmctldDebug=3 
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log 
#SlurmdDebug=4 
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log 
# 
# 
# COMPUTE NODES 
NodeName=tian-609-06,tian-609-07,tian-609-08 CPUs=48 Sockets=2 CoresPerSocket=12 RealMemory=257731 ThreadsPerCore=2 State=IDLE
PartitionName=debug Nodes=tian-609-06,tian-609-07,tian-609-08 Default=YES MaxTime=INFINITE State=UP

3、将/etc/hosts中配置对应的hostname和ip(所有机器)

4、开启slurm
sudo systemctl enable slurmctld(控制节点tian-609-06)
sudo service slurmctld start(控制节点tian-609-06)
sudo systemctl enable slurmd(计算节点tian-609-[06-08])
sudo service slurmd start(计算节点tian-609-[06-08])

5、将控制节点的/etc/munge/munge.key拷贝至其他机器相同目录,文件所属用户和用户组均为slurm

6、开启munge
sudo /etc/init.d/munge start(所有节点)

7、查看slurm集群状态

> sinfo 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      3   idle tian-609-[06-08]

8、执行命令(测试)

> srun -N 3 hostname
tian-609-07
tian-609-08
tian-609-06

参考:
https://ubuntuforums.org/showthread.php?t=2404746
https://nablacfd.github.io/2019/01/27/Notes-of-installing-slurm-in-Ubuntu-WSL/

你可能感兴趣的:(tools,Linux,slurm,mpi,munge)