1、关闭防火墙
systemctl stop firewalld;systemctl disable firewalld
sed -i -e 's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config
2、修改hostname
vim /etc/hosts
比如你的IP是192.168.60.8,你指定你的主机名叫xx
192.168.60.8 xx
3、如果是redhat则指定本地yum源,如果是centos则不存在yum源问题。
4、安装muge
这里我们用已经下载好的muge安装包(或者自己安装muge),将这些包拷贝到当前用户目前下,执行以下命令:
cd munge/
yum -y install perl-Cairo* perl-Cairo-GObject* perl-common-sense* perl-File-FcntlLock* perl-Glib* perl-Glib-devel* perl-Glib-Object-Introspection* perl-Gtk2* perl-Linux-Inotify2* perl-Linux-Inotify2-tests* perl-Pango* perl-Switch*
yum -y install zlib*
yum -y install munge*
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
systemctl start munge
systemctl status munge
如果是绿色avtive则表明muge启动成功。
5、下载slurm安装包,官网就可以,我这里用的20版本,然后执行以下命令
tar -jxvf slurm-20.02.7.tar.bz2
cd slurm-20.02.7/
yum -y install freeipmi hwloc-libs mariadb mariadb-server perl-ExtUtils-MakeMaker gcc python3 yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad perl-Switch mariadb mariadb-server mariadb-devel -y
./configure –-prefix=/usr/local –-sysconfdir=/etc/slurm
make -j 4
make install
注意:这样编译后的输出有安装位置在/usr/local,配置文件在/etc/slurm(重要),默认lib路径为:/usr/local/lib/slurm/。
最后,将slurm-21.08.4/etc 文件夹中的三个service启动脚本拷贝到/etc/systemd/system/目录下。
6、启动数据库
systemctl start mariadb
mysql
create database slurm_db;
CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurm123';
GRANT ALL ON *.* TO 'slurm'@'localhost';
grant all on slurm_db.* to 'slurm'@'localhost' identified by 'slurm123' with grant option;
7、将cgroup.conf slurm.conf slurmdbd.conf这三个配置文件拷贝到/etc/slurm目录下。
修改slurm.conf,将ControlMachine、AccountingStorageHost、NodeName、Nodes四个地方改为主机名(用hostname查看)。
修改slurmdbd.conf中的DbdHost为主机名。
修改slurmdbd.conf中的StorageLoc为上面你创建的数据库的名称(slurm_db),StoragePass为数据库的密码(slurm123)。
修改slurmdbd.conf的权限:chmod 600 slurmdbd.conf
8、启动服务
systemctl start slurmd
systemctl status slurmd
systemctl start slurmdbd
systemctl status slurmdbd
systemctl start slurmctld
systemctl status slurmctld
依次启动服务,正常情况下显示绿色的active状态;如果失败,则用下面命令查看错误日志
slurmctld -Dvvvvv
slurmdbd -Dvvvvv
slurmd -Dvvvvv
1、启动后如果节点状态是down,可用下面命令启动节点:
scontrol update nodename=XX state=idle
2、关闭超线程,不然在修改配置文件cpus时会有问题
如果自己不想下载包和配置文件 ,可以在这里下载:
slurm安装的muge包和配置文件-Linux文档类资源-CSDN下载