目录
一 环境准备
二 时间同步
三 munge认证
四 数据库安装
五 slurm搭建
六 集群用户管理和初始化配置
qos配置
主机规划
master 192.168.220.128
node1 192.168.220.129
关闭防火墙 hosts文件互相通信
1.安装时间同步软件 yum -y install ntp.x86_64 yum -y install ntpdate.x86_64 2.同步阿里云时间服务器 ntpdate ntp.aliyun.com 3.开机自启服务 systemctl stop firewalld systemctl enable ntpd systemctl restart ntpd
master配置
1.确保没有安装过munge和munge用户 yum remove -y munge munge-libs munge-devel userdel -r munge 2.添加munge用户 export MUNGEUSER=1120 groupadd -g $MUNGEUSER munge useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge 3.安装munge软件 yum install munge munge-devel munge-libs rng-tools -y 4.添加配置文件 rngd -r /dev/urandom create-munge-key dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key chown munge: /etc/munge/munge.key chmod 400 /etc/munge/munge.key chown -R munge: /var/lib/munge chown -R munge: /var/run/munge chown -R munge: /var/log/munge 5.发送秘钥给客户端 scp /etc/munge/munge.key root@node1:/etc/munge/ 6.启动服务 systemctl start munge systemctl enable munge
node1配置
1.确保没有安装过munge和munge用户 yum remove -y munge munge-libs munge-devel userdel -r munge 2.添加munge用户 export MUNGEUSER=1120 groupadd -g $MUNGEUSER munge useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge 3.安装munge软件 yum install munge munge-devel munge-libs rng-tools -y 4.添加配置文件 rngd -r /dev/urandom chmod 700 /etc/munge chown -R munge: /etc/munge chown -R munge: /var/lib/munge chown -R munge: /var/run/munge chown -R munge: /var/log/munge 5.启动服务 systemctl start rngd systemctl start munge systemctl enable rngd systemctl enable munge
1.安装数据库 yum install mariadb-server mariadb-devel 2.重置密码 mysql_secure_installation mysql -u root -ppassword #password为之前你设置的root密码 3.进入数据库 # 生成slurm用户,以便该用户操作slurm_acct_db数据库,其密码是SomePassWD,可自行设定 create user 'slurm'@'localhost' identified by 'SomePassWD'; # 生成账户数据库slurm_acct_db create database slurm_acct_db; # 赋予slurm从本机localhost采用密码SomePassWD登录具备操作slurm_acct_db数据下所有表的全部权限 grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'SomePassWD' with grant option; # 赋予slurm从node1采用密码SomePassWD登录具备操作slurm_acct_db数据下所有表的全部权限 grant all on slurm_acct_db.* TO 'slurm'@'node1' identified by 'SomePassWD' with grant option; # 生成作业信息数据库slurm_jobcomp_db create database slurm_jobcomp_db; # 赋予slurm从本机localhost采用密码SomePassWD登录具备操作slurm_jobcomp_db数据下所有表的全部权限 grant all on slurm_jobcomp_db.* TO 'slurm'@'localhost' identified by 'SomePassWD' with grant option; # 赋予slurm从node1采用密码SomePassWD登录具备操作slurm_jobcomp_db数据下所有表的全部权限 grant all on slurm_jobcomp_db.* TO 'slurm'@'node1' identified by 'SomePassWD' with grant option; flush privileges; ##保存配置
修改配置文件my.cnf # The following options will be passed to all MySQL clients [client] port=3306 socket=/var/lib/mysql/mysql.sock default-character-set=utf8mb4 # Here follows entries for some specific programs [mariadb_safe] log-error=/var/log/mariadb/mariadb.log pid-file=/var/run/mariadb/mariadb.pid # The MySQL server [mariadb] # explicit_defaults_for_timestamp = true datadir=/var/lib/mysql port = 3306 # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 # sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION' interactive_timeout=1200 wait_timeout=1800 skip_name_resolve=OFF innodb_file_per_table=ON max_connections=2048 max_connect_errors=1000000 max_allowed_packet=16M sort_buffer_size=512K net_buffer_length=16K read_buffer_size=512K read_rnd_buffer_size=512K character_set_server=utf8mb4 collation_server=utf8mb4_bin thread_stack=256K thread_cache_size=384 tmp_table_size=96M max_heap_table_size=96M #open slow query slow_query_log=OFF slow_query_log_file=/var/lib/mysql/mysql-slow-query.log #set slow time to default 10 second, minimum value 0 long_query_time=4 local_infile=OFF # binary logging is required for replication #log_bin=mysql-bin #master - slave syncronized setting log_slave_updates=ON server-id=1 log-bin=mysql-bin sync_binlog=1 binlog_checksum = none binlog_format = mixed auto-increment-increment = 2 auto-increment-offset = 1 slave-skip-errors = all sync_binlog=1 binlog_checksum = none binlog_format = mixed auto-increment-increment = 2 auto-increment-offset = 1 slave-skip-errors = all # Uncomment the following if you are using InnoDB tables #innodb_data_home_dir = /var/lib/mysql/ #innodb_data_file_path = ibdata1:10M:autoextend #innodb_log_group_home_dir = /var/lib/mysql/ #innodb_log_arch_dir = /var/lib/mysql/ # You can set .._buffer_pool_size up to 50 - 80 % # of RAM but beware of setting memory usage too high event_scheduler=ON default_storage_engine=InnoDB innodb_buffer_pool_size=1024M #64M # 1024M innodb_purge_threads=1 innodb_log_file_size=128M innodb_log_buffer_size=2M innodb_lock_wait_timeout=900 #120 bulk_insert_buffer_size=32M myisam_sort_buffer_size=8M #MySQL rebuild index allowed maxmum cache file myisam_max_sort_file_size=4G myisam_repair_threads=1 lower_case_table_names=0 [mysqldump] quick max_allowed_packet=16M #[isamchk] #key_buffer = 16M #sort_buffer_size = 16M #read_buffer = 4M #write_buffer = 4M [myisamchk] key_buffer=16M sort_buffer_size=16M read_buffer=4M write_buffer=4M # # include all files from the config directory # !includedir /etc/my.cnf.d
4.重启数据库 cd /var/lib/mysql mv ib_logfile0 ib_logfile0.bak mv ib_logfile1 ib_logfile1.bak systemctl restart mariadb && systemctl enable mariadb
1.编译slurm包 yum install -y readline-devel perl-ExtUtils* perl-Switch pam-devel lua-devel hwloc-devel 2.下载slurm包 本次使用的是最新版本22.05
Downloads | SchedMDhttps://www.schedmd.com/downloads.php
3.编译安装包 rpmbuild -ta -with lua slurm-21.08.8-2.tar.bz2 root用户编译完成后会在/root/rpmbuild/RPMS/x86_64下生成与slurm有关的rpm包
4.添加slurm用户 export SLURMUSER=1121 groupadd -g $SLURMUSER slurm useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm 5.安装rmp包 rpm -ivh slurm*.rpm
master节点操作
6.创建配置文件 #slurmdbd.conf文件为slurmdbd服务的配置文件,所有者必须为slurm用户 touch /etc/slurm/slurmdbd.conf chown slurm:slurm /etc/slurm/slurmdbd.conf chmod 600 /etc/slurm/slurmdbd.conf #slurm.conf文件为slurmd、slurmctld的配置文件,所有者必须为root用户 touch /etc/slurm/slurm.conf chown root:root /etc/slurm/slurm.conf #建立slurmctld服务存储其状态等的目录,由slurm.conf中StateSaveLocation参数定义: mkdir /var/spool/slurmctld chown slurm:slurm /var/spool/slurmctld #建立日志文件存储目录,并修改目录权限 mkdir /var/log/slurm cd /var/log/slurm/ touch slurmd.log touch slurmctld.log touch slurmdbd.log chown slurm:slurm /var/log/slurm
编辑/etc/slurm/slurmdbd.conf文件,添加如下内容:
AuthType=auth/munge #认证方式,该处采用munge进行认证 AuthInfo=/var/run/munge/munge.socket.2 #为了与slurmctld控制节点通信的其它认证信息 # slurmDBD info DbdAddr=localhost # 数据库节点名 DbdHost=localhost # 数据库IP地址 DbdPort=6819 # 数据库端口号,默认为6819 SlurmUser=slurm # 用户数据库操作的用户 MessageTimeout=60 # 允许以秒为单位完成往返通信的时间,默认为10秒 DebugLevel=5 # 调试信息级别,quiet:无调试信息;fatal:仅严重错误信息;error:仅错误信息; info:错误与通常信息;verbose:错误和详细信息;debug:错误、详细和调试信息;debug2:错误、详细和更多调试信息;debug3:错误、详细和甚至更多调试信息;debug4:错误、详细和甚至更多调试信息;debug5:错误、详细和甚至更多调试信息。debug数字越大,信息越详细 LogFile=/var/log/slurm/slurmdbd.log # slurmdbd守护进程日志文件绝对路径 PidFile=/var/run/slurmdbd.pid # slurmdbd守护进程存储进程号文件绝对路径 # Database info StorageType=accounting_storage/mysql # 数据存储类型 StorageHost=localhost # 存储数据库节点名 StoragePort=3306 # 存储数据库服务端口号 StoragePass=password # 存储数据库密码 StorageUser=slurm # 存储数据库用户名 StorageLoc=slurm_acct_db # 存储位置,对应数据库中的slurm_acct_db的表名称
编辑完成后保存,启动slurmdbd服务并加入开机自启:
systemctl enable slurmdbd systemctl restart slurmdbd
配置slurmd服务
无配置模式是Slurm的一项新特性(从20.02版起支持),可以允许计算节点和用户登录节点从slurmctld守护进程获取配置而无需采用 /etc/slurm 等目录下的本地配置文件。需要在管理节点的slurm.conf文件配置SlurmctldParameters=enable_configless选项。
编辑/etc/slurm/slurm.conf文件,添加如下内容:
配置的更多详情可参考slurm官方说明:Slurm Workload Manager - slurm.conf
# slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # ################################################ # CONTROL # ################################################ ClusterName=hgy #集群名称 SlurmctldHost=cn03 #管理服务主控节点名 SlurmUser=root #slurm的主用户 SlurmdUser=root #slurmd服务的启动用户 SlurmctldPort=6817 #slurmctld服务端口 SlurmdPort=6818 #slurmd服务的端口 AuthType=auth/munge #采用munge认证,与其他计算节点通信 ################################################ # LOGGING & OTHER PATHS # ################################################ SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/spool/slurmd StateSaveLocation=/var/spool/slurmctld SlurmctldParameters=enable_configless #采用无配置模式 ################################################ # ACCOUNTING # ################################################ AccountingStorageEnforce=associations,limits,qos #account存储数据的配置选项 AccountingStorageHost=cn03 #数据库存储节点 AccountingStoragePass=/var/run/munge/munge.socket.2 #munge认证文件,与slurmdbd.conf文件中的AuthInfo文件同名。 AccountingStoragePort=6819 #slurmd服务监听端口,默认为6819 AccountingStorageType=accounting_storage/slurmdbd #数据库记账服务 AccountingStorageTRES=cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu:tesla #记账信息 AcctGatherEnergyType=acct_gather_energy/none #作业消耗能源信息,none代表不采集 AcctGatherFilesystemType=acct_gather_filesystem/none AcctGatherInterconnectType=acct_gather_interconnect/none AcctGatherNodeFreq=0 AcctGatherProfileType=acct_gather_profile/none ExtSensorsType=ext_sensors/none ExtSensorsFreq=0 ################################################ # JOBS # ################################################ JobCompHost=localhost #作业完成信息的数据库本节点 #JobCompLoc= JobCompPass=password #slurm用户数据库密码 JobCompPort=6819 #作业完成信息数据库端口,与上面的端口一致 JobCompType=jobcomp/mysql #作业完成信息数据存储类型,采用mysql数据库 JobCompUser=slurm #作业完成信息数据库用户名 JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux PrivateData=jobs,usage DisableRootJobs=NO # ################################################ # SCHEDULING & ALLOCATION # ################################################ PreemptMode=OFF PreemptType=preempt/none PreemptExemptTime=00:00:00 PriorityType=priority/multifactor SchedulerTimeSlice=300 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_CPU SlurmSchedLogLevel=0 ################################################ # TOPOLOGY # ################################################ TopologyPlugin=topology/none ################################################ # TIMERS # ################################################ BatchStartTimeout=100 CompleteWait=0 EpilogMsgTime=2000 GetEnvTimeout=10 InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=600 SlurmdTimeout=600 WaitTime=0 MessageTimeout=30 TCPTimeout=10 ################################################ # POWER # ################################################ ResumeRate=300 ResumeTimeout=120 SuspendRate=60 SuspendTime=NONE SuspendTimeout=60 # ################################################ # DEBUG # ################################################ DebugFlags=NO_CONF_HASH ################################################ # PROCESS TRACKING # ################################################ ProctrackType=proctrack/linuxproc ################################################ # RESOURCE CONFINEMENT # ################################################ TaskPlugin=task/affinity TaskPluginParam=threads ################################################ # PRIORITY # ################################################ #PrioritySiteFactorPlugin= PriorityDecayHalfLife=7-00:00:00 PriorityCalcPeriod=00:05:00 PriorityFavorSmall=No #PriorityFlags= PriorityMaxAge=7-00:00:00 PriorityUsageResetPeriod=NONE PriorityWeightAge=0 PriorityWeightAssoc=0 PriorityWeightFairShare=0 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=1000 ################################################ # OTHER # ################################################ AllowSpecResourcesUsage=No CoreSpecPlugin=core_spec/none CpuFreqGovernors=Performance,OnDemand,UserSpace CredType=cred/munge EioTimeout=120 EnforcePartLimits=NO MpiDefault=none FirstJobId=2 JobFileAppend=0 JobRequeue=1 MailProg=/bin/mail MaxArraySize=1001 MaxDBDMsgs=24248 MaxJobCount=10000 MaxJobId=67043328 MaxMemPerNode=UNLIMITED MaxStepCount=40000 MaxTasksPerNode=512 MCSPlugin=mcs/none ReturnToService=2 RoutePlugin=route/default TmpFS=/tmp TrackWCKey=no TreeWidth=50 UsePAM=0 SwitchType=switch/none UnkillableStepTimeout=60 VSizeFactor=0 ################################################ # NODES # ################################################ NodeName=master CPUs=3 Boards=1 SocketsPerBoard=1 CoresPerSocket=3 ThreadsPerCore=1 RealMemory=2827 State=UNKNOWN #节点信息 ################################################ # PARTITIONS # ################################################ PartitionName=debug MAXCPUsPerNode=2 Nodes=ALL Default=YES MaxTime=INFINITE State=UP #分区信息
计算节点配置
1.创建日志目录,并添加日志文件: mkdir -p /var/log/slurm cd /var/log/slurm/ touch slurmd.log 2. 编辑/lib/systemd/system/slurmd.service文件,按照如下所述修改: #修改前 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS #修改后 ExecStart=/usr/sbin/slurmd --conf-server cn03:6817 -D -s $SLURMD_OPTIONS 执行成功后直接启动slurmd服务并加入开机自启即可。 3.重启服务 systemctl daemon-reload systemctl enable slurmd systemctl start slurmd 4.主节点配置 [root@master x86_64]# vim /lib/systemd/system/slurmctld.service [Unit] Description=Slurm controller daemon After=network-online.target munge.service mariadb.service ##要添加maridb服务 否则重启时服务起不来 此时输入sinfo已经可以看到分区信息了,至此slurm安装配置完毕。 若节点显示darin [root@master x86_64]# scontrol update NodeName=node1 State=resume 刷新节点
节点都显示idle则代表节点正常 [root@master x86_64]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 2 idle master,node1 1.集群添加account属性 默认属性 sacctmgr add account normal Description="Default account" 2.添加用户 UIDNOW=1300 useradd test -p test -d /public/home/$1 -u ${UIDNOW} -s /bin/bash scp /etc/passwd /etc/shadow /etc/group node1:/etc/ 3.将用户加入slurm集群组 sacctmgr -i add user test DefaultAccount=normal 4.测试 bash-4.2$ srun -n 6 hostname master node1 node1 master node1 master
qos配置
1,添加qos [root@master x86_64]# sacctmgr add qos ceshi ##添加qos [root@master x86_64]# sacctmgr show qos format=name,priority,user ##展示qos和优先级 [root@master x86_64]# sacctmgr modify qos ceshi set priority=10 ##修改qos ceshi 优先级10 [root@master x86_64]# sacctmgr modify user test set qos=ceshi ##将用户添加qos
用户操作
查询用户 sacctmgr show user *** 添加用户 sacctmgr add user sghpc2 DefaultAccount=acct02 Qos=test_qos 修改用户 sacctmgr modify user sghpc2 set QoS=nomal 删除用户 sacctmgr delete user username
slurm作业调度管理系统配置-集群搭建步骤6_slurm集群搭建_R★F的博客-CSDN博客https://blog.csdn.net/xhk12345678/article/details/124710528?spm=1001.2014.3001.5502此篇只是完善此博主的slurm搭建中遇到的错误,由衷感谢他提供的模板。