HPC搭建手册(一)

Slurm作业管理系统安装配置

(一)拓扑与节点角色

e56PeI.png
节点 角色 地址
node1 管理节点 192.168.101.1
node2 计算节点 192.168.101.2
node3 计算节点 192.168.101.3
node4 计算节点 192.168.101.4
node5 计算节点 192.168.101.5

(二)准备工作(每台机器)

  1. 安装工具

    yum install net-tools wget vim nfs-utils rpcbind ntp ntpdate
    
  2. 配置主机名与主机地址映射

    vim /etc/hostname
    修改为节点对应的名称
    
    vim /etc/hosts
    
    192.168.101.1 node1
    192.168.101.2 node2
    192.168.101.3 node3
    192.168.101.4 node4
    192.168.101.5 node5
    
  3. 配置Root SSH免密登录

    vim /etc/ssh/sshd_config
    修改 PermitRootLogin=yes
    修改 PasswordAuthentication=yes
    修改 PubkeyAuthentication=yes
    
    生成 ssh密钥对
    ssh-keygen -t rsa
    

    拷贝公钥到计算节点

    ssh-copy-id -i ~/.ssh/id_rsa.pub root@node[1-5]

  4. 配置防火墙开放端口

    firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.101.0/24 port protocol=udp port=1-65535 accept'
    firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.101.0/24 port protocol=tcp port=1-65535 accept'
    
  5. 关闭SELinux

    vim /etc/selinux/config
    修改 SELINUX=enforcing 为 SELINUX=disabled
    
    重启
    

(三)配置NFS存储

  1. 管理节点

    设置开机启动相关服务
    systemctl enable rpcbind
    systemctl enable nfs-server
    systemctl enable nfs-lock
    systemctl enable nfs-idmap
    开启相关服务
    systemctl start rpcbind
    systemctl start nfs-server
    systemctl start nfs-lock
    systemctl start nfs-idmap
    

    配置共享路径

    mkdir /workspacce
    mkdir /rhome
    vim /etc/exports
    添加 /workspace   192.168.101.0/24(rw)
    添加 /rhome                192.168.101.0/24(rw)
    
    执行exportfs -a 使生效
    
  1. 计算节点

    mkdir /workspace
    mkdir /rhome
    vim /etc/fstab
    添加 192.168.101.1:/workspace /workspace nfs defaults 0 0
    添加 192.168.101.1:/rhome     /rhome  nfs defaults 0 0
    
    执行 mount -a 挂载目录
    

(四)安装munge

  1. 管理节点

    • 下载numge源码. 点击此处

    • 创建rpm安装包

      安装相关工具
      yum install rpmdevtools gcc bzip2-devel openssl-devel zlib-devel
      构建rpm安装包
      rpmbuild -tb --without verify munge-0.5.15.tar.xz
      
      rpm -ivh rpmbuild/RPMS/x86_64/munge*
      复制生成的rpm包到/workspace 供其他节点安装
      cp -r rpmbuild/RPMS/x86_64/ /workspace
      
    • 生成munge.key

      sudo -u munge /usr/sbin/mungekey --verbose
      
      chown munge:munge /etc/munge/munge.key
      复制生成的key /workspace 供其他节点使用
      cp /etc/munge/munge.key /workspace/munge.key
      
    • 启动服务 systemctl start munage | systemctl enable munage

  2. 计算节点

    rpm -ivh /workspace/x86_64/*
    
    cp /workspace/munage.key /etc/munage/
    chown munage:munage /etc/munage/munage.key
    
    systemctl start munage
    systemctl enable munage
    
  3. 测试munage

    munage -n | ssh [计算节点] unmunage

(五)配置NTP 时间同步(每台主机)

ntpdate ntp.aliyun.com 同步时间
systemctl start ntpd
systemctl enable ntpd

(六)安装Mysql

  1. 管理节点

    rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2022
    yum install mysql80-community-release-el7-6.noarch.rpm
    yum install mysql-community-{server,client,common,libs,devel}-*
    
    systemctl start mysqld
    systemctl enable mysqld
    
    创建slurm mysql用户
    
    mysql> create user 'slurm'@'localhost' identified by 'password'
    mysql> grant all on slurm_acct_db.* TO 'slurm'@'localhost';
    create database slurm_acct_db;
    
  2. 计算节点

    rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2022
    yum install mysql80-community-release-el7-6.noarch.rpm
    yum install mysql-community-devel
    

(七)安装Slurm

  1. 创建用户

    在每个节点创建slurm用户: useradd slurm

  2. 构建slurm rpm包 点击此处

    安装依赖
    yum install hwloc
    
    点击上方链接,下载Slurm 源码
    
    rpmbuild -ta slurm*.tar.bz2 (若提示需要依赖,则安装后继续构建)
    cp -r rpmbuild/RPMS/x86_64 /workspace/slurm_rpm 
    
  3. 安装配置slurm

    rpm -ivh /workspace/slurm_rpm/*
    

    slurm.conf (所有节点)文件可使用如下工具配置:点击此处

    配置参考:

    ControlMachine=node1 
    ControlAddr=192.168.101.1  #控制器的ip
    ClusterName=cluster
    #MailProg=/bin/mail
    MpiDefault=none
    #MpiParams=ports=#-#
    ProctrackType=proctrack/cgroup
    ReturnToService=1
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmctldPort=6817
    SlurmdPidFile=/var/run/slurmd.pid
    SlurmdPort=6818
    SlurmdSpoolDir=/var/spool/
    SlurmUser=slurm
    SlurmdUser=root
    StateSaveLocation=/var/spool/slurmctld
    SwitchType=switch/none
    TaskPlugin=task/cgroup
    
    PrologFlags=CONTAIN
    #
    #
    # TIMERS
    #KillWait=30
    #MinJobAge=300
    #SlurmctldTimeout=120
    #SlurmdTimeout=300
    #
    
    #
    # SCHEDULING
    SchedulerType=sched/backfill
    SelectType=select/cons_res
    #SelectTypeParameters=
    #
    #
    # LOGGING AND ACCOUNTING
    #JobAcctGatherFrequency=30
    JobAcctGatherType=jobacct_gather/cgroup
    #SlurmctldDebug=info
    SlurmctldLogFile=/var/log/slurmctld.log
    #SlurmdDebug=info
    SlurmdLogFile=/var/log/slurmd.log
    #
    #
    #配置记账
    AccountingStorageHost=127.0.0.1 #数据库位置
    #AccountingStoragePass=
    AccountingStoragePort=6819
    AccountingStorageType=accounting_storage/slurmdbd
    #
    # COMPUTE NODES 计算节点信息
    NodeName=node[1-5] CPUs=40  Sockets=2 CoresPerSocket=10 ThreadsPerCore=2  State=UNKNOWN                                  
    PartitionName=workspace Nodes=ALL Default=YES MaxTime=INFINITE State=UP
    

    Slurmdbd.conf (管理节点)配置参考

    #
    # Example slurmdbd.conf file.
    #
    # See the slurmdbd.conf man page for more information.
    #
    # Archive info
    #ArchiveJobs=yes
    #ArchiveDir="/tmp"
    #ArchiveSteps=yes
    #ArchiveScript=
    #JobPurge=12
    #StepPurge=1
    #
    # Authentication info
    AuthType=auth/munge
    AuthInfo=/var/run/munge/munge.socket.2
    #
    # slurmDBD info
    Dbdaddr=127.0.0.1
    DbdHost=localhost
    DbdPort=6819
    SlurmUser=slurm
    #MessageTimeout=300
    DebugLevel=verbose
    #DefaultQOS=normal,standby
    LogFile=/opt/slurm/log/slurmdbd.log
    PidFile=/opt/slurm/log/slurmdbd.pid
    #PluginDir=/usr/lib/slurm
    #PrivateData=accounts,users,usage,jobs
    #TrackWCKey=yes
    #
    # Database info
    StorageType=accounting_storage/mysql
    #数据库信息
    StorageHost=127.0.0.1
    StoragePort=3306
    StoragePass=[PASSWORD]
    StorageUser=slurm
    #数据库名称
    StorageLoc=slurm_acct_db
    

    cgroup.conf(所有节点)配置参考:

    ###
    #
    # Slurm cgroup support configuration file
    #
    # See man slurm.conf and man cgroup.conf for further
    # information on cgroup configuration parameters
    #--
    CgroupAutomount=yes
    ConstrainCores=yes
    ConstrainRAMSpace=no
    CgroupMountpoint=/sys/fs/cgroup
    
  4. 配置完毕后,将slurm.conf 、cgroup.conf 复制到所有节点的/etc/slurm/目录下

    chown slurm:slurm /etc/slurm/slurm.conf | chown slurm:slurm /etc/slurm/slurmdbd.conf

  5. 启动服务

    管理节点: systemctl start slurmctld && systemctl start slurmdbd && systemctl start slurmd

    并设置开机启动

    计算节点:systemctl start slurmd 并设置开机启动

  6. 测试

    sinfo
    PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
    workspace*    up   infinite      5   idle node[1-5]
    

OpenMPI 安装

OpenMPI 是一种高性能消息传递库,可以很方便的把串行程序,改为 多线程并行程序,适合多核心电脑,可以和 MPI 搭配使用,对 C 语言 和 Fortran 高性能计算支持很好

准备工作 (所有节点)

yum install gcc gcc-c++ gcc-gfortran make 

tar -zxf openmpi-4.1.4.tar.gz

点击此处下载OpenMPI 源码

编译

cd openmpi-4.1.4
./configure --prefix=/opt/openmpi/4.1.4/ CC=gcc CXX=g++ FC=gfortran
make -j40 && make install

修改环境变量

vim  /etc/profile
尾部添加
OPENMPI=/opt/openmpi/4.1.4
PATH=$OPENMPI/bin:$PATH
LD_LIBRARY_PATH=$OPENMPI/lib:$LD_LIBRARY_PATH
INCLUDE=$OPENMPI/include:$INCLUDE
CPATH=$OPENMPI/include:$CPATH
MANPATH=$OPENMPI/share/man:$MANPATH
export PATH
export LD_LIBRARY_PATH
export INCLUDE
export CPATH
export MANPATH

source /etc/profile

你可能感兴趣的:(HPC搭建手册(一))