参考资料
Setup AWS ParallelCluster 3.0 with AWS Cloud9 200
HPC For Public Sector Customers 200
HPC pcluster workshop 200
Running CFD on AWS ParallelCluster at scale 400
Tutorial on how to run CFD on AWS ParallelCluster 400
Running CFD on AWS ParallelCluster at scale 400
Running WRF on AWS ParallelCluster 300
Slurm REST API, Accounting and Federation on AWS ParallelCluster 400
Running Fire Dynamics CFD Simulation on AWS ParallelCluster at scale 200
Spack Tutorial on AWS ParallelCluster
AWS ParallelCluster 是 AWS 支持的开源集群管理工具。它允许客户轻松入门,并在几分钟内更新和扩展 AWS Cloud 中的 HPC 集群环境。支持各种作业调度程序,如 AWS 批处理、 SGE、Torque和 Slurm(Amazon ParallelCluster 3. x 不支持 SGE 和 Torque 调度器),以方便作业提交
安装pcluster工具,需要依赖cdk生成cloudformation模板,因此需要预装node环境
virtualenv pvenv
source pvenv/bin/active
pip3 install --upgrade "aws-parallelcluster"
# pip install aws-parallelcluster==3.2.1
pcluster version
# pip install aws-parallelcluster --upgrade --user
生成集群配置
$ pcluster configure --config cluster-config.yaml --region cn-north-1
配置文件示例,网络配置参照后文的pcluster集群的网络配置
部分
pclusterv3支持的调度器有slurm和awsbatch,这里只涉及到slurm
Region: cn-north-1
Image:
Os: ubuntu1804
HeadNode:
InstanceType: m5.large
Networking:
SubnetId: subnet-027025e9d9760acdd
Ssh:
KeyName: cluster-key
CustomActions: #自定义行为
OnNodeConfigured:
Script: s3://parallelcluster-2e9f54dea72c8134-v1-do-not-delete/script/hello.sh
Iam:
S3Access: #访问s3权限
- BucketName: parallelcluster-2e9f54dea72c8134-v1-do-not-delete
EnableWriteAccess: false #只读
AdditionalIamPolicies: # 访问ecr权限
- Policy: arn:aws-cn:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess
Scheduling:
Scheduler: slurm
SlurmSettings:
Dns:
DisableManagedDns: false #默认为false,使用r53的dns解析
ScaledownIdletime: 3 #队列中无任务时3分钟缩容
SlurmQueues:
- Name: queue1
ComputeResources:
- Name: c5large
DisableSimultaneousMultithreading: false
Efa:
Enabled: false
GdrSupport: false
InstanceType: c5.large
MinCount: 1 #静态实例的数量
MaxCount: 10
Iam:
AdditionalIamPolicies: # 访问ecr权限
- Policy: arn:aws-cn:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess
Networking:
SubnetIds:
- subnet-027025e9d9760acdd
SharedStorage:
- FsxLustreSettings: #配置fsx共享存储
StorageCapacity: 1200
MountDir: /fsx
Name: fsx
StorageType: FsxLustre
创建集群
默认情况下创建的 ParallelCluster 不启用 VPC 流日志
$ pcluster create-cluster --cluster-name mycluster --cluster-configuration cluster-config.yaml
查看集群
$ pcluster describe-cluster --cluster-name mycluster
{
"creationTime": "2023-01-1xT01:33:01.470Z",
"version": "3.4.1",
"clusterConfiguration": {
"url": "https://parallelcluster-2e9f54dea72c8134-v1-do-not-delete.s3.cn-north-1.amazonaws.com.cn/parallelcluster/3.4.1/clusters/mycluster-69tt2sf5bgsldktx/configs/cluster-config.yaml?versionId=Q7XO1MF.LE4sh3d.K06n49CmQIirsb3k&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAQRIBWRJKH4DPNLIN%2F20230116%2Fcn-north-1%2Fs3%2Faws4_request&..."
},
"tags": [...],
"cloudFormationStackStatus": "CREATE_IN_PROGRESS",
"clusterName": "mycluster",
"computeFleetStatus": "UNKNOWN",
"cloudformationStackArn": "arn:aws-cn:cloudformation:cn-north-1:xxxxxxxxxxx:stack/mycluster/b6ea1050-953d-11ed-ad17-0e468cb97d98",
"region": "cn-north-1",
"clusterStatus": "CREATE_IN_PROGRESS",
"scheduler": {
"type": "slurm"
}
}
访问配置链接没有权限,目测是一个presign url,由于中国区账号未备案无法访问
该对象存储在专用的s3桶中
查看集群实例
$ pcluster describe-cluster-instances --cluster-name mycluster
{
"instances": [
{
"launchTime": "2023-01-xxT01:35:50.000Z",
"instanceId": "i-0c3xxxxxxxd164",
"publicIpAddress": "xx.xx.xx.xx",
"instanceType": "m5.large",
"state": "running",
"nodeType": "HeadNode",
"privateIpAddress": "172.31.20.150"
}
]
}
更新集群
$ pcluster update-cluster -n mycluster -c cluster-config.yaml
删除集群
$ pcluster delete-cluster --cluster-name mycluster
连接集群
$ pcluster ssh --cluster-name mycluster -i /home/ec2-user/.ssh/cluster-key.pem
获取日志,不需要导入到s3桶之后再下载了
$ pcluster export-cluster-logs --cluster-name mycluster --region cn-north-1 \
--bucket zhaojiew-test --bucket-prefix logs --output-file /tmp/archive.tar.gz
$ tar -xzvf /tmp/archive.tar.gz
mycluster-logs-202301160516/cloudwatch-logs/ip-172-31-17-51.i-0b3f352aa1a503b5a.cloud-init
mycluster-logs-202301160516/cloudwatch-logs/ip-172-31-17-51.i-0b3f352aa1a503b5a.cloud-init-output
mycluster-logs-202301160516/cloudwatch-logs/ip-172-31-17-51.i-0b3f352aa1a503b5a.computemgtd
mycluster-logs-202301160516/cloudwatch-logs/ip-172-31-17-51.i-0b3f352aa1a503b5a.slurmd
mycluster-logs-202301160516/cloudwatch-logs/ip-172-31-17-51.i-0b3f352aa1a503b5a.supervisord
...
mycluster-logs-202301160516/mycluster-cfn-events
(1)JobRequeue
控制要重新排队的批作业的默认值。manager可能重新启动作业,例如,在计划停机之后、从节点故障恢复或者在被更高优先级的作业抢占时
This option controls the default ability for batch jobs to be requeued. Jobs may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job
作业抢占的报错
slurmstepd: error: *** JOB 63830645 ON p08r06n17 CANCELLED AT 2020-08-18T21:40:52 DUE TO PREEMPTION ***
对于pcluster来说,任务失败会自动重新排队
(2)backfill
https://hpc.nmsu.edu/discovery/slurm/backfill-and-checkpoints/
当作业在回填分区中暂停时,当具有较高优先级的作业完成执行时,它将立即重新启动并从头开始计算
IBM的LSF对backfill解释的还比较清楚,但是不知道和slurm有什么区别
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=jobs-backfill-scheduling
Introducing new backfill-based scheduler for SLURM resource manager
查看节点
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
queue1* up infinite 9 idle~ queue1-dy-c5large-[1-9]
queue1* up infinite 1 down~ queue1-st-c5large-1
提交任务
$ sbatch hellojob.sh
查看任务队列
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 queue1 hellojob ubuntu R 0:01 1 queue1-st-c5large-1
$ squeue --format="%.3i %.9P %.40j %.8T %.10M %.6D %.30R %E"
JOB PARTITION NAME STATE TIME NODES NODELIST(REASON) DEPENDENCY
2 queue1 hellojob.sh RUNNING 0:02 1 queue1-st-c5large-1 (null)
$ squeue --states=RUNNING -o "%i" --noheader
2
取消作业
$ scancel $(squeue --states=RUNNING -o "%i" --noheader)
查看计算结果
$ cat slurm-1.out
Hello World from queue1-st-c5large-1
可以使用ssh直接登录节点
$ ssh queue1-st-c5large-1
集群诊断和配置
$ sdiag
$ scontrol show config | grep -i time
BatchStartTimeout = 10 sec
BOOT_TIME = 2023-01-16T03:19:54
EioTimeout = 60
EpilogMsgTime = 2000 usec
GetEnvTimeout = 2 sec
GroupUpdateTime = 600 sec
LogTimeFormat = iso8601_ms
MessageTimeout = 60 sec
OverTimeLimit = 0 min
PreemptExemptTime = 00:00:00
PrologEpilogTimeout = 65534
ResumeTimeout = 1800 sec
SchedulerTimeSlice = 30 sec
SlurmctldTimeout = 300 sec
SlurmdTimeout = 180 sec
SuspendTime = 180 sec
SuspendTimeout = 120 sec
TCPTimeout = 2 sec
UnkillableStepTimeout = 180 sec
WaitTime = 0 sec
PMIxTimeout = 300
cat > hello.c << EOF
#include
#include
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
processor_name, world_rank, world_size);
// Finalize the MPI environment.
MPI_Finalize();
}
EOF
运行结果
$ mpicc -o hello hello.c
$ mpirun -n 4 hello
Hello world from processor ip-172-31-23-84, rank 0 out of 4 processors
Hello world from processor ip-172-31-23-84, rank 1 out of 4 processors
Hello world from processor ip-172-31-23-84, rank 2 out of 4 processors
Hello world from processor ip-172-31-23-84, rank 3 out of 4 processors
提交任务
cat > hello.sbatch << EOF
#!/bin/bash
#SBATCH --job-name=hello-world
#SBATCH --ntasks-per-node=2
#SBATCH --output=/fsx/logs/%x_%j.out
set -x
module load openmpi
mpirun /home/ubuntu/hello
sleep 10
EOF
mkdir -p /fsx/logs
sbatch -N2 /home/ubuntu/hello.sbatch
查看计算结果和过程
$ cat hello-world_4.out
+ module load openmpi
+ mpirun /home/ubuntu/hello
Hello world from processor queue1-dy-c5large-1, rank 0 out of 4 processors
Hello world from processor queue1-dy-c5large-1, rank 1 out of 4 processors
Hello world from processor queue1-st-c5large-1, rank 3 out of 4 processors
Hello world from processor queue1-st-c5large-1, rank 2 out of 4 processors
+ sleep 10
#!/bin/bash
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --job-name=sleep-inf
sleep inf
EOF
cat > ~/slurm/sleep-exclusive.sbatch << EOF
#!/bin/bash
#SBATCH --exclusive
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --job-name=sleep-inf-exclusive
sleep inf
EOF
pcluster对集群的网络要求比较严格
DNS Resolution
和DNS Hostnames
可能的网络配置如下
(1)单个公有子网
(2)头节点在公有子网,计算节点在nat私有子网
(3)使用dx连接http proxy
(4)私有子网
必须配置以下终端节点
Service | Service name | Type |
---|---|---|
Amazon CloudWatch | com.amazonaws.region-id .logs |
Interface |
Amazon CloudFormation | ccom.amazonaws.region-id .cloudformation |
Interface |
Amazon EC2 | com.amazonaws.region-id .ec2 |
Interface |
Amazon S3 | com.amazonaws.region-id .s3 |
Gateway |
Amazon DynamoDB | com.amazonaws.region-id .dynamodb |
Gateway |
Amazon Secrets Manager(AD功能需要) | com.amazonaws.region-id .secretsmanager |
Interface |
禁用route53(默认pcluster会创建,但是r53不支持vpc endpoint)并启动ec2的dns解析(使用ec2的dns主机名称)
Scheduling:
...
SlurmSettings:
Dns:
DisableManagedDns: true
UseEc2Hostnames: true
只支持slurm调度器
https://docs.aws.amazon.com/zh_cn/parallelcluster/latest/ug/building-custom-ami-v3.html
尽量使用节点自定义引导实现节点的自定义,而不是构建ami。因为ami需要在每次集群升级的时候重复构建新的ami
从3.0.0开始pcluster支持构建ami,pcluster依赖 EC2 Image Builder 服务来构建自定义 AMI
创建build配置,其中InstanceType
和ParentImage
是必须的,使用默认vpc启动构建实例(需要访问互联网),此处明确配置公有子网
https://docs.amazonaws.cn/zh_cn/parallelcluster/latest/ug/Build-v3.html
$ cat > image-config.yaml << EOF
Build:
InstanceType: c5.4xlarge
ParentImage: ami-0e726e5c50ce7e38f
SubnetId: subnet-xxxxxxxxx
SecurityGroupIds:
- sg-xxxxxxxxx
UpdateOsPackages:
Enabled: true
EOF
image builder构建可能会花费1小时以上的时间,具体步骤如下
通过cloudformation创建基础设施
添加pcluster自定义组件
https://catalog.us-east-1.prod.workshops.aws/workshops/e2f40d13-8082-4718-909b-6cdc3155ae41/en-US/examples/custom-ami
构建完毕后启动新实例测试新的ami
构建成功删除堆栈
开始构建
$ pcluster build-image --image-configuration image-config.yaml --image-id myubuntu1804
{
"image": {
"imageId": "myubuntu1804",
"imageBuildStatus": "BUILD_IN_PROGRESS",
"cloudformationStackStatus": "CREATE_IN_PROGRESS",
"cloudformationStackArn": "arn:aws-cn:cloudformation:cn-north-1:xxxxxxxxxxx:stack/myubuntu1804/f5c1b530-9592-11ed-ad5b-06bf38020c5c",
"region": "cn-north-1",
"version": "3.4.1"
}
}
查看控制台imagebuilder,一共创建了3个components
启动新的ec2实例进行构建工作
查看构建日志
$ watch -n 1 'pcluster get-image-log-events -i myubuntu1804 \
--log-stream-name 3.4.1/1 \
--query "events[*].message" | tail -n 50'
查看镜像
$ pcluster describe-image --image-id myubuntu1804
删除镜像
$ pcluster delete-image --image-id myubuntu1804
列出官方镜像
$ pcluster list-official-images | grep -B 2 ubuntu1804
构建完毕后堆栈会自动删除
(1)实例类型
头节点协调集群的扩展逻辑,并负责将新节点连接到调度器,如果性能不足会导致集群崩溃
头节点通过nfs将任务与计算节点共享,需要确保足够和网络和存储带宽
以下目录在节点间共享
$ cat /etc/exports
/home 172.31.0.0/16(rw,sync,no_root_squash)
/opt/parallelcluster/shared 172.31.0.0/16(rw,sync,no_root_squash)
/opt/intel 172.31.0.0/16(rw,sync,no_root_squash)
/opt/slurm 172.31.0.0/16(rw,sync,no_root_squash)
$ sudo showmount -e 127.0.0.1
Export list for 127.0.0.1:
/opt/slurm 172.31.0.0/16
/opt/intel 172.31.0.0/16
/opt/parallelcluster/shared 172.31.0.0/16
/home 172.31.0.0/16
(2)网络性能
(3)共享存储
(4)集群监控
集群自定义配置脚本出错,bash脚本格式问题,在windows下编辑的换行符问题
[ERROR] Command runpostinstall (/opt/parallelcluster/scripts/fetch_and_run -postinstall) failed
2023-01-16 03:01:40,474 [DEBUG] Command runpostinstall output: /opt/parallelcluster/scripts/fetch_and_run: /tmp/tmp.wI9VD7fhQs: /bin/bash^M: bad interpreter: No such file or directory
parallelcluster: fetch_and_run - Failed to run postinstall, s3://parallelcluster-2e9f54dea72c8134-v1-do-not-delete/script/hello.sh failed with non 0 return code: 126
构建自定义ami报错(日志来自image builder),pcluser命令行版本3.1.4,ami对应pcluster版本为3.2.1,报错如下
================================================================================
Stdout: Recipe Compile Error in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster/attributes/conditions.rb
Stdout: ================================================================================
Stdout:
Stdout: RuntimeError
Stdout: ------------
Stdout: This AMI was created with aws-parallelcluster-cookbook-3.2.1, but is trying to be used with aws-parallelcluster-cookbook-3.1.4. Please either use an AMI created with aws-parallelcluster-cookbook-3.1.4 or change your ParallelCluster to aws-parallelcluster-cookbook-3.2.1