0. Build Nvidia Sample Codes
- Make CUDA sample codes for device query utilities and basic bandwidth benchmarks
- Define SMS=70, as we are building the applications for V100 platform
root@clx-mld-63:~# cd NVIDIA_CUDA-10.0_Samples/
root@clx-mld-63:~/NVIDIA_CUDA-10.0_Samples# make SMS=70 -j $(nproc)
root@clx-mld-63:~# CUDA_VISIBLE_DEVICES=1 /root/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery/deviceQuery
/root/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla V100-SXM2-16GB"
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 16130 MBytes (16914055168 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1530 MHz (1.53 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 6 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 22 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
1. Single GPU BW/LAT tests
root@clx-mld-63:~# CUDA_VISIBLE_DEVICES=1 /root/NVIDIA_CUDA-10.0_Samples/1_Utilities/bandwidthTest/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla V100-SXM2-16GB
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12079.0
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12842.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 742690.9
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
2. Multi-GPU BW/LAT tests
root@clx-mld-63:~# /root/NVIDIA_CUDA-10.0_Samples/1_Utilities/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla V100-SXM2-16GB, pciBusID: 15, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla V100-SXM2-16GB, pciBusID: 16, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla V100-SXM2-16GB, pciBusID: 3a, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla V100-SXM2-16GB, pciBusID: 3b, pciDeviceID: 0, pciDomainID:0
Device: 4, Tesla V100-SXM2-16GB, pciBusID: 89, pciDeviceID: 0, pciDomainID:0
Device: 5, Tesla V100-SXM2-16GB, pciBusID: 8a, pciDeviceID: 0, pciDomainID:0
Device: 6, Tesla V100-SXM2-16GB, pciBusID: b2, pciDeviceID: 0, pciDomainID:0
Device: 7, Tesla V100-SXM2-16GB, pciBusID: b3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 739.82 9.91 10.99 11.01 11.04 11.03 11.02 11.02
1 9.88 746.89 10.96 11.02 11.04 11.01 11.03 11.03
2 11.00 11.03 746.89 9.90 11.04 11.02 11.03 11.02
3 10.96 11.01 9.85 746.89 11.03 11.00 11.02 11.02
4 11.05 11.06 11.04 11.02 746.89 9.91 11.03 11.02
5 11.05 11.03 11.05 11.04 9.91 745.47 10.99 10.97
6 11.04 11.04 11.04 11.03 11.08 11.05 748.32 9.86
7 11.06 11.05 11.04 11.03 11.07 11.03 9.92 749.76
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 741.22 24.22 24.22 48.34 48.34 9.47 9.10 9.32
1 24.22 749.76 48.32 24.22 9.30 48.34 9.32 9.67
2 24.22 48.34 752.65 48.34 9.12 8.80 24.22 9.13
3 48.34 24.22 48.35 752.65 8.95 9.12 8.82 24.22
4 48.34 8.95 9.43 9.49 752.65 24.22 24.22 48.33
5 9.11 48.34 9.32 9.48 24.22 751.20 48.32 24.22
6 9.30 9.13 24.22 9.35 24.22 48.33 752.65 48.35
7 9.13 9.14 9.19 24.22 48.33 24.22 48.34 749.76
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 754.10 10.42 17.37 18.02 17.33 18.00 17.38 17.89
1 10.43 755.56 18.03 18.88 17.94 18.91 18.02 18.84
2 17.32 18.08 754.83 10.38 17.37 18.14 17.30 17.98
3 17.87 18.84 10.49 754.83 17.86 18.95 17.93 18.91
4 17.58 18.00 17.52 18.04 758.50 10.54 17.49 18.18
5 18.24 18.89 17.94 18.91 10.51 757.03 17.90 18.93
6 17.42 18.10 17.53 17.91 17.54 18.01 755.56 10.48
7 18.06 18.93 17.87 18.89 17.97 18.87 10.51 761.45
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 753.38 48.38 48.38 96.44 96.46 18.32 18.45 18.11
1 48.34 757.03 96.51 48.39 18.34 96.51 18.35 18.36
2 48.34 96.49 758.50 96.44 18.39 18.11 48.39 18.45
3 96.47 48.38 96.42 756.29 18.77 18.11 17.88 48.38
4 96.46 18.36 18.43 18.43 759.97 48.40 48.39 96.44
5 18.34 96.43 18.39 18.43 48.39 758.50 96.49 48.34
6 17.49 17.83 48.38 17.89 48.39 96.49 754.10 96.46
7 17.49 18.35 18.20 48.39 96.49 48.39 96.49 758.50
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 1.81 15.62 15.57 15.66 16.63 16.58 16.43 16.44
1 16.50 1.76 16.48 16.42 17.47 17.48 17.42 17.43
2 16.58 16.57 1.79 16.46 17.75 17.77 17.83 18.17
3 16.49 16.42 16.53 1.93 17.42 17.44 17.53 17.44
4 17.56 17.43 17.45 17.44 1.88 16.49 16.57 16.60
5 17.60 17.55 17.70 17.51 16.68 1.77 16.68 16.68
6 17.61 17.44 17.57 17.64 16.50 23.48 1.83 16.47
7 16.47 16.46 16.62 16.43 16.94 16.44 16.54 1.83
CPU 0 1 2 3 4 5 6 7
0 4.41 11.86 11.70 10.26 12.23 10.92 10.69 12.06
1 11.62 4.33 11.50 10.21 11.93 11.08 10.61 12.08
2 11.56 11.74 4.35 10.26 12.03 10.72 10.72 12.01
3 10.66 10.60 10.63 3.54 11.18 9.83 9.83 11.06
4 11.78 11.68 11.76 10.51 3.74 11.05 11.00 12.26
5 10.81 10.80 10.89 9.81 11.41 4.50 10.15 11.37
6 10.89 10.91 10.82 9.52 11.60 10.29 3.69 11.27
7 11.95 11.72 11.94 10.51 12.26 11.11 10.99 4.45
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 1.80 1.53 1.53 1.99 1.98 2.14 2.12 2.12
1 1.53 1.77 2.02 1.53 2.14 1.99 2.13 2.13
2 1.54 2.00 1.78 2.01 2.14 2.13 1.54 2.13
3 1.99 1.53 2.01 1.93 2.14 2.11 2.13 1.53
4 1.90 2.19 2.18 2.19 1.89 1.50 1.48 1.89
5 2.18 1.91 2.16 2.19 1.51 1.77 1.90 1.52
6 2.14 2.12 1.53 2.12 1.51 1.98 1.82 2.00
7 2.15 2.14 2.14 1.54 1.99 1.54 1.99 1.82
CPU 0 1 2 3 4 5 6 7
0 4.41 3.02 2.98 2.98 2.96 3.10 3.14 3.08
1 3.00 4.59 3.04 2.60 3.10 3.00 3.08 3.10
2 3.02 3.01 4.38 2.96 3.14 3.11 3.02 3.27
3 2.63 2.62 2.67 3.60 2.74 2.73 2.72 2.62
4 3.16 3.35 3.28 3.27 4.51 3.14 3.11 3.18
5 2.75 3.17 2.90 2.98 2.81 4.55 2.83 2.84
6 2.74 2.72 3.24 2.89 3.14 3.17 3.71 3.11
7 3.06 3.10 3.04 3.19 3.13 3.20 3.16 4.46
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.