NVIDIA Samples

0. Build Nvidia Sample Codes

  • Make CUDA sample codes for device query utilities and basic bandwidth benchmarks
  • Define SMS=70, as we are building the applications for V100 platform
root@clx-mld-63:~# cd NVIDIA_CUDA-10.0_Samples/
root@clx-mld-63:~/NVIDIA_CUDA-10.0_Samples# make SMS=70 -j $(nproc)
root@clx-mld-63:~# CUDA_VISIBLE_DEVICES=1 /root/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery/deviceQuery 
/root/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla V100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16130 MBytes (16914055168 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 22 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

1. Single GPU BW/LAT tests

root@clx-mld-63:~# CUDA_VISIBLE_DEVICES=1 /root/NVIDIA_CUDA-10.0_Samples/1_Utilities/bandwidthTest/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla V100-SXM2-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         12079.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         12842.4

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         742690.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

2. Multi-GPU BW/LAT tests

root@clx-mld-63:~# /root/NVIDIA_CUDA-10.0_Samples/1_Utilities/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla V100-SXM2-16GB, pciBusID: 15, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla V100-SXM2-16GB, pciBusID: 16, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla V100-SXM2-16GB, pciBusID: 3a, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla V100-SXM2-16GB, pciBusID: 3b, pciDeviceID: 0, pciDomainID:0
Device: 4, Tesla V100-SXM2-16GB, pciBusID: 89, pciDeviceID: 0, pciDomainID:0
Device: 5, Tesla V100-SXM2-16GB, pciBusID: 8a, pciDeviceID: 0, pciDomainID:0
Device: 6, Tesla V100-SXM2-16GB, pciBusID: b2, pciDeviceID: 0, pciDomainID:0
Device: 7, Tesla V100-SXM2-16GB, pciBusID: b3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3     4     5     6     7
     0       1     1     1     1     1     1     1     1
     1       1     1     1     1     1     1     1     1
     2       1     1     1     1     1     1     1     1
     3       1     1     1     1     1     1     1     1
     4       1     1     1     1     1     1     1     1
     5       1     1     1     1     1     1     1     1
     6       1     1     1     1     1     1     1     1
     7       1     1     1     1     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 739.82   9.91  10.99  11.01  11.04  11.03  11.02  11.02 
     1   9.88 746.89  10.96  11.02  11.04  11.01  11.03  11.03 
     2  11.00  11.03 746.89   9.90  11.04  11.02  11.03  11.02 
     3  10.96  11.01   9.85 746.89  11.03  11.00  11.02  11.02 
     4  11.05  11.06  11.04  11.02 746.89   9.91  11.03  11.02 
     5  11.05  11.03  11.05  11.04   9.91 745.47  10.99  10.97 
     6  11.04  11.04  11.04  11.03  11.08  11.05 748.32   9.86 
     7  11.06  11.05  11.04  11.03  11.07  11.03   9.92 749.76 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 741.22  24.22  24.22  48.34  48.34   9.47   9.10   9.32 
     1  24.22 749.76  48.32  24.22   9.30  48.34   9.32   9.67 
     2  24.22  48.34 752.65  48.34   9.12   8.80  24.22   9.13 
     3  48.34  24.22  48.35 752.65   8.95   9.12   8.82  24.22 
     4  48.34   8.95   9.43   9.49 752.65  24.22  24.22  48.33 
     5   9.11  48.34   9.32   9.48  24.22 751.20  48.32  24.22 
     6   9.30   9.13  24.22   9.35  24.22  48.33 752.65  48.35 
     7   9.13   9.14   9.19  24.22  48.33  24.22  48.34 749.76 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 754.10  10.42  17.37  18.02  17.33  18.00  17.38  17.89 
     1  10.43 755.56  18.03  18.88  17.94  18.91  18.02  18.84 
     2  17.32  18.08 754.83  10.38  17.37  18.14  17.30  17.98 
     3  17.87  18.84  10.49 754.83  17.86  18.95  17.93  18.91 
     4  17.58  18.00  17.52  18.04 758.50  10.54  17.49  18.18 
     5  18.24  18.89  17.94  18.91  10.51 757.03  17.90  18.93 
     6  17.42  18.10  17.53  17.91  17.54  18.01 755.56  10.48 
     7  18.06  18.93  17.87  18.89  17.97  18.87  10.51 761.45 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 753.38  48.38  48.38  96.44  96.46  18.32  18.45  18.11 
     1  48.34 757.03  96.51  48.39  18.34  96.51  18.35  18.36 
     2  48.34  96.49 758.50  96.44  18.39  18.11  48.39  18.45 
     3  96.47  48.38  96.42 756.29  18.77  18.11  17.88  48.38 
     4  96.46  18.36  18.43  18.43 759.97  48.40  48.39  96.44 
     5  18.34  96.43  18.39  18.43  48.39 758.50  96.49  48.34 
     6  17.49  17.83  48.38  17.89  48.39  96.49 754.10  96.46 
     7  17.49  18.35  18.20  48.39  96.49  48.39  96.49 758.50 
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4      5      6      7 
     0   1.81  15.62  15.57  15.66  16.63  16.58  16.43  16.44 
     1  16.50   1.76  16.48  16.42  17.47  17.48  17.42  17.43 
     2  16.58  16.57   1.79  16.46  17.75  17.77  17.83  18.17 
     3  16.49  16.42  16.53   1.93  17.42  17.44  17.53  17.44 
     4  17.56  17.43  17.45  17.44   1.88  16.49  16.57  16.60 
     5  17.60  17.55  17.70  17.51  16.68   1.77  16.68  16.68 
     6  17.61  17.44  17.57  17.64  16.50  23.48   1.83  16.47 
     7  16.47  16.46  16.62  16.43  16.94  16.44  16.54   1.83 

   CPU     0      1      2      3      4      5      6      7 
     0   4.41  11.86  11.70  10.26  12.23  10.92  10.69  12.06 
     1  11.62   4.33  11.50  10.21  11.93  11.08  10.61  12.08 
     2  11.56  11.74   4.35  10.26  12.03  10.72  10.72  12.01 
     3  10.66  10.60  10.63   3.54  11.18   9.83   9.83  11.06 
     4  11.78  11.68  11.76  10.51   3.74  11.05  11.00  12.26 
     5  10.81  10.80  10.89   9.81  11.41   4.50  10.15  11.37 
     6  10.89  10.91  10.82   9.52  11.60  10.29   3.69  11.27 
     7  11.95  11.72  11.94  10.51  12.26  11.11  10.99   4.45 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7 
     0   1.80   1.53   1.53   1.99   1.98   2.14   2.12   2.12 
     1   1.53   1.77   2.02   1.53   2.14   1.99   2.13   2.13 
     2   1.54   2.00   1.78   2.01   2.14   2.13   1.54   2.13 
     3   1.99   1.53   2.01   1.93   2.14   2.11   2.13   1.53 
     4   1.90   2.19   2.18   2.19   1.89   1.50   1.48   1.89 
     5   2.18   1.91   2.16   2.19   1.51   1.77   1.90   1.52 
     6   2.14   2.12   1.53   2.12   1.51   1.98   1.82   2.00 
     7   2.15   2.14   2.14   1.54   1.99   1.54   1.99   1.82 

   CPU     0      1      2      3      4      5      6      7 
     0   4.41   3.02   2.98   2.98   2.96   3.10   3.14   3.08 
     1   3.00   4.59   3.04   2.60   3.10   3.00   3.08   3.10 
     2   3.02   3.01   4.38   2.96   3.14   3.11   3.02   3.27 
     3   2.63   2.62   2.67   3.60   2.74   2.73   2.72   2.62 
     4   3.16   3.35   3.28   3.27   4.51   3.14   3.11   3.18 
     5   2.75   3.17   2.90   2.98   2.81   4.55   2.83   2.84 
     6   2.74   2.72   3.24   2.89   3.14   3.17   3.71   3.11 
     7   3.06   3.10   3.04   3.19   3.13   3.20   3.16   4.46 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

你可能感兴趣的:(NVIDIA Samples)