2024-01-24发表2024-04-01更新来点干货完整阅读约16 分钟 (2354个字)0次访问

NVIDIA GPU 架构与 CUDA 算力

使用NVCC编译时，Gencodes（’-gencode‘）后带arch和code参数。arch标志（’arch‘）指定了CUDA文件将被编译的英伟达（NVIDIA®）GPU架构名称，code标志（’code‘）指定了GPU算力。例如（’-gencode arch=compute_75,code=sm_75‘）

以下是 NVIDIA GPU 架构名称及其算力对照表：

Fermi †	Kepler †	Maxwell ‡	Pascal	Volta	Turing	Ampere	Ada	Hopper	Blackwell
sm_20	sm_30	sm_50	sm_60	sm_70	sm_75	sm_80	sm_89	sm_90	sm_95
	sm_35	sm_52	sm_61	sm_72 (Xavier)		sm_86		sm_90a (Thor)
	sm_37	sm_53	sm_62			sm_87 (Orin)

† 从 CUDA 9 和 11 开始，Fermi 和 Kepler 已被弃用
‡ 自 CUDA 11.6 起，Maxwell 已被弃用

Q&A

1. 何时应使用不同的 “gencodes “或 “cuda arch”？

编译 CUDA 代码时，应始终只编译一个与最常用的 GPU 显卡相匹配的 “-arch“标志。这将加快运行速度，因为代码生成将在编译过程中进行。
如果只提及 “-gencode“，却省略了 “-arch“标志，那么 GPU 代码生成将在 JIT 编译器上由 CUDA 驱动程序完成。

如果想加快 CUDA 编译速度，就需要减少无关的 “-gencode“标志。不过，有时可能希望通过添加更全面的 “-gencode“标记来获得更好的 CUDA 向后兼容性。

在继续之前，请先确定 GPU 和安装的 CUDA 版本。

2. CUDA 版本与 GPU 核心架构详细列表

Fermi 架构 (CUDA 3.2 至 CUDA 8)

从CUDA 9开始，kepler架构已被弃用，CUDA10开始完全停止支持.

SM20 -> SM_20, compute_30 –
GeForce 400, 500, 600, GT-630. **

Kepler 架构 (CUDA 5 至 CUDA 10)

从CUDA 11开始，kepler架构已被弃用.

SM30 -> SM_30, compute_30 –
Kepler 架构 (e.g. generic Kepler, GeForce 700, GT-730).
SM35 -> SM_35, compute_35 –
Tesla K40.
SM37 ->SM_37, compute_37 –
Tesla K80.

Maxwell 架构 (CUDA 6 至 CUDA 11)

从CUDA 11开始，Maxwell架构已被弃用.

SM50 ->SM_50, compute_50 –
Tesla/Quadro M 系列.
SM52 ->SM_52, compute_52 –
Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X.

SM53 ->SM_53, compute_53 –
Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.

Pascal 架构 (CUDA 8 至今)

SM60 or SM_60, compute_60 –
Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
SM61 or SM_61, compute_61–
GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, NVIDIA Drive PX2
SM62 or SM_62, compute_62 –
NVIDIA Drive PX2, Tegra (Jetson) TX2

Volta 架构 (CUDA 9 至今)

SM70 or SM_70, compute_70 –
DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100
SM72 or SM_72, compute_72 –
Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX

Turing 架构 (CUDA 10 至今)

SM75 or SM_75, compute_75 –
GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4

Ampere 架构 (CUDA 11.1 至今)

SM80 or SM_80, compute_80 –
NVIDIA A100 (“Tesla” 命名从此代开始停用 – GA100), NVIDIA DGX-A100
SM86 or SM_86, compute_86 – (CUDA 11.1 onwards)
Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A2000, A3000, RTX A4000, A5000, A6000, NVIDIA A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050, RTX A10, RTX A16, RTX A40, A2 Tensor Core GPU
SM87 or SM_87, compute_87 – (CUDA 11.4 onwards, 采用 PTX ISA 7.4 / 驱动程序 r470 及更新版本）- 该项仅适用于 Jetson AGX Orin 和 Drive AGX Orin

“与计算能力为 8.0 的设备相比，计算能力为 8.6 的设备每个 SM 每个周期的 FP32 操作量增加了 2 倍。虽然为 8.0 编译的二进制文件可以在 8.6 上原样运行，但建议明确为 8.6 进行编译，以便从增加的 FP32 吞吐量中获益。.“
https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32

Ada Lovelace 架构 (CUDA 11.8 至今)

SM89 or SM_89, compute_89 –
NVIDIA GeForce RTX 4090, RTX 4080, RTX 4070, RTX 4060, RTX 6000 Ada, Tesla L40, L40s Ada, L4 Ada

Hopper 架构 (CUDA 12 至今)

SM90 or SM_90, compute_90 –
NVIDIA H100 (GH100), NVIDIA H200
SM90a or SM_90a, compute_90a – (适用于 PTX ISA 8.0 版）- 为 wgmma 和 setmaxnreg 等功能添加了加速功能。英伟达™（NVIDIA®）CUTLASS 需要此功能。

Blackwell 架构 (CUDA 12 至今)

SM95 or SM_95, compute_95 –
NVIDIA B100 (GB100)

3. GCC 中的 nvcc``gencode 和 `arch` 标志示例

注意：如果在生产环境中，部署的计算卡是固定的型号，不需要兼容性时，如A100，nvcc生成时建议固定指定算力数据，可以最大程度保证算力不会浪费。

根据 NVIDIA 提供的信息：

nvcc 的 -gencode= 命令行选项中的 arch= 指定前端编译目标，并且必须始终是 PTX 版本。code= 指定后端编译目标，可以是 cubin 或 PTX，也可以是两者。生成的二进制文件将只保留 code= 指定的后端目标版本；至少有一个版本必须是 PTX，以提供 Ampere 兼容性。

在 CUDA 7.0 上生成 GCC 的示例flag，可最大程度地兼容该时代的所有显卡：

-arch=sm_30 \
-gencode=arch=compute_20,code=sm_20 \
-gencode=arch=compute_30,code=sm_30 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_52,code=compute_52

在 CUDA 8.1 上生成的示例flag，可与 Volta 之前的显卡实现最大程度的兼容性：

在 CUDA 9.2 上生成的示例标志，可与 Volta 显卡实现最大兼容性：

-arch=sm_50 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_70,code=compute_70

在CUDA 10.1上生成的样本标志，与 V100 和 T4 图灵卡具有最大兼容性：

-arch=sm_50
-gencode=arch=compute_50,code=sm_50
-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75

在CUDA 11.0上生成的标记示例，与 V100 和 T4 图灵卡具有最大兼容性：

-arch=sm_52
-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80

在CUDA 11.7上生成，与 V100 和 T4 显卡具有最大兼容性，但也支持较新的 RTX 3080 和 Drive AGX Orin：

在CUDA 11.4上生成的示例标志，可使 RTX 3080 显卡发挥最佳性能：

-arch=sm_80
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_86,code=compute_86

使用 GeForce RTX 4080、L40s、L4 和 RTX A6000 Ada 显卡在 CUDA 12 上生成最佳性能的示例标志：

-arch=sm_89
-gencode=arch=compute_89,code=sm_89 \
-gencode=arch=compute_89,code=compute_89

在 CUDA12（PTX ISA 版本 8.0）上生成的示例标记，可与英伟达 H100 和 H200（Hopper）图形处理器实现最佳性能，且不向下兼容前几代产品：

-arch=sm_90
-gencode=arch=compute_90,code=sm_90 \
-gencode=arch=compute_90a,code=sm_90a \
-gencode=arch=compute_90a,code=compute_90a

为 Hopper GPU 增加更多兼容性和一些向后兼容性：

-arch=sm_52
-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_75,code=sm_75
-gencode=arch=compute_80,code=sm_80
-gencode=arch=compute_86,code=sm_86
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_90,code=sm_90
-gencode=arch=compute_90,code=compute_90

4. 在 PyTorch 中使用 TORCH_CUDA_ARCH_LIST

如果使用的是 PyTorch，可以在安装时使用 TORCH_CUDA_ARCH_LIST 环境变量设置架构，例如：

TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6" python3 setup.py install

请注意，虽然你可以在这个变量中指定每一个架构，但每一个都会延长编译时间，因为内核必须针对每一个架构进行编译。

也可以在指定的最新架构上添加后缀 +PTX 来告诉 PyTorch 生成与较新的显卡向前兼容的 PTX 代码：

TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6+PTX" python3 build_my_extension.py

5. 使用 Cmake 编译 TensorRT

如果使用 CMAKE 编译 TensorRT，请去掉 sm_ 和 compute_ 前缀，只提及计算能力。

以 Tesla V100 和 Volta 显卡为例：
cmake <...> `-DGPU_ARCHS="70"`

以英伟达 RTX 2070 和特斯拉 T4 为例：
cmake <...> -DGPU_ARCHS="75"

NVIDIA A100 示例：
cmake <...> `-DGPU_ARCHS="80"`

以英伟达™（NVIDIA®）RTX 3080 和 A100 一起使用为例：
cmake <...> -DGPU_ARCHS="80 86"

NVIDIA H100 示例：
cmake <...> -DGPU_ARCHS="90"

6. 使用 Cmake 编译 CUTLASS 和 Hopper GH100

cmake .. -DCUTLASS_NVCC_ARCHS=90a

7. Value ‘sm_86’ is not defined for option ‘gpu-architecture’

如果出现类似下面的错误：

1	nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'.

可能安装了旧版本的 CUDA 和/或驱动程序。升级到较新的驱动程序，至少 450.36.06 或更高版本，以支持 A100、RTX 3080等 sm_8x 显卡。

8. CUDA runtime error: operation not supported

如果收到类似下面这样的 std::runtime_error（runtime error）：

1	CUDA runtime error：operation not supported

显卡不支持生成的runtime code。

使用 nvidia-smi 查看显卡和驱动程序版本。然后，尝试匹配生成代码，生成适合显卡的正确的runtime code。

NVIDIA GPU 架构与 CUDA 算力

https://cyhasuka.github.io/posts/8a630bae/

作者

cyhasuka

发布于

2024-01-24

更新于

2024-04-01

许可协议

#Linux 自动驾驶深度学习 NVIDIA

NVIDIA GPU 架构与 CUDA 算力

Q&A

1. 何时应使用不同的 “gencodes “或 “cuda arch”？

2. CUDA 版本与 GPU 核心架构详细列表

Fermi 架构 (CUDA 3.2 至 CUDA 8)

Kepler 架构 (CUDA 5 至 CUDA 10)

Maxwell 架构 (CUDA 6 至 CUDA 11)

Pascal 架构 (CUDA 8 至今)

Volta 架构 (CUDA 9 至今)

Turing 架构 (CUDA 10 至今)

Ampere 架构 (CUDA 11.1 至今)

Ada Lovelace 架构 (CUDA 11.8 至今)

Hopper 架构 (CUDA 12 至今)

Blackwell 架构 (CUDA 12 至今)

3. GCC 中的 nvcc``gencode 和 `arch` 标志示例

4. 在 PyTorch 中使用 TORCH_CUDA_ARCH_LIST

5. 使用 Cmake 编译 TensorRT

6. 使用 Cmake 编译 CUTLASS 和 Hopper GH100

7. Value ‘sm_86’ is not defined for option ‘gpu-architecture’

8. CUDA runtime error: operation not supported

作者

发布于

更新于

许可协议

评论

分类

最新文章

文章

标签

订阅更新

NVIDIA GPU 架构与 CUDA 算力

Q&A

1. 何时应使用不同的 “gencodes “或 “cuda arch”？

2. CUDA 版本与 GPU 核心架构详细列表

Fermi 架构 (CUDA 3.2 至 CUDA 8)

Kepler 架构 (CUDA 5 至 CUDA 10)

Maxwell 架构 (CUDA 6 至 CUDA 11)

Pascal 架构 (CUDA 8 至今)

Volta 架构 (CUDA 9 至今)

Turing 架构 (CUDA 10 至今)

Ampere 架构 (CUDA 11.1 至今)

Ada Lovelace 架构 (CUDA 11.8 至今)

Hopper 架构 (CUDA 12 至今)

Blackwell 架构 (CUDA 12 至今)

3. GCC 中的 nvcc``gencode 和 arch 标志示例

4. 在 PyTorch 中使用 TORCH_CUDA_ARCH_LIST

5. 使用 Cmake 编译 TensorRT

6. 使用 Cmake 编译 CUTLASS 和 Hopper GH100

7. Value ‘sm_86’ is not defined for option ‘gpu-architecture’

8. CUDA runtime error: operation not supported

作者

发布于

更新于

许可协议

评论

分类

最新文章

文章

标签

订阅更新

3. GCC 中的 nvcc``gencode 和 `arch` 标志示例