这几天出于好奇让Gemini帮忙总结了一下主流卡的参数,以及A卡和I卡大致相同计算性能的N卡的对比,数据不一定完全准确所以仅供参考
这是一些我的硬件理解,如果有不对的还请指正:
- 这些数据只是理论数据,因为有很多其他瓶颈所以并不完全代表实际性能,尤其是sparse matmul数据仅供参考,也不代表用了cudagraph或者其他特别优化的kernel之后的性能
- 一些架构除非特别指定一般会根据硬件用兼容性最高的dtype作为运算,比如llama.cpp默认dequant activation到fp32, weight到fp16,vllm和comfyui默认fp16/bf16
- 跑LLM更多的是sparse matmul 意思是矩阵会有很多0weight N卡tensor对这种矩阵运算有特殊的优化
- 跑comfyui更多是dense matmul
- 运算性能只是一部分,有些步骤比如LLM decode和video generation更加多是受显存带宽限制而跑不满运算
- I卡的INT8性能虽然强,但似乎只有openvino支持
- 7900XTX和r9700虽然没有原生fp16硬件支持但似乎Rocm有黑科技能加速fp16运算 R9700是有原生fp8硬件支持的
| Hardware Metric | Intel Arc Pro B70 | AMD Radeon AI PRO R9700 | AMD Radeon RX 7900 XTX | NVIDIA RTX 3090 | NVIDIA RTX 4070 | NVIDIA RTX 5060 Ti | NVIDIA RTX 5070 | NVIDIA RTX 5070 Ti | NVIDIA RTX 4090 | NVIDIA RTX 5090 |
|---|---|---|---|---|---|---|---|---|---|---|
| Architecture | Intel Xe2 | AMD RDNA 4 | AMD RDNA 3 | 3rd-Gen Ampere | 4th-Gen Ada | 5th-Gen Blackwell | 5th-Gen Blackwell | 5th-Gen Blackwell | 4th-Gen Ada | 5th-Gen Blackwell |
| VRAM Capacity | 32 GB GDDR6 | 32 GB GDDR6 | 24 GB GDDR6 | 24 GB GDDR6X | 12 GB GDDR6X | 16 GB GDDR7 | 12 GB GDDR7 | 16 GB GDDR7 | 24 GB GDDR6X | 32 GB GDDR7 |
| Memory Bus Width | 256-bit | 256-bit | 384-bit | 384-bit | 192-bit | 128-bit | 192-bit | 256-bit | 384-bit | 512-bit |
| Memory Bandwidth | 608 GB/s | 644.6 GB/s | 960 GB/s | 936 GB/s | 504 GB/s | 448 GB/s | 672 GB/s | 896 GB/s | 1,008 GB/s | 1,792 GB/s |
| FP32 (Float32) | ~22.9 TFLOPS | ~47.8 TFLOPS | ~61.4 TFLOPS | ~35.6 TFLOPS | ~29.2 TFLOPS | ~23.7 TFLOPS | ~30.8 TFLOPS | ~43.9 TFLOPS | ~82.6 TFLOPS | ~104.8 TFLOPS |
| FP16 / BF16 (Dense) | ~46 TFLOPS | ~95.7 TFLOPS | ~123 TFLOPS | ~71 TFLOPS | ~117 TFLOPS | ~94.8 TFLOPS | ~124 TFLOPS | ~175.7 TFLOPS | ~165.2 TFLOPS | ~419.2 TFLOPS |
| FP16 / BF16 (Sparse) | No Sparsity | No Sparsity | No Sparsity | ~142 TFLOPS | ~233 TFLOPS | ~189.6 TFLOPS | ~248 TFLOPS | ~351.4 TFLOPS | ~330.3 TFLOPS | ~838.4 TFLOPS |
| INT8 / FP8 (Dense) | 367 TOPS / ~46 TF | ~191.4 / ~95.7 TF | ~246 TOPS / Emulated | ~142 TOPS / Emulated | ~233 TOPS / ~233 TF | ~189.6 / ~189.6 TF | ~248 / ~248 TF | ~351.4 TOPS / ~351.4 TF | ~330.3 / ~330.3 TF | ~838.4 / ~838.4 TF |
| INT8 / FP8 (Sparse) | No Sparsity | No Sparsity | No Sparsity | ~284 TOPS / Emulated | ~466 TOPS / ~466 TF | ~379.2 / ~379.2 TF | ~496 / ~496 TF | ~702.8 TOPS / ~702.8 TF | ~660.6 / ~660.6 TF | ~1,676.8 / ~1,676.8 TF |
| INT4 (Dense / Sparse) | ~734 / No Sparse | ~1,531 / No Sparse | ~246 / No Sparse | ~284 / ~568 TOPS | ~466 / ~932 TOPS | ~379.2 / ~758.4 TOPS | ~496 / ~992 TOPS | ~702.8 / ~1,405.6 TOPS | ~660.6 / ~1,321 TOPS | ~1,676.8 / ~3,353.6 TOPS |
| FP4 (Dense / Sparse) | N/A (Emulated) | N/A (Emulated) | N/A (Emulated) | N/A (Emulated) | N/A (Emulated) | ~758.4 / ~1518 TF | ~988 / ~1,976 TF | ~1,403 / ~2,806 TFLOPS | N/A (Emulated) | ~1,676.8 / ~3,353.6 TF |