平民AI硬件参数对比
-
这几天出于好奇让Gemini帮忙总结了一下主流卡的参数,以及A卡和I卡大致相同计算性能的N卡的对比,数据不一定完全准确所以仅供参考
这是一些我的硬件理解,如果有不对的还请指正:
- 这些数据只是理论数据,因为有很多其他瓶颈所以并不完全代表实际性能,尤其是sparse matmul数据仅供参考,也不代表用了cudagraph或者其他特别优化的kernel之后的性能
- 一些架构除非特别指定一般会根据硬件用兼容性最高的dtype作为运算,比如llama.cpp默认dequant activation到fp32, weight到fp16,vllm和comfyui默认fp16/bf16
- 跑LLM更多的是sparse matmul 意思是矩阵会有很多0weight N卡tensor对这种矩阵运算有特殊的优化
- 跑comfyui更多是dense matmul
- 运算性能只是一部分,有些步骤比如LLM decode和video generation更加多是受显存带宽限制而跑不满运算
- I卡的INT8性能虽然强,但似乎只有openvino支持
- 7900XTX和r9700虽然没有原生fp16硬件支持但似乎Rocm有黑科技能加速fp16运算 R9700是有原生fp8硬件支持的
Hardware Metric Intel Arc Pro B70 AMD Radeon AI PRO R9700 AMD Radeon RX 7900 XTX NVIDIA RTX 3090 NVIDIA RTX 4070 NVIDIA RTX 5060 Ti NVIDIA RTX 5070 NVIDIA RTX 5070 Ti NVIDIA RTX 4090 NVIDIA RTX 5090 Architecture Intel Xe2 AMD RDNA 4 AMD RDNA 3 3rd-Gen Ampere 4th-Gen Ada 5th-Gen Blackwell 5th-Gen Blackwell 5th-Gen Blackwell 4th-Gen Ada 5th-Gen Blackwell VRAM Capacity 32 GB GDDR6 32 GB GDDR6 24 GB GDDR6 24 GB GDDR6X 12 GB GDDR6X 16 GB GDDR7 12 GB GDDR7 16 GB GDDR7 24 GB GDDR6X 32 GB GDDR7 Memory Bus Width 256-bit 256-bit 384-bit 384-bit 192-bit 128-bit 192-bit 256-bit 384-bit 512-bit Memory Bandwidth 608 GB/s 644.6 GB/s 960 GB/s 936 GB/s 504 GB/s 448 GB/s 672 GB/s 896 GB/s 1,008 GB/s 1,792 GB/s FP32 (Float32) ~22.9 TFLOPS ~47.8 TFLOPS ~61.4 TFLOPS ~35.6 TFLOPS ~29.2 TFLOPS ~23.7 TFLOPS ~30.8 TFLOPS ~43.9 TFLOPS ~82.6 TFLOPS ~104.8 TFLOPS FP16 / BF16 (Dense) ~46 TFLOPS ~95.7 TFLOPS ~123 TFLOPS ~71 TFLOPS ~117 TFLOPS ~94.8 TFLOPS ~124 TFLOPS ~175.7 TFLOPS ~165.2 TFLOPS ~419.2 TFLOPS FP16 / BF16 (Sparse) No Sparsity No Sparsity No Sparsity ~142 TFLOPS ~233 TFLOPS ~189.6 TFLOPS ~248 TFLOPS ~351.4 TFLOPS ~330.3 TFLOPS ~838.4 TFLOPS INT8 / FP8 (Dense) 367 TOPS / ~46 TF ~191.4 / ~95.7 TF ~246 TOPS / Emulated ~142 TOPS / Emulated ~233 TOPS / ~233 TF ~189.6 / ~189.6 TF ~248 / ~248 TF ~351.4 TOPS / ~351.4 TF ~330.3 / ~330.3 TF ~838.4 / ~838.4 TF INT8 / FP8 (Sparse) No Sparsity No Sparsity No Sparsity ~284 TOPS / Emulated ~466 TOPS / ~466 TF ~379.2 / ~379.2 TF ~496 / ~496 TF ~702.8 TOPS / ~702.8 TF ~660.6 / ~660.6 TF ~1,676.8 / ~1,676.8 TF INT4 (Dense / Sparse) ~734 / No Sparse ~1,531 / No Sparse ~246 / No Sparse ~284 / ~568 TOPS ~466 / ~932 TOPS ~379.2 / ~758.4 TOPS ~496 / ~992 TOPS ~702.8 / ~1,405.6 TOPS ~660.6 / ~1,321 TOPS ~1,676.8 / ~3,353.6 TOPS FP4 (Dense / Sparse) N/A (Emulated) N/A (Emulated) N/A (Emulated) N/A (Emulated) N/A (Emulated) ~758.4 / ~1518 TF ~988 / ~1,976 TF ~1,403 / ~2,806 TFLOPS N/A (Emulated) ~1,676.8 / ~3,353.6 TF