抡锤者

kaifan

@xiaote 帮我检查一下数据是否准确

kaifan

还有好几个错误更正后的

Hardware Metric	Arc Pro B70	Radeon AI PRO R9700	RX 7900 XTX	RTX 3090	RTX 4070	RTX 5060 Ti 16GB	RTX 5070	RTX 5070 Ti	RTX 4090	RTX 5090
Architecture	Xe2	RDNA4	RDNA3	Ampere	Ada	Blackwell	Blackwell	Blackwell	Ada	Blackwell
VRAM	32 GB	32 GB	24 GB	24 GB	12 GB	16 GB	12 GB	16 GB	24 GB	32 GB
Memory Bus	256-bit	256-bit	384-bit	384-bit	192-bit	128-bit	192-bit	256-bit	384-bit	512-bit
Memory Bandwidth	608 GB/s	644.6 GB/s	960 GB/s	936 GB/s	504 GB/s	448 GB/s	672 GB/s	896 GB/s	1008 GB/s	1792 GB/s
FP32	22.9 TF	47.8 TF	61.4 TF	35.6 TF	29.2 TF	23.7 TF	30.8 TF	43.9 TF	82.6 TF	104.8 TF
FP16/BF16 Vector	45.9 TF	95.7 TF	122.8 TF	71.2 TF	58.3 TF	47.4 TF	61.6 TF	87.8 TF	165.2 TF	209.6 TF
FP16/BF16 Matrix Dense	~183 TF*	191.4 TF	~123 TF†	142 TF	233 TF	189.6 TF	248 TF	351.4 TF	330.3 TF	838.4 TF
FP16/BF16 Matrix Sparse	—	382.8 TF	—	284 TF	466 TF	379.2 TF	496 TF	702.8 TF	660.6 TF	1676.8 TF
FP8 Matrix Dense	~367 TF*	382.8 TF	Emulated	Emulated	466 TF	379.2 TF	496 TF	702.8 TF	660.6 TF	1676.8 TF
FP8 Matrix Sparse	—	765.6 TF	—	—	932 TF	758.4 TF	992 TF	1405.6 TF	1321 TF	3353.6 TF
INT8 Dense	367 TOPS	382.8 TOPS	~246 TOPS†	142 TOPS	233 TOPS	189.6 TOPS	248 TOPS	351.4 TOPS	330.3 TOPS	838.4 TOPS
INT8 Sparse	—	765.6 TOPS	—	284 TOPS	466 TOPS	379.2 TOPS	496 TOPS	702.8 TOPS	660.6 TOPS	1676.8 TOPS
INT4 Dense	~734 TOPS*	1531 TOPS	~246 TOPS†	284 TOPS	466 TOPS	379.2 TOPS	496 TOPS	702.8 TOPS	660.6 TOPS	1676.8 TOPS
INT4 Sparse	—	3062 TOPS	—	568 TOPS	932 TOPS	758.4 TOPS	992 TOPS	1405.6 TOPS	1321 TOPS	3353.6 TOPS
Native FP8	Estimated	Yes	No	No	Yes	Yes	Yes	Yes	Yes	Yes
Native FP4	No	No	No	No	No	Yes	Yes	Yes	No	Yes
FP4 Dense	—	—	—	—	—	758.4 TF	992 TF	1405.6 TF	—	1676.8 TF
FP4 Sparse	—	—	—	—	—	1516.8 TF	1984 TF	2811.2 TF	—	3353.6 TF

kaifan

更正一下 Arc Pro B70, R9700 和 3090 的FP16 matmul数据是fp16 vector op的数据并不是matmul的数据。AI对于整理图表还是很可能会出错

GPU	VRAM	Memory BW	FP16 Matrix Throughput
RTX 3090	24 GB	936 GB/s	142 TFLOPS(dense)/284 TFLOPS(sparse)
Arc Pro B70	32 GB	608 GB/s	~184 TFLOPS
AMD AI Pro R9700	32 GB	640 GB/s	191 TFLOPS

kaifan

这几天出于好奇让Gemini帮忙总结了一下主流卡的参数，以及A卡和I卡大致相同计算性能的N卡的对比，数据不一定完全准确所以仅供参考

这是一些我的硬件理解，如果有不对的还请指正：

这些数据只是理论数据，因为有很多其他瓶颈所以并不完全代表实际性能，尤其是sparse matmul数据仅供参考，也不代表用了cudagraph或者其他特别优化的kernel之后的性能
一些架构除非特别指定一般会根据硬件用兼容性最高的dtype作为运算，比如llama.cpp默认dequant activation到fp32, weight到fp16，vllm和comfyui默认fp16/bf16
跑LLM更多的是sparse matmul 意思是矩阵会有很多0weight N卡tensor对这种矩阵运算有特殊的优化
跑comfyui更多是dense matmul
运算性能只是一部分，有些步骤比如LLM decode和video generation更加多是受显存带宽限制而跑不满运算
I卡的INT8性能虽然强，但似乎只有openvino支持
7900XTX和r9700虽然没有原生fp16硬件支持但似乎Rocm有黑科技能加速fp16运算 R9700是有原生fp8硬件支持的

Hardware Metric	Intel Arc Pro B70	AMD Radeon AI PRO R9700	AMD Radeon RX 7900 XTX	NVIDIA RTX 3090	NVIDIA RTX 4070	NVIDIA RTX 5060 Ti	NVIDIA RTX 5070	NVIDIA RTX 5070 Ti	NVIDIA RTX 4090	NVIDIA RTX 5090
Architecture	Intel Xe2	AMD RDNA 4	AMD RDNA 3	3rd-Gen Ampere	4th-Gen Ada	5th-Gen Blackwell	5th-Gen Blackwell	5th-Gen Blackwell	4th-Gen Ada	5th-Gen Blackwell
VRAM Capacity	32 GB GDDR6	32 GB GDDR6	24 GB GDDR6	24 GB GDDR6X	12 GB GDDR6X	16 GB GDDR7	12 GB GDDR7	16 GB GDDR7	24 GB GDDR6X	32 GB GDDR7
Memory Bus Width	256-bit	256-bit	384-bit	384-bit	192-bit	128-bit	192-bit	256-bit	384-bit	512-bit
Memory Bandwidth	608 GB/s	644.6 GB/s	960 GB/s	936 GB/s	504 GB/s	448 GB/s	672 GB/s	896 GB/s	1,008 GB/s	1,792 GB/s
FP32 (Float32)	~22.9 TFLOPS	~47.8 TFLOPS	~61.4 TFLOPS	~35.6 TFLOPS	~29.2 TFLOPS	~23.7 TFLOPS	~30.8 TFLOPS	~43.9 TFLOPS	~82.6 TFLOPS	~104.8 TFLOPS
FP16 / BF16 (Dense)	~46 TFLOPS	~95.7 TFLOPS	~123 TFLOPS	~71 TFLOPS	~117 TFLOPS	~94.8 TFLOPS	~124 TFLOPS	~175.7 TFLOPS	~165.2 TFLOPS	~419.2 TFLOPS
FP16 / BF16 (Sparse)	No Sparsity	No Sparsity	No Sparsity	~142 TFLOPS	~233 TFLOPS	~189.6 TFLOPS	~248 TFLOPS	~351.4 TFLOPS	~330.3 TFLOPS	~838.4 TFLOPS
INT8 / FP8 (Dense)	367 TOPS / ~46 TF	~191.4 / ~95.7 TF	~246 TOPS / Emulated	~142 TOPS / Emulated	~233 TOPS / ~233 TF	~189.6 / ~189.6 TF	~248 / ~248 TF	~351.4 TOPS / ~351.4 TF	~330.3 / ~330.3 TF	~838.4 / ~838.4 TF
INT8 / FP8 (Sparse)	No Sparsity	No Sparsity	No Sparsity	~284 TOPS / Emulated	~466 TOPS / ~466 TF	~379.2 / ~379.2 TF	~496 / ~496 TF	~702.8 TOPS / ~702.8 TF	~660.6 / ~660.6 TF	~1,676.8 / ~1,676.8 TF
INT4 (Dense / Sparse)	~734 / No Sparse	~1,531 / No Sparse	~246 / No Sparse	~284 / ~568 TOPS	~466 / ~932 TOPS	~379.2 / ~758.4 TOPS	~496 / ~992 TOPS	~702.8 / ~1,405.6 TOPS	~660.6 / ~1,321 TOPS	~1,676.8 / ~3,353.6 TOPS
FP4 (Dense / Sparse)	N/A (Emulated)	N/A (Emulated)	N/A (Emulated)	N/A (Emulated)	N/A (Emulated)	~758.4 / ~1518 TF	~988 / ~1,976 TF	~1,403 / ~2,806 TFLOPS	N/A (Emulated)	~1,676.8 / ~3,353.6 TF

kaifan

5060ti显存可以超到32GHz 估计tg会更快

kaifan

谢谢回答

ComfyUI不支持多卡，LTX Unet生成Latent文件必须单卡连续显存

对，我是一个卡一个comfyui一起跑的。我的理解是latent space肯定一直在gpu上只是weight是dynamic loading的。三个卡每个300秒就是一个视频100秒

4090的速度大概是4060Ti的10倍以上都不止

这个有可能是4060ti的显存带宽导致的 200多GB/s 跟4090的差不多1TB/s应该没有太大可比性。我的5060ti其实超频了和4070的运算能力差不太多，但是估计因为显存带宽不够功率怎么也跑不上去，只能跑到60％的满功耗。4070显存超频之后倒是能跑到80-90％功耗大概显存带宽有550GB/s
好奇gemini说的4090baseline准不准，24G以上显存太难买了但是12G显存的容易很多

kaifan

最近用原生 ComfyUI 测试了最新的 LTX 视频生成。配置是 22B FP8 模型（因为显存不够，走的是 Dynamic Loading 动态加载），纯 PyTorch Attention 架构，没开 sageattention 或 fa3。

生成参数：官方例子workflow 720p 24fps 10秒视频（共240帧）。同时双卡双comfyui instance连续跑了15个小时，清除缓存与模型之后实测每张卡生成单个视频的时间和功耗如下：

5060 Ti：约 380 秒（平均负载 120W/200W，感觉撞了显存带宽墙）

4070：约 340 秒（平均负载 180W/200W）缓存模型和execution cache后速度约230s

我把这个结果喂给 Gemini，它说4090 的基准成绩（75 到 90 秒）但我没有4090所以没办法验证。但是按照这个计算

算力缩放：4090 有 512 个 Tensor Core，4070 有 184 个，算力核心比差距约 2.84 倍。按 4090 最慢 90 秒算纯算力时间：90秒 * 2.84 = 255.6 秒。

显存带宽损耗（带宽税）：用Pytorch attention，4070显存带宽比4090少一半假设这个带来额外 25% 左右的时间损耗。

最终估算：255.6 秒 * 1.25 约等于 320 秒。

这个估算出来的 320 秒和我的实测 340 秒非常接近。我想请教一下懂底层的大佬：

两张卡都是PCIe 4.0x4 8G/s duplex 在跑sampling的时候看nvtop大约5-10秒一次从内存load一次，并没有想象中那么频繁，假设这个4090的数据属实，假设有足够的内存，如果主板支持3卡，是不是3个4070大概也能达到4090的生成速度？

用的workflow：

video_ltx2_3_t2v.json

kaifan

@sirwang 哦？是这周的事情吗周末去试试看谢谢告知

kaifan

@sirwang arc pro b70

kaifan

分享一下单卡跑llmscaler数据
周末把 Qwen3.6-27B 调到了一个对于 Agentic Loop 来说还算能接受的状态。比较系统的跑了一下单请求和并行 5 rep的benchmark。pp速度还可以，但 tg还是有点慢。不过配合 vLLM 的 continuous batching，并行 token 生成整体还比较稳定。目前专门用来给Hermes agent的delegate task去收集代码库context打下手

目前唯一比较大的问题是：KV Cache 必须使用 BF16，才能达到可用的 token generation 速度，但ctx就只有43000了。另外还需要骗 vLLM，让它识别 layer architecture。希望未来能有优化过的 FP8 dequant kernel去支持fp8的kvcache。fp8的dequant比Q8_0慢很多，可惜官方docker的vllm版本还不支持除了fp8和bf16以外的kvcache dtype。可惜它和7900xtx都没有fp8的硬件支持，好像r9700有。另外autoround质量还是稍微比不过Q4的gguf

硬件比较旧 64g的ddr4 虽然比较慢，但总比 pcie4x16 快。proxmox 9.1

vLLM 单请求 qwen/qwen3.6-27b（int4 AutoRound）：

PP TTFT：1,685 ms

PP2048 TPS：1,686 ± 66 tok/s

TG512：13.7 ± 1.4 tok/s

并行测试 pp2048 tg512
Conc: 1
• TTFT(ms): 1,261
• Prefill(tok/s): 1,400
• Decode(tok/s): 13.3
• Output(tok/s): 12.9

• Conc: 2
• TTFT(ms): 1,907
• Prefill(tok/s): 925
• Decode(tok/s): 12.9
• Output(tok/s): 24.7

• Conc: 4
• TTFT(ms): 3,319
• Prefill(tok/s): 532
• Decode(tok/s): 12.7
• Output(tok/s): 46.7

• Conc: 8
• TTFT(ms): 6,231
• Prefill(tok/s): 283
• Decode(tok/s): 11.9
• Output(tok/s): 82.7

docker run 命令：

docker run -it --rm --name vllmb70 --ipc=host --shm-size=32g
--device=/dev/dri:/dev/dri --privileged -p 1234:8000
-v ~/.cache/huggingface:/root/.cache/huggingface
-e VLLM_TARGET_DEVICE=xpu
--entrypoint /bin/bash intel/llm-scaler-vllm:0.14.0-b8.2.1 -c "
source /opt/intel/oneapi/setvars.sh --force &&
sed -i 's/image_processor.max_pixels/getattr(image_processor, "max_pixels", 12845056)/g'
/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_vl.py &&
python3 -m vllm.entrypoints.openai.api_server
--model Intel/Qwen3.6-27B-int4-AutoRound
--tokenizer Qwen/Qwen3.6-27B
--served-model-name qwen/qwen3.6-27b
--kv-cache-dtype auto
--max-model-len 65536
--gpu-memory-utilization 0.9
--enable-auto-tool-choice
--tool-call-parser qwen3_xml
--allow-deprecated-quantization
--trust-remote-code
--port 8000
--tensor-parallel-size 1
--pipeline-parallel-size 1
--enforce-eager
"

也跑了一下ltx2.3 full gpu offload比4070需要dynamic loading快10%左右 custom node很多不支持暂时不值得折腾

抡锤者

kaifan

帖子