新手入坑 R9700 真的行嗎?
-
這個是2張R9700的配置, vllm-openai-rocm 配合FP8
| Model | Test | Tokens/sec | Peak Tokens/sec | TTFR (ms) | Est PPT (ms) | E2E TTFT (ms) | |:------------|--------------:|----------------:|----------------:|----------------:|----------------:|----------------:| | Qwen3.6-27B | pp2048 @ d4096 | 2508.92 ± 11.57 | — | 2529.74 ± 11.19 | 2449.58 ± 11.19 | 2529.74 ± 11.19 | | Qwen3.6-27B | tg32 @ d4096 | 72.94 ± 0.55 | 75.30 ± 0.57 | — | — | — | | Qwen3.6-27B | pp2048 @ d8132 | 2402.38 ± 1.13 | — | 4318.05 ± 1.99 | 4237.88 ± 1.99 | 4318.05 ± 1.99 | | Qwen3.6-27B | tg32 @ d8132 | 63.52 ± 3.35 | 65.58 ± 3.46 | — | — | — | | Qwen3.6-27B | pp2048 @ d16000| 2197.86 ± 7.44 | — | 8292.49 ± 28.04 | 8212.32 ± 28.04 | 8293.70 ± 28.04 | | Qwen3.6-27B | tg32 @ d16000 | 53.45 ± 2.63 | 55.18 ± 2.71 | — | — | — | | Qwen3.6-27B | pp2048 @ d30000| 1899.63 ± 1.41 | — | 16951.73 ± 13.21| 16871.56 ± 13.21| 16952.54 ± 14.22| | Qwen3.6-27B | tg32 @ d30000 | 53.23 ± 0.16 | 54.95 ± 0.17 | — | — | — | | Qwen3.6-27B | pp2048 @ d60000| 1459.41 ± 0.62 | — | 42596.49 ± 18.16| 42516.32 ± 18.16| 42598.65 ± 18.72| | Qwen3.6-27B | tg32 @ d60000 | 40.35 ± 0.04 | 41.66 ± 0.04 | — | — | — | | Qwen3.6-27B | pp2048 @ d90000| 1181.78 ± 0.27 | — | 77970.53 ± 16.71| 77890.36 ± 16.71| 77970.53 ± 16.71| | Qwen3.6-27B | tg32 @ d90000 | 28.89 ± 0.07 | 30.33 ± 0.47 | — | — | — | | Qwen3.6-27B | pp2048 @ d120000| 991.43 ± 0.47 | — | 123185.76 ± 58.07| 123103.97 ± 58.07| 123187.93 ± 60.50| | Qwen3.6-27B | tg32 @ d120000 | 25.20 ± 1.44 | 26.67 ± 0.94 | — | — | — | | Qwen3.6-27B | pp2048 @ d150000| 854.21 ± 0.17 | — | 178081.59 ± 36.01| 177999.80 ± 36.01| 178088.15 ± 32.55| | Qwen3.6-27B | tg32 @ d150000 | 21.86 ± 1.19 | 24.33 ± 0.94 | — | — | — |運行參數
--model /app/models --served-model-name Qwen3.6-27B-FP8 --host 192.168.1.224 --port 5678 --tool-call-parser qwen3_coder --enable-auto-tool-choice --reasoning-parser qwen3 --language-model-only --tensor-parallel-size 2 --max-num-seqs 4 --max-model-len 200k --dtype auto --gpu-memory-utilization 0.95 --attention-config.backend TRITON_ATTN --quantization fp8 --enable-chunked-prefill --enable-prefix-caching --override-generation-config '{"temperature":0.6, "top_p":0.95, "top_k":20, "presence_penalty": 0.0 , "repetition_penalty":1.0}' --speculative-config '{"method":"mtp","num_speculative_tokens":3}'就這個而言, 單卡估計要把上下文長度砍半變100K了, 然後TTFT如未意外應該也會大降
估計要玩還是玩llama.cpp + Vulkan了
@566656661 我目標都只是100K 🥲 沒貨了
機型所限沒法上雙卡。想過兩張7900 XTX 才2萬內 960GB 頻寬,好像總比兩張R9700來得化算。
但 單卡就是沒有2萬內比R9700快還是謝過大哥,抄來的數據,很有用。
現時小弟都是用wsl + lm studio...如果入手r9700 看似要全部搬到ubuntu....
-
@rolex-lo 是的,localLLM的甜点区(高显存带宽的32GB卡)原本是5090的位置,但现在他已经上天了。
5090目前的价格比rtx pro 5000还要贵,我就很难理解……如果想爽跑LLM,显存带宽1T以上是基本要求,才会在不过分降低模型精度,稍大的上下文的前提下,有一个比较漂亮的prefill数据。在Agent工具流行的现在,系统提示词超过20k很轻松,prefill过慢会导致等待时间太长。
-
@566656661 我目標都只是100K 🥲 沒貨了
機型所限沒法上雙卡。想過兩張7900 XTX 才2萬內 960GB 頻寬,好像總比兩張R9700來得化算。
但 單卡就是沒有2萬內比R9700快還是謝過大哥,抄來的數據,很有用。
現時小弟都是用wsl + lm studio...如果入手r9700 看似要全部搬到ubuntu....
-
$: llama-bench-vulkan -m 'Qwen3.6-27B-UD-Q4_K_XL.gguf' WARNING: radv is not a conformant Vulkan implementation, testing use only. ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon AI PRO R9700 (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | Vulkan | 99 | pp512 | 1050.13 ± 0.54 | | qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | Vulkan | 99 | tg128 | 31.26 ± 0.01 | build: 97895129e (8863)運行參數
llama-server-vulkan -m '/Qwen3.6-27B-UD-Q4_K_XL.gguf' --mmproj '/mmproj-BF16(3).gguf' -np 1 -ngl 99 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence_penalty 0.00 --jinja --chat-template-kwargs '{"preserve_thinking": true}' -ub 2048 -fa 1 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 --host 0.0.0.0 --port 8180--- Prompt Processing (PPS) Statistics --- Mean: 549.60 t/s Median: 519.19 t/s P95: 936.60 t/s StdDev: 240.80 (Stability) Range: 64.18 - 1015.91 t/s --- Token Generation (Tok/s) Statistics --- Mean: 28.80 t/s Median: 28.20 t/s P95: 45.34 t/s StdDev: 6.78 (Stability) Range: 16.49 - 53.63 t/s Total Tokens Generated: 87840 $:~/Documents/llama_perf$ python3 parse_performance_stats_full.py == Prompt Processing (PPS) Analysis == Effective Avg: 549.60 t/s (Token-Weighted) Median (P50): 519.19 t/s Tail (P99): 958.31 t/s Stability(CV): 43.8% (JITTERY) Skewness: 0.04 (Symmetric) == Token Generation (Tok/s) Analysis == Effective Avg: 1697.20 t/s (Token-Weighted) Median (P50): 28.20 t/s Tail (P99): 51.39 t/s Stability(CV): 23.5% (JITTERY) Skewness: 1.40 (Burst Heavy)看上去至少比vLLM好, 不過真的就只有一點
-
@rolex-lo
我是 opencode 搭配 liteLLM 跑 gamma4 / Qwne 3.6 3.7
主力是 codex max + claude code max 200 ,我的工作是移動端全棧開發+LLM devops
我平常常會把大量的裝置端 log直接喂進去做分析,也會讓AI直接去做E2E測試
還有配合 BDD 做 測試與開發 -
$: llama-bench-vulkan -m 'Qwen3.6-27B-UD-Q4_K_XL.gguf' WARNING: radv is not a conformant Vulkan implementation, testing use only. ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon AI PRO R9700 (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | Vulkan | 99 | pp512 | 1050.13 ± 0.54 | | qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | Vulkan | 99 | tg128 | 31.26 ± 0.01 | build: 97895129e (8863)運行參數
llama-server-vulkan -m '/Qwen3.6-27B-UD-Q4_K_XL.gguf' --mmproj '/mmproj-BF16(3).gguf' -np 1 -ngl 99 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence_penalty 0.00 --jinja --chat-template-kwargs '{"preserve_thinking": true}' -ub 2048 -fa 1 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 --host 0.0.0.0 --port 8180--- Prompt Processing (PPS) Statistics --- Mean: 549.60 t/s Median: 519.19 t/s P95: 936.60 t/s StdDev: 240.80 (Stability) Range: 64.18 - 1015.91 t/s --- Token Generation (Tok/s) Statistics --- Mean: 28.80 t/s Median: 28.20 t/s P95: 45.34 t/s StdDev: 6.78 (Stability) Range: 16.49 - 53.63 t/s Total Tokens Generated: 87840 $:~/Documents/llama_perf$ python3 parse_performance_stats_full.py == Prompt Processing (PPS) Analysis == Effective Avg: 549.60 t/s (Token-Weighted) Median (P50): 519.19 t/s Tail (P99): 958.31 t/s Stability(CV): 43.8% (JITTERY) Skewness: 0.04 (Symmetric) == Token Generation (Tok/s) Analysis == Effective Avg: 1697.20 t/s (Token-Weighted) Median (P50): 28.20 t/s Tail (P99): 51.39 t/s Stability(CV): 23.5% (JITTERY) Skewness: 1.40 (Burst Heavy)看上去至少比vLLM好, 不過真的就只有一點
@566656661 看了又看 那如果上 blackwell 4500 32GB vram 對比 R9700 來說差多嗎?除了價錢外...
-
@566656661 看了又看 那如果上 blackwell 4500 32GB vram 對比 R9700 來說差多嗎?除了價錢外...
-
@566656661 謝過大哥. 都想了解 一倍價錢, 會否比r9700好一半,,,

-
@566656661 我也很期待,也許我們可以來測同一個指標?
-
@566656661 我也很期待,也許我們可以來測同一個指標?
那r9700對你來說真的雞肋,你cotext 開到多少>?