看目前這社區越來越多人買7900XTX了，大家為了一個爽度token無限發與反應速度，這幾天折騰的過程分享給大家(win11+vulkan & ubuntu +rocm)

CHIA AN YANG

看目前這社區越來越多人買7900XTX了，大家為了一個爽度token無限連發與反應速度，這幾天折騰的過程分享給大家，我主要場景是hermes agent,透過telgram發任務請求

7900 XTX × 2 跑 Qwen3.6 27B 本地 LLM 測試報告

硬體：Z10PE-D16-WS 工作站主機板 × 雙 Xeon E5-2678v3（24C/48T）× 128GB DDR4 ECC × 雙 RX 7900 XTX 24GB
系統：Ubuntu + ROCm / Vulkan（AMD GPU 的 Linux 推理框架）
模型：Qwopus3.6-27B-v2-MTP（Unsloth 發布，基於 Qwen3.6 27B，內建 MTP 輔助頭）
量化：IQ4_XS（4.25 bpw）
時間：2026-06

名詞解釋（新手看這裡）

量化（Quantization）：把模型權重壓縮存放，犧牲一點點精度換取更小的檔案和更快速度。

fp16：16-bit 浮點數，每個值 2 bytes，沒有壓縮，是 GPU 的「原始精度」。27B 模型 fp16 約需 54 GB VRAM，一般玩家裝不下。

Q4_K_M：每個值平均 4.5 bpw（bits per weight），是最普遍的量化格式，品質好、速度穩定。

IQ4_XS：每個值 4.25 bpw，比 Q4_K_M 稍微更壓縮，在 RDNA3 架構（7900 XTX）上因為 VRAM 佔用更少，實際跑起來反而更快。

turbo4：beellama fork 獨有的 KV cache 量化，3.5 bpw，比 q4_0（4 bpw）更省 VRAM，用在 KV cache（不是模型本身）。

KV cache：模型在處理長對話時，會把「記憶」存在 VRAM 裡，稱為 KV cache。對話越長，佔用越多 VRAM。量化 KV cache 可以在不降低多少品質的前提下省 VRAM。

MTP（Multi-Token Prediction）：一種加速推理的技術。模型一次預測多個「草稿 token」，再批次驗證，如果草稿正確就直接採用，不正確就丟掉重算。接受率越高速度越快，若接受率 0% 反而比不開更慢（多餘計算）。

t/s（tokens per second）：每秒輸出多少個 token。中文大約 1 個 token = 1 個字，英文 1 個 token ≈ 0.75 個單字。一般對話感覺順暢約需 20+ t/s。

背景與問題

之前在 Win11 + Vulkan 跑 Qwen3.6 27B 可以穩定 60-80 t/s，換到 Ubuntu + ROCm 後掉到 28-33 t/s，本篇記錄怎麼找回速度。

測試過程

階段一：找出 ROCm 慢的原因

配置	速度	備註
goodbyecain b9256 + Q4_K_M + q4_0 KV	22-27 t/s	舊主力，最慢
goodbyecain b9256 + IQ4_XS + q4_0 KV	30-33 t/s	換小量化有幫助
Vulkan build + IQ4_XS（無 MTP）	31-33 t/s	Vulkan base 跟 ROCm 差不多

發現：Vulkan base 速度跟 ROCm 幾乎一樣。Win11 快那麼多，關鍵不在 Vulkan vs ROCm，而是 MTP 能否有效運作。

階段二：MTP 在 Linux 上的問題

原始結論（goodbyecain b9256，AMD 7900 XTX）：Vulkan MTP 接受率約 0.7%，幾乎無效。

更新（2026-06-22，upstream b9377）：實測後確認 issue #22842 已在新版修復。同樣硬體（AMD 7900 XTX），升級到 upstream b9377 之後：

配置	MTP 接受率	速度
goodbyecain b9256 + Vulkan	~0.7%	31-33 t/s
upstream b9377 + Vulkan（實測）	53.5%	49.8 t/s
goodbyecain b9256 + ROCm	54-77%	39-42 t/s

結論更新：舊版 Vulkan MTP 確實有 bug，新版已修。AMD 7900 XTX 上新版 Vulkan MTP 接受率（53.5%）與 ROCm 相近，速度更快（49.8 vs 39-42 t/s）。如果你在 AMD GPU 上跑 Vulkan，建議升級到 upstream b9377 以上。

階段三：beellama + TurboQuant（turbo4 KV cache）

beellama（v0.3.2）是 llama.cpp 的非官方 fork，加入了 TurboQuant（ICLR 2026 論文），一種更激進的 KV cache 量化方式，稱為 turbo4（3.5 bpw，比標準 q4_0 的 4 bpw 更壓縮）。

更省 VRAM → KV cache 更小 → MTP draft 驗證更快 → 接受率更高

配置	速度	MTP 接受率
beellama + IQ4_XS + turbo4 KV + n=4 草稿	38-40 t/s	~38%（n4 太多，浪費）
beellama + IQ4_XS + turbo4 KV + n=3 草稿	39-42 t/s	54-77%

n=4 表示一次預測 4 個草稿 token，但這個模型在 n=4 時常常只接受 0 或 1 個，白費算力；n=3 接受率更穩定。

階段四：Context 大小與速度曲線

Vulkan b9377（q4_0 KV，65K ctx，2026-06-22 實測）：

Context 大小	速度	備註
1K–5K tokens	72–75 t/s	KV cache 小，attention 快
5K–17K tokens	70–72 t/s	輕微下降
27K tokens	40.7 t/s	明顯減速
48K tokens	34.7 t/s	趨於穩定
59K tokens	35.7 t/s	大 context 底限

Vulkan b9377 在短 context 下速度驚人，但隨 context 成長速度明顯衰減。在 48K+ 時（35 t/s），ROCm beellama + turbo4（39-42 t/s）反而更快，因為 turbo4 KV cache 更小（3.5 bpw vs 4 bpw），attention bandwidth 占用更少。

階段五：KV cache 精度（q4_0 vs q8_0）對 MTP 的影響

這個發現比較意外：KV cache 精度直接影響 MTP 接受率。

原因：MTP 的 draft head 在驗證草稿 token 時，需要讀取已有 token 的 KV cache 做 attention。KV cache 精度越低，attention 結果的誤差越大，導致 draft 驗證時機率分佈偏移，更多草稿被拒絕。

KV cache 格式	VRAM（65K ctx）	MTP 接受率	速度
q4_0（4 bpw）	73%（~18GB）	38-51%	35-42 t/s
q8_0（8 bpw）	75%（~18.4GB）	52-57%	44-51 t/s

只多用 2% VRAM（約 400MB），速度提升 25%。 這是這次測試中 CP 值最高的發現。

128K context 也測了：

配置	VRAM	速度
128K + q4_0	76%	~40 t/s（短 ctx）
128K + q8_0	84%（~20.6GB）	47-52 t/s

兩種 128K 配置都能裝進 24GB 的 7900 XTX，之前 ROCm beellama 128K 溢出是 ROCm 的記憶體管理問題，Vulkan 不同。

階段六：AMD GPU 時脈陷阱

症狀：同樣配置，server 第一次啟動時速度明顯比後來重啟的慢。

根本原因：AMD GPU 在 auto 效能模式下，閒置時 shader clock（sclk）會降到 25-87 MHz（正常推理需要 2371 MHz）。新啟動的 server 若沒有立即收到請求，GPU 時脈爬升不及，前幾個 query 在極低頻率下跑。

驗證：在推理進行時監控 sclk：

閒置：25 MHz
推理中：2080 → 2369 → 2400 MHz（正常）
推理結束：立刻掉回 94 MHz

解法：手動鎖定 sclk 和 mclk 到最高頻：

# 需要 sudo
rocm-smi --device 0 --setperflevel manual
echo "2" > /sys/class/drm/card0/device/pp_dpm_sclk  # 鎖 sclk → 2371 MHz
echo "3" > /sys/class/drm/card0/device/pp_dpm_mclk  # 鎖 mclk → 1249 MHz

注意：card0 編號因系統而異，用 ls /sys/class/drm/card*/device/pp_dpm_sclk 查詢。

階段七：server 層 sampling 參數會殺死 MTP

症狀：在 llama-server 啟動時加了 --presence-penalty 1.5 --top-k 20，MTP 接受率從 54% 掉到 41%，速度從 52 t/s 掉到 37 t/s。

原因：MTP 的 draft head 預測的是「原始機率分佈」，不知道有 sampling 限制。當 server 強制 top-k 20 時，很多 draft head 認為高機率的 token 其實在 top-20 之外，被過濾掉，導致接受率下降。

解法：server 層不設 sampling 參數，讓 client 每次 request 自帶。這樣 MTP 在驗證時用的是完整分佈，接受率恢復正常。

最終最佳配置（2026-06-22 更新）

使用軟體：upstream llama.cpp（b9377 以上，標準 Vulkan build）
模型：Qwopus3.6-27B-v2-MTP IQ4_XS（Unsloth HuggingFace）

#!/bin/bash
# 先鎖 GPU 時脈（需 sudo）
sudo rocm-smi --device 0 --setperflevel manual
sudo bash -c "echo '2' > /sys/class/drm/card2/device/pp_dpm_sclk"
sudo bash -c "echo '3' > /sys/class/drm/card2/device/pp_dpm_mclk"

export VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json

SERVER=/path/to/llama.cpp/build-vulkan/bin/llama-server
MODEL=/path/to/Qwopus3.6-27B-v2-MTP-IQ4_XS.gguf

"$SERVER" \
  --host 0.0.0.0 --port 8080 \
  --device Vulkan0 \            # 指定 GPU0
  -m "$MODEL" \
  --alias "unsloth/Qwen3.6-27B-GGUF" \
  --spec-type draft-mtp \       # 開啟 MTP 推測解碼
  --spec-draft-n-max 3 \        # 一次預測 3 個草稿 token
  -ngl 99 \                     # 全部層放 GPU
  --ctx-size 65536 \            # 65K context
  -n 8192 \
  -b 2048 -ub 512 -np 1 \
  --cache-type-k q8_0 \         # q8_0 KV cache（比 q4_0 接受率高 10-15%）
  --cache-type-v q8_0 \
  --no-mmap --mlock \
  --flash-attn on \
  --jinja --no-warmup --reasoning off
  # 注意：不在 server 層設 sampling 參數（top-k/presence-penalty 會降低 MTP 接受率）

速度總結

配置	速度	vs 舊主力
goodbyecain + Q4_K_M + q4_0（舊主力）	22-27 t/s	基準
goodbyecain + IQ4_XS + q4_0	30-33 t/s	+25%
Vulkan 無 MTP（b9256）	31-33 t/s	+25%
beellama + IQ4_XS + turbo4 + n3 MTP（ROCm）	39-42 t/s	+60%
Vulkan b9377 + IQ4_XS + q4_0 + n3 MTP	35-42 t/s	+50%
Vulkan b9377 + IQ4_XS + q8_0 + n3 MTP（現役）	44-51 t/s	+90%

關鍵結論

Vulkan MTP bug 已在 b9377 修復（issue #22842）：舊版 b9256 接受率 0.7%，新版 53%+
q8_0 KV cache 比 q4_0 快 25%：只多用 2% VRAM，MTP 接受率從 42% 升到 54%，CP 值最高的優化
server 層 sampling 參數會殺 MTP：--presence-penalty、--top-k 等讓 draft 被過濾，接受率掉 10-15%；改由 client 每次帶
AMD GPU 時脈陷阱：auto 模式閒置時 sclk 掉到 25-87 MHz，啟動 server 前要先鎖時脈
n=3 草稿比 n=4 好：接受率更穩定，不浪費算力
65K 是 Vulkan 的甜蜜點，128K q8_0 也能裝（84% VRAM），但速度差不多
IQ4_XS 比 Q4_K_M 快：VRAM footprint 更小，MTP 批次更有效率

已知限制

結構化輸出（JSON schema）時 MTP 失效：grammar constraint 會讓 MTP 接受率歸零，速度掉回 ~21 t/s。這是 llama.cpp 的已知問題。
Win11 的 60-80 t/s 差距：現在 Linux Vulkan 已可到 44-51 t/s，差距縮小。Windows 快的部分原因可能是 grammar 限制較少，還在研究中。

相關連結

beellama（TurboQuant fork）
Qwopus3.6-27B-v2-MTP（Unsloth）
llama.cpp Vulkan MTP 問題：issue #22842
TurboQuant 論文：ICLR 2026（beellama README 有連結）

AGI · **Vulkan b9377（q4_0 KV，65K ctx，2026-06-22 實測）：**

@CHIA-AN-YANG 说:

Vulkan 版在 Linux 上 MTP 幾乎無效（GitHub issue #22842）

有效果啊，接受率70%+

terry

非常好，这样的帖子对小白很有帮助

CHIA AN YANG

@AGI 不知道耶我怎麼測速度都起不來,,cc直接放棄,但我在win11+vulkan明顯可以我的AI 可能累了,你有作業可以抄嗎?

AGI

@CHIA-AN-YANG

/usr/local/bin/llama-server \
    -m ./models/Qwen3.6-27B-Uncensored-HauhauCS-Balanced-MTP-Q5_K_P.gguf \
    --mmproj ./models/mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-f16.gguf \
    -c 131072 \
    --parallel 1 \
    -b 2048 \
    -ub 512 \
    -fa 1 \
    -ngl 99 \
    -t 16 \
    --spec-type draft-mtp \
    --cache-type-k q5_0 \
    --cache-type-v q4_1 \
    --no-mmap \
    --temp 0.4 \
    --spec-draft-n-max 3 \
    --top-p 0.95 \
    --top-k 20 \
    --host 0.0.0.0 \
    --port 8080 \
    --tools all

root@ailab:~# llama-server --version
version: 236 (d5376cf5d)
built with GNU 13.3.0 for Linux x86_64

abaalei

啊啊啊看到就心痒痒又想折腾了

Z Boss丶

@CHIA-AN-YANG llama-server.service
/home/myclaw/Downloads/llama.cpp/vulkan/bin/llama-server -m /media/myclaw/SYS/VM/llm/Qwen3.6-27B-Q4_K_M-mtp.gguf --alias qwen3.6-27b --spec-type draft-mtp --spec-draft-n-max 3 --cache-type-k q4_0 --cache-type-v q4_0 -np 1 -c 131072 --temp 0.7 --top-k 20 -ngl 99 --port 8080 --host 0.0.0.0 -fa 1 -ub 256 -fit off

CHIA AN YANG

@AGI 感謝大老~驗證成功了我更新文章了

nami ryuu

@chia-an-yang @agi Qwen3.6-27B-Uncensored-HauhauCS-Balanced-MTP-Q5_K_P.gguf ,请问你们这个模型是在哪下载的，现在hauhuacs的huggingface的repo里面已经没有这个模型了。google也搜不到。

CHIA AN YANG

@nami-ryuu 通常都是讓ai agent代勞了,,比較快

nami ryuu

@agi 您好！我也是用7900xtx显卡，使用
/usr/local/bin/llama-server
-m ./models/Qwen3.6-27B-Uncensored-HauhauCS-Balanced-MTP-Q5_K_P.gguf
--mmproj ./models/mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-f16.gguf
-c 131072
--parallel 1
-b 2048
-ub 512
-fa 1
-ngl 99
-t 16
--spec-type draft-mtp
--cache-type-k q5_0
--cache-type-v q4_1
--no-mmap
--temp 0.4
--spec-draft-n-max 3
--top-p 0.95
--top-k 20
--host 0.0.0.0
--port 8080
--tools all

启动llama.cpp, 但是遇到oom的错误如下：
/usr/local/bin/llama-server -m ./models/Qwen3.6-27B-Uncensored-HauhauCS-Balanced-MTP-Q5_K_P.gguf --mmproj ./models/mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-f16.gguf -c 131072 --parallel 1 -b 2048 -ub 512 -fa 1 -ngl 99 -t 16 --spec-type draft-mtp --cache-type-k q5_0 --cache-type-v q4_1 --no-mmap --temp 0.4 --spec-draft-n-max 3 --top-p 0.95 --top-k 20 --host 0.0.0.0 --port 8080 --tools all
0.00.014.095 I log_info: verbosity = 3 (adjust with the -lv N CLI arg)
0.00.014.097 I device_info:
0.00.014.112 I - ROCm0 : Radeon RX 7900 XTX (24560 MiB, 24524 MiB free)
0.00.014.154 I - ROCm1 : AMD Radeon Graphics (47068 MiB, 89322 MiB free)
0.00.014.156 I - CPU : AMD Ryzen 7 9700X 8-Core Processor (94137 MiB, 94137 MiB free)
0.00.014.207 I system_info: n_threads = 16 (n_threads_batch = 16) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.014.234 I srv init: running without SSL
0.00.014.273 I srv init: using 15 threads for HTTP server
0.00.014.473 W srv llama_server: -----------------
0.00.014.474 W srv llama_server: Built-in tools are enabled, do not expose server to untrusted environments
0.00.014.474 W srv llama_server: This feature is EXPERIMENTAL and may be changed in the future
0.00.014.474 W srv llama_server: -----------------
0.00.014.481 I srv start: binding port with default address family
0.00.015.619 I srv llama_server: loading model
0.00.015.661 I srv load_model: loading model './models/Qwen3.6-27B-Uncensored-HauhauCS-Balanced-MTP-Q5_K_P.gguf'
0.00.052.136 I srv load_model: [mtmd] estimated worst-case memory usage of mmproj is 1157.64 MiB (took 36.45 ms)
0.00.295.983 I srv load_model: [spec] estimated memory usage of MTP context is 708.02 MiB
0.00.296.004 I common_init_result: fitting params to device memory ...
0.00.296.004 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.517.578 W common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 99, abort
0.01.810.285 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.01.838.385 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.01.916.196 I srv load_model: creating MTP draft context against the target model './models/Qwen3.6-27B-Uncensored-HauhauCS-Balanced-MTP-Q5_K_P.gguf'
0.01.916.222 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.01.932.754 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
0.01.932.756 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
0.01.932.756 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

0.01.933.558 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 884.62 MiB on device 0: cudaMalloc failed: out of memory
0.01.933.561 E alloc_tensor_range: failed to allocate ROCm0 buffer of size 927588992
/home/liubo/llama.cpp/ggml/src/ggml-backend.cpp:179: GGML_ASSERT(buffer) failed
[New LWP 459888]
[New LWP 459887]
[New LWP 459886]
[New LWP 459885]
[New LWP 459884]
[New LWP 459883]
[New LWP 459882]
[New LWP 459881]
[New LWP 459880]
[New LWP 459879]
[New LWP 459878]
[New LWP 459877]
[New LWP 459876]
[New LWP 459875]
[New LWP 459874]
[New LWP 459717]
[New LWP 459716]
[New LWP 459715]
[New LWP 459714]
[New LWP 459713]
[New LWP 459712]
[New LWP 459711]
[New LWP 459710]
[New LWP 459709]
[New LWP 459708]
[New LWP 459707]
[New LWP 459706]
[New LWP 459705]
[New LWP 459704]
[New LWP 459703]
[New LWP 459702]
[New LWP 459700]
[New LWP 459699]
[New LWP 459696]

This GDB supports auto-downloading debuginfo from the following URLs:
https://debuginfod.ubuntu.com
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x0000762b61110813 in __GI___wait4 (pid=459889, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x0000762b61110813 in __GI___wait4 (pid=459889, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x0000762b6134e663 in ggml_print_backtrace () from /home/liubo/llama.cpp/build/bin/libggml-base.so.0
#2 0x0000762b6134e80b in ggml_abort () from /home/liubo/llama.cpp/build/bin/libggml-base.so.0
#3 0x0000762b61367611 in ggml_backend_buffer_set_usage () from /home/liubo/llama.cpp/build/bin/libggml-base.so.0
#4 0x0000762b617a75e8 in clip_model_loader::load_tensors(clip_ctx&) () from /home/liubo/llama.cpp/build/bin/libmtmd.so.0
#5 0x0000762b61795dcd in clip_init(char const*, clip_context_params) () from /home/liubo/llama.cpp/build/bin/libmtmd.so.0
#6 0x0000762b6170987c in mtmd_context::mtmd_context(char const*, llama_model const*, mtmd_context_params const&, bool) () from /home/liubo/llama.cpp/build/bin/libmtmd.so.0
#7 0x0000762b61703211 in mtmd_init_from_file () from /home/liubo/llama.cpp/build/bin/libmtmd.so.0
#8 0x0000762b619aab79 in server_context_impl::load_model(common_params&) () from /home/liubo/llama.cpp/build/bin/libllama-server-impl.so
#9 0x0000762b618e4a48 in llama_server(int, char**) () from /home/liubo/llama.cpp/build/bin/libllama-server-impl.so
#10 0x0000762b6102a1ca in __libc_start_call_main (main=main@entry=0x5e6c5fa22270 <main>, argc=argc@entry=40, argv=argv@entry=0x7fffc3eb01c8) at ../sysdeps/nptl/libc_start_call_main.h:58
warning: 58 ../sysdeps/nptl/libc_start_call_main.h: No such file or directory
#11 0x0000762b6102a28b in __libc_start_main_impl (main=0x5e6c5fa22270 <main>, argc=40, argv=0x7fffc3eb01c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffc3eb01b8) at ../csu/libc-start.c:360
warning: 360 ../csu/libc-start.c: No such file or directory
#12 0x00005e6c5fa222a5 in _start ()
[Inferior 1 (process 459658) detached]
Aborted (core dumped)

请问是我哪步弄错了吗？我问了gemini,它让我减少上下文，q4我可运行，占用21.5g，我加上q4和q5模型的权重差，我大概差1g的内存。我们几乎是一样的环境。感谢！!

Screenshot from 2026-06-24 20-56-25.png

AGI

@nami-ryuu https://huggingface.co/crotron/Qwen3.6-27B-Uncensored-HauhauCS-Balanced-MTP/tree/main

AGI · Screenshot from 2026-06-24 20-56-25.png

@nami-ryuu 我用的是vulkan，你用的是rocm吧？

CHIA AN YANG · Screenshot from 2026-06-24 20-56-25.png

@nami-ryuu 建議vulkan順很多

#!/bin/bash

先鎖 GPU 時脈（需 sudo）

sudo rocm-smi --device 0 --setperflevel manual
sudo bash -c "echo '2' > /sys/class/drm/card2/device/pp_dpm_sclk"
sudo bash -c "echo '3' > /sys/class/drm/card2/device/pp_dpm_mclk"

export VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json

SERVER=/path/to/llama.cpp/build-vulkan/bin/llama-server
MODEL=/path/to/Qwopus3.6-27B-v2-MTP-IQ4_XS.gguf

"$SERVER"
--host 0.0.0.0 --port 8080
--device Vulkan0 \ # 指定 GPU0
-m "$MODEL"
--alias "unsloth/Qwen3.6-27B-GGUF"
--spec-type draft-mtp \ # 開啟 MTP 推測解碼
--spec-draft-n-max 3 \ # 一次預測 3 個草稿 token
-ngl 99 \ # 全部層放 GPU
--ctx-size 65536 \ # 65K context
-n 8192
-b 2048 -ub 512 -np 1
--cache-type-k q8_0 \ # q8_0 KV cache（比 q4_0 接受率高 10-15%）
--cache-type-v q8_0
--no-mmap --mlock
--flash-attn on
--jinja --no-warmup --reasoning off

注意：不在 server 層設 sampling 參數（top-k/presence-penalty 會降低 MTP 接受率）

python96998

这个论坛的界面太丑了吧

terry

@python96998 你可以在随便聊聊板块专门发帖，说出你对论坛UI的感受，可以说出哪里丑，这是你作为访客的权利，也可以提出改进建议。

这是个技术话题的帖子，你在这里如此回帖，是缺乏教养的表现。你不是宇宙的中心，这个论坛不是你的许愿池，如此缺乏教养就会被我扇耳光，被骂然后被禁言。煞笔东西。

nami ryuu

@chia-an-yang @agi 谢谢两位老师，我先换成vulkan试一试。

AGI

@terry 你的性格也太直了，看Ytb能感觉出来，但是既然大家都来了，感觉可以允许有不同意见，这个ui丑，又不是你的过错，解释清楚就可以了。

terry

@AGI 这个头不能开，说的很清楚，这个UI丑，可以专门发帖提出来。
论坛有专门的板块，可以随便聊聊，也可以到公告区回复帖子。
这种人发帖的动机是什么？他发帖的时候，从未考虑过是否影响公共秩序，对这种人，我的原则就是绝不姑息。

另外，虽然不太重要，我认为这个UI即便算不上好看，它怎么也算不上丑。当然了，说它丑是自由，但这不是这人被我骂的原因。

nami ryuu

@agi @chia-an-yang 两位老师我跑通了，但是我用hermes的时候工具调用感觉卡了额，我的7900xtx在疯狂的生成，但是hermes却卡住了。请问两位遇到过类似的问题吗？
llama.cpp 输出：

65.27 t/s, tg_3s = 55.86 t/s
36.30.567.224 I slot print_timing: id 0 | task 8648 | n_decoded = 62072, tg = 65.24 t/s, tg_3s = 55.84 t/s
36.33.579.807 I slot print_timing: id 0 | task 8648 | n_decoded = 62240, tg = 65.21 t/s, tg_3s = 55.77 t/s
36.36.592.579 I slot print_timing: id 0 | task 8648 | n_decoded = 62408, tg = 65.18 t/s, tg_3s = 55.76 t/s
36.39.607.362 I slot print_timing: id 0 | task 8648 | n_decoded = 62576, tg = 65.15 t/s, tg_3s = 55.73 t/s
36.42.629.501 I slot print_timing: id 0 | task 8648 | n_decoded = 62744, tg = 65.12 t/s, tg_3s = 55.59 t/s
36.45.651.508 I slot print_timing: id 0 | task 8648 | n_decoded = 62912, tg = 65.09 t/s, tg_3s = 55.59 t/s
36.48.669.380 I slot print_timing: id 0 | task 8648 | n_decoded = 63080, tg = 65.06 t/s, tg_3s = 55.67 t/s
36.51.697.721 I slot print_timing: id 0 | task 8648 | n_decoded = 63247, tg = 65.03 t/s, tg_3s = 55.15 t/s
36.54.730.154 I slot print_timing: id 0 | task 8648 | n_decoded = 63415, tg = 65.00 t/s, tg_3s = 55.40 t/s
36.57.762.852 I slot print_timing: id 0 | task 8648 | n_decoded = 63583, tg = 64.97 t/s, tg_3s = 55.40 t/s
37.00.794.845 I slot print_timing: id 0 | task 8648 | n_decoded = 63751, tg = 64.94 t/s, tg_3s = 55.41 t/s

hermes输出：

c09f0fd3-2890-42e1-838f-8e36a2ab527b-bd93db497055bc01fe89b39dc4f1a308915fe680.rtfd

preparing browser_navigate...
navigate
search.yahoo.com
14.2s

Hermes
Let me try a more targeted search.
A
preparing browser_navigate... navigate www.google.com
3.35
Response truncated (finish_reason='length')
preparing browser_navigate...
navigate duckduckgo.com 20.5s preparing browser_scroll...
↓
scroll
down 0.2s
LOI
preparing browser_snapshot...
snapshot compact 0.2s preparing browser_navigate... navigate duckduckgo.com 1.5s
(>** cogitating...
model hit max output toke
qwen3.6-27b 30,9K/131.1K [
1］24% ｜36m ｜020

抡锤者

看目前這社區越來越多人買7900XTX了，大家為了一個爽度token無限發與反應速度，這幾天折騰的過程分享給大家(win11+vulkan & ubuntu +rocm)

7900 XTX × 2 跑 Qwen3.6 27B 本地 LLM 測試報告

名詞解釋（新手看這裡）

背景與問題

測試過程

階段一：找出 ROCm 慢的原因

階段二：MTP 在 Linux 上的問題

階段三：beellama + TurboQuant（turbo4 KV cache）

階段四：Context 大小與速度曲線

階段五：KV cache 精度（q4_0 vs q8_0）對 MTP 的影響

階段六：AMD GPU 時脈陷阱

階段七：server 層 sampling 參數會殺死 MTP

最終最佳配置（2026-06-22 更新）

速度總結

關鍵結論

已知限制

相關連結

7900 XTX × 2 跑 Qwen3.6 27B 本地 LLM 測試報告

名詞解釋（新手看這裡）

背景與問題

測試過程

階段一：找出 ROCm 慢的原因

階段二：MTP 在 Linux 上的問題

階段三：beellama + TurboQuant（turbo4 KV cache）

階段四：Context 大小與速度曲線

階段五：KV cache 精度（q4_0 vs q8_0）對 MTP 的影響

階段六：AMD GPU 時脈陷阱

階段七：server 層 sampling 參數會殺死 MTP

最終最佳配置（2026-06-22 更新）

速度總結

關鍵結論

已知限制

相關連結

先鎖 GPU 時脈（需 sudo）

注意：不在 server 層設 sampling 參數（top-k/presence-penalty 會降低 MTP 接受率）