感谢,抄了作业,重新编译一下从原来~30TPS 提升到~40TPS,后面对coding微调了一下基本上确定大概eGPU +7900xtx能编程能测试了,等装上x99-cd3会来更新一下
# 7900 XTX (TB3 eGPU) + Qwen3.6-27B llama.cpp MTP — Bench Summary
Hardware: AMD 7900 XTX via Razer Core X Chroma (TB3) + Beelink SER7
Tool: llama-benchy (Sherlock Holmes prompts, pp=512 tg=128 depth=[0, 4096])
| # | Config | tg mean | tg peak | tg @ d4096 | pp512 | Accept |
|---|-------------------------------------------------|--------:|--------:|-----------:|------:|-------:|
| 1 | Baseline (mainline, no MTP, temp=0.2) | 30.26 | 31.5 | 29.79 | 459 | n/a |
| 2 | + MTP enabled (old PR build 9117) | 35.54 | 41.0 | 29.45 | 310 | 97% |
| 3 | + Rebuilt PR to latest (9173, GDN rollback fix) | 37.25 | 45.5 | 34.70 | 353 | 57% |
| 4 | + GPU power_dpm forced to `high` | 45.00 | 54.8 | 37.94 | 351 | 57% |
| 5 | + Qwen "precise coding" sampling (current) | 37.32 | 46.8 | 31.75 | 368 | 54% |
Cumulative gain vs original baseline: **+23% TG mean, +49% TG peak**
(Step 4 alone is +49% / +74%; step 5 trades 16% speed for output quality)
## Variant comparisons (PR 9173 + perf=high)
| Variant | tg mean | tg peak | tg @ d4096 | Accept | Verdict |
|--------------------------------------------|--------:|--------:|-----------:|-------:|------------------|
| froggeric Q4_K_M MTP (default) | 45.00 | 54.8 | 37.94 | 67% | ✅ Best mean |
| unsloth Q4_K_M MTP | 36.13 | 44.0 | 34.68 | 49% | ❌ -19% TG |
| unsloth UD-Q4_K_XL MTP | 43.65 | 53.0 | 33.01 | 60% | ≈ Tied, worse @d |
| Chain: `ngram-mod,draft-mtp` (unsloth tip) | — | — | — | — | 🔴 CRASH (SSM) |
## Sampling A/B (froggeric MTP, n=2, perf=high)
| Preset | temp / top_p / top_k / pp | tg mean | Accept@0 | Note |
|-------------------------|---------------------------|--------:|---------:|---------------|
| Fast (temp=0.2) | 0.2 / — / 20 / — | 45.00 | 67% | Fastest, repetitive |
| Precise coding (active) | 0.6 / 0.95 / 20 / 0.0 | 37.32 | 54% | ★ Current default |
| Non-thinking general | 0.7 / 0.8 / 20 / 1.5 | 36.26 | 57% | Best @ long ctx |
| Thinking general | 1.0 / 0.95 / 20 / 1.5 | 37.68 | 59% | Avoid (no MTP gain) |
## Other paths evaluated and rejected
| Option | Result on 7900 XTX |
|------------------------------|----------------------------------------|
| vLLM (ROCm) | ❌ -10–20%, no Qwen3.6 MTP, 4–8h install |
| TurboQuant (Vulkan port) | ❌ Broken — 10 t/s, GPU util <30% |
| DFlash / Hipfire | ❌ Crashes >4k context, no MTP |
| MLC-LLM (Vulkan) | ⚠️ ~10 t/s slower, no MTP |
## Hardware ceiling vs realistic upgrades
| Setup | Expected tg mean |
|--------------------------------------------------|-----------------:|
| Current (TB3 eGPU, all sw optimizations) | 37–45 |
| OCuLink mod to Core X Chroma (~$80, 3h) | 52–55 |
| Move GPU to X99 desktop (PCIe 3.0 x16) | 58–62 |
| Modern AM5 + PCIe 4.0 x16 (blog reference) | 67 |
**Current `start_server start`:** llama.cpp PR 9173 + froggeric MTP Q4_K_M + `--spec-type draft-mtp --spec-draft-n-max 2` + KV q4_0 + FA on + Qwen precise coding sampling + GPU perf=high.

