<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享]]></title><description><![CDATA[<p dir="auto">关注 Lucebox 有几周了，终于最近AMD HIP backend 实现了 DFlash<br />
<a href="https://github.com/Luce-Org/lucebox-hub" rel="nofollow ugc">https://github.com/Luce-Org/lucebox-hub</a><br />
原帖: <a href="https://www.lucebox.com/blog/amd" rel="nofollow ugc">https://www.lucebox.com/blog/amd</a></p>
<p dir="auto">然后我本地让opencode跑了下，单卡7900XTX，q8,q4,tq3, 256k ctx，结果如下：</p>
<h1>Lucebox DFlash + PFlash 复现报告 (RX 7900 XTX)</h1>
<h2>硬件环境</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>项目</th>
<th>规格</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPU</td>
<td>AMD Radeon RX 7900 XTX (Navi 31, gfx1100)</td>
</tr>
<tr>
<td>显存</td>
<td>24 GiB GDDR6 (~936 GB/s)</td>
</tr>
<tr>
<td>系统内存</td>
<td>62 GiB DDR5</td>
</tr>
<tr>
<td>ROCm</td>
<td>7.1</td>
</tr>
<tr>
<td>系统</td>
<td>Ubuntu 26.04, Linux 7.0.0-14-generic</td>
</tr>
</tbody>
</table>
<h2>基准测试结果</h2>
<p dir="auto"><strong>模型</strong>: Qwen3.6-27B Q4_K_M (15.65 GiB)</p>
<p dir="auto"><strong>Draft模型对比</strong>: 测试了两个 DFlash Draft模型:</p>
<ul>
<li><strong>Lucebox Q8_0</strong> (1.84 GiB, 官方) — <code>dflash-draft-3.6-q8_0.gguf</code></li>
<li><strong>spiritbuun Q4_K_M</strong> (1.03 GiB, 社区) — <code>dflash-draft-3.6-q4_k_m.gguf</code></li>
</ul>
<p dir="auto">结论: Q4 Draft <strong>未带来性能提升</strong>。在 7900 XTX 上:</p>
<ul>
<li>Draft计算阶段 Q4 反而更慢 (31.43 ms vs 27.46 ms) — Q4 反量化开销 + 可能较差的 kernel 优化</li>
<li>验证阶段 (target forward) 完全一致 (~68.8 ms)</li>
<li>Q4 Draft的接受率略有下降 (AL 4.69 vs 4.93)</li>
</ul>
<p dir="auto">所有后续测试均使用官方 Lucebox Q8_0 Draft。</p>
<h3>Short-Prompt Decode (HumanEval, ~125 tok prompt, n_gen=128)</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>配置</th>
<th>KV 缓存</th>
<th>平均 tok/s</th>
<th>平均 AL</th>
<th>加速比 (vs llama.cpp HIP)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>llama.cpp HIP AR</strong></td>
<td>—</td>
<td><strong>28.07</strong></td>
<td>—</td>
<td><strong>1.00x</strong></td>
</tr>
<tr>
<td><strong>DFlash (标准链式推测)</strong></td>
<td>Q8_0</td>
<td><strong>64.23</strong></td>
<td>5.36</td>
<td><strong>2.29x</strong></td>
</tr>
<tr>
<td><strong>DFlash DDTree budget=8</strong></td>
<td>Q8_0</td>
<td><strong>62.75</strong></td>
<td>4.93</td>
<td><strong>2.24x</strong></td>
</tr>
<tr>
<td><strong>DFlash DDTree budget=8</strong></td>
<td><strong>tq3_0</strong></td>
<td><strong>68.64</strong></td>
<td>5.57</td>
<td><strong>2.44x</strong></td>
</tr>
<tr>
<td>DFlash DDTree budget=22</td>
<td>Q8_0</td>
<td>60.94</td>
<td>6.11</td>
<td>2.17x</td>
</tr>
</tbody>
</table>
<h3>Multi-Context Sweep (PFlash Prefill + DFlash Decode)</h3>
<p dir="auto">在不同上下文长度下，使用 PFlash 预填充（rocWMMA Phase 2）+ DFlash DDTree (budget=8) 解码，对比 llama.cpp HIP AR。</p>
<p dir="auto">KV cache 使用 Q8_0（4K–64K）或 Q4_0（128K）量化以节省显存。</p>
<h4>预填充性能 (tok/s)</h4>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>上下文</th>
<th>llama.cpp HIP (AR)</th>
<th>Lucebox (PFlash)</th>
<th>加速比</th>
<th>备注</th>
</tr>
</thead>
<tbody>
<tr>
<td>128</td>
<td>619.51</td>
<td>312.5</td>
<td>0.50x</td>
<td>PFlash 短上下文开销大</td>
</tr>
<tr>
<td>4K</td>
<td>734.57</td>
<td>726 (Q8 KV)</td>
<td>0.99x</td>
<td>持平</td>
</tr>
<tr>
<td>16K</td>
<td>649.08</td>
<td>735 (Q8 KV)</td>
<td><strong>1.13x</strong></td>
<td></td>
</tr>
<tr>
<td>64K</td>
<td>—¹</td>
<td>733 (Q8 KV)</td>
<td>—</td>
<td>¹llama.cpp context 创建 OOM</td>
</tr>
<tr>
<td>128K</td>
<td>—¹</td>
<td>730 (Q4 KV)</td>
<td>—</td>
<td></td>
</tr>
<tr>
<td>192K</td>
<td>—¹</td>
<td><strong>730</strong> (Q4 KV)</td>
<td>—</td>
<td></td>
</tr>
<tr>
<td>256K</td>
<td>—¹</td>
<td><strong>730</strong> (Q4 KV + Q4 draft / tq3_0 KV + Q8 draft / tq3_0 KV + Q4 draft)</td>
<td>—</td>
<td></td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">¹ AR prefill 在 64K+ 无法运行，因为 llama.cpp 在 context 创建时就需要分配完整 KV cache（O(n²) 注意力 + 全量 KV 存储），64K Q4 KV 约 8 GiB + 模型 15 GiB 已超 24 GiB。这是 PFlash 的核心优势：<strong>压缩预填充将长 prompt 压缩为固定大小，prefill 复杂度从 O(n²) 降至 O(n)</strong>。</p>
<p dir="auto">256K 时可用 Q4 Draft + Q4 KV cache、Q8 Draft + tq3_0 KV cache、或 Q4 Draft + tq3_0 KV cache。</p>
</blockquote>
<h4>解码性能 (tok/s)</h4>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>上下文</th>
<th>llama.cpp HIP (AR)</th>
<th>DFlash (Q8 Draft)</th>
<th>DFlash (tq3_0 KV, Q8 Draft)</th>
<th>DFlash (Q4 Draft)</th>
<th>DFlash (tq3_0 KV, Q4 Draft)</th>
<th>加速比</th>
</tr>
</thead>
<tbody>
<tr>
<td>128</td>
<td>28.07</td>
<td>62.75</td>
<td><strong>68.64</strong></td>
<td>59.01</td>
<td>—</td>
<td>2.44x</td>
</tr>
<tr>
<td>4K</td>
<td>27.77</td>
<td><strong>86.37</strong></td>
<td>79.27</td>
<td>82.72</td>
<td>78.75</td>
<td><strong>3.11x</strong></td>
</tr>
<tr>
<td>16K</td>
<td>27.79</td>
<td>76.87</td>
<td>72.43</td>
<td>—</td>
<td>78.57</td>
<td>2.77x</td>
</tr>
<tr>
<td>64K</td>
<td>27.78¹</td>
<td>78.33</td>
<td>71.96</td>
<td>—</td>
<td>75.72</td>
<td>2.82x</td>
</tr>
<tr>
<td>128K</td>
<td>27.78¹</td>
<td>78.82 (Q4 KV)</td>
<td>71.20</td>
<td>—</td>
<td>75.27</td>
<td>2.84x</td>
</tr>
<tr>
<td>192K</td>
<td>27.78¹</td>
<td>73.77</td>
<td>71.56</td>
<td><strong>81.09</strong> (Q4 KV)</td>
<td>76.54</td>
<td>2.92x</td>
</tr>
<tr>
<td>256K</td>
<td>27.78¹</td>
<td>OOM</td>
<td><strong>72.26</strong></td>
<td><strong>81.01</strong> (Q4 KV + Q4 draft)</td>
<td>75.70</td>
<td>2.92x</td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">AR decode 速度与上下文长度无关（KV cache size 不影响单 token forward），因此所有行使用同一基线（4K 和 16K 实测均值 27.78 tok/s）。DFlash decode 在所有上下文下稳定加速 ~2.8-3.1x。<br />
256K 时需 Q4 Draft + Q4 KV cache 才能装入 24 GiB 显存；tq3_0 KV (3.5 bpv) 在 256K 时可用 Q8 Draft或 Q4 Draft。</p>
</blockquote>
<p dir="auto">PFlash 预填充在短上下文（4K）与 AR 相当；上下文越长，PFlash 优势越明显（16K 时 1.13x）。DFlash 解码在所有上下文长度下保持 ~2.8–3.1x 加速。</p>
<h3>关键发现</h3>
<ol>
<li><strong>Budget=8 对 7900 XTX 最优</strong> (62.75 tok/s)，与 blog 一致。GDDR6 高带宽下小树更好，避免 tile waste；LPDDR5X 的 Strix Halo 则需要 budget=22 来摊销 launch 开销。</li>
<li><strong>2.24x 加速</strong> 与 blog 中 Strix Halo 的 2.23x 完全对应。7900 XTX 绝对速度 62.75 tok/s 远超 26.85 tok/s，归功于 ~9x 带宽优势。</li>
<li><strong>标准链式推测 (无 DDTree) 略快</strong> (64.23 tok/s)，说明短生成 (128 tokens) 下简单策略的 overhead 更低。</li>
<li><strong>PFlash 预填充</strong> 在短上下文与 AR 持平（4K: 726 vs 735 tok/s），在长上下文发挥优势（16K: 735 vs 649 tok/s, 1.13x）。</li>
<li><strong>DFlash 解码</strong> 在所有上下文长度稳定加速 2.8-3.1x，不受上下文长度影响。</li>
<li><strong>tq3_0 KV (3.5 bpv)</strong> — 短提示 68.64 tok/s（2.44x），256K 可用 Q8 Draft。Q4 Draft + tq3_0 KV 在 4K–192K 长上下文略快于 Q8 Draft + tq3_0 KV（~75–79 vs ~71–79 tok/s），但 256K 有 OOM 风险。总体而言 tq3_0 提供了最优的压缩/速度平衡。</li>
</ol>
<hr />
<h2>完整复现步骤</h2>
<h3>1. 克隆仓库（PR #119 已合并，无需再单独切换）</h3>
<pre><code class="language-bash">git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git submodule update --init --recursive
</code></pre>
<h3>2. 安装 rocWMMA 头文件（可选但推荐，开启 Phase 2 FlashPrefill）</h3>
<p dir="auto">如果没有 sudo 权限安装 <code>rocwmma</code> 包，可直接从 GitHub 拉取头文件：</p>
<pre><code class="language-bash">git clone --depth 1 https://github.com/ROCm/rocWMMA.git /tmp/rocwmma
mkdir -p /tmp/rocm_include/include
cp -r /tmp/rocwmma/library/include/rocwmma /tmp/rocm_include/include/rocwmma
</code></pre>
<h3>3. 编译 (gfx1100 / 7900 XTX)</h3>
<pre><code class="language-bash">cd dflash
cmake -B build -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DDFLASH27B_GPU_BACKEND=hip \
  -DDFLASH27B_HIP_ARCHITECTURES=gfx1100 \
  -DDFLASH27B_HIP_SM80_EQUIV=ON \
  -DROCM_PATH=/tmp/rocm_include    # 上一步准备的路径，如已安装 rocwmma 则不需要

cmake --build build --target test_dflash -j$(nproc)
</code></pre>
<blockquote>
<p dir="auto">其他 GPU 架构请将 <code>gfx1100</code> 替换为对应值，如 gfx1151 (Strix Halo)、gfx1030 (Navi 21) 等。<br />
如果不使用 rocWMMA，设置 <code>-DDFLASH27B_HIP_SM80_EQUIV=OFF</code> 使用 q8 fallback。</p>
</blockquote>
<h3>4. 下载模型</h3>
<pre><code class="language-bash">mkdir -p models/draft
wget -c -O models/Qwen3.6-27B-Q4_K_M.gguf \
  "https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q4_K_M.gguf"
wget -c -O models/draft/dflash-draft-3.6-q8_0.gguf \
  "https://huggingface.co/Lucebox/Qwen3.6-27B-DFlash-GGUF/resolve/main/dflash-draft-3.6-q8_0.gguf"
</code></pre>
<h3>5. 安装 Python 依赖（用于 bench 脚本）</h3>
<pre><code class="language-bash">pip3 install --break-system-packages transformers torch
</code></pre>
<h3>6. 运行基准测试</h3>
<pre><code class="language-bash"># DFlash DDTree budget=8 (推荐用于 gfx1100)
cd dflash
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
DFLASH_BIN=$PWD/build/test_dflash \
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \
DFLASH27B_DRAFT_SWA=2048 \
DFLASH27B_PREFILL_UBATCH=512 \
  python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 8
</code></pre>
<h4>环境变量说明</h4>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>环境变量</th>
<th>含义</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>DFLASH_BIN</code></td>
<td>test_dflash 二进制路径</td>
</tr>
<tr>
<td><code>DFLASH_TARGET</code></td>
<td>目标模型 GGUF 路径</td>
</tr>
<tr>
<td><code>DFLASH_DRAFT</code></td>
<td>Draft模型 GGUF 路径</td>
</tr>
<tr>
<td><code>DFLASH27B_DRAFT_SWA</code></td>
<td>Qwen3.6 Draft的 sliding window attention 窗口 (2048)</td>
</tr>
<tr>
<td><code>DFLASH27B_PREFILL_UBATCH</code></td>
<td>压缩预填充的 micro-batch 大小 (512，对应 PR #159)</td>
</tr>
</tbody>
</table>
<h4>bench_he.py 常用参数</h4>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>参数</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>--n-gen N</code></td>
<td>每个 prompt 生成 token 数 (默认 128)</td>
</tr>
<tr>
<td><code>--ddtree-budget N</code></td>
<td>DDTree 节点预算 (8/22/32/48/64/96/128)</td>
</tr>
<tr>
<td><code>--ddtree-temp T</code></td>
<td>Draft logits 温度 (T&lt;1 加宽 top-1/top-2 差距)</td>
</tr>
<tr>
<td><code>--max-ctx N</code></td>
<td>最大上下文长度</td>
</tr>
<tr>
<td><code>--target-tokenizer REPO</code></td>
<td>目标模型 tokenizer (默认 Qwen/Qwen3.5-27B)</td>
</tr>
<tr>
<td><code>--target-split-dflash</code></td>
<td>使用目标层拆分模式（显示 prefill 耗时）</td>
</tr>
<tr>
<td><code>--skip-tokenize</code></td>
<td>跳过 tokenize 步骤（复用缓存）</td>
</tr>
</tbody>
</table>
<h3>7. 多上下文压力测试</h3>
<p dir="auto">测试不同上下文长度下 PFlash prefill + DFlash decode 性能：</p>
<pre><code class="language-bash"># 生成不同长度的 prompt 文件
python3 -c "
import struct
from pathlib import Path
d = Path('/tmp/dflash_sweep')
d.mkdir(parents=True, exist_ok=True)

# 读取一个真实的 HumanEval prompt 作为种子
he = open('/tmp/dflash_bench/he_prompt_Qwen_Qwen3.5-27B_00.bin', 'rb').read()
tokens = [struct.unpack('&lt;i', he[i*4:(i+1)*4])[0] for i in range(len(he)//4)]

for target, name in [(4096, '4k'), (16384, '16k'), (65536, '64k'), (131072, '128k'), (262144, '256k')]:
    repeated = (tokens * (target // len(tokens) + 1))[:target]
    p = d / f'prompt_real_{name}.bin'
    p.write_bytes(b''.join(struct.pack('&lt;i', t) for t in repeated))
    print(f'Created {name} prompt')
"

# 运行测试（示例：4K 上下文，Q8 KV cache）
export DFLASH27B_DRAFT_SWA=2048 DFLASH27B_PREFILL_UBATCH=512
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
  build/test_dflash \
  models/Qwen3.6-27B-Q4_K_M.gguf \
  models/draft/dflash-draft-3.6-q8_0.gguf \
  /tmp/dflash_sweep/prompt_real_4k.bin \
  128 /dev/null --fast-rollback --ddtree --ddtree-budget=8 --max-ctx=8704 \
  -ctk q8_0 -ctv q8_0

# 更大上下文示例（128K 需 Q4 KV）
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
  build/test_dflash \
  models/Qwen3.6-27B-Q4_K_M.gguf \
  models/draft/dflash-draft-3.6-q8_0.gguf \
  /tmp/dflash_sweep/prompt_real_128k.bin \
  128 /dev/null --fast-rollback --ddtree --ddtree-budget=8 --max-ctx=200000 \
  -ctk q4_0 -ctv q4_0
</code></pre>
<p dir="auto">KV cache 量化和上下文选择指南：</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>上下文</th>
<th>推荐 KV 量化</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>上下文</td>
<td>推荐 KV 量化</td>
<td>说明</td>
</tr>
<tr>
<td>--------</td>
<td>-------------</td>
<td>------</td>
</tr>
<tr>
<td>4K–16K</td>
<td><code>-ctk q8_0 -ctv q8_0</code></td>
<td>有足够显存裕量</td>
</tr>
<tr>
<td>64K</td>
<td><code>-ctk q8_0 -ctv q8_0</code></td>
<td>显存紧张但可用</td>
</tr>
<tr>
<td>128K</td>
<td><code>-ctk q4_0 -ctv q4_0</code></td>
<td>Q8 超出 24 GiB 显存</td>
</tr>
<tr>
<td>256K</td>
<td><code>-ctk tq3_0 -ctv tq3_0</code> (Q8 draft)</td>
<td>tq3_0 (3.5 bpv) 配合 Q8 Draft可装入 24 GiB</td>
</tr>
</tbody>
</table>
<h3>KV 类型基准结果（7900 XTX, DDTree budget=8, Q8 draft）</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>KV 类型</th>
<th>解码 (tok/s)</th>
<th>加速比 vs AR (28.07)</th>
<th>备注</th>
</tr>
</thead>
<tbody>
<tr>
<td>AR (llama.cpp HIP)</td>
<td>28.07</td>
<td>1.00x</td>
<td>基线</td>
</tr>
<tr>
<td>默认（无显式类型）</td>
<td>57.70</td>
<td>2.06x</td>
<td>q8_0 默认</td>
</tr>
<tr>
<td>Q8_0</td>
<td>58.84</td>
<td>2.10x</td>
<td>显式 q8_0</td>
</tr>
<tr>
<td>Q4_0</td>
<td>57.32</td>
<td>2.04x</td>
<td>DFlash 解码</td>
</tr>
<tr>
<td><strong>tq3_0</strong></td>
<td><strong>68.64</strong></td>
<td><strong>2.44x</strong></td>
<td>短提示；3.5 bpv 压缩</td>
</tr>
<tr>
<td>turbo2/3/4</td>
<td>—</td>
<td>—</td>
<td>CUDA cpy 需要 f32→turbo 转换</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>tq3_0 多上下文解码性能</strong>（Q8 draft vs Q4 draft）：</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>上下文</th>
<th>AR (tok/s)</th>
<th>Q8 draft (tok/s)</th>
<th>Q4 draft (tok/s)</th>
<th>加速比</th>
</tr>
</thead>
<tbody>
<tr>
<td>4K</td>
<td>27.77</td>
<td><strong>79.27</strong></td>
<td><strong>78.75</strong></td>
<td><strong>2.85x</strong></td>
</tr>
<tr>
<td>16K</td>
<td>27.79</td>
<td>72.43</td>
<td>78.57</td>
<td>2.83x</td>
</tr>
<tr>
<td>64K</td>
<td>27.78</td>
<td>71.96</td>
<td>75.72</td>
<td>2.73x</td>
</tr>
<tr>
<td>128K</td>
<td>27.78</td>
<td>71.20</td>
<td>75.27</td>
<td>2.71x</td>
</tr>
<tr>
<td>192K</td>
<td>27.78</td>
<td>71.56</td>
<td>76.54</td>
<td>2.76x</td>
</tr>
<tr>
<td>256K</td>
<td>27.78</td>
<td>72.26</td>
<td>75.70</td>
<td>2.73x</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>关键发现</strong>：</p>
<ul>
<li>tq3_0 短提示最优（68.64 tok/s, 2.44x），优于 Q8 draft 基线（62.75）</li>
<li>tq3_0 + Q4 draft 在长上下文（16K–192K）快于 tq3_0 + Q8 draft（多 3–7 tok/s），因 Q4 draft 计算时间更稳定</li>
<li><strong>256K 上下文</strong>：Q8 Draft + tq3_0 KV 可靠运行（72.26 tok/s）；Q4 Draft + tq3_0 KV 也可行（75.70 tok/s）但有 OOM 风险</li>
<li>短上下文（4K）时 tq3_0 + Q8 draft 解码（79.27 tok/s）略低于普通 Q8 draft（86.37 tok/s），因 WHT 旋转开销；tq3_0 + Q4 draft 在 4K 为 78.75 tok/s</li>
<li>tq3_0 提供 4.6× 压缩比（3.5 bpv），256K 无需降级到 Q4 Draft</li>
</ul>
<p dir="auto">KV 类型解析自测：</p>
<pre><code class="language-bash">./build/test_kv_quant
</code></pre>
<h3>9. 编译并运行 llama.cpp 基线对比</h3>
<pre><code class="language-bash"># 在 dflash/deps/llama.cpp 目录下单独编译
BUILD_DIR=/tmp/llama-bench-build
cmake -B $BUILD_DIR -S dflash/deps/llama.cpp \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_HIP=ON \
  -DLLAMA_BUILD_TOOLS=ON
cmake --build $BUILD_DIR --target llama-bench -j$(nproc)

# 运行基线（短 prompt）
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
$BUILD_DIR/bin/llama-bench \
  -m models/Qwen3.6-27B-Q4_K_M.gguf \
  -n 128 -p 128 -o md

# 不同上下文长度
for ctx in 4096 16384 65536; do
  LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
    $BUILD_DIR/bin/llama-bench \
    -m models/Qwen3.6-27B-Q4_K_M.gguf \
    -n 128 -p $ctx -ngl 99 -o md 2&gt;&amp;1 | grep -E "pp|tg"
done
</code></pre>
<hr />
<h2>与 Blog 数据对比</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th>Strix Halo (gfx1151) Blog</th>
<th>7900 XTX (gfx1100) 实测</th>
</tr>
</thead>
<tbody>
<tr>
<td>llama.cpp HIP AR</td>
<td>12.02 tok/s</td>
<td>28.07 tok/s</td>
</tr>
<tr>
<td>DFlash (最优 budget)</td>
<td>26.85 tok/s (budget=22)</td>
<td>62.75 tok/s (budget=8)</td>
</tr>
<tr>
<td>加速比</td>
<td>2.23x</td>
<td>2.24x</td>
</tr>
<tr>
<td>最优 budget</td>
<td>22 (LPDDR5X 带宽瓶颈)</td>
<td>8 (GDDR6 高带宽)</td>
</tr>
</tbody>
</table>
<p dir="auto">Blog 原文: <a href="https://www.lucebox.com/blog/amd" rel="nofollow ugc">https://www.lucebox.com/blog/amd</a></p>
<hr />
<h2>注意</h2>
<ol>
<li><strong>BSA scoring kernel</strong> 在 HIP 上未实现，回退到 ggml flash_attn_ext（约 3.4x 慢于 CUDA BSA）。这是 PFlash 的剩余优化空间。</li>
<li><strong>PR #159 的 ubatch=512</strong> 通过 <code>DFLASH27B_PREFILL_UBATCH=512</code> 环境变量应用（在 PR #119 基础上手动覆盖）。</li>
<li><strong>显存限制</strong>: 7900 XTX 24 GiB 不足以运行 16K 上下文的完整 PFlash 测试。16K KV cache + 模型权重 (~16 GiB + ~6 GiB KV 缓存) 超出 24 GiB。Strix Halo 的 128 GiB 统一内存才能支持大上下文 + 大模型。</li>
</ol>
<hr />
<h2>tq3_0 KV 优缺点总结</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>方面</th>
<th>优点</th>
<th>缺点</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>压缩比</strong></td>
<td>3.5 bpv（4.6× vs F16），比 Q4_0 省 ~0.5 bpv</td>
<td>比 Q8_0 有信息损失，但解码质量未降级</td>
</tr>
<tr>
<td><strong>解码速度 (短提示)</strong></td>
<td><strong>68.64 tok/s（2.44x）</strong>，优于 Q8/q4_0 KV 基线</td>
<td>—</td>
</tr>
<tr>
<td><strong>解码速度 (多上下文)</strong></td>
<td>Q4 draft 搭配时 16K–192K 达 <strong>75–79 tok/s</strong>，比 Q8 draft 快</td>
<td>Q8 draft 搭配时 4K 略低于普通 Q8 KV（79 vs 86），WHT 旋转开销</td>
</tr>
<tr>
<td><strong>256K 支持</strong></td>
<td><strong>可用 Q8 draft</strong>（无需降级Draft质量），72 tok/s；Q4 draft 搭配 76 tok/s</td>
<td>Q4 draft + tq3_0 256K 有 OOM 风险（显存碎片）</td>
</tr>
<tr>
<td><strong>显存节省</strong></td>
<td>256K 比 Q4_0 KV 省 ~1 GiB，足够容纳 Q8 draft</td>
<td>—</td>
</tr>
<tr>
<td><strong>兼容性</strong></td>
<td>HIP/ROCm 端到端验证通过，<code>parse_kv_type</code> 支持</td>
<td>turbo2/3/4 仍需 CUDA cpy kernel 支持</td>
</tr>
<tr>
<td><strong>适用场景</strong></td>
<td>256K 长上下文 + Q8 draft（推荐）；短提示通用加速</td>
<td>4K 极短上下文追求极致 tok/s 时用普通 Q8_0 KV 更快</td>
</tr>
</tbody>
</table>
<hr />
<h2>KV 类型对比分析：q8_0 vs q4_0 vs tq3_0</h2>
<p dir="auto">基于 7900 XTX (24 GiB) + Qwen3.6-27B Q4_K_M 的实测数据，按上下文长度分档总结：</p>
<h3>4K 及以下（短上下文，128–4K）</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>KV 类型</th>
<th>Draft</th>
<th>tok/s</th>
<th>显存</th>
<th>质量</th>
<th>推荐？</th>
</tr>
</thead>
<tbody>
<tr>
<td>q8_0</td>
<td>Q8</td>
<td><strong>86.37</strong></td>
<td>充裕</td>
<td>无损</td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2b50.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--star" style="height:23px;width:auto;vertical-align:middle" title="⭐" alt="⭐" /> <strong>首选</strong></td>
</tr>
<tr>
<td>q8_0</td>
<td>Q4</td>
<td>82.72</td>
<td>充裕</td>
<td>无损</td>
<td>备选</td>
</tr>
<tr>
<td>tq3_0</td>
<td>Q8</td>
<td>79.27</td>
<td>更省</td>
<td>3.5 bpv</td>
<td>WHT 开销拖慢 ~8%</td>
</tr>
<tr>
<td>tq3_0</td>
<td>Q4</td>
<td>78.75</td>
<td>最省</td>
<td>3.5 bpv</td>
<td>—</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>结论</strong>：短上下文显存充裕，q8_0 KV + Q8 draft 速度最快（86 tok/s），无需 tq3_0。tq3_0 的 WHT 旋转开销在短上下文时无法被显存优势抵消。</p>
<h3>16K–64K（中等上下文）</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>KV 类型</th>
<th>Draft</th>
<th>tok/s</th>
<th>显存约束</th>
<th>质量</th>
</tr>
</thead>
<tbody>
<tr>
<td>q8_0</td>
<td>Q8</td>
<td>76–78</td>
<td>64K 需 PFlash（AR OOM）</td>
<td>无损</td>
</tr>
<tr>
<td>q8_0</td>
<td>Q4</td>
<td>—</td>
<td>64K 勉强</td>
<td>无损</td>
</tr>
<tr>
<td>tq3_0</td>
<td>Q8</td>
<td>72</td>
<td>充裕</td>
<td>3.5 bpv</td>
</tr>
<tr>
<td>tq3_0</td>
<td>Q4</td>
<td><strong>75–79</strong></td>
<td>充裕</td>
<td>3.5 bpv</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>结论</strong>：</p>
<ul>
<li>16K：q8_0 + Q8 最快（77 tok/s），tq3_0 + Q4 不相上下（79 tok/s）。</li>
<li>64K：q8_0 + Q8 仍可用（PFlash），78 tok/s；tq3_0 + Q4 略慢（76 tok/s）但更省显存。</li>
<li><strong>tq3_0 + Q4 draft 在此区间表现最好</strong>（75–79 tok/s），因 Q4 draft 计算更稳定，抵消了 tq3_0 的 WHT 开销。</li>
</ul>
<h3>128K–192K（长上下文）</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>KV 类型</th>
<th>Draft</th>
<th>tok/s</th>
<th>显存约束</th>
<th>质量</th>
</tr>
</thead>
<tbody>
<tr>
<td>q4_0</td>
<td>Q8</td>
<td>74–79</td>
<td>128K ✓，192K ✗（超 24 GiB）</td>
<td>4.5 bpv</td>
</tr>
<tr>
<td>q4_0</td>
<td>Q4</td>
<td><strong>81</strong></td>
<td>需 Q4 draft 省 ~1 GiB</td>
<td>4.5 bpv</td>
</tr>
<tr>
<td>tq3_0</td>
<td>Q8</td>
<td>71</td>
<td>128K ✓，192K 勉强</td>
<td>3.5 bpv</td>
</tr>
<tr>
<td>tq3_0</td>
<td>Q4</td>
<td>75–77</td>
<td>充裕</td>
<td>3.5 bpv</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>结论</strong>：</p>
<ul>
<li>128K：q4_0 + Q8 最快（79 tok/s），显存刚好够。</li>
<li>192K：q4_0 必须配 Q4 draft（81 tok/s）；tq3_0 + Q8 可免去降级Draft（72 tok/s）。</li>
<li><strong>追求极致速度</strong>：q4_0 + Q4 draft（81 tok/s）；<strong>追求Draft质量</strong>：tq3_0 + Q8 draft（72 tok/s）。</li>
</ul>
<h3>256K（超长上下文）</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>KV 类型</th>
<th>Draft</th>
<th>tok/s</th>
<th>显存约束</th>
<th>质量</th>
</tr>
</thead>
<tbody>
<tr>
<td>q4_0</td>
<td>Q8</td>
<td>OOM</td>
<td>Q8 KV + Q8 draft 超 24 GiB</td>
<td>—</td>
</tr>
<tr>
<td>q4_0</td>
<td>Q4</td>
<td><strong>81.01</strong></td>
<td>勉强装入，需 Q4 draft 省显存</td>
<td>4.5 bpv + Q4 Draft</td>
</tr>
<tr>
<td>tq3_0</td>
<td>Q8</td>
<td>72.26</td>
<td>✓ 可靠运行，无需降级Draft</td>
<td>3.5 bpv + Q8 Draft</td>
</tr>
<tr>
<td>tq3_0</td>
<td>Q4</td>
<td>75.70</td>
<td>✓ 但有 OOM 风险（碎片）</td>
<td>3.5 bpv + Q4 Draft</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>结论</strong>：</p>
<ul>
<li><strong>推荐方案</strong>：<strong>tq3_0 KV + Q8 draft（72 tok/s）</strong>—— 唯一保留 Q8 Draft质量的 256K 方案。</li>
<li><strong>极致速度</strong>：q4_0 KV + Q4 draft（81 tok/s），但Draft质量最低。</li>
<li><strong>不推荐</strong>：tq3_0 + Q4 draft（76 tok/s）—— 速度不如 q4_0 + Q4，Draft质量不如 tq3_0 + Q8。</li>
</ul>
<h3>总选择指南</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>场景</th>
<th>最佳组合</th>
<th>理由</th>
</tr>
</thead>
<tbody>
<tr>
<td>聊天/对话（4K 内）</td>
<td>q8_0 KV + Q8 draft</td>
<td>最快解码，无损质量</td>
</tr>
<tr>
<td>文档分析（16K–64K）</td>
<td>tq3_0 KV + Q4 draft</td>
<td>速度与显存的最佳平衡</td>
</tr>
<tr>
<td>代码库理解（128K–192K）</td>
<td>q4_0 KV + Q4 draft（要速度）&lt;br&gt;tq3_0 KV + Q8 draft（要Draft质量）</td>
<td>前者快 10%，后者Draft更准</td>
</tr>
<tr>
<td>超长上下文（256K）</td>
<td><strong>tq3_0 KV + Q8 draft</strong> <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /></td>
<td>唯一能兼顾 Q8 Draft + 256K 的方案</td>
</tr>
<tr>
<td>显存极度紧张</td>
<td>tq3_0 KV + Q4 draft</td>
<td>最省显存（但 256K 有 OOM 风险）</td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto"><strong>一句话总结</strong>：tq3_0 最大的价值在于 <strong>256K 场景下保留 Q8 Draft质量</strong>，这是 q4_0 KV 做不到的。在 16K–192K 区间，tq3_0 + Q4 draft 是值得考虑的高性价比组合。短上下文（4K 内）直接上 q8_0 KV 即可。</p>
</blockquote>
<hr />
<hr />
<hr />
<h1>测试细节数据</h1>
<hr />
<h2>1. 短提示解码 (HumanEval, 10 prompts, n_gen=128)</h2>
<h3>1a. llama.cpp HIP AR 基线</h3>
<pre><code>$ llama-bench -m Qwen3.6-27B-Q4_K_M.gguf -n 128 -p 128 -o md
| model                    | size     | params  | backend | ngl | test   | t/s              |
|--------------------------|----------|---------|---------|-----|--------|------------------|
| qwen35 27B Q4_K - Medium | 15.65 GiB| 26.90 B | ROCm    | 99  | pp128  | 619.51 ± 3.29    |
| qwen35 27B Q4_K - Medium | 15.65 GiB| 26.90 B | ROCm    | 99  | tg128  | 28.07 ± 0.06     |
</code></pre>
<h3>1b. Lucebox Q8 Draft (官方)</h3>
<h4>DDTree budget=8</h4>
<pre><code>$ bench_he.py --n-gen 128 --ddtree-budget 8

prompt                         steps     AL   pct%  prefill   decode
------------------------------------------------------------------------
  has_close_elements              22   5.82   36.4      n/a    74.25
  separate_paren_groups           23   5.57   34.8      n/a    70.97
  truncate_number                 42   3.05   19.0      n/a    39.32
  below_zero                      22   5.82   36.4      n/a    74.16
  mean_absolute_deviation         28   4.57   28.6      n/a    58.51
  intersperse                     25   5.12   32.0      n/a    65.33
  parse_nested_parens             23   5.57   34.8      n/a    70.23
  filter_by_substring             28   4.57   28.6      n/a    58.29
  sum_product                     26   4.92   30.8      n/a    62.29
  rolling_max                     30   4.27   26.7      n/a    54.13
------------------------------------------------------------------------
MEAN                                   4.93   30.8      n/a    62.75

commit/step 范围: 3.05 - 5.82
tok/s 范围:    39.3 - 74.2
</code></pre>
<h4>DDTree budget=22</h4>
<pre><code>$ bench_he.py --n-gen 128 --ddtree-budget 22

prompt                         steps     AL   pct%  prefill   decode
------------------------------------------------------------------------
  has_close_elements              16   8.00   50.0      n/a    80.41
  separate_paren_groups           16   8.00   50.0      n/a    80.08
  truncate_number                 33   3.88   24.2      n/a    39.03
  below_zero                      16   8.00   50.0      n/a    80.53
  mean_absolute_deviation         24   5.33   33.3      n/a    52.11
  intersperse                     26   4.92   30.8      n/a    49.51
  parse_nested_parens             21   6.10   38.1      n/a    60.51
  filter_by_substring             25   5.12   32.0      n/a    51.54
  sum_product                     20   6.40   40.0      n/a    63.81
  rolling_max                     24   5.33   33.3      n/a    51.83
------------------------------------------------------------------------
MEAN                                   6.11   38.2      n/a    60.94

commit/step 范围: 3.88 - 8.00
tok/s 范围:    39.0 - 80.5
</code></pre>
<h4>标准链式推测 (无 DDTree)</h4>
<pre><code>$ bench_he.py --n-gen 128

prompt                         steps     AL   pct%  prefill   decode
------------------------------------------------------------------------
  has_close_elements              17   7.53   47.4      n/a    90.09
  separate_paren_groups           21   6.10   39.6      n/a    72.99
  truncate_number                 41   3.12   19.5      n/a    37.79
  below_zero                      19   6.74   44.1      n/a    80.57
  mean_absolute_deviation         25   5.12   32.2      n/a    61.26
  intersperse                     31   4.13   25.8      n/a    49.59
  parse_nested_parens             21   6.10   38.4      n/a    72.60
  filter_by_substring             28   4.57   29.0      n/a    54.95
  sum_product                     22   5.82   37.8      n/a    69.52
  rolling_max                     29   4.41   28.0      n/a    52.94
------------------------------------------------------------------------
MEAN                                   5.36   34.2      n/a    64.23

commit/step 范围: 3.12 - 7.53
tok/s 范围:    37.8 - 90.1
</code></pre>
<h3>1c. spiritbuun Q4 Draft</h3>
<h4>DDTree budget=8</h4>
<pre><code>$ bench_he.py --n-gen 128 --ddtree-budget 8

prompt                         steps     AL   pct%  prefill   decode
------------------------------------------------------------------------
  has_close_elements              23   5.57   34.8      n/a    69.64
  separate_paren_groups           22   5.82   36.4      n/a    74.49
  truncate_number                 40   3.20   20.0      n/a    40.57
  below_zero                      31   4.13   25.8      n/a    51.59
  mean_absolute_deviation         30   4.27   26.7      n/a    53.28
  intersperse                     24   5.33   33.3      n/a    66.22
  parse_nested_parens             28   4.57   28.6      n/a    58.52
  filter_by_substring             26   4.92   30.8      n/a    61.48
  sum_product                     26   4.92   30.8      n/a    63.17
  rolling_max                     31   4.13   25.8      n/a    51.18
------------------------------------------------------------------------
MEAN                                   4.69   29.3      n/a    59.01
</code></pre>
<h4>标准链式推测 (无 DDTree)</h4>
<pre><code>$ bench_he.py --n-gen 128

prompt                         steps     AL   pct%  prefill   decode
------------------------------------------------------------------------
  has_close_elements              18   7.11   45.1      n/a    82.30
  separate_paren_groups           17   7.53   50.4      n/a    89.03
  truncate_number                 36   3.56   22.4      n/a    41.97
  below_zero                      30   4.27   26.7      n/a    49.76
  mean_absolute_deviation         24   5.33   35.4      n/a    61.82
  intersperse                     35   3.66   23.0      n/a    43.04
  parse_nested_parens             24   5.33   35.4      n/a    63.36
  filter_by_substring             24   5.33   35.7      n/a    61.94
  sum_product                     23   5.57   36.1      n/a    66.46
  rolling_max                     30   4.27   27.1      n/a    49.96
------------------------------------------------------------------------
MEAN                                   5.20   33.7      n/a    60.96
</code></pre>
<h3>1d. Q8 vs Q4 Draft vs tq3_0 KV 对比总结</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>配置</th>
<th>Draft</th>
<th>KV 类型</th>
<th>平均 tok/s</th>
<th>平均 AL</th>
<th>vs AR 加速比</th>
</tr>
</thead>
<tbody>
<tr>
<td>AR (llama.cpp)</td>
<td>—</td>
<td>—</td>
<td>28.07</td>
<td>—</td>
<td>1.00x</td>
</tr>
<tr>
<td>标准链式</td>
<td>Q8</td>
<td>q8_0</td>
<td><strong>64.23</strong></td>
<td>5.36</td>
<td>2.29x</td>
</tr>
<tr>
<td>标准链式</td>
<td>Q4</td>
<td>q8_0</td>
<td>60.96</td>
<td>5.20</td>
<td>2.17x</td>
</tr>
<tr>
<td>DDTree b=8</td>
<td>Q8</td>
<td>q8_0</td>
<td><strong>62.75</strong></td>
<td>4.93</td>
<td>2.24x</td>
</tr>
<tr>
<td>DDTree b=8</td>
<td>Q4</td>
<td>q8_0</td>
<td>59.01</td>
<td>4.69</td>
<td>2.10x</td>
</tr>
<tr>
<td>DDTree b=8</td>
<td><strong>Q8</strong></td>
<td><strong>tq3_0</strong></td>
<td><strong>68.64</strong></td>
<td>—</td>
<td><strong>2.44x</strong></td>
</tr>
<tr>
<td>DDTree b=22</td>
<td>Q8</td>
<td>q8_0</td>
<td>60.94</td>
<td>6.11</td>
<td>2.17x</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>tq3_0 KV 短提示解码最优</strong>（68.64 tok/s, 2.44x），比 Q8 Draft + q8_0 KV 快 9%，比 Q4 Draft 快 16%。3.5 bpv 压缩比额外节省 ~1 GiB 显存。</p>
<hr />
]]></description><link>https://lcz.me/topic/195/lucebox-dflash-pflash-7900xtx-qwen3.6-27b-2.8-3.1x加速-测试数据分享</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 06:08:39 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/195.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 18 May 2026 06:54:55 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 15:26:23 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/david-zhang" aria-label="Profile: David-Zhang">@<bdi>David-Zhang</bdi></a> 简化下，争取让我复制粘贴，全程鼠标搞定，<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f602.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--joy" style="height:23px;width:auto;vertical-align:middle" title="😂" alt="😂" />，我特么被油管用魔障了。</p>
]]></description><link>https://lcz.me/post/2460</link><guid isPermaLink="true">https://lcz.me/post/2460</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 15:26:23 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 15:18:12 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/coin1860" aria-label="Profile: coin1860">@<bdi>coin1860</bdi></a> 嗯嗯，我这几天先测测看</p>
]]></description><link>https://lcz.me/post/2455</link><guid isPermaLink="true">https://lcz.me/post/2455</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 15:18:12 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 15:17:04 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 嗯嗯，看你时间啊，不慌。</p>
]]></description><link>https://lcz.me/post/2452</link><guid isPermaLink="true">https://lcz.me/post/2452</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 15:17:04 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 15:14:34 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/david-zhang" aria-label="Profile: David-Zhang">@<bdi>David-Zhang</bdi></a> 最近我不折腾了，我后面还要再买一张xtx再折腾，现在被油管这个AI视频政策弄的头疼，我这几天一直在纠结做什么内容，烦死了。</p>
]]></description><link>https://lcz.me/post/2449</link><guid isPermaLink="true">https://lcz.me/post/2449</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 15:14:34 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 15:07:30 GMT]]></title><description><![CDATA[<p dir="auto">dflash 不错， pflash 要关注一下， 我让gemini 搜索作者承认pflash 不是无损的。 作为agent 我觉得无所谓， 但是编程就有点伤。还是等你们测试实际的效果。</p>
]]></description><link>https://lcz.me/post/2445</link><guid isPermaLink="true">https://lcz.me/post/2445</guid><dc:creator><![CDATA[coin1860]]></dc:creator><pubDate>Mon, 18 May 2026 15:07:30 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 14:19:14 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/quincysnow" aria-label="Profile: QuincySnow">@<bdi>QuincySnow</bdi></a> 这个你需要自己改代码</p>
]]></description><link>https://lcz.me/post/2414</link><guid isPermaLink="true">https://lcz.me/post/2414</guid><dc:creator><![CDATA[iamvirus]]></dc:creator><pubDate>Mon, 18 May 2026 14:19:14 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 13:11:15 GMT]]></title><description><![CDATA[<p dir="auto">想抄作业的 看这里<br />
<a href="https://lcz.me/topic/202/lucebox-dflash-pflash-%E7%BC%96%E8%AF%91%E4%B8%8E%E9%83%A8%E7%BD%B2%E6%8C%87%E5%8D%97-qwen3.6-27b-%E6%96%B9%E4%BE%BF%E6%8A%84%E4%BD%9C%E4%B8%9A-linux"> Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) </a></p>
]]></description><link>https://lcz.me/post/2396</link><guid isPermaLink="true">https://lcz.me/post/2396</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 13:11:15 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 12:56:39 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 更新了tq3_0, 你可以出场了 <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f601.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--grin" style="height:23px;width:auto;vertical-align:middle" title=":grin:" alt="😁" /></p>
]]></description><link>https://lcz.me/post/2390</link><guid isPermaLink="true">https://lcz.me/post/2390</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 12:56:39 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 09:54:16 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/quincysnow" aria-label="Profile: QuincySnow">@<bdi>QuincySnow</bdi></a> 是啊，希望那哥们加油，最近一段好几天没大版本放出来，但是目前的4k性能跟vulkan差不多，不知道能不能更强，等一段时间再试试看。</p>
]]></description><link>https://lcz.me/post/2349</link><guid isPermaLink="true">https://lcz.me/post/2349</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 09:54:16 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 09:19:56 GMT]]></title><description><![CDATA[<p dir="auto">只能等它优化了,至少有专门优化的可以选择不是吗?</p>
]]></description><link>https://lcz.me/post/2346</link><guid isPermaLink="true">https://lcz.me/post/2346</guid><dc:creator><![CDATA[QuincySnow]]></dc:creator><pubDate>Mon, 18 May 2026 09:19:56 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 08:55:20 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/quincysnow" aria-label="Profile: QuincySnow">@<bdi>QuincySnow</bdi></a> 这货 8k ctx以上就会炸，4k随便完</p>
]]></description><link>https://lcz.me/post/2343</link><guid isPermaLink="true">https://lcz.me/post/2343</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 08:55:20 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 08:45:06 GMT]]></title><description><![CDATA[<p dir="auto">如果是AMD卡的话可以使用<a href="https://github.com/Kaden-Schutt/hipfire" rel="nofollow ugc">https://github.com/Kaden-Schutt/hipfire</a> ,目前还不太成熟,但是我是6650XT在Liunx跑Qwen 3.5 9B可以到达45 tok/s,且如果开启DFlash 之后更快</p>
]]></description><link>https://lcz.me/post/2341</link><guid isPermaLink="true">https://lcz.me/post/2341</guid><dc:creator><![CDATA[QuincySnow]]></dc:creator><pubDate>Mon, 18 May 2026 08:45:06 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 08:38:24 GMT]]></title><description><![CDATA[<p dir="auto">我以为llama.cpp mtp已经稳定在50-60很爽了，但是prefill在上下文时 prefill稳定的下降，agent影响很大<br />
这个prefill 这么稳定，搞得我再想买一个7900xtx了！不知道质量如何</p>
]]></description><link>https://lcz.me/post/2337</link><guid isPermaLink="true">https://lcz.me/post/2337</guid><dc:creator><![CDATA[iamvirus]]></dc:creator><pubDate>Mon, 18 May 2026 08:38:24 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 10:45:09 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 在试了... 等我发帖</p>
<p dir="auto">lucebox 现在已经 有了 tq_3, 意不意外，惊不惊喜。<br />
<img src="https://upload.lcz.me/uploads/47d117bf-f9cf-4303-a88f-3a2e1c2fb2f1.jpeg" alt="92fc9853-0c7d-4e6b-8402-e1fc6e3ec468-image.jpeg" class=" img-fluid img-markdown" /></p>
]]></description><link>https://lcz.me/post/2332</link><guid isPermaLink="true">https://lcz.me/post/2332</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 10:45:09 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 08:19:15 GMT]]></title><description><![CDATA[<p dir="auto">还是需要测试TurboQuant+Dflash，总之必须要同时工作，否则对于24G卡没有意义。</p>
]]></description><link>https://lcz.me/post/2329</link><guid isPermaLink="true">https://lcz.me/post/2329</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 08:19:15 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 08:11:22 GMT]]></title><description><![CDATA[<p dir="auto">这个Lucebox 有点牛</p>
]]></description><link>https://lcz.me/post/2324</link><guid isPermaLink="true">https://lcz.me/post/2324</guid><dc:creator><![CDATA[bin flamebox]]></dc:creator><pubDate>Mon, 18 May 2026 08:11:22 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 08:38:40 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> opencode 跑出来了，数据上来看 DFlash + PFlash确实可以</p>
<h4>预填充性能 (tok/s)</h4>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>上下文</th>
<th>llama.cpp HIP (AR)</th>
<th>Lucebox (PFlash)</th>
<th>加速比</th>
<th>备注</th>
</tr>
</thead>
<tbody>
<tr>
<td>128</td>
<td>619.51</td>
<td>312.5</td>
<td>0.50x</td>
<td>PFlash 短上下文开销大</td>
</tr>
<tr>
<td>4K</td>
<td>734.57</td>
<td>726 (Q8 KV)</td>
<td>0.99x</td>
<td>持平</td>
</tr>
<tr>
<td>16K</td>
<td>649.08</td>
<td>735 (Q8 KV)</td>
<td><strong>1.13x</strong></td>
<td></td>
</tr>
<tr>
<td>64K</td>
<td>—¹</td>
<td>733 (Q8 KV)</td>
<td>—</td>
<td>¹llama.cpp context 创建 OOM</td>
</tr>
<tr>
<td>128K</td>
<td>—¹</td>
<td>730 (Q4 KV)</td>
<td>—</td>
<td></td>
</tr>
<tr>
<td>192K</td>
<td>—¹</td>
<td><strong>730</strong> (Q4 KV)</td>
<td>—</td>
<td></td>
</tr>
<tr>
<td>256K</td>
<td>—¹</td>
<td><strong>730</strong> (Q4 KV + Q4 draft)</td>
<td>—</td>
<td></td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">¹ AR prefill 在 64K+ 无法运行，因为 llama.cpp 在 context 创建时就需要分配完整 KV cache（O(n²) 注意力 + 全量 KV 存储），64K Q4 KV 约 8 GiB + 模型 15 GiB 已超 24 GiB。这是 PFlash 的核心优势：<strong>压缩预填充将长 prompt 压缩为固定大小，prefill 复杂度从 O(n²) 降至 O(n)</strong>。</p>
<p dir="auto">256K 需使用 Q4 Draft + Q4 KV cache 以节省显存。</p>
</blockquote>
<h4>解码性能 (tok/s)</h4>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>上下文</th>
<th>llama.cpp HIP (AR)</th>
<th>Lucebox (DFlash)</th>
<th>加速比</th>
<th>备注</th>
</tr>
</thead>
<tbody>
<tr>
<td>128</td>
<td>28.07</td>
<td>62.75</td>
<td><strong>2.24x</strong></td>
<td></td>
</tr>
<tr>
<td>4K</td>
<td>27.77</td>
<td><strong>86.37</strong></td>
<td><strong>3.11x</strong></td>
<td></td>
</tr>
<tr>
<td>16K</td>
<td>27.79</td>
<td><strong>76.87</strong></td>
<td><strong>2.77x</strong></td>
<td></td>
</tr>
<tr>
<td>64K</td>
<td>27.78¹</td>
<td><strong>78.33</strong></td>
<td><strong>2.82x</strong></td>
<td>¹AR decode 与 ctx 无关</td>
</tr>
<tr>
<td>128K</td>
<td>27.78¹</td>
<td><strong>78.82</strong> (Q4 KV)</td>
<td><strong>2.84x</strong></td>
<td></td>
</tr>
<tr>
<td>192K</td>
<td>27.78¹</td>
<td><strong>81.09</strong> (Q4 KV)</td>
<td><strong>2.92x</strong></td>
<td></td>
</tr>
<tr>
<td>256K</td>
<td>27.78¹</td>
<td><strong>81.01</strong> (Q4 KV + Q4 draft)</td>
<td><strong>2.92x</strong></td>
<td></td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">AR decode 速度与上下文长度无关（KV cache size 不影响单 token forward），因此所有行使用同一基线（4K 和 16K 实测均值 27.78 tok/s）。DFlash decode 在所有上下文下稳定加速 ~2.8-3.1x。<br />
256K 时需 Q4 Draft + Q4 KV cache 才能装入 24 GiB 显存。</p>
</blockquote>
<p dir="auto">PFlash 预填充在短上下文（4K）与 AR 相当；上下文越长，PFlash 优势越明显（16K 时 1.13x）。DFlash 解码在所有上下文长度下保持 ~2.8–3.1x 加速。</p>
<hr />
]]></description><link>https://lcz.me/post/2322</link><guid isPermaLink="true">https://lcz.me/post/2322</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 08:38:40 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 07:33:38 GMT]]></title><description><![CDATA[<p dir="auto">补充下，我到时候给置顶。</p>
]]></description><link>https://lcz.me/post/2307</link><guid isPermaLink="true">https://lcz.me/post/2307</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 07:33:38 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 07:02:05 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 等我 测两天再说，我只是让opencode给我拉下来编译跑了下，具体生产体验如何，且等我几天。</p>
]]></description><link>https://lcz.me/post/2297</link><guid isPermaLink="true">https://lcz.me/post/2297</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 07:02:05 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 7900XTX Qwen3.6-27B ~2.8–3.1x加速 测试数据分享 on Mon, 18 May 2026 06:58:31 GMT]]></title><description><![CDATA[<p dir="auto">又是精品帖子。这个Dflash你测试多长的上下文，prefill速度有无影响，pflash上下文这么短是不是没啥实用价值？</p>
]]></description><link>https://lcz.me/post/2294</link><guid isPermaLink="true">https://lcz.me/post/2294</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 06:58:31 GMT</pubDate></item></channel></rss>