<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux)]]></title><description><![CDATA[<p dir="auto">大家伙先等等抄作业，目前Lucebox的代码还有点坑，只有cli模式下才会真正启动 dflash, --daemon模式压根就没启用。<br />
我先尝试修改下看看效果，回头再更新这帖子。抱歉各位~</p>
<p dir="auto">刚刚调通，跑了下，炸裂。我再完善下，一会儿把代码push到github吧。<br />
<img src="https://upload.lcz.me/uploads/38861827-832d-46c8-9164-4585fef8d340.jpeg" alt="a171da8d-568d-490b-b720-a14a35964a10-image.jpeg" class=" img-fluid img-markdown" /><br />
<img src="https://upload.lcz.me/uploads/6ac7cb7e-a2ed-4279-8a48-3ecb6443601c.jpeg" alt="4c46da2c-dbe8-492d-9e84-33235bc1b962-image.jpeg" class=" img-fluid img-markdown" /></p>
<h1>Lucebox DFlash + PFlash 编译与部署指南</h1>
<h2>1. 克隆与子模块初始化</h2>
<pre><code class="language-bash">git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git submodule update --init --recursive
</code></pre>
<hr />
<h2>2. 编译</h2>
<h3>2.1 系统依赖</h3>
<pre><code class="language-bash"># CUDA (NVIDIA)
sudo apt install build-essential cmake git

# ROCm (AMD)
sudo bash dflash/scripts/setup_system.sh
</code></pre>
<h3>2.2 编译 dflash (GPU Kernel + test_dflash)</h3>
<pre><code class="language-bash">cd dflash

# CUDA (NVIDIA, e.g. RTX 4090 sm_89)
cmake -B build -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --target test_dflash -j$(nproc)

# ROCm (AMD, e.g. 7900 XTX gfx1100)
# 可选：安装 rocWMMA 头文件以开启 Phase 2 FlashPrefill
git clone --depth 1 https://github.com/ROCm/rocWMMA.git /tmp/rocwmma
mkdir -p /tmp/rocm_include/include
cp -r /tmp/rocwmma/library/include/rocwmma /tmp/rocm_include/include/rocwmma

cmake -B build -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DDFLASH27B_GPU_BACKEND=hip \
  -DDFLASH27B_HIP_ARCHITECTURES=gfx1100 \
  -DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -j$(nproc)
</code></pre>
<blockquote>
<p dir="auto"><code>DFLASH27B_HIP_SM80_EQUIV=ON</code> 开启 rocWMMA Phase 2 预填充。若不用 rocWMMA，设为 OFF 使用 q8 fallback。</p>
</blockquote>
<h3>2.3 编译 llama.cpp 基线 (可选)</h3>
<pre><code class="language-bash">BUILD_DIR=/tmp/llama-bench-build
cmake -B $BUILD_DIR -S dflash/deps/llama.cpp \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON                    # NVIDIA
  # -DGGML_HIP=ON                   # AMD
cmake --build $BUILD_DIR --target llama-bench llama-server -j$(nproc)
</code></pre>
<h3>2.4 安装 Python 依赖 (<a href="http://server.py" rel="nofollow ugc">server.py</a>)</h3>
<pre><code class="language-bash">pip install fastapi uvicorn transformers pydantic starlette
</code></pre>
<hr />
<h2>3. 下载模型文件</h2>
<h3>3.1 目录结构</h3>
<pre><code>lucebox-hub/
├── dflash/
│   ├── models/
│   │   ├── Qwen3.6-27B-Q4_K_M.gguf     # 目标模型 (~16 GB)
│   │   ├── Qwen3-0.6B-BF16.gguf         # PFlash drafter (~1.2 GB)
│   │   └── draft/
│   │       └── dflash-draft-3.6-q8_0.gguf  # 推测解码草稿模型 (~1.84 GB)
│   └── build/
│       └── test_dflash                   # GPU daemon 二进制
└── ...
</code></pre>
<h3>3.2 下载命令</h3>
<pre><code class="language-bash">cd dflash
mkdir -p models/draft

# 方式 A: huggingface-cli
huggingface-cli download unsloth/Qwen3.6-27B-GGUF \
  Qwen3.6-27B-Q4_K_M.gguf --local-dir models/

huggingface-cli download Lucebox/Qwen3.6-27B-DFlash-GGUF \
  dflash-draft-3.6-q8_0.gguf --local-dir models/draft/

huggingface-cli download unsloth/Qwen3-0.6B-GGUF \
  Qwen3-0.6B-BF16.gguf --local-dir models/

# 方式 B: wget
wget -c -O models/Qwen3.6-27B-Q4_K_M.gguf \
  "https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q4_K_M.gguf"

wget -c -O models/draft/dflash-draft-3.6-q8_0.gguf \
  "https://huggingface.co/Lucebox/Qwen3.6-27B-DFlash-GGUF/resolve/main/dflash-draft-3.6-q8_0.gguf"

wget -c -O models/Qwen3-0.6B-BF16.gguf \
  "https://huggingface.co/unsloth/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-BF16.gguf"
</code></pre>
<hr />
<h2>4. 启动命令（按上下文长度）</h2>
<p dir="auto">所有命令从 <code>lucebox-hub/dflash/</code> 目录执行。</p>
<blockquote>
<p dir="auto"><strong><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/26a0.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--warning" style="height:23px;width:auto;vertical-align:middle" title="⚠" alt="⚠" />️ 重要：DFlash / PFlash 不能直接用 llama-server 启动。</strong><br />
<code>llama-speculative-dflash.cpp</code> + <code>llama-server</code> 的集成是<strong>待办事项</strong>（见 README Contributing），尚未实现。<br />
目前必须使用 <code>dflash/scripts/server.py</code>——它在内部将 <code>test_dflash</code> 作为子进程 daemon 运行，<br />
对外暴露 OpenAI 兼容 API（<code>/v1/chat/completions</code>），功能与用法和 llama-server 一致。<br />
对接 Open WebUI / LM Studio / Cline 时只需设 <code>OPENAI_API_BASE=http://localhost:8080/v1</code> 即可。</p>
</blockquote>
<blockquote>
<p dir="auto"><strong>模型路径变量说明</strong>：以下命令假设模型文件位于 <code>dflash/models/</code> 下，draft 位于 <code>dflash/models/draft/</code>。如果你的路径不同，修改 <code>--target</code> / <code>--draft</code> / <code>--prefill-drafter</code> 参数。</p>
</blockquote>
<h3>4.1 短上下文 (4K) — q8_0 KV + Q8 draft，最快解码</h3>
<pre><code class="language-bash">python scripts/server.py \
  --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft/dflash-draft-3.6-q8_0.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --max-ctx 8704 \
  --fa-window 2048 \
  --budget 8 \
  --host 0.0.0.0 --port 8080
</code></pre>
<ul>
<li>显存充裕，无需 PFlash 压缩</li>
<li><code>budget=8</code> 对 7900 XTX 最优（GDDR6 高带宽）</li>
</ul>
<h3>4.2 中等上下文 (16K–64K) — 推荐 tq3_0 KV + Q4 draft</h3>
<pre><code class="language-bash">python scripts/server.py \
  --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft/dflash-draft-3.6-q4_k_m.gguf \
  --cache-type-k tq3_0 --cache-type-v tq3_0 \
  --max-ctx 131072 \
  --fa-window 2048 \
  --budget 8 \
  --prefill-compression auto \
  --prefill-threshold 32000 \
  --prefill-drafter models/Qwen3-0.6B-BF16.gguf \
  --host 0.0.0.0 --port 8080
</code></pre>
<ul>
<li>tq3_0 + Q4 draft 在 16K–64K 区间达 75–79 tok/s，速度与显存的最佳平衡</li>
<li>PFlash 压缩长 prompt 至 5%，64K 预填充 ~733 tok/s</li>
</ul>
<h3>4.3 长上下文 (128K–192K) — 速度优先用 q4_0 + Q4 draft</h3>
<pre><code class="language-bash">python scripts/server.py \
  --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft/dflash-draft-3.6-q4_k_m.gguf \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --max-ctx 200000 \
  --fa-window 2048 \
  --budget 8 \
  --prefill-compression auto \
  --prefill-threshold 32000 \
  --prefill-drafter models/Qwen3-0.6B-BF16.gguf \
  --host 0.0.0.0 --port 8080
</code></pre>
<ul>
<li>解码 ~81 tok/s（最快），使用 Q4 draft 节省 ~1 GiB 显存</li>
<li>192K 仅 q4_0 KV + Q4 draft 可装入 24 GiB</li>
</ul>
<h3>4.4 长上下文 (128K–192K) — 草稿质量优先用 tq3_0 + Q8 draft</h3>
<pre><code class="language-bash">python scripts/server.py \
  --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft/dflash-draft-3.6-q8_0.gguf \
  --cache-type-k tq3_0 --cache-type-v tq3_0 \
  --max-ctx 200000 \
  --fa-window 2048 \
  --budget 8 \
  --prefill-compression auto \
  --prefill-threshold 32000 \
  --prefill-drafter models/Qwen3-0.6B-BF16.gguf \
  --host 0.0.0.0 --port 8080
</code></pre>
<ul>
<li>解码 ~72 tok/s，保留 Q8 草稿质量（比 Q4 draft 更准确）</li>
<li>tq3_0 3.5 bpv 压缩释放 ~1 GiB 显存给 Q8 draft</li>
</ul>
<h3>4.5 超长上下文 (256K) — 推荐 tq3_0 + Q8 draft（唯一方案）</h3>
<pre><code class="language-bash">python scripts/server.py \
  --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft/dflash-draft-3.6-q8_0.gguf \
  --cache-type-k tq3_0 --cache-type-v tq3_0 \
  --max-ctx 270000 \
  --fa-window 2048 \
  --budget 8 \
  --prefill-compression auto \
  --prefill-threshold 32000 \
  --prefill-drafter models/Qwen3-0.6B-BF16.gguf \
  --host 0.0.0.0 --port 8080
</code></pre>
<ul>
<li><strong>唯一能在 256K 保留 Q8 草稿质量的方案</strong></li>
<li>tq3_0 (3.5 bpv) 省 ~1 GiB 显存，刚好容纳 Q8 draft</li>
<li>解码 ~72 tok/s，预填充 ~730 tok/s</li>
</ul>
<h3>4.6 超长上下文 (256K) — 极致速度 q4_0 + Q4 draft</h3>
<pre><code class="language-bash">python scripts/server.py \
  --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft/dflash-draft-3.6-q4_k_m.gguf \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  --max-ctx 270000 \
  --fa-window 2048 \
  --budget 8 \
  --prefill-compression auto \
  --prefill-threshold 32000 \
  --prefill-drafter models/Qwen3-0.6B-BF16.gguf \
  --host 0.0.0.0 --port 8080
</code></pre>
<ul>
<li>解码 ~81 tok/s（最快），但草稿质量最低</li>
<li>显存勉强装入 24 GiB</li>
</ul>
<hr />
<h2>5. 快速选择指南</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>场景</th>
<th>KV 类型</th>
<th>Draft</th>
<th>tok/s</th>
<th>特点</th>
</tr>
</thead>
<tbody>
<tr>
<td>聊天 (≤4K)</td>
<td>q8_0</td>
<td>Q8</td>
<td><strong>86</strong></td>
<td>最快，无损质量</td>
</tr>
<tr>
<td>文档分析 (16K–64K)</td>
<td>tq3_0</td>
<td>Q4</td>
<td><strong>75–79</strong></td>
<td>速度/显存最佳平衡</td>
</tr>
<tr>
<td>代码理解 (128K–192K)</td>
<td>q4_0</td>
<td>Q4</td>
<td><strong>81</strong></td>
<td>极致速度</td>
</tr>
<tr>
<td>代码理解 (128K–192K)</td>
<td>tq3_0</td>
<td>Q8</td>
<td><strong>72</strong></td>
<td>草稿质量优先</td>
</tr>
<tr>
<td>超长上下文 (256K)</td>
<td><strong>tq3_0</strong></td>
<td><strong>Q8</strong></td>
<td><strong>72</strong></td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /> 推荐，唯一 Q8 方案</td>
</tr>
<tr>
<td>超长上下文 (256K)</td>
<td>q4_0</td>
<td>Q4</td>
<td><strong>81</strong></td>
<td>最快但有 OOM 风险</td>
</tr>
</tbody>
</table>
<hr />
<h2>6. 对接客户端</h2>
<p dir="auto">服务器启动后，兼容 OpenAI API，可对接任意客户端：</p>
<pre><code class="language-bash"># 测试
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"luce-dflash","messages":[{"role":"user","content":"你好"}],"stream":true}'
</code></pre>
<p dir="auto"><strong>Open WebUI / LM Studio / Cline 配置：</strong></p>
<ul>
<li>API Base: <code>http://localhost:8080/v1</code></li>
<li>API Key: <code>sk-any</code>（任意值）</li>
<li>Model: <code>luce-dflash</code></li>
</ul>
<hr />
<h2>7. 常用环境变量</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>变量</th>
<th>说明</th>
<th>默认值</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>DFLASH27B_DRAFT_SWA</code></td>
<td>Draft 滑动窗口大小</td>
<td>2048</td>
</tr>
<tr>
<td><code>DFLASH27B_PREFILL_UBATCH</code></td>
<td>PFlash 预填充 micro-batch</td>
<td>512</td>
</tr>
<tr>
<td><code>DFLASH_BIN</code></td>
<td>test_dflash 二进制路径</td>
<td><code>build/test_dflash</code></td>
</tr>
<tr>
<td><code>DFLASH_TARGET</code></td>
<td>目标模型路径</td>
<td><code>models/Qwen3.6-27B-Q4_K_M.gguf</code></td>
</tr>
<tr>
<td><code>DFLASH_DRAFT</code></td>
<td>Draft 模型路径</td>
<td><code>models/draft/</code></td>
</tr>
</tbody>
</table>
]]></description><link>https://lcz.me/topic/202/lucebox-dflash-pflash-编译与部署指南-qwen3.6-27b-方便抄作业-linux</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 06:05:02 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/202.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 18 May 2026 13:09:23 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Wed, 20 May 2026 05:21:33 GMT]]></title><description><![CDATA[<p dir="auto">我试了下bee分支的draft，编程场景，开think，多工具调用，draft命中绿和覆盖率几乎没用，不如不开</p>
]]></description><link>https://lcz.me/post/2718</link><guid isPermaLink="true">https://lcz.me/post/2718</guid><dc:creator><![CDATA[blackjack]]></dc:creator><pubDate>Wed, 20 May 2026 05:21:33 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Wed, 20 May 2026 03:19:38 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/david-zhang" aria-label="Profile: David-Zhang">@<bdi>David-Zhang</bdi></a> 不是这个意思。草稿质量高应该只影响预测命中率，最终准确率还是要看主模型和主模型的kv cache。</p>
]]></description><link>https://lcz.me/post/2703</link><guid isPermaLink="true">https://lcz.me/post/2703</guid><dc:creator><![CDATA[stakira]]></dc:creator><pubDate>Wed, 20 May 2026 03:19:38 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Tue, 19 May 2026 03:12:42 GMT]]></title><description><![CDATA[<p dir="auto">這個幫助很大, 馬上就部署好, 快2倍多. 謝謝</p>
]]></description><link>https://lcz.me/post/2519</link><guid isPermaLink="true">https://lcz.me/post/2519</guid><dc:creator><![CDATA[You Be with]]></dc:creator><pubDate>Tue, 19 May 2026 03:12:42 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Tue, 19 May 2026 03:01:23 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E5%BC%A0%E9%91%AB%E7%A3%8A" aria-label="Profile: 张鑫磊">@<bdi>张鑫磊</bdi></a> <a href="/post/2517">说</a>:</p>
<p dir="auto">rocm HIP SDK</p>
</blockquote>
<p dir="auto">让opencode 给你编译<br />
<a href="https://github.com/ROCm/HIP" rel="nofollow ugc">https://github.com/ROCm/HIP</a></p>
]]></description><link>https://lcz.me/post/2518</link><guid isPermaLink="true">https://lcz.me/post/2518</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Tue, 19 May 2026 03:01:23 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Tue, 19 May 2026 02:56:45 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/david-zhang" aria-label="Profile: david-zhang">@<bdi>david-zhang</bdi></a> 请问windows上的rocm HIP SDK 是哪里能下载到7.2.3的，真是找不到<img src="https://upload.lcz.me/uploads/ea035218-8a35-43b8-aa48-7c65d294443a.jpeg" alt="a3729306-d390-4f11-aee1-1165959991a7-image.jpeg" class=" img-fluid img-markdown" /></p>
]]></description><link>https://lcz.me/post/2517</link><guid isPermaLink="true">https://lcz.me/post/2517</guid><dc:creator><![CDATA[张鑫磊]]></dc:creator><pubDate>Tue, 19 May 2026 02:56:45 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Tue, 19 May 2026 01:06:26 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/chang-ching-chun" aria-label="Profile: Chang-Ching-Chun">@<bdi>Chang-Ching-Chun</bdi></a> 关于DFlash和MTP能否混用：两者确实是不同思路的加速方案。DFlash是通过推测解码（speculative decoding）减少串行生成步数，MTP（Multi-Token Prediction）是同时预测多个token。从原理上它们不排斥，但Lucebox目前的实现里两者互斥，需要等后面代码整合。</p>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/mraksugar" aria-label="Profile: mraksugar">@<bdi>mraksugar</bdi></a> 关于Hermes调用崩溃的问题，建议检查下API端口的batch参数设置。如果用Open WebUI的兼容API接入Hermes，需要确保返回格式是标准的OpenAI-compatible。Lucebox的API端有些参数默认值和Hermes期望的不一致，比如max_tokens限制和stop token的处理。可以试试在Lucebox启动参数里加上 <code>--api-server --api-host 0.0.0.0 --api-port 8081</code> 然后用Hermes的provider配置指向这个地址。</p>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/stakira" aria-label="Profile: stakira">@<bdi>stakira</bdi></a> draft质量模式的选择可以这么理解：草稿质量优先（draft quality first）适合追求输出质量的场景，最终生成的质量更高但速度提升有限；最终质量优先（final quality first）适合需要高吞吐量的场景，牺牲一点点草稿质量换取更大的加速比。对于Qwen3.6-27B，实测final quality first模式在3090上能提升20-30%的decode速度，输出质量差异非常小。</p>
]]></description><link>https://lcz.me/post/2509</link><guid isPermaLink="true">https://lcz.me/post/2509</guid><dc:creator><![CDATA[Xiaote]]></dc:creator><pubDate>Tue, 19 May 2026 01:06:26 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Mon, 18 May 2026 15:49:49 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/fanwen1974" aria-label="Profile: fanwen1974">@<bdi>fanwen1974</bdi></a> pr119已经merge了</p>
]]></description><link>https://lcz.me/post/2469</link><guid isPermaLink="true">https://lcz.me/post/2469</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 15:49:49 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Mon, 18 May 2026 15:45:14 GMT]]></title><description><![CDATA[<p dir="auto">樓主的ROCM Build 方法有點錯，官方Blog 上的，轉貼如下：</p>
<h1>1. Build PR #119 for gfx1151</h1>
<p dir="auto">git clone <a href="https://github.com/Luce-Org/lucebox-hub.git" rel="nofollow ugc">https://github.com/Luce-Org/lucebox-hub.git</a><br />
cd lucebox-hub<br />
git fetch origin pull/119/head:pr119 &amp;&amp; git checkout pr119<br />
git submodule update --init --recursive<br />
cd dflash<br />
cmake -B build -S . <br />
-DCMAKE_BUILD_TYPE=Release <br />
-DDFLASH27B_GPU_BACKEND=hip <br />
-DDFLASH27B_HIP_ARCHITECTURES=gfx1151 <br />
-DDFLASH27B_HIP_SM80_EQUIV=ON<br />
cmake --build build --target test_dflash -j</p>
<h1>2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter</h1>
<p dir="auto">mkdir -p models/draft<br />
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/<br />
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/</p>
<h1>3. Bench (DFlash decode + PFlash long-context prefill)</h1>
<p dir="auto">LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH <br />
DFLASH_BIN=$PWD/build/test_dflash <br />
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf <br />
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf <br />
DFLASH27B_DRAFT_SWA=2048 <br />
DFLASH27B_PREFILL_UBATCH=512 <br />
python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22</p>
<h2>gx1151 那個你要看你是張顯卡去改。</h2>
<h2>gfx1100	7900 XTX	<br />
gfx1151	Strix Halo iGPU	<br />
gfx1201	R9700</h2>
<p dir="auto">然後 budget 那個 7900 選 8 ， AMD Strix Halo (AI MAX 395+) ,R9700 選 22 。<br />
我試了下 R9700 能55-63 t/s</p>
<h2><a href="http://run.sh" rel="nofollow ugc">run.sh</a></h2>
<h2>#!/bin/sh<br />
python scripts/server.py <br />
--target models/Qwen3.6-27B-Q4_K_M.gguf <br />
--draft models/draft/dflash-draft-3.6-q8_0.gguf <br />
--cache-type-k q8_0 --cache-type-v q8_0 <br />
--max-ctx 8704 <br />
--fa-window 2048 <br />
--budget 22 <br />
--host 0.0.0.0 --port 1234</h2>
<p dir="auto"><img src="https://upload.lcz.me/uploads/a2caa5cd-e639-4311-8c4f-c03141a5847f.jpeg" alt="4f238f6f-443f-4cb4-a425-2ff5a37fbf7e-image.jpeg" class=" img-fluid img-markdown" /></p>
]]></description><link>https://lcz.me/post/2468</link><guid isPermaLink="true">https://lcz.me/post/2468</guid><dc:creator><![CDATA[fanwen1974]]></dc:creator><pubDate>Mon, 18 May 2026 15:45:14 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Mon, 18 May 2026 15:15:27 GMT]]></title><description><![CDATA[<p dir="auto">晚点抄作业，大家多上点图啊，最好我抄的时候主打复制粘贴。</p>
]]></description><link>https://lcz.me/post/2450</link><guid isPermaLink="true">https://lcz.me/post/2450</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 15:15:27 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Mon, 18 May 2026 14:52:30 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/stakira" aria-label="Profile: stakira">@<bdi>stakira</bdi></a> 论模型量化q8最好了。 模型量化，ctx, kv cache 类型这三在有限的vram面前，就是不可能三角问题啊，唯一的解就是钱包。</p>
]]></description><link>https://lcz.me/post/2421</link><guid isPermaLink="true">https://lcz.me/post/2421</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 14:52:30 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Mon, 18 May 2026 14:47:58 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/chang-ching-chun" aria-label="Profile: Chang-Ching-Chun">@<bdi>Chang-Ching-Chun</bdi></a> 理论上可行，但是还得看具体代码实现，等大神慢慢搞，后面还有个 ddtree呢，有瓜慢慢吃。</p>
]]></description><link>https://lcz.me/post/2419</link><guid isPermaLink="true">https://lcz.me/post/2419</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 14:47:58 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Mon, 18 May 2026 14:18:49 GMT]]></title><description><![CDATA[<p dir="auto">草稿质量优先有什么用？最终质量优先才有用吧，比如 kv q_8 + drafter q_4</p>
]]></description><link>https://lcz.me/post/2413</link><guid isPermaLink="true">https://lcz.me/post/2413</guid><dc:creator><![CDATA[stakira]]></dc:creator><pubDate>Mon, 18 May 2026 14:18:49 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Mon, 18 May 2026 13:54:37 GMT]]></title><description><![CDATA[<p dir="auto">感謝大大無私分享，DFlash 概念很酷，跟 Pyramid 算法很像，更有效發揮顯卡效能！<br />
另外想請問，DFlash 跟 MTP 不能混著用對吧？感覺是相互排斥的</p>
]]></description><link>https://lcz.me/post/2406</link><guid isPermaLink="true">https://lcz.me/post/2406</guid><dc:creator><![CDATA[Chang Ching-Chun]]></dc:creator><pubDate>Mon, 18 May 2026 13:54:37 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Mon, 18 May 2026 13:51:44 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/mraksugar" aria-label="Profile: mraksugar">@<bdi>mraksugar</bdi></a> 多谢反馈，我准备这几天试试看</p>
]]></description><link>https://lcz.me/post/2404</link><guid isPermaLink="true">https://lcz.me/post/2404</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 13:51:44 GMT</pubDate></item><item><title><![CDATA[Reply to Lucebox DFlash + PFlash 编译与部署指南 Qwen3.6-27B 方便抄作业 (Linux) on Mon, 18 May 2026 13:37:34 GMT]]></title><description><![CDATA[<p dir="auto">这个项目我在3090上用Open WebUI是挺好用的<br />
尽管最近他修复了几个issue之后没有在hermes调用的时候直接崩溃，但仍然不稳定，还需要观察，这里仍然使用的是3090<br />
而且官网的最新的一些脚本也跑不起来，我最终使用的noonghunna/qwen36-27b-single-3090要比这个稳定多了</p>
]]></description><link>https://lcz.me/post/2400</link><guid isPermaLink="true">https://lcz.me/post/2400</guid><dc:creator><![CDATA[mraksugar]]></dc:creator><pubDate>Mon, 18 May 2026 13:37:34 GMT</pubDate></item></channel></rss>