<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[双卡AI Pro R9700 32g，Qwen 3.6 27b FP8 256k SGlang部署成功]]></title><description><![CDATA[<p dir="auto">Hermes+Deepseek整理，不保证全对。</p>
<h2>硬件环境</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>组件</th>
<th>详情</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU</td>
<td>Intel Xeon E5-2686 v4</td>
</tr>
<tr>
<td>主板</td>
<td>X99（<strong>无 GPU P2P 直连</strong>）</td>
</tr>
<tr>
<td>GPU</td>
<td>2× AMD Radeon AI PRO R9700 (RDNA4 / gfx1201)</td>
</tr>
<tr>
<td>显存</td>
<td>每卡 32 GB GDDR6</td>
</tr>
<tr>
<td>系统内存</td>
<td>62 GB</td>
</tr>
<tr>
<td>OS</td>
<td>Ubuntu 24.04</td>
</tr>
<tr>
<td>ROCm</td>
<td>7.2.4</td>
</tr>
<tr>
<td>Python</td>
<td>3.12 (conda: sglang-triton36)</td>
</tr>
</tbody>
</table>
<h2>模型信息</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>参数</th>
<th>数值</th>
</tr>
</thead>
<tbody>
<tr>
<td>模型</td>
<td>Qwen3.6-27B-FP8（HuggingFace 本地缓存）</td>
</tr>
<tr>
<td>架构</td>
<td>Qwen3.5 (GDN + 混合注意力)，64 层</td>
</tr>
<tr>
<td>量化</td>
<td>Native FP8 (e4m3, dynamic activation scheme)</td>
</tr>
<tr>
<td>模型大小</td>
<td>29 GB（磁盘）</td>
</tr>
<tr>
<td>隐藏维度</td>
<td>5120</td>
</tr>
<tr>
<td>注意力头</td>
<td>24 (Q) / 4 (KV)，head_dim=256</td>
</tr>
<tr>
<td>词表</td>
<td>248,320</td>
</tr>
<tr>
<td>最大上下文</td>
<td>262,144 (256K)</td>
</tr>
</tbody>
</table>
<h2>当前配置 (2026-06-15)</h2>
<h3>启动命令</h3>
<pre><code class="language-bash"># 激活环境
export PATH="/home/xxx/miniforge3/bin:/home/XXX/.cargo/bin:$PATH"
source /home/XXX/miniforge3/etc/profile.d/conda.sh
conda activate sglang-triton36

# 启动服务
python -m sglang.launch_server \
  --model-path /home/XXX/models-hf/Qwen3.6-27B-FP8 \
  --tp-size 2 \
  --mem-fraction-static 0.75 \
  --context-length 262144 \
  --attention-backend triton \
  --fp8-gemm-backend triton \
  --trust-remote-code \
  --port 23334 \
  --host 0.0.0.0 \
  --disable-custom-all-reduce \
  --cuda-graph-max-bs 1 \
  --cuda-graph-bs 1 \
  --max-running-requests 1 \
  --num-continuous-decode-steps 8 \
  --max-mamba-cache-size 8 \
  --chunked-prefill-size 8192 \
  --disable-overlap-schedule \
  --tool-call-parser qwen3_coder \
  --chat-template /home/XXX/models-hf/Qwen3.6-27B-FP8/chat_template_nothink.jinja \
  --speculative-algorithm NEXTN \
  --speculative-num-steps 5 \
  --speculative-num-draft-tokens 2 \
  --speculative-eagle-topk 1 \
  --allow-auto-truncate \
  --log-requests
</code></pre>
<h3>关键参数说明</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>参数</th>
<th>值</th>
<th>原因</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>--tp-size 2</code></td>
<td>双卡 TP</td>
<td>256K 单卡装不下 KV cache</td>
</tr>
<tr>
<td><code>--mem-fraction-static 0.75</code></td>
<td>75% 显存</td>
<td>MTP draft graph 额外占用，0.80 会 OOM</td>
</tr>
<tr>
<td><code>--context-length 262144</code></td>
<td>256K 上下文</td>
<td>模型最大上下文</td>
</tr>
<tr>
<td><code>--fp8-gemm-backend triton</code></td>
<td>Triton GEMM</td>
<td>RDNA4 最优 FP8 矩阵乘后端</td>
</tr>
<tr>
<td><code>--disable-custom-all-reduce</code></td>
<td>禁用</td>
<td>X99 无 GPU P2P，必须禁用</td>
</tr>
<tr>
<td><code>--cuda-graph-max-bs 1</code></td>
<td>单 batch graph</td>
<td>CUDA graph 解码提速</td>
</tr>
<tr>
<td><code>--disable-overlap-schedule</code></td>
<td>禁用 overlap</td>
<td>Mamba no_buffer 不兼容</td>
</tr>
<tr>
<td><code>--tool-call-parser qwen3_coder</code></td>
<td>Qwen3 工具调用</td>
<td>Hermes Agent function calling</td>
</tr>
<tr>
<td><code>--chat-template nothink.jinja</code></td>
<td>自定义模板</td>
<td>关闭 thinking（无 reasoning_parser，模板去掉了 <code>&lt;think&gt;</code> 标签）</td>
</tr>
<tr>
<td><code>--speculative-algorithm NEXTN</code></td>
<td>MTP (EAGLE)</td>
<td>内建 MTP 加速解码</td>
</tr>
<tr>
<td><code>--speculative-num-steps 5</code></td>
<td>5 步 draft</td>
<td>最优值：4.38 accept_len，68% accept_rate</td>
</tr>
<tr>
<td><code>--speculative-num-draft-tokens 2</code></td>
<td>每步 2 token</td>
<td>配合 steps=5</td>
</tr>
<tr>
<td><code>--speculative-eagle-topk 1</code></td>
<td>单分支</td>
<td>topk=2 无额外收益且占更多显存</td>
</tr>
<tr>
<td><code>--allow-auto-truncate</code></td>
<td>自动截断</td>
<td>超过 KV cache (147K tokens) 自动截断而非 400 报错</td>
</tr>
<tr>
<td><code>--max-running-requests 1</code></td>
<td>单请求</td>
<td>单用户优化</td>
</tr>
<tr>
<td><code>--log-requests</code></td>
<td>请求日志</td>
<td>生产调试</td>
</tr>
</tbody>
</table>
<h3>显存分配 (256K 上下文 + MTP)</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:center">GPU</th>
<th style="text-align:center">模型权重</th>
<th style="text-align:center">KV Pool</th>
<th style="text-align:center">Draft Graph</th>
<th style="text-align:center">CUDA Graph + 驱动</th>
<th style="text-align:center">余量</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">GPU0</td>
<td style="text-align:center">~17 GB</td>
<td style="text-align:center">~8 GB</td>
<td style="text-align:center">~0.3 GB</td>
<td style="text-align:center">~5 GB</td>
<td style="text-align:center">~5.6 GB</td>
</tr>
<tr>
<td style="text-align:center">GPU1</td>
<td style="text-align:center">~17 GB</td>
<td style="text-align:center">~8 GB</td>
<td style="text-align:center">~0.3 GB</td>
<td style="text-align:center">~5 GB</td>
<td style="text-align:center">~5.6 GB</td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto"><code>max_total_num_tokens=146964</code>，<code>allow_auto_truncate=True</code> 超过自动截断。</p>
</blockquote>
<h3>思考模式</h3>
<ul>
<li><code>reasoning_parser=None</code> — 无 reasoning parser</li>
<li><code>chat_template_nothink.jinja</code> — 自定义模板，add_generation_prompt 中不输出 <code>&lt;think&gt;</code> 标签</li>
<li><code>enable_thinking</code> 变量处理已完全移除</li>
<li>模型训练行为残留的 <code>&lt;think&gt;</code> 标签（output_ids 以 248068=`</li>
</ul>
<p dir="auto">&lt;｜｜DSML｜｜tool_calls&gt;<br />
&lt;｜｜DSML｜｜invoke name="write_file"&gt;<br />
&lt;｜｜DSML｜｜parameter name="content" string="true"&gt; 开头）属于模型输出内容，不影响 reasoning_tokens（始终为 0）</p>
<ul>
<li>NOTE: 编辑 <code>chat_template_nothink.jinja</code> 后<strong>禁止还原 &lt;think&gt; 标签</strong>。如 SGLang 版本升级后模板行为变化，参考本文件的"思考模式"说明进行调试。</li>
</ul>
<h2>性能基准</h2>
<h3>Decode 速度 (MTP steps=5)</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:center">场景</th>
<th style="text-align:center">速度 (MTP)</th>
<th style="text-align:center">速度 (无 MTP)</th>
<th style="text-align:center">提升</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">短上下文 (&lt;1K)</td>
<td style="text-align:center">~34 tok/s</td>
<td style="text-align:center">~18 tok/s</td>
<td style="text-align:center">+89%</td>
</tr>
<tr>
<td style="text-align:center">中等上下文 (4K–32K)</td>
<td style="text-align:center">~30 tok/s</td>
<td style="text-align:center">~18 tok/s</td>
<td style="text-align:center">+67%</td>
</tr>
</tbody>
</table>
<h3>MTP 参数对比测试结果</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:center">steps</th>
<th style="text-align:center">draft_tokens</th>
<th style="text-align:center">topk</th>
<th style="text-align:center">速度</th>
<th style="text-align:center">accept_len</th>
<th style="text-align:center">accept_rate</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">无 MTP</td>
<td style="text-align:center">-</td>
<td style="text-align:center">-</td>
<td style="text-align:center">~18 tok/s</td>
<td style="text-align:center">-</td>
<td style="text-align:center">-</td>
</tr>
<tr>
<td style="text-align:center">1</td>
<td style="text-align:center">2</td>
<td style="text-align:center">1</td>
<td style="text-align:center">~24 tok/s</td>
<td style="text-align:center">1.88</td>
<td style="text-align:center">88%</td>
</tr>
<tr>
<td style="text-align:center">3</td>
<td style="text-align:center">2</td>
<td style="text-align:center">1</td>
<td style="text-align:center">~29 tok/s</td>
<td style="text-align:center">3.08</td>
<td style="text-align:center">69%</td>
</tr>
<tr>
<td style="text-align:center">4</td>
<td style="text-align:center">2</td>
<td style="text-align:center">1</td>
<td style="text-align:center">~30 tok/s</td>
<td style="text-align:center">3.38</td>
<td style="text-align:center">56%</td>
</tr>
<tr>
<td style="text-align:center"><strong>5</strong></td>
<td style="text-align:center"><strong>2</strong></td>
<td style="text-align:center"><strong>1</strong></td>
<td style="text-align:center"><strong>~34 tok/s</strong></td>
<td style="text-align:center"><strong>4.38</strong></td>
<td style="text-align:center"><strong>68%</strong></td>
</tr>
<tr>
<td style="text-align:center">6</td>
<td style="text-align:center">2</td>
<td style="text-align:center">1</td>
<td style="text-align:center">~33 tok/s</td>
<td style="text-align:center">4.62</td>
<td style="text-align:center">60%</td>
</tr>
<tr>
<td style="text-align:center">5</td>
<td style="text-align:center">4</td>
<td style="text-align:center">2</td>
<td style="text-align:center">~28 tok/s</td>
<td style="text-align:center">3.23</td>
<td style="text-align:center">72%</td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto"><strong>最优：steps=5, draft_tokens=2, topk=1</strong>。Steps=2 以上会在 CUDA graph 捕获阶段增加 draft graph（约 0.3GB 显存开销）。</p>
</blockquote>
<h3>Prefill 速度 (chunk_size=8K)</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:center">上下文</th>
<th style="text-align:center">速度</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">16K</td>
<td style="text-align:center">~473 tok/s</td>
</tr>
<tr>
<td style="text-align:center">64K</td>
<td style="text-align:center">~450 tok/s</td>
</tr>
<tr>
<td style="text-align:center">108K</td>
<td style="text-align:center">~407 tok/s</td>
</tr>
</tbody>
</table>
<h2>环境搭建步骤</h2>
<h3>1. 安装 Miniforge3</h3>
<pre><code class="language-bash">curl -L -o /tmp/miniforge.sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh"
bash /tmp/miniforge.sh -b -p /home/gaopy/miniforge3
</code></pre>
<h3>2. 安装 Rust (SGLang 编译需要)</h3>
<pre><code class="language-bash">curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
</code></pre>
<h3>3. 克隆并运行 SGLang RDNA4 setup</h3>
<pre><code class="language-bash">git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference.git sglang-rdna4
cd sglang-rdna4
export PATH="/home/gaopy/miniforge3/bin:/home/gaopy/.cargo/bin:$PATH"
bash scripts/setup.sh
</code></pre>
<h3>4. 修复依赖冲突</h3>
<pre><code class="language-bash">conda activate sglang-triton36
cp components/sglang/sgl-kernel/python/sgl_kernel/*.py $CONDA_PREFIX/lib/python3.12/site-packages/sgl_kernel/
pip install kernels==0.14.1
pip install --force-reinstall --no-deps transformers==5.8.0
bash scripts/build_awq_gemv.sh --env sglang-triton36
cd components/sglang
git checkout v0.5.12
for p in ../../patches/0*.patch; do git apply --3way "$p" 2&gt;/dev/null; done
pip install -e "python[all]" --no-build-isolation --no-deps
</code></pre>
<h3>5. 下载模型（FP8）</h3>
<pre><code class="language-bash">huggingface-cli download Qwen/Qwen3.6-27B-FP8 \
  --local-dir /home/gaopy/models-hf/Qwen3.6-27B-FP8 \
  --max-workers 4
</code></pre>
<h2>服务接口</h2>
<pre><code class="language-bash"># 健康检查
curl http://localhost:23334/health

# 模型列表
curl http://localhost:23334/v1/models

# 推理 (OpenAI 兼容)
curl http://localhost:23334/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/home/XXX/models-hf/Qwen3.6-27B-FP8",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100,
    "temperature": 0.7
  }'
</code></pre>
<h2>已知问题</h2>
<ol>
<li><strong>thinking 残留</strong> — 模型训练行为，<code>&lt;/think&gt;</code> 标签以 <code>248068/248069</code> token 形式出现在输出开头，不影响内容质量</li>
<li><strong>X99 无 GPU P2P</strong> — 多卡必须用 <code>--disable-custom-all-reduce</code></li>
<li><strong>首次启动慢</strong> — 模型加载 + KV cache 分配 + CUDA graph + draft graph 捕获约 90 秒</li>
<li><strong>FP8 GEMM 警告</strong> — 启动时提示 <code>Using default W8A8 Block FP8 kernel config</code>，性能可能未达最优，等待社区提交 R9700 调优配置</li>
<li><strong>Triton deprecation warning</strong> — <code>tl.where with non-boolean condition</code>，当前不影响运行</li>
</ol>
<h2>版本历史</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>日期</th>
<th>变更</th>
</tr>
</thead>
<tbody>
<tr>
<td>2026-06-15</td>
<td>MTP 加速 (+89%, ~34 tok/s)；思考关闭 (no-think v2 模板)；添加 --allow-auto-truncate；上下文 256K</td>
</tr>
<tr>
<td>2026-06-14</td>
<td>从 AWQ 切换到 Native FP8；添加 no-think 模板绕过 reasoning_parser 自动检测</td>
</tr>
<tr>
<td>2026-06-13</td>
<td>初始部署：AWQ + 256K，后因 OOM 切换 FP8</td>
</tr>
</tbody>
</table>
]]></description><link>https://lcz.me/topic/569/双卡ai-pro-r9700-32g-qwen-3.6-27b-fp8-256k-sglang部署成功</link><generator>RSS for Node</generator><lastBuildDate>Wed, 01 Jul 2026 10:53:00 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/569.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 15 Jun 2026 07:39:09 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 双卡AI Pro R9700 32g，Qwen 3.6 27b FP8 256k SGlang部署成功 on Sun, 21 Jun 2026 22:20:41 GMT]]></title><description><![CDATA[<p dir="auto">看来看去我还是老老实实用云端，你们这24g都是入门级别</p>
]]></description><link>https://lcz.me/post/7751</link><guid isPermaLink="true">https://lcz.me/post/7751</guid><dc:creator><![CDATA[付贵]]></dc:creator><pubDate>Sun, 21 Jun 2026 22:20:41 GMT</pubDate></item><item><title><![CDATA[Reply to 双卡AI Pro R9700 32g，Qwen 3.6 27b FP8 256k SGlang部署成功 on Wed, 17 Jun 2026 15:00:22 GMT]]></title><description><![CDATA[<p dir="auto">昨天刚折腾了15小时双7900xtx SG-lang失败的来拜读一下</p>
]]></description><link>https://lcz.me/post/7207</link><guid isPermaLink="true">https://lcz.me/post/7207</guid><dc:creator><![CDATA[abaalei]]></dc:creator><pubDate>Wed, 17 Jun 2026 15:00:22 GMT</pubDate></item><item><title><![CDATA[Reply to 双卡AI Pro R9700 32g，Qwen 3.6 27b FP8 256k SGlang部署成功 on Wed, 17 Jun 2026 14:23:03 GMT]]></title><description><![CDATA[<p dir="auto"><img src="https://upload.lcz.me/uploads/06a23807-3487-4a5d-973f-1c55d09e38ba.png" alt="IMG_1985.png" class=" img-fluid img-markdown" /><br />
最后MTP 3比较稳定。</p>
<p dir="auto"><img src="https://upload.lcz.me/uploads/7e76dd8e-0a84-432c-8f81-9ab25f2e50f8.png" alt="IMG_1983.png" class=" img-fluid img-markdown" /><br />
之前MTP 5有长任务会卡</p>
]]></description><link>https://lcz.me/post/7206</link><guid isPermaLink="true">https://lcz.me/post/7206</guid><dc:creator><![CDATA[Brian]]></dc:creator><pubDate>Wed, 17 Jun 2026 14:23:03 GMT</pubDate></item><item><title><![CDATA[Reply to 双卡AI Pro R9700 32g，Qwen 3.6 27b FP8 256k SGlang部署成功 on Wed, 17 Jun 2026 14:19:54 GMT]]></title><description><![CDATA[<p dir="auto"><img src="https://upload.lcz.me/uploads/ce3e76b0-d239-4c75-b906-85ee140e050d.png" alt="IMG_1984.png" class=" img-fluid img-markdown" /><br />
开了above 4g和rebar，非常成功，速度和使用感觉好很多</p>
]]></description><link>https://lcz.me/post/7205</link><guid isPermaLink="true">https://lcz.me/post/7205</guid><dc:creator><![CDATA[Brian]]></dc:creator><pubDate>Wed, 17 Jun 2026 14:19:54 GMT</pubDate></item><item><title><![CDATA[Reply to 双卡AI Pro R9700 32g，Qwen 3.6 27b FP8 256k SGlang部署成功 on Mon, 15 Jun 2026 10:32:02 GMT]]></title><description><![CDATA[<p dir="auto">非常重要的数据，很多人等着看，SG-Lang才是重点，上截图，不要只发文字。<br />
X99 无 GPU P2P — 多卡必须用 --disable-custom-all-reduce。确信吗？主板配置带这个东西的，可以开启。above4G和rebar。</p>
]]></description><link>https://lcz.me/post/6925</link><guid isPermaLink="true">https://lcz.me/post/6925</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 15 Jun 2026 10:32:02 GMT</pubDate></item><item><title><![CDATA[Reply to 双卡AI Pro R9700 32g，Qwen 3.6 27b FP8 256k SGlang部署成功 on Mon, 15 Jun 2026 09:13:51 GMT]]></title><description><![CDATA[<p dir="auto">終於有大神玩2張R9700了</p>
]]></description><link>https://lcz.me/post/6915</link><guid isPermaLink="true">https://lcz.me/post/6915</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Mon, 15 Jun 2026 09:13:51 GMT</pubDate></item><item><title><![CDATA[Reply to 双卡AI Pro R9700 32g，Qwen 3.6 27b FP8 256k SGlang部署成功 on Mon, 15 Jun 2026 08:59:44 GMT]]></title><description><![CDATA[<p dir="auto">还是要开思考，不开拉垮的很</p>
]]></description><link>https://lcz.me/post/6914</link><guid isPermaLink="true">https://lcz.me/post/6914</guid><dc:creator><![CDATA[Brian]]></dc:creator><pubDate>Mon, 15 Jun 2026 08:59:44 GMT</pubDate></item></channel></rss>