<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果]]></title><description><![CDATA[<h2>1. 说明</h2>
<p dir="auto"><img src="https://upload.lcz.me/uploads/4a26bf7c-57f2-4b00-9b88-f1910243aca1.jpeg" alt="180d054f-6091-4d0f-9a75-3ecb239b511f-image.jpeg" class=" img-fluid img-markdown" /><br />
双智铠100算力卡运行大模型的测试情况，当前已完整形成性能测试结果的模型为：</p>
<ul>
<li><code>Qwen3.6-35B-A3B-W4A8</code></li>
</ul>
<p dir="auto">并且opencode接入了该模型使用，非常快<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f602.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--joy" style="height:23px;width:auto;vertical-align:middle" title=":joy:" alt="😂" /></p>
<p dir="auto"><img src="https://upload.lcz.me/uploads/59cf2164-5a8f-47cc-87e8-205ea5ac3403.jpeg" alt="06c9b5b7-7255-48a1-b01d-75a88502d17f-image.jpeg" class=" img-fluid img-markdown" /></p>
<h2>2. 测试对象</h2>
<p dir="auto">硬件对象：双智铠100算力卡。</p>
<p dir="auto">推理框架：vLLM。</p>
<p dir="auto">接口协议：OpenAI Chat Completions API。</p>
<p dir="auto">主要测试接口：</p>
<pre><code class="language-shell">http://127.0.0.1:10030/v1/chat/completions
</code></pre>
<p dir="auto">主要测试模型：</p>
<pre><code class="language-shell">Qwen3.6-35B-A3B-W4A8
</code></pre>
<p dir="auto">模型路径：</p>
<pre><code class="language-shell">/data/model/Qwen3___6-35B-A3B-W4A8
</code></pre>
<h2>3. Qwen3.6-35B-A3B-W4A8 启动命令</h2>
<h3>3.1 日常交互启动命令</h3>
<p dir="auto">该配置适合低并发、普通上下文和长上下文测试。</p>
<pre><code class="language-shell">export VLLM_RPC_TIMEOUT=50000
export VLLM_ENFORCE_CUDA_GRAPH=1
export VLLM_W8A8_MOE_USE_W4A8=1
export VLLM_KV_DISABLE_CROSS_GROUP_SHARE=1

vllm serve /data/model/Qwen3___6-35B-A3B-W4A8 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --max-model-len 65536 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --host 0.0.0.0 \
  --port 10030 \
  --gpu-memory-utilization 0.90 \
  --served-model-name Qwen3.6-35B-A3B-W4A8 \
  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "level": 0}' \
  --default-chat-template-kwargs '{"enable_thinking": false}'
</code></pre>
<h3>3.2 吞吐压测启动命令</h3>
<p dir="auto">该配置用于 6、8、12 并发测试，主要观察吞吐上限和过载边界。</p>
<pre><code class="language-shell">export VLLM_RPC_TIMEOUT=50000
export VLLM_ENFORCE_CUDA_GRAPH=1
export VLLM_W8A8_MOE_USE_W4A8=1
export VLLM_KV_DISABLE_CROSS_GROUP_SHARE=1

vllm serve /data/model/Qwen3___6-35B-A3B-W4A8 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --max-num-seqs 12 \
  --enable-chunked-prefill \
  --max-model-len 65536 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --host 0.0.0.0 \
  --port 10030 \
  --gpu-memory-utilization 0.90 \
  --served-model-name Qwen3.6-35B-A3B-W4A8 \
  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "level": 0}' \
  --default-chat-template-kwargs '{"enable_thinking": false}'
</code></pre>
<h2>4. 测试命令模板</h2>
<h3>4.1 单并发普通上下文测试</h3>
<pre><code class="language-shell">vllm bench serve \
  --backend openai-chat \
  --base-url http://127.0.0.1:10030 \
  --endpoint /v1/chat/completions \
  --model Qwen3.6-35B-A3B-W4A8 \
  --tokenizer /data/model/Qwen3___6-35B-A3B-W4A8 \
  --dataset-name random \
  --random-input-len 2048 \
  --random-output-len 512 \
  --num-prompts 20 \
  --request-rate inf \
  --max-concurrency 1 \
  --ignore-eos \
  --seed 123
</code></pre>
<h3>4.2 普通上下文多并发测试</h3>
<p dir="auto">将 <code>--max-concurrency</code> 分别设置为 <code>4</code>、<code>6</code>、<code>8</code>、<code>12</code>。</p>
<pre><code class="language-shell">vllm bench serve \
  --backend openai-chat \
  --base-url http://127.0.0.1:10030 \
  --endpoint /v1/chat/completions \
  --model Qwen3.6-35B-A3B-W4A8 \
  --tokenizer /data/model/Qwen3___6-35B-A3B-W4A8 \
  --dataset-name random \
  --random-input-len 4096 \
  --random-output-len 512 \
  --num-prompts 50 \
  --request-rate inf \
  --max-concurrency 8 \
  --ignore-eos \
  --seed 123
</code></pre>
<p dir="auto">说明：4 并发测试时，实际提供的测试请求数为 10；6、8、12 并发测试请求数为 50。</p>
<h3>4.3 长上下文测试</h3>
<pre><code class="language-shell">vllm bench serve \
  --backend openai-chat \
  --base-url http://127.0.0.1:10030 \
  --endpoint /v1/chat/completions \
  --model Qwen3.6-35B-A3B-W4A8 \
  --tokenizer /data/model/Qwen3___6-35B-A3B-W4A8 \
  --dataset-name random \
  --random-input-len 16384 \
  --random-output-len 512 \
  --num-prompts 20 \
  --request-rate inf \
  --max-concurrency 2 \
  --ignore-eos \
  --seed 123
</code></pre>
<h2>5. Qwen3.6-35B-A3B-W4A8 测试结果总表</h2>
<h3>表格 1：基础信息与吞吐量</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:left">测试场景</th>
<th style="text-align:right">输入/输出 tokens</th>
<th style="text-align:right">并发</th>
<th style="text-align:right">请求数</th>
<th style="text-align:right">成功数</th>
<th style="text-align:right">失败数</th>
<th style="text-align:right">总耗时</th>
<th style="text-align:right">输出吞吐 (tok/s)</th>
<th style="text-align:right">总吞吐 (tok/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">单并发普通上下文</td>
<td style="text-align:right">2048 / 512</td>
<td style="text-align:right">1</td>
<td style="text-align:right">20</td>
<td style="text-align:right">20</td>
<td style="text-align:right">0</td>
<td style="text-align:right">181.81s</td>
<td style="text-align:right">56.32</td>
<td style="text-align:right">281.61</td>
</tr>
<tr>
<td style="text-align:left">4 并发普通上下文</td>
<td style="text-align:right">4096 / 512</td>
<td style="text-align:right">4</td>
<td style="text-align:right">10</td>
<td style="text-align:right">10</td>
<td style="text-align:right">0</td>
<td style="text-align:right">44.94s</td>
<td style="text-align:right">113.93</td>
<td style="text-align:right">1025.39</td>
</tr>
<tr>
<td style="text-align:left">6 并发普通上下文</td>
<td style="text-align:right">4096 / 512</td>
<td style="text-align:right">6</td>
<td style="text-align:right">50</td>
<td style="text-align:right">50</td>
<td style="text-align:right">0</td>
<td style="text-align:right">172.87s</td>
<td style="text-align:right">148.09</td>
<td style="text-align:right">1332.81</td>
</tr>
<tr>
<td style="text-align:left">8 并发普通上下文</td>
<td style="text-align:right">4096 / 512</td>
<td style="text-align:right">8</td>
<td style="text-align:right">50</td>
<td style="text-align:right">50</td>
<td style="text-align:right">0</td>
<td style="text-align:right">149.76s</td>
<td style="text-align:right">170.94</td>
<td style="text-align:right">1538.48</td>
</tr>
<tr>
<td style="text-align:left">12 并发普通上下文</td>
<td style="text-align:right">4096 / 512</td>
<td style="text-align:right">12</td>
<td style="text-align:right">50</td>
<td style="text-align:right">50</td>
<td style="text-align:right">0</td>
<td style="text-align:right">236.90s</td>
<td style="text-align:right">108.06</td>
<td style="text-align:right">972.58</td>
</tr>
<tr>
<td style="text-align:left">长上下文</td>
<td style="text-align:right">16384 / 512</td>
<td style="text-align:right">2</td>
<td style="text-align:right">20</td>
<td style="text-align:right">20</td>
<td style="text-align:right">0</td>
<td style="text-align:right">192.28s</td>
<td style="text-align:right">53.26</td>
<td style="text-align:right">1757.45</td>
</tr>
</tbody>
</table>
<h3>表格 2：延迟指标（TTFT / TPOT / ITL）</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:left">测试场景</th>
<th style="text-align:right">平均 TTFT</th>
<th style="text-align:right">P99 TTFT</th>
<th style="text-align:right">平均 TPOT</th>
<th style="text-align:right">P99 TPOT</th>
<th style="text-align:right">P99 ITL</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">单并发普通上下文</td>
<td style="text-align:right">675.33ms</td>
<td style="text-align:right">684.19ms</td>
<td style="text-align:right">16.47ms</td>
<td style="text-align:right">16.59ms</td>
<td style="text-align:right">17.21ms</td>
</tr>
<tr>
<td style="text-align:left">4 并发普通上下文</td>
<td style="text-align:right">2539.73ms</td>
<td style="text-align:right">4174.28ms</td>
<td style="text-align:right">25.62ms</td>
<td style="text-align:right">28.45ms</td>
<td style="text-align:right">24.38ms</td>
</tr>
<tr>
<td style="text-align:left">6 并发普通上下文</td>
<td style="text-align:right">2812.72ms</td>
<td style="text-align:right">5848.28ms</td>
<td style="text-align:right">33.38ms</td>
<td style="text-align:right">36.07ms</td>
<td style="text-align:right">508.41ms</td>
</tr>
<tr>
<td style="text-align:left">8 并发普通上下文</td>
<td style="text-align:right">3110.26ms</td>
<td style="text-align:right">8321.04ms</td>
<td style="text-align:right">38.25ms</td>
<td style="text-align:right">41.46ms</td>
<td style="text-align:right">515.14ms</td>
</tr>
<tr>
<td style="text-align:left">12 并发普通上下文</td>
<td style="text-align:right">3593.71ms</td>
<td style="text-align:right">12122.58ms</td>
<td style="text-align:right">100.03ms</td>
<td style="text-align:right">106.45ms</td>
<td style="text-align:right">524.32ms</td>
</tr>
<tr>
<td style="text-align:left">长上下文</td>
<td style="text-align:right">6423.67ms</td>
<td style="text-align:right">8687.50ms</td>
<td style="text-align:right">25.04ms</td>
<td style="text-align:right">28.39ms</td>
<td style="text-align:right">22.67ms</td>
</tr>
</tbody>
</table>
<h2>6. 每用户体感输出速度</h2>
<p dir="auto">每用户体感输出速度按以下公式估算：</p>
<pre><code class="language-text">每用户输出速度 ≈ 1000 / 平均 TPOT(ms)
</code></pre>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>测试场景</th>
<th style="text-align:right">平均 TPOT</th>
<th style="text-align:right">估算每用户输出速度</th>
</tr>
</thead>
<tbody>
<tr>
<td>单并发普通上下文</td>
<td style="text-align:right">16.47ms</td>
<td style="text-align:right">约 60.72 tok/s</td>
</tr>
<tr>
<td>4 并发普通上下文</td>
<td style="text-align:right">25.62ms</td>
<td style="text-align:right">约 39.03 tok/s</td>
</tr>
<tr>
<td>6 并发普通上下文</td>
<td style="text-align:right">33.38ms</td>
<td style="text-align:right">约 29.96 tok/s</td>
</tr>
<tr>
<td>8 并发普通上下文</td>
<td style="text-align:right">38.25ms</td>
<td style="text-align:right">约 26.14 tok/s</td>
</tr>
<tr>
<td>12 并发普通上下文</td>
<td style="text-align:right">100.03ms</td>
<td style="text-align:right">约 10.00 tok/s</td>
</tr>
<tr>
<td>长上下文</td>
<td style="text-align:right">25.04ms</td>
<td style="text-align:right">约 39.94 tok/s</td>
</tr>
</tbody>
</table>
<h2>补充：</h2>
<h3>配置信息</h3>
<p dir="auto"><img src="https://upload.lcz.me/uploads/e9db259a-3840-46f6-ba15-264e0e3b556b.jpeg" alt="4a53387d-1f58-4c0d-9d7d-4a8485c27ee3-image.jpeg" class=" img-fluid img-markdown" /></p>
<h3>价格</h3>
<p dir="auto">公司订购的一台测试机子，工作站样式，外壳应该是铝的定制的；整机5w多。我看淘宝上同款推理卡mr-100一张1.5w左右</p>
]]></description><link>https://lcz.me/topic/684/国产替代-智铠100-32gx2部署qwen3.6-35b-w4a8含多并发测试结果</link><generator>RSS for Node</generator><lastBuildDate>Wed, 01 Jul 2026 09:31:38 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/684.rss" rel="self" type="application/rss+xml"/><pubDate>Wed, 24 Jun 2026 09:34:56 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Thu, 25 Jun 2026 03:30:44 GMT]]></title><description><![CDATA[<p dir="auto">反正后面都是要全力工作的了，待机功耗大无所谓拉</p>
]]></description><link>https://lcz.me/post/8209</link><guid isPermaLink="true">https://lcz.me/post/8209</guid><dc:creator><![CDATA[vosrock]]></dc:creator><pubDate>Thu, 25 Jun 2026 03:30:44 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Thu, 25 Jun 2026 01:46:13 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/ezios" aria-label="Profile: ezios">@<bdi>ezios</bdi></a></p>
<p dir="auto">估計驅動還沒有調校好吧, 藍綠紅三家的非魔改卡都有閒置降頻的設定, 降到200到300mhz</p>
<p dir="auto">我之前的4090D 48GB閒置都要50到60w左右...核心頻率不會降下來</p>
]]></description><link>https://lcz.me/post/8200</link><guid isPermaLink="true">https://lcz.me/post/8200</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Thu, 25 Jun 2026 01:46:13 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Thu, 25 Jun 2026 01:40:10 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/566656661" aria-label="Profile: 566656661">@<bdi>566656661</bdi></a> 这卡待机功耗也太高了，两张100w，在旁边闷热闷热的</p>
]]></description><link>https://lcz.me/post/8199</link><guid isPermaLink="true">https://lcz.me/post/8199</guid><dc:creator><![CDATA[ezios]]></dc:creator><pubDate>Thu, 25 Jun 2026 01:40:10 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Thu, 25 Jun 2026 01:28:46 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/ezios" aria-label="Profile: ezios">@<bdi>ezios</bdi></a></p>
<p dir="auto">因為數據中心的卡極度依賴機箱風扇和周邊的冷空氣</p>
<p dir="auto">那些風扇基本上轉速都上個5到6千轉了, 改家用的話基本上就要另外裝個渦輪</p>
]]></description><link>https://lcz.me/post/8197</link><guid isPermaLink="true">https://lcz.me/post/8197</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Thu, 25 Jun 2026 01:28:46 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Thu, 25 Jun 2026 01:15:42 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/566656661" aria-label="Profile: 566656661">@<bdi>566656661</bdi></a> 我这里是个台式机，推理卡也是改了涡轮散热，太神奇了 <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f633.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--flushed" style="height:23px;width:auto;vertical-align:middle" title=":flushed:" alt="😳" /></p>
]]></description><link>https://lcz.me/post/8195</link><guid isPermaLink="true">https://lcz.me/post/8195</guid><dc:creator><![CDATA[ezios]]></dc:creator><pubDate>Thu, 25 Jun 2026 01:15:42 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Thu, 25 Jun 2026 01:14:50 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 已修改，拆分成两个表格，看着会舒服一些<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f60a.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--blush" style="height:23px;width:auto;vertical-align:middle" title=":blush:" alt="😊" /></p>
]]></description><link>https://lcz.me/post/8194</link><guid isPermaLink="true">https://lcz.me/post/8194</guid><dc:creator><![CDATA[ezios]]></dc:creator><pubDate>Thu, 25 Jun 2026 01:14:50 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Wed, 24 Jun 2026 23:15:08 GMT]]></title><description><![CDATA[<p dir="auto">被動散熱估計也是data center的卡, 類似6000D的東東</p>
<p dir="auto">先不說家用要改散熱, 有點懷疑一張卡的價格估計都要20到30K了</p>
<p dir="auto">不過多一個玩家總是好事, 期待能把價格打下來 <s>雖然以老黃的性格我覺得很難就是了</s></p>
]]></description><link>https://lcz.me/post/8188</link><guid isPermaLink="true">https://lcz.me/post/8188</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Wed, 24 Jun 2026 23:15:08 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Wed, 24 Jun 2026 21:53:40 GMT]]></title><description><![CDATA[<p dir="auto">牛逼, 国产显卡 开始支棱起来了.</p>
]]></description><link>https://lcz.me/post/8185</link><guid isPermaLink="true">https://lcz.me/post/8185</guid><dc:creator><![CDATA[mark]]></dc:creator><pubDate>Wed, 24 Jun 2026 21:53:40 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Wed, 24 Jun 2026 21:19:10 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/ezios" aria-label="Profile: ezios">@<bdi>ezios</bdi></a></p>
<p dir="auto">多分享, 期待国产尽快能顶上来.</p>
]]></description><link>https://lcz.me/post/8184</link><guid isPermaLink="true">https://lcz.me/post/8184</guid><dc:creator><![CDATA[Tony Wang]]></dc:creator><pubDate>Wed, 24 Jun 2026 21:19:10 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Wed, 24 Jun 2026 20:25:17 GMT]]></title><description><![CDATA[<p dir="auto">稀有内容，很牛逼，这玩意怎么也不提下价格。有个表哥太臃肿了，你给修改下，分成两个。</p>
]]></description><link>https://lcz.me/post/8183</link><guid isPermaLink="true">https://lcz.me/post/8183</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Wed, 24 Jun 2026 20:25:17 GMT</pubDate></item><item><title><![CDATA[Reply to 【国产替代】智铠100 32Gx2部署Qwen3.6-35B-W4A8含多并发测试结果 on Wed, 24 Jun 2026 09:41:41 GMT]]></title><description><![CDATA[<p dir="auto">这家伙跟arc一样，待机功耗奇高，ixsmi官方工具查看显示待机功耗达到了45-50w。我在旁边调试，快热死我了</p>
]]></description><link>https://lcz.me/post/8126</link><guid isPermaLink="true">https://lcz.me/post/8126</guid><dc:creator><![CDATA[ezios]]></dc:creator><pubDate>Wed, 24 Jun 2026 09:41:41 GMT</pubDate></item></channel></rss>