<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[M3U 96G 发一下数据供以参考]]></title><description><![CDATA[<p dir="auto">GPT帮我总结的希望对大家有帮助：<br />
下面是你这次 oMLX / Qwen3.6-27B-8bit 测试数据的整理版，可以直接拿去分享。</p>
<h1>oMLX + Qwen3.6-27B-8bit 模型性能测试报告</h1>
<h2>一、测试环境</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>项目</th>
<th>内容</th>
</tr>
</thead>
<tbody>
<tr>
<td>推理框架</td>
<td>oMLX</td>
</tr>
<tr>
<td>运行设备</td>
<td>Mac / Apple Silicon</td>
</tr>
<tr>
<td>主模型</td>
<td>Qwen3.6-27B-8bit</td>
</tr>
<tr>
<td>DFlash Draft</td>
<td>z-lab/Qwen3.6-27B-DFlash</td>
</tr>
<tr>
<td>主要用途</td>
<td>Hermes / OpenClaw / Agent 本地模型服务</td>
</tr>
<tr>
<td>测试方式</td>
<td>oMLX Benchmark + server.log 观察</td>
</tr>
<tr>
<td>重点指标</td>
<td>tg TPS、TTFT、E2E、Peak Mem、DFlash acceptance</td>
</tr>
</tbody>
</table>
<hr />
<h2>二、核心 Benchmark 对比</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>模式</th>
<th>测试项</th>
<th>TTFT</th>
<th>TPOT</th>
<th>Prompt 处理速度</th>
<th>生成速度 tg TPS</th>
<th>E2E 总耗时</th>
<th>峰值内存</th>
<th>备注</th>
</tr>
</thead>
<tbody>
<tr>
<td>普通 BatchedEngine</td>
<td>pp1024 / tg128</td>
<td>3416.4 ms</td>
<td>43.84 ms</td>
<td>299.7 tok/s</td>
<td>23.0 tok/s</td>
<td>8.985 s</td>
<td>30.45 GB</td>
<td>稳定，适合 Agent 常驻</td>
</tr>
<tr>
<td>普通 BatchedEngine</td>
<td>pp4096 / tg128</td>
<td>12828.8 ms</td>
<td>44.66 ms</td>
<td>319.3 tok/s</td>
<td>22.6 tok/s</td>
<td>18.500 s</td>
<td>31.90 GB</td>
<td>长上下文下速度稳定</td>
</tr>
<tr>
<td>DFlash 开启</td>
<td>pp1024 / tg128</td>
<td>3607.3 ms</td>
<td>9.21 ms</td>
<td>283.9 tok/s</td>
<td>109.4 tok/s</td>
<td>4.777 s</td>
<td>34.13 GB</td>
<td>短上下文加速明显</td>
</tr>
<tr>
<td>DFlash 开启</td>
<td>pp4096 / tg128</td>
<td>15439.9 ms</td>
<td>44.67 ms</td>
<td>265.3 tok/s</td>
<td>22.6 tok/s</td>
<td>21.114 s</td>
<td>31.90 GB</td>
<td>触发 fallback，回退普通引擎</td>
</tr>
</tbody>
</table>
<hr />
<h2>三、DFlash 加速效果</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>对比项</th>
<th>普通模式</th>
<th>DFlash 模式</th>
<th>提升</th>
</tr>
</thead>
<tbody>
<tr>
<td>pp1024 / tg128 生成速度</td>
<td>23.0 tok/s</td>
<td>109.4 tok/s</td>
<td>约 4.76 倍</td>
</tr>
<tr>
<td>pp1024 / tg128 E2E 总耗时</td>
<td>8.985 s</td>
<td>4.777 s</td>
<td>约快 46.8%</td>
</tr>
<tr>
<td>pp4096 / tg128 生成速度</td>
<td>22.6 tok/s</td>
<td>22.6 tok/s</td>
<td>无提升，已 fallback</td>
</tr>
<tr>
<td>pp4096 / tg128 E2E 总耗时</td>
<td>18.500 s</td>
<td>21.114 s</td>
<td>略慢</td>
</tr>
</tbody>
</table>
<hr />
<h2>四、DFlash 日志观察</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>项目</th>
<th>观察结果</th>
</tr>
</thead>
<tbody>
<tr>
<td>DFlash 是否成功启用</td>
<td>成功</td>
</tr>
<tr>
<td>Draft 模型路径</td>
<td>/Users/hao/.omlx/models/z-lab/Qwen3.6-27B-DFlash</td>
</tr>
<tr>
<td>DFlash max_ctx</td>
<td>4096</td>
</tr>
<tr>
<td>短上下文 acceptance</td>
<td>约 85.9% - 93.8%</td>
</tr>
<tr>
<td>短上下文日志速度</td>
<td>约 20.4 - 27.0 tok/s（与 Benchmark 口径不完全一致）</td>
</tr>
<tr>
<td>长上下文行为</td>
<td>4096 &gt;= 4096 时自动 fallback 到 BatchedEngine</td>
</tr>
<tr>
<td>结论</td>
<td>DFlash 对短上下文非常有效，长上下文基本无收益</td>
</tr>
</tbody>
</table>
<hr />
<h2>五、连续批处理测试（对 Agent 有意义）</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>模式</th>
<th>Batch</th>
<th>tg TPS</th>
<th>Speedup</th>
<th>pp TPS</th>
<th>pp TPS/req</th>
<th>TTFT</th>
<th>E2E</th>
</tr>
</thead>
<tbody>
<tr>
<td>普通 BatchedEngine</td>
<td>1x</td>
<td>23.0 tok/s</td>
<td>1.00x</td>
<td>299.7 tok/s</td>
<td>299.7 tok/s</td>
<td>3416.4 ms</td>
<td>8.985 s</td>
</tr>
<tr>
<td>普通 BatchedEngine</td>
<td>2x</td>
<td>39.6 tok/s</td>
<td>1.72x</td>
<td>297.5 tok/s</td>
<td>148.8 tok/s</td>
<td>6757.2 ms</td>
<td>13.353 s</td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">多请求并发时，总吞吐能提升到 39.6 tok/s。</p>
</blockquote>
<hr />
<h2>六、SpecPrefill / TurboQuant 测试</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>功能</th>
<th>测试结果</th>
<th>结论</th>
</tr>
</thead>
<tbody>
<tr>
<td>SpecPrefill</td>
<td>成功加载过 draft 模型</td>
<td>功能可用，但不建议当前常驻</td>
</tr>
<tr>
<td>SpecPrefill Draft</td>
<td>Huihui-Qwen3.5-9B-abliterated-mlx-4bit</td>
<td>9B draft 偏大，不适合长期 Agent</td>
</tr>
<tr>
<td>TurboQuant 8bit KV</td>
<td>成功启用</td>
<td>可省 KV cache，非必需</td>
</tr>
<tr>
<td>TurboQuant 4bit KV</td>
<td>启用成功，但内存回收压力大</td>
<td>不建议常驻</td>
</tr>
<tr>
<td>DFlash + SpecPrefill + TurboQuant</td>
<td>能跑，但出现内存压力</td>
<td>不适合长期稳定运行</td>
</tr>
</tbody>
</table>
<hr />
<h2>七、关键日志风险</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>日志</th>
<th>含义</th>
</tr>
</thead>
<tbody>
<tr>
<td>DFlash context fallback: 4096 &gt;= 4096</td>
<td>上下文达到 4096 后，DFlash 回退普通引擎</td>
</tr>
<tr>
<td>active_memory=56.66GB exceeds safe threshold</td>
<td>内存占用超过安全阈值</td>
</tr>
<tr>
<td>Emergency reclaim failed</td>
<td>模型回收压力大，不适合常驻 Agent</td>
</tr>
</tbody>
</table>
<hr />
<h2>八、最终推荐配置</h2>
<h3>1. Hermes / Agent 常驻配置（推荐长期使用）</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>设置项</th>
<th>推荐值</th>
</tr>
</thead>
<tbody>
<tr>
<td>模型</td>
<td>Qwen3.6-27B-8bit</td>
</tr>
<tr>
<td>DFlash</td>
<td>关闭</td>
</tr>
<tr>
<td>SpecPrefill</td>
<td>关闭</td>
</tr>
<tr>
<td>TurboQuant KV Cache</td>
<td>关闭</td>
</tr>
<tr>
<td>Temperature</td>
<td>0.3 - 0.5</td>
</tr>
<tr>
<td>Top P</td>
<td>0.9 - 0.95</td>
</tr>
<tr>
<td>Top K</td>
<td>20</td>
</tr>
<tr>
<td>Max Tokens</td>
<td>2048</td>
</tr>
<tr>
<td>CTX Window</td>
<td>默认或 8192</td>
</tr>
<tr>
<td>适用场景</td>
<td>Hermes、OpenClaw、Codex、本地 Agent、长时间自动化任务</td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">原因：普通 BatchedEngine 稳定，速度 22-23 tok/s，内存 30-32GB。</p>
</blockquote>
<h3>2. 短任务高速配置（一次性生成）</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>设置项</th>
<th>推荐值</th>
</tr>
</thead>
<tbody>
<tr>
<td>模型</td>
<td>Qwen3.6-27B-8bit</td>
</tr>
<tr>
<td>DFlash</td>
<td>开启</td>
</tr>
<tr>
<td>Draft Model</td>
<td>z-lab/Qwen3.6-27B-DFlash</td>
</tr>
<tr>
<td>SpecPrefill</td>
<td>关闭</td>
</tr>
<tr>
<td>TurboQuant KV Cache</td>
<td>关闭</td>
</tr>
<tr>
<td>Temperature</td>
<td>0.5 - 0.6</td>
</tr>
<tr>
<td>Top P</td>
<td>0.95</td>
</tr>
<tr>
<td>Top K</td>
<td>20</td>
</tr>
<tr>
<td>CTX Window</td>
<td>4096</td>
</tr>
<tr>
<td>Max Tokens</td>
<td>1024 - 2048</td>
</tr>
<tr>
<td>适用场景</td>
<td>电商文案、标题、私信、脚本、卖点、批量改写</td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">原因：短上下文生成速度从 23.0 → 109.4 tok/s，提升约 4.76 倍。</p>
</blockquote>
<hr />
<h2>九、一句话总结</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>用途</th>
<th>推荐配置</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hermes / Agent 长期运行</td>
<td>普通 BatchedEngine，关闭所有实验功能</td>
</tr>
<tr>
<td>短文案 / 批量生成 / 快速输出</td>
<td>开 DFlash，关闭 SpecPrefill 和 TurboQuant</td>
</tr>
<tr>
<td>长上下文资料分析</td>
<td>普通 BatchedEngine，按需测试 TurboQuant</td>
</tr>
<tr>
<td>不推荐常驻</td>
<td>DFlash + SpecPrefill + TurboQuant 全开</td>
</tr>
</tbody>
</table>
<hr />
<h2>十、最终结论</h2>
<p dir="auto">Qwen3.6-27B-8bit 在 oMLX 上普通模式已经很稳，<strong>Agent 常驻优先稳定</strong>；<br />
DFlash 适合短上下文高速生成，但<strong>不适合作为 Hermes 主力常驻配置</strong>。</p>
]]></description><link>https://lcz.me/topic/194/m3u-96g-发一下数据供以参考</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 07:09:13 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/194.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 18 May 2026 06:44:41 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to M3U 96G 发一下数据供以参考 on Mon, 18 May 2026 07:29:31 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 又学到新东西了，以后知道怎么发帖了</p>
]]></description><link>https://lcz.me/post/2304</link><guid isPermaLink="true">https://lcz.me/post/2304</guid><dc:creator><![CDATA[kinco520]]></dc:creator><pubDate>Mon, 18 May 2026 07:29:31 GMT</pubDate></item><item><title><![CDATA[Reply to M3U 96G 发一下数据供以参考 on Mon, 18 May 2026 06:52:15 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/kinco520" aria-label="Profile: kinco520">@<bdi>kinco520</bdi></a> 我让豆包把它转化为了markdown格式，以后发帖你让GPT直接出markdown代码即可。论坛支持markdown代码，剋方便做成表格之类的富文本。</p>
]]></description><link>https://lcz.me/post/2290</link><guid isPermaLink="true">https://lcz.me/post/2290</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 06:52:15 GMT</pubDate></item></channel></rss>