<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[llama.cpp+qwen3.6-27b 初步测试]]></title><description><![CDATA[<p dir="auto">回复: [下单 7900xtx](开始折腾 llama.cpp)</p>
<p dir="auto">周末简单试了一下. 主要目标达成, 就是能够在 vram 中完整跑 qwen3.6-27b 且能够使用 256k 上下文 和 multimodal 功能.</p>
<ul>
<li>模型: <code>unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q4_K_M.gguf</code> + <code>mmproj-BF16.gguf</code></li>
<li>Branch: <code>ggml-org/llama.cpp</code> (upstream master, latest)</li>
<li>Backend: Vulkan (Mesa RADV)</li>
<li>编译配置:</li>
</ul>
<pre><code>cmake -S . -B build-vulkan \
    -DGGML_VULKAN=ON \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_SHARED_LIBS=ON \
    -DCMAKE_C_FLAGS="-fPIC -mcmodel=large -mavx2 -mfma -mf16c" \
    -DCMAKE_CXX_FLAGS="-fPIC -mcmodel=large -mavx2 -mfma -mf16c"
</code></pre>
<ul>
<li>llama-server 命令行:</li>
</ul>
<pre><code>bruin@lmde7:~/github/llama.cpp$ cat run-qwen3-vulkan.sh
#!/bin/bash

export LD_LIBRARY_PATH=$(pwd)/build-vulkan/src:$(pwd)/build-vulkan/ggml/src:$LD_LIBRARY_PATH ;
./build-vulkan/bin/llama-server \
  -m /opt/gguf-models/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj /opt/gguf-models/unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf \
  --spec-type draft-mtp \
  -c 262144 \
  -np 1 \
  -fa on \
  -ngl 999 \
  -ctk q4_0 -ctv q4_0 \
  --cont-batching --jinja --mlock \
  --host 0.0.0.0 --port 8000
</code></pre>
<ul>
<li>vram 使用对比</li>
</ul>
<p dir="auto">可忽略ik_llama.cpp这一列, upstream 这一列好像也不太准, 因为从 nvtop 看 vram 的 headroom 已经很小了, 但确实全部都进了 vram.</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>Component</th>
<th style="text-align:center">ik_llama.cpp (IQ4_KS)</th>
<th style="text-align:center">Upstream (Q4_K_M + MTP)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Model weights</td>
<td style="text-align:center">13,003 MiB</td>
<td style="text-align:center">15,850 MiB</td>
</tr>
<tr>
<td>MTP heads (1 extra layer)</td>
<td style="text-align:center">—</td>
<td style="text-align:center">~260 MiB</td>
</tr>
<tr>
<td>mmproj (multimodal projector)</td>
<td style="text-align:center">—</td>
<td style="text-align:center">1,161 MiB</td>
</tr>
<tr>
<td>KV cache q4_0 256K</td>
<td style="text-align:center">4,758 MiB</td>
<td style="text-align:center">4,758 MiB</td>
</tr>
<tr>
<td>Compute buffer</td>
<td style="text-align:center">505 MiB</td>
<td style="text-align:center">~505 MiB</td>
</tr>
<tr>
<td><strong>Total GPU</strong></td>
<td style="text-align:center"><strong>~18,266 MiB</strong></td>
<td style="text-align:center"><strong>~22,534 MiB</strong></td>
</tr>
<tr>
<td>Available</td>
<td style="text-align:center">23,984 MiB</td>
<td style="text-align:center">23,984 MiB</td>
</tr>
<tr>
<td><strong>Margin</strong></td>
<td style="text-align:center"><strong>5,718 MiB</strong></td>
<td style="text-align:center"><strong>1,450 MiB <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /></strong></td>
</tr>
</tbody>
</table>
<ul>
<li><code>llama-benchy</code>的初步结果:</li>
</ul>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:right">Depth</th>
<th style="text-align:center">pp2048 (tok/s)</th>
<th style="text-align:center">tg128 (tok/s)</th>
<th style="text-align:center">ttfr</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right">0 (empty)</td>
<td style="text-align:center"><strong>541</strong></td>
<td style="text-align:center"><strong>76</strong> <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f3c6.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--trophy" style="height:23px;width:auto;vertical-align:middle" title="🏆" alt="🏆" /></td>
<td style="text-align:center">3.8s</td>
</tr>
<tr>
<td style="text-align:right">65,536</td>
<td style="text-align:center">353</td>
<td style="text-align:center">45</td>
<td style="text-align:center">191s</td>
</tr>
<tr>
<td style="text-align:right">131,072</td>
<td style="text-align:center">257</td>
<td style="text-align:center">31</td>
<td style="text-align:center">519s</td>
</tr>
<tr>
<td style="text-align:right">250,000</td>
<td style="text-align:center">170</td>
<td style="text-align:center">26</td>
<td style="text-align:center">1480s (~25 min)</td>
</tr>
<tr>
<td style="text-align:right">262,144</td>
<td style="text-align:center">N/A (corpus too small)</td>
<td style="text-align:center">—</td>
<td style="text-align:center">—</td>
</tr>
</tbody>
</table>
<ul>
<li>功率测量和电费估算:</li>
</ul>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>State</th>
<th style="text-align:center">Power</th>
<th style="text-align:center">Per day (24h)</th>
<th style="text-align:center">Per month (30d)</th>
<th style="text-align:center">Per year (365d)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Idle</td>
<td style="text-align:center">~20W</td>
<td style="text-align:center">0.48 kWh = <strong>0.24 RMB</strong></td>
<td style="text-align:center">14.4 kWh = <strong>7.2 RMB</strong></td>
<td style="text-align:center">175.2 kWh = <strong>87.6 RMB</strong></td>
</tr>
<tr>
<td>Full load (100%)</td>
<td style="text-align:center">~400W</td>
<td style="text-align:center">9.6 kWh = <strong>4.8 RMB</strong></td>
<td style="text-align:center">288 kWh = <strong>144 RMB</strong></td>
<td style="text-align:center">3504 kWh = <strong>1,752 RMB</strong></td>
</tr>
<tr>
<td>Typical use (3h full + 21h idle)</td>
<td style="text-align:center">~67W avg</td>
<td style="text-align:center">1.62 kWh = <strong>0.81 RMB</strong></td>
<td style="text-align:center">48.6 kWh = <strong>24.3 RMB</strong></td>
<td style="text-align:center">591.3 kWh = <strong>295.7 RMB</strong></td>
</tr>
</tbody>
</table>
<ul>
<li>问题记录:
<ul>
<li>ik_llama.cpp 编译出来, gpu的利用率只能到 50%, 原因不明. 放弃, 回到 upstream llama.cpp.</li>
<li>主线 llama.cpp 上, 使用 rocm 的后端好像性能差~20%左右. 目前直接放弃;</li>
<li>让 Hermes 使用图片好像还有问题. 用自带的网页端是可以上传并识别图片的.</li>
</ul>
</li>
</ul>
]]></description><link>https://lcz.me/topic/301/llama.cpp-qwen3.6-27b-初步测试</link><generator>RSS for Node</generator><lastBuildDate>Sun, 31 May 2026 06:31:28 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/301.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 25 May 2026 01:50:30 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Fri, 29 May 2026 02:54:42 GMT]]></title><description><![CDATA[<p dir="auto">最近在用这个llama.cp +qwen 35 a3b。 m5 max  蛮好的 百来token</p>
]]></description><link>https://lcz.me/post/4138</link><guid isPermaLink="true">https://lcz.me/post/4138</guid><dc:creator><![CDATA[goodhat5405]]></dc:creator><pubDate>Fri, 29 May 2026 02:54:42 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Fri, 29 May 2026 01:45:34 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/john-ato" aria-label="Profile: John-Ato">@<bdi>John-Ato</bdi></a> smithy是什么? 没搜到</p>
]]></description><link>https://lcz.me/post/4135</link><guid isPermaLink="true">https://lcz.me/post/4135</guid><dc:creator><![CDATA[laobenxiong]]></dc:creator><pubDate>Fri, 29 May 2026 01:45:34 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Thu, 28 May 2026 14:21:07 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/laobenxiong" aria-label="Profile: laobenxiong">@<bdi>laobenxiong</bdi></a> 你这套参数接近目前的极限了,还有一个SMITHY可以试试,--cache-ram多轮对话好像有用。</p>
]]></description><link>https://lcz.me/post/4067</link><guid isPermaLink="true">https://lcz.me/post/4067</guid><dc:creator><![CDATA[John Ato]]></dc:creator><pubDate>Thu, 28 May 2026 14:21:07 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Thu, 28 May 2026 14:16:51 GMT]]></title><description><![CDATA[<p dir="auto">关于 hermes 接入 llama-server, 这两天有两个观察:</p>
<ol>
<li>hermes 新会话的第一条 prompt 大概是不到 20k token, 7900xtx大概需要30~40秒pp, 然后tg. 观察 llama-server 的log, 发现第一条prompt之后紧接着会更一个小 prompt, 这个 prompt 会把前面的 checkpoint (大概每8k个token一个 checkpoint)都冲掉, 这样下一次再接着聊, 前面的~20K prompt 还得重新 pp. 让 hermes 自己检查了一下, 第二个小 prompt 是它发的 title generation request. 为了避免这种情况, 可以禁止 title generation, 或者设一个辅助 aux model来生成 title (比如我让在线的 deepseek-v4-flash 干所有的 aux 工作);</li>
<li>hermes stream mode 下有一个环境变量 <code>HERMES_STREAM_READ_TIMEOUT</code>, 它 控制收到第一个回复token的 timeout, 缺省为120s. 而在7900xtx 下pp花的时间大概是这样的(q5_1/q5_1 cache quant):</li>
</ol>
<pre><code> -  20k:  40s
 -  40k: 100s
 -  60k: 170s
 -  80k: 260s
 - 100k: 360s
 - 120k: 460s
 - 140k: 580s
 - 160k: 710s
</code></pre>
<p dir="auto">如果hermes有~50k的prompt, 赶上 llama-server cache checkpoint 刚好都是清空的情况下, pp没有完成之前, hermes就超时了.超时以后 hermes会中断当前请求,再发第二次(共三次). 如果第二次又刚好checkpoint被清空(我碰到过,具体原因还没搞明白),那么三次必然都会失败. 碰到这种情况, 可以把这个环境变量增大一下.</p>
<p dir="auto">今天更新llama.cpp到最新, 还碰到了 llama-server RSS 超大给 oom-kill的情况, 以及 llama-server/hermes 进入死循环的情况. 具体还没有时间搞清楚. 目前我回到了 b9305.</p>
<p dir="auto">目前的启动脚本如下:</p>
<pre><code>#!/bin/bash

LLAMA_SERVER=/home/bruin/github/llama.cpp/build-vulkan/bin/llama-server

TOKEN_PER_CKPT=8192    # token per checkpoint, seems llama.cpp hardcoded
NUM_CKPT=32
CTX_SIZE=$((TOKEN_PER_CKPT * NUM_CKPT))

ARGS=(
  --model              /home/bruin/Qwen3.6-27B-Q4_K_M.gguf
  --mmproj             /opt/gguf-models/unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf
  #--chat-template-file /opt/gguf-models/froggeric/Qwen-Fixed-Chat-Templates/chat_template.jinja
  --spec-type          draft-mtp
  --spec-draft-n-max   2                       # Max draft tokens
  --ctx-checkpoints ${NUM_CKPT}                # 8k token per ckpt
  --ctx-size ${CTX_SIZE}                       # 262144 for 256k context
  #--swa-full                                   # qwen3.6-27b does not support it
  --parallel   1                               # Single slot
  --flash-attn on                              # Enable FlashAttention
  --n-gpu-layers 999                           # All layers to GPU
  --cache-type-k q5_1                          # Quantize KV cache keys
  --cache-type-v q5_1                          # Quantize KV cache values
  #--fit off                                    #
  --threads 16                                 # CPU threads helping tg
  --threads-batch 16                           # CPU threads helping pg
  --batch-size 2048                            # Batch size
  --ubatch-size 1024                           # Micro‑batch size
  --cache-ram 0                                # seems not working
  --reasoning auto                             # Auto reasoning
  --reasoning-format deepseek                  # Reasoning format
  --reasoning-budget 1024                      # Reasoning budget
  --log-verbosity 4                            # Log verbosity
  --host 0.0.0.0 --port 8000                   # Listen on all interfaces, port 8000
  --cont-batching                              # Continuous batching
  --no-warmup                                  # Skip warmup
  --no-mmap                                    # Don’t memory‑map model
  --mlock                                      # Lock model in RAM
  --jinja                                      # Jinja chat template
  --metrics                                    # View metrics by accessing http://&lt;ip:port&gt;/metrics
)

# print the cmdline
echo "${LLAMA_SERVER}"
for ((i=0; i&lt;${#ARGS[@]}; i+=2)); do
  echo "${ARGS[i]} ${ARGS[i+1]}"
done

# run the cmd
${LLAMA_SERVER} "${ARGS[@]}"
</code></pre>
]]></description><link>https://lcz.me/post/4065</link><guid isPermaLink="true">https://lcz.me/post/4065</guid><dc:creator><![CDATA[laobenxiong]]></dc:creator><pubDate>Thu, 28 May 2026 14:16:51 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Thu, 28 May 2026 06:58:40 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/blackjack" aria-label="Profile: blackjack">@<bdi>blackjack</bdi></a> <a href="/post/3735">说</a>:</p>
<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/joker_chang" aria-label="Profile: joker_chang">@<bdi>joker_chang</bdi></a> <a href="/post/3730">说</a>:</p>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/rock-shi" aria-label="Profile: rock-shi">@<bdi>rock-shi</bdi></a> 那就对了，24G跑128K上下文+MTP资源不够</p>
</blockquote>
<p dir="auto">我27 q4量化，kv均q8_0量化，上下文128k，MTP, 5090laptop 24GRAM，开thinking，50+tps，快的起飞啊</p>
</blockquote>
<p dir="auto">厉害！一样的卡，大哥能给个作业抄吗？14900k，32g内存，llama.cpp，感谢！</p>
]]></description><link>https://lcz.me/post/4021</link><guid isPermaLink="true">https://lcz.me/post/4021</guid><dc:creator><![CDATA[ran z]]></dc:creator><pubDate>Thu, 28 May 2026 06:58:40 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Wed, 27 May 2026 04:05:09 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/blackjack" aria-label="Profile: blackjack">@<bdi>blackjack</bdi></a> 我用的是llama.cpp官方release：llama-b9329-bin-win-cuda-12.4-x64</p>
<p dir="auto">之前我自己编译，不知道是什么参数不对，build后的llama-server怎么调参数都只有10tokens/s</p>
]]></description><link>https://lcz.me/post/3887</link><guid isPermaLink="true">https://lcz.me/post/3887</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Wed, 27 May 2026 04:05:09 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Wed, 27 May 2026 03:56:53 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/blackjack" aria-label="Profile: blackjack">@<bdi>blackjack</bdi></a> 我在论坛大神的指点下，也起飞了<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f604.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--smile" style="height:23px;width:auto;vertical-align:middle" title="😄" alt="😄" /></p>
<p dir="auto">llama-server的启动参数</p>
<p dir="auto">--reasoning off ^<br />
--n-gpu-layers -1 ^<br />
--ctx-size 131072 ^<br />
--batch-size 2048 ^<br />
--ubatch-size 1024 ^<br />
--flash-attn on ^<br />
--cache-type-k q4_0 ^<br />
--cache-type-v q4_0 ^<br />
--spec-type draft-mtp,ngram-mod ^<br />
--spec-draft-n-max 3 ^<br />
--spec-ngram-mod-n-max 5 ^<br />
--spec-ngram-mod-n-min 3 ^<br />
--temp 0.7 ^<br />
--parallel 1</p>
]]></description><link>https://lcz.me/post/3883</link><guid isPermaLink="true">https://lcz.me/post/3883</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Wed, 27 May 2026 03:56:53 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Wed, 27 May 2026 00:56:48 GMT]]></title><description><![CDATA[<p dir="auto">昨天碰到一个 oom, 好像是 host ram 的 oom, 没搞懂为啥它使用那么多 system ram...</p>
<pre><code>[71766.725058] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2025.05-1~bpo12+1 03/12/2026
[71766.725059] Call Trace:
[71766.725073]  &lt;TASK&gt;
[71766.725076]  dump_stack_lvl+0x5d/0x80
[71766.725082]  dump_header+0x43/0x1aa
[71766.725085]  oom_kill_process.cold+0xa/0xb2
[71766.725088]  out_of_memory+0x217/0x4b0
[71766.725091]  __alloc_pages_slowpath.constprop.0+0xc3b/0xdd0
[71766.725098]  __alloc_frozen_pages_noprof+0x2cd/0x320
[71766.725103]  alloc_pages_mpol+0x7d/0x180
[71766.725107]  folio_alloc_noprof+0x5d/0xe0
[71766.725110]  __filemap_get_folio+0x1dd/0x330
[71766.725112]  filemap_fault+0x10c/0x12f0
[71766.725116]  __do_fault+0x30/0x180
[71766.725119]  do_fault+0x310/0x540
[71766.725122]  __handle_mm_fault+0x8ee/0xf20
[71766.725124]  ? srso_alias_return_thunk+0x5/0xfbef5
[71766.725129]  handle_mm_fault+0xec/0x2e0
[71766.725132]  do_user_addr_fault+0x2c3/0x7f0
[71766.725135]  exc_page_fault+0x74/0x180
[71766.725139]  asm_exc_page_fault+0x26/0x30
[71766.725140] RIP: 0033:0x9bab0a
[71766.725161] Code: Unable to access opcode bytes at 0x9baae0.
[71766.725162] RSP: 002b:000000c00093cc60 EFLAGS: 00010216
[71766.725164] RAX: 0000000001b35410 RBX: 0000000001b77928 RCX: 0000000001b77928
[71766.725165] RDX: 0000000000a244e0 RSI: 00000000009baae0 RDI: 000000c000aaa6e0
[71766.725166] RBP: 000000c00093cca0 R08: 0000000000000040 R09: 0000000000000082
[71766.725167] R10: 00007f88af626fa8 R11: 00000000000000d0 R12: 0000000000000006
[71766.725168] R13: 000000c001548410 R14: 000000c000171880 R15: ffffffffffffffff
[71766.725172]  &lt;/TASK&gt;
[71766.725173] Mem-Info:
[71766.725181] active_anon:3652772 inactive_anon:3225935 isolated_anon:0
                active_file:74 inactive_file:592 isolated_file:0
                unevictable:2004 dirty:0 writeback:0
                slab_reclaimable:16967 slab_unreclaimable:49972
                mapped:2076 shmem:4443 pagetables:27449
                sec_pagetables:0 bounce:0
                kernel_misc_reclaimable:0
                free:57534 free_pcp:0 free_cma:0
[71766.725185] Node 0 active_anon:14611088kB inactive_anon:12903740kB active_file:296kB inactive_file:2004kB unevictable:8016kB isolated(anon):0kB isolated(file):0kB mapped:8304kB dirty:0kB writeback:0kB shmem:17772kB shmem_thp:0kB shmem_pmdmapped:0
kB anon_thp:17055744kB kernel_stack:12720kB pagetables:109796kB sec_pagetables:0kB all_unreclaimable? no Balloon:0kB
[71766.725189] Node 0 DMA free:11264kB boost:0kB min:28kB low:40kB high:52kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15864kB managed:153
60kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[71766.725193] lowmem_reserve[]: 0 1948 32063 32063 32063
[71766.725198] Node 0 DMA32 free:124216kB boost:0kB min:3760kB low:5576kB high:7392kB reserved_highatomic:0KB free_highatomic:0KB active_anon:1852244kB inactive_anon:14144kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:
2061152kB managed:1995156kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[71766.725202] lowmem_reserve[]: 0 0 30114 30114 30114
[71766.725207] Node 0 Normal free:94656kB boost:30720kB min:94508kB low:125332kB high:156156kB reserved_highatomic:0KB free_highatomic:0KB active_anon:12758844kB inactive_anon:12889596kB active_file:660kB inactive_file:2004kB unevictable:8016kB writ
epending:0kB present:31457280kB managed:30837388kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[71766.725210] lowmem_reserve[]: 0 0 0 0 0
[71766.725215] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
[71766.725228] Node 0 DMA32: 0*4kB 1*8kB (M) 3*16kB (UM) 8*32kB (UM) 10*64kB (UM) 5*128kB (UM) 3*256kB (UM) 4*512kB (UM) 1*1024kB (U) 2*2048kB (M) 28*4096kB (M) = 124216kB
[71766.725245] Node 0 Normal: 635*4kB (UME) 484*8kB (UME) 664*16kB (UME) 503*32kB (UME) 315*64kB (UME) 196*128kB (UME) 59*256kB (UME) 3*512kB (ME) 0*1024kB 0*2048kB 0*4096kB = 95020kB
[71766.725261] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[71766.725263] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[71766.725264] 12002 total pagecache pages
[71766.725265] 6815 pages in swap cache
[71766.725266] Free swap  = 72kB
[71766.725267] Total swap = 8496124kB
[71766.725268] 8383574 pages RAM
[71766.725269] 0 pages HighMem/MovableOnly
[71766.725269] 171598 pages reserved
[71766.725270] 0 pages hwpoisoned
[71766.725271] Tasks state (memory values in pages):
[71766.725272] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[71766.725279] [    513]     0   513    16897      240       32      208         0   135168      256          -250 systemd-journal
[71766.725283] [    535]   990   535    23023      185        0      185         0    81920      256             0 systemd-timesyn
[71766.725285] [    541]     0   541     9456      707      544      163         0    98304      384         -1000 systemd-udevd
[71766.725288] [    819]     0   819    77254      161       64       97         0   106496       96             0 accounts-daemon
[71766.725290] [    821]   105   821     1564      239        0      239         0    53248       64             0 avahi-daemon
[71766.725292] [    824]     0   824     1716      185       64      121         0    49152        0             0 cron
[71766.725294] [    826]   989   826     2549      457      256      201         0    61440      256          -900 dbus-daemon
[71766.725297] [    839]     0   839    20199      275       96      179         0    61440        0             0 irqbalance
[71766.725299] [    845]   987   845    95842      717      569      148         0   114688      224             0 polkitd
[71766.725301] [    846]     0   846    54923      493      384      109         0    77824        0             0 rsyslogd
[71766.725303] [    848]     0   848    76962      254      128      126         0   102400        0             0 switcheroo-cont
[71766.725305] [    861]     0   861     4761      267      128      139         0    73728      192             0 systemd-logind
[71766.725308] [    862]     0   862    48059      131        0      131         0   126976      480             0 touchegg
[71766.725310] [    865]     0   865   117523      709      544      165         0   151552        0             0 udisksd
[71766.725312] [    881]   105   881     1518      151       37      114         0    53248       32             0 avahi-daemon
[71766.725314] [    964]     0   964   102609      882      704      178         0   167936        0             0 NetworkManager
[71766.725317] [    969]     0   969     4383      353      224      129         0    77824        0             0 wpa_supplicant
[71766.725319] [   1000]     0  1000    97731      195       64      131         0   126976      384             0 ModemManager
[71766.725321] [   1045]     0  1045    95236      179        0      179         0   110592      192             0 lightdm
[71766.725323] [   1048]     0  1048     2944      447      256      191         0    61440       32         -1000 sshd
[71766.725325] [   1082]     0  1082   654404     6572     5217      178      1177   782336     9472             0 Xorg
[71766.725327] [   1085]     0  1085     2042      147       32      115         0    57344        0             0 agetty
[71766.725329] [   1228]     0  1228    10995      185       40      145         0    86016       96             0 master
[71766.725332] [   1230]   116  1230    11129      208       32      176         0    77824      128             0 qmgr
[71766.725334] [   1246]     0  1246    43208      161        0      161         0    90112      256             0 lightdm
[71766.725336] [   1254]  1000  1254     5813      822      640      182         0    98304       32           100 systemd
[71766.725338] [   1256]  1000  1256     6387      531      421      110         0    81920      128           100 (sd-pam)
[71766.725340] [   1278]  1000  1278     2343      644      480      164         0    53248        0           200 dbus-daemon
[71766.725342] [   1279]  1000  1279    25582     1141      960      181         0    98304        0           200 pipewire
[71766.725345] [   1280]  1000  1280    21187      311      128      183         0    77824        0           200 pipewire
[71766.725347] [   1281]  1000  1281   119867     1068      896      172         0   155648        0           200 wireplumber
[71766.725349] [   1282]  1000  1282    41570      791      561      230         0    90112        0           200 pipewire-pulse
[71766.725351] [   1283]  1000  1283   119267      162        0      162         0   151552      544             0 cinnamon-sessio
[71766.725353] [   1297]  1000  1297     1810      183       32      151         0    53248        0           200 mpris-proxy
[71766.725355] [   1300]   114  1300     5369      193       32      161         0    61440        0             0 rtkit-daemon
[71766.725358] [   1361]  1000  1361     2637      153       71       82         0    49152      192             0 ssh-agent
[71766.725360] [   1372]  1000  1372    61760      178       21      157         0   225280     2816             0 fcitx
[71766.725362] [   1378]  1000  1378     2054      139       32      107         0    57344       64             0 dbus-daemon
[71766.725364] [   1382]  1000  1382     1279      117        3      114         0    49152       32             0 fcitx-dbus-watc
[71766.725367] [   1395]  1000  1395    45784      376      224      152         0   102400        0           200 gnome-keyring-d
[71766.725369] [   1404]  1000  1404   191033      682      517      165         0   462848     3232             0 csd-media-keys
[71766.725371] [   1407]  1000  1407    76111      218       32      186         0    98304       96             0 csd-screensaver
[71766.725373] [   1408]  1000  1408    79234      246       64      182         0   114688      256             0 csd-print-notif
[71766.725375] [   1411]  1000  1411   154236      268        0      268         0   438272     3648             0 csd-automount
[71766.725377] [   1416]  1000  1416   136010      270        1      269         0   425984     3872             0 csd-wacom
[71766.725379] [   1419]  1000  1419   176584      726      530      196         0   458752     3104             0 csd-color
[71766.725381] [   1421]  1000  1421   119372     1360     1084      276         0   421888     2496             0 csd-housekeepin
[71766.725384] [   1422]  1000  1422   119717      724      544      180         0   421888     3168             0 csd-xsettings
[71766.725387] [   1425]  1000  1425   127752     1320     1062      226        32   446464     2624             0 csd-background
[71766.725391] [   1427]  1000  1427   100813      282       97      185         0   413696     3520             0 csd-clipboard
[71766.725394] [   1428]  1000  1428   175014     1286     1073      213         0   454656     2560             0 csd-power
[71766.725396] [   1430]  1000  1430    60105      130       32       98         0    86016      128             0 csd-a11y-settin
[71766.725398] [   1431]  1000  1431    59835      114        0      114         0    81920      160             0 csd-settings-re
[71766.725400] [   1432]  1000  1432   154174      248        0      248         0   434176     3584             0 csd-keyboard
[71766.725402] [   1435]  1000  1435    95328      128        0      128         0   102400      128             0 at-spi-bus-laun
[71766.725404] [   1446]  1000  1446    41342      209      128       81         0    69632        0           200 dconf-service
[71766.725406] [   1447]  1000  1447     2120      172        0      172         0    53248      128             0 dbus-daemon
[71766.725408] [   1466]  1000  1466    78186      373      160      213         0   110592        0           200 gvfsd
[71766.725410] [   1487]  1000  1487   118027      287      160      127         0   122880        0           200 gvfsd-fuse
[71766.725413] [   1510]  1000  1510    42226      171        0      171         0    86016      192             0 at-spi2-registr
[71766.725415] [   1514]  1000  1514   134662      585      320      265         0   147456        0           200 gvfs-udisks2-vo
[71766.725417] [   1520]   115  1520    79008      173        0      173         0   122880     1120             0 colord
[71766.725419] [   1527]     0  1527    79740      509      320      189         0   122880      192             0 upowerd
[71766.725421] [   1539]  1000  1539    76984      252       32      220         0    98304      128           200 gvfs-mtp-volume
[71766.725423] [   1552]  1000  1552    97513      145        0      145         0   126976      288           200 gvfs-afc-volume
[71766.725425] [   1558]  1000  1558    77224      388      128      260         0   102400       32           200 gvfs-gphoto2-vo
[71766.725428] [   1563]  1000  1563    76958      244       96      148         0   102400        0           200 gvfs-goa-volume
[71766.725430] [   1568]  1000  1568   129472     1092      928      164         0   233472        0           200 goa-daemon
[71766.725432] [   1576]  1000  1576   103338      155       32      123         0   167936      448             0 csd-printer
[71766.725435] [   1582]  1000  1582   112521     2526     2302      224         0   229376     2176             0 cinnamon-launch
[71766.725437] [   1593]  1000  1593    96929      398      192      206         0   114688        0           200 goa-identity-se
[71766.725439] [   1609]  1000  1609  1585603    21400    19495      175      1730  1376256     9984             0 cinnamon
[71766.725441] [   1666]  1000  1666    96417      143        0      143         0   118784      960             0 ibus-daemon
[71766.725443] [   1673]  1000  1673    42299      171       32      139         0    86016       96             0 ibus-memconf
[71766.725445] [   1674]  1000  1674    69794      271       27      244         0   167936     3040             0 ibus-extension-
[71766.725447] [   1676]  1000  1676    44591      237       32      205         0    98304      320             0 ibus-x11
[71766.725449] [   1681]  1000  1681    77136      234      128      106         0   102400        0           200 ibus-portal
[71766.725452] [   1693]  1000  1693   102902      252       64      188         0   163840     1280             0 xapp-sn-watcher
[71766.725454] [   1715]  1000  1715    42298      196       32      164         0    86016       96             0 ibus-engine-sim
[71766.725456] [   1721]  1000  1721   153619      785      526      227        32   299008     5696             0 nemo-desktop
[71766.725458] [   1724]  1000  1724   132589     3439     3227      212         0   270336     2816             0 blueman-applet
[71766.725460] [   1726]  1000  1726    75200      278       64      214         0   204800     3904             0 cinnamon-killer
[71766.725462] [   1729]  1000  1729   263369      806      525      281         0   450560     4480             0 evolution-alarm
[71766.725464] [   1770]  1000  1770   375036     2744     2536      208         0   446464     1472           200 evolution-sourc
[71766.725466] [   1775]  1000  1775   113530      795      608      187         0   180224       32           200 obexd
[71766.725469] [   1788]  1000  1788   206201     1077      928      149         0   274432        0           200 evolution-addre
[71766.725498] [   1791]  1000  1791   225026      209        0      209         0   241664      864           200 evolution-calen
[71766.725501] [   1823]     0  1823   320552    10808    10778       30         0   282624      797             0 netbird
[71766.725504] [   1824]     0  1824    77293      260      128      132         0    98304        0             0 power-profiles-
[71766.725506] [   1910]  1000  1910   133521      405      192      213         0   126976        0           200 gvfsd-trash
[71766.725509] [   2079]  1000  2079   263021     2578     2306      272         0   413696     9664             0 mintUpdate
[71766.725511] [   2137]  1000  2137    16271      287      115      172         0   172032     4544             0 applet.py
[71766.725513] [   2145]     0  2145     5552      717      512      205         0    81920       32             0 sshd-session
[71766.725516] [   2155]  1000  2155     4453      169        0      169         0    81920      480             0 ssh
[71766.725518] [   2158]  1000  2158     4323      265       64      201         0    69632      352             0 ssh
[71766.725520] [   2160]  1000  2160   114180      159        0      159         0   102400      352             0 sshfs
[71766.725523] [   2161]  1000  2161    77314       35       35        0         0    86016       32             0 sshfs
[71766.725526] [   2169]  1000  2169   145847     5382     5163      219         0   315392     4672             0 mintreport-tray
[71766.725528] [   2184]  1000  2184   134921     2068     1716      224       128   225280      192           200 gnome-terminal-
[71766.725531] [   2189]  1000  2189   157663     1011      832      179         0   184320        0           200 xdg-desktop-por
[71766.725533] [   2194]  1000  2194    77191      257       96      161         0   102400        0           200 xdg-permission-
[71766.725535] [   2199]  1000  2199   134930      273      128      145         0   139264        0           200 xdg-document-po
[71766.725538] [   2205]  1000  2205      646       83        0       83         0    49152        0           200 fusermount3
[71766.725540] [   2210]  1000  2210   102736     1353     1120      233         0   167936      192           200 xdg-desktop-por
[71766.725543] [   2228]  1000  2228   103011     1398     1248      150         0   172032       64           200 xdg-desktop-por
[71766.725546] [   2239]  1000  2239     2256      491      384      107         0    61440      160           200 bash
[71766.725548] [   2285]  1000  2285     5666      788      557      231         0    81920      128             0 sshd-session
[71766.725550] [   2286]     0  2286     5554      711      512      199         0    90112        0             0 sshd-session
[71766.725552] [   2288]  1000  2288     2282      165       32      133         0    57344      512             0 bash
[71766.725554] [   2319]  1000  2319     5595      770      590      180         0    90112        0             0 sshd-session
[71766.725557] [   2331]  1000  2331      675      117        0      117         0    45056        0             0 sftp-server
[71766.725559] [   4065]  1000  4065     4282      188       80      108         0    73728      576             0 tmux: client
[71766.725561] [   4067]  1000  4067     8892      235       72      163         0   102400     1664             0 tmux: server
[71766.725563] [   5508]  1000  5508     2323      258      128      130         0    61440      480             0 bash
[71766.725566] [  38132]  1000 38132     2290      187       64      123         0    57344      512             0 bash
[71766.725569] [  38178]  1000 38178     5185     1453     1275      178         0    73728        0             0 nvtop
[71766.725571] [  50374]     0 50374   127230     1314     1115      199         0   266240     5024             0 fwupd
[71766.725574] [  50951]  1000 50951     1771      206       64      142         0    57344        0             0 bash
[71766.725576] [ 137326]     0 137326     5246      368      224      144         0    81920      224             0 cupsd
[71766.725579] [ 137327]     0 137327    48293      395      224      171         0   131072      544             0 cups-browsed
[71766.725581] [ 416850]     0 416850     5553      758      544      214         0    90112        0             0 sshd-session
[71766.725584] [ 416920]  1000 416920     5594      714      558      156         0    90112       32             0 sshd-session
[71766.725586] [ 416921]  1000 416921     2282      562      480       82         0    65536       96             0 bash
[71766.725588] [ 416960]  1000 416960     4282      280      112      168         0    65536       64             0 tmux: client
[71766.725591] [ 416961]  1000 416961     2323      454      384       70         0    57344      224             0 bash
[71766.725593] [ 438401]     0 438401     5552      722      576      146         0    94208        0             0 sshd-session
[71766.725596] [ 438453]  1000 438453     5633      850      621      229         0    94208       32             0 sshd-session
[71766.725598] [ 438468]     0 438468     5552      745      512      233         0    81920        0             0 sshd-session
[71766.725601] [ 438470]  1000 438470     2282      591      480      111         0    53248       96             0 bash
[71766.725603] [ 438525]  1000 438525     5593      792      621      171         0    81920        0             0 sshd-session
[71766.725605] [ 438547]  1000 438547      675       81        0       81         0    49152        0             0 sftp-server
[71766.725607] [ 438608]  1000 438608     4282     1338     1136      202         0    73728        0             0 tmux: client
[71766.725610] [ 793687]  1000 793687     1771      170       32      138         0    57344       32             0 run-qwen3-vulka
[71766.725612] [ 793688]  1000 793688 17163172  6781252  6781035      217         0 88559616  1995769             0 llama-server
[71766.725615] [ 945076]   116 945076    11117      188       32      156         0    77824      128             0 pickup
[71766.725619] [1047815]  1000 1047815     1395      103        0      103         0    53248        0             0 sleep
[71766.725621] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/tmux-spawn-ff073762-1455-4c6b-8b1c-b737ee738d0c.scope,task=llama-server,pid=793688,uid=1000
[71766.725781] Out of memory: Killed process 793688 (llama-server) total-vm:68652688kB, anon-rss:27124140kB, file-rss:868kB, shmem-rss:0kB, UID:1000 pgtables:86484kB oom_score_adj:0
</code></pre>
]]></description><link>https://lcz.me/post/3860</link><guid isPermaLink="true">https://lcz.me/post/3860</guid><dc:creator><![CDATA[laobenxiong]]></dc:creator><pubDate>Wed, 27 May 2026 00:56:48 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 13:24:27 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/blackjack" aria-label="Profile: blackjack">@<bdi>blackjack</bdi></a> 没深入研究，我用Q8 kv就没这个问题了。</p>
]]></description><link>https://lcz.me/post/3814</link><guid isPermaLink="true">https://lcz.me/post/3814</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Tue, 26 May 2026 13:24:27 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 13:22:12 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> <a href="/post/3795">说</a>:</p>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/blackjack" aria-label="Profile: blackjack">@<bdi>blackjack</bdi></a> 相信你的测试个结果，但我实际跑hermes过程中，Q4_0确实拉胯，跑OpenClaw更是如此，就是经常会陷入死循环。</p>
</blockquote>
<p dir="auto">qwen的工具调用极弱，让他专门做过patch工具测试，分不清工具名称patch和参数名称path。这个就是模型能力问题，再怎么提示也白扯，只能在hermes里把参数名称path改成路径等其他严重不让他花眼的文字，还有各种对他人性化的反馈。死循环基本就是掉入到各种工具调用的汪洋大海中了，你可以开个ai让他研究一下日志</p>
]]></description><link>https://lcz.me/post/3813</link><guid isPermaLink="true">https://lcz.me/post/3813</guid><dc:creator><![CDATA[blackjack]]></dc:creator><pubDate>Tue, 26 May 2026 13:22:12 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 12:09:12 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/blackjack" aria-label="Profile: blackjack">@<bdi>blackjack</bdi></a> 相信你的测试个结果，但我实际跑hermes过程中，Q4_0确实拉胯，跑OpenClaw更是如此，就是经常会陷入死循环。</p>
]]></description><link>https://lcz.me/post/3795</link><guid isPermaLink="true">https://lcz.me/post/3795</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Tue, 26 May 2026 12:09:12 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 06:09:33 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto">@老用户 <a href="/post/3493">说</a>:</p>
<p dir="auto">kv cache type 使用q4_0精度，有没有在长上下文的情况下测试过稳定性，智商、工具调用如何。我在使用中发现，上下文到7-8W的时候就开始飘了。所以有时候我认为256K上下文有没有必要，不如把kv cache type精度调高点</p>
</blockquote>
<p dir="auto">我今天正好做了个测试,</p>
<h1>Hermes 长会话 KV <code>q8_0</code> vs <code>q4_0</code> A/B（2026-05-26）</h1>
<h2>结论</h2>
<ul>
<li>这次真实 Hermes 多轮长会话基准里，<code>KV q8_0</code> 和 <code>KV q4_0</code> <strong>没有表现出可见的语义或结构优势差异</strong>。</li>
<li>两边都是：
<ul>
<li><code>12 / 12</code> turn 全过</li>
<li>最终精确召回通过</li>
<li>最终文件状态召回通过</li>
<li><code>chain_diff</code> 只有 <code>first_request + exact_message_append</code></li>
<li><strong>没有</strong> <code>message_prefix_drift</code></li>
<li><strong>没有</strong> <code>forcing full prompt re-processing</code></li>
<li><strong>没有</strong> server 侧 <code>class=prefix-drift</code></li>
</ul>
</li>
</ul>
<p dir="auto">更直接地说：<br />
在这条真实 Hermes replay/tool/file 多轮链路上，至少这一次跑下来，<code>q8_0</code> 没有比 <code>q4_0</code> 明显更稳，<code>q4_0</code> 也没有出现明显漂移退化。</p>
<h2>基准配置</h2>
<ul>
<li>日期：<code>2026-05-26</code></li>
<li>benchmark 脚本：<code>~/custom-agent-stack/local-agent-setup/scripts/benchmark_hermes_long_session_kv.py</code></li>
<li>结果目录：<code>~/.cache/local-agent-setup/benchmarks/hermes-long-kv-20260526-ab1</code></li>
<li>Hermes 运行时：仓库版 <code>~/custom-agent-stack/hermes</code></li>
<li>llama-server：<code>~/src/ik_llama.cpp/build-mmq/bin/llama-server</code></li>
<li>模型：<code>~/models/Qwen3.6-27B-MTP-IQ4_KS.gguf</code></li>
<li>ctx：<code>128000</code></li>
<li>block chars：<code>30000</code></li>
<li>toolset：<code>file</code></li>
<li>compression：<code>off</code></li>
<li>ignore_rules：<code>on</code></li>
</ul>
<h2>任务形态</h2>
<p dir="auto">不是单问单答，也不是直接打 OpenAI-compatible <code>/chat/completions</code> 的伪 benchmark。<br />
这次走的是<strong>真实 Hermes 多轮链路</strong>：</p>
<ol>
<li><code>HermesCLI.chat()</code> 连续多轮追加 history</li>
<li>中间混合长 reference turn</li>
<li>中间混合 <code>write_file / patch / read_file</code></li>
<li>结尾做精确 JSON 召回</li>
<li>同时抓：
<ul>
<li>Hermes request diagnostics</li>
<li>llama-server console log</li>
<li>最终语义结果和文件状态</li>
</ul>
</li>
</ol>
<h2>结果表</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>KV</th>
<th style="text-align:right">Passed Turns</th>
<th style="text-align:right">Final Recall</th>
<th style="text-align:right">Final File Recall</th>
<th style="text-align:right">Max Approx Tokens</th>
<th style="text-align:right"><code>exact_message_append</code></th>
<th style="text-align:right"><code>message_prefix_drift</code></th>
<th style="text-align:right"><code>forcing_full</code></th>
<th style="text-align:right"><code>prefix_drift</code></th>
<th style="text-align:right">acceptance avg</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>q8_0</code></td>
<td style="text-align:right"><code>12/12</code></td>
<td style="text-align:right">yes</td>
<td style="text-align:right">yes</td>
<td style="text-align:right"><code>41221</code></td>
<td style="text-align:right"><code>15</code></td>
<td style="text-align:right"><code>0</code></td>
<td style="text-align:right"><code>0</code></td>
<td style="text-align:right"><code>0</code></td>
<td style="text-align:right"><code>0.73017</code></td>
</tr>
<tr>
<td><code>q4_0</code></td>
<td style="text-align:right"><code>12/12</code></td>
<td style="text-align:right">yes</td>
<td style="text-align:right">yes</td>
<td style="text-align:right"><code>41221</code></td>
<td style="text-align:right"><code>15</code></td>
<td style="text-align:right"><code>0</code></td>
<td style="text-align:right"><code>0</code></td>
<td style="text-align:right"><code>0</code></td>
<td style="text-align:right"><code>0.75664</code></td>
</tr>
</tbody>
</table>
<h2>解释</h2>
<h3>1. 结构稳定性</h3>
<p dir="auto">这部分两者完全一样：</p>
<ul>
<li><code>first_request = 1</code></li>
<li><code>exact_message_append = 15</code></li>
<li><code>message_prefix_drift = 0</code></li>
<li><code>session_changed = 0</code></li>
<li><code>forcing full prompt re-processing = 0</code></li>
</ul>
<p dir="auto">这说明：</p>
<ul>
<li>Hermes replay 没有在这组任务里引入可见 prefix 漂移</li>
<li>llama-server 的 checkpoint / prompt cache 路径工作正常</li>
<li><code>q4_0</code> 没有比 <code>q8_0</code> 更容易把 replay 链打崩</li>
</ul>
<h3>2. 语义结果</h3>
<p dir="auto">这部分两者也一样：</p>
<ul>
<li>长 reference 块记忆没丢</li>
<li>文件工具链没错</li>
<li>结尾 JSON 精确召回通过</li>
<li>最终文件尾部状态召回通过</li>
</ul>
<p dir="auto">所以就“长会话漂移”这个问题看，<strong>这次没有证据表明 <code>q8_0</code> 更稳</strong>。</p>
<h3>3. acceptance</h3>
<p dir="auto">这次单次 run 里：</p>
<ul>
<li><code>q8_0</code>: <code>0.73017</code></li>
<li><code>q4_0</code>: <code>0.75664</code></li>
</ul>
<p dir="auto"><code>q4_0</code> 略高，但差距不大，而且这只是一次跑出来的平均值。<br />
在没有重复样本之前，<strong>不能据此下结论说 <code>q4_0</code> 优于 <code>q8_0</code></strong>，更不能据此反推“<code>q8_0</code> 在真实 Hermes 长会话里一定更聪明”。</p>
<h2>目前更可信的判断</h2>
<p dir="auto">至少在你这条链路里：</p>
<ul>
<li><code>KV q8_0</code> 对 <code>patch/path</code> 这类错误并没有表现出明确额外优势</li>
<li>真正影响 tool 成功率的主因，仍然更像是：
<ul>
<li>模型文件本身</li>
<li>chat template / replay 一致性</li>
<li>Hermes 侧 prefix 稳定化 hack</li>
<li>llama-server 侧 checkpoint / single-slot 行为</li>
</ul>
</li>
</ul>
<p dir="auto">而不是简单的：</p>
<ul>
<li>“把 KV 从 <code>q4_0</code> 提到 <code>q8_0</code>，模型就突然会分辨 patch/path”</li>
</ul>
]]></description><link>https://lcz.me/post/3740</link><guid isPermaLink="true">https://lcz.me/post/3740</guid><dc:creator><![CDATA[blackjack]]></dc:creator><pubDate>Tue, 26 May 2026 06:09:33 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 06:00:56 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/joker_chang" aria-label="Profile: joker_chang">@<bdi>joker_chang</bdi></a> <a href="/post/3726">说</a>:</p>
<p dir="auto">7900xtx+Ubuntu性能这么好？<br />
我Windows10+RTX3090Ti，</p>
<p dir="auto">--n-gpu-layers 999 ^<br />
--ctx-size 131072 ^<br />
--batch-size 2048 ^<br />
--ubatch-size 1024 ^<br />
--flash-attn on ^<br />
--cache-type-k q4_0 ^<br />
--cache-type-v q4_0 ^<br />
--cache-type-k-draft q4_0 ^<br />
--cache-type-v-draft q4_0 ^</p>
<p dir="auto">不开MTP跑Qwen3.6 27B只能跑到30tokens/s;<br />
开MTP变得更慢</p>
<p dir="auto">特别是在长上下文时，例如：我让模型分析一个大约128K的md文件，然后就爆了</p>
</blockquote>
<p dir="auto">你可以查一下编译llama-server的时候，用的mmq还是cuBLAS，或者有没有fallback到cuBLAS。亲测，两者性能差距巨大。</p>
<p dir="auto"><img src="https://upload.lcz.me/uploads/10f04120-003d-4bfc-bc38-320cbe206019.jpeg" alt="33077301-ba4c-4f17-8c7f-8bf20a217b19-image.jpeg" class=" img-fluid img-markdown" /></p>
]]></description><link>https://lcz.me/post/3737</link><guid isPermaLink="true">https://lcz.me/post/3737</guid><dc:creator><![CDATA[blackjack]]></dc:creator><pubDate>Tue, 26 May 2026 06:00:56 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 05:53:20 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/joker_chang" aria-label="Profile: joker_chang">@<bdi>joker_chang</bdi></a> <a href="/post/3730">说</a>:</p>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/rock-shi" aria-label="Profile: rock-shi">@<bdi>rock-shi</bdi></a> 那就对了，24G跑128K上下文+MTP资源不够</p>
</blockquote>
<p dir="auto">我27 q4量化，kv均q8_0量化，上下文128k，MTP, 5090laptop 24GRAM，开thinking，50+tps，快的起飞啊</p>
]]></description><link>https://lcz.me/post/3735</link><guid isPermaLink="true">https://lcz.me/post/3735</guid><dc:creator><![CDATA[blackjack]]></dc:creator><pubDate>Tue, 26 May 2026 05:53:20 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 05:20:07 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/joker_chang" aria-label="Profile: joker_chang">@<bdi>joker_chang</bdi></a> 再搞一个3090啊，后面DFalsh合并进来绝对是福音</p>
]]></description><link>https://lcz.me/post/3731</link><guid isPermaLink="true">https://lcz.me/post/3731</guid><dc:creator><![CDATA[rock shi]]></dc:creator><pubDate>Tue, 26 May 2026 05:20:07 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 05:17:48 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/rock-shi" aria-label="Profile: rock-shi">@<bdi>rock-shi</bdi></a> 那就对了，24G跑128K上下文+MTP资源不够</p>
]]></description><link>https://lcz.me/post/3730</link><guid isPermaLink="true">https://lcz.me/post/3730</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Tue, 26 May 2026 05:17:48 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 04:56:05 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/joker_chang" aria-label="Profile: joker_chang">@<bdi>joker_chang</bdi></a> 要么就是24g显存不够了。我3080 40g，27b 128k上下文展开就占了32g显存</p>
]]></description><link>https://lcz.me/post/3729</link><guid isPermaLink="true">https://lcz.me/post/3729</guid><dc:creator><![CDATA[rock shi]]></dc:creator><pubDate>Tue, 26 May 2026 04:56:05 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Tue, 26 May 2026 04:53:24 GMT]]></title><description><![CDATA[<p dir="auto">7900xtx+Ubuntu性能这么好？<br />
我Windows10+RTX3090Ti，</p>
<p dir="auto">--n-gpu-layers 999 ^<br />
--ctx-size 131072 ^<br />
--batch-size 2048 ^<br />
--ubatch-size 1024 ^<br />
--flash-attn on ^<br />
--cache-type-k q4_0 ^<br />
--cache-type-v q4_0 ^<br />
--cache-type-k-draft q4_0 ^<br />
--cache-type-v-draft q4_0 ^</p>
<p dir="auto">不开MTP跑Qwen3.6 27B只能跑到30tokens/s;<br />
开MTP变得更慢</p>
<p dir="auto">特别是在长上下文时，例如：我让模型分析一个大约128K的md文件，然后就爆了</p>
]]></description><link>https://lcz.me/post/3726</link><guid isPermaLink="true">https://lcz.me/post/3726</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Tue, 26 May 2026 04:53:24 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Mon, 25 May 2026 22:12:07 GMT]]></title><description><![CDATA[<p dir="auto">嗯, 确实256K上下文没太大必要, 反正也变慢了. 目前改到了128K. 更新脚本如下:</p>
<pre><code>bruin@lmde7:~$ cat run-qwen3-vulkan.sh
#!/bin/bash

LLAMA_SERVER=/home/bruin/github/llama.cpp/build-vulkan/bin/llama-server

ARGS=(
  --model              /opt/gguf-models/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q4_K_M.gguf
  --mmproj             /opt/gguf-models/unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf
  --chat-template-file /opt/gguf-models/froggeric/Qwen-Fixed-Chat-Templates/chat_template.jinja
  --spec-type          draft-mtp
  --spec-draft-n-max   2                       # Max draft tokens
  --ctx-size   131072                          # 262144 for 256k context
  --parallel   1                               # Single slot
  --flash-attn on                              # Enable FlashAttention
  --n-gpu-layers 999                           # All layers to GPU
  --cache-type-k q8_0                          # Quantize KV cache keys
  --cache-type-v q8_0                          # Quantize KV cache values
  #--fit off                                    #
  --threads 16                                 # CPU threads helping tg
  --threads-batch 16                           # CPU threads helping pg
  --batch-size 2048                            # Batch size
  --ubatch-size 1024                           # Micro‑batch size
  --no-warmup                                  # Skip warmup
  --no-mmap                                    # Don’t memory‑map model
  --mlock                                      # Lock model in RAM
  --cont-batching                              # Continuous batching
  --jinja                                      # Jinja chat template
  --reasoning auto                             # Auto reasoning
  --reasoning-format deepseek                  # Reasoning format
  --reasoning-budget 1024                      # Reasoning budget
  --metrics                                    # View metrics by accessing http://&lt;ip:port&gt;/metrics
  --log-verbosity 4                            # Log verbosity
  --host 0.0.0.0 --port 8000                   # Listen on all interfaces, port 8000
)

${LLAMA_SERVER} "${ARGS[@]}"
</code></pre>
]]></description><link>https://lcz.me/post/3666</link><guid isPermaLink="true">https://lcz.me/post/3666</guid><dc:creator><![CDATA[laobenxiong]]></dc:creator><pubDate>Mon, 25 May 2026 22:12:07 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Mon, 25 May 2026 03:40:31 GMT]]></title><description><![CDATA[<p dir="auto">不错，补充下实际运行截图会更好。</p>
]]></description><link>https://lcz.me/post/3510</link><guid isPermaLink="true">https://lcz.me/post/3510</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 25 May 2026 03:40:31 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Mon, 25 May 2026 03:12:10 GMT]]></title><description><![CDATA[<p dir="auto">@刘海彬 嗯嗯, 正在测试中. 关于 vision, 刚才喊 hermes 自己折腾了一通, 可以用了. 发了一张户型图给它, 分析的基本正确. 标注尺寸啥的也可以. 在多使用几天看看.</p>
]]></description><link>https://lcz.me/post/3501</link><guid isPermaLink="true">https://lcz.me/post/3501</guid><dc:creator><![CDATA[laobenxiong]]></dc:creator><pubDate>Mon, 25 May 2026 03:12:10 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Mon, 25 May 2026 02:10:04 GMT]]></title><description><![CDATA[<p dir="auto">kv cache type 使用q4_0精度，有没有在长上下文的情况下测试过稳定性，智商、工具调用如何。我在使用中发现，上下文到7-8W的时候就开始飘了。所以有时候我认为256K上下文有没有必要，不如把kv cache type精度调高点</p>
]]></description><link>https://lcz.me/post/3493</link><guid isPermaLink="true">https://lcz.me/post/3493</guid><dc:creator><![CDATA[[[global:former-user]]]]></dc:creator><pubDate>Mon, 25 May 2026 02:10:04 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp+qwen3.6-27b 初步测试 on Mon, 25 May 2026 01:54:41 GMT]]></title><description><![CDATA[<p dir="auto">怎么开了一个新帖? 应该是跟在<a href="https://lcz.me/topic/268/%E4%B8%8B%E5%8D%95-7900xtx-%E5%BC%80%E5%A7%8B%E6%8A%98%E8%85%BE-llama.cpp/">原帖</a>后面的...操作失误...</p>
]]></description><link>https://lcz.me/post/3492</link><guid isPermaLink="true">https://lcz.me/post/3492</guid><dc:creator><![CDATA[laobenxiong]]></dc:creator><pubDate>Mon, 25 May 2026 01:54:41 GMT</pubDate></item></channel></rss>