<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[ik_llama.cpp效能問題]]></title><description><![CDATA[<p dir="auto"><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/" rel="nofollow ugc">https://www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/</a></p>
<p dir="auto">我照這片貼文的設置<br />
結果速度(~30tps)反而比llama.cpp(~50tps)還慢<br />
請問有人知道為什麼嗎<br />
OS:win11<br />
GPU:4070 12g<br />
RAM:DDR4 16g*2</p>
]]></description><link>https://lcz.me/topic/467/ik_llama.cpp效能問題</link><generator>RSS for Node</generator><lastBuildDate>Wed, 01 Jul 2026 12:08:36 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/467.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 08 Jun 2026 01:41:13 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Sun, 21 Jun 2026 02:20:42 GMT]]></title><description><![CDATA[<p dir="auto">可试试BEELLAMA 3.2预览版，用华为 kavrn kv cache格式， 不过草稿格式好像还不支持。</p>
]]></description><link>https://lcz.me/post/7647</link><guid isPermaLink="true">https://lcz.me/post/7647</guid><dc:creator><![CDATA[stxpnet]]></dc:creator><pubDate>Sun, 21 Jun 2026 02:20:42 GMT</pubDate></item><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Sat, 20 Jun 2026 14:19:14 GMT]]></title><description><![CDATA[<p dir="auto">試出來了效果超好<br />
prompt eval time =    2634.48 ms /  2988 tokens (    0.88 ms per token,  1134.19 tokens per second)<br />
eval time =  132563.02 ms /  9496 tokens (   13.96 ms per token,    71.63 tokens per second)<br />
total time =  135197.50 ms / 12484 tokens</p>
<p dir="auto">指令<br />
.\build\bin\Release\llama-server.exe -m "C:\Users\User.lmstudio\models\byteshape\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-IQ4_XS-3.53bpw.gguf" -fitt 1736 -c 100000 -n 32768 --no-mmap --mlock -fa on -np 1 -ctk q4_0 -ctv q4_0 -ctkd q4_0 -ctvd q4_0 -ctxcp 64 --no-warmup --spec-type mtp --spec-draft-n-max 2 --port 8080 --host 0.0.0.0</p>
]]></description><link>https://lcz.me/post/7615</link><guid isPermaLink="true">https://lcz.me/post/7615</guid><dc:creator><![CDATA[Hcl]]></dc:creator><pubDate>Sat, 20 Jun 2026 14:19:14 GMT</pubDate></item><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Sun, 14 Jun 2026 07:20:21 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/cs6" aria-label="Profile: CS6">@<bdi>CS6</bdi></a> 好的感謝我會再試試看</p>
]]></description><link>https://lcz.me/post/6790</link><guid isPermaLink="true">https://lcz.me/post/6790</guid><dc:creator><![CDATA[Hcl]]></dc:creator><pubDate>Sun, 14 Jun 2026 07:20:21 GMT</pubDate></item><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Fri, 12 Jun 2026 04:13:15 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/hcl" aria-label="Profile: Hcl">@<bdi>Hcl</bdi></a> <a href="/post/6433">说</a>:</p>
<p dir="auto">CachyOS</p>
</blockquote>
<p dir="auto">你需要的是一個沒有UI的環境，任何的G U I (WM )都會佔用顯示卡效能</p>
]]></description><link>https://lcz.me/post/6438</link><guid isPermaLink="true">https://lcz.me/post/6438</guid><dc:creator><![CDATA[CS6]]></dc:creator><pubDate>Fri, 12 Jun 2026 04:13:15 GMT</pubDate></item><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Fri, 12 Jun 2026 03:27:29 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/cs6" aria-label="Profile: CS6">@<bdi>CS6</bdi></a> 有考慮照那篇reddit換去CachyOS看看</p>
]]></description><link>https://lcz.me/post/6433</link><guid isPermaLink="true">https://lcz.me/post/6433</guid><dc:creator><![CDATA[Hcl]]></dc:creator><pubDate>Fri, 12 Jun 2026 03:27:29 GMT</pubDate></item><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Thu, 11 Jun 2026 22:13:18 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/hcl" aria-label="Profile: Hcl">@<bdi>Hcl</bdi></a> ik_llama.cpp 比 llama.cpp 慢，在12G显存+Windows环境下是预期内的，原因如下：</p>
<p dir="auto">ik_llama.cpp 的核心优化方向是"极致压缩显存"，它用的 IQ4_XS 等非常规量化格式在压缩率上确实比 Q4_K_M 更高，但代价是反量化(dequantization)时需要额外的 CPU/GPU 计算开销。在你 12G 显存跑 35B 模型的情况下：</p>
<ol>
<li>
<p dir="auto">模型无法完全放进显存（35B Q4 约 20G，Q4_XS 约 17G），你的 12G 必然触发大量 CPU Offload —— 而 ik 的 CPU Offload 通路在 Windows 上没有做过专门优化。</p>
</li>
<li>
<p dir="auto">Windows CUDA 本身的 overhead 比 Linux 高。llama.cpp 主线的 Windows CUDA 后端经过大量用户打磨，ik_llama.cpp 作为个人分支在 CUDA kernel 优化上没有走那么远。</p>
</li>
<li>
<p dir="auto">Reddit 那个 110 t/s 的成绩是用 DDR5 6000 + Linux 跑的，内存带宽对 offload 场景影响极大。你的 DDR4 双通道带宽（~40-50GB/s）只有 DDR5 6000（~90GB/s）的一半左右。</p>
</li>
</ol>
<p dir="auto">建议：12G 显存 + DDR4 的场景，最适合的模型是 7B-14B Q4 全程跑在显存里，或者 20B+ 模型用 Q3_K_M + -ngl 20（只放前20层到GPU）。ik_llama.cpp 的优势在显存极度吃紧的 edge case（比如 6G 跑 14B），12G 的场景它反而没优势。</p>
]]></description><link>https://lcz.me/post/6390</link><guid isPermaLink="true">https://lcz.me/post/6390</guid><dc:creator><![CDATA[Xiaote]]></dc:creator><pubDate>Thu, 11 Jun 2026 22:13:18 GMT</pubDate></item><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Thu, 11 Jun 2026 10:05:22 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/cs6" aria-label="Profile: CS6">@<bdi>CS6</bdi></a> 内存也不够，<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f602.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--joy" style="height:23px;width:auto;vertical-align:middle" title="😂" alt="😂" /></p>
]]></description><link>https://lcz.me/post/6302</link><guid isPermaLink="true">https://lcz.me/post/6302</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Thu, 11 Jun 2026 10:05:22 GMT</pubDate></item><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Thu, 11 Jun 2026 09:07:29 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/hcl" aria-label="Profile: hcl">@<bdi>hcl</bdi></a>  老哥，你先放棄 win11 吧，這點內存跟 Vram 都不夠系統折騰....</p>
]]></description><link>https://lcz.me/post/6295</link><guid isPermaLink="true">https://lcz.me/post/6295</guid><dc:creator><![CDATA[CS6]]></dc:creator><pubDate>Thu, 11 Jun 2026 09:07:29 GMT</pubDate></item><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Thu, 11 Jun 2026 09:05:20 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/566656661" aria-label="Profile: 566656661">@<bdi>566656661</bdi></a> 我覺得比較奇怪的是我的ik_llama.cpp跑再windows反而比llama.cpp慢</p>
]]></description><link>https://lcz.me/post/6294</link><guid isPermaLink="true">https://lcz.me/post/6294</guid><dc:creator><![CDATA[Hcl]]></dc:creator><pubDate>Thu, 11 Jun 2026 09:05:20 GMT</pubDate></item><item><title><![CDATA[Reply to ik_llama.cpp效能問題 on Mon, 08 Jun 2026 01:49:07 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/hcl" aria-label="Profile: Hcl">@<bdi>Hcl</bdi></a></p>
<p dir="auto">模型太大, 12GB放不下必須要倒進內存, Reddit那個是用DDR5 6000, 比DDR4快上不少</p>
]]></description><link>https://lcz.me/post/5634</link><guid isPermaLink="true">https://lcz.me/post/5634</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Mon, 08 Jun 2026 01:49:07 GMT</pubDate></item></channel></rss>