<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[另类16GB+12GB配置]]></title><description><![CDATA[<p dir="auto">本帖适合原本就有 16GB 显卡的朋友低成本尝试。</p>
<ul>
<li>情况是原本有 16GB 显存的 RTX 5070 Ti 和一张 6GB 显存的 RTX 2060。</li>
<li>5070 Ti 单卡跑 27b 需要 CPU offload，160k 上下文 LM Studio 只能跑到个位数的生成速度。</li>
<li>尝试插上 6GB 2060 后，小心配置 llama.cpp，缩短上下文生成速度可以提升到 20 左右，达到可用程度。</li>
<li>后另购入 12GB 的 3060，显存宽裕许多，llama.cpp 生成速度提升到接近 30。</li>
</ul>
<p dir="auto">总的来讲单卡大显存还是更合适的选择，3090二手购入价和新5070Ti差不多甚至更低，虽然24GB也没给上下文留下多少，但可以跑到40+生成速度。5070 Ti 实际上算力比 3090 强，但显存不够成为了瓶颈。</p>
<p dir="auto">具体设置为：</p>
<p dir="auto">使用的是 llama.cpp Vulkan 版本。CUDA 版本疑似开销较大无法达到同样的上下文长度。LM studio 虽然后端是 llama.cpp，但暴露的可控制参数不够</p>
<p dir="auto">models.ini</p>
<pre><code>[unsloth/qwen3.6-27b]
model = ./unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_S.gguf
mmproj = ./unsloth/Qwen3.6-27B-GGUF/mmproj-F32.gguf
no-mmproj-offload = true
no-mmap = true
mlock = false
cache-type-k = q8_0
cache-type-v = q8_0
reasoning = on
dev = Vulkan1,Vulkan2
n-gpu-layers = 999
t = 0
split-mode = layer
tensor-split = 66,34
kv-unified = true
c = 160000
np = 1
; Thinking mode for precise coding tasks
temperature = 0.6
top-k = 20
top-p = 0.95
min-p = 0.0
repeat-penalty = 1.0
presence-penalty = 0.0
</code></pre>
<pre><code>llama-server.exe \
    --models-preset ./models.ini \
    --host 0.0.0.0 \
    --models-max 1 \
    --port 1235
</code></pre>
<p dir="auto">dev 参数需要运行 llama-server.exe --list-devices 看一下实际的设备名<br />
另外把 models.ini 里的内容转换成 llama-server 的命令行参数也是等价的</p>
<p dir="auto">edit: 感觉可能发错区了，抱歉</p>
]]></description><link>https://lcz.me/topic/29/另类16gb-12gb配置</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 07:49:27 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/29.rss" rel="self" type="application/rss+xml"/><pubDate>Tue, 05 May 2026 18:02:24 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 另类16GB+12GB配置 on Thu, 07 May 2026 02:55:19 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/tomcatzh" aria-label="Profile: tomcatzh">@<bdi>tomcatzh</bdi></a> 1000上下</p>
]]></description><link>https://lcz.me/post/398</link><guid isPermaLink="true">https://lcz.me/post/398</guid><dc:creator><![CDATA[stakira]]></dc:creator><pubDate>Thu, 07 May 2026 02:55:19 GMT</pubDate></item><item><title><![CDATA[Reply to 另类16GB+12GB配置 on Thu, 07 May 2026 01:10:35 GMT]]></title><description><![CDATA[<p dir="auto">prefill速度呢？如果用来跑agent 30-40K，甚至70k - 100k的prefill都是很常见的</p>
<p dir="auto">虽然cache命中就无所谓，但总有冷启动的时候</p>
]]></description><link>https://lcz.me/post/394</link><guid isPermaLink="true">https://lcz.me/post/394</guid><dc:creator><![CDATA[tomcatzh]]></dc:creator><pubDate>Thu, 07 May 2026 01:10:35 GMT</pubDate></item><item><title><![CDATA[Reply to 另类16GB+12GB配置 on Wed, 06 May 2026 14:18:42 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/chia-an-yang" aria-label="Profile: CHIA-AN-YANG">@<bdi>CHIA-AN-YANG</bdi></a> 换卡是对的，它这5070Ti的算力很强了，被3060拖累了。这卡又贵，效果还不如单卡3090.</p>
]]></description><link>https://lcz.me/post/345</link><guid isPermaLink="true">https://lcz.me/post/345</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Wed, 06 May 2026 14:18:42 GMT</pubDate></item><item><title><![CDATA[Reply to 另类16GB+12GB配置 on Wed, 06 May 2026 14:14:33 GMT]]></title><description><![CDATA[<p dir="auto">我之前搞rtx3060 12g x3 搞不出來,,後來換7900XTX 24G 體驗好多了</p>
]]></description><link>https://lcz.me/post/342</link><guid isPermaLink="true">https://lcz.me/post/342</guid><dc:creator><![CDATA[CHIA AN YANG]]></dc:creator><pubDate>Wed, 06 May 2026 14:14:33 GMT</pubDate></item><item><title><![CDATA[Reply to 另类16GB+12GB配置 on Tue, 05 May 2026 18:15:05 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/stakira" aria-label="Profile: stakira">@<bdi>stakira</bdi></a> 厉害！我也正想折腾下这个玩意，你做的很有意义啊，兄弟，这为我提供了一个素材，我可以测试下A卡和N卡一起分层跑Vulkan<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f602.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--joy" style="height:23px;width:auto;vertical-align:middle" title="😂" alt="😂" />。说到分层，你说的对，如果主力卡算力足够，只是显存不够，用一张副卡来offload绝对比降级到CPU内存划算得多，这是个很好的思路。感谢分享，好贴！</p>
]]></description><link>https://lcz.me/post/248</link><guid isPermaLink="true">https://lcz.me/post/248</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Tue, 05 May 2026 18:15:05 GMT</pubDate></item></channel></rss>