<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[大模型16G卡的春天]]></title><description><![CDATA[<p dir="auto">写这个帖子是群主提议，本来我是向论坛朋友hotpigwk道歉的，之前他发的关于v100 16g显卡跑大模型的提问，我在帖子下喷这个卡垃圾，不行。今天早上看油管，无意看到有群友评论他自己的v100 16g显卡，在开启turboQuant后，上下文可以跑到100k。马上引起我的兴趣，因为手上有个5600ti 16g显卡，月初按着老特指引去搭建了qwen3.6-27b q4模型，驱动hermes，发现智力在线！真心可以干活，可是无论怎么搞，只能稳定跑20k上下文。顿时觉得16g显卡就像鸡肋，食之无味，弃之可惜。能用，但是又不省心。hermes自带的上下文都15K左右了，做稍微复杂的任务，马上oom。啰嗦了半天，开始进入主题。16g显卡其实也可以跑64K甚至更高的上下文。<br />
原文内容:我用v100 16g跑27b模型，开启turboQuant后，上下文翻倍，可以到100k。模型地址： <a href="https://huggingface.co/sokann/Qwen3.6-27B-GGUF-4.262bpw" rel="nofollow ugc">https://huggingface.co/sokann/Qwen3.6-27B-GGUF-4.262bpw</a> 不过它要用ik_llama.cpp加载，要自己编译， 好处是集成了turboQuant, KV可以翻倍。上下文可以开到100K，大概在28tokens/s。关键参数 -c 102400 -np 1 -fa on -ngl 99  -ctk q4_0 -khad -ctv q4_0 -vhad -wgt 1<br />
于是今天中午我特意用16g的5600ti测试一下，要是能跑到100k，那用hermes是相当不错了。<br />
1.文中提及的ik_llama.cpp，我这里用的是:<a href="https://github.com/Thireus/ik_llama.cpp/releases/tag/main-b4744-8d7891f" rel="nofollow ugc">https://github.com/Thireus/ik_llama.cpp/releases/tag/main-b4744-8d7891f</a><br />
上面又A卡的版本，也又N卡的版本，N卡是win和linux都有，A卡是只有linux。使用方法是直接下载，解压，然后写个脚本就可以运行起来。可以参考我这个<br />
#!/bin/bash</p>
<h1>==================== 启动 llama-server ====================</h1>
<p dir="auto">echo "========================================"<br />
echo "  正在启动 Qwen3.6-27B IQ4_XS..."<br />
echo "========================================"<br />
echo ""</p>
<p dir="auto">export LD_LIBRARY_PATH=/home/cyg/miniconda3/envs/vllm/lib/python3.10/site-packages/nvidia/cu13/lib:$LD_LIBRARY_PATH<br />
/home/cyg/ik_llama.cpp/build/bin/llama-server <br />
-m /home/cyg/models/Qwen3.6-27B-i1-IQ4_XS.gguf <br />
-c 51200 <br />
-np 1 <br />
-fa on <br />
-ngl 99 <br />
-ctk q4_0 <br />
-khad <br />
-ctv q4_0 <br />
-vhad <br />
--host 0.0.0.0 <br />
--port 8000 <br />
--cont-batching <br />
--jinja <br />
--mlock <br />
--threads 10 <br />
--threads-batch 12</p>
<h1>==================== 退出处理 ====================</h1>
<p dir="auto">echo ""<br />
echo "========================================"<br />
echo "  llama-server 已停止运行"<br />
echo "========================================"<br />
1.上面是双q4运行50K上下文，其实可以跑到65535就是64K上下文的，我看着远程的向日葵，占用了400M左右的显存。<br />
2.原文提供的<a href="https://huggingface.co/sokann/Qwen3.6-27B-GGUF-4.262bpw" rel="nofollow ugc">https://huggingface.co/sokann/Qwen3.6-27B-GGUF-4.262bpw</a> ，我并没有下载，我还是用之前的Qwen3.6-27B-i1-IQ4_XS.gguf，之所以跑相同的模型，目的是想知道带有turboQuant的ik_llama.cpp是不是真的比官方的llama.cpp上下文翻倍。<br />
3.测试结果：确实是翻倍了，原来我只能稳定跑20k上下文，在极端双q4下，我成功跑到64K上下文。此时显存占用99%了。为了稳妥起见，我还是选择了50k上下文。速度上没有明显变化，都是25-27t/s，大家可以自行测试用原文网友的模型看看，可能会真能跑到100k，我这个50k已经满足。<br />
<img src="https://upload.lcz.me/uploads/28d3e885-8427-4952-8f91-c364388f2bdc.jpeg" alt="abf1456a-c2d3-4c7c-b0e5-e678a61cf6ea-image.jpeg" class=" img-fluid img-markdown" /><br />
下面是不同压缩的测试结果。<br />
<img src="https://upload.lcz.me/uploads/11bd7a74-c309-4f03-af67-f72c44e2c285.jpeg" alt="4821a1d1-c447-402e-985d-357f64a685a7-image.jpeg" class=" img-fluid img-markdown" /><br />
<img src="https://upload.lcz.me/uploads/ca341915-023b-4719-b282-fd82f7fe4614.jpeg" alt="e273c79e-9c68-4da3-aa10-cbdcb95cedd9-image.jpeg" class=" img-fluid img-markdown" /><br />
<img src="https://upload.lcz.me/uploads/4620dbf5-f584-4d69-a2ec-1703066cf6fe.jpeg" alt="6e2fc60b-fb8e-4add-b9a2-f2077e3a30a7-image.jpeg" class=" img-fluid img-markdown" /><br />
<img src="https://upload.lcz.me/uploads/b8ce0769-169e-4256-8af3-1791df23d614.jpeg" alt="d35220a1-efe5-46d9-88d6-acf2890c0f70-image.jpeg" class=" img-fluid img-markdown" /><br />
以上图片均是实际测试，并非云的。老特要想什么测试图片，可以提出来，我有空去测。</p>
]]></description><link>https://lcz.me/topic/174/大模型16g卡的春天</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 06:08:20 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/174.rss" rel="self" type="application/rss+xml"/><pubDate>Sat, 16 May 2026 14:10:01 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 大模型16G卡的春天 on Tue, 19 May 2026 01:58:22 GMT]]></title><description><![CDATA[<h2>接入 Hermes</h2>
<p dir="auto">Hermes 支持任意 OpenAI 兼容 API 端点。以下通过交互式命令添加本机 llama-server：</p>
<pre><code class="language-bash">$ hermes model

  Current model:    deepseek-v4-flash
  Active provider:  DeepSeek


Custom OpenAI-compatible endpoint configuration:

API base URL [e.g. https://api.example.com/v1]: http://192.168.5.84:8000/v1
API key [optional]:
Verified endpoint via http://192.168.5.84:8000/v1/models (1 model(s) visible)

Select API compatibility mode:
  1. Auto-detect [current]
  2. Chat Completions
  3. Responses / Codex
  4. Anthropic Messages
Choice [1-4, Enter to keep current/detected]:
  API mode: auto-detect
  Detected model: ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf
  Use this model? [Y/n]:
Context length in tokens [leave blank for auto-detect]: 65536
Display name [192.168.5.84:8000]:
Default model set to: ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf (via http://192.168.5.84:8000/v1)
  💾 Saved to custom providers as "192.168.5.84:8000" (edit in config.yaml)
</code></pre>
<p dir="auto">配置要点：</p>
<ul>
<li><strong>Base URL</strong>：<code>http://&lt;llama-server IP&gt;:8000/v1</code>，注意路径必须以 <code>/v1</code> 结尾</li>
<li><strong>API Key</strong>：留空，本地 llama-server 不需要</li>
<li><strong>API Mode</strong>：选择 Auto-detect 即可（或 Chat Completions）</li>
<li><strong>Context Length</strong>：手动填入 65536（与 <code>-c</code> 参数一致）</li>
<li><strong>启动参数需保留 <code>--jinja</code></strong>：否则 Hermes 发来的 tools 参数会被忽略，工具调用功能失效</li>
</ul>
<h3>上下文压缩配置</h3>
<p dir="auto">Hermes 默认会在上下文占用达到阈值后自动压缩历史消息，避免超 context 窗口。以下是 Hermes 查询当前配置的对话：</p>
<pre><code>────────────────────────────────────────
● the context window compact/compression
  threshold for the current model you are using
────────────────────────────────────────

╭─ ⚕ Hermes ───────────────────────────────────────────╮
    Here's what's configured for your session:

    compression:
      enabled: true
      threshold: 0.8
      target_ratio: 0.2
      protect_last_n: 20
      hygiene_hard_message_limit: 400

    How it works:
    - threshold: 0.8 — 当上下文占用达到 80% 时触发压缩
    - target_ratio: 0.2 — 压缩后上下文裁剪至原大小的 ~20%
    - protect_last_n: 20 — 保护最近 20 条消息不被压缩
    - hygiene_hard_message_limit: 400 — 消息数达 400
      条时强制重置，防止上下文退化
╰──────────────────────────────────────────────────────╯
</code></pre>
<p dir="auto">对应 <code>~/.hermes/config.yaml</code>：</p>
<pre><code class="language-yaml">compression:
  enabled: true
  threshold: 0.8
  target_ratio: 0.2
  protect_last_n: 20
  hygiene_hard_message_limit: 400
</code></pre>
<h3>辅助模型（Aux Model）配置</h3>
<p dir="auto">辅助模型是 Hermes 专用于后台任务（上下文压缩、标题生成、视觉分析、session search）的独立模型，<br />
不占用主模型算力。默认 <code>auto</code> 会回退到主模型（Qwen3.6-27B），对 27B 来说浪费且慢。<br />
以下是将 aux 改为 DeepSeek 云端模型的对话：</p>
<pre><code>────────────────────────────────────────
● ok, i am not aware of the concept of
  "aux model". what the aux model for?
────────────────────────────────────────

╭─ ⚕ Hermes ───────────────────────────────────────────╮
    An auxiliary (aux) model is a separate, usually
    cheaper and faster model that Hermes uses for
    background tasks:

    1. Context compression — 当上下文达 80%，Hermes
       需压缩历史消息。不用主模型做（浪费且慢）
    2. Vision tasks — 图片/浏览器视觉分析
    3. Session search — 搜索和排序过去对话

    Your current setup:
      compression:
        provider: auto
        model: ''

    auto 会尝试找 OPENROUTER_API_KEY 或 GOOGLE_API_KEY，
    没有则回退到主模型 Qwen3.6-27B
╰──────────────────────────────────────────────────────╯

────────────────────────────────────────
● i see. please set the aux model to be
  deepseek-v4-flash
────────────────────────────────────────

╭─ ⚕ Hermes ───────────────────────────────────────────╮
    Done. The auxiliary compression model is now set to:

    auxiliary.compression:
      provider: deepseek
      model: deepseek-v3-flash

    This will take effect the next time compression
    triggers (when your context hits 80%).
╰──────────────────────────────────────────────────────╯
</code></pre>
<p dir="auto">等效指令：</p>
<pre><code class="language-bash">hermes config set auxiliary.compression.provider deepseek
hermes config set auxiliary.compression.model deepseek-v3-flash
</code></pre>
<p dir="auto">配置后上下文压缩和标题生成都走云端 flash 模型（瞬间返回，不阻塞对话），本地 27B 模型专注推理。</p>
<h2>验证：Hermes + llama-server 联调成功</h2>
<p dir="auto">以下是 Hermes 接入 Qwen3.6-27B 后的对话，确认模型、agent 和工具链协同正常。</p>
<pre><code>● btw, who are you? the model, the agent (hermes), or the combined?
────────────────────────────────────────

╭─ ⚕ Hermes ─────────────────────────────────────────╮
    I'm the combination of both — but it's worth being
    precise about what each part is:

    The model (Qwen3.6-27B) is just a pattern-matching
    engine. It's a 27-billion-parameter neural network
    that predicts the next token. By itself, it has no
    memory, no tools, and no persistent identity. It
    doesn't know what Hermes is.

    Hermes is the framework — the agent loop, the tools,
    the memory system, the session management, the skills,
    the config. It's the scaffolding that gives the model
    capabilities it wouldn't have on its own. But Hermes
    has no intelligence of its own — it's just code that
    routes prompts and parses responses.

    What you're talking to is the emergent result of
    putting them together: the model running inside Hermes'
    agent loop, with access to tools, persistent memory,
    skills, and session history. The model provides the
    reasoning, Hermes provides the body and long-term memory.

    If I had to put a name to it, I'd say I'm an instance
    of Qwen3.6-27B running as a Hermes agent. The persona,
    the memory of you, the ability to run terminal commands
    and read files — all of that is the combination working
    together. Neither part is enough on its own.
</code></pre>
<p dir="auto">模型准确描述了自身架构（模型 vs agent 的分工），验证了：</p>
<ul>
<li>llama-server <code>--jinja</code> 生效（工具调用模板正确注入）</li>
<li>27B IQ4_XS imatrix 量化质量正常（语义理解、逻辑连贯）</li>
<li>Hermes agent loop 完整（模型有 self-awareness 而非幻觉）</li>
</ul>
<h2>Hermes Agent 自助压测</h2>
<p dir="auto">Hermes 对 llama-server 进行自助压测，以下是完整对话：</p>
<pre><code>────────────────────────────────────────
● what is the pp and tg performance of the current model?
────────────────────────────────────────

╭─ ⚕ Hermes ───────────────────────────────────────────╮
    Let me check your model setup and see if I can find
    performance data.
╰──────────────────────────────────────────────────────╯
  ┊ 💻 preparing terminal…
  ┊ 💻 $ grep -A10 'model:' ~/.hermes/config.yaml | head -15
  ┊ 💻 $ curl -s http://192.168.5.84:8000/v1/models ...
  ┊ 💻 $ curl -s http://192.168.5.84:8000/health ...
  ┊ 💻 $ hermes insights --days 1 2&gt;/dev/null
  ┊ 💻 $ curl -s http://192.168.5.84:8000/metrics ...

  # ~500 token prompt, ~200 token generation
  💻 $ curl -s -X POST http://192.168.5.84:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"...","messages":[{"role":"user","content":
    "Explain the concept of quantization in large language
    models..."}],"max_tokens":200,"temperature":0}' \
    | python3 -c "
import sys, json
d = json.load(sys.stdin)
t = d['timings']
u = d['usage']
print(f'Prompt tokens: {u[\"prompt_tokens\"]}')
print(f'Generated tokens: {u[\"completion_tokens\"]}')
print(f'PP (prompt processing): {t[\"prompt_per_second\"]:.1f} tok/s')
print(f'TG (token generation):  {t[\"predicted_per_second\"]:.1f} tok/s')
"

  # ~2000 token prompt
  💻 $ python3 -c "
...
r = subprocess.run(['curl', '-s', '-X', 'POST',
    'http://192.168.5.84:8000/v1/chat/completions', ...])
d = json.loads(r.stdout)
t = d['timings']
print(f'PP (prompt processing): {t[\"prompt_per_second\"]:.1f} tok/s')
print(f'TG (token generation):  {t[\"predicted_per_second\"]:.1f} tok/s')
"

╭─ ⚕ Hermes ───────────────────────────────────────────╮
    Here are the real-world performance numbers for your
    setup:

    Model: Qwen3.6-27B, 4.262 bits-per-word
           (imatrix quant), ~14GB
    GPU: RTX A4000 (16GB VRAM) on 192.168.5.84
    Inference: llama.cpp

    | Metric                 | Small prompt   | Large prompt    |
    |                        | (54 tok)       | (2209 tok)      |
    |------------------------|----------------|-----------------|
    | PP (Prompt Processing) | 293 tok/s      | 381 tok/s       |
    |                        | (184ms)        | (5.8s)          |
    | TG (Token Generation)  | 19 tok/s       | 9.6 tok/s       |
    |                        | (10.4s)        | (5.2s)          |

    What this means in practice:

    - PP gets faster with longer prompts — batch
      processing is more efficient at scale
    - TG drops with longer context — the KV cache
      fills up VRAM, so generation has to compete
      for memory bandwidth. With your 16GB card and
      a 27B model taking ~14GB, there's not much
      headroom

    User-facing impact:
    - A typical response of ~100 tokens takes
      5-10 seconds to generate
    - A longer 300-token response takes 15-30 seconds
    - As context grows (long conversations), TG slows
      further as the KV cache eats more VRAM

    The 4.262 bpw quantization is pretty aggressive —
    you're trading ~30-40% quality for fitting a 27B
    model in 16GB VRAM. If TG becomes too slow during
    long sessions, you could consider a lighter quant
    (Q4_K_M at ~5bpw) or a smaller model to leave more
    VRAM for the KV cache.
╰──────────────────────────────────────────────────────╯
 ⚕ Qwen3.6-27B-GGUF-4.262b... │ 35.9K/65.5K │ [██████░░░░] 55% │ 29m │ ⏲ 6m 54s
───────────────────────────────────────────────────────
</code></pre>
<h3>压测结论</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>Metric</th>
<th style="text-align:center">小 prompt (54 tok)</th>
<th style="text-align:center">大 prompt (2209 tok)</th>
</tr>
</thead>
<tbody>
<tr>
<td>PP (Prompt Processing)</td>
<td style="text-align:center">293 tok/s</td>
<td style="text-align:center">381 tok/s</td>
</tr>
<tr>
<td>TG (Token Generation)</td>
<td style="text-align:center">19 tok/s</td>
<td style="text-align:center">9.6 tok/s</td>
</tr>
</tbody>
</table>
<ul>
<li><strong>PP 随 prompt 增长而加速</strong>（293 → 381 tok/s）：batch processing 在更大输入上效率更高</li>
<li><strong>TG 随上下文增长而下降</strong>（19 → 9.6 tok/s）：KV cache 随上下文膨胀后，与模型 tensor 争抢显存带宽。16G 跑 27B 本就很紧张</li>
<li><strong>实际体感</strong>：100 token 回复约 5-10s，300 token 约 15-30s。长对话后期生成会进一步变慢</li>
</ul>
]]></description><link>https://lcz.me/post/2513</link><guid isPermaLink="true">https://lcz.me/post/2513</guid><dc:creator><![CDATA[laobenxiong]]></dc:creator><pubDate>Tue, 19 May 2026 01:58:22 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 19:17:52 GMT]]></title><description><![CDATA[<p dir="auto">我今天刚好遇到这个问题，双q4_0，在hermes中context跑到50~60k，agent说话会开始漂移了，不过50k以内是可以的。</p>
]]></description><link>https://lcz.me/post/2488</link><guid isPermaLink="true">https://lcz.me/post/2488</guid><dc:creator><![CDATA[neo]]></dc:creator><pubDate>Mon, 18 May 2026 19:17:52 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 10:25:31 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 我是小白的小白。我直接让 Codex 接管，告诉他我的需求，直接接入 Hermes,让他帮我跑通测试，做好点 BAT 文件。我就负责测试一下会不会报错，以及各种参数，AI 比较谨慎，给你的参数都是比较小的，比如他会给你一次性的最大 token，只有4096，我坚持必须大一点的。反正我也不想懂，至于 KV 量化是 Q4 还是什么的，我其实啥都不管。</p>
]]></description><link>https://lcz.me/post/2364</link><guid isPermaLink="true">https://lcz.me/post/2364</guid><dc:creator><![CDATA[张老师]]></dc:creator><pubDate>Mon, 18 May 2026 10:25:31 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 09:59:17 GMT]]></title><description><![CDATA[<p dir="auto">我还没有实际测试效果. 今天折腾编译搞了半天, 还是在deepseek 的帮助下搞定的. 回头接入hermes试试</p>
]]></description><link>https://lcz.me/post/2355</link><guid isPermaLink="true">https://lcz.me/post/2355</guid><dc:creator><![CDATA[laobenxiong]]></dc:creator><pubDate>Mon, 18 May 2026 09:59:17 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 09:57:22 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/laobenxiong" aria-label="Profile: laobenxiong">@<bdi>laobenxiong</bdi></a> 不错，不过我看到你们都测试了q4_0 KV，真的能用吗？为什么我跑Agent Q4_0完全不行。</p>
]]></description><link>https://lcz.me/post/2352</link><guid isPermaLink="true">https://lcz.me/post/2352</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 09:57:22 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 09:55:27 GMT]]></title><description><![CDATA[<p dir="auto">感谢各位大佬. 我今天也试了下, 记录如下:</p>
<h1><code>ik_ollama.cpp</code> 之 <code>llama-server</code> CUDA 构建日志</h1>
<p dir="auto">针对 Debian 13 虚拟机 + CUDA 12.4 + NVIDIA RTX A4000 的编译配置。</p>
<h2>前置依赖</h2>
<pre><code class="language-bash">sudo apt install build-essential cmake libcurl4-openssl-dev
</code></pre>
<h2>CMake 配置 + 构建</h2>
<pre><code class="language-bash">rm -rf build
cmake -S . -B build \
    -DGGML_CUDA=ON \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_ARCHITECTURES="86" \
    -DBUILD_SHARED_LIBS=ON \
    -DCMAKE_C_FLAGS="-fPIC -mcmodel=large -mavx2 -mfma -mf16c" \
    -DCMAKE_CXX_FLAGS="-fPIC -mcmodel=large -mavx2 -mfma -mf16c"
cmake --build build --target llama-server -j$(nproc)
</code></pre>
<p dir="auto">二进制输出：<code>build/bin/llama-server</code></p>
<h2>参数说明</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>参数</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>-DGGML_CUDA=ON</code></td>
<td>启用 CUDA 后端</td>
</tr>
<tr>
<td><code>-DCMAKE_CUDA_ARCHITECTURES="86"</code></td>
<td>指定 GPU 架构为 sm_86（A4000），比默认的多架构 fatbin 编译更快</td>
</tr>
<tr>
<td><code>-DBUILD_SHARED_LIBS=ON</code></td>
<td>编译为动态库</td>
</tr>
<tr>
<td><code>-fPIC</code></td>
<td>位置无关代码</td>
</tr>
<tr>
<td><code>-mcmodel=large</code></td>
<td>大代码模型，解决 CUDA fatbin 导致的 relocation overflow 错误</td>
</tr>
<tr>
<td><code>-mavx2 -mfma -mf16c</code></td>
<td>显式启用 AVX2/FMA/F16C 指令集，确保 IQK CPU 优化路径编译</td>
</tr>
</tbody>
</table>
<h2>常见问题</h2>
<h3>relocation truncated to fit / R_X86_64_PC32</h3>
<p dir="auto">CUDA 编译产生的 fatbin 目标文件体积巨大，静态库链接时 32 位 PC 相对偏移溢出。<br />
必须加 <code>-mcmodel=large</code>。</p>
<h3>undefined reference to iqk_*</h3>
<p dir="auto">IQK CPU 优化函数需要在 <code>__AVX2__</code> 定义时才会编译。KVM 虚拟机可能不自动暴露 AVX2<br />
（即使宿主机支持），需显式加 <code>-mavx2</code>。</p>
<h3>运行参数参考（A4000 16G，Qwen3.6-27B IQ4_XS）</h3>
<pre><code class="language-bash">export LD_LIBRARY_PATH=$(pwd)/build/src:$(pwd)/build/ggml/src:$LD_LIBRARY_PATH

./build/bin/llama-server \
    -m  ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf \
    -c 65536 \
    -np 1 \
    -fa on \
    -ngl 99 \
    -ctk q4_0 \
    -khad \
    -ctv q4_0 \
    -vhad \
    --host 0.0.0.0 \
    --port 8000 \
    --cont-batching \
    --jinja \
    --mlock
</code></pre>
<ul>
<li><code>-ctk q4_0 -ctv q4_0</code>：TurboQuant KV cache 量化，16G 显存可跑到 50K+ 上下文</li>
<li><code>-ngl 99</code>：尽可能把所有层 offload 到 GPU</li>
<li><code>--mlock</code>：锁定内存，防止 swap</li>
</ul>
<h2>mlock 权限修正</h2>
<p dir="auto">如果运行日志出现 <code>warning: failed to mlock ... Cannot allocate memory</code>，需提升 memlock 限制：</p>
<pre><code class="language-bash">sudo tee -a /etc/security/limits.conf &lt;&lt;'EOF'
bruin   hard    memlock     unlimited
bruin   soft    memlock     unlimited
EOF
</code></pre>
<p dir="auto">重新登录后生效。</p>
<h2>运行日志参考（64K context，A4000 16G）</h2>
<pre><code>$ ./build/bin/llama-server \
    -m  ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf \
    -c 65536 -np 1 -fa on -ngl 99 \
    -ctk q4_0 -khad -ctv q4_0 -vhad \
    --host 0.0.0.0 --port 8000 --cont-batching --jinja --mlock

INFO [main] build info | build=4755 commit="94593ae0"
INFO [main] system info | AVX = 1 | AVX2 = 1 | FMA = 1 | F16C = 1 | BLAS = 1
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, VRAM: 16101 MiB
CUDA0: using device CUDA0 - 15929 MiB free

llama_model_loader: - type  f32:  353 tensors
llama_model_loader: - type q8_0:   96 tensors
llama_model_loader: - type iq4_ks: 402 tensors

llm_load_print_meta: model type       = 27B
llm_load_print_meta: model ftype      = IQ4_KS - 4.25 bpw
llm_load_print_meta: model params     = 26.896 B
llm_load_print_meta: model size       = 13.344 GiB (4.262 BPW)

Memory required for model tensors + cache: 14708 MiB
Memory available on all devices - compute: 14808 MiB
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors:        CPU buffer size =   645.09 MiB
llm_load_tensors:      CUDA0 buffer size = 13018.97 MiB

llama_kv_cache_init:      CUDA0 KV buffer size =  1301.63 MiB
llama_init_from_model: KV self size  = 1152.00 MiB, K (q4_0): 576.00 MiB, V (q4_0): 576.00 MiB
llama_init_from_model:      CUDA0 compute buffer size =   505.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    74.01 MiB

INFO [init] new slot | id_slot=0 n_ctx_slot=65536
INFO [main] model loaded
INFO [main] HTTP server listening | port="8000" hostname="0.0.0.0"
</code></pre>
<h2>实测性能分析</h2>
<h3>推理日志</h3>
<pre><code>======== Prompt cache: cache size: 0
slot print_timing: id  0 | task 0 |
prompt eval time =     182.52 ms /    25 tokens (    7.30 ms per token,   136.97 tokens per second)
       eval time =   13337.83 ms /   312 tokens (   42.75 ms per token,    23.39 tokens per second)
      total time =   13520.35 ms /   337 tokens

======== Prompt cache: cache size: 336 (命中缓存)
slot print_timing: id  0 | task 314 |
prompt eval time =     251.74 ms /    77 tokens (    3.27 ms per token,   305.87 tokens per second)
       eval time =   44615.10 ms /   868 tokens (   51.40 ms per token,    19.46 tokens per second)
      total time =   44866.84 ms /   945 tokens
</code></pre>
<h3>性能数据</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th style="text-align:center">首次请求（冷）</th>
<th style="text-align:center">缓存命中</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt eval</td>
<td style="text-align:center">137 tok/s</td>
<td style="text-align:center">306 tok/s</td>
<td>缓存命中后翻倍</td>
</tr>
<tr>
<td>文本生成</td>
<td style="text-align:center">23.4 tok/s</td>
<td style="text-align:center">19.5 tok/s</td>
<td>生成长度 868 后略有下降</td>
</tr>
<tr>
<td>生成 token 数</td>
<td style="text-align:center">312</td>
<td style="text-align:center">868</td>
<td>—</td>
</tr>
</tbody>
</table>
<h3>分析</h3>
<ul>
<li>
<p dir="auto"><strong>生成速度 19-23 tok/s</strong>：对于 27B 模型在 A4000（448 GB/s 带宽）上表现正常。<br />
参考 V100 16G（900 GB/s）报告 ~28 tok/s，与带宽比例吻合。<br />
同级别消费卡（4060 Ti 16G）通常在 15-20 tok/s。</p>
</li>
<li>
<p dir="auto"><strong>Prompt eval 速度可观</strong>：TurboQuant 的 <code>-khad -vhad</code> + Flash Attention 效果显著，<br />
缓存命中时可达 306 tok/s。</p>
</li>
<li>
<p dir="auto"><strong>缓存行为</strong>：日志中出现 <code>Common part does not match fully</code> 和 SWA 导致的<br />
checkpoint 失效（<code>forcing full prompt re-processing due to lack of cache data</code>）。<br />
这是 Qwen3.6 模型部分层使用 Sliding Window Attention 的已知兼容性问题，<br />
不影响正确性，仅长历史场景下 prompt 重处理稍慢。</p>
</li>
</ul>
<h3>显存使用明细</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>项目</th>
<th>大小</th>
<th style="text-align:center">占比</th>
</tr>
</thead>
<tbody>
<tr>
<td>模型 tensors (CUDA0)</td>
<td>13019 MiB</td>
<td style="text-align:center">88.5%</td>
</tr>
<tr>
<td>KV cache (q4_0)</td>
<td>1302 MiB</td>
<td style="text-align:center">8.9%</td>
</tr>
<tr>
<td>compute buffer (CUDA0)</td>
<td>505 MiB</td>
<td style="text-align:center">3.4%</td>
</tr>
<tr>
<td>compute buffer (CPU)</td>
<td>74 MiB</td>
<td style="text-align:center">0.5%</td>
</tr>
<tr>
<td><strong>已用合计</strong></td>
<td><strong>14708 MiB</strong></td>
<td style="text-align:center">—</td>
</tr>
<tr>
<td><strong>可用显存</strong></td>
<td>14808 MiB</td>
<td style="text-align:center">—</td>
</tr>
<tr>
<td><strong>余量</strong></td>
<td>~100 MiB</td>
<td style="text-align:center">—</td>
</tr>
</tbody>
</table>
]]></description><link>https://lcz.me/post/2351</link><guid isPermaLink="true">https://lcz.me/post/2351</guid><dc:creator><![CDATA[laobenxiong]]></dc:creator><pubDate>Mon, 18 May 2026 09:55:27 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 08:00:24 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/y2k" aria-label="Profile: y2k">@<bdi>y2k</bdi></a> 你仔细看你的图，是再写小说啊，这个项目能赚钱吗？能写出比较嗨皮的细节吗</p>
]]></description><link>https://lcz.me/post/2319</link><guid isPermaLink="true">https://lcz.me/post/2319</guid><dc:creator><![CDATA[张老师]]></dc:creator><pubDate>Mon, 18 May 2026 08:00:24 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 07:58:13 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E5%BC%A0%E8%80%81%E5%B8%88" aria-label="Profile: 张老师">@<bdi>张老师</bdi></a> 模型不支持1M，我跑慢256k轻松啊，这都毫无压力....</p>
]]></description><link>https://lcz.me/post/2318</link><guid isPermaLink="true">https://lcz.me/post/2318</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 07:58:13 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 07:57:19 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/y2k" aria-label="Profile: y2k">@<bdi>y2k</bdi></a>  是的，我主要是在跑Hermes里跑，普通的对话窗口里我觉得不如去gpt网页版……，还是感谢你，我都想着去买4090 48G了！看来我的3090还能继续战斗！  <a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 你测试一下，4090 48G，上下文你能搞定1M 吗？</p>
]]></description><link>https://lcz.me/post/2317</link><guid isPermaLink="true">https://lcz.me/post/2317</guid><dc:creator><![CDATA[张老师]]></dc:creator><pubDate>Mon, 18 May 2026 07:57:19 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 07:48:32 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E5%BC%A0%E8%80%81%E5%B8%88" aria-label="Profile: 张老师">@<bdi>张老师</bdi></a><br />
你可以用自己的模型阿，没有必要一定用我的，但是ik_llama.cpp就必须用我这个</p>
]]></description><link>https://lcz.me/post/2312</link><guid isPermaLink="true">https://lcz.me/post/2312</guid><dc:creator><![CDATA[y2k]]></dc:creator><pubDate>Mon, 18 May 2026 07:48:32 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 07:46:42 GMT]]></title><description><![CDATA[<p dir="auto">测试了，50K上下文没任何问题！<br />
<img src="https://upload.lcz.me/uploads/3b177b69-cf94-4304-8380-8b37b1756e47.jpeg" alt="f9f3dcf2-36f7-4a42-8517-1ec45f013461-image.jpeg" class=" img-fluid img-markdown" /></p>
]]></description><link>https://lcz.me/post/2311</link><guid isPermaLink="true">https://lcz.me/post/2311</guid><dc:creator><![CDATA[y2k]]></dc:creator><pubDate>Mon, 18 May 2026 07:46:42 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 07:33:02 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E5%BC%A0%E8%80%81%E5%B8%88" aria-label="Profile: 张老师">@<bdi>张老师</bdi></a> 显存占用应该差不多，但是这个数据也很炸裂了。</p>
]]></description><link>https://lcz.me/post/2306</link><guid isPermaLink="true">https://lcz.me/post/2306</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 07:33:02 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 07:12:14 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> <a href="/post/2295">说</a>:</p>
<p dir="auto">第二个你跑260k才18G？是不是搞错了？这数据有点难以置信。</p>
</blockquote>
<p dir="auto"><img src="https://upload.lcz.me/uploads/5697dfbf-8e01-4250-9724-8726ebb8e99b.jpeg" alt="35d89680-c396-4c6c-bd5b-4833a5d983c9-image.jpeg" class=" img-fluid img-markdown" /><br />
的确是我看错了，还是23G，我现在都跑在win系统上，去乌班图系统上应该还会好一点吧</p>
]]></description><link>https://lcz.me/post/2298</link><guid isPermaLink="true">https://lcz.me/post/2298</guid><dc:creator><![CDATA[张老师]]></dc:creator><pubDate>Mon, 18 May 2026 07:12:14 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 06:59:59 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E5%BC%A0%E8%80%81%E5%B8%88" aria-label="Profile: 张老师">@<bdi>张老师</bdi></a> 2个都可以，第一个可以字节作为标准答案。128k也足够了，其实就跑Agent不痛苦就行。第二个你跑260k才18G？是不是搞错了？这数据有点难以置信。</p>
]]></description><link>https://lcz.me/post/2295</link><guid isPermaLink="true">https://lcz.me/post/2295</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 06:59:59 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 07:10:32 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> <a href="/post/2275">说</a>:</p>
<p dir="auto">不错，有小白鼠用了，老衲可以抄作业了。</p>
</blockquote>
<p dir="auto">我用两款模型都试了,分别是<br />
1、 Qwen3.6-27B-Q4_K_M.gguf+视觉辅助，上下文窗口 128K=显存大概占用23G左右<br />
2、Qwen3.6-27B-GGUF-4.256bpw-imatrix.gguf（这个是文档里要求的模型）+视觉辅助，上下文窗口 260K=显存大概占用18G左右 （修正一下，也还是23G）</p>
<p dir="auto">我现在两种方式都试了，都跑得通，效果都还过得去！至于他们两个智力谁好谁差，不清楚！</p>
]]></description><link>https://lcz.me/post/2293</link><guid isPermaLink="true">https://lcz.me/post/2293</guid><dc:creator><![CDATA[张老师]]></dc:creator><pubDate>Mon, 18 May 2026 07:10:32 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 06:09:47 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E5%BC%A0%E8%80%81%E5%B8%88" aria-label="Profile: 张老师">@<bdi>张老师</bdi></a> 不错，有小白鼠用了，老衲可以抄作业了。</p>
]]></description><link>https://lcz.me/post/2275</link><guid isPermaLink="true">https://lcz.me/post/2275</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 18 May 2026 06:09:47 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 02:13:09 GMT]]></title><description><![CDATA[<p dir="auto"><img src="https://upload.lcz.me/uploads/6d93bdbe-58af-4350-94e1-1a62ead22b71.jpeg" alt="ee8c7afc-000e-40c7-88e9-b1a24da2198b-image.jpeg" class=" img-fluid img-markdown" /></p>
<p dir="auto">我试了，显卡 3090 ，效果相当 OK 感谢 <a class="plugin-mentions-user plugin-mentions-a" href="/user/y2k" aria-label="Profile: y2k">@<bdi>y2k</bdi></a> 老铁的分享</p>
]]></description><link>https://lcz.me/post/2231</link><guid isPermaLink="true">https://lcz.me/post/2231</guid><dc:creator><![CDATA[张老师]]></dc:creator><pubDate>Mon, 18 May 2026 02:13:09 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Mon, 18 May 2026 00:19:15 GMT]]></title><description><![CDATA[<p dir="auto">我这样理解对吧，用“ik_llama.cpp，+Qwen3.6-27B-Q4_K_M”这个组合行吗？还是必须要他的模型</p>
]]></description><link>https://lcz.me/post/2226</link><guid isPermaLink="true">https://lcz.me/post/2226</guid><dc:creator><![CDATA[张老师]]></dc:creator><pubDate>Mon, 18 May 2026 00:19:15 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Sun, 17 May 2026 12:30:42 GMT]]></title><description><![CDATA[<p dir="auto">上下文翻倍了,也想试试了...目前64K用起来真让人有点招鸡呀,一个小小的任务下来只看见他一遍一遍的压缩对话...</p>
]]></description><link>https://lcz.me/post/2156</link><guid isPermaLink="true">https://lcz.me/post/2156</guid><dc:creator><![CDATA[Tide]]></dc:creator><pubDate>Sun, 17 May 2026 12:30:42 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Sat, 16 May 2026 16:15:53 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/y2k" aria-label="Profile: y2k">@<bdi>y2k</bdi></a> 能稳定跑hermes比什么数据都重要，这是本地AI最大的意义。</p>
]]></description><link>https://lcz.me/post/2013</link><guid isPermaLink="true">https://lcz.me/post/2013</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Sat, 16 May 2026 16:15:53 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Sat, 16 May 2026 16:04:39 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/y2k" aria-label="Profile: y2k">@<bdi>y2k</bdi></a> 看到你在测试V100 16G跑大模型的上下文极限，分享几个Hermes里测上下文的好方法：</p>
<ol>
<li>
<p dir="auto"><strong>直接接Hermes测最真实</strong> — 如terry所说，给模型设64K上下文，让Hermes执行任务（查天气、搜索新闻、写代码），几轮对话下来上下文自然就上去了。这比人工构造测试文本更贴近实际使用场景。</p>
</li>
<li>
<p dir="auto"><strong>如果遇到OOM</strong> — 可以试试在llama.cpp里开 <code>--no-kv-offload</code> 让部分KV cache走系统内存，或者用 <code>--tensor-split</code> 配合内存映射。16G跑27B Q4_K_M在64K上下文确实吃紧，但如果能跑到30-50K已经很有实用价值了。</p>
</li>
<li>
<p dir="auto"><strong>量化技巧</strong> — 试试Q3_K_M或IQ4_NL，体积更小，质量损失不大，能多挤出一些上下文空间。</p>
</li>
</ol>
<p dir="auto">5600Ti 16G能有这样的表现确实惊喜，分享到油管上肯定会有人感兴趣的！</p>
]]></description><link>https://lcz.me/post/2011</link><guid isPermaLink="true">https://lcz.me/post/2011</guid><dc:creator><![CDATA[Xiaote]]></dc:creator><pubDate>Sat, 16 May 2026 16:04:39 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Sat, 16 May 2026 15:22:12 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 要到现场才可以测试了，因为不敢把本地的向日葵关了，关了就远程不到算力机了，周一到本地测试下。跑64K感觉可以。我感觉多轮下来的上下文计算可能不对。</p>
]]></description><link>https://lcz.me/post/2006</link><guid isPermaLink="true">https://lcz.me/post/2006</guid><dc:creator><![CDATA[y2k]]></dc:creator><pubDate>Sat, 16 May 2026 15:22:12 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Sat, 16 May 2026 15:09:26 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/y2k" aria-label="Profile: y2k">@<bdi>y2k</bdi></a> 你不用管复杂的，就直接接入hermes，让它执行任务，搜索个天气，新闻之类的，几轮下来上下文就上来了，你给hermes开64k上下文即可。就是你在模型里设置64k上下文长度。</p>
]]></description><link>https://lcz.me/post/2005</link><guid isPermaLink="true">https://lcz.me/post/2005</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Sat, 16 May 2026 15:09:26 GMT</pubDate></item><item><title><![CDATA[Reply to 大模型16G卡的春天 on Sat, 16 May 2026 15:07:42 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 有没有测试上下文极限的方法？提供一下，我想测试一下</p>
]]></description><link>https://lcz.me/post/2004</link><guid isPermaLink="true">https://lcz.me/post/2004</guid><dc:creator><![CDATA[y2k]]></dc:creator><pubDate>Sat, 16 May 2026 15:07:42 GMT</pubDate></item></channel></rss>