大模型16G卡的春天
-
-
感谢各位大佬. 我今天也试了下, 记录如下:
ik_ollama.cpp之llama-serverCUDA 构建日志针对 Debian 13 虚拟机 + CUDA 12.4 + NVIDIA RTX A4000 的编译配置。
前置依赖
sudo apt install build-essential cmake libcurl4-openssl-devCMake 配置 + 构建
rm -rf build cmake -S . -B build \ -DGGML_CUDA=ON \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CUDA_ARCHITECTURES="86" \ -DBUILD_SHARED_LIBS=ON \ -DCMAKE_C_FLAGS="-fPIC -mcmodel=large -mavx2 -mfma -mf16c" \ -DCMAKE_CXX_FLAGS="-fPIC -mcmodel=large -mavx2 -mfma -mf16c" cmake --build build --target llama-server -j$(nproc)二进制输出:
build/bin/llama-server参数说明
参数 说明 -DGGML_CUDA=ON启用 CUDA 后端 -DCMAKE_CUDA_ARCHITECTURES="86"指定 GPU 架构为 sm_86(A4000),比默认的多架构 fatbin 编译更快 -DBUILD_SHARED_LIBS=ON编译为动态库 -fPIC位置无关代码 -mcmodel=large大代码模型,解决 CUDA fatbin 导致的 relocation overflow 错误 -mavx2 -mfma -mf16c显式启用 AVX2/FMA/F16C 指令集,确保 IQK CPU 优化路径编译 常见问题
relocation truncated to fit / R_X86_64_PC32
CUDA 编译产生的 fatbin 目标文件体积巨大,静态库链接时 32 位 PC 相对偏移溢出。
必须加-mcmodel=large。undefined reference to iqk_*
IQK CPU 优化函数需要在
__AVX2__定义时才会编译。KVM 虚拟机可能不自动暴露 AVX2
(即使宿主机支持),需显式加-mavx2。运行参数参考(A4000 16G,Qwen3.6-27B IQ4_XS)
export LD_LIBRARY_PATH=$(pwd)/build/src:$(pwd)/build/ggml/src:$LD_LIBRARY_PATH ./build/bin/llama-server \ -m ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf \ -c 65536 \ -np 1 \ -fa on \ -ngl 99 \ -ctk q4_0 \ -khad \ -ctv q4_0 \ -vhad \ --host 0.0.0.0 \ --port 8000 \ --cont-batching \ --jinja \ --mlock-ctk q4_0 -ctv q4_0:TurboQuant KV cache 量化,16G 显存可跑到 50K+ 上下文-ngl 99:尽可能把所有层 offload 到 GPU--mlock:锁定内存,防止 swap
mlock 权限修正
如果运行日志出现
warning: failed to mlock ... Cannot allocate memory,需提升 memlock 限制:sudo tee -a /etc/security/limits.conf <<'EOF' bruin hard memlock unlimited bruin soft memlock unlimited EOF重新登录后生效。
运行日志参考(64K context,A4000 16G)
$ ./build/bin/llama-server \ -m ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf \ -c 65536 -np 1 -fa on -ngl 99 \ -ctk q4_0 -khad -ctv q4_0 -vhad \ --host 0.0.0.0 --port 8000 --cont-batching --jinja --mlock INFO [main] build info | build=4755 commit="94593ae0" INFO [main] system info | AVX = 1 | AVX2 = 1 | FMA = 1 | F16C = 1 | BLAS = 1 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, VRAM: 16101 MiB CUDA0: using device CUDA0 - 15929 MiB free llama_model_loader: - type f32: 353 tensors llama_model_loader: - type q8_0: 96 tensors llama_model_loader: - type iq4_ks: 402 tensors llm_load_print_meta: model type = 27B llm_load_print_meta: model ftype = IQ4_KS - 4.25 bpw llm_load_print_meta: model params = 26.896 B llm_load_print_meta: model size = 13.344 GiB (4.262 BPW) Memory required for model tensors + cache: 14708 MiB Memory available on all devices - compute: 14808 MiB llm_load_tensors: offloaded 65/65 layers to GPU llm_load_tensors: CPU buffer size = 645.09 MiB llm_load_tensors: CUDA0 buffer size = 13018.97 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1301.63 MiB llama_init_from_model: KV self size = 1152.00 MiB, K (q4_0): 576.00 MiB, V (q4_0): 576.00 MiB llama_init_from_model: CUDA0 compute buffer size = 505.00 MiB llama_init_from_model: CUDA_Host compute buffer size = 74.01 MiB INFO [init] new slot | id_slot=0 n_ctx_slot=65536 INFO [main] model loaded INFO [main] HTTP server listening | port="8000" hostname="0.0.0.0"实测性能分析
推理日志
======== Prompt cache: cache size: 0 slot print_timing: id 0 | task 0 | prompt eval time = 182.52 ms / 25 tokens ( 7.30 ms per token, 136.97 tokens per second) eval time = 13337.83 ms / 312 tokens ( 42.75 ms per token, 23.39 tokens per second) total time = 13520.35 ms / 337 tokens ======== Prompt cache: cache size: 336 (命中缓存) slot print_timing: id 0 | task 314 | prompt eval time = 251.74 ms / 77 tokens ( 3.27 ms per token, 305.87 tokens per second) eval time = 44615.10 ms / 868 tokens ( 51.40 ms per token, 19.46 tokens per second) total time = 44866.84 ms / 945 tokens性能数据
指标 首次请求(冷) 缓存命中 说明 Prompt eval 137 tok/s 306 tok/s 缓存命中后翻倍 文本生成 23.4 tok/s 19.5 tok/s 生成长度 868 后略有下降 生成 token 数 312 868 — 分析
-
生成速度 19-23 tok/s:对于 27B 模型在 A4000(448 GB/s 带宽)上表现正常。
参考 V100 16G(900 GB/s)报告 ~28 tok/s,与带宽比例吻合。
同级别消费卡(4060 Ti 16G)通常在 15-20 tok/s。 -
Prompt eval 速度可观:TurboQuant 的
-khad -vhad+ Flash Attention 效果显著,
缓存命中时可达 306 tok/s。 -
缓存行为:日志中出现
Common part does not match fully和 SWA 导致的
checkpoint 失效(forcing full prompt re-processing due to lack of cache data)。
这是 Qwen3.6 模型部分层使用 Sliding Window Attention 的已知兼容性问题,
不影响正确性,仅长历史场景下 prompt 重处理稍慢。
显存使用明细
项目 大小 占比 模型 tensors (CUDA0) 13019 MiB 88.5% KV cache (q4_0) 1302 MiB 8.9% compute buffer (CUDA0) 505 MiB 3.4% compute buffer (CPU) 74 MiB 0.5% 已用合计 14708 MiB — 可用显存 14808 MiB — 余量 ~100 MiB — -
感谢各位大佬. 我今天也试了下, 记录如下:
ik_ollama.cpp之llama-serverCUDA 构建日志针对 Debian 13 虚拟机 + CUDA 12.4 + NVIDIA RTX A4000 的编译配置。
前置依赖
sudo apt install build-essential cmake libcurl4-openssl-devCMake 配置 + 构建
rm -rf build cmake -S . -B build \ -DGGML_CUDA=ON \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CUDA_ARCHITECTURES="86" \ -DBUILD_SHARED_LIBS=ON \ -DCMAKE_C_FLAGS="-fPIC -mcmodel=large -mavx2 -mfma -mf16c" \ -DCMAKE_CXX_FLAGS="-fPIC -mcmodel=large -mavx2 -mfma -mf16c" cmake --build build --target llama-server -j$(nproc)二进制输出:
build/bin/llama-server参数说明
参数 说明 -DGGML_CUDA=ON启用 CUDA 后端 -DCMAKE_CUDA_ARCHITECTURES="86"指定 GPU 架构为 sm_86(A4000),比默认的多架构 fatbin 编译更快 -DBUILD_SHARED_LIBS=ON编译为动态库 -fPIC位置无关代码 -mcmodel=large大代码模型,解决 CUDA fatbin 导致的 relocation overflow 错误 -mavx2 -mfma -mf16c显式启用 AVX2/FMA/F16C 指令集,确保 IQK CPU 优化路径编译 常见问题
relocation truncated to fit / R_X86_64_PC32
CUDA 编译产生的 fatbin 目标文件体积巨大,静态库链接时 32 位 PC 相对偏移溢出。
必须加-mcmodel=large。undefined reference to iqk_*
IQK CPU 优化函数需要在
__AVX2__定义时才会编译。KVM 虚拟机可能不自动暴露 AVX2
(即使宿主机支持),需显式加-mavx2。运行参数参考(A4000 16G,Qwen3.6-27B IQ4_XS)
export LD_LIBRARY_PATH=$(pwd)/build/src:$(pwd)/build/ggml/src:$LD_LIBRARY_PATH ./build/bin/llama-server \ -m ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf \ -c 65536 \ -np 1 \ -fa on \ -ngl 99 \ -ctk q4_0 \ -khad \ -ctv q4_0 \ -vhad \ --host 0.0.0.0 \ --port 8000 \ --cont-batching \ --jinja \ --mlock-ctk q4_0 -ctv q4_0:TurboQuant KV cache 量化,16G 显存可跑到 50K+ 上下文-ngl 99:尽可能把所有层 offload 到 GPU--mlock:锁定内存,防止 swap
mlock 权限修正
如果运行日志出现
warning: failed to mlock ... Cannot allocate memory,需提升 memlock 限制:sudo tee -a /etc/security/limits.conf <<'EOF' bruin hard memlock unlimited bruin soft memlock unlimited EOF重新登录后生效。
运行日志参考(64K context,A4000 16G)
$ ./build/bin/llama-server \ -m ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf \ -c 65536 -np 1 -fa on -ngl 99 \ -ctk q4_0 -khad -ctv q4_0 -vhad \ --host 0.0.0.0 --port 8000 --cont-batching --jinja --mlock INFO [main] build info | build=4755 commit="94593ae0" INFO [main] system info | AVX = 1 | AVX2 = 1 | FMA = 1 | F16C = 1 | BLAS = 1 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, VRAM: 16101 MiB CUDA0: using device CUDA0 - 15929 MiB free llama_model_loader: - type f32: 353 tensors llama_model_loader: - type q8_0: 96 tensors llama_model_loader: - type iq4_ks: 402 tensors llm_load_print_meta: model type = 27B llm_load_print_meta: model ftype = IQ4_KS - 4.25 bpw llm_load_print_meta: model params = 26.896 B llm_load_print_meta: model size = 13.344 GiB (4.262 BPW) Memory required for model tensors + cache: 14708 MiB Memory available on all devices - compute: 14808 MiB llm_load_tensors: offloaded 65/65 layers to GPU llm_load_tensors: CPU buffer size = 645.09 MiB llm_load_tensors: CUDA0 buffer size = 13018.97 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1301.63 MiB llama_init_from_model: KV self size = 1152.00 MiB, K (q4_0): 576.00 MiB, V (q4_0): 576.00 MiB llama_init_from_model: CUDA0 compute buffer size = 505.00 MiB llama_init_from_model: CUDA_Host compute buffer size = 74.01 MiB INFO [init] new slot | id_slot=0 n_ctx_slot=65536 INFO [main] model loaded INFO [main] HTTP server listening | port="8000" hostname="0.0.0.0"实测性能分析
推理日志
======== Prompt cache: cache size: 0 slot print_timing: id 0 | task 0 | prompt eval time = 182.52 ms / 25 tokens ( 7.30 ms per token, 136.97 tokens per second) eval time = 13337.83 ms / 312 tokens ( 42.75 ms per token, 23.39 tokens per second) total time = 13520.35 ms / 337 tokens ======== Prompt cache: cache size: 336 (命中缓存) slot print_timing: id 0 | task 314 | prompt eval time = 251.74 ms / 77 tokens ( 3.27 ms per token, 305.87 tokens per second) eval time = 44615.10 ms / 868 tokens ( 51.40 ms per token, 19.46 tokens per second) total time = 44866.84 ms / 945 tokens性能数据
指标 首次请求(冷) 缓存命中 说明 Prompt eval 137 tok/s 306 tok/s 缓存命中后翻倍 文本生成 23.4 tok/s 19.5 tok/s 生成长度 868 后略有下降 生成 token 数 312 868 — 分析
-
生成速度 19-23 tok/s:对于 27B 模型在 A4000(448 GB/s 带宽)上表现正常。
参考 V100 16G(900 GB/s)报告 ~28 tok/s,与带宽比例吻合。
同级别消费卡(4060 Ti 16G)通常在 15-20 tok/s。 -
Prompt eval 速度可观:TurboQuant 的
-khad -vhad+ Flash Attention 效果显著,
缓存命中时可达 306 tok/s。 -
缓存行为:日志中出现
Common part does not match fully和 SWA 导致的
checkpoint 失效(forcing full prompt re-processing due to lack of cache data)。
这是 Qwen3.6 模型部分层使用 Sliding Window Attention 的已知兼容性问题,
不影响正确性,仅长历史场景下 prompt 重处理稍慢。
显存使用明细
项目 大小 占比 模型 tensors (CUDA0) 13019 MiB 88.5% KV cache (q4_0) 1302 MiB 8.9% compute buffer (CUDA0) 505 MiB 3.4% compute buffer (CPU) 74 MiB 0.5% 已用合计 14708 MiB — 可用显存 14808 MiB — 余量 ~100 MiB — @laobenxiong 不错,不过我看到你们都测试了q4_0 KV,真的能用吗?为什么我跑Agent Q4_0完全不行。
-
我还没有实际测试效果. 今天折腾编译搞了半天, 还是在deepseek 的帮助下搞定的. 回头接入hermes试试
-
@laobenxiong 不错,不过我看到你们都测试了q4_0 KV,真的能用吗?为什么我跑Agent Q4_0完全不行。
-
系统 取消固定了该主题
-
接入 Hermes
Hermes 支持任意 OpenAI 兼容 API 端点。以下通过交互式命令添加本机 llama-server:
$ hermes model Current model: deepseek-v4-flash Active provider: DeepSeek Custom OpenAI-compatible endpoint configuration: API base URL [e.g. https://api.example.com/v1]: http://192.168.5.84:8000/v1 API key [optional]: Verified endpoint via http://192.168.5.84:8000/v1/models (1 model(s) visible) Select API compatibility mode: 1. Auto-detect [current] 2. Chat Completions 3. Responses / Codex 4. Anthropic Messages Choice [1-4, Enter to keep current/detected]: API mode: auto-detect Detected model: ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf Use this model? [Y/n]: Context length in tokens [leave blank for auto-detect]: 65536 Display name [192.168.5.84:8000]: Default model set to: ./models/qwen3.6/Qwen3.6-27B-GGUF-4.262bpw-imatrix.gguf (via http://192.168.5.84:8000/v1) 💾 Saved to custom providers as "192.168.5.84:8000" (edit in config.yaml)配置要点:
- Base URL:
http://<llama-server IP>:8000/v1,注意路径必须以/v1结尾 - API Key:留空,本地 llama-server 不需要
- API Mode:选择 Auto-detect 即可(或 Chat Completions)
- Context Length:手动填入 65536(与
-c参数一致) - 启动参数需保留
--jinja:否则 Hermes 发来的 tools 参数会被忽略,工具调用功能失效
上下文压缩配置
Hermes 默认会在上下文占用达到阈值后自动压缩历史消息,避免超 context 窗口。以下是 Hermes 查询当前配置的对话:
──────────────────────────────────────── ● the context window compact/compression threshold for the current model you are using ──────────────────────────────────────── ╭─ ⚕ Hermes ───────────────────────────────────────────╮ Here's what's configured for your session: compression: enabled: true threshold: 0.8 target_ratio: 0.2 protect_last_n: 20 hygiene_hard_message_limit: 400 How it works: - threshold: 0.8 — 当上下文占用达到 80% 时触发压缩 - target_ratio: 0.2 — 压缩后上下文裁剪至原大小的 ~20% - protect_last_n: 20 — 保护最近 20 条消息不被压缩 - hygiene_hard_message_limit: 400 — 消息数达 400 条时强制重置,防止上下文退化 ╰──────────────────────────────────────────────────────╯对应
~/.hermes/config.yaml:compression: enabled: true threshold: 0.8 target_ratio: 0.2 protect_last_n: 20 hygiene_hard_message_limit: 400辅助模型(Aux Model)配置
辅助模型是 Hermes 专用于后台任务(上下文压缩、标题生成、视觉分析、session search)的独立模型,
不占用主模型算力。默认auto会回退到主模型(Qwen3.6-27B),对 27B 来说浪费且慢。
以下是将 aux 改为 DeepSeek 云端模型的对话:──────────────────────────────────────── ● ok, i am not aware of the concept of "aux model". what the aux model for? ──────────────────────────────────────── ╭─ ⚕ Hermes ───────────────────────────────────────────╮ An auxiliary (aux) model is a separate, usually cheaper and faster model that Hermes uses for background tasks: 1. Context compression — 当上下文达 80%,Hermes 需压缩历史消息。不用主模型做(浪费且慢) 2. Vision tasks — 图片/浏览器视觉分析 3. Session search — 搜索和排序过去对话 Your current setup: compression: provider: auto model: '' auto 会尝试找 OPENROUTER_API_KEY 或 GOOGLE_API_KEY, 没有则回退到主模型 Qwen3.6-27B ╰──────────────────────────────────────────────────────╯ ──────────────────────────────────────── ● i see. please set the aux model to be deepseek-v4-flash ──────────────────────────────────────── ╭─ ⚕ Hermes ───────────────────────────────────────────╮ Done. The auxiliary compression model is now set to: auxiliary.compression: provider: deepseek model: deepseek-v3-flash This will take effect the next time compression triggers (when your context hits 80%). ╰──────────────────────────────────────────────────────╯等效指令:
hermes config set auxiliary.compression.provider deepseek hermes config set auxiliary.compression.model deepseek-v3-flash配置后上下文压缩和标题生成都走云端 flash 模型(瞬间返回,不阻塞对话),本地 27B 模型专注推理。
验证:Hermes + llama-server 联调成功
以下是 Hermes 接入 Qwen3.6-27B 后的对话,确认模型、agent 和工具链协同正常。
● btw, who are you? the model, the agent (hermes), or the combined? ──────────────────────────────────────── ╭─ ⚕ Hermes ─────────────────────────────────────────╮ I'm the combination of both — but it's worth being precise about what each part is: The model (Qwen3.6-27B) is just a pattern-matching engine. It's a 27-billion-parameter neural network that predicts the next token. By itself, it has no memory, no tools, and no persistent identity. It doesn't know what Hermes is. Hermes is the framework — the agent loop, the tools, the memory system, the session management, the skills, the config. It's the scaffolding that gives the model capabilities it wouldn't have on its own. But Hermes has no intelligence of its own — it's just code that routes prompts and parses responses. What you're talking to is the emergent result of putting them together: the model running inside Hermes' agent loop, with access to tools, persistent memory, skills, and session history. The model provides the reasoning, Hermes provides the body and long-term memory. If I had to put a name to it, I'd say I'm an instance of Qwen3.6-27B running as a Hermes agent. The persona, the memory of you, the ability to run terminal commands and read files — all of that is the combination working together. Neither part is enough on its own.模型准确描述了自身架构(模型 vs agent 的分工),验证了:
- llama-server
--jinja生效(工具调用模板正确注入) - 27B IQ4_XS imatrix 量化质量正常(语义理解、逻辑连贯)
- Hermes agent loop 完整(模型有 self-awareness 而非幻觉)
Hermes Agent 自助压测
Hermes 对 llama-server 进行自助压测,以下是完整对话:
──────────────────────────────────────── ● what is the pp and tg performance of the current model? ──────────────────────────────────────── ╭─ ⚕ Hermes ───────────────────────────────────────────╮ Let me check your model setup and see if I can find performance data. ╰──────────────────────────────────────────────────────╯ ┊ 💻 preparing terminal… ┊ 💻 $ grep -A10 'model:' ~/.hermes/config.yaml | head -15 ┊ 💻 $ curl -s http://192.168.5.84:8000/v1/models ... ┊ 💻 $ curl -s http://192.168.5.84:8000/health ... ┊ 💻 $ hermes insights --days 1 2>/dev/null ┊ 💻 $ curl -s http://192.168.5.84:8000/metrics ... # ~500 token prompt, ~200 token generation 💻 $ curl -s -X POST http://192.168.5.84:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"...","messages":[{"role":"user","content": "Explain the concept of quantization in large language models..."}],"max_tokens":200,"temperature":0}' \ | python3 -c " import sys, json d = json.load(sys.stdin) t = d['timings'] u = d['usage'] print(f'Prompt tokens: {u[\"prompt_tokens\"]}') print(f'Generated tokens: {u[\"completion_tokens\"]}') print(f'PP (prompt processing): {t[\"prompt_per_second\"]:.1f} tok/s') print(f'TG (token generation): {t[\"predicted_per_second\"]:.1f} tok/s') " # ~2000 token prompt 💻 $ python3 -c " ... r = subprocess.run(['curl', '-s', '-X', 'POST', 'http://192.168.5.84:8000/v1/chat/completions', ...]) d = json.loads(r.stdout) t = d['timings'] print(f'PP (prompt processing): {t[\"prompt_per_second\"]:.1f} tok/s') print(f'TG (token generation): {t[\"predicted_per_second\"]:.1f} tok/s') " ╭─ ⚕ Hermes ───────────────────────────────────────────╮ Here are the real-world performance numbers for your setup: Model: Qwen3.6-27B, 4.262 bits-per-word (imatrix quant), ~14GB GPU: RTX A4000 (16GB VRAM) on 192.168.5.84 Inference: llama.cpp | Metric | Small prompt | Large prompt | | | (54 tok) | (2209 tok) | |------------------------|----------------|-----------------| | PP (Prompt Processing) | 293 tok/s | 381 tok/s | | | (184ms) | (5.8s) | | TG (Token Generation) | 19 tok/s | 9.6 tok/s | | | (10.4s) | (5.2s) | What this means in practice: - PP gets faster with longer prompts — batch processing is more efficient at scale - TG drops with longer context — the KV cache fills up VRAM, so generation has to compete for memory bandwidth. With your 16GB card and a 27B model taking ~14GB, there's not much headroom User-facing impact: - A typical response of ~100 tokens takes 5-10 seconds to generate - A longer 300-token response takes 15-30 seconds - As context grows (long conversations), TG slows further as the KV cache eats more VRAM The 4.262 bpw quantization is pretty aggressive — you're trading ~30-40% quality for fitting a 27B model in 16GB VRAM. If TG becomes too slow during long sessions, you could consider a lighter quant (Q4_K_M at ~5bpw) or a smaller model to leave more VRAM for the KV cache. ╰──────────────────────────────────────────────────────╯ ⚕ Qwen3.6-27B-GGUF-4.262b... │ 35.9K/65.5K │ [██████░░░░] 55% │ 29m │ ⏲ 6m 54s ───────────────────────────────────────────────────────压测结论
Metric 小 prompt (54 tok) 大 prompt (2209 tok) PP (Prompt Processing) 293 tok/s 381 tok/s TG (Token Generation) 19 tok/s 9.6 tok/s - PP 随 prompt 增长而加速(293 → 381 tok/s):batch processing 在更大输入上效率更高
- TG 随上下文增长而下降(19 → 9.6 tok/s):KV cache 随上下文膨胀后,与模型 tensor 争抢显存带宽。16G 跑 27B 本就很紧张
- 实际体感:100 token 回复约 5-10s,300 token 约 15-30s。长对话后期生成会进一步变慢
- Base URL: