<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[RTX 3080 20GB 上以 256k &#x2F; ~45 tk&#x2F;s 运行 Qwen3.6-35B-A3B-Q4-K-M(ubuntu)]]></title><description><![CDATA[<p dir="auto">我基本上是按照这个视频中的方法操作的：<br />
<a href="https://www.youtube.com/watch?v=8F_5pdcD3HY" rel="nofollow ugc">https://www.youtube.com/watch?v=8F_5pdcD3HY</a><br />
我没有 1:1 完全复制，而是以此为主要参考并根据我自己的机器进行了调整。</p>
<p dir="auto">我目前的配置：</p>
<p dir="auto">GPU: RTX 3080 20GB</p>
<p dir="auto">RAM: 15 GB</p>
<p dir="auto">CPU: i3-10100F</p>
<p dir="auto">llama.cpp: turboquant 编译版本<br />
<a href="https://github.com/TheTom/llama-cpp-turboquant" rel="nofollow ugc">https://github.com/TheTom/llama-cpp-turboquant</a></p>
<p dir="auto">模型 (Model): Qwen3.6-35B-A3B-UD-Q4_K_M.gguf</p>
<p dir="auto">多模态组件 (mmproj): mmproj-F16.gguf</p>
<p dir="auto">上下文 (Context): 256k</p>
<p dir="auto">n-cpu-moe: 15</p>
<p dir="auto">cache-type-k: turbo4</p>
<p dir="auto">cache-type-v: turbo3</p>
<p dir="auto">flash-attn: 开启</p>
<p dir="auto">目前的结果：</p>
<p dir="auto">在 256k 上下文下运行稳定</p>
<p dir="auto">速度大约为 45 tok/s</p>
<p dir="auto">模型加载时间约为 5 分钟</p>
<p dir="auto">运行添加 mmproj 后，视觉功能也能正常工作<br />
<img src="https://upload.lcz.me/uploads/864769fb-4417-4970-aabf-a7c9dcbe25ba.jpeg" alt="beca22fc-40cd-4620-8b5d-87dca6e8d079-image.jpeg" class=" img-fluid img-markdown" /></p>
<p dir="auto">运行脚本：<br />
#!/usr/bin/env bash<br />
set -euo pipefail</p>
<p dir="auto">MODEL="/mnt/hdd_storage/models/llama.cpp/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"<br />
SERVER="/mnt/hdd_storage/llama.cpp-turboquant/repo/build/bin/llama-server"<br />
HOST="0.0.0.0"<br />
PORT="9999"<br />
CTX="262144"<br />
THREADS="6"<br />
THREADS_BATCH="6"<br />
BATCH="256"<br />
UBATCH="128"<br />
GPU_LAYERS="99"<br />
CPU_MOE="20"<br />
PARALLEL="2"<br />
CACHE_K="turbo4"<br />
CACHE_V="turbo3"<br />
MMPROJ="/mnt/hdd_storage/models/llama.cpp/mmproj-F16.gguf"<br />
REASONING_MODE="${REASONING_MODE:-off}"</p>
<p dir="auto">exec "$SERVER" <br />
--model "$MODEL" <br />
--host "$HOST" <br />
--port "$PORT" <br />
-ngl "$GPU_LAYERS" <br />
--n-cpu-moe "$CPU_MOE" <br />
-c "$CTX" <br />
-t "$THREADS" <br />
-tb "$THREADS_BATCH" <br />
-b "$BATCH" <br />
-ub "$UBATCH" <br />
-np "$PARALLEL" <br />
--cache-type-k "$CACHE_K" <br />
--cache-type-v "$CACHE_V" <br />
--mmproj "$MMPROJ" <br />
--flash-attn on <br />
--no-warmup <br />
--jinja <br />
--reasoning "$REASONING_MODE"</p>
<p dir="auto">我尝试了运行不同27B模型量化参数但是都不能稳定跑长上下文任务，经常OOM，想说各位老大有没有什么办法。</p>
]]></description><link>https://lcz.me/topic/216/rtx-3080-20gb-上以-256k-45-tk-s-运行-qwen3.6-35b-a3b-q4-k-m-ubuntu</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 06:08:19 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/216.rss" rel="self" type="application/rss+xml"/><pubDate>Tue, 19 May 2026 11:09:04 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to RTX 3080 20GB 上以 256k &#x2F; ~45 tk&#x2F;s 运行 Qwen3.6-35B-A3B-Q4-K-M(ubuntu) on Tue, 19 May 2026 12:43:12 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/simo9052" aria-label="Profile: simo9052">@<bdi>simo9052</bdi></a> 我准备抄你的作业<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f44d.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--+1" style="height:23px;width:auto;vertical-align:middle" title=":+1:" alt="👍" /> <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f601.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--grin" style="height:23px;width:auto;vertical-align:middle" title=":grin:" alt="😁" /></p>
]]></description><link>https://lcz.me/post/2614</link><guid isPermaLink="true">https://lcz.me/post/2614</guid><dc:creator><![CDATA[Tide]]></dc:creator><pubDate>Tue, 19 May 2026 12:43:12 GMT</pubDate></item><item><title><![CDATA[Reply to RTX 3080 20GB 上以 256k &#x2F; ~45 tk&#x2F;s 运行 Qwen3.6-35B-A3B-Q4-K-M(ubuntu) on Tue, 19 May 2026 11:38:53 GMT]]></title><description><![CDATA[<p dir="auto">跑27b要全量推理，你显存不够。35b专家之外可以卸载到内存里。你3080 20G能跑到这个水平很牛了。</p>
]]></description><link>https://lcz.me/post/2585</link><guid isPermaLink="true">https://lcz.me/post/2585</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Tue, 19 May 2026 11:38:53 GMT</pubDate></item></channel></rss>