<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得]]></title><description><![CDATA[<p dir="auto">見到各位大神有在分享設置, 然後我這個小白最近在公司還蠻幸運能用上AWS G5dn x12large配置</p>
<p dir="auto">現在就來分享一下我在試驗時候(私心)測試後比較穩定的配置(至少沒有OOM).</p>
<p dir="auto">A10G可以視作被削弱的3090, 所以配置可以視作3090/3090Ti適用, 擁有3090等級或者以上的24GB卡也用來參考 (基本上只會比這個更快, 不會更慢).</p>
<p dir="auto">注意我目前還沒有額外時間跑太多的Benchmark, 而且有跑的Benchmark基本上是vLLm自己官網上有寫的.  因此下面有寫的基本上就是成功架構并且能達到普(老)通(闆)人不會等到不耐煩的速度.</p>
<p dir="auto">使用場景: v0.22.0-cu129-ubuntu2404</p>
<h3>1) Gemma 4 31B</h3>
<p dir="auto">試過2 * A10G 跟 4 * A10G, 以下是2 * A10G設置</p>
<pre><code>docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  -e VLLM_MODEL_NAME="Gemma-4-31B-it" \
  vllm/vllm-openai:v0.22.0-cu129-ubuntu2404 \
  --model Intel/gemma-4-31B-it-int4-AutoRound \
  --served-model-name Gemma-4-31B-it \
  --dtype float16 \
  --quantization auto_round \
  --gpu-memory-utilization 0.90 \
  --max-model-len 192768 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --data-parallel-size 1 \
  --attention-backend TRITON_ATTN \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}' \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4
</code></pre>
<p dir="auto"><em><strong>Gemma 4 31B不能在單A10G的情況下使用, 就算是Int 4也不行, 請研究關於MoE的配置</strong></em></p>
<p dir="auto">來解釋一下配置原因:</p>
<p dir="auto">base: Intel/gemma-4-31B-it-int4-AutoRound  (~22 GB)<br />
drafter: google/gemma-4-31B-it-assistant  (0.5B / ~927 MB BF16)</p>
<p dir="auto">Gemma 4 MTP跟Qwen 3.6不一樣, Qwen内置MTP head, 因此可以不需要額外帶上另外一個drafter.<br />
Gemma 4并沒有内置MTP head, 所以需要一個基於Gemma 4提煉出來的drafter. 理論上也可以使用gemma 4 E2B 跟 E4B作爲drafter, 但是這兩個比31B自己的0.5B drafter還要大</p>
<p dir="auto">dtype : float16, 理論上bf16也可以, 3090 支持float 16跟bfloat 16<br />
quantization: autoround, 低精度下保持相對高的token質量, int4/int8必用</p>
<p dir="auto">gpu-memory-utilization: 0.9, 有其他東西需要VRAM所以我限制在0.9, 理論上Headless Server可以把這個推到0.95的話, 然後2張卡可以上260K長度<br />
max-num-seqs: 1作爲POC超長上下文使用, 普通長上下文Agent可以設成2, 短上下文Agent可以上4</p>
<p dir="auto">max-num-batched-tokens: vision tower需要至少2496以上<br />
attention-backend: TRITON_ATTN, Gemma 4 理論上支持 FA2, 但是基於Head Dimension跟Hybrid Attention的關係用FA會爆炸, 現階段還是用triton attention</p>
<p dir="auto">speculative-config: MTP加速</p>
<p dir="auto"><em><strong>值得一提的是kv-cache-dtype理論上int8_per_token_head, 我也曾經在Reddit上看到有人成功使用, 可是我自己不行, 有3090的朋友可以試試看</strong></em></p>
<pre><code>Benchmark

--dataset-name random --input-len 1024 --output-len 256 --num-prompts 100 --request-rate inf --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  376.11    
Total input tokens:                      103747    
Total generated tokens:                  25600     
Request throughput (req/s):              0.27      
Output token throughput (tok/s):         68.06     
Peak output token throughput (tok/s):    32.00     
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          343.91    
---------------Time to First Token----------------
Mean TTFT (ms):                          190345.47 
Median TTFT (ms):                        190211.51 
P99 TTFT (ms):                           370098.98 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.63     
Median TPOT (ms):                        10.63     
P99 TPOT (ms):                           13.52     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.00     
Median ITL (ms):                         32.28     
P99 ITL (ms):                            32.95     
---------------Speculative Decoding---------------
Acceptance rate (%):                     51.67     
Acceptance length:                       3.07      
Drafts:                                  8369      
Draft tokens:                            33476     
Accepted tokens:                         17296     
Per-position acceptance (%):
  Position 0:                            77.55     
  Position 1:                            58.00     
  Position 2:                            41.55     
  Position 3:                            29.57     
==================================================
</code></pre>
<hr />
<h3>2) Gemma 4 26B A4B</h3>
<p dir="auto">單純4 * A10G設立過, 但并沒有真正使用過, 沒加上MTP,  原則上跟Gemma 4 31B的思路差不多</p>
<pre><code>docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  -e VLLM_MODEL_NAME="Gemma-4-26B-A4B-it" \
  vllm/vllm-openai:v0.22.0-cu129-ubuntu2404 \
  --model Intel/gemma-4-26B-A4B-it-int4-mixed-AutoRound \
  --served-model-name Gemma-4-26B-A4B-it \
  --dtype float16 \
  --quantization auto_round \
  --gpu-memory-utilization 0.90 \
  --max-model-len 192768 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 1 \
  --data-parallel-size 1 \
  --enable-expert-parallel true \
  --attention-backend TRITON_ATTN \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4
</code></pre>
<p dir="auto">model: Intel/gemma-4-26B-A4B-it-int4-mixed-AutoRound (~15 GB)</p>
<p dir="auto">請注意跟自行修改tensor-parallel-size跟enable-expert-parallel</p>
<p dir="auto">tensor-parallel-size: 4, 因爲x12Large有4張卡, 這個按照你卡的數量以2的次方為準 (2, 4, 8 etc)<br />
enable-expert-parallel: true, 這個沒有NVLINK那些的話就直接false就可以了, 有興趣的可以問我, 之後再回答</p>
<p dir="auto">以下預測MTP的配置:</p>
<p dir="auto">drafter: google/gemma-4-26B-A4B-it-assistant  (~0.4B / ~800 MB BF16)<br />
--speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}'</p>
<p dir="auto">如果想要2卡的話max-model-len很大機會需要乘上0.75 或 0.5</p>
<hr />
<ol start="3">
<li>Qwen 3.6 27B</li>
</ol>
<p dir="auto">有一個比較嚴重的問題是Qwen 3.6在長上下文的情況下思考時間過長(Overthinking), 試過關閉enable_thinking跟限制thinking token數量但效果不太明顯.</p>
<p dir="auto">雖然質量出色但因爲本身System Prompt加上上下文一多就會導致TTFT太長, 需要更多時間研究</p>
<p dir="auto">試過2 * A10G 跟 4 * A10G, 以下是2 * A10G設置</p>
<pre><code>docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  -e VLLM_MODEL_NAME="Qwen3.6-27B" \
  vllm/vllm-openai:v0.22.0-cu129-ubuntu2404 \
  --model cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 \
  --tensor-parallel-size 2 \
  --max-model-len 192768 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'
</code></pre>
<p dir="auto">cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 (~ 15GB)</p>
<p dir="auto">沒特別研究所以還沒看autoround (Lorbus/Qwen3.6-27B-int4-AutoRound ) 以及其他人有關fp8 kv cache的設定</p>
<hr />
<ol start="4">
<li>Qwen 3.6 35B A3B</li>
</ol>
<p dir="auto">未嘗試, 但估計思路跟從Gemma 4 31B轉換到26B-A4B類似</p>
]]></description><link>https://lcz.me/topic/398/論-a10g-3090-底下的gemma-4跟qwen-3.6測試心得</link><generator>RSS for Node</generator><lastBuildDate>Sat, 06 Jun 2026 02:11:18 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/398.rss" rel="self" type="application/rss+xml"/><pubDate>Tue, 02 Jun 2026 16:32:59 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Fri, 05 Jun 2026 08:57:05 GMT]]></title><description><![CDATA[<p dir="auto">这个系列的帖子很好，很有参考价值，有其他的也可以分享下。</p>
]]></description><link>https://lcz.me/post/5182</link><guid isPermaLink="true">https://lcz.me/post/5182</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Fri, 05 Jun 2026 08:57:05 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Fri, 05 Jun 2026 04:02:30 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/c0aster" aria-label="Profile: c0aster">@<bdi>c0aster</bdi></a></p>
<p dir="auto">Deepseek在伺服器那邊基本上都會有Prompting優化, 本地的AGENT.md跟Rules基本上不會有同樣的效果</p>
<p dir="auto">倒不如說根本追不上, 就算是同樣的Prompting, 一個27B跟一個1600B-A49B (1.6T-A49B, DeepSeek-V4-Pro), 基本上就是螞蟻跟大象的分別</p>
<p dir="auto">本地最大的優勢就只是在處理敏感資料跟不會額外收費而已</p>
]]></description><link>https://lcz.me/post/5122</link><guid isPermaLink="true">https://lcz.me/post/5122</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Fri, 05 Jun 2026 04:02:30 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Fri, 05 Jun 2026 02:01:00 GMT]]></title><description><![CDATA[<p dir="auto">我的3090跑qwen3.6 27B,TOKEN 54 t/s，但写代码完整的项目，不能直接运行，deepseek v4直接OK，好像实际意义不大，opencode跑的完整项目，简单页面确实能直接跑起来，同样提示词（前端效果的），效果和deepseek差距巨大，是我使用方式不对么</p>
]]></description><link>https://lcz.me/post/5067</link><guid isPermaLink="true">https://lcz.me/post/5067</guid><dc:creator><![CDATA[c0aster]]></dc:creator><pubDate>Fri, 05 Jun 2026 02:01:00 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Thu, 04 Jun 2026 09:12:18 GMT]]></title><description><![CDATA[<p dir="auto">Qwen 27B 參數 (1 * A10G)</p>
<p dir="auto">放棄, VRAM太過於緊張了, 有幾次雖然成功架構vLLM但是壓力測試失敗, 過於不穩定</p>
<p dir="auto">有需要的人可以看著這個折騰, 但這涉及太多偷改內核的東西, 很有可能下一個版本就無法再用, 姑且不使用</p>
<p dir="auto"><a href="https://lcz.me/topic/417/%E6%89%BE%E5%88%B0%E4%B8%AA%E8%9B%AE%E6%9C%89%E7%94%A8%E7%9A%84%E7%94%A83090%E9%83%A8%E7%BD%B2%E6%9C%AC%E5%9C%B0%E6%A8%A1%E5%9E%8B%E7%9A%84repo">https://lcz.me/topic/417/找到个蛮有用的用3090部署本地模型的repo</a></p>
]]></description><link>https://lcz.me/post/4982</link><guid isPermaLink="true">https://lcz.me/post/4982</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Thu, 04 Jun 2026 09:12:18 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Thu, 04 Jun 2026 07:44:47 GMT]]></title><description><![CDATA[<p dir="auto">Qwen 27B 參數 (2 * A10G)</p>
<p dir="auto">Docker Image: vllm-openai:v0.22.0-cu129-ubuntu2404</p>
<pre><code>vllm serve \
  --model Lorbus/Qwen3.6-27B-int4-AutoRound \
  --host 0.0.0.0 \
  --port 8000 \
  --generation-config vllm \
  --served-model-name Qwen-3.6-27B-autoround \
  --dtype float16 \
  --quantization auto_round \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 192768 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --data-parallel-size 1 \
  --language-model-only \
  --enable-auto-tool-choice \
  --mamba-cache-mode align \
  --limit-mm-per-prompt '{"image":0,"video":0}' \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3
</code></pre>
<p dir="auto"><img src="https://upload.lcz.me/uploads/5194629d-8db1-4892-862a-2f1f598ac871.jpeg" alt="c4a7ede3-f9d1-49b6-bec5-8c29739e2ced-image.jpeg" class=" img-fluid img-markdown" /></p>
<hr />
<h2>碎碎唸</h2>
<p dir="auto">思路基本上跟A10G * 4一樣, batch token 降到4096, gpu memory utilization 上到0.95</p>
<p dir="auto">以下是更新版的Benchmark</p>
<pre><code>### Workload

| Metric                     | Run 07:09            | Run 07:17            | Run 07:26            |
| -------------------------- | -------------------- | -------------------- | -------------------- |
| dataset                    | random               | random               | random               |
| input length arg           | 1024                 | 1024                 | 1024                 |
| output length arg          | 256                  | 256                  | 256                  |
| input tokens mean/min/max  | 1034.4 / 1033 / 1036 | 1034.4 / 1033 / 1036 | 1034.4 / 1033 / 1036 |
| output tokens mean/min/max | 256.0 / 256 / 256    | 256.0 / 256 / 256    | 256.0 / 256 / 256    |
| num prompts                | 100                  | 100                  | 100                  |
| request rate               | inf                  | inf                  | inf                  |

### Request Outcome

| Metric                 | Run 07:09  | Run 07:17  | Run 07:26  |
| ---------------------- | ---------- | ---------- | ---------- |
| successful requests    | 100        | 100        | 100        |
| failed requests        | 0          | 0          | 0          |
| benchmark duration (s) | 463.34     | 478.80     | 474.50     |

### Latency

| Metric           | Run 07:09   | Run 07:17   | Run 07:26   |
| ---------------- | ----------- | ----------- | ----------- |
| mean TTFT (ms)   | 232418.08   | 238435.64   | 236922.49   |
| median TTFT (ms) | 231770.91   | 238065.71   | 238316.95   |
| P99 TTFT (ms)    | 455414.07   | 470471.84   | 466104.09   |
| mean TPOT (ms)   | 14.38       | 15.00       | 14.83       |
| P99 TPOT (ms)    | 24.48       | 20.19       | 22.90       |
| mean ITL (ms)    | 39.04       | 39.49       | 39.32       |
| P99 ITL (ms)     | 41.72       | 42.91       | 42.08       |

### Throughput

| Metric                          | Run 07:09 | Run 07:17 | Run 07:26 |
| ------------------------------- | --------- | --------- | --------- |
| request throughput (req/s)      | 0.216     | 0.209     | 0.211     |
| output token throughput (tok/s) | 55.25     | 53.47     | 53.95     |
| total token throughput (tok/s)  | 278.50    | 269.50    | 271.94    |
| prefill throughput (tok/s)      | 4.5       | 4.3       | 4.4       |

### Memory And Cache

| Metric                      | Run 07:09                  | Run 07:17                  | Run 07:26                  |
| --------------------------- | -------------------------- | -------------------------- | -------------------------- |
| VRAM before (MiB)           | 20731                      | 21693                      | 21693                      |
| VRAM peak (MiB)             | 21693                      | 21693                      | 21693                      |
| VRAM peak per GPU (MiB)     | 21691, 21693, 3, 3         | 21691, 21693, 3, 3         | 21691, 21693, 3, 3         |
| RAM used peak (MiB)         | 16572                      | 15092                      | 15119                      |
| vLLM process RSS peak (MiB) | 1837                       | 1837                       | 1837                       |
| gpu/kv_cache_usage peak     | 3.1%                       | 3.1%                       | 3.1%                       |
| prefix caching enabled      | false                      | false                      | false                      |
| prefix cache hit rate       | n/a                        | n/a                        | n/a                        |

### Speculative Decoding

| Metric              | Run 07:09 | Run 07:17 | Run 07:26 |
| ------------------- | --------- | --------- | --------- |
| acceptance rate (%) | 58.40     | 55.60     | 56.28     |
| acceptance length   | 2.75      | 2.67      | 2.69      |
</code></pre>
]]></description><link>https://lcz.me/post/4967</link><guid isPermaLink="true">https://lcz.me/post/4967</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Thu, 04 Jun 2026 07:44:47 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Thu, 04 Jun 2026 06:47:51 GMT]]></title><description><![CDATA[<p dir="auto">然後給一下Qwen 27B 參數 (4 * A10G)</p>
<p dir="auto">Docker Image: vllm-openai:v0.22.0-cu129-ubuntu2404</p>
<pre><code>vllm serve \
  --model Lorbus/Qwen3.6-27B-int4-AutoRound \
  --host 0.0.0.0 \
  --port 8000 \
  --generation-config vllm \
  --served-model-name Qwen-3.6-27B-autoround \
  --dtype float16 \
  --quantization auto_round \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 192768 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 1 \
  --data-parallel-size 1 \
  --language-model-only \
  --enable-auto-tool-choice \
  --mamba-cache-mode align \
  --limit-mm-per-prompt '{"image":0,"video":0}' \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3
</code></pre>
<p dir="auto"><img src="https://upload.lcz.me/uploads/c698a419-847f-4850-bde2-f290a42f42c9.jpeg" alt="1637555a-9c24-425c-b772-1a8fef797783-image.jpeg" class=" img-fluid img-markdown" /></p>
<hr />
<h2>碎碎唸</h2>
<p dir="auto">基本上跟Gemma 4一樣，使用auto round來節省model weight</p>
<p dir="auto">kv cache則使用僅有支持Ampere架構的fp8_e5m2, vllm可以透過fp8_e5m2模仿bfloat16, 並且轉換成int8獲得硬件加速, fp8_e4m3架構則不支持模仿</p>
<p dir="auto">強烈不建議使用 <em><strong>--default-chat-template-kwargs '{"enable_thinking": false}'</strong></em>, Token質量會斷崖式下降</p>
<p dir="auto">以下是更新版的Benchmark</p>
<pre><code>### Workload

| Metric                     | Run 05:17            | Run 05:28            | Run 05:36            |
| -------------------------- | -------------------- | -------------------- | -------------------- |
| dataset                    | random               | random               | random               |
| input length arg           | 1024                 | 1024                 | 1024                 |
| output length arg          | 256                  | 256                  | 256                  |
| input tokens mean/min/max  | 1034.4 / 1033 / 1036 | 1034.4 / 1033 / 1036 | 1034.4 / 1033 / 1036 |
| output tokens mean/min/max | 256.0 / 256 / 256    | 256.0 / 256 / 256    | 256.0 / 256 / 256    |
| num prompts                | 100                  | 100                  | 100                  |
| request rate               | inf                  | inf                  | inf                  |

### Request Outcome

| Metric                 | Run 05:17 | Run 05:28 | Run 05:36 |
| ---------------------- | --------- | --------- | --------- |
| successful requests    | 100       | 100       | 100       |
| failed requests        | 0         | 0         | 0         |
| benchmark duration (s) | 430.22    | 427.70    | 443.72    |

### Latency

| Metric           | Run 05:17 | Run 05:28 | Run 05:36 |
| ---------------- | --------- | --------- | --------- |
| mean TTFT (ms)   | 214258.88 | 211603.07 | 217519.66 |
| median TTFT (ms) | 211865.20 | 210793.65 | 213751.71 |
| P99 TTFT (ms)    | 422468.83 | 418775.36 | 435311.06 |
| mean TPOT (ms)   | 13.11     | 13.01     | 13.63     |
| P99 TPOT (ms)    | 21.82     | 16.84     | 19.43     |
| mean ITL (ms)    | 35.67     | 35.94     | 36.59     |
| P99 ITL (ms)     | 38.89     | 39.51     | 40.25     |

### Throughput

| Metric                          | Run 05:17 | Run 05:28 | Run 05:36 |
| ------------------------------- | --------- | --------- | --------- |
| request throughput (req/s)      | 0.232     | 0.234     | 0.225     |
| output token throughput (tok/s) | 59.50     | 59.85     | 57.69     |
| total token throughput (tok/s)  | 299.94    | 301.70    | 290.81    |
| prefill throughput (tok/s)      | 4.8       | 4.9       | 4.8       |

### Memory And Cache

| Metric                      | Run 05:17                  | Run 05:28                  | Run 05:36                  |
| --------------------------- | -------------------------- | -------------------------- | -------------------------- |
| VRAM before (MiB)           | 20261                      | 21143                      | 21143                      |
| VRAM peak (MiB)             | 21143                      | 21143                      | 21143                      |
| VRAM peak per GPU (MiB)     | 21143, 21143, 21143, 21143 | 21143, 21143, 21143, 21143 | 21143, 21143, 21143, 21143 |
| RAM used peak (MiB)         | 22076                      | 20870                      | 20798                      |
| vLLM process RSS peak (MiB) | 1825                       | 1825                       | 1825                       |
| gpu/kv_cache_usage peak     | 1.2%                       | 1.2%                       | 1.2%                       |
| prefix caching enabled      | false                      | false                      | false                      |
| prefix cache hit rate       | n/a                        | n/a                        | n/a                        |

### Speculative Decoding

| Metric              | Run 05:17 | Run 05:28 | Run 05:36 |
| ------------------- | --------- | --------- | --------- |
| acceptance rate (%) | 58.75     | 60.16     | 57.49     |
| acceptance length   | 2.76      | 2.80      | 2.72      |

---
</code></pre>
]]></description><link>https://lcz.me/post/4960</link><guid isPermaLink="true">https://lcz.me/post/4960</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Thu, 04 Jun 2026 06:47:51 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Thu, 04 Jun 2026 06:10:26 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/566656661" aria-label="Profile: 566656661">@<bdi>566656661</bdi></a> 这个牛，有空我也测试下，我正好A卡N卡都有<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f602.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--joy" style="height:23px;width:auto;vertical-align:middle" title="😂" alt="😂" /></p>
]]></description><link>https://lcz.me/post/4955</link><guid isPermaLink="true">https://lcz.me/post/4955</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Thu, 04 Jun 2026 06:10:26 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Thu, 04 Jun 2026 07:30:12 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/dreamy2k" aria-label="Profile: dreamy2k">@<bdi>dreamy2k</bdi></a></p>
<p dir="auto">好消息是你可以混合使用A + N卡, 你可以用Vulkan來將model分到兩張卡的VRAM上面, 然後llamacpp選用Vulkan, 我也曾經在Reddit上面聽過有人混合RTX 5070 Ti + RX 9070, 除了prefill速度慢了跟沒有特別優化之外應該沒什麼問題</p>
<p dir="auto"><img src="https://upload.lcz.me/uploads/6731b970-fcf9-4437-b99f-520d2c20734f.jpeg" alt="89429049-f523-47df-9f14-eb4632bc1f14-image.jpeg" class=" img-fluid img-markdown" /></p>
<p dir="auto">壞消息是你需要自己編譯Vulkan內核</p>
<p dir="auto">如果是普通人不太想太深入研究的話推薦直接買多一張A10G, 或者賣A10G換成R9700</p>
<hr />
<h2>碎碎念一下</h2>
<p dir="auto"><s>跑去llamacpp看了一下, 很不負責地給一下編譯command</s></p>
<p dir="auto">強烈建議使用docker container + Linux Kernel, 不要在Window底下編譯, <a href="https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp" rel="nofollow ugc">可以用這個試試看</a></p>
<pre><code>編譯
rm -rf build &amp;&amp; \
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -B build \
  -DBUILD_SHARED_LIBS=ON \
  -DGGML_BACKEND_DL=ON \
  -DGGML_NATIVE=OFF \
  -DGGML_CPU_ALL_VARIANTS=ON \
  -DGGML_CUDA=ON \              
  -DGGML_HIP=ON \
  -DGPU_TARGETS=gfx1201 \                           #(R9700 AI 架構)
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES="86" &amp;&amp; \           #(3090 SM86架構)
cmake --build build --config Release -j 64
</code></pre>
<pre><code>啟動
${HOME}/code/llama.cpp/build/bin/llama-server \
	--port 1234 --host 0.0.0.0 \
	--models-preset &lt;你模型的啟動參數&gt;.ini \
	--device CUDA0,ROCm0 --fit-target 3072,512        #(假設你第一張卡是插屏幕，需要預留多點VRAM)
</code></pre>
]]></description><link>https://lcz.me/post/4954</link><guid isPermaLink="true">https://lcz.me/post/4954</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Thu, 04 Jun 2026 07:30:12 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Thu, 04 Jun 2026 04:36:52 GMT]]></title><description><![CDATA[<p dir="auto">我也是用A10G但只用单卡，现在用QWEN3.6-35B-A3B Q4 使用中，但VRAM不够用，我想听听大神的建議，是买多一張A10G好还是有一張R9700 32G呢</p>
]]></description><link>https://lcz.me/post/4940</link><guid isPermaLink="true">https://lcz.me/post/4940</guid><dc:creator><![CDATA[dreamy2k]]></dc:creator><pubDate>Thu, 04 Jun 2026 04:36:52 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Wed, 03 Jun 2026 11:41:55 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a></p>
<p dir="auto">抱歉, 因為是軟件工程出身跟泡太多關於榨乾硬件性能的Reddit帖子 / Github Repo, 所以關於配置跟benchmark可能會帶有比較專業的名詞跟滿多數據, 之後會在回覆開頭加個tldr</p>
<p dir="auto">這個帖子估計也會持續更新, 在實體空(摸)閒(魚)的時候拿來當實驗紀錄</p>
]]></description><link>https://lcz.me/post/4827</link><guid isPermaLink="true">https://lcz.me/post/4827</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Wed, 03 Jun 2026 11:41:55 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Wed, 03 Jun 2026 11:01:24 GMT]]></title><description><![CDATA[<p dir="auto">太偏专业性，可以在帖子开头加入简单点的结论，小白一看就懂，有需求的才会详细看测试数据。</p>
]]></description><link>https://lcz.me/post/4817</link><guid isPermaLink="true">https://lcz.me/post/4817</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Wed, 03 Jun 2026 11:01:24 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Thu, 04 Jun 2026 10:35:02 GMT]]></title><description><![CDATA[<p dir="auto">然後這個是Gemma 4 31B (2 * A10G)</p>
<pre><code>vllm serve \
  --model Intel/gemma-4-31B-it-int4-AutoRound \
  --host 0.0.0.0 \
  --port 8000 \
  --generation-config vllm \
  --served-model-name Gemma-4-31B-it \
  --dtype float16 \
  --quantization auto_round \
  --gpu-memory-utilization 0.95 \        #(需要上到0.95不然OOM)
  --max-model-len 192768 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \ #(8192降到4096)
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --data-parallel-size 1 \
  --language-model-only \
  --attention-config.backend TRITON_ATTN \
  --limit-mm-per-prompt '{"image":0,"video":0}' \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}' \
  --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4
</code></pre>
<p dir="auto"><img src="https://upload.lcz.me/uploads/5ed458b7-8e5a-4b12-a232-732fbf900a2c.jpeg" alt="3cd0bf60-44bd-48e5-ae1c-792ccb9fce2e-image.jpeg" class=" img-fluid img-markdown" /></p>
<p dir="auto">以下是更新版的Benchmark</p>
<pre><code>### Workload

| Metric                     | Run 08:58            | Run 09:07            | Run 09:34            |
| -------------------------- | -------------------- | -------------------- | -------------------- |
| dataset                    | random               | random               | random               |
| input length arg           | 1024                 | 1024                 | 1024                 |
| output length arg          | 256                  | 256                  | 256                  |
| input tokens mean/min/max  | 1037.5 / 1037 / 1039 | 1037.5 / 1037 / 1039 | 1037.5 / 1037 / 1039 |
| output tokens mean/min/max | 256.0 / 256 / 256    | 256.0 / 256 / 256    | 256.0 / 256 / 256    |
| num prompts                | 100                  | 100                  | 100                  |
| request rate               | inf                  | inf                  | inf                  |

### Request Outcome

| Metric                 | Run 08:58 | Run 09:07 | Run 09:34 |
| ---------------------- | --------- | --------- | --------- |
| successful requests    | 100       | 100       | 100       |
| failed requests        | 0         | 0         | 0         |
| benchmark duration (s) | 462.51    | 457.19    | 462.96    |

### Latency

| Metric           | Run 08:58 | Run 09:07 | Run 09:34 |
| ---------------- | --------- | --------- | --------- |
| mean TTFT (ms)   | 233010.17 | 229102.51 | 231664.74 |
| median TTFT (ms) | 234769.52 | 232669.51 | 231388.69 |
| P99 TTFT (ms)    | 453358.78 | 449056.54 | 454054.81 |
| mean TPOT (ms)   | 13.96     | 13.75     | 13.98     |
| P99 TPOT (ms)    | 18.09     | 17.01     | 18.07     |
| mean ITL (ms)    | 42.03     | 41.92     | 42.02     |
| P99 ITL (ms)     | 43.59     | 43.59     | 43.67     |

### Throughput

| Metric                          | Run 08:58 | Run 09:07 | Run 09:34 |
| ------------------------------- | --------- | --------- | --------- |
| request throughput (req/s)      | 0.216     | 0.219     | 0.216     |
| output token throughput (tok/s) | 55.35     | 55.99     | 55.30     |
| total token throughput (tok/s)  | 279.66    | 282.92    | 279.39    |
| prefill throughput (tok/s)      | 4.5       | 4.5       | 4.5       |

### Memory And Cache

| Metric                      | Run 08:58                  | Run 09:07                  | Run 09:34                  |
| --------------------------- | -------------------------- | -------------------------- | -------------------------- |
| VRAM before (MiB)           | 20825                      | 20947                      | 20825                      |
| VRAM peak (MiB)             | 20947                      | 20947                      | 20947                      |
| VRAM peak per GPU (MiB)     | 20947, 20947, 3, 3        | 20947, 20947, 3, 3        | 20947, 20947, 3, 3        |
| RAM used peak (MiB)         | 14713                      | 14706                      | 14809                      |
| vLLM process RSS peak (MiB) | 2117                       | 2117                       | 2133                       |
| gpu/kv_cache_usage peak     | 4.4%                       | 4.4%                       | 4.4%                       |
| prefix caching enabled      | false                      | false                      | false                      |
| prefix cache hit rate       | 0.00% (0/103761)           | 0.00% (0/103761)           | 0.00% (0/103761)           |

### Speculative Decoding

| Metric              | Run 08:58 | Run 09:07 | Run 09:34 |
| ------------------- | --------- | --------- | --------- |
| acceptance rate (%) | 51.55     | 52.60     | 51.51     |
| acceptance length   | 3.06      | 3.10      | 3.06      |
</code></pre>
]]></description><link>https://lcz.me/post/4805</link><guid isPermaLink="true">https://lcz.me/post/4805</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Thu, 04 Jun 2026 10:35:02 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Thu, 04 Jun 2026 10:34:20 GMT]]></title><description><![CDATA[<p dir="auto">昨天太晚打文, 剛才從頭看發現我搞錯了4 * A10G 跟 2 * A10G 的部分參數, 先說聲抱歉, 這個才是 Gemma 4 31B (4 * A10G)</p>
<pre><code>docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  -e VLLM_MODEL_NAME="Gemma-4-31B-it" \
  vllm/vllm-openai:v0.22.0-cu129-ubuntu2404 \
  --model Intel/gemma-4-31B-it-int4-AutoRound \
  --served-model-name Gemma-4-31B-it \
  --dtype float16 \
  --quantization auto_round \
  --gpu-memory-utilization 0.90 \
  --max-model-len 192768 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \    #(這裏4096改8192)
  --tensor-parallel-size 4 \                     #(這裏2改4)
  --pipeline-parallel-size 1 \
  --data-parallel-size 1 \
  --attention-backend TRITON_ATTN \
  --speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":4}' \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4
</code></pre>
<p dir="auto"><img src="https://upload.lcz.me/uploads/b60d46a2-f6fc-4cde-86d5-76815918c57b.jpeg" alt="707cf399-28fc-48f1-aad7-35fdfa029960-image.jpeg" class=" img-fluid img-markdown" /></p>
<p dir="auto">以下是更新版的Benchmark</p>
<pre><code>### Workload

| Metric                     | Run 07:30            | Run 07:45            | Run 08:05            |
| -------------------------- | -------------------- | -------------------- | -------------------- |
| dataset                    | random               | random               | random               |
| input length arg           | 1024                 | 1024                 | 1024                 |
| output length arg          | 256                  | 256                  | 256                  |
| input tokens mean/min/max  | 1037.5 / 1037 / 1039 | 1037.5 / 1037 / 1039 | 1037.5 / 1037 / 1039 |
| output tokens mean/min/max | 256.0 / 256 / 256    | 256.0 / 256 / 256    | 256.0 / 256 / 256    |
| num prompts                | 100                  | 100                  | 100                  |
| request rate               | inf                  | inf                  | inf                  |

### Request Outcome

| Metric                 | Run 07:30 | Run 07:45 | Run 08:05 |
| ---------------------- | --------- | --------- | --------- |
| successful requests    | 100       | 100       | 100       |
| failed requests        | 0         | 0         | 0         |
| benchmark duration (s) | 371.71    | 374.19    | 367.04    |

### Latency

| Metric           | Run 07:30 | Run 07:45 | Run 08:05 |
| ---------------- | --------- | --------- | --------- |
| mean TTFT (ms)   | 185538.74 | 188381.24 | 182757.33 |
| median TTFT (ms) | 185918.49 | 189676.29 | 181425.16 |
| P99 TTFT (ms)    | 364700.58 | 368045.53 | 360175.17 |
| mean TPOT (ms)   | 10.59     | 10.69     | 10.48     |
| P99 TPOT (ms)    | 14.95     | 15.77     | 15.28     |
| mean ITL (ms)    | 31.83     | 31.73     | 31.76     |
| P99 ITL (ms)     | 33.16     | 33.13     | 33.91     |

### Throughput

| Metric                          | Run 07:30 | Run 07:45 | Run 08:05 |
| ------------------------------- | --------- | --------- | --------- |
| request throughput (req/s)      | 0.269     | 0.267     | 0.272     |
| output token throughput (tok/s) | 68.87     | 68.42     | 69.75     |
| total token throughput (tok/s)  | 347.98    | 345.68    | 352.41    |
| prefill throughput (tok/s)      | 5.6       | 5.5       | 5.7       |

### Memory And Cache

| Metric                      | Run 07:30                  | Run 07:45                  | Run 08:05                  |
| --------------------------- | -------------------------- | -------------------------- | -------------------------- |
| VRAM before (MiB)           | 20371                      | 20453                      | 20453                      |
| VRAM peak (MiB)             | 20453                      | 20453                      | 20453                      |
| VRAM peak per GPU (MiB)     | 20453, 20453, 20453, 20453 | 20453, 20453, 20453, 20453 | 20453, 20453, 20453, 20453 |
| RAM used peak (MiB)         | 19233                      | 19075                      | 23152                      |
| vLLM process RSS peak (MiB) | 2117                       | 2117                       | 2117                       |
| gpu/kv_cache_usage peak     | 1.6%                       | 1.6%                       | 1.6%                       |
| prefix caching enabled      | false                      | false                      | false                      |
| prefix cache hit rate       | 0.00% (0/103761)           | 0.00% (0/103761)           | 1.85% (1920/103761)        |

### Speculative Decoding

| Metric              | Run 07:30 | Run 07:45 | Run 08:05 |
| ------------------- | --------- | --------- | --------- |
| acceptance rate (%) | 51.50     | 50.54     | 52.18     |
| acceptance length   | 3.06      | 3.02      | 3.09      |
</code></pre>
]]></description><link>https://lcz.me/post/4803</link><guid isPermaLink="true">https://lcz.me/post/4803</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Thu, 04 Jun 2026 10:34:20 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Wed, 03 Jun 2026 03:37:06 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a></p>
<p dir="auto">抱歉, 這文是昨天晚上下班的時候打的, 現在吃飯前有空補一補圖</p>
<p dir="auto">這不是在Benchmark的時候截圖, 這是我跟老闆在試Prompting, Gemma-4-31B-it-int4-AutoRound</p>
<p dir="auto"><img src="https://upload.lcz.me/uploads/c98a65d8-e1a1-43af-a04b-51b38bfe338e.jpeg" alt="513e2369-2ab9-4cd8-a6e6-09d77f173357-image.jpeg" class=" img-fluid img-markdown" /></p>
<p dir="auto">其他模型我看有沒有機會試試看跟Benchmark, 有的話會在這裏補上</p>
]]></description><link>https://lcz.me/post/4739</link><guid isPermaLink="true">https://lcz.me/post/4739</guid><dc:creator><![CDATA[566656661]]></dc:creator><pubDate>Wed, 03 Jun 2026 03:37:06 GMT</pubDate></item><item><title><![CDATA[Reply to 論 A10G (~3090) 底下的Gemma 4跟Qwen 3.6測試心得 on Wed, 03 Jun 2026 01:32:44 GMT]]></title><description><![CDATA[<p dir="auto">补充点运行截图，我给置顶，有图有真相。</p>
]]></description><link>https://lcz.me/post/4720</link><guid isPermaLink="true">https://lcz.me/post/4720</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Wed, 03 Jun 2026 01:32:44 GMT</pubDate></item></channel></rss>