抡锤者

alanwoo

我把qwen3-vl 8b q4當OCR用，還不錯，主要是不佔用太多內存大概6G，qwen3.6-27b q4推理能力更強只是需要更大內存，大概16G以上

alanwoo

@Xiaote

請確認
RTX PRO 6000 Blackwell 是 96GB 內存.
Qwen3.6-27B 最大上下文是 262000 tokens.

Alan

alanwoo

Hermes Agent 最新版本 v0.17.0 （June 19，2026）部署本地模型 bug

本地模型：qwen3.6-27b-fp8
推理引擎：vLLM 0.23.0
顯卡：NVIDIA RTX PRO 6000 Blackwell Workstation Edition
OS: Ubuntu 24.04.4 LTS

今天當我更新 Hermes Agent 從 v0.14.0 到最新版本 v.017.0 後頻繁出現一下報錯碼，在加入 max_tokens 設定後問題得以解決。

報錯碼

API call failed (attempt 1/3): BadRequestError [HTTP 400]
   Provider: custom  Model: qwen36-27b
   Endpoint: http://127.0.0.1:8000/v1
   Error: HTTP 400: This model's maximum context length is 65536 tokens. However, you requested 65536 output tokens and your prompt contains 81579 characters (more than 0 characters, which is the upper bound for 0 input tokens). Please reduce the length of the input prompt or the number of requested output tokens. (par
   Details: {'message': "This model's maximum context length is 65536 tokens. However, you requested 65536 output tokens and your prompt contains 81579 characters (more than 0 characters, which is the upper bound for 0 input tokens). Please reduce the length of the input prompt or the number of requested output
   Elapsed: 0.21s  Context: 2 msgs, ~4,861 tokens
   Output cap too large for current prompt — retrying with max_tokens=38,279 (available_tokens=38,343; context_length unchanged at 65,536)

API call failed (attempt 1/3): BadRequestError [HTTP 400]
   Provider: custom  Model: qwen36-27b
   Endpoint: http://127.0.0.1:8000/v1
   Error: HTTP 400: This model's maximum context length is 65536 tokens. However, you requested 65536 output tokens and your prompt contains 98728 characters (more than 0 characters, which is the upper bound for 0 input tokens). Please reduce the length of the input prompt or the number of requested output tokens. (par
   Details: {'message': "This model's maximum context length is 65536 tokens. However, you requested 65536 output tokens and your prompt contains 98728 characters (more than 0 characters, which is the upper bound for 0 input tokens). Please reduce the length of the input prompt or the number of requested output
   Elapsed: 0.04s  Context: 5 msgs, ~9,388 tokens
   Output cap too large for current prompt — retrying with max_tokens=32,562 (available_tokens=32,626; context_length unchanged at 65,536)

修改前 config.yaml 設置

model:
  provider: custom
  default: qwen36-27b
  base_url: http://127.0.0.1:8000/v1

解決方法：
加入 max_tokens: 8192
*可以根據需求調整 max_tokens 參數

修改後 config.yaml 設置

model:
  provider: custom
  default: qwen36-27b
  base_url: http://127.0.0.1:8000/v1
  max_tokens: 8192

希望這個對大家有幫助

Alan

抡锤者

alanwoo

帖子