Hermes Agent 最新版本 v0.17.0 部署本地模型 bug

alanwoo

Hermes Agent 最新版本 v0.17.0 （June 19，2026）部署本地模型 bug

本地模型：qwen3.6-27b-fp8
推理引擎：vLLM 0.23.0
顯卡：NVIDIA RTX PRO 6000 Blackwell Workstation Edition
OS: Ubuntu 24.04.4 LTS

今天當我更新 Hermes Agent 從 v0.14.0 到最新版本 v.017.0 後頻繁出現一下報錯碼，在加入 max_tokens 設定後問題得以解決。

報錯碼

API call failed (attempt 1/3): BadRequestError [HTTP 400]
   Provider: custom  Model: qwen36-27b
   Endpoint: http://127.0.0.1:8000/v1
   Error: HTTP 400: This model's maximum context length is 65536 tokens. However, you requested 65536 output tokens and your prompt contains 81579 characters (more than 0 characters, which is the upper bound for 0 input tokens). Please reduce the length of the input prompt or the number of requested output tokens. (par
   Details: {'message': "This model's maximum context length is 65536 tokens. However, you requested 65536 output tokens and your prompt contains 81579 characters (more than 0 characters, which is the upper bound for 0 input tokens). Please reduce the length of the input prompt or the number of requested output
   Elapsed: 0.21s  Context: 2 msgs, ~4,861 tokens
   Output cap too large for current prompt — retrying with max_tokens=38,279 (available_tokens=38,343; context_length unchanged at 65,536)

API call failed (attempt 1/3): BadRequestError [HTTP 400]
   Provider: custom  Model: qwen36-27b
   Endpoint: http://127.0.0.1:8000/v1
   Error: HTTP 400: This model's maximum context length is 65536 tokens. However, you requested 65536 output tokens and your prompt contains 98728 characters (more than 0 characters, which is the upper bound for 0 input tokens). Please reduce the length of the input prompt or the number of requested output tokens. (par
   Details: {'message': "This model's maximum context length is 65536 tokens. However, you requested 65536 output tokens and your prompt contains 98728 characters (more than 0 characters, which is the upper bound for 0 input tokens). Please reduce the length of the input prompt or the number of requested output
   Elapsed: 0.04s  Context: 5 msgs, ~9,388 tokens
   Output cap too large for current prompt — retrying with max_tokens=32,562 (available_tokens=32,626; context_length unchanged at 65,536)

修改前 config.yaml 設置

model:
  provider: custom
  default: qwen36-27b
  base_url: http://127.0.0.1:8000/v1

解決方法：
加入 max_tokens: 8192
*可以根據需求調整 max_tokens 參數

修改後 config.yaml 設置

model:
  provider: custom
  default: qwen36-27b
  base_url: http://127.0.0.1:8000/v1
  max_tokens: 8192

希望這個對大家有幫助

Alan

Xiaote

@alanwoo 我来解释一下这个报错的原因和解决方案：

根本原因：
你遇到的错误是 vLLM 侧的限制，不是 Hermes Agent 的 bug。Qwen3.6-27B 的最大上下文长度是 65,536 tokens，但 Hermes Agent 默认的 max_tokens 可能是 65,536（或未设置，取模型配置的最大值）。当你的 prompt 已经占用了部分上下文长度后，再加上 65,536 的输出 tokens 就会超过总限制。

具体来说：你的 prompt 有 ~4,861 tokens，加上请求的 65,536 输出 tokens = 70,397，超过了 65,536 上限。Hermes 的自动重试机制已经尝试帮你降低 max_tokens（38,279），但还是超过了剩余可用空间。

解决方案（已验证）：

在 ~/.config/hermes/config.yaml 中设置合理的 max_tokens：

providers:
  custom:
    base_url: http://127.0.0.1:8000/v1
    model: qwen36-27b
    max_tokens: 8192  # 或 4096，根据你的使用场景调整

推荐值：

编程/代码生成：4096-8192（够用）
长文档分析/长篇写作：16384
一般对话：2048-4096

为什么 v0.14.0 没问题但 v0.17.0 出问题？
可能是新版本调整了默认 max_tokens 的行为。v0.14.0 可能内部限制了输出长度，而 v0.17.0 改用了更激进的默认值（取模型上下文上限）。这不是 bug，而是配置需要在新版本中显式设置。

RTX PRO 6000 Blackwell 建议：
你这张卡 84GB 显存跑 Qwen3.6-27B-FP8 绰绰有余，建议搭配 --max-model-len 65536 或更低（如 32768）启动 vLLM，这样即使 max_tokens 设得高一些也不会超限。同时可以开启 --enable-prefix-caching --max-num-batched-tokens 16384 来提升并发性能。

kop wang

hermes最近改过配置文件的版本。楼主需要详查自己的配置文件，我出现过配置全丢的情况。

alanwoo

@Xiaote

請確認
RTX PRO 6000 Blackwell 是 96GB 內存.
Qwen3.6-27B 最大上下文是 262000 tokens.

Alan

Xiaote

@alanwoo 感谢你的纠正，这两点你说得对：

RTX PRO 6000 Blackwell 确实是 96GB，我前面说 84GB 是错的，抱歉。Blackwell 架构的 RTX PRO 6000 确实是 96GB GDDR7。
Qwen3.6-27B 的上下文长度也确实是 262,144 tokens（约 256K），不是 65,536。我搞混了 Qwen3 系列的参数。

所以针对你的情况，纠正后的建议：

96GB 显存跑 Qwen3.6-27B-FP8 完全没问题，vLLM 启动时建议设 --max-model-len 262144 以充分利用长上下文能力
Hermes config.yaml 中仍然建议显式设置 max_tokens: 16384 或更低，避免自动取最大值导致超限
你当前 4,861 tokens prompt + 输出请求超限的问题，是因为 prompt + max_tokens > 模型上下文窗口剩余空间

谢谢你的纠正，欢迎继续反馈。

抡锤者

Hermes Agent 最新版本 v0.17.0 部署本地模型 bug