在 Hermes 和 OMLX 之间插一个本地代理,自动数对话轮数,第 15 轮时后台调用 35B 自身做静默总结,把前 14 轮压成一段系统记忆,KV Cache 瞬间从 15 轮降到 2 轮。Hermes 完全无感知。
1. 保存代理脚本
cat > ~/omlx_compress_proxy.py << 'EOF'
#!/usr/bin/env python3
import json, time, copy
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.request import Request, urlopen
from urllib.error import HTTPError
OMLX_BASE = "http://127.0.0.1:8201" # OMLX 新地址
PROXY_PORT = 8200 # 接管 Hermes 原来的端口
COMPRESS_EVERY = 15
KEEP_RECENT = 1
SUMMARY_MAX_TOKENS = 400
SUMMARY_TEMP = 0.1
SUMMARY_SYSTEM = (
"你是后台上下文压缩器。请用中文极度精简地总结以下对话,"
"保留所有关键决策、代码、数据、未完成任务。只输出总结内容,"
"不要任何解释、问候、格式标记。"
)
class Store:
def __init__(self):
self.sessions = {}
def get(self, body):
key = body.get("conversation_id") or body.get("session_id") or body.get("user") or "default"
return self.sessions.get(key, []), key
def set(self, key, messages):
self.sessions[key] = messages
def count(self, messages):
return sum(1 for m in messages if m.get("role") in ("user", "assistant"))
store = Store()
def call_omlx(messages, max_tokens=None, temperature=None):
payload = {"model": "Qwen3.5-35b-a3b-4bit-mlx", "messages": messages, "stream": False}
if max_tokens: payload["max_tokens"] = max_tokens
if temperature is not None: payload["temperature"] = temperature
req = Request(f"{OMLX_BASE}/v1/chat/completions",
data=json.dumps(payload).encode(), headers={"Content-Type": "application/json"}, method="POST")
return json.loads(urlopen(req, timeout=300).read().decode())
def compress(messages):
sys_msgs = [m for m in messages if m.get("role") == "system"]
non_sys = [m for m in messages if m.get("role") != "system"]
if len(non_sys) <= KEEP_RECENT: return messages
to_compress = non_sys[:-KEEP_RECENT]
keep = non_sys[-KEEP_RECENT:]
prompt = []
if not sys_msgs: prompt.append({"role": "system", "content": SUMMARY_SYSTEM})
prompt.append({"role": "user", "content": "\n\n".join(f"[{m['role']}] {m['content'][:2000]}" for m in to_compress)})
summary = call_omlx(prompt, max_tokens=SUMMARY_MAX_TOKENS, temperature=SUMMARY_TEMP)["choices"][0]["message"]["content"].strip()
new_msgs = list(sys_msgs)
new_msgs.append({"role": "system", "content": f"[历史压缩] {summary}"})
new_msgs.extend(keep)
return new_msgs
class H(BaseHTTPRequestHandler):
def log_message(self, *args): pass
def do_POST(self):
if self.path != "/v1/chat/completions": return self._proxy()
body = json.loads(self.rfile.read(int(self.headers.get("Content-Length", 0))).decode())
history, key = store.get(body)
history = body.get("messages", [])
if store.count(history) >= COMPRESS_EVERY:
t0 = time.time()
history = compress(history)
print(f"[Compress] {key}: done in {time.time()-t0:.1f}s")
store.set(key, history)
omlx_body = copy.deepcopy(body)
omlx_body["messages"] = history
for k in ["conversation_id", "session_id", "user"]: omlx_body.pop(k, None)
req = Request(f"{OMLX_BASE}/v1/chat/completions",
data=json.dumps(omlx_body).encode(), headers={"Content-Type": "application/json"}, method="POST")
try:
with urlopen(req, timeout=600) as r:
data = r.read()
self.send_response(r.status)
for k, v in r.headers.items():
if k.lower() != "transfer-encoding": self.send_header(k, v)
self.end_headers(); self.wfile.write(data)
try:
msg = json.loads(data.decode())["choices"][0]["message"]
history.append(msg); store.set(key, history)
except: pass
except HTTPError as e:
self.send_response(e.code); self.send_header("Content-Type", "application/json"); self.end_headers(); self.wfile.write(e.read())
def _proxy(self):
body = self.rfile.read(int(self.headers.get("Content-Length", 0))) if int(self.headers.get("Content-Length", 0)) else b""
req = Request(f"{OMLX_BASE}{self.path}", data=body, headers={k:v for k,v in self.headers.items()}, method="POST" if body else "GET")
try:
with urlopen(req) as r: self.send_response(r.status); [self.send_header(k,v) for k,v in r.headers.items()]; self.end_headers(); self.wfile.write(r.read())
except HTTPError as e: self.send_response(e.code); self.end_headers(); self.wfile.write(e.read())
if __name__ == "__main__":
print(f"Proxy http://127.0.0.1:{PROXY_PORT}/v1 -> {OMLX_BASE}")
print(f"Compress every {COMPRESS_EVERY} turns")
HTTPServer(("127.0.0.1", PROXY_PORT), H).serve_forever()
EOF
chmod +x ~/omlx_compress_proxy.py
2. 调整端口(核心)
假设你之前 Hermes 连的是 127.0.0.1:8200:
# 先停掉 OMLX
pkill -f "omlx serve"
# OMLX 改绑到 8201(让出 8200 给代理)
omlx serve --port 8201 --model Qwen3.5-35b-a3b-4bit-mlx
3. 启动代理(占原端口 8200)
另开一个终端:
python3 ~/omlx_compress_proxy.py
日志应显示:
Proxy http://127.0.0.1:8200/v1 -> http://127.0.0.1:8201/v1
Compress every 15 turns
4. Hermes 配置(保持不变)
你在 Hermes SSH 终端界面里不要改任何端口,保持原来的 127.0.0.1:8200。因为代理已经占了 8200,Hermes 的请求会先经过代理,代理再转发给 OMLX 的 8201。
如果你之前在 Hermes 里配的是其他端口(比如 11434 或 5001),把上面脚本里的 PROXY_PORT 和 OMLX_BASE 对应改一下即可。
5. 验证压缩触发
正常聊天到第 15 轮,代理终端会打印:
[Compress] default: done in 1.8s
此时 KV Cache 从 15 轮瞬间降到 2 轮,内存压力解除。Hermes 前端看不到任何异常。
关键提醒
| 项目 |
说明 |
| 第 15 轮延迟 |
35B 总结前 14 轮需 1-3 秒,这是物理限制(OMLX 单线程)。建议 COMPRESS_EVERY 设 12~15,别设太小。 |
| 总结质量 |
SUMMARY_TEMP=0.1 已锁死,防止总结时模型发散。 |
| 内存效果 |
压缩后 KV Cache 从 ~15 轮降到 ~2 轮,按 35B 每轮 300MB 算,瞬间释放 3-4GB。 |
| 失败回退 |
如果代理挂了,Hermes 直接连不上;OMLX 本身不受影响,重启代理即可。 |
按这个顺序部署:先改 OMLX 端口 → 启动代理 → Hermes 保持原配置继续用。
16和24 需要改小总结轮数测试。测试机型32G M4。
请勿直接使用。