感謝!我再試看看!
F
Fishbubu
@Fishbubu
-
GX10主機vLLM部屬Qwen3.6請益 -
GX10主機vLLM部屬Qwen3.6請益各位大佬好,目前公司要我架設一個地端的 AI 協尋平台,但在效能調校上遇到了一些瓶頸,想請教版上是否有前輩處理過類似架構的經驗。
【專案需求】
一般使用者: 透過 Librechat 進行日常問答。
RD 工程師: 透過 API 串接模型,輔助 Code 編寫。
【目前遇到的問題】
目前我們部署了 Qwen3.5:9B 以及 Qwen3.6:27B 兩個模型,雖然已經嘗試開啟了 fp8 Cache,但實際測試下來,兩個模型的推論生成速度皆不盡理想,無法滿足 RD 開發時需要的流暢度。想請教各位前輩在不購買設備的情況下還有甚麼優化空間呢?
services: vllm-qwen-27b: image: vllm/vllm-openai:latest container_name: vllm-qwen-27b restart: unless-stopped ipc: host shm_size: "16gb" ports: - "8001:8000" volumes: - ~/.cache/huggingface:/root/.cache/huggingface environment: - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} command: > Qwen/Qwen3.6-27B --host 0.0.0.0 --port 8000 --served-model-name qwen3.6-27b --gpu-memory-utilization 0.70 --max-model-len 32768 --kv-cache-dtype fp8 --trust-remote-code --reasoning-parser qwen3 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] vllm-qwen: image: vllm/vllm-openai:qwen3_5-cu130 container_name: vllm-qwen restart: unless-stopped ipc: host shm_size: "8gb" ports: - "8002:8000" volumes: - ~/.cache/huggingface:/root/.cache/huggingface environment: - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} command: > Qwen/Qwen3.5-9B --host 0.0.0.0 --port 8000 --served-model-name qwen3.5-9b --gpu-memory-utilization 0.25 --max-model-len 16384 --kv-cache-dtype fp8 --trust-remote-code --reasoning-parser qwen3 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] # ============================================================ # LibreChat:主要聊天介面 # ============================================================ librechat: image: ghcr.io/danny-avila/librechat:latest container_name: librechat restart: unless-stopped ports: - "3000:3080" volumes: - ./librechat-config/librechat.yaml:/app/librechat.yaml - librechat-uploads:/app/client/public/images - librechat-logs:/app/api/logs environment: - MONGO_URI=mongodb://mongodb:27017/LibreChat - MEILI_HOST=http://meilisearch:7700 - MEILI_MASTER_KEY=${MEILI_MASTER_KEY} - JWT_SECRET=${JWT_SECRET} - JWT_REFRESH_SECRET=${JWT_REFRESH_SECRET} - ALLOW_REGISTRATION=true # 首次設定完可改 false - RAG_USE_FULL_CONTEXT=true # ← 關鍵:強制全文送入,不切塊 depends_on: - vllm-qwen-27b - vllm-qwen # LibreChat 需要的 MongoDB mongodb: image: mongo:6 container_name: librechat-mongo restart: unless-stopped volumes: - mongodb-data:/data/db # LibreChat 需要的 MeiliSearch(對話搜尋功能) meilisearch: image: getmeili/meilisearch:v1.12 container_name: librechat-meili restart: unless-stopped environment: - MEILI_MASTER_KEY=${MEILI_MASTER_KEY} - MEILI_NO_ANALYTICS=true volumes: - meilisearch-data:/meili_data volumes: librechat-uploads: librechat-logs: mongodb-data: meilisearch-data: