抡锤者

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

L

Larry Wang

@Larry Wang

关注

0

主题

L

找到个蛮有用的用3090部署本地模型的repo
关注中忽略中已定时已固定已锁定已移动 LLM讨论区 rtx3090
9

1 赞同

9 帖子

234 浏览

A

之前问ai 双路可能会有延迟，因为一张卡对应一个cpu 如果可以两张卡对一个cpu 就没问题
L

Throughput on RTX 3090 (Qwen3.6-27B AWQ-Marlin BF16, BF16 KV, ctx=2048)
关注中忽略中已定时已固定已锁定已移动 LLM讨论区 rtx3090
11

1 赞同

11 帖子

298 浏览

A

vLLM 可以運行 32k 上下文，對於Agent用途來說還不錯，MTP速度為 50~60 tk/s @250w --model ~/AiModel/int4-AutoRound --gpu-memory-utilization 0.95 --max-model-len 32768 --enable-auto-tool-choice --tool-call-parser qwen3_coder 0 --language-model-only --host 0.0.0.0 --port 8000 --kv-cache-dtype fp8_e5m2 --max-num-seqs 1 --max-num-batched-tokens 4128 --trust-remote-code --dtype bfloat16 --enable-prefix-caching --enable-chunked-prefill --no-scheduler-reserve-full-isl --speculative-config '{"method":"mtp","num_speculative_tokens":3}'