跳转至内容
  • 版块
  • 最新
  • 标签
  • 热门
  • 用户
  • 群组
皮肤
  • 浅色
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • 深色
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • 默认(不使用皮肤)
  • 不使用皮肤
折叠
品牌标识

抡锤者

  1. 主页
  2. LLM讨论区
  3. 5090 + vLLM + Qwen3.6-27B 成功分享

5090 + vLLM + Qwen3.6-27B 成功分享

已定时 已固定 已锁定 已移动 LLM讨论区
6 帖子 4 发布者 188 浏览
  • 从旧到新
  • 从新到旧
  • 最多赞同
回复
  • 在新帖中回复
登录后回复
此主题已被删除。只有拥有主题管理权限的用户可以查看。
  • remR 离线
    remR 离线
    rem
    编写于 最后由 编辑
    #1

    硬件配置

    • 处理:Ryzen Threadripper 3970X
    • 显卡:RTX5090 32GB
    • 内存:DDR4-3200 32GB x 8

    软件配置

    • 系统:Ubuntu 24.04 LTS
    • 推理:vLLM v0.21.0(Docker)
    • 驱动:v595.71.05|CUDA v13.2
    • 模型一:自制/Qwen3.6-27B-Heretic-W4G128(Refusal:6/100)
    • 模型二:Lorbus/Qwen3.6-27B-int4-AutoRound

    启动指令及参数

    docker run --gpus all \
      --ipc=host \
      -v /fast_pool/models:/models \
      -p 8000:8000 \
      -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 \
      vllm/vllm-openai:latest \
      --model /models/Qwen3.6-27B-Heretic-w4g128 \
      --quantization auto_round \
      --dtype bfloat16 \
      --max-model-len 196608 \
      --kv-cache-dtype fp8 \
      --gpu-memory-utilization 0.9 \
      --max-num-seqs 1 \
      --enable-prefix-caching \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3_coder \
      --reasoning-parser qwen3 \
      --default-chat-template-kwargs '{"enable_thinking": false}' \
      --safetensors-load-strategy prefetch \
      --trust-remote-code \
      --host 0.0.0.0 \
      --port 8000
    

    Benchmark测试

    ============================================================
      Qwen3.6-27B Heretic W4G128  —  Benchmark
      3 run  ×  1024 output tokens
    ============================================================
    Run 1/3... ✓  TTFT=281ms  TPS=43.5  (522 token)
    Run 2/3... ✓  TTFT=69ms  TPS=43.4  (522 token)
    Run 3/3... ✓  TTFT=69ms  TPS=43.4  (522 token)
    
     Run   TTFT (ms)   TPOT (ms)       TPS    Token   Total (s)
    ────  ──────────  ──────────  ────────  ───────  ──────────
       1      280.78       23.00     43.47      522       12.29
       2       68.85       23.02     43.45      522       12.08
       3       68.82       23.02     43.43      522       12.09
     AVG      139.48       23.01     43.45        ─           ─
    

    平均43.5t/s,比之前我在log看到的86t/s低很多。原因可能是benchmark.py只生成了522token,没到1023,而且这是纯文字短回答,prefix cache没热身。注意TTFT第一次281ms,第二次69ms,prefix cache命中后快了4倍~TPOT 23.01ms,每个token约23ms。

    Screenshot 2026-05-24 at 14.42.52.png

    Screenshot 2026-05-24 at 15.33.09.png

    Screenshot 2026-05-24 at 15.35.51.png

    WARNING 05-24 08:23:58 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in a future version.
    (APIServer pid=1) INFO 05-24 08:23:58 [utils.py:306] 
    (APIServer pid=1) INFO 05-24 08:23:58 [utils.py:306]        █     █     █▄   ▄█
    (APIServer pid=1) INFO 05-24 08:23:58 [utils.py:306]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.21.0
    (APIServer pid=1) INFO 05-24 08:23:58 [utils.py:306]   █▄█▀ █     █     █     █  model   /models/Qwen3.6-27B-Heretic-w4g128
    (APIServer pid=1) INFO 05-24 08:23:58 [utils.py:306]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
    (APIServer pid=1) INFO 05-24 08:23:58 [utils.py:306] 
    (APIServer pid=1) INFO 05-24 08:23:58 [utils.py:240] non-default args: {'model_tag': '/models/Qwen3.6-27B-Heretic-w4g128', 'default_chat_template_kwargs': {'enable_thinking': False}, 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/models/Qwen3.6-27B-Heretic-w4g128', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 196608, 'quantization': 'auto_round', 'safetensors_load_strategy': 'prefetch', 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.9, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_seqs': 1}
    (APIServer pid=1) WARNING 05-24 08:23:58 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
    (APIServer pid=1) WARNING 05-24 08:23:58 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
    (APIServer pid=1) WARNING 05-24 08:23:58 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_BUILD_URL
    (APIServer pid=1) WARNING 05-24 08:23:58 [envs.py:1866] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
    (APIServer pid=1) INFO 05-24 08:24:09 [model.py:568] Resolved architecture: Qwen3_5ForConditionalGeneration
    (APIServer pid=1) INFO 05-24 08:24:09 [model.py:1697] Using max model len 196608
    (APIServer pid=1) INFO 05-24 08:24:10 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
    (APIServer pid=1) WARNING 05-24 08:24:10 [config.py:367] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
    (APIServer pid=1) INFO 05-24 08:24:10 [config.py:387] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
    (APIServer pid=1) INFO 05-24 08:24:10 [vllm.py:886] Asynchronous scheduling is enabled.
    (APIServer pid=1) INFO 05-24 08:24:10 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
    (APIServer pid=1) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
    (APIServer pid=1) [transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
    (EngineCore pid=392) INFO 05-24 08:24:26 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model='/models/Qwen3.6-27B-Heretic-w4g128', speculative_config=None, tokenizer='/models/Qwen3.6-27B-Heretic-w4g128', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=196608, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=inc, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/models/Qwen3.6-27B-Heretic-w4g128, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 2, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto')
    (EngineCore pid=392) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
    (EngineCore pid=392) INFO 05-24 08:24:29 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.3:51473 backend=nccl
    (EngineCore pid=392) INFO 05-24 08:24:29 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
    (EngineCore pid=392) INFO 05-24 08:24:29 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
    (EngineCore pid=392) [transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
    (EngineCore pid=392) INFO 05-24 08:24:36 [gpu_model_runner.py:4857] Starting to load model /models/Qwen3.6-27B-Heretic-w4g128...
    (EngineCore pid=392) INFO 05-24 08:24:36 [cuda.py:427] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
    (EngineCore pid=392) INFO 05-24 08:24:36 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
    (EngineCore pid=392) INFO 05-24 08:24:36 [gptq_marlin.py:387] Using MarlinLinearKernel for GPTQMarlinLinearMethod
    (EngineCore pid=392) INFO 05-24 08:24:36 [gdn_linear_attn.py:169] Using Triton/FLA GDN prefill kernel
    (EngineCore pid=392) INFO 05-24 08:24:36 [cuda.py:372] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:938] Filesystem type for checkpoints: ZFS. Checkpoint size: 17.41 GiB. Available RAM: 192.24 GiB.
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:900] Prefetching checkpoint files into page cache started (in background, num_threads=8, block_size=16777216 bytes)
    Loading safetensors checkpoint shards:   0% Completed | 0/10 [00:00<?, ?it/s]
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:872] Prefetching checkpoint files: 10% (1/10)
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:872] Prefetching checkpoint files: 20% (2/10)
    Loading safetensors checkpoint shards:  10% Completed | 1/10 [00:00<00:05,  1.74it/s]
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:872] Prefetching checkpoint files: 30% (3/10)
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:872] Prefetching checkpoint files: 40% (4/10)
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:872] Prefetching checkpoint files: 50% (5/10)
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:872] Prefetching checkpoint files: 60% (6/10)
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:872] Prefetching checkpoint files: 70% (7/10)
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:872] Prefetching checkpoint files: 80% (8/10)
    (EngineCore pid=392) INFO 05-24 08:24:37 [weight_utils.py:872] Prefetching checkpoint files: 90% (9/10)
    (EngineCore pid=392) INFO 05-24 08:24:38 [weight_utils.py:872] Prefetching checkpoint files: 100% (10/10)
    (EngineCore pid=392) INFO 05-24 08:24:38 [weight_utils.py:895] Prefetching checkpoint files into page cache finished in 0.97s
    Loading safetensors checkpoint shards:  20% Completed | 2/10 [00:01<00:04,  1.99it/s]
    Loading safetensors checkpoint shards:  30% Completed | 3/10 [00:01<00:03,  2.24it/s]
    Loading safetensors checkpoint shards:  40% Completed | 4/10 [00:01<00:02,  2.39it/s]
    Loading safetensors checkpoint shards:  50% Completed | 5/10 [00:02<00:02,  2.49it/s]
    Loading safetensors checkpoint shards:  60% Completed | 6/10 [00:02<00:01,  2.55it/s]
    Loading safetensors checkpoint shards:  70% Completed | 7/10 [00:02<00:00,  3.04it/s]
    Loading safetensors checkpoint shards:  80% Completed | 8/10 [00:03<00:00,  3.09it/s]
    Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:03<00:00,  3.88it/s]
    Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:03<00:00,  2.93it/s]
    (EngineCore pid=392) 
    (EngineCore pid=392) INFO 05-24 08:24:40 [default_loader.py:397] Loading weights took 3.50 seconds
    (EngineCore pid=392) INFO 05-24 08:24:41 [gpu_model_runner.py:4959] Model loading took 17.45 GiB memory and 4.497936 seconds
    (EngineCore pid=392) INFO 05-24 08:24:41 [interface.py:645] Setting attention block size to 1568 tokens to ensure that attention page size is >= mamba page size.
    (EngineCore pid=392) INFO 05-24 08:24:41 [interface.py:669] Padding mamba page size by 0.13% to ensure that mamba page size and attention page size are exactly equal.
    (EngineCore pid=392) INFO 05-24 08:24:41 [gpu_model_runner.py:5920] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
    (EngineCore pid=392) INFO 05-24 08:25:04 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/537a496283/rank_0_0/backbone for vLLM's torch.compile
    (EngineCore pid=392) INFO 05-24 08:25:04 [backends.py:1148] Dynamo bytecode transform time: 11.73 s
    (EngineCore pid=392) INFO 05-24 08:25:07 [backends.py:378] Cache the graph of compile range (1, 2048) for later use
    (EngineCore pid=392) INFO 05-24 08:25:34 [backends.py:393] Compiling a graph for compile range (1, 2048) takes 28.64 s
    (EngineCore pid=392) INFO 05-24 08:25:41 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/39e231811ba95356ca3efb314dd32a14ee616beee18711c360c6ceba7a22119d/rank_0_0/model
    (EngineCore pid=392) INFO 05-24 08:25:41 [monitor.py:53] torch.compile took 48.44 s in total
    (EngineCore pid=392) INFO 05-24 08:26:58 [monitor.py:81] Initial profiling/warmup run took 76.56 s
    (EngineCore pid=392) INFO 05-24 08:26:58 [gpu_model_runner.py:6063] Profiling CUDA graph memory: PIECEWISE=2 (largest=2), FULL=1 (largest=1)
    (EngineCore pid=392) INFO 05-24 08:27:01 [gpu_model_runner.py:6142] Estimated CUDA graph memory: 0.40 GiB total
    (EngineCore pid=392) INFO 05-24 08:27:01 [gpu_worker.py:462] Available KV cache memory: 8.27 GiB
    (EngineCore pid=392) INFO 05-24 08:27:01 [gpu_worker.py:477] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9000 is equivalent to --gpu-memory-utilization=0.8872 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9128. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
    (EngineCore pid=392) INFO 05-24 08:27:01 [kv_cache_utils.py:1710] GPU KV cache size: 256,186 tokens
    (EngineCore pid=392) INFO 05-24 08:27:01 [kv_cache_utils.py:1711] Maximum concurrency for 196,608 tokens per request: 1.30x
    (EngineCore pid=392) INFO 05-24 08:27:01 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
    Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 2/2 [00:00<00:00, 20.00it/s]
    Capturing CUDA graphs (decode, FULL): 100%|██████████| 1/1 [00:00<00:00,  5.15it/s]
    (EngineCore pid=392) INFO 05-24 08:27:02 [gpu_model_runner.py:6243] Graph capturing finished in 1 secs, took 0.43 GiB
    (EngineCore pid=392) INFO 05-24 08:27:02 [gpu_worker.py:621] CUDA graph pool memory: 0.43 GiB (actual), 0.4 GiB (estimated), difference: 0.03 GiB (7.2%).
    (EngineCore pid=392) INFO 05-24 08:27:02 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
    (EngineCore pid=392) INFO 05-24 08:27:02 [core.py:299] init engine (profile, create kv cache, warmup model) took 141.51 s (compilation: 48.44 s)
    (EngineCore pid=392) INFO 05-24 08:27:03 [vllm.py:886] Asynchronous scheduling is enabled.
    (EngineCore pid=392) INFO 05-24 08:27:03 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
    (APIServer pid=1) INFO 05-24 08:27:03 [api_server.py:613] Supported tasks: ['generate']
    (APIServer pid=1) INFO 05-24 08:27:03 [parser_manager.py:202] "auto" tool choice has been enabled.
    (APIServer pid=1) WARNING 05-24 08:27:03 [model.py:1454] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
    (APIServer pid=1) INFO 05-24 08:27:04 [hf.py:483] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
    (APIServer pid=1) INFO 05-24 08:27:19 [base.py:224] Multi-modal warmup completed in 14.396s
    (APIServer pid=1) INFO 05-24 08:27:19 [base.py:224] Readonly multi-modal warmup completed in 0.556s
    (APIServer pid=1) INFO 05-24 08:27:19 [api_server.py:617] Starting vLLM server on http://0.0.0.0:8000
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:37] Available routes are:
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /docs, Methods: HEAD, GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /tokenize, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /detokenize, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /load, Methods: GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /version, Methods: GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /health, Methods: GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /metrics, Methods: GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/models, Methods: GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /ping, Methods: GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /ping, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /invocations, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/responses, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/completions, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/messages, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /generative_scoring, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
    (APIServer pid=1) INFO 05-24 08:27:19 [launcher.py:46] Route: /v1/completions/render, Methods: POST
    (APIServer pid=1) INFO:     Started server process [1]
    (APIServer pid=1) INFO:     Waiting for application startup.
    (APIServer pid=1) INFO:     Application startup complete.
    (EngineCore pid=392) WARNING 05-24 08:29:10 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _zero_kv_blocks_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
    (EngineCore pid=392) WARNING 05-24 08:29:10 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
    (EngineCore pid=392) WARNING 05-24 08:29:11 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _causal_conv1d_fwd_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
    (EngineCore pid=392) WARNING 05-24 08:29:11 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _fused_post_conv_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
    (APIServer pid=1) INFO 05-24 08:41:30 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO 05-24 08:41:40 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO:     172.17.0.1:45968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=1) INFO 05-24 08:42:00 [loggers.py:271] Engine 000: Avg prompt throughput: 3.5 tokens/s, Avg generation throughput: 56.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO:     172.17.0.1:58652 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=1) INFO 05-24 08:42:10 [loggers.py:271] Engine 000: Avg prompt throughput: 3.5 tokens/s, Avg generation throughput: 86.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO:     172.17.0.1:42426 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=1) INFO 05-24 08:42:20 [loggers.py:271] Engine 000: Avg prompt throughput: 3.5 tokens/s, Avg generation throughput: 86.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO 05-24 08:42:30 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 83.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO 05-24 08:42:40 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO:     172.17.0.1:35304 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=1) INFO 05-24 08:44:10 [loggers.py:271] Engine 000: Avg prompt throughput: 3.5 tokens/s, Avg generation throughput: 67.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO:     172.17.0.1:57372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=1) INFO 05-24 08:44:20 [loggers.py:271] Engine 000: Avg prompt throughput: 3.5 tokens/s, Avg generation throughput: 86.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO:     172.17.0.1:37742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=1) INFO 05-24 08:44:30 [loggers.py:271] Engine 000: Avg prompt throughput: 3.5 tokens/s, Avg generation throughput: 86.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO 05-24 08:44:40 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    (APIServer pid=1) INFO 05-24 08:44:50 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.8%, MM cache hit rate: 50.0%
    

    不知道KV还能不能再往上冲,毕竟是多模态,不审查内容跟不审查图片识别都已经测试没问题了~请大神帮忙看看KV跟TPS还有什么可以调整的空间~感谢感谢~第一个帖子搞得不好还请多包涵~

    对了,模型二我就没有贴了,因为效能不如我自制的模型,我是看了几个老外的帖子说很厉害85-110TPS,结果试了之后才34,我这个也有上到85的时候,平均比Lorbus多~但我又看到另外一个说用NVFP4+MTP也很厉害,可能晚上我也来折腾一下,试试自制NVFP4 + MTP跑一下~以上汇报~

    1 条回复 最后回复
    2
    • remR 离线
      remR 离线
      rem
      编写于 最后由 编辑
      #2

      实在是不好意思~我想改昵称但是怎么改都没有用,后来发现是因为注册的时候犯懒点了狗哥注册,就死活改不了昵称了,无奈之下只好删除账号重新用邮箱注册,这才可以改昵称~赶紧把文补上~

      terryT 1 条回复 最后回复
      0
      • remR rem

        实在是不好意思~我想改昵称但是怎么改都没有用,后来发现是因为注册的时候犯懒点了狗哥注册,就死活改不了昵称了,无奈之下只好删除账号重新用邮箱注册,这才可以改昵称~赶紧把文补上~

        terryT 在线
        terryT 在线
        terry
        编写于 最后由 编辑
        #3

        @rem 可以改名字的大哥,就在资料里里改就行了。分享帖子不错。

        油管:https://www.youtube.com/@抡锤者

        九龙杨生九 1 条回复 最后回复
        0
        • terryT terry 固定了该主题
        • terryT terry

          @rem 可以改名字的大哥,就在资料里里改就行了。分享帖子不错。

          九龙杨生九 离线
          九龙杨生九 离线
          九龙杨生
          编写于 最后由 编辑
          #4

          @terry 尝试过了,改不了,改了之后论坛的显示名称还是没变化

          terryT 1 条回复 最后回复
          0
          • 九龙杨生九 九龙杨生

            @terry 尝试过了,改不了,改了之后论坛的显示名称还是没变化

            terryT 在线
            terryT 在线
            terry
            编写于 最后由 编辑
            #5

            @九龙杨生 改完论坛的帖子有缓存的,不是立刻更新的,它实际上该成好了,我测试过了。

            油管:https://www.youtube.com/@抡锤者

            1 条回复 最后回复
            0
            • Billy ShenB 离线
              Billy ShenB 离线
              Billy Shen
              编写于 最后由 编辑
              #6

              我用的是以下配置,跑的是Qwen3.6-35B-A3B-FP8

              Linux OS: Ubuntu 24.04 LTS
              · CPU:2 颗 Intel Xeon 8168(每颗 24 核 48 线程,基础频率 2.7GHz)
              · 主板:Intel Xeon 1代/2代 PIODRG 双路主板
              · 内存:三星 DDR4 4*32GB 2933MHz RECC
              · 系统盘:金士顿NVMe SSD,2TB容量
              · 存储盘:希捷企业级 8TB 硬盘,256MB 缓存,7200RPM SATA
              · 显卡:NVIDIA Quadro RTX 5880 Ada Generation,48GB 显存

              1 条回复 最后回复
              0
              • 系统 取消固定了该主题

              你好!看起来您对这段对话很感兴趣,但您还没有一个账号。

              厌倦了每次访问都刷到同样的帖子?您注册账号后,您每次返回时都能精准定位到您上次浏览的位置,并可选择接收新回复通知(通过邮件或推送通知)。您还能收藏书签、为帖子顶,向社区成员表达您的欣赏。

              有了你的建议,这篇帖子会更精彩哦 💗

              注册 登录
              回复
              • 在新帖中回复
              登录后回复
              • 从旧到新
              • 从新到旧
              • 最多赞同


              • 登录

              • 没有帐号? 注册

              • 登录或注册以进行搜索。
              • 第一个帖子
                最后一个帖子
              0
              • 版块
              • 最新
              • 标签
              • 热门
              • 用户
              • 群组