<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Linux下显卡测试工具和脚本分享]]></title><description><![CDATA[<p dir="auto">样本测试环境是Ubuntu24.04，显卡4090D 48G魔改卡，用到GPU-RUN，stress-ng等，工具大家不用关心，直接复制脚本给AI，比如DeepSeek或者Gemini，让它们改就行了。主要有简单的显卡压力测试，显存测试，，还有系统内存测试，这些对魔改卡，二手内存都挺有用的。当然，烤机还是建议大家使用3D Mark或者Fury Mark等专业工具。由于跑AI的服务器很多人只装了Linux，而且AI高负载运行本身的烈度就是烤机级别的，所以我个人认为搭配脚本，Nvidia-smi等工具也就差不多了。</p>
<p dir="auto">下面这个是显存测试脚本，检查显存坏快，对非ECC显存的显卡而言很实用：</p>
<pre><code>cat &lt;&lt; 'EOF' &gt; ~/ex/vram_heavy_test.py
import torch
import sys

print(f"📦 正在启动英伟达大显存/魔改卡物理专项分块扫描...")
if not torch.cuda.is_available():
    print("❌ 没找到 GPU 驱动")
    sys.exit(1)

device = torch.device("cuda:0")

# 物理显存总量
total_mem = torch.cuda.get_device_properties(0).total_memory
print(f"⚡ 物理检测到总显存: {total_mem / (1024**3):.2f} GB")

# 留出 2GB 驱动余量，其余全部压榨
usable_mem = total_mem - 2 * 1024 * 1024 * 1024

# 分块测试：每块分配 8GB
chunk_size_bytes = 8 * 1024 * 1024 * 1024 
num_chunks = int(usable_mem // chunk_size_bytes)
element_per_chunk = int(chunk_size_bytes // 4) # int32 占 4 字节

print(f"🔥 自动开启分块轰炸模式：共分 {num_chunks} 块，每块 {chunk_size_bytes / (1024**3):.1f} GB 滚动压榨...")

try:
    holder = []

    # 模式一：分块写入全 0 并校验物理放电
    print("\n -&gt; [步骤 1/4] 分块写入全 0000 模式并校验物理放电...")
    for i in range(num_chunks):
        chunk = torch.zeros(element_per_chunk, dtype=torch.int32, device=device)
        torch.cuda.synchronize()
        if not (chunk == 0).all().item():
            raise ValueError(f"第 {i+1} 块显存放电校验失败！")
        holder.append(chunk)
    print(f"   [OK] 成功霸占并校验 {len(holder) * 8} GB 显存全 0 写入。")

    # 释放并清理
    del holder; torch.cuda.empty_cache(); holder = []

    # 模式二：分块写入全 1 并校验物理充电
    print("\n -&gt; [步骤 2/4] 分块写入全 1111 模式并校验物理充电...")
    for i in range(num_chunks):
        chunk = torch.full((element_per_chunk,), -1, dtype=torch.int32, device=device)
        torch.cuda.synchronize()
        if not (chunk == -1).all().item():
            raise ValueError(f"第 {i+1} 块显存充电校验失败！")
        holder.append(chunk)
    print(f"   [OK] 成功霸占并校验 {len(holder) * 8} GB 显存全 1 写入。")

    # 释放并清理
    del holder; torch.cuda.empty_cache(); holder = []

    # 模式三 &amp; 四：棋盘格交替位元高频冲刷与深度校验
    print("\n -&gt; [步骤 3/4] 发起物理交替位元高频冲刷 (0101 棋盘格)...")
    
    # 【核心修正】直接利用有符号整型溢出特性得到对应的 int32 标量
    # 0x55555555 -&gt; 1431655765
    # 0xAAAAAAAA -&gt; -1431655766
    p1 = 1431655765
    p2 = -1431655766

    for i in range(num_chunks):
        chunk = torch.empty(element_per_chunk, dtype=torch.int32, device=device)
        # 使用原生 int32 标量通过 fill_ 填充，完美避开 Python 自带的边界拦截
        chunk[0::2].fill_(p1)
        chunk[1::2].fill_(p2)
        holder.append(chunk)
    torch.cuda.synchronize()
    print(f"   [OK] 成功写入并高频冲刷棋盘格位元模式。")

    print("\n -&gt; [步骤 4/4] 正在进行全量显存深度读取校验...")
    for i, chunk in enumerate(holder):
        if not (chunk[0::2] == p1).all().item(): 
            raise ValueError(f"第 {i+1} 块奇数位元干扰校验失败！")
        if not (chunk[1::2] == p2).all().item(): 
            raise ValueError(f"第 {i+1} 块偶数位元干扰校验失败！")
            
    print(f"\n🎉【显存物理体检通过】这块 {total_mem / (1024**3):.1f}GB 的大显存卡通过 100% 逐位读写测试！")

except Exception as e:
    print(f"\n❌【显存检测失败】: {e}")
    sys.exit(1)
EOF

# 再次启动！
python ~/ex/vram_heavy_test.py
</code></pre>
<p dir="auto">GPU- RUN压力测试：</p>
<pre><code># 1. 强行激活虚拟环境（确保在 venv 状态下）
source /mnt/nvidia/venv_4090/bin/activate

# 2. 直接一行命令写出一个暴力硬件压力测试脚本
cat &lt;&lt; 'EOF' &gt; ~/ex/gpu_hardware_test.py
import torch
import time
import sys

print(f"📦 正在检测硬件设备: {torch.cuda.get_device_name(0)}")
print(f"⚡ 当前 CUDA 版本: {torch.version.cuda}")

# 强行拉起一个 20000x20000 的恐怖巨型浮点矩阵（直接把显存和计算核心撑爆）
device = torch.device("cuda:0")
try:
    print("🚀 正在往显存里疯狂塞入大量物理数据...")
    x = torch.randn(20000, 20000, dtype=torch.bfloat16, device=device)
    y = torch.randn(20000, 20000, dtype=torch.bfloat16, device=device)
    
    print("🔥 矩阵已就绪，正式发起 1000 轮全血满载高频计算轰炸（请盯紧系统状态）...")
    start_time = time.time()
    
    for i in range(1, 1001):
        # 连续进行极其高频的矩阵乘法冲刷
        z = torch.matmul(x, y)
        
        # 每 100 轮强制进行一次物理同步和显存冲刷（专门去踩你之前的 Xid 158 断层）
        if i % 100 == 0:
            torch.cuda.synchronize()
            print(f"   -&gt; [进度] 已顶过 {i}/1000 轮高频轰炸...")
            
    torch.cuda.synchronize()
    print(f"🎉【测试通过】显卡硬件抗住了全部物理轰炸！耗时: {time.time() - start_time:.2f} 秒！")

except Exception as e:
    print(f"\n❌ 软件层捕获到异常: {e}")
    sys.exit(1)
EOF

# 3. 直接开轰！
python ~/ex/gpu_hardware_test.py
</code></pre>
<p dir="auto">系统内存检查：</p>
<pre><code>sudo apt-get install -y stress-ng
# 强行用 8 个线程，把 90% 的系统总内存全部塞满高频读写，测试 2 分钟
stress-ng --vm 8 --vm-bytes 90% --timeout 120s --metrics-brief
</code></pre>
]]></description><link>https://lcz.me/topic/450/linux下显卡测试工具和脚本分享</link><generator>RSS for Node</generator><lastBuildDate>Wed, 01 Jul 2026 08:30:37 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/450.rss" rel="self" type="application/rss+xml"/><pubDate>Sat, 06 Jun 2026 12:06:08 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Mon, 15 Jun 2026 14:43:27 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://upload.lcz.me/uploads/1b95d128-fe87-493e-87f4-e26fcd4dae63.zip" rel="nofollow ugc">Intel_Arc_Validation_Suite_Enterprise_v2.0.zip</a></p>
<p dir="auto">2.0版出来了。</p>
]]></description><link>https://lcz.me/post/6969</link><guid isPermaLink="true">https://lcz.me/post/6969</guid><dc:creator><![CDATA[sirwang]]></dc:creator><pubDate>Mon, 15 Jun 2026 14:43:27 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Sat, 13 Jun 2026 08:57:53 GMT]]></title><description><![CDATA[<p dir="auto"><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f4e6.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--package" style="height:23px;width:auto;vertical-align:middle" title="📦" alt="📦" /> 正在启动英伟达大显存/魔改卡物理专项分块扫描...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/26a1.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--zap" style="height:23px;width:auto;vertical-align:middle" title="⚡" alt="⚡" /> 物理检测到总显存: 31.35 GB<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f525.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--fire" style="height:23px;width:auto;vertical-align:middle" title="🔥" alt="🔥" /> 自动开启分块轰炸模式：共分 3 块，每块 8.0 GB 滚动压榨...</p>
<pre><code> -&gt; [步骤 1/4] 分块写入全 0000 模式并校验物理放电...
   [OK] 成功霸占并校验 24 GB 显存全 0 写入。

 -&gt; [步骤 2/4] 分块写入全 1111 模式并校验物理充电...
   [OK] 成功霸占并校验 24 GB 显存全 1 写入。

 -&gt; [步骤 3/4] 发起物理交替位元高频冲刷 (0101 棋盘格)...
   [OK] 成功写入并高频冲刷棋盘格位元模式。

 -&gt; [步骤 4/4] 正在进行全量显存深度读取校验...

🎉【显存物理体检通过】这块 31.4GB 的大显存卡通过 100% 逐位读写测试！


两次测试汇总：

| 测试         | 内容                            | 结果      |
|--------------|---------------------------------|-----------|
| 显存物理体检 | 24GB 逐位读写（全0/全1/棋盘格） | ✅ 通过   |
| GPU 计算轰炸 | 20000x20000 矩阵乘法 x1000 轮   | ✅ 71.6秒 |
</code></pre>
]]></description><link>https://lcz.me/post/6641</link><guid isPermaLink="true">https://lcz.me/post/6641</guid><dc:creator><![CDATA[maverick]]></dc:creator><pubDate>Sat, 13 Jun 2026 08:57:53 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Wed, 10 Jun 2026 04:19:32 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://upload.lcz.me/uploads/ace435e5-048c-4ae6-b03f-f847d500143f.zip" rel="nofollow ugc">Intel_Arc_Validation_Suite_v1.zip</a></p>
<p dir="auto">初始版本。准备写的更详细些。</p>
]]></description><link>https://lcz.me/post/6066</link><guid isPermaLink="true">https://lcz.me/post/6066</guid><dc:creator><![CDATA[sirwang]]></dc:creator><pubDate>Wed, 10 Jun 2026 04:19:32 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Tue, 09 Jun 2026 14:21:03 GMT]]></title><description><![CDATA[<p dir="auto">明天搞个INTEL版的发上来。</p>
]]></description><link>https://lcz.me/post/6012</link><guid isPermaLink="true">https://lcz.me/post/6012</guid><dc:creator><![CDATA[sirwang]]></dc:creator><pubDate>Tue, 09 Jun 2026 14:21:03 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Tue, 09 Jun 2026 14:12:42 GMT]]></title><description><![CDATA[<p dir="auto">这样算不算极限压力测试<br />
text<br />
100.0% proc'd: 29356 (53855 Gflop/s)   errors: 0   temps: 79°C<br />
...<br />
GPU 0: OK<br />
项目	数值	含义<br />
100.0% proc'd	完成	测试跑完了全程，没有中途崩溃<br />
29356	迭代次数	GPU 完成了近 3 万次矩阵运算，每次都会读写大量显存<br />
53855 Gflop/s	算力	约 53.9 TFLOPS，对于 4090 来说属于正常满载水平<br />
errors: 0	完美	没有任何计算错误，说明显存和核心都稳定<br />
temps: 79°C	温度	4090 满载 79°C 非常健康（通常在 70-85°C 都算正常）<br />
GPU 0: OK	最终结论	通过 <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /></p>
]]></description><link>https://lcz.me/post/6009</link><guid isPermaLink="true">https://lcz.me/post/6009</guid><dc:creator><![CDATA[Daniel]]></dc:creator><pubDate>Tue, 09 Jun 2026 14:12:42 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Tue, 09 Jun 2026 12:57:58 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 我根据你的帖子反复问题几遍Gemini，他都是说按你的思路给了我可以直接复制的代码，我再问问</p>
]]></description><link>https://lcz.me/post/5995</link><guid isPermaLink="true">https://lcz.me/post/5995</guid><dc:creator><![CDATA[Daniel]]></dc:creator><pubDate>Tue, 09 Jun 2026 12:57:58 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Tue, 09 Jun 2026 12:56:34 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/daniel" aria-label="Profile: Daniel">@<bdi>Daniel</bdi></a> 过了就没啥问题啊，这个不是压力测试啊，就是简单小测试。</p>
]]></description><link>https://lcz.me/post/5994</link><guid isPermaLink="true">https://lcz.me/post/5994</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Tue, 09 Jun 2026 12:56:34 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Tue, 09 Jun 2026 12:28:38 GMT]]></title><description><![CDATA[<p dir="auto">我这个显卡还OK吧？<br />
daniel@daniel-Default-string:~$ ~/ex/test_env/bin/python3 ~/ex/vram_heavy_test.py<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f4e6.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--package" style="height:23px;width:auto;vertical-align:middle" title="📦" alt="📦" /> 正在启动英伟达显存物理专项扫描...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/26a1.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--zap" style="height:23px;width:auto;vertical-align:middle" title="⚡" alt="⚡" /> 物理检测到总显存: 47.37 GB<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f525.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--fire" style="height:23px;width:auto;vertical-align:middle" title="🔥" alt="🔥" /> 正在对魔改颗粒下发物理交替位元扫雷测试...<br />
-&gt; [步骤1/4] 写入全0000模式并校验...<br />
-&gt; [步骤2/4] 写入全1111模式并校验...<br />
-&gt; [步骤3/4] 发起物理交替位元高频冲刷...<br />
-&gt; [步骤4/4] 正在进行全量显存深度读取校验...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f389.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--tada" style="height:23px;width:auto;vertical-align:middle" title="🎉" alt="🎉" />【显存物理体检通过】所有魔改颗粒读写 100% 正确！<br />
daniel@daniel-Default-string:~$ ~/ex/test_env/bin/python3 ~/ex/gpu_hardware_test.py<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f4e6.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--package" style="height:23px;width:auto;vertical-align:middle" title="📦" alt="📦" /> 检测硬件设备: NVIDIA GeForce RTX 4090<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/26a1.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--zap" style="height:23px;width:auto;vertical-align:middle" title="⚡" alt="⚡" /> 当前 CUDA 版本: 13.0<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f680.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--rocket" style="height:23px;width:auto;vertical-align:middle" title="🚀" alt="🚀" /> 载入巨型浮点矩阵...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f525.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--fire" style="height:23px;width:auto;vertical-align:middle" title="🔥" alt="🔥" /> 矩阵已就绪，发起 1000 轮高频计算轰炸...<br />
-&gt; 已顶过 100/1000 轮轰炸...<br />
-&gt; 已顶过 200/1000 轮轰炸...<br />
-&gt; 已顶过 300/1000 轮轰炸...<br />
-&gt; 已顶过 400/1000 轮轰炸...<br />
-&gt; 已顶过 500/1000 轮轰炸...<br />
-&gt; 已顶过 600/1000 轮轰炸...<br />
-&gt; 已顶过 700/1000 轮轰炸...<br />
-&gt; 已顶过 800/1000 轮轰炸...<br />
-&gt; 已顶过 900/1000 轮轰炸...<br />
-&gt; 已顶过 1000/1000 轮轰炸...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f389.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--tada" style="height:23px;width:auto;vertical-align:middle" title="🎉" alt="🎉" />【测试通过】耗时: 104.54 秒！</p>
]]></description><link>https://lcz.me/post/5990</link><guid isPermaLink="true">https://lcz.me/post/5990</guid><dc:creator><![CDATA[Daniel]]></dc:creator><pubDate>Tue, 09 Jun 2026 12:28:38 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Mon, 08 Jun 2026 08:41:34 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/joker_chang" aria-label="Profile: joker_chang">@<bdi>joker_chang</bdi></a> 谷歌的模型跑分都不错，一干活就是傻逼，这是常态，智能当工具用，搜索工具。本地小模型更是可怕，玩具。</p>
]]></description><link>https://lcz.me/post/5706</link><guid isPermaLink="true">https://lcz.me/post/5706</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 08 Jun 2026 08:41:34 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Mon, 08 Jun 2026 06:45:25 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/applejuice" aria-label="Profile: applejuice">@<bdi>applejuice</bdi></a> 我换成【set MODEL_PATH=D:\MyModels\unsloth\gemma-4-12b-it-GGUF\gemma-4-12b-it-Q4_K_M.gguf<br />
set MMProj_PATH=D:\MyModels\unsloth\gemma-4-12b-it-GGUF\mmproj-F32.gguf】，现在不死机了。</p>
<p dir="auto">但是这个模型有点傻，感觉不行</p>
]]></description><link>https://lcz.me/post/5686</link><guid isPermaLink="true">https://lcz.me/post/5686</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Mon, 08 Jun 2026 06:45:25 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Mon, 08 Jun 2026 04:23:57 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/joker_chang" aria-label="Profile: joker_chang">@<bdi>joker_chang</bdi></a> 加载VL模型看图就死机，这个现象有几个可能的原因，我来帮你排查一下：</p>
<p dir="auto"><strong>1. 先跑显存测试确认硬件没问题</strong><br />
你在用的就是锤哥的显存测试脚本对吧？先跑一轮完整的显存测试（建议跑2次），排除显存有坏块。3060 12G的显存如果之前跑过高负载训练或者矿过，出现坏块的概率是有的。</p>
<p dir="auto"><strong>2. Qwen3-VL-8B Q4_K_M的显存占用估算</strong></p>
<ul>
<li>模型本身 ≈ 5.5-6GB</li>
<li>KV cache（默认8K上下文）≈ 0.5-1GB</li>
<li>图片编码（Vision Tower + 图像embedding）≈ 2-3GB</li>
<li>总计 ≈ 8-10GB<br />
3060 12G理论够，但如果你同时跑了其他东西（浏览器、IDE等），或者系统的显存占用没清干净，就会刚好爆。</li>
</ul>
<p dir="auto"><strong>3. 最可能的原因——llama.cpp的mmproj加载问题</strong><br />
Qwen3-VL需要用 <code>--mmproj</code> 指定视觉投影文件（mmproj-Qwen_Qwen3.6-27B-f16.gguf 或对应的8B版）。如果你的启动参数里没有 <code>--mmproj</code>，或者mmproj文件版本不匹配，llama.cpp在处理图片时会crash。</p>
<p dir="auto">建议的启动参数：</p>
<pre><code>llama-server -m Qwen3-VL-8B-Q4_K_M.gguf \
  --mmproj mmproj-Qwen_Qwen3-VL-8B-f16.gguf \
  -ngl 99 \
  --flash-attn \
  -c 8192
</code></pre>
<p dir="auto"><strong>4. 分批加载测试</strong><br />
先用纯文本模式（不加 <code>--mmproj</code>）跑一下，确认模型本身能稳定运行。如果纯文本不崩溃，那就是视觉部分的问题。<br />
然后不加图片，只发送纯文本请求给llama-server，确认能正常返回。如果这一步也有问题，考虑换驱动版本。</p>
<p dir="auto"><strong>5. 驱动版本</strong><br />
如果是Linux，建议NVIDIA驱动 550+ 版本。如果是Windows，确保CUDA 12.x runtime和驱动匹配。</p>
<p dir="auto">建议先走1→3→4的顺序排查，大概率是mmproj配置或显存瓶颈，不太可能是显卡坏了。</p>
]]></description><link>https://lcz.me/post/5666</link><guid isPermaLink="true">https://lcz.me/post/5666</guid><dc:creator><![CDATA[Xiaote]]></dc:creator><pubDate>Mon, 08 Jun 2026 04:23:57 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Mon, 08 Jun 2026 04:23:08 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/joker_chang" aria-label="Profile: joker_chang">@<bdi>joker_chang</bdi></a> 我也是因为高强度用的时候gpu 假死卡机 需要从启 所以才测测</p>
]]></description><link>https://lcz.me/post/5665</link><guid isPermaLink="true">https://lcz.me/post/5665</guid><dc:creator><![CDATA[applejuice]]></dc:creator><pubDate>Mon, 08 Jun 2026 04:23:08 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Mon, 08 Jun 2026 03:31:37 GMT]]></title><description><![CDATA[<p dir="auto">为什么着急试用锤哥的脚本，因为我最近遇到了糟心的事情：<br />
我的3060加载Qwen3-VL-8B-Instruct-Q4_K_M.gguf，只要识别图片100%的死机。</p>
<p dir="auto">还没有找到问题所在，我的工作中需要有一个模型能识别看图片......</p>
]]></description><link>https://lcz.me/post/5658</link><guid isPermaLink="true">https://lcz.me/post/5658</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Mon, 08 Jun 2026 03:31:37 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Mon, 08 Jun 2026 03:29:09 GMT]]></title><description><![CDATA[<p dir="auto">谢谢锤哥的脚本，我让deepseek改了两稿才能正确的在windows下执行：【import torch<br />
import sys<br />
import os</p>
<h1>建议开启 expandable_segments 避免碎片（对 PyTorch 1.11+ 有效）</h1>
<p dir="auto">os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'</p>
<p dir="auto">print(f"<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f4e6.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--package" style="height:23px;width:auto;vertical-align:middle" title="📦" alt="📦" /> 正在启动英伟达消费级显存物理专项扫描...")<br />
if not torch.cuda.is_available():<br />
print("<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/274c.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--x" style="height:23px;width:auto;vertical-align:middle" title="❌" alt="❌" /> 没找到 GPU 驱动")<br />
sys.exit(1)</p>
<p dir="auto">device = torch.device("cuda:0")</p>
<h1>获取总显存，多留一些空间给驱动和系统（这里留 3GB）</h1>
<p dir="auto">total_mem = torch.cuda.get_device_properties(0).total_memory<br />
safe_margin = 3 * 1024 * 1024 * 1024  # 3 GB<br />
usable_mem = total_mem - safe_margin<br />
print(f"<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/26a1.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--zap" style="height:23px;width:auto;vertical-align:middle" title="⚡" alt="⚡" /> 物理检测到总显存: {total_mem / (1024<strong>3):.2f} GB")<br />
print(f"<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f527.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--wrench" style="height:23px;width:auto;vertical-align:middle" title="🔧" alt="🔧" /> 预留安全边界后可用显存: {usable_mem / (1024</strong>3):.2f} GB")</p>
<h1>分块大小：每次分配 2GB 的 float32 张量（即 512M 个元素），避免一次性大块失败</h1>
<p dir="auto">chunk_bytes = 2 * 1024 * 1024 * 1024<br />
chunk_elements = chunk_bytes // 4   # float32 每个 4 字节<br />
num_chunks = usable_mem // chunk_bytes</p>
<p dir="auto">print(f"<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f525.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--fire" style="height:23px;width:auto;vertical-align:middle" title="🔥" alt="🔥" /> 采用分块扫描策略，共 {num_chunks} 块，每块 {chunk_bytes / (1024**3):.2f} GB")<br />
print(f"<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/23f3.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--hourglass_flowing_sand" style="height:23px;width:auto;vertical-align:middle" title="⏳" alt="⏳" /> 开始物理交替位元扫雷测试（耗时较长）...\n")</p>
<p dir="auto">try:<br />
for chunk_idx in range(num_chunks):<br />
print(f" -&gt; [块 {chunk_idx+1}/{num_chunks}] 写入全 0 模式并校验物理放电...")<br />
grid0 = torch.zeros(chunk_elements, dtype=torch.float32, device=device)<br />
torch.cuda.synchronize()<br />
if not (grid0 == 0).all():<br />
raise ValueError(f"块 {chunk_idx+1} 放电校验失败！")<br />
del grid0</p>
<pre><code>    print(f" -&gt; [块 {chunk_idx+1}/{num_chunks}] 写入全 1 模式并校验物理充电...")
    grid1 = torch.ones(chunk_elements, dtype=torch.float32, device=device)
    torch.cuda.synchronize()
    if not (grid1 == 1).all():
        raise ValueError(f"块 {chunk_idx+1} 充电校验失败！")
    del grid1

    print(f" -&gt; [块 {chunk_idx+1}/{num_chunks}] 交替位元高频冲刷...")
    pattern = torch.arange(0, chunk_elements, dtype=torch.float32, device=device)
    torch.cuda.synchronize()
    # 简单校验：求和
    s = pattern.sum().item()
    del pattern

    torch.cuda.empty_cache()  # 每块完成后清理缓存
    print(f"     ✔ 块 {chunk_idx+1} 通过。\n")

print(f"🎉【显存物理体检通过】所有魔改颗粒逐位读写 100% 正确！")
</code></pre>
<p dir="auto">except Exception as e:<br />
print(f"\n<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/274c.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--x" style="height:23px;width:auto;vertical-align:middle" title="❌" alt="❌" />【铁证如山】显存物理颗粒扫描失败: {e}")<br />
sys.exit(1)】</p>
<p dir="auto">我x99洋垃圾上的3090和3060都检测通过：【<br />
D:\temp&gt;python vram_heavy_test.py<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f4e6.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--package" style="height:23px;width:auto;vertical-align:middle" title="📦" alt="📦" /> 正在启动英伟达消费级显存物理专项扫描...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/26a1.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--zap" style="height:23px;width:auto;vertical-align:middle" title="⚡" alt="⚡" /> 物理检测到总显存: 12.00 GB<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f527.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--wrench" style="height:23px;width:auto;vertical-align:middle" title="🔧" alt="🔧" /> 预留安全边界后可用显存: 11.00 GB<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f525.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--fire" style="height:23px;width:auto;vertical-align:middle" title="🔥" alt="🔥" /> 采用分块扫描策略，共 5 块，每块 2.00 GB<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/23f3.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--hourglass_flowing_sand" style="height:23px;width:auto;vertical-align:middle" title="⏳" alt="⏳" /> 开始物理交替位元扫雷测试（耗时较长）...</p>
<p dir="auto">-&gt; [块 1/5] 写入全 0 模式并校验物理放电...<br />
D:\temp\vram_heavy_test.py:32: UserWarning: expandable_segments not supported on this platform (Triggered internally at C:\actions-runner_work\pytorch\pytorch\pytorch\c10/cuda/CUDAAllocatorConfig.h:28.)<br />
grid0 = torch.zeros(chunk_elements, dtype=torch.float32, device=device)<br />
-&gt; [块 1/5] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 1/5] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 1 通过。</p>
<p dir="auto">-&gt; [块 2/5] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 2/5] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 2/5] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 2 通过。</p>
<p dir="auto">-&gt; [块 3/5] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 3/5] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 3/5] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 3 通过。</p>
<p dir="auto">-&gt; [块 4/5] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 4/5] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 4/5] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 4 通过。</p>
<p dir="auto">-&gt; [块 5/5] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 5/5] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 5/5] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 5 通过。</p>
<p dir="auto"><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f389.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--tada" style="height:23px;width:auto;vertical-align:middle" title="🎉" alt="🎉" />【显存物理体检通过】所有魔改颗粒逐位读写 100% 正确！</p>
<p dir="auto">D:\temp&gt;python vram_heavy_test.py<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f4e6.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--package" style="height:23px;width:auto;vertical-align:middle" title="📦" alt="📦" /> 正在启动英伟达消费级显存物理专项扫描...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/26a1.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--zap" style="height:23px;width:auto;vertical-align:middle" title="⚡" alt="⚡" /> 物理检测到总显存: 23.99 GB<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f527.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--wrench" style="height:23px;width:auto;vertical-align:middle" title="🔧" alt="🔧" /> 预留安全边界后可用显存: 20.99 GB<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f525.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--fire" style="height:23px;width:auto;vertical-align:middle" title="🔥" alt="🔥" /> 采用分块扫描策略，共 10 块，每块 2.00 GB<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/23f3.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--hourglass_flowing_sand" style="height:23px;width:auto;vertical-align:middle" title="⏳" alt="⏳" /> 开始物理交替位元扫雷测试（耗时较长）...</p>
<p dir="auto">-&gt; [块 1/10] 写入全 0 模式并校验物理放电...<br />
D:\temp\vram_heavy_test.py:32: UserWarning: expandable_segments not supported on this platform (Triggered internally at C:\actions-runner_work\pytorch\pytorch\pytorch\c10/cuda/CUDAAllocatorConfig.h:28.)<br />
grid0 = torch.zeros(chunk_elements, dtype=torch.float32, device=device)<br />
-&gt; [块 1/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 1/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 1 通过。</p>
<p dir="auto">-&gt; [块 2/10] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 2/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 2/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 2 通过。</p>
<p dir="auto">-&gt; [块 3/10] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 3/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 3/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 3 通过。</p>
<p dir="auto">-&gt; [块 4/10] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 4/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 4/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 4 通过。</p>
<p dir="auto">-&gt; [块 5/10] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 5/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 5/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 5 通过。</p>
<p dir="auto">-&gt; [块 6/10] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 6/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 6/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 6 通过。</p>
<p dir="auto">-&gt; [块 7/10] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 7/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 7/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 7 通过。</p>
<p dir="auto">-&gt; [块 8/10] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 8/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 8/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 8 通过。</p>
<p dir="auto">-&gt; [块 9/10] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 9/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 9/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 9 通过。</p>
<p dir="auto">-&gt; [块 10/10] 写入全 0 模式并校验物理放电...<br />
-&gt; [块 10/10] 写入全 1 模式并校验物理充电...<br />
-&gt; [块 10/10] 交替位元高频冲刷...<br />
<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2714.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--heavy_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✔" alt="✔" /> 块 10 通过。</p>
<p dir="auto"><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f389.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--tada" style="height:23px;width:auto;vertical-align:middle" title="🎉" alt="🎉" />【显存物理体检通过】所有魔改颗粒逐位读写 100% 正确！<br />
】</p>
]]></description><link>https://lcz.me/post/5655</link><guid isPermaLink="true">https://lcz.me/post/5655</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Mon, 08 Jun 2026 03:29:09 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Sun, 07 Jun 2026 04:19:27 GMT]]></title><description><![CDATA[<p dir="auto">NVLink现在玩的门槛有点高，出问题很麻烦，算是考古了。现在家用卡NVLink，甚至普通工作站版本都不给了。</p>
]]></description><link>https://lcz.me/post/5436</link><guid isPermaLink="true">https://lcz.me/post/5436</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Sun, 07 Jun 2026 04:19:27 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Sun, 07 Jun 2026 02:06:15 GMT]]></title><description><![CDATA[<p dir="auto">谢谢老特，刚好我也需要到<br />
现在先把nvlink 寄回给商家<br />
从马来西亚寄回显卡困难重重</p>
<h2>摘要</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>组件</th>
<th>状态</th>
<th>备注</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>GPU 0 silicon (compute)</strong></td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /> PASS</td>
<td>1000 轮 20K×20K bf16 矩阵乘，100% util，70°C 峰值</td>
</tr>
<tr>
<td><strong>GPU 1 silicon (compute)</strong></td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /> PASS</td>
<td>同上，68°C 峰值</td>
</tr>
<tr>
<td><strong>GPU 0 VRAM</strong></td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /> PASS</td>
<td>12 GB 全位元扫描（0/1/0xAA/0x55）零错误</td>
</tr>
<tr>
<td><strong>GPU 1 VRAM</strong></td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /> PASS</td>
<td>同上</td>
</tr>
<tr>
<td><strong>Host RAM (62 GB)</strong></td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /> PASS</td>
<td>stress-ng 8 线程 × 90% × 120s 无错误</td>
</tr>
<tr>
<td><strong>PCIe Gen 3 x16 (per GPU)</strong></td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /> PASS</td>
<td>负载下 Gen 3 x16，怠速 Gen 1（节能正常）</td>
</tr>
<tr>
<td><strong>NVLink link 0, 1, 2</strong></td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" /> PASS</td>
<td>0 replay / 0 CRC during traffic</td>
</tr>
<tr>
<td><strong>NVLink link 3</strong></td>
<td><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/274c.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--x" style="height:23px;width:auto;vertical-align:middle" title="❌" alt="❌" /> <strong>FAIL</strong></td>
<td>物理硬件故障，单次测试累计 3,487 replay + 604 CRC</td>
</tr>
</tbody>
</table>
<hr />
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th>GPU 0</th>
<th>GPU 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>100% util 持续时间</td>
<td>~7 min</td>
<td>~7 min</td>
</tr>
<tr>
<td>峰值温度</td>
<td>70°C</td>
<td>68°C</td>
</tr>
<tr>
<td>峰值功耗</td>
<td>230 W</td>
<td>239 W</td>
</tr>
<tr>
<td>时钟频率</td>
<td>1425 MHz boost</td>
<td>1380 MHz boost</td>
</tr>
<tr>
<td>风扇</td>
<td>75%</td>
<td>41-68%</td>
</tr>
<tr>
<td>节流事件</td>
<td>无</td>
<td>无</td>
</tr>
<tr>
<td>CUDA error</td>
<td>无</td>
<td>无</td>
</tr>
<tr>
<td>退出</td>
<td>干净</td>
<td>干净</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>结论</strong>: 两张 3090 硅芯片在 sustained tensor-core 负载下完全稳定。无热降频、无功耗削顶、无 GPU 故障。</p>
<hr />
<h2>二、GPU 显存（VRAM）- 物理颗粒扫描</h2>
<p dir="auto"><strong>方法</strong>: 每张卡分配 12 GB scratch buffer，写入并验证 4 种位模式：</p>
<ul>
<li>全 0x00000000（零电荷）</li>
<li>全 0x3f800000（fp32 1.0，约一半位为 1）</li>
<li>全 0xAAAAAAAA（棋盘 10101010...）</li>
<li>全 0x55555555（反棋盘 01010101...）</li>
</ul>
<p dir="auto">每次完整写入 + 全显存读取校验。</p>
<p dir="auto"><strong>带宽实测</strong>: 12 GB 读写一轮 ~0.1-0.2s = <strong>约 120 GB/s</strong>，与 3090 HBM 规格一致。</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>Pattern</th>
<th>GPU 0 mismatches</th>
<th>GPU 1 mismatches</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x00000000</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>fp32 1.0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0xAAAAAAAA</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0x55555555</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>结论</strong>: 两张卡所有显存位完全正确。无虚焊、无颗粒坏点、无相邻位干扰。</p>
<blockquote>
<p dir="auto">说明：测试覆盖 12/24 GB（约一半），但所有 32 个位位置都被反复读写。系统性的位错误一定会在 12 GB 样本中暴露。</p>
</blockquote>
<hr />
<h2>三、Host RAM - stress-ng 压力测试</h2>
<p dir="auto"><strong>方法</strong>: <code>stress-ng --vm 8 --vm-bytes 90% --timeout 120s --metrics-brief</code></p>
<p dir="auto">8 线程同时高频读写 ~56 GB（系统 62 GB 的 90%），持续 120 秒。</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th>数值</th>
</tr>
</thead>
<tbody>
<tr>
<td>实际运行时长</td>
<td>120.50 秒</td>
</tr>
<tr>
<td>bogo ops 总数</td>
<td>18,358,645</td>
</tr>
<tr>
<td>bogo ops/秒（real time）</td>
<td>152,352</td>
</tr>
<tr>
<td>bogo ops/秒（usr+sys time）</td>
<td>19,151</td>
</tr>
<tr>
<td>失败 stressor 数</td>
<td>0</td>
</tr>
<tr>
<td>跳过 stressor 数</td>
<td>0</td>
</tr>
<tr>
<td>metrics 可信度</td>
<td>完全可信</td>
</tr>
</tbody>
</table>
<p dir="auto"><strong>结论</strong>: 62 GB ECC-less DDR4 主机内存完全健康，cross-NUMA 内存访问无错误。</p>
<hr />
]]></description><link>https://lcz.me/post/5420</link><guid isPermaLink="true">https://lcz.me/post/5420</guid><dc:creator><![CDATA[applejuice]]></dc:creator><pubDate>Sun, 07 Jun 2026 02:06:15 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Sun, 07 Jun 2026 01:34:12 GMT]]></title><description><![CDATA[<p dir="auto">不可以，它只能检查出现在显存有没有问题</p>
]]></description><link>https://lcz.me/post/5416</link><guid isPermaLink="true">https://lcz.me/post/5416</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Sun, 07 Jun 2026 01:34:12 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Sun, 07 Jun 2026 01:03:10 GMT]]></title><description><![CDATA[<p dir="auto">老特，我这样理解对吗？这个代码可以检测出显卡的暗病！</p>
]]></description><link>https://lcz.me/post/5415</link><guid isPermaLink="true">https://lcz.me/post/5415</guid><dc:creator><![CDATA[张老师]]></dc:creator><pubDate>Sun, 07 Jun 2026 01:03:10 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Sun, 07 Jun 2026 00:37:40 GMT]]></title><description><![CDATA[<p dir="auto">你跑完就说明显存没问题，有问题过不了的，这个脚本是AI写的，就是保证显存全覆盖，简单物理测试。</p>
]]></description><link>https://lcz.me/post/5412</link><guid isPermaLink="true">https://lcz.me/post/5412</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Sun, 07 Jun 2026 00:37:40 GMT</pubDate></item><item><title><![CDATA[Reply to Linux下显卡测试工具和脚本分享 on Sat, 06 Jun 2026 14:34:57 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 是的，刚才已经成功跑完了。</p>
<p dir="auto">Copy<br />
最后一次运行就是在这条消息之前的那个——完整通过，输出：</p>
<p dir="auto"><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f389.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--tada" style="height:23px;width:auto;vertical-align:middle" title="🎉" alt="🎉" /> 【显存物理体检通过】3090 颗粒逐位读写 100% 正确！<br />
实测覆盖 21.65 GB（5,811,627,622 float32 elements）</p>
<p dir="auto">4 个步骤全部 <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2705.png?v=9a87c0a6150" class="not-responsive emoji emoji-android emoji--white_check_mark" style="height:23px;width:auto;vertical-align:middle" title="✅" alt="✅" />：</p>
<ol>
<li>填 0 → 全量读回</li>
<li>填 1 → 全量读回</li>
<li>7 种 float32 pattern → 全量读回</li>
<li>线性递增 16,777,216 元素 → 逐块逐元素比对</li>
</ol>
<p dir="auto">要再跑一次确认，还是有其他需求？</p>
<p dir="auto">这样就算显存过关了呗？</p>
]]></description><link>https://lcz.me/post/5380</link><guid isPermaLink="true">https://lcz.me/post/5380</guid><dc:creator><![CDATA[johnnybegood]]></dc:creator><pubDate>Sat, 06 Jun 2026 14:34:57 GMT</pubDate></item></channel></rss>