-
Notifications
You must be signed in to change notification settings - Fork 95
Windows 10 VM running BlueIris has igfx driver crash every few days. #228
Comments
Did some testing on linux kernel 5.13 over the last month and the behavior noted above completely resolved. Moving up to kernel 5.15 now, since it's actually being maintained. |
Running on 5.15, I was able to get about 3 weeks out of the system before I noticed this in the syslog, and a crashed video driver on the Win10 guest. May 16 05:26:30 pve kernel: DMAR: DRHD: handling fault status reg 3 |
Same setup and versions as last time, looks like same error. May 26 03:05:48 pve kernel: DMAR: DRHD: handling fault status reg 3 |
Same setup as before. Jun 12 02:50:44 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800401e73000 [fault reason 0x07] Next page table ptr is invalid |
For these kinds of errors, you can try the workaround I've posted here: #153 (comment) It's not a 100% solution though. Check the comments in #153 |
Greetings all,
Looking for some hints as to what might be the issue with my setup. I have a Windows 10 VM running BlueIris that has started exhibiting igfx driver crashes approximately a month ago. Previously, this system was stable with uptimes of several months with no issues.
Host system:
Proxmox 7.4-3
Kernels recently used 6.2, 6.1, 5.19, 5.15, 5.13
Intel E-2186G, 128 GB ram, Nvidia T1000, LSI HBA
VMs:
Ubuntu 22.04 running PiHole, no issues noted
TrueNas Core, has LSI HBA passed through, no issues noted
Ubuntu 22.04 running Portainer, has Nvidia T1000 passed through, no issues noted
Windows 10 22H2, has Intel igpu p630 passed through (GVT-d), igfx driver crashes every few days.
This setup has been in place for approximately a year with virtually no issues until approximately a month ago (March 8th from my notes). In the last week or so, I've worked my way through linux kernels 5.19, 6.1, 6.2, as well as trying out GVT-g to see if i could stop the igfx driver crashes. Using GVT-g, when the crash happens the VM would stop responding completely, and cause issues with the host as well necessitating a host reboot. Using GVT-d, only the VM needs to be rebooted.
Under the 6.1 and 6.2 (and perhaps 5.19) kernels using GVT-G I get syslog entries (on host) like this when a crash happens
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail to flush post shadow
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail to dispatch workload, skip
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c000
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c008
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c010
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c018
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c020
and
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 13 kernel messages
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c948
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 17 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 15 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 13 kernel messages
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6ca80
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 13 kernel messages
Under 6.2 and 6.1 using GVT-d I get messages like this when a crash happens
Mar 26 07:20:45 pve kernel: DMAR: DRHD: handling fault status reg 3
Mar 26 07:20:45 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffffb8024c046000 [fault reason 0x07] Next page table ptr is invalid
Mar 29 12:08:47 pve kernel: DMAR: DRHD: handling fault status reg 3
Mar 29 12:08:47 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff8004014b4000 [fault reason 0x07] Next page table ptr is invalid
Mar 31 05:48:36 pve kernel: DMAR: DRHD: handling fault status reg 3
Mar 31 05:48:36 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800417686000 [fault reason 0x07] Next page table ptr is invalid
I'm trying out older kernels now (currently 5.13) to see if there is any appreciable difference. I do realize that I am running quite a complicated system, and might be bumping up against an edge case.
Any thoughts?
The text was updated successfully, but these errors were encountered: