Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDM failing to start on boot: white screen, "Please contact a system administrator" #149539

Closed
tomfitzhenry opened this issue Dec 8, 2021 · 26 comments
Labels
0.kind: bug Something is broken

Comments

@tomfitzhenry
Copy link
Contributor

Steps To Reproduce

My machine:

  • runs NixOS on bc5d683
  • has a discrete AMD card, and an integrated Intel card
  • runs impermanence, with no /var persisted between boots

The machine's relevant config is:

{ pkgs, lib, config, ... }:
{
  imports =
    [
      "${nixos-hardware}/common/cpu/intel"
      "${nixos-hardware}/common/gpu/amd"
    ];

    services.xserver = {
      enable = true;
      displayManager.gdm.enable = true;
      displayManager.gdm.debug = true;
      desktopManager.gnome3.enable = true;
    };
}

Boot machine.

It fails. Only known workaround is to boot to an earlier generation.

Expected behavior

I expected the machine to boot GDM and be able to log in to GNOME.

Screenshots

This is what is displayed at boot, after the usual kernel/systemd logs:

signal-2021-12-08-173042

Additional context

GDM debug logs: https://pastebin.com/sYvNhmHC

Notify maintainers

@hedning @jtojnar @dasj19 @maxeaubrey

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module: gdm
@tomfitzhenry tomfitzhenry added the 0.kind: bug Something is broken label Dec 8, 2021
@tomfitzhenry
Copy link
Contributor Author

tomfitzhenry commented Dec 8, 2021

Per Dec 08 17:21:57 mymachine gdm[2218]: Gdm: Child process -2238 was already dead. in https://pastebin.com/sYvNhmHC, this might be related to #149537.

@tomfitzhenry tomfitzhenry changed the title GDM failing to start on boot GDM failing to start on boot: white screen, "Please contact a system administrator" Dec 8, 2021
@jtojnar
Copy link
Member

jtojnar commented Dec 8, 2021

Hmm, the only thing I see is “Session never registered, failing”, could you please also try services.xserver.desktopManager.gnome3.debug = true;

@jtojnar
Copy link
Member

jtojnar commented Dec 8, 2021

Also, do you by chance know what commit of Nixpkgs the previous working derivation was?

@tomfitzhenry
Copy link
Contributor Author

Also, do you by chance know what commit of Nixpkgs the previous working derivation was?

Yes: 09a54b1.

I will try with services.xserver.desktopManager.gnome3.debug = true;.

@tomfitzhenry
Copy link
Contributor Author

Hmm, the only thing I see is “Session never registered, failing”, could you please also try services.xserver.desktopManager.gnome3.debug = true;

Done.

journalctl -u display-manager.service: https://pastebin.com/cac08hgh

journalctl -u session-c1.scope: https://pastebin.com/tv170GEX

@tomfitzhenry
Copy link
Contributor Author

tomfitzhenry commented Dec 8, 2021

For comparison, here are the same logs for a working boot (on 09a54b1), with no debug logging..:

display-manager.service: https://pastebin.com/tAKuDJsy
session-c1.scope: https://pastebin.com/DgHYD09m

@jtojnar
Copy link
Member

jtojnar commented Dec 8, 2021

Dec 08 18:23:28 mymachine /nix/store/8c92zvlflv5gka582d5svbmbvmyww03q-gdm-41.0/libexec/gdm-wayland-session[2229]: gnome-session-binary[2229]: WARNING: Failed to upload environment to systemd: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Name "org.freedesktop.systemd1" does not exist

is suspicious but I can see it in the working log too.

But I would still try early KMS per the suggestion in https://bbs.archlinux.org/viewtopic.php?id=262221 (I think the corresponding thing would be boot.initrd.availableKernelModules.)

Also there is this issue, not sure if related #103746

@tomfitzhenry
Copy link
Contributor Author

But I would still try early KMS per the suggestion in https://bbs.archlinux.org/viewtopic.php?id=262221 (I think the corresponding thing would be boot.initrd.availableKernelModules.)

Per https://github.com/NixOS/nixos-hardware/blob/4c9f07277bd4bc29a051ff2a0ca58c6403e3881a/common/gpu/amd/default.nix#L4 , nixos-hardware (which I use) sets boot.initrd.kernelModules correctly.

Per

rootModules = config.boot.initrd.availableKernelModules ++ config.boot.initrd.kernelModules;
, either boot.initrd.kernelModules or boot.initrd.availableKernelModules is sufficient.

Also there is this issue, not sure if related #103746

I tried the workaround in #103746 (comment) but this did not change the behaviour.

I will try to:

@jtojnar
Copy link
Member

jtojnar commented Dec 8, 2021

At worst, it should also be possible to bisect on live machine (though there is 11 steps between bc5d683 and 09a54b1 so it might take a while).

@andir
Copy link
Member

andir commented Dec 8, 2021

I am seeing the exact same issue on my AMD notebook. We can cut down two bisect steps as I can report that with 6daa4a5 it was still working fine.

@jtojnar
Copy link
Member

jtojnar commented Dec 8, 2021

I can actually reproduce on intel/nouveau system too. Also works for me on 6daa4a5.

@jtojnar
Copy link
Member

jtojnar commented Dec 8, 2021

I can also reproduce in the following VM (built using nixos-rebuild build-vm -I nixpkgs=$PWD -I nixos-config=../nix-playground/gdm-fail.nix executed using env QEMU_NET_OPTS="hostfwd=tcp::2222-:22" result/bin/run-*-vm):

{ pkgs, config, ... }: {
  environment.systemPackages = with pkgs; [
    gdb
    binutils # readelf
    file
    htop
    less
  ];
  services.gnome.core-utilities.enable = false;
  environment.enableDebugInfo = true;
  services.xserver = {
    enable = true;
    layout = "cz";
    xkbVariant = "qwerty";

    displayManager.gdm = {
      enable = true;
      debug = true;
    };
    desktopManager.gnome = {
      enable = true;
      debug = true;
    };
  };
  services.openssh.enable = true;

  console.useXkbConfig = true;
  i18n = {
    defaultLocale = "en_UK.UTF-8";
  };

  users.extraUsers.jtojnar = {
    isNormalUser = true;
    uid = 1000;
    extraGroups = [ "wheel" "networkmanager" ];
    password = "";
    openssh.authorizedKeys.keys = [ "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDYbOlZydfRRCGCT08wdtPcpfSrgxMc6weDx3NcWrnMpVgxnMs3HozzkaS/hbcZUocn7XbCOyaxEd1O8Fuaw4JXpUBcMetpPXkQC+bZHQ3YsZZyzVgCXFPRF88QQj0nR7YVE1AeAifjk3TCODstTxit868V1639/TVIi5y5fC0/VbYG2Lt4AadNH67bRv8YiO3iTsHQoZPKD1nxA7yANHCuw38bGTHRhsxeVD+72ThbsYSZeA9dBrzACpEdnwyXclaoyIOnKdN224tu4+4ytgH/vH/uoUfL8SmzzIDvwZ4Ba2yHhZHs5iwsVjTvLe7jjE6I1u8qY7X8ofnanfNcsmz/ jtojnar@kaiser" ];
  };
  virtualisation.memorySize = 2048;
  virtualisation.diskSize = 28048;
  virtualisation.qemu.options = [ "-device intel-hda" "-device hda-duplex" ];

  environment.etc."modprobe.d/floppy.blacklist.conf".text = ''
    blacklist floppy
  '';
}

@andir
Copy link
Member

andir commented Dec 8, 2021

Shouldn't this be also caught by our VM test?

EDIT: It isn't caught. I've bisected the entire chain and all of them passed :(

@andir
Copy link
Member

andir commented Dec 8, 2021

As far as I can tell this issue was introduced with abfcb79

@andir
Copy link
Member

andir commented Dec 8, 2021

As far as I can tell this issue was introduced with abfcb79

Yep, reverting the commit allows me to boot into my system again.

@jtojnar jtojnar closed this as completed in bcb4b71 Dec 8, 2021
@jtojnar
Copy link
Member

jtojnar commented Dec 8, 2021

Thanks. I can also confirm that, reverted for now.

That the tests cannot reproduce it is weird. And also since environment.sessionVariables should be merged into environment.variables:

environment.variables = config.environment.sessionVariables;

cc @ncfavier

@tomfitzhenry
Copy link
Contributor Author

Thanks for the diagnosis @jtojnar and @andir.

I can confirm this fixed my issue too.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/gnome-display-manager-oh-no-something-has-gone-wrong/16501/4

@wllianwd
Copy link

wllianwd commented Dec 9, 2021

Hey peeps, I can see this was fixed but I still get the same error, what needs to be done to fix?
Here my flake: https://github.com/wllianwd/my-nix-os/blob/main/flake.nix
And I'm updating the system using:

nix flake update
sudo nixos-rebuild switch --flake .#

Here the diff with latest working configs:

    ~  nix store diff-closures /nix/var/nix/profiles/system-86-link /nix/var/nix/profiles/system-88-link                                                                                                                 ✔  17:40:29  
adwaita-qt: 1.4.0 → 1.4.1
asar: +13.4 KiB
bash: 5.1-p8 → 5.1-p12
bash-interactive: 5.1-p8 → 5.1-p12
electron: 13.6.2 → ∅, -183668.3 KiB
eog: 41.0 → 41.1
fwupd: 1.7.1 → 1.7.2, -78.9 KiB
fwupd-efi: ∅ → 1.1, +61.7 KiB
gd: ∅ → 2.3.2, +836.8 KiB
git: 2.33.1 → 2.34.0, +843.9 KiB
gnome-connections: 41.1 → 41.2
gnome-control-center: 41.1 → 41.2
gnome-initial-setup: 41.0 → 41.2
gnome-maps: 41.1 → 41.2
gnome-software: 41.1 → 41.2, +45.9 KiB
idea-community: 2021.2.3 → 2021.3, +414955.7 KiB
initrd-kmod-blacklist: ∅ → ε
initrd-linux: -83.2 KiB
kubectl: +8.0 KiB
ldns: 1.7.1 → 1.8.0
libgphoto2: +286.4 KiB
libnftnl: 1.2.0 → 1.2.1
libselinux: 3.0 → 3.3, +41.7 KiB
libsepol: 3.0 → 3.3, +264.6 KiB
linux: -678.7 KiB
mdadm.conf: ∅ → ε
mesa: 21.2.5 → 21.2.6
nftables: 1.0.0 → 1.0.1
nixos: +10.8 KiB
nixos-manual: +23.7 KiB
nixos-system-nixos: 22.05.20211129.8a30877 → 22.05.20211206.bc5d683
nss: 3.72 → 3.73
obsidian: 0.12.19 → ∅, -21971.0 KiB
postgresql: 13.4 → 13.5
python3.9-importlib-metadata: 4.8.1 → 4.8.2
sof-firmware: 1.9 → 1.9.2, +597.1 KiB
steam-runtime: 0.20210906.1 → 0.20211102.0, -628.0 KiB
strace: 5.14 → 5.15, +89.5 KiB
udisks: 2.8.4 → 2.9.4, +119.8 KiB
yelp: 41.1 → 41.2
yelp-xsl: 41.0 → 41.1
zoom: 5.8.4.210 → 5.8.6.739, +3629.9 KiB

@ncfavier
Copy link
Member

ncfavier commented Dec 9, 2021

This was fixed in master but nixos-unstable hasn't caught up yet (you can see here that it was last updated two days ago).

@dcarosone
Copy link

The build has updated, but unfortunately the new one 581d2d6 also just missed out this fix, so you will have to wait for the next.

image

@baracoder
Copy link
Contributor

Or alternatively override the input referencing the fix commit bcb4b71:

nix flake update --override-input nixpkgs 'https://github.com/NixOS/nixpkgs/archive/bcb4b714bdddec94d88ff974f242cdb3f3308dac.tar.gz'
sudo nixos-rebuild switch --flake .

@bbigras
Copy link
Contributor

bbigras commented Dec 10, 2021

nix flake update --override-input nixpkgs 'https://github.com/NixOS/nixpkgs/archive/bcb4b714bdddec94d88ff974f242cdb3f3308dac.tar.gz'

Or nix flake update --override-input nixpkgs nixpkgs/bcb4b714bdddec94d88ff974f242cdb3f3308dac :)

but thanks 👍

@ncfavier
Copy link
Member

It seems like this is related to dconf: commenting out the environment.sessionVariables.GIO_EXTRA_MODULES line in dconf.nix fixes the crash, as does removing the programs.dconf.profiles.gdm block in gdm.nix.

Also, the successful logs contain the lines

Dec 11 10:07:31 nixos /nix/store/8c92zvlflv5gka582d5svbmbvmyww03q-gdm-41.0/libexec/gdm-wayland-session[898]: gnome-session-binary[898]: GLib-GIO-DEBUG(+): _g_io_module_get_default: Found default implementation keyfile (GKeyfileSettingsBackend) for ‘gsettings-backend’
Dec 11 10:07:31 nixos gnome-session-binary[898]: GLib-GIO-DEBUG(+): _g_io_module_get_default: Found default implementation keyfile (GKeyfileSettingsBackend) for ‘gsettings-backend’

while the failing logs contain

Dec 11 10:10:24 nixos /nix/store/8c92zvlflv5gka582d5svbmbvmyww03q-gdm-41.0/libexec/gdm-wayland-session[959]: gnome-session-binary[959]: GLib-GIO-DEBUG(+): _g_io_module_get_default: Found default implementation dconf (DConfSettingsBackend) for ‘gsettings-backend’
Dec 11 10:10:24 nixos gnome-session-binary[959]: GLib-GIO-DEBUG(+): _g_io_module_get_default: Found default implementation dconf (DConfSettingsBackend) for ‘gsettings-backend’

I don't know enough about the GNOME ecosystem to know what to do with this information.

@jtojnar
Copy link
Member

jtojnar commented Dec 11, 2021

Interesting. Possibly it tries to use dconf without the daemon running? Would expect more info about that in the logs if that is the case, though.

@jtojnar jtojnar unpinned this issue Dec 16, 2021
@ncfavier
Copy link
Member

ncfavier commented Dec 16, 2021

I've opened an issue upstream with all the information I have: https://gitlab.gnome.org/GNOME/gdm/-/issues/756

It turns out the crash doesn't happen if autologin is enabled without a delay, which is why the tests didn't catch it. Adding gdm.autoLogin.delay = 1; to the gnome test seems to make it hang when it should fail (and succeed when it should succeed), I'll see if I can fix that.

ncfavier added a commit to ncfavier/nixpkgs that referenced this issue Jan 3, 2022
Catches failures like NixOS#149539
that don't happen with AutomaticLoginEnable.

We still have a 0-delay autologin test in gnome-xorg, in case there's
ever an issue that only arises with AutomaticLoginEnable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

No branches or pull requests

9 participants