Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

install-darwin: fix _nixbld uids for macOS sequoia #10919

Merged
merged 3 commits into from
Sep 3, 2024

Conversation

abathur
Copy link
Member

@abathur abathur commented Jun 15, 2024

Motivation

Starting in macOS 15 Sequoia, macOS daemon UIDs are encroaching on our default UIDs of 301-332. This commit relocates our range up to avoid clashing with the current UIDs of 301-304 and buy us a little time while still leaving headroom for people installing more than 32 users.

It also adopts GID 350 (same as first UID), since @emilazy pointed out that this will keep our build group from showing up in the Users & Groups interface. (See #10919 (comment))

Context

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@emilazy
Copy link
Member

emilazy commented Jun 15, 2024

Is there any chance we’ll be able to automate this transition on upgrade? Getting the word out to every single user that they need to run a script seems painful. And if we’re going to get everyone to run a migration script, is there any chance we can move the builder group to GID 330 at the same time so it stops showing up in System Settings?

I can try and test this in a VM someday soon.

@abathur
Copy link
Member Author

abathur commented Jun 15, 2024

Is there any chance we’ll be able to automate this transition on upgrade? Getting the word out to every single user that they need to run a script seems painful.

That's a Nix-nix question, I guess. I'm not personally aware of any mechanism for running a stateful script when people update Nix and don't see one on nix upgrade-nix --help (not that this would help people not using the new CLI or who only update their nix via nixos/hm/nix-darwin). We'd probably need to build some user migration logic into the daemon (and it would still only help you if people updated to the new Nix and ran that daemon before they updated to Sequoia).

if we’re going to get everyone to run a migration script, is there any chance we can move the builder group to GID 330 at the same time so it stops showing up in System Settings?

I've wondered why people were mentioning the GID, since we haven't needed to change it previously. (I assumed people were just conflating the 30000 GID we set with the 301 I gather the detsys installer sets?)

I'm less sure about migrating it. Inevitably we can change it, but I don't know if doing so will "break" the existing groups. (If it did, we might have to rework the script a little to just remove the users + group and re-add them all?)

@caldwell
Copy link

Is there a reason that every nix install needs the same UIDs for these users? Furthermore, is there a reason they need to be consecutive? Does the main nix builder code deal with the UIDs directly or look them up from the names?

It appears from the user reported error in #10912 that it uses the names, in which case the installer would be free to pick whatever UIDs happen to be free on a given OS install when creating them… That would be a simpler solution for the install/upgrade script and it would work everywhere. Trying to pick a empty span is only ever guaranteed to work on a fresh OS install—there's always going to be somebody out there who's got a user in that range created by hand or by some other software.

As far as upgrading goes, perhaps the script should be more nixos/puppet-like (declarative/idempotent) and create the users only if they don't exist.

It would also be super convenient if the code that got the error in #10912 could call the upgrade script and fix it right there, but I don't know if it has the correct permissions at that point in time. It could at least mention your upgrade script or prompt the user to re-install.

@abathur
Copy link
Member Author

abathur commented Jun 17, 2024

Is there a reason that every nix install needs the same UIDs for these users? Furthermore, is there a reason they need to be consecutive? Does the main nix builder code deal with the UIDs directly or look them up from the names?

To the best of my understanding (as someone not very familiar with the codebase for Nix itself), the answer to all of these is: not really.

It appears from the user reported error in #10912 that it uses the names, in which case the installer would be free to pick whatever UIDs happen to be free on a given OS install when creating them… That would be a simpler solution for the install/upgrade script and it would work everywhere. Trying to pick a empty span is only ever guaranteed to work on a fresh OS install—there's always going to be somebody out there who's got a user in that range created by hand or by some other software.

Indeed. But reworking this means modifying how we install users on all platforms, having to wrestle with edge cases that the current installer's process is too ~dumb to have to worry about (like what to do if the UIDs we want are taken by old nixbld users that may not match the nixbuildN user we were planning to put at that UID, what to do if we run out of valid role UIDs on macOS before placing the requested number of users, etc.), and having to go test it on a broad spectrum of systems to confirm the change.

While this refactor might save us from having to relocate to a new range a few years in the future, it would not fix the underlying problem here--this macOS update's installer currently clobbers our existing users to take the UIDs for its own daemons. This could of course happen to any UID we use on any update to any existing install on macOS--either they'll have to stop doing this without relocating our users, or Nix itself needs to get smart enough to detect the situation and recover from it or suggest remediation steps to the user.

As far as upgrading goes, perhaps the script should be more nixos/puppet-like (declarative/idempotent) and create the users only if they don't exist.

Not sure if you mean this in the context of the migration script, or the installer. If the latter, I broadly agree--but full idempotence is tricky to reach and maintain, and a lot of idempotence-focused work can lead to minimal benefit if/when one or two things block full idempotence (i.e., we do the work and testing, but the installer will still break or bail somewhere and we still have to tell frustrated users to go manually uninstall and reinstall).

I had thoughts and laid down some patterns for getting us here a few years back, but at this point I imagine this is more likely to come from working on the NixOS org's fork of the detsys installer (directly or by contributing to the upstream detsys installer itself). That said, I'll note that macOS eminent-domaining our UIDs and clobbering the users in the process is also causing trouble for their installer. (IIRC is breaks their ability to do an uninstall, for example.)

It would also be super convenient if the code that got the error in #10912 could call the upgrade script and fix it right there, but I don't know if it has the correct permissions at that point in time. It could at least mention your upgrade script or prompt the user to re-install.

I agree, but I think that's a nix-nix question outside of the scope of this PR (and I imagine it would be better if that took the form of a more general user fixup routine instead of having to figure out how to suggest macos version-specific cleanup to only the right users).

I'll also note that--unless Apple changes the updater to be a bit more polite--the cake is mostly-baked here. Even if someone opened a PR to support this in Nix today and there was a release cut by the end of the week, some fraction of Nix's macOS users will not be using that Nix release when they take the Sequoia update (whether that's a beta this summer or the official release this fall).

@emilazy
Copy link
Member

emilazy commented Jul 1, 2024

I’m sorry that I didn’t yet get around to testing the migration; I will try to do so soon.

@abathur How do you feel about trying to land the UID (and preferably GID) changes for new installs only – which has to be done regardless – and we can worry about migration when it becomes clearer if Sequoia is going to implement any kind of migration itself?

@abathur
Copy link
Member Author

abathur commented Jul 2, 2024

No worries :)

  • I suppose I can separate the migration script out (hopefully tonight).
  • I can't say my expectations are high, but I have been hoping that this report getting escalated within Apple will shake loose some guidance on good UID ranges. (I'll try to remember to follow up on that tonight--it's been 2 weeks since they reported escalating it.)

I'm not opposed to changing the GID (whether here or in a separate PR to ensure both are easy to revert from GH without unrelated regressions), but I'm conservative about fiddling with these and do need some convincing:

  • I don't have a spare eligible macbook sitting around this cycle that I can use to readily go try different macOS versions/updates out on to see for myself and confirm the fix.
  • I can't recall seeing (nor finding via search) a clear report of the problem the current GID is causing (and demonstration that moving the GID fixes it) here, against the detsys installer, or against the Lix fork of it.
  • I'm not aware of any prior art here where someone's demonstrated that moving the GID down won't cause its own trouble. (The detsys and lix installers are both still using 30000 AFAIK.)

If there is an issue pointing me that way is fine, but if not could you open one and document what you're seeing there (ideally w/ screenshots)?

@emilazy
Copy link
Member

emilazy commented Jul 2, 2024

emily@yuyuko ~> sudo dseditgroup -o create -r 'test group for emily 1' -i 3000 emilytest1
emily@yuyuko ~> sudo dseditgroup -o create -r 'test group for emily 2' -i 360 emilytest2
Screenshot of “test group for emily 1” showing up in System Settings

(Sonoma 14.5)

I fiddled with a bunch of combinations along these lines and a bunch of different IDs but the results were pretty clear; as I mentioned in #10892 (comment) the threshold for groups seems to be 500 for whatever reason, but of course picking one that matches the UID we’ll always take up seems the most conservative choice and Apple do use GID 395–400 and 441 as of Sonoma.

I don’t know if I can prove the negative re: the group ID potentially causing problems, but I can’t personally foresee any problems that we wouldn’t already get with UIDs. I’m open to trying to find the time to test stuff in a VM if you have some proposals for ways to test things, though. From reading the old discussion from when we moved the UIDs, I get the impression we were just too preoccupied with those more pressing issues to think about whether there might be any side‐effects of having a GID outside the system range too.

@abathur
Copy link
Member Author

abathur commented Jul 2, 2024

I don’t know if I can prove the negative re: the group ID potentially causing problems, but I can’t personally foresee any problems that we wouldn’t already get with UIDs. I’m open to trying to find the time to test stuff in a VM if you have some proposals for ways to test things, though. From reading the old discussion from when we moved the UIDs, I get the impression we were just too preoccupied with those more pressing issues to think about whether there might be any side‐effects of having a GID outside the system range too.

To clarify, I don't mean that I won't PR the change without meeting that standard of proof--I think your comment here reasonably demonstrates the issue and shows that a lower UID addresses it. I just meant that one reason I'm treading cautiously is that I can't just ~transfer confidence from looking at the other installers frontrunning us on this for days/weeks/months with issues/PRs to document the problem+fix and a lack of subsequent reports I could take as supporting evidence.

@emilazy
Copy link
Member

emilazy commented Jul 2, 2024

Oh yeah I totally understand the conservatism here, don’t worry. I just figure if we’re on Apple’s wild ride for the time being anyway we might as well improve the UX, especially if we do end up having to make everyone do a manual migration. I guess if I find the time to test the Sequoia migration I can make a group in the system range before the upgrade and see what happens to it?

I think the DetSys installer has some kind of A/B testing roll‐out stuff. I don’t know how quickly we could get data on potential GID problems with that though.

export NIX_FIRST_BUILD_UID="${NIX_FIRST_BUILD_UID:-331}"
export NIX_BUILD_GROUP_ID="${NIX_BUILD_GROUP_ID:-331}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be silly, but – is there any chance of this group being associated with _nixbld1 specifically because of sharing the GID with it? If so, we could avoid that by picking 330 instead. I guess that per‐user groups sharing the IDs of their corresponding users is just a convention so hopefully nothing in the system is going to assume it and cause some kind of weird behaviour down the road, but all this system role ID stuff makes me superstitious enough to worry about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope not, but that's definitely in the class of risk that makes me a bit cautious about changing it without a reason (but I think you've provided a sufficiently-good reason to try it).

Hopefully we'll have some degree of confirmation from people running this for at least a few days before merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pick 330 instead just in case? That matches the staus quo on Linux (30000 ≠ 30001), the previous status quo on Darwin (30000 ≠ 301), and doesn’t seem to have any additional risks.

I’m overdue for reinstalling my crusty old Nix setup anyway, so I will probably try this out soon as long as I can get the Darwin sandbox actually fixed.

@emilazy
Copy link
Member

emilazy commented Jul 3, 2024

Reposting this inline since it’s hidden in the commit comments currently:

Maybe we should make the group name _nixbld too for consistency with the users and the other system groups in the range while we’re at it? Probably doesn’t matter that much, but I seem to recall we renamed the users to get them hidden in some way so it might not hurt to follow suit with the group.

@abathur
Copy link
Member Author

abathur commented Jul 4, 2024

Latest force-push is just a rebase to see if the installer jobs will run.

@abathur
Copy link
Member Author

abathur commented Jul 5, 2024

Installer jobs did run. (For context, they were broken because GH has switched over to arm macs. The upshot is that, while the test-generated installers covered x86_64-darwin, they now cover aarch64-darwin instead. That's pretty great for our case here :)

Installer tests in my fork succeeded; individuals can test with (note that test installers aren't generated for all platforms):

sh <(curl -L https://abathur-nix-install-tests.cachix.org/serve/5ri842sqcg061q1lk2p9zki0s0q1li1r/install) --tarball-url-prefix https://abathur-nix-install-tests.cachix.org/serve

@Eveeifyeve
Copy link

@tomberek This is a pr that should fix that issue.

@emilazy
Copy link
Member

emilazy commented Jul 10, 2024

(FWIW: we learned more about Apple’s recommended UID range for role users, and Determinate Systems have adopted an approach based on that temporarily, so this probably needs a bit more research before we can commit fully to the approach. I’ve been somewhat negligent at getting around to doing VM testing but hope to get around to it soon.)

@randallb
Copy link

randallb commented Aug 2, 2024

I just set mine at 2000, does it have to be < 500? I know it'll pollute the UI.

@abathur
Copy link
Member Author

abathur commented Aug 2, 2024

I just set mine at 2000, does it have to be < 500? I know it'll pollute the UI.

@randallb We don't know. If the underlying problem documented in the issue below still exists, you may get booted into recovery mode on macOS updates.

We know moving to the role user range (200-400) fixed that problem at that time, and I am not personally aware of any user reports that explicitly attest to the absence of that problem on update with uids outside of this range.

@Eveeifyeve
Copy link

What if you do the same as linux is already doing is detect uuid/guids and find free uuid/guids that work?

@abathur abathur force-pushed the macos_sequoia_fixes branch 2 times, most recently from db37199 to 0d75be1 Compare August 15, 2024 02:08
@abathur abathur marked this pull request as ready for review August 19, 2024 13:46
@abathur abathur requested a review from edolstra as a code owner August 19, 2024 13:46
Starting in macOS 15 Sequoia, macOS daemon UIDs are encroaching on our
default UIDs of 301-332. This commit relocates our range up to avoid
clashing with the current UIDs of 301-304 and buy us a little time
while still leaving headroom for people installing more than 32 users.
@tomberek
Copy link
Contributor

tomberek commented Sep 3, 2024

@abathur @emilazy @mkenigs @cole-h I'm intending to merge this as-is. Any blockers?

I'd still like to have this be automatically applied, but I an leave that for a future PR.

@emilazy
Copy link
Member

emilazy commented Sep 3, 2024

I believe this is ready, but the backport labels down to 2.18 should be applied first, as otherwise users tracking the Nix version used by 24.05 or other intermediate versions will get broken Sequoia upgrades. Automatic migration would be nice but should be handled another time.

@abathur
Copy link
Member Author

abathur commented Sep 3, 2024

@tomberek I agree with merging as-is--no blockers from my perspective. (edit: aside from Emily's point above.)

@tomberek
Copy link
Contributor

tomberek commented Sep 4, 2024

@abathur The migration script uses PrimaryGroupID 30001. Something doesn't seem right. This used to be 30000 and is being changed to 350. Is the idea that this marks those users as migrated, or is this an off-by-one that has been gracefully handled so far?

edolstra added a commit that referenced this pull request Sep 4, 2024
…0919

install-darwin: fix _nixbld uids for macOS sequoia (backport #10919)
@abathur
Copy link
Member Author

abathur commented Sep 4, 2024

@abathur The migration script uses PrimaryGroupID 30001. Something doesn't seem right. This used to be 30000 and is being changed to 350.

I think you're right--it does look like I made a mistake.

For the record (since this is actually about the migration script added in the other PR), we're discussing the last line of the function below:

re_create_nixbld_user(){
	local name uid

	name="$1"
	uid="$2"

	sudo /usr/bin/dscl . -create "/Users/$name" "UniqueID" "$uid"
	sudo /usr/bin/dscl . -create "/Users/$name" "IsHidden" "1"
	sudo /usr/bin/dscl . -create "/Users/$name" "NFSHomeDirectory" "/var/empty"
	sudo /usr/bin/dscl . -create "/Users/$name" "RealName" "Nix build user $name"
	sudo /usr/bin/dscl . -create "/Users/$name" "UserShell" "/sbin/nologin"
	sudo /usr/bin/dscl . -create "/Users/$name" "PrimaryGroupID" "30001"
}

I'll brain-dump a little:

Our final decision was that the migration script should not affirmatively touch the group ID at all.

An early version of the migration script did affirmatively change it to 350 (via $NEW_NIX_FIRST_BUILD_UID), but someone pointed out that changing the GID of an existing install might mess up file ownership (unless we also changed the group on all of those).

Since the main reason to change the GID--making sure the nixbld group doesn't show up in Users & Groups--is something people with existing installs weren't complaining too loudly about, we decided it didn't make sense to fiddle with it during migration.

It looks like I copied in the wrong number when I ~reversed the implementation to match that decision:

https://github.com/NixOS/nix/compare/e17114e20ea15a4cc01893bb2e6b71ee660e6e0d..3d5d24b6cdda150f896e0f6a9a8fd0c4b4022a81

Is the idea that this marks those users as migrated, or is this an off-by-one that has been gracefully handled so far?

I'm not quite sure why it's working (but a few people did report it working during some basic testing), but I can think of at least two possible explanations:

  1. The PrimaryGroupID may not be meaningful here as long as there's still a GroupMembership record associating these users by username with the nixbld group (and you previously noticed that those dangling entries do still exist on these installs).
  2. Maybe dscl or macOS are cleaning up those dangling entries when the migration script re-creates the users. (In this case I imagine the re-created build users probably wouldn't actually get used for builds.)

Now that we're focused on it, though, I think the even-more-correct thing to do here is probably to look up the existing group ID (it looks like the installer uses something like dsclattr "/Users/nixbld" "PrimaryGroupID" for this task) and use that instead of hardcoding 30000.

Can't start on a PR atm, but I might be able to this evening if someone else hasn't already opened one by then.

edolstra added a commit that referenced this pull request Sep 10, 2024
…0919

install-darwin: fix _nixbld uids for macOS sequoia (backport #10919)
edolstra added a commit that referenced this pull request Sep 10, 2024
…0919

install-darwin: fix _nixbld uids for macOS sequoia (backport #10919)
edolstra added a commit that referenced this pull request Sep 10, 2024
…0919

install-darwin: fix _nixbld uids for macOS sequoia (backport #10919)
edolstra added a commit that referenced this pull request Sep 10, 2024
…0919

install-darwin: fix _nixbld uids for macOS sequoia (backport #10919)
tomberek added a commit that referenced this pull request Sep 16, 2024
…0919

install-darwin: fix _nixbld uids for macOS sequoia (backport #10919)
tomberek added a commit that referenced this pull request Sep 16, 2024
…0919

install-darwin: fix _nixbld uids for macOS sequoia (backport #10919)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.18-maintenance Automatically creates a PR against the branch backport 2.19-maintenance Automatically creates a PR against the branch backport 2.20-maintenance Automatically creates a PR against the branch backport 2.21-maintenance Automatically creates a PR against the branch backport 2.22-maintenance Automatically creates a PR against the branch backport 2.23-maintenance Automatically creates a PR against the branch backport 2.24-maintenance Automatically creates a PR against the branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants