-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MTLCreateSystemDefaultDevice returns nil #1779
Comments
Hey, @drewcrawford |
In FB8904929 we noticed that metal::pow often returns the wrong value on macOS. This causes incorrect results for some operations like BCCubicSplit (and left/right variants). It could theoretically affect BCCubicEvaluate, BCCubicEvaluatePrime and BCNormalization as well but it seems a lot less likely for those functions. This may be related to use of negative base and odd exponent, which ought to be well-defined for metal::pow per Metal Shading Language Specification Table 6.4, but in practice behaves like UB with different results on different GPUs. Seems to be quite finicky about whether the operands are statically known or the dataflow of the arguments. The fix is to use explicit multiplication where possible (e.g. integer exponent) and where we think the base might be negative. Also add test coverage. This test coverage is known to trip on some AMD and Intel systems. However some GPUs are known to pass the test even though they experience the issue, so it’s not total. Due to actions/runner-images#1779, there is no CI coverage for this issue. See also FB8904929, mt2-109.
In FB8904929 we noticed that metal::pow often returns the wrong value on macOS. This causes incorrect results for some operations like BCCubicSplit (and left/right variants). It could theoretically affect BCCubicEvaluate, BCCubicEvaluatePrime and BCNormalization as well but it seems a lot less likely for those functions. This may be related to use of negative base and odd exponent, which ought to be well-defined for metal::pow per Metal Shading Language Specification Table 6.4, but in practice behaves like UB with different results on different GPUs. Seems to be quite finicky about whether the operands are statically known or the dataflow of the arguments. The fix is to use explicit multiplication where possible (e.g. integer exponent) and where we think the base might be negative. Also add test coverage. This test coverage is known to trip on some AMD and Intel systems. However some GPUs are known to pass the test even though they experience the issue, so it’s not total. Due to actions/runner-images#1779, there is no CI coverage for this issue. See also FB8904929, mt2-109.
@LeonidLapshin Will it be reasonable to allow Metal support while introducing the M1 runners |
Description
Inside a GitHub-hosted runner, calls to the macOS API
MTLCreateSystemDefaultDevice
returnsnil
. This prevents use of Metal, is not generally anticipated to happen on macOS, and can break arbitrary software, which is more likely to occur over time. This appears to be caused by GPU configuration in the guest environment.Area for Triage:
Apple
Question, Bug, or Feature?:
?
Virtual environments affected
Expected behavior
MTLCreateSystemDefaultDevice()
should return a non-nil valueActual behavior
MTLCreateSystemDefaultDevice()
returns nilRepro steps
In the linked action run this API is called in both macOS and iOS Simulator environment
What is this API?
This API is a chokepoint for use of Metal, the only non-deprecated graphics library on macOS. In addition, Metal is a general-purpose computing language that may be doing the heavy lifting when you call some other system API. It's increasingly likely over time that some software you use or test in a CI environment on Apple is trying to do this.
What is the significance of the current behavior?
Errors related to this appear in other reports, so I wonder if other macOS issues are related to this issue.
It is generally imagined that this API returning
nil
is not really possible on modern macOS. A brief survey of usage on GitHub supports this view, the predominant pattern being force-unwrapping the API (!
) which crashes in a virtual environment. A minority of results generate a soft error, and I wasn't immediately able to turn up any examples that would function correctly in a GitHub runner.Developers assume it works because a GPU supporting Metal has been a minimum system requirement for macOS since 10.14, and iOS for even longer. So this API working (e.g., slowly with integrated graphics) is imagined to be part of the macOS 10.14+ platform, rather than a question of availability on specific hardware. This is a very different expectation than Windows/Linux.
I asked someone with knowledge of the implementation for this API if there is any reason a developer today ought to handle a
nil
response, and they suggested nil probably indicates a serious OS fault, so not really.Isn't there a software fallback for this?
Not for Metal itself. Codebases that predate widespread Metal availability may have kept around their old codepath which incidentally supported a fallback. These are increasingly not maintained or actively developed, and so if they exist they usually aren't the priority for running or testing/CI workflows.
Roblox recently wrote that
Of course, new code written today is likely to skip this entirely and assume Metal is available.
What can be done about this?
The method I'm aware of is to passthrough the host GPU to the guest environment. I don't know if this can be done for multiple guests or would be sensible in GitHub's environment (I'm guessing not)
For virtualizing macOS 11, Apple is forcing a new set of low-level APIs. Some VMWare products have experimental support with these APIs to paravirtualize the host GPU into the guest environment which fixes this issue. So it seems like the situation for macOS 11 will be better, but might require additional or experimental config to make it work.
The text was updated successfully, but these errors were encountered: