Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement remaining __clz*i2 intrinsics #639

Merged
merged 2 commits into from
Jul 24, 2024
Merged

Implement remaining __clz*i2 intrinsics #639

merged 2 commits into from
Jul 24, 2024

Conversation

tea
Copy link
Contributor

@tea tea commented Jul 6, 2024

This one was a bit interesting to figure out.

Firstly, the existing implementation was wrong, as it was working with whatever native word is, whereas __clzsi2 should be 32-bit specifically (SI is 32-bit exactly). compiler-rt implementation is always 32-bit. Interestingly, existing CI tests didn't catch this because testcrate only tests against compiler-rt when compiled with --no-default-features -F c (because testcrate defaults mangled-names). When tested in this way, testcreate surely enough failed __clzsi2 test on 64 bit target.

Even more interestingly, the compiler-rt implementation is also wrong! It is 32-bit implementation but it uses int type for its argument, so on 16-bit platform it couldn't get 32-bit value. I guess compiler-rt is not used much for 16 bit?

Now, libgcc implementation is the correct one. It always uses 16 bit types for HI functions, 32 bit types for SI, 64 bit for DI and 128 bit for TI. But it has its own twist: it only implements two functions out of all possible - one for native word size and one for double that. So on 16 bit it would be HI and SI, on 32 bit SI/DI and on 64 bit DI/TI. Which is why buggy existing implementation didn't caused any dramas: it would never be called as the compilers know it shouldn't even exist (indeed, even trying hard to call it via __builtin_clz(uint32_t) results in something akin to _clzdi2(x as u64) - 32).

So I did the same thing as with bswap - simply relegate work to the LLVM and let it spit out proper builtin implementation. i686 and x86-64 versions look great using bsr. Riscv64... cannot say if it became better or worse than it was; its a different algorithm, though both are branchless and logarithmic in complexity. One would need to benchmark it like crazy to decide which is better :)

I like new version better because of two things:

  1. it could get better "for free" as llvm improves (definitely more eyes on llvm builtins than on compiler-builtins repo).
  2. it could do much better if allowed (e.g. when there is a native cpu instruction to do it; Riscv64gc+Zbb functions const of two instructions).

With this implementation tests are not quite as useful; they can only work as a documentation of sorts since they check the same implementation against itself. Well, they also can check against compiler-rt so there's that.

@Amanieu
Copy link
Member

Amanieu commented Jul 6, 2024

Unfortunately that won't work here because LLVM does in fact use this builtin, for example on the armv4t-unknown-linux-gnueabi target.

it could do much better if allowed (e.g. when there is a native cpu instruction to do it; Riscv64gc+Zbb functions const of two instructions).

In that case the implementation in compiler-builtins won't be called anyways since LLVM/GCC will just use the Zbb instruction directly.

@tgross35
Copy link
Contributor

tgross35 commented Jul 6, 2024

Possible generic version:

fn leading_zeros<I: Int>(mut x: I) -> usize {
    let mut z = I::BITS;
    let mut t: I;
    let mut shift = I::BITS / 2;

    while shift > 1 {
        t = x >> shift;
        if t != I::ZERO {
            z -= shift;
            x = t;
        }

        shift /= 2;
    }

    t = x >> 1;
    let res = if t != I::ZERO { z - 2 } else { z - x.cast() };
    res as usize
}

LLVM correctly unrolls the loop to get the same result as the existing usize-only version. Which means it doesn't make use of e.g. bsr on x86, but I think we'd have to handwrite those if we want them emitted reliably (and not much purpose since LLVM knows that lowering on those platforms).

Comment on lines 80 to 131
#[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))]
{
use compiler_builtins::int::leading_zeros::__clzdi2;
fuzz(N, |x: u64| {
if x == 0 {
return; // undefined value for an intrinsic
}
let lz = x.leading_zeros() as usize;
let lz0 = __clzdi2(x);
if lz0 != lz {
panic!("__clzdi2({}): std: {}, builtins: {}", x, lz, lz0);
}
});
}
#[cfg(target_pointer_width = "64")]
{
use compiler_builtins::int::leading_zeros::__clzti2;
fuzz(N, |x: u128| {
if x == 0 {
return; // undefined value for an intrinsic
}
let lz = x.leading_zeros() as usize;
let lz0 = __clzti2(x);
if lz0 != lz {
panic!("__clzti2({}): std: {}, builtins: {}", x, lz, lz0);
}
});
}
Copy link
Contributor

@tgross35 tgross35 Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is happening on 16 & 32-bit platforms for the gating - is LLVM lowering .leading_zeros() to this missing symbol? From CI it looks like MSVC is missing ___clzti2 on arm and 32-bit so you'll have to update the gates.

If this logic is a bit more complex than just pointer sizes, it might be better to add a NoSysU128Clz option to

enum Feature {
NoSysF128,
NoSysF128IntConvert,
NoSysF16,
NoSysF16F128Convert,
}
, update that file, and then gate on that (easier if the gates need to be reused).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gated these, and the intrinsics themselves, the same way the compilers do (well, libgcc does so they have to follow along?). Makes no sense to test for intrinsics which aren't ever used.

As for 128 bit on 64-bit msvc - I guess they don't have 128 bit int somehow, according to compiler-rt. I just took "optimized c" attribute off the implementation instead of trying to gate it. Single branch isn't that bad.

Copy link
Contributor

@tgross35 tgross35 Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that the gates were on the intrinsics as well. Could you please remove them? The reasons that libgcc doesn't supply a specific intrinsic on specific platforms may vary (no integers of that type, GCC itself doesn't use them, platform bugs, untested, just not done yet etc). But if we don't have any specific reason not to make them available and there’s a chance that the compilers make use of them, we should just provide them.

Also a lot of the purpose of this library is to fill in for the shortcomings of libgcc, so it doesn't make any sense to copy their limitations :).

well, libgcc does so they have to follow along?

Unfortunately this isn't very accurate, LLVM (and GCC) both pretty frequently emit symbols that aren't available on lesser-known platforms. Rust hits this more often because we make less common types available on all platforms (e.g. u128, f16, f128) where there might not be a C equivalent available.

Not like this specific symbol is likely to be a problem anywhere. But we should avoid unnecessary cfg (unless it's a arch-specific symbol) because it just means somebody might have to chase them down in the future and wonder why it's disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all the gate. Let the flood begin!

As for libgcc and their builtins, the reason there is simple: that's how they do it. Basically they have two and only two functions __clzSI2 and __clzDI2 (note uppercase letters). Despite their names, they operate on native word size and double that respectively. These function names (and argument types) get macroed into whatever proper name and argument type is for the platform.

Same is true for many other builtins there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change. I think you might be able to remove some C sources that we are building (mentioned below) but otherwise the implementation looks good to me now.

src/int/leading_zeros.rs Outdated Show resolved Hide resolved
Comment on lines +25 to +26
const { assert!(T::BITS <= 64) };
if T::BITS >= 64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit: the second condition could become == 64 since > 64 is rejected right above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather leave this as is, as it is more future-proof (in case any 128 bit implementations would actually pop up, and whoever applies 128 bit part happens to be both incredibly careless and doesn't test anything...)

More seriously though, result of the expression is a compile-time constant so it doesn't matter at all which operator to use here, and >= is more consistent with 32 bit part down below.

@@ -382,7 +383,6 @@ mod c {
sources.extend(&[
("__absvti2", "absvti2.c"),
("__addvti3", "addvti3.c"),
("__clzti2", "clzti2.c"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could also remove __clzdi2 and __clzsi2 on lines 349-350 now that these aren't broken. Possibly

sources.extend(&[("__clzdi2", "clzdi2.c"), ("__clzsi2", "clzsi2.c")])
too if our implementation works there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure what's the policy on these. Is it "only apply C replacements for broken/missing functions"?

Copy link
Contributor

@tgross35 tgross35 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is accurate, part of the goal of this crate is to be able to target some platforms without needing a C toolchain. I think that usually once something gets ported over and tested, it can be removed from the C sources lists.

I guess maybe we only used the C implementations on thumb because ours was broken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Aarch64 CI is down because of new rustc I guess (0/1 asm labels are now off limits instead of just discouraged). This particular thing about numeric labels amazes me so much... This is due to llvm bug, but instead of dealing with that people decided to document it in rust book (so I guess it is a feature now) and make it a compile-time error.

Copy link
Contributor

@tgross35 tgross35 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's an easy llvm bug to fix unfortunately. That lint should almost certainly be x86-only though, where LLVM was already throwing an error and the lint just makes it a more accurate error. I brought it up.

You can add

#![allow(binary_asm_labels)] // aarch64 has no problems with binary labels

to the top of src/aarch64_linux.rs just to get that to pass.

Copy link
Contributor

@tgross35 tgross35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks. @Amanieu still needs to do the final review. The CI error obviously isn't your problem.

@tgross35
Copy link
Contributor

Fix coming for the lint btw, probably will be on tomorrow's nightly. Just rebase once it's available rust-lang/rust#127935

@Amanieu Amanieu enabled auto-merge July 24, 2024 11:50
@Amanieu Amanieu merged commit b78d0f1 into rust-lang:master Jul 24, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants