Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32 bit targets in Cranelift #5572

Closed
Artentus opened this issue Jan 13, 2023 · 38 comments
Closed

32 bit targets in Cranelift #5572

Artentus opened this issue Jan 13, 2023 · 38 comments

Comments

@Artentus
Copy link

Hello,

I am currently trying to write a Cranelift backend that targets a 32 bit architecture but am encountering some issues.
Since all the existing backends are 64 bit am I not sure whether this is due to missing support or my lack of understanding.

For example, codegen::machinst::valueregs::ValueRegs states in its description we cap the capacity of this at four (when any 32-bit target is enabled) or two (otherwise), however the capacity is hardcoded to 2.

Also, inside the codegen::machinst::isle::isle_lower_prelude_methods macro, the function temp_writable_reg assumes only a single register is returned. However, this function is called in IsleContext<'_, '_, MInst, XYZBackend>::imm, which takes 64 bit values, so on a 32 bit target two registers are required.

In general, the Context trait generated by ISLE requires 64 bit quantities in most places. How is one expected to implement these for a 32 bit target? Do you invent non-existing instructions that operate on 64 bit and then lower them into equivalent real instructions?

@jameysharp
Copy link
Contributor

Cool, it's nice to hear from people working on more Cranelift backends! I'm curious what architecture you're targeting; glancing at your GitHub, I'm guessing it's your A32 RISC ISA? I'd also love to hear some about your goal: Do you want to run Wasmtime on this target, or compile Rust to it with cg-clif, or something else?

As you noticed, we don't currently have backends implemented for any 32-bit targets, so I'm not particularly surprised that you're running into situations where we didn't quite think through the implications for a 32-bit target.

There are existing cases where backends use what we refer to as "pseudo-instructions", which are what I think you're describing as "non-existing instructions". You can find a bunch of examples with git grep -E 'pseudo-?inst'; we haven't been super consistent about spelling apparently. Maybe there are cases where that would help with 64-bit assumptions, but off-hand my guess is there will be few if any of those cases.

For the most part I think you'll just need to do for 64-bit types what the existing backends do for 128-bit types: split the value across two registers and figure out how to combine register pairs as needed. For some relatively simple examples, look at how various backends implement (lower (has_type $I128 (iadd x y))).

The only case where you should need four registers in ValueRegs is if you're going to support 128-bit integers with 32-bit registers, and we're very inconsistent about I128 support across the existing backends anyway. So I'd just skip that for now. Once you have a proof of concept working we can figure out how to handle issues like ValueRegs being too small. It's a little tricky because I'm guessing making it bigger will have a measurable impact on compile time, and also because we'll need to do some API design work.

The only fn imm I see that takes a 64-bit value and calls temp_writable_reg is in the riscv64 backend. You don't have to implement that. It was apparently convenient for that backend but you can do something different if you need to.

I hope this helps give you a starting point. We're happy to answer questions if we can!

@Artentus
Copy link
Author

Hey, thanks for that detailed answer.

Yes you are right, I'm doing this for academic purposes atm. The ISA I'm targeting is similar to RV32I in a lot of ways so if you don't wanna read through it just pretend its RV32I. This is also the reason why I mostly used the RiscV backend as a template.

My goal is to compile Rust. I messed around with LLVM but writing a backend for that turned out to be really difficult. And since I know my way around Rust a lot better than C++ I decided to give Cranelift a go. Since this is mostly about the learning experience I don't really care that the generated code is not quite on par with LLVM.

Alright so I will ignore i128s for now. I guess as long as I don't use them in the code I want to compile they should never be emitted right? Same with floats probably.

I'm gonna read through the other backedns ISLE code and see how it's being done there. I'm still finding my way around ISLE.

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 13, 2023

If you want the rust standard library to compile without patches you need both i128 and floats. It is possible to patch them out though with a bit of work.

@fitzgen
Copy link
Member

fitzgen commented Jan 14, 2023

I’d suggest looking at the aarch64 and x64 backends when looking for inspiration.

@Artentus
Copy link
Author

Ok, I got a very simple function to correctly compile now.
Before I continue I wanna ask: are there any tests I should modify/add specifically for the backend?
The other backends contain very little testing directly so I guess it happens somewhere else.

@fitzgen
Copy link
Member

fitzgen commented Jan 15, 2023

Nice! Most testing is done via file tests (‘cranelift/filetests/filetests/isa/*’), fuzzing, and wasmtime exercising the codegen.

@Artentus
Copy link
Author

Ah thanks, that's what I was looking for.

@Artentus
Copy link
Author

I have implemented all of the arithmetic and bitwise operations now, except for division and remainder.
The architecture doesn't have division or remainder instructions so they have to be implemented algorithmically.
However emitting the algorithm inline does not seem like a wise choice, instead I want to generate a call to some function (I believe in LLVM the function in question would be called divsi3).

But since none of the other backends do this I don't know how to go about it. I can manage to emit a call but where does the function come from, how do I define it?
Ideally the function would be defined in target-specific code to hand-optimize it.

@cfallin
Copy link
Member

cfallin commented Jan 19, 2023

@Artentus we have a mechanism for this with LibCalls; we do have a few such calls in some of the instruction lowerings.

@Artentus
Copy link
Author

I saw that in the memcpy lowering but couldn't figure out how to actually define new ones.

@cfallin
Copy link
Member

cfallin commented Jan 19, 2023

The enum is defined here. If you grep for one of the existing ones, e.g. FloorF32, you'll see the places in the code where it's used and how the relocation can be handled to link to an actual function (e.g. in Wasmtime here to this function).

@cfallin
Copy link
Member

cfallin commented Jan 19, 2023

(Forgot to finish: then you can follow that example to define a new one; also in general, when adding a new option to an enum, the compile errors will usually lead you to where you need to make changes.)

@Artentus
Copy link
Author

Thanks!

However I assume this only works within the runtime correct?
Does that mean if I want to compile Rust using the rustc cranelift backend I have to supply these through the linker?

@cfallin
Copy link
Member

cfallin commented Jan 19, 2023

Yes, you would need to somehow provide implementations of the libcalls that the relocation resolution can find. cc @bjorn3 for the best way to do that...

@afonso360
Copy link
Contributor

afonso360 commented Jan 19, 2023

👋 Hey, See #4460 for an example on how to define a new LibCall. Mostly once you provide the correct name in the module crate, the linker can usually figure it out automatically.

That PR also does some setup groundwork to emit libcall's from isle that you shouldn't need to repeat.

@Artentus
Copy link
Author

Yes I mostly copied the code from x64 emit_vm_call and it's correctly generating these calls now.
My main problem is how to get linkable object files that contain these functions, since there is no existing compilers for this target.

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 19, 2023

You could add it to the compiler-builtins rust crate I think. This crate is included in everything rustc compiles.

@Artentus
Copy link
Author

I am currently working on lowering for loads and stores, and there are these two ir instructions I'm unsure about: uload32 and sload32.
On a 32 bit architecture these are both functionally identical to just load.i32 but I'm wondering if the compiler will still try to emit these.

I tried experimenting and defined this test function:

function %uload32(i32) -> i32 {
block0(v0: i32):
  v1 = uload32.i32 v0+42
  return v1
}

But this yields the following error: inst1 (return v1): arg 0 (v1) has type i64, must match function signature of i32

So do I just completely ignore these?

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 21, 2023

uload and sload generally return a 64bit value by zero/sign extending a 32bit loaded value.

@Artentus
Copy link
Author

Well, uload8, sload8, uload16 and sload16 do work fine with 32 bit. This code compiles perfectly fine:

function %uload8(i32) -> i32 {
block0(v0: i32):
  v1 = uload8.i32 v0+42
  return v1
}

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 21, 2023

Looks like uload* and sload* have a constraint that their output size is bigger than what they load (up to 64bit). So for example uload32 will always return an i64 and uload16 an i32 or i64.

TypeSetBuilder::new().ints(64..64).build(),

@Artentus
Copy link
Author

Ah, I guess that makes sense. So I should probably implement all of them for both 32 bit and 64 bit right?

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 22, 2023

uload8 and sload8 should he implemented for i16, i32 and i64 (i16 can be skipped for wasm I think, uload16 and sload16 should be implemented for i32 and i64. Uload32 and sload32 should be implemented for i64.

@Artentus
Copy link
Author

If you want the rust standard library to compile without patches you need both i128 and floats.

Unfortunately it seems like this is true even for the core library. I tried to compile it but am gettings loads of errors related to invalid SSA types (f32 and i128).

So how do I go about patching that stuff out?

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 27, 2023

I know there are usages in core::time and core::num. Try commenting them out and see what breaks. Then comment everything out that depends on those impls and repeat. You might take some inspiration from the patch file bjorn3/rustc_codegen_cranelift@425b527 introduced to remove i128 usage from libcore. It doesn't apply to the latest version anymore, but it may be useful as guide.

@Artentus
Copy link
Author

Thank you very much, I actually managed to have it compile the (modified) core library without errors.

However I must say it is very awkward having to inject a custom core library.
I think I will actually do the work and add libcalls for all the float instructions. It's tedious but not difficult.

128 bit integers on the other hand do pose a problem, I'm not sure what to do about that.

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 29, 2023

For rustc_codegen_cranelift the patches directory contains various kinds of patches. Those with -sysroot- are applied to the standard library. You can commit your changes, do git format-patch HEAD~..HEAD (or if you have multiple commits use the appropriate amount of ~) and then move the generated patch to the patches directory and rename it to include -sysroot-. After that running ./y.rs prepare again will apply the patch to the standard library sources that ./y.rs build uses. If you need to only build libcore and not the rest of the standard library, you can edit build_sysroot/Cargo.toml to remove the dependencies on the rest. (you will likely need to keep compiler_builtins) Note that you will have to do ./y.rs prepare again after editing it as it is copied elsewhere. (I plan to make this automatic, but haven't gotten to it yet.)

@Artentus
Copy link
Author

I think adding support for 128 bit integers in the backend is the better solution than patching the core library.
Modifying Cranelift to support 4 register values wasn't actually all that difficult.

@Artentus
Copy link
Author

I'm successfully compiling code now, including the entire Rust core and alloc libraries.

However after inspecting the generated code I am seeing some very weird things being generated.
To be clear, I never expected Cranelift to generate code that is optimized to the same degree as LLVM, it clearly states so after all.
But I did not expect it to do such hilariously bad things.

I compiled this very basic hello world main function:

mod io {
    pub fn serial_write(s: &str) {
        // ...
    }
}

fn main() {
    io::serial_write("Hello World!");
}

And this is the generated assembly (manually commented):

;; allocate stack memory
sub sp, sp, 32

;; load address of "Hello World!" string (0 because pre-relocation)
ldui t7, 0
or a0, t7, 0

;; store address of "Hello World!" string at sp+0
add a1, sp, 0
st [a1], a0

;; store length of "Hello World!" string (12 bytes) at sp+4
add a1, sp, 4
ldi a2, 12
st [a1], a2

;; load address from sp+0
add a2, sp, 0
ld a2, [a2]

;; load length from sp+4
add a3, sp, 4
ld a3, [a3]

;; store address at sp+16
add a4, sp, 16
st [a4], a2

;; store length at sp+20
add a4, sp, 20
st [a4], a3

;; load address from sp+16
add a4, sp, 16
ld a0, [a4]

;; load length from sp+20
add a4, sp, 20
ld a1, [a4]

;; call io::serial_write (0 because pre-relocation)
ldui t7, 0
or a4, t7, 0
jl ra, a4

;; free stack memory
add sp, sp, 32

As you can see what should be loading two simple arguments into a0 and a1 registers is actually round tripping two times through memory for absolutely no reason.
Also it is not making use of offsetting memory instructions at all, instead emitting an add+load/store.

@cfallin
Copy link
Member

cfallin commented Jan 31, 2023

@Artentus I wonder if it would make sense to move some of this discussion to a Zulip thread? We're interested in helping you support your ISA, but normally we like to use GitHub issues more for targeted single issues.

In any case, it's hard to say much without actually playing with your new backend. Are you enabling optimizations? How did you implement the ABI? What does the CLIF look like? The redundant ops could be coming from any one of several layers.

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 31, 2023

I'm successfully compiling code now, including the entire Rust core and alloc libraries.

🎉

However after inspecting the generated code I am seeing some very weird things being generated.

Yeah, rustc generates terrible MIR with a lot of copies in it. I try to remove some of the copies in cg_clif, but many are passed through to Cranelift. Cranelift doesn't have the optimization passes necessary to remove them. If you enable optimizations for cg_clif you this will enable some optimizations working on MIR which may tidy it up a bit.

@Artentus
Copy link
Author

@cfallin got it. I have just never used Zulip in my life so I have to sign up first.

@bjorn3 I'm already compiling the compiler itself in release mode if that's what you mean. Test program was also compiled in release mode.

@bjorn3
Copy link
Contributor

bjorn3 commented Feb 1, 2023

I just checked it and this is indeed caused by the MIR containing copies:

fn main() -> () {
    let mut _0: ();                      // return place in scope 0 at /app/example.rs:7:15: 7:15
    let _1: ();                          // in scope 0 at /app/example.rs:8:5: 8:37
    let mut _2: &str;                    // in scope 0 at /app/example.rs:8:22: 8:36
    let _3: &str;                        // in scope 0 at /app/example.rs:8:22: 8:36

    bb0: {
        _3 = const "Hello World!";       // scope 0 at /app/example.rs:8:22: 8:36
        _2 = _3;                         // scope 0 at /app/example.rs:8:22: 8:36
        _1 = serial_write(move _2) -> bb1; // scope 0 at /app/example.rs:8:5: 8:37
    }

    bb1: {
        return;                          // scope 0 at /app/example.rs:9:2: 9:2
    }
}

The first store is _3 = const "Hello World!";, the first load and store is _2 = _3; and the last load is the move _2 argument. Compiling with -Zmir-opt-level=3 optimizes away the _2 = _3.

This seems to be a case where cg_clif itself should be able to avoid stack allocation, but for some reason it doesn't. I'm currently investigating.

@bjorn3
Copy link
Contributor

bjorn3 commented Feb 1, 2023

Found it. This was a regression from bjorn3/rustc_codegen_cranelift@777d473. I've fixed it in bjorn3/rustc_codegen_cranelift@df04fd6 for a 12% runtime perf win on a benchmark I use. Thanks for pointing this out!

@Artentus
Copy link
Author

Artentus commented Feb 2, 2023

That's great news! 12% from such a small change is huge.

This patch did get rid of one of the unnecessary copies.
I believe the second one is because of a transmute from &str to &[u8] which Cranelift can't optimize out. But that's fine.

I also figured out how to optimize those add+load/store sequences into a single instruction.
Just needed a rule to cover stack_addr followed by load or store IR instructions.

Since everything is working I believe I can now close this. Thanks to everyone who helped me with this, I got way more and way faster responses than I ever expected. The codebase was also very pleasant to work with, especially compared to the mess that is LLVM.

If you're interested I can PR the changes that are required to fully support 32 bit targets in cranelift_codegen.
It's very little code, essentially just extending ValueRegs to size of 4. Easily feature gated as well so if no 32 bit targets are enabled there won't be any impact.

@Artentus Artentus closed this as completed Feb 2, 2023
@cfallin
Copy link
Member

cfallin commented Feb 2, 2023

If you're interested I can PR the changes that are required to fully support 32 bit targets in cranelift_codegen.
It's very little code, essentially just extending ValueRegs to size of 4. Easily feature gated as well so if no 32 bit targets are enabled there won't be any impact.

We actually had exactly this code before, when we had a (partial) arm32 backend in-tree! I think it's probably best to wait until we add a new target to bring this back, as we are also considering other ways now of supporting wide values (namely: legalization/narrowing rules in the mid-end, as Cranelift used to do actually) that might be simpler in the end.

I'm happy you got this working though, it's a really neat demo of portability! And if you want to contribute any documentation regarding rough edges or confusing bits you hit, that might also be a good outcome from this :-)

@Artentus
Copy link
Author

Artentus commented Feb 2, 2023

A pice of documentation I would have loved to have is a listing of all CLIF instructions, just to know what needs to be implemented instead of trying to compile something and then reading error messages about missing lowering rules.
Some of the instructions also have quirks that weren't immediately obvious to me.

If you tell me where to put it I can start working on it.

@cfallin
Copy link
Member

cfallin commented Feb 2, 2023

A pice of documentation I would have loved to have is a listing of all CLIF instructions, just to know what needs to be implemented instead of trying to compile something and then reading error messages about missing lowering rules. Some of the instructions also have quirks that weren't immediately obvious to me.

If you tell me where to put it I can start working on it.

The closest we have to that is the InstBuilder trait docs that are generated from the source (that ultimately come from the doc-strings in cranelift/codegen/meta/shared/src/instructions.rs). Any updates to those descriptions that would have been helpful are very welcome contributions!

We have a cranelift/docs/ directory as well that could use some attention in general; if you come up with any more general thoughts from your experience please do feel free to start a porting.md or similar! It's on our roadmap for this year to update docs and a porting-to-new-ISA guide is something we do want to build out, eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants