Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use load+store instead of memcpy for small integer arrays #111999

Merged
merged 2 commits into from
Jun 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions compiler/rustc_codegen_llvm/src/type_.rs
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,9 @@ impl<'ll, 'tcx> LayoutTypeMethods<'tcx> for CodegenCx<'ll, 'tcx> {
fn reg_backend_type(&self, ty: &Reg) -> &'ll Type {
ty.llvm_type(self)
}
fn scalar_copy_backend_type(&self, layout: TyAndLayout<'tcx>) -> Option<Self::Type> {
layout.scalar_copy_llvm_type(self)
}
}

impl<'ll, 'tcx> TypeMembershipMethods<'tcx> for CodegenCx<'ll, 'tcx> {
Expand Down
33 changes: 33 additions & 0 deletions compiler/rustc_codegen_llvm/src/type_of.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ use rustc_middle::bug;
use rustc_middle::ty::layout::{FnAbiOf, LayoutOf, TyAndLayout};
use rustc_middle::ty::print::{with_no_trimmed_paths, with_no_visible_paths};
use rustc_middle::ty::{self, Ty, TypeVisitableExt};
use rustc_target::abi::HasDataLayout;
use rustc_target::abi::{Abi, Align, FieldsShape};
use rustc_target::abi::{Int, Pointer, F32, F64};
use rustc_target::abi::{PointeeInfo, Scalar, Size, TyAbiInterface, Variants};
Expand Down Expand Up @@ -192,6 +193,7 @@ pub trait LayoutLlvmExt<'tcx> {
) -> &'a Type;
fn llvm_field_index<'a>(&self, cx: &CodegenCx<'a, 'tcx>, index: usize) -> u64;
fn pointee_info_at<'a>(&self, cx: &CodegenCx<'a, 'tcx>, offset: Size) -> Option<PointeeInfo>;
fn scalar_copy_llvm_type<'a>(&self, cx: &CodegenCx<'a, 'tcx>) -> Option<&'a Type>;
}

impl<'tcx> LayoutLlvmExt<'tcx> for TyAndLayout<'tcx> {
Expand Down Expand Up @@ -414,4 +416,35 @@ impl<'tcx> LayoutLlvmExt<'tcx> for TyAndLayout<'tcx> {
cx.pointee_infos.borrow_mut().insert((self.ty, offset), result);
result
}

fn scalar_copy_llvm_type<'a>(&self, cx: &CodegenCx<'a, 'tcx>) -> Option<&'a Type> {
debug_assert!(self.is_sized());

// FIXME: this is a fairly arbitrary choice, but 128 bits on WASM
// (matching the 128-bit SIMD types proposal) and 256 bits on x64
// (like AVX2 registers) seems at least like a tolerable starting point.
let threshold = cx.data_layout().pointer_size * 4;
if self.layout.size() > threshold {
the8472 marked this conversation as resolved.
Show resolved Hide resolved
return None;
}

// Vectors, even for non-power-of-two sizes, have the same layout as
// arrays but don't count as aggregate types
if let FieldsShape::Array { count, .. } = self.layout.fields()
&& let element = self.field(cx, 0)
&& element.ty.is_integral()
{
// `cx.type_ix(bits)` is tempting here, but while that works great
// for things that *stay* as memory-to-memory copies, it also ends
// up suppressing vectorization as it introduces shifts when it
// extracts all the individual values.

let ety = element.llvm_type(cx);
return Some(cx.type_vector(ety, *count));
}

// FIXME: The above only handled integer arrays; surely more things
// would also be possible. Be careful about provenance, though!
None
}
}
14 changes: 13 additions & 1 deletion compiler/rustc_codegen_ssa/src/base.rs
Original file line number Diff line number Diff line change
Expand Up @@ -380,7 +380,19 @@ pub fn memcpy_ty<'a, 'tcx, Bx: BuilderMethods<'a, 'tcx>>(
return;
}

bx.memcpy(dst, dst_align, src, src_align, bx.cx().const_usize(size), flags);
if flags == MemFlags::empty()
&& let Some(bty) = bx.cx().scalar_copy_backend_type(layout)
{
// I look forward to only supporting opaque pointers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a FIXME so someone may remove this when llvm only has opaque ptrs? (Well, I guess removing this logic would also preclude other backends with typed ptrs, too. In that case, maybe no comment at all.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bigger question because GCC would need to move, so I'll leave it tracked by conversations like https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/llvm.20bitcasts.20in.20codegen/near/356591425 instead of something specific in this place.

let pty = bx.type_ptr_to(bty);
let src = bx.pointercast(src, pty);
let dst = bx.pointercast(dst, pty);
Comment on lines +388 to +389
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do these not match pty? Why not just return a more "accurate" type in scalar_copy_llvm_type? Like the actual LLVM type that corresponds to src/dst? (Or am I misunderstanding something?)

Copy link
Member Author

@scottmcm scottmcm May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Well, the shallow answer is that they don't match because it's -> Option<Type>, not should_load_store_instead_of_memcpy() -> bool.)

Because emitting these as the backend type doesn't necessarily do what we want. Notably, this PR is emitting the load/store as <4 x i8>, the vector type, rather than the llvm_backend_type of [4 x i8] for arrays.

I tried using the LLVM type first, but with arrays that results in exploding out the IR: https://llvm.godbolt.org/z/vjjsdea9e

Optimizing the version that loads/stores arrays

define void @replace_short_array_using_arrays(ptr noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
  %1 = load [3 x i32], ptr %r, align 4
  store [3 x i32] %1, ptr %0, align 4
  %2 = load [3 x i32], ptr %v, align 4
  store [3 x i32] %2, ptr %r, align 4
  ret void
}

gives

define void @replace_short_array_using_arrays(ptr noalias nocapture noundef writeonly sret([3 x i32]) dereferenceable(12) %0, ptr noalias nocapture noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
  %.unpack = load i32, ptr %r, align 4
  %.elt1 = getelementptr inbounds [3 x i32], ptr %r, i64 0, i64 1
  %.unpack2 = load i32, ptr %.elt1, align 4
  %.elt3 = getelementptr inbounds [3 x i32], ptr %r, i64 0, i64 2
  %.unpack4 = load i32, ptr %.elt3, align 4
  store i32 %.unpack, ptr %0, align 4
  %.repack5 = getelementptr inbounds [3 x i32], ptr %0, i64 0, i64 1
  store i32 %.unpack2, ptr %.repack5, align 4
  %.repack7 = getelementptr inbounds [3 x i32], ptr %0, i64 0, i64 2
  store i32 %.unpack4, ptr %.repack7, align 4
  %.unpack9 = load i32, ptr %v, align 4
  %.elt10 = getelementptr inbounds [3 x i32], ptr %v, i64 0, i64 1
  %.unpack11 = load i32, ptr %.elt10, align 4
  %.elt12 = getelementptr inbounds [3 x i32], ptr %v, i64 0, i64 2
  %.unpack13 = load i32, ptr %.elt12, align 4
  store i32 %.unpack9, ptr %r, align 4
  store i32 %.unpack11, ptr %.elt1, align 4
  store i32 %.unpack13, ptr %.elt3, align 4
  ret void
}

whereas optimizing the version with vectors leaves the operations together

define void @replace_short_array_using_vectors(ptr noalias nocapture noundef writeonly sret([3 x i32]) dereferenceable(12) %0, ptr noalias nocapture noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
  %1 = load <3 x i32>, ptr %r, align 4
  store <3 x i32> %1, ptr %0, align 4
  %2 = load <3 x i32>, ptr %v, align 4
  store <3 x i32> %2, ptr %r, align 4
  ret void
}

for the backend to legalize instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I originally expected to do this with LLVM's arbitrary-length integer support https://llvm.godbolt.org/z/hzG9aeqhM, like I did for comparisons back in #85828

define void @replace_short_array_using_long_integer(ptr noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 {
start:
  %1 = load i96, ptr %r, align 4
  store i96 %1, ptr %0, align 4
  %2 = load i96, ptr %v, align 4
  store i96 %2, ptr %r, align 4
  ret void
}

But that broke some of the autovectorization codegen tests for operations on arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ok, so we explicitly want to be lowering these moves to vector types specifically because they can optimize better than arrays.


let temp = bx.load(bty, src, src_align);
bx.store(temp, dst, dst_align);
} else {
bx.memcpy(dst, dst_align, src, src_align, bx.cx().const_usize(size), flags);
}
}

pub fn codegen_instance<'a, 'tcx: 'a, Bx: BuilderMethods<'a, 'tcx>>(
Expand Down
22 changes: 22 additions & 0 deletions compiler/rustc_codegen_ssa/src/traits/type_.rs
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,28 @@ pub trait LayoutTypeMethods<'tcx>: Backend<'tcx> {
index: usize,
immediate: bool,
) -> Self::Type;

/// A type that can be used in a [`super::BuilderMethods::load`] +
/// [`super::BuilderMethods::store`] pair to implement a *typed* copy,
/// such as a MIR `*_0 = *_1`.
///
/// It's always legal to return `None` here, as the provided impl does,
/// in which case callers should use [`super::BuilderMethods::memcpy`]
/// instead of the `load`+`store` pair.
///
/// This can be helpful for things like arrays, where the LLVM backend type
/// `[3 x i16]` optimizes to three separate loads and stores, but it can
/// instead be copied via an `i48` that stays as the single `load`+`store`.
/// (As of 2023-05 LLVM cannot necessarily optimize away a `memcpy` in these
/// cases, due to `poison` handling, but in codegen we have more information
/// about the type invariants, so can emit something better instead.)
///
/// This *should* return `None` for particularly-large types, where leaving
/// the `memcpy` may well be important to avoid code size explosion.
fn scalar_copy_backend_type(&self, layout: TyAndLayout<'tcx>) -> Option<Self::Type> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a default impl here may make it not obvious that other backends should emit typed load/store. Is it bad style to just add this to cranelift and codegen_gcc here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, cg_clif doesn't use cg_ssa, so it's not impacted.

I have no practical way to test cg_gcc, being on windows, and I also don't actually have any information that doing something other than memcpy for it would actually be an improvement for it, so I figured I'd just leave the default here since it's a semantically sufficient implementation.

Notably, GCC apparently doesn't have the poison semantics that are what nikic mentioned as being the problem for better optimizing this:

fn const_poison(&self, typ: Type<'gcc>) -> RValue<'gcc> {
// No distinction between undef and poison.
self.const_undef(typ)
}

so indeed it might just never need to do this.

let _ = layout;
None
}
}

// For backends that support CFI using type membership (i.e., testing whether a given pointer is
Expand Down
35 changes: 35 additions & 0 deletions tests/codegen/array-codegen.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
// compile-flags: -O -C no-prepopulate-passes
// min-llvm-version: 15.0 (for opaque pointers)

#![crate_type = "lib"]

// CHECK-LABEL: @array_load
#[no_mangle]
pub fn array_load(a: &[u8; 4]) -> [u8; 4] {
// CHECK: %0 = alloca [4 x i8], align 1
// CHECK: %[[TEMP1:.+]] = load <4 x i8>, ptr %a, align 1
// CHECK: store <4 x i8> %[[TEMP1]], ptr %0, align 1
// CHECK: %[[TEMP2:.+]] = load i32, ptr %0, align 1
// CHECK: ret i32 %[[TEMP2]]
*a
}

// CHECK-LABEL: @array_store
#[no_mangle]
pub fn array_store(a: [u8; 4], p: &mut [u8; 4]) {
// CHECK: %a = alloca [4 x i8]
// CHECK: %[[TEMP:.+]] = load <4 x i8>, ptr %a, align 1
// CHECK-NEXT: store <4 x i8> %[[TEMP]], ptr %p, align 1
*p = a;
}

// CHECK-LABEL: @array_copy
#[no_mangle]
pub fn array_copy(a: &[u8; 4], p: &mut [u8; 4]) {
// CHECK: %[[LOCAL:.+]] = alloca [4 x i8], align 1
// CHECK: %[[TEMP1:.+]] = load <4 x i8>, ptr %a, align 1
// CHECK: store <4 x i8> %[[TEMP1]], ptr %[[LOCAL]], align 1
// CHECK: %[[TEMP2:.+]] = load <4 x i8>, ptr %[[LOCAL]], align 1
// CHECK: store <4 x i8> %[[TEMP2]], ptr %p, align 1
*p = *a;
}
11 changes: 11 additions & 0 deletions tests/codegen/mem-replace-simple-type.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,14 @@ pub fn replace_ref_str<'a>(r: &mut &'a str, v: &'a str) -> &'a str {
// CHECK: ret { ptr, i64 } %[[P2]]
std::mem::replace(r, v)
}

#[no_mangle]
// CHECK-LABEL: @replace_short_array(
pub fn replace_short_array(r: &mut [u32; 3], v: [u32; 3]) -> [u32; 3] {
// CHECK-NOT: alloca
// CHECK: %[[R:.+]] = load <3 x i32>, ptr %r, align 4
// CHECK: store <3 x i32> %[[R]], ptr %0
// CHECK: %[[V:.+]] = load <3 x i32>, ptr %v, align 4
// CHECK: store <3 x i32> %[[V]], ptr %r
std::mem::replace(r, v)
}
9 changes: 9 additions & 0 deletions tests/codegen/swap-simd-types.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,12 @@ pub fn swap_m256_slice(x: &mut [__m256], y: &mut [__m256]) {
x.swap_with_slice(y);
}
}

// CHECK-LABEL: @swap_bytes32
#[no_mangle]
pub fn swap_bytes32(x: &mut [u8; 32], y: &mut [u8; 32]) {
// CHECK-NOT: alloca
// CHECK: load <32 x i8>{{.+}}align 1
// CHECK: store <32 x i8>{{.+}}align 1
swap(x, y)
}
25 changes: 20 additions & 5 deletions tests/codegen/swap-small-types.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// compile-flags: -O
// compile-flags: -O -Z merge-functions=disabled
// only-x86_64
// ignore-debug: the debug assertions get in the way

Expand All @@ -8,13 +8,28 @@ use std::mem::swap;

type RGB48 = [u16; 3];

// CHECK-LABEL: @swap_rgb48_manually(
#[no_mangle]
pub fn swap_rgb48_manually(x: &mut RGB48, y: &mut RGB48) {
// CHECK-NOT: alloca
// CHECK: %[[TEMP0:.+]] = load <3 x i16>, ptr %x, align 2
// CHECK: %[[TEMP1:.+]] = load <3 x i16>, ptr %y, align 2
// CHECK: store <3 x i16> %[[TEMP1]], ptr %x, align 2
// CHECK: store <3 x i16> %[[TEMP0]], ptr %y, align 2

let temp = *x;
*x = *y;
*y = temp;
}

// CHECK-LABEL: @swap_rgb48
#[no_mangle]
pub fn swap_rgb48(x: &mut RGB48, y: &mut RGB48) {
// FIXME MIR inlining messes up LLVM optimizations.
// WOULD-CHECK-NOT: alloca
// WOULD-CHECK: load i48
// WOULD-CHECK: store i48
// CHECK-NOT: alloca
// CHECK: load <3 x i16>
// CHECK: load <3 x i16>
// CHECK: store <3 x i16>
// CHECK: store <3 x i16>
swap(x, y)
}

Expand Down