Skip to content

Golang ppc64x asm Reference

Sun Yimin edited this page Oct 8, 2024 · 16 revisions

Reference

向量指令

算术加减、减法

加法

  • VADDCUQ,对应s390x的VACCQ,计算进位。
  • VADDUQM,对应s390x的VAQ,计算两个数之和的低128位。
  • VADDECUQ,对应s390x的VACCCQ,带进位加,计算进位。
  • VADDEUQM,对应s390x的VACQ,带进位加,两个数和进位的总和的低128位。

(中间的Q代表位宽)
所以,两个数相加要同时使用多个指令。下面示例演示 T2||T1||T0 = T1||T0 + RED2||RED1。

	VADDCUQ  T0, RED1, CAR1       // VACCQ  T0, RED1, CAR1
	VADDUQM  T0, RED1, T0         // VAQ    T0, RED1, T0
	VADDECUQ T1, RED2, CAR1, CAR2 // VACCCQ T1, RED2, CAR1, CAR2
	VADDEUQM T1, RED2, CAR1, T1   // VACQ   T1, RED2, CAR1, T1
	VADDUQM  T2, CAR2, T2         // VAQ    T2, CAR2, T2

减法

  • VSUBCUQ,对应s390x的VSCBIQ,计算借位
  • VSUBUQM,对应s390x的VSQ,计算两数之差,结果是差值的低128位。
  • VSUBECUQ,对应s390x的VSBCBIQ,带借位减,计算借位
  • VSUBEUQM,对应s390x的VSBIQ,带借位减,计算结果的低128位。

下面示例演示 T2||TT1||TT0 = T2||T1||T0 - ZERO||PH||PL。T2是借位。

	VSUBCUQ  T0, PL, CAR1       // VSCBIQ  PL, T0, CAR1
	VSUBUQM  T0, PL, TT0        // VSQ     PL, T0, TT0
	VSUBECUQ T1, PH, CAR1, CAR2 // VSBCBIQ T1, PH, CAR1, CAR2
	VSUBEUQM T1, PH, CAR1, TT1  // VSBIQ   T1, PH, CAR1, TT1
	VSUBEUQM T2, ZER, CAR2, T2  // VSBIQ   T2, ZER, CAR2, T2

算术乘法

乘法太复杂:

// The following macros are used to implement the ppc64le
// equivalent function from the corresponding s390x
// instruction for vector multiply high, low, and add,
// since there aren't exact equivalent instructions.
// The corresponding s390x instructions appear in the
// comments.
// Implementation for big endian would have to be
// investigated, I think it would be different.
//
//
// Vector multiply word
//
//	VMLF  x0, x1, out_low
//	VMLHF x0, x1, out_hi
#define VMULT(x1, x2, out_low, out_hi) \
	VMULEUW x1, x2, TMP1; \
	VMULOUW x1, x2, TMP2; \
	VMRGEW TMP1, TMP2, out_hi; \
	VMRGOW TMP1, TMP2, out_low

//
// Vector multiply add word
//
//	VMALF  x0, x1, y, out_low
//	VMALHF x0, x1, y, out_hi
#define VMULT_ADD(x1, x2, y, one, out_low, out_hi) \
	VMULEUW  y, one, TMP2; \
	VMULOUW  y, one, TMP1; \
	VMULEUW  x1, x2, out_low; \
	VMULOUW  x1, x2, out_hi; \
	VADDUDM  TMP2, out_low, TMP2; \
	VADDUDM  TMP1, out_hi, TMP1; \
	VMRGOW   TMP2, TMP1, out_low; \
	VMRGEW   TMP2, TMP1, out_hi

向量数据加载

填充

  • VSPLTISB "Vector Splat Immediate Signed Byte". This instruction is used to fill a vector register with a specified 8-bit signed integer.填充立即数到目标向量寄存器。
  • VSPLTB "Vector Splat Byte". This instruction is used to replicate a specified byte across all elements of a vector register.从源向量寄存器中取指定位置的字节,填充到目标向量寄存器。
  • VSPLTISW "Vector Splat Immediate Signed Word". This instruction is used to fill a vector register with a specified 16-bit signed integer.
  • VSPLTW "Vector Splat Word". This instruction is used to replicate a specified word (32-bit element) across all elements of a vector register.

从内存加载(Load)数据

  • LVX "Load Vector Indexed". This instruction is used to load a vector from memory into a vector register. The LVX instructions on ppc64 require 16 byte alignment of the data. To avoid that requirement, data is loaded using LXVD2X with VPERM to reorder bytes correctly.
  • LXVDSX "Load VSR Vector Doubleword and Splat Indexed". This instruction is used to load a doubleword (64-bit element) from memory into a vector register.从指定内存位置加载64位数据,将其存储到目标向量寄存器的lower half(byte index from 0-7), 并将相同值复制到higher half (byte index from 8-16)。
  • LVXD2X "Load Vector Doubleword 2 Indexed". This instruction is used to load two consecutive doublewords (64-bit elements) from memory into a vector register. 加载两个连续的64位数到目标向量寄存器。
  • LXVW4X "Load Vector Word Indexed". It loads a vector of 4 words (16 bytes total, as each word is 4 bytes) from memory into a vector register.

示例(PPC64LE)

LVXD2X加载两个连续64位整数:

DATA ·mask+0x00(SB)/8, $0x0f0e0d0c0b0a0908 // Permute for vector doubleword endian swap
DATA ·mask+0x08(SB)/8, $0x0706050403020100
GLOBL ·mask(SB), RODATA, $16

MOVD	$·mask(SB), R4
LVXD2X  (R4), V0

那V[0] = 0x0f0e0d0c, V[1] = 0x0b0a0908, V[2] = 0x07060504, V[3] = 0x03020100

LVX加载两个连续64位整数:

DATA ·mask+0x00(SB)/8, $0x0f0e0d0c0b0a0908 // Permute for vector doubleword endian swap
DATA ·mask+0x08(SB)/8, $0x0706050403020100
GLOBL ·mask(SB), RODATA, $16

MOVD	$·mask(SB), R4
LVX  (R4), V0

那V[2] = 0x0f0e0d0c, V[3] = 0x0b0a0908, V[0] = 0x07060504, V[1] = 0x03020100

LVXD2X加载四个32位整数:
假设四个连续32位整数为:[0x0f0e0d0c, 0x0b0a0908, 0x07060504, 0x03020100]
则V[0] = 0x0b0a0908, V[1] = 0x0f0e0d0c, V[2] = 0x03020100, V[3] = 0x07060504

LVX加载四个32位整数:
假设四个连续32位整数为:[0x0f0e0d0c, 0x0b0a0908, 0x07060504, 0x03020100]
则V[0] = 0x03020100, V[1] = 0x07060504, V[2] = 0x0b0a0908, V[3] = 0x0f0e0d0c

存储向量寄存器中的数据到内存

  • STVX "Store Vector Indexed". This instruction is used to store a vector from a vector register into memory.The STVX instructions on ppc64 require 16 byte alignment of the data. To avoid that requirement, data is stored using STXVD2X with VPERM to reorder bytes correctly.
  • STXVD2X "Store Vector Doubleword 2 Indexed". This instruction is used to store two consecutive doublewords (64-bit elements) from a vector register into memory.
  • STXVW4X "Store Vector Word Indexed". It stores a vector of 4 words (16 bytes total, as each word is 4 bytes) from a vector register to memory.

判断相等

  • VCMPEQUD "Vector Compare Equal Unsigned Doubleword". This instruction is used to compare the corresponding doublewords (64-bit elements) in two vector registers for equality. The instruction compares the doublewords in the source registers for equality. If the doublewords are equal, the corresponding element in the result is set to all ones; otherwise, it is set to all zeros.
  • VCMPEQUDCC "Vector Compare Equal Unsigned Doubleword and Conditionally Clear". This instruction is used to compare two vector registers for equality on a doubleword (64-bit) basis. The instruction compares the doublewords in the source registers for equality. If the doublewords are equal, the corresponding element in the result is set to all ones; otherwise, it is set to all zeros. The result of the comparison is stored in the condition register field CR6. If the comparison result is true (all ones), the instruction also clears the condition register field CR6.

其它

  • VSLDOI "Vector Shift Left Double by Octet Immediate". This instruction is used to shift the contents of a vector register left by a specified number of octets (8-bit bytes).
  • XXPERMDI "Vector Permute Doubleword Immediate". This instruction is used to permute (rearrange) the doublewords (64-bit elements) in a vector register based on an immediate value.
  • VPERM "Vector Permute". This instruction is used to permute (rearrange) the bytes in two vector registers based on a permutation vector.

典型的ppc64及ppc64le

The typical CPUs for ppc64 (PowerPC 64-bit Big Endian) and ppc64le (PowerPC 64-bit Little Endian) are IBM's POWER series of processors.

For ppc64, the IBM POWER5, POWER6, POWER7, and POWER8 processors are commonly used. These processors are often found in high-performance computing environments, enterprise servers, and similar applications.

For ppc64le, the IBM POWER8 and POWER9 processors are typically used. The switch to little-endian mode in these processors was made to improve compatibility with software written for x86_64, which also uses little-endian byte order. These processors are used in a variety of applications, from supercomputers to servers for cloud and data analytics workloads.

what's the difference between function definition in golang asm with funcname<>(SB), funcname(SB) and ·funcname(SB)?

In Go assembly, the naming conventions for function definitions are as follows:

  1. funcname(SB): This is a global function named funcname. The (SB) suffix refers to the static base pointer, which is used to reference global symbols. This function can be called from anywhere in the program.

  2. funcname<>(SB): This is a file-local function named funcname. The <> makes the function local to the file it's defined in. Other files in the same package cannot call this function directly.

  3. ·funcname(SB): This is a special notation used in the Plan 9 assembly syntax that Go uses. The · prefix denotes a global symbol. This function can be called from anywhere in the program. It's essentially the same as funcname(SB).

In Go assembly, the · symbol is used to denote a global symbol. When you define a global symbol like ·mask, it's expected to be unique in the entire program. If you try to define ·mask more than once, you'll get a duplicate symbol definition error.

On the other hand, mask<> is a local symbol. Local symbols are only visible within the file they are defined in. You can have a mask<> symbol in each assembly file in your program, and they won't conflict with each other because they are not visible outside their own files.

So, the difference comes from the scope of the symbols. Global symbols like ·mask are visible throughout the entire program and must be unique. Local symbols like mask<> are only visible within their own file and can be defined in each file without causing conflicts.