When inline C is too safe.
Did you try inline-c
, but it's not enough? You need more? Nothing seems to satisfy?
inline-asm
to the rescue!
And, since the inline assembly is just a usual Haskell value (even if manipulated at compile-time), there's a lot of pretty cool stuff one can do, like, for instance, explicit compile-time loop unrolling.
Swapping two Int
s and also incrementing one of them by two:
defineAsmFun "swap2p1"
[asmTy| (a : Int) (b : Int) | (_ : Int) (_ : Int)]
[asm|
xchg {a}, {b}
add $2, {b}
|]
This provides a function swap2p1 :: Int -> Int -> (Int, Int)
that, well, swaps two Int
s.
Note that the resulting function is pure, and the {a}
, {b}
antiquoters.
Getting the last character of a ByteString
, or a default character if it's empty:
defineAsmFun "lastChar"
[asmTy| (bs : ByteString) (def : Word) | (w : Word) |]
[asm|
test {bs:len}, {bs:len}
jz is_zero
movzbq -1({bs:ptr},{bs:len}), {w}
RET_HASK
is_zero:
mov {def}, {w}
|]
This provides a function lastChar :: ByteString -> Word -> Word
.
Note the special {bs:ptr}
and {bs:len}
antiquoters, as well as RET_HASK
command to return early.
SIMD-accelerated character occurrences count in a string:
defineAsmFun "countCharSSE42"
[asmTy| (ch : Word8) (ptr : Ptr Word8) (len : Int) | (cnt : Int) |] $
unroll "i" [12..15]
[asm|
push %r{i}|] <> [asm|
vmovd {ch}, %xmm15
vpxor %xmm0, %xmm0, %xmm0
vpshufb %xmm0, %xmm15, %xmm15
shr $7, {len}
mov $16, %eax
mov $16, %edx
xor {cnt}, {cnt}
{move ptr rdi}
loop: |] <> unrolls "i" [1..8] [
[asm|
vmovdqa {(i - 1) * 0x10}({ptr}), %xmm{i}
|], [asm|
vpcmpestrm $10, %xmm15, %xmm{i}
vmovdqa %xmm0, %xmm{i}
|], [asm|
vmovq %xmm{i}, %r{i + 7}
|], [asm|
popcnt %r{i + 7}, %r{i + 7}
|], [asm|
add %r{i + 7}, {cnt}
|]
] <>
[asm|
add $128, {ptr}
dec {len}
jnz loop|] <> unroll "i" [15,14..12] [asm|
pop %r{i} |]
This provides a function countCharSSE42 :: Word8 -> Ptr Word8 -> Int -> Int
.
Note the unroll
/unrolls
Haskell function for compile-time code generation and loop unrolling
with arithmetic expressions in the templates.
Impure computation depending on some external state, like reading the CPU's time stamp counter:
defineAsmFunM "rdtsc"
[asmTy| | (out : Word64) |]
[asm|
rdtsc
mov %rdx, {out}
shl $32, {out}
add %rax, {out}
|]
This provides a function rdtsc :: PrimMonad m => m Word64
which can be used in ST
or IO
contexts.
Note the M letter in defineAsmFunM
.
The entry point is the defineAsmFun
function from Language.Asm.Inline
as well as the asm
and asmTy
quasiquoters from Language.Asm.Inline.QQ
.
First, enable some extensions required for Template Haskell and for the unlifted FFI marshalling stuff:
{-# LANGUAGE TemplateHaskell, QuasiQuotes #-}
{-# LANGUAGE GHCForeignImportPrim, UnliftedFFITypes, UnboxedTuples #-}
then one can just do
defineAsmFun "funName"
[asmTy| (someInt : Int) (somePtr : Ptr Word) (someStr : BS.ByteString) | (len : Int) (count : Int) |]
[asm|
; your asm code follows
|]
to define a function funName
of the type Int -> Ptr Word -> BS.ByteString -> (Int, Int)
.
Good news: it's not necessary to memorize the GHC calling convention
to access the arguments and the output values slots!
Instead, one can use the {someInt}
syntax to refer to the argument or the return slot named someInt
.
NB: despite that, be careful to not accidentally overwrite an input parameter for now by picking a wrong register for temporary computations. We might introduce some syntax to pick unused registers in a future version, but for now care must be taken.
ByteString
parameters are supported, but, being composite objects, they are a bit special:
an input parameter of type ByteString
actually takes two registers:
one for the address of the string, and one for its length.
Sometimes it might be handy to reassociate an input parameter with another register.
For this, the {move argName newReg}
antiquoter can be used (for instance, {move someInt rdi}
).
This will both update the mapping from argument names to register names
as well as issue an assembly mov
command.
In case you need to return early to the Haskell-land, just write RET_HASK
,
which gets substituted by the actual command to return to Haskell.
The asm
quasiquoter basically produces a string,
so it can be manipulated at compile-time with arbitrary Haskell functions.
In particular, a string template can be replicated at compile time.
The unroll
function unrolls a single asm
code block,
calculating arithmetic expressions involving the unroll variable, so
unroll "i" [1..8] [asm|vmovdqa {(i - 1) * 0x10}({ptr}), %xmm{i}|]
is equivalent to
vmovdqa 0x0({ptr}), %xmm1
vmovdqa 0x10({ptr}), %xmm2
vmovdqa 0x20({ptr}), %xmm3
vmovdqa 0x30({ptr}), %xmm4
vmovdqa 0x40({ptr}), %xmm5
vmovdqa 0x50({ptr}), %xmm6
vmovdqa 0x60({ptr}), %xmm7
vmovdqa 0x70({ptr}), %xmm8
unrolls
works analogously, but it takes a list of asm
code blocks (instead of a single block),
unrolls each of them and then concatenates the results. Equationally,
unrolls var ints codes = foldMap (unroll var ints) codes
The countCharSSE42
function above might be a pretty good example.
Most of the functions above are actually pure:
they return the same result given the same parameters and have no side effects.
Perhaps unsurprisingly, this is not always the case.
Let's consider the rdtsc
example again and assume we've written
defineAsmFun "rdtsc"
[asmTy| (_ : Unit) | (out : Word64) |]
[asm|
rdtsc
mov %rdx, {out}
shl $32, {out}
add %rax, {out}
|]
(BTW we need a dummy Unit
input here
since otherwise the type of the generated imported Assembly function would be just Word64#
,
which is not a function but a value, and thus disallowed by GHC).
How do we use this function to measure something? We'd probably write something like
measure = do
let r1 = rdtsc Unit
runLongComputation
let r2 = rdtsc Unit
print $ r2 - r1
The problem is that the compiler is very keen on transforming this into
measure = do
let r1 = rdtsc Unit
runLongComputation
let r2 = r1
print $ r2 - r1
so every computation is executed instantly according to our measurements,
and no amount of {-# NOINLINE #-}
and the likes will fix it.
The only fix is to actually thread through the State#
token,
which is what happens when we use the monadic function defineAsmFunM
:
defineAsmFunM "rdtsc"
[asmTy| | (out : Word64) |]
[asm|
rdtsc
mov %rdx, {out}
shl $32, {out}
add %rax, {out}
|]
In this case, the generated Assembly function would be imported with the type
forall s. State# s -> (# State# s, Word64# #)
,
and what happens to State# s
is completely opaque to the compiler, so it can no longer "optimize" this,
and the following works as expected:
measure = do
r1 <- rdtsc
runLongComputation
r2 <- rdtsc
print $ r2 - r1
By the way, now that there's the state token parameter, we no longer need a dummy Unit
.
- Firstly, all of this is utterly unsafe.
- While this package provides some shortcuts, understanding the calling convention (in particular, which registers are actually used) is important. The best source of truth is, of course, GHC's source code.
- Each function is compiled in its own
.S
file, so one can freely pick arbitrary naming for the labels and so on, but, on the other hand, one cannot access labels in other functions. This can be remedied somewhat easily — consider throwing up an issue if that's actually desired. - Finally, all of this is utterly unsafe.