Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable fusion of unsafePackLen{Bytes,Chars} #494

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

noughtmare
Copy link
Contributor

@noughtmare noughtmare commented Mar 3, 2022

Here's a benchmark:

byteswap :: ByteString -> ByteString
byteswap xs = unsafePackLenBytes (B.length xs `quot` 4 * 4)
  [ unsafeIndex xs $ i * 4 + j
  | i <- [0 .. B.length xs `quot` 4 - 1]
  , j <- [3,2,1,0]
  ]

Before:

All
  byteswap
    1000:    OK (0.36s)
      10.1 μs ± 582 ns,  97 KB allocated,  23 B  copied, 2.0 MB peak memory
    10000:   OK (0.22s)
      102  μs ± 8.6 μs, 967 KB allocated, 205 B  copied, 2.0 MB peak memory
    100000:  OK (0.61s)
      1.18 ms ± 110 μs, 9.4 MB allocated,  11 KB copied, 5.0 MB peak memory
    1000000: OK (0.74s)
      10.5 ms ± 390 μs,  94 MB allocated,  38 KB copied,  34 MB peak memory

After:

All
  byteswap
    1000:    OK (0.18s)
      711  ns ±  70 ns, 1.1 KB allocated,   0 B  copied, 2.0 MB peak memory
    10000:   OK (0.24s)
      7.13 μs ± 583 ns, 9.9 KB allocated,   1 B  copied, 3.0 MB peak memory
    100000:  OK (0.56s)
      68.8 μs ± 6.2 μs,  98 KB allocated,  13 B  copied, 4.0 MB peak memory
    1000000: OK (0.24s)
      781  μs ±  51 μs, 977 KB allocated,  89 B  copied,  34 MB peak memory

@Bodigrim
Copy link
Contributor

Bodigrim commented Mar 3, 2022

@noughtmare could you please fix warnings?

@Bodigrim
Copy link
Contributor

Bodigrim commented Mar 3, 2022

Please submit benchmarks as well.

@noughtmare
Copy link
Contributor Author

noughtmare commented Mar 5, 2022

I've added these benchmarks:

    , bgroup "pack"
      [ bench "not fused" $ nf S.pack (replicate nRepl 0)
      , bench "fused" $ nf (S.pack . replicate nRepl) 0
      ]
    , bgroup "unsafePackLenBytes"
      [ bench "not fused" $ nf (SI.unsafePackLenBytes nRepl) (replicate nRepl 0)
      , bench "fused" $ nf (SI.unsafePackLenBytes nRepl . replicate nRepl) 0
      ]
    , bgroup "unsafePackLenChar"
      [ bench "not fused" $ nf (SI.unsafePackLenChars nRepl) (replicate nRepl 'A')
      , bench "fused" $ nf (SI.unsafePackLenChars nRepl . replicate nRepl) 'A'
      ]

Before

All
  pack
    not fused: OK (0.39s)
      45.7 μs ± 1.9 μs, 6.4 KB allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.26s)
      120  μs ± 8.3 μs, 546 KB allocated, 2.2 KB copied,  76 MB peak memory
  unsafePackLenBytes
    not fused: OK (0.23s)
      27.1 μs ± 1.9 μs, 6.4 KB allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.17s)
      81.0 μs ± 7.1 μs, 546 KB allocated,   6 B  copied,  76 MB peak memory
  unsafePackLenChar
    not fused: OK (0.14s)
      34.7 μs ± 3.4 μs,   0 B  allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.35s)
      84.4 μs ± 4.8 μs, 552 KB allocated,   4 B  copied,  76 MB peak memory

After

All
  pack
    not fused: OK (0.40s)
      46.7 μs ± 1.8 μs, 6.4 KB allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.26s)
      122  μs ± 7.5 μs, 546 KB allocated, 2.2 KB copied,  76 MB peak memory
  unsafePackLenBytes
    not fused: OK (1.85s)
      27.6 μs ± 1.6 μs, 8.9 KB allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.54s)
      3.86 μs ± 189 ns, 9.5 KB allocated,   0 B  copied,  76 MB peak memory
  unsafePackLenChar
    not fused: OK (0.16s)
      38.0 μs ± 2.7 μs,   0 B  allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.55s)
      3.85 μs ± 154 ns, 9.5 KB allocated,   0 B  copied,  76 MB peak memory

If the fusion works out there is a 20x speedup, but otherwise it is slightly slower.

Update: It seems that it is just the inlining which makes things a tiny bit worse in this case.

Also, note that this function is only exposed via pack, where it is never possible to fuse, so to really make this useful I would suggest adding a safe packLen :: Int -> [Word8] -> ByteString function which just fills the rest with zeroes if the list is too short.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to do the same for ShortByteString?

packLenBytes :: Int -> [Word8] -> ShortByteString
packLenBytes len ws0 =
create len (\mba -> go mba 0 ws0)
where
go :: MBA s -> Int -> [Word8] -> ST s ()
go !_ !_ [] = return ()
go !mba !i (w:ws) = do
writeWord8Array mba i w
go mba (i+1) ws

unsafeCreate len . foldr
(\x go p -> poke p x >> go (p `plusPtr` 1))
(\_ -> return ())
{-# INLINE unsafePackLenBytes #-}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why these functions have not been inlined before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the benchmarks show, if no fusion happens then there is no benefit in inlining (it even costs a few percents of performance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now think this is actually quite a big problem, because the only way this function is exposed in the safe API is through pack which can never fuse anyway, so is this patch really still desirable? We would need a safe packLen function to really get the benefits.

Copy link
Contributor Author

@noughtmare noughtmare Mar 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a version of pack that is fusable (by using dynamic arrays doubling in size):

https://gist.github.com/noughtmare/8902ddec65fedae9fb24d3826dbe11a0

This again has the problem that it is slightly slower for short inputs if not fused (around x1.5), but a lot faster if it is fused.

@Bodigrim
Copy link
Contributor

@noughtmare what's the status of this? Could you please rebase?

@noughtmare
Copy link
Contributor Author

I was kind of waiting for a response on this thread:

As the benchmarks show, if no fusion happens then there is no benefit in inlining (it even costs a few percents of performance).

I now think this is actually quite a big problem, because the only way this function is exposed in the safe API is through pack which can never fuse anyway, so is this patch really still desirable? We would need a safe packLen function to really get the benefits.

Here's a version of pack that is fusable (by using dynamic arrays doubling in size):

https://gist.github.com/noughtmare/8902ddec65fedae9fb24d3826dbe11a0

This again has the problem that it is slightly slower for short inputs if not fused (around x1.5), but a lot faster if it is fused.

So the options are:

  1. Go through with this and accept the 1.5x performance penalty if the fusion doesn't work.
  2. Add (complicated) rewrite rules to do the fusion and undo if fusion doesn't work. Like how base does it, e.g.:
    {-# RULES
      "iterate"    [~1] forall f x.   iterate f x = build (\c _n -> iterateFB c f x)
      "iterateFB"  [1]                iterateFB (:) = iterate
    #-}
    This requires someone writing a bunch more code, I don't know how soon I will have time.
  3. Keep the old behavior and add the new fusible functions as completely new functions. This requires someone to pick new names for the functions or a new module name.

I'd prefer 2 or 3

@clyring
Copy link
Member

clyring commented Sep 27, 2022

The existing fromListN is a good candidate for option 3.

@Bodigrim
Copy link
Contributor

I'd rather avoid rules with phase control, they are such a nightmare to work with.

@Bodigrim Bodigrim marked this pull request as draft December 16, 2022 20:08
@Bodigrim
Copy link
Contributor

Marking as draft for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants