Enable fusion of unsafePackLen{Bytes,Chars} #494

noughtmare · 2022-03-03T21:11:29Z

Here's a benchmark:

byteswap :: ByteString -> ByteString
byteswap xs = unsafePackLenBytes (B.length xs `quot` 4 * 4)
  [ unsafeIndex xs $ i * 4 + j
  | i <- [0 .. B.length xs `quot` 4 - 1]
  , j <- [3,2,1,0]
  ]

Before:

All
  byteswap
    1000:    OK (0.36s)
      10.1 μs ± 582 ns,  97 KB allocated,  23 B  copied, 2.0 MB peak memory
    10000:   OK (0.22s)
      102  μs ± 8.6 μs, 967 KB allocated, 205 B  copied, 2.0 MB peak memory
    100000:  OK (0.61s)
      1.18 ms ± 110 μs, 9.4 MB allocated,  11 KB copied, 5.0 MB peak memory
    1000000: OK (0.74s)
      10.5 ms ± 390 μs,  94 MB allocated,  38 KB copied,  34 MB peak memory

After:

All
  byteswap
    1000:    OK (0.18s)
      711  ns ±  70 ns, 1.1 KB allocated,   0 B  copied, 2.0 MB peak memory
    10000:   OK (0.24s)
      7.13 μs ± 583 ns, 9.9 KB allocated,   1 B  copied, 3.0 MB peak memory
    100000:  OK (0.56s)
      68.8 μs ± 6.2 μs,  98 KB allocated,  13 B  copied, 4.0 MB peak memory
    1000000: OK (0.24s)
      781  μs ±  51 μs, 977 KB allocated,  89 B  copied,  34 MB peak memory

Bodigrim · 2022-03-03T21:18:02Z

@noughtmare could you please fix warnings?

Data/ByteString/Internal.hs

Bodigrim · 2022-03-03T22:18:49Z

Please submit benchmarks as well.

noughtmare · 2022-03-05T13:03:06Z

I've added these benchmarks:

    , bgroup "pack"
      [ bench "not fused" $ nf S.pack (replicate nRepl 0)
      , bench "fused" $ nf (S.pack . replicate nRepl) 0
      ]
    , bgroup "unsafePackLenBytes"
      [ bench "not fused" $ nf (SI.unsafePackLenBytes nRepl) (replicate nRepl 0)
      , bench "fused" $ nf (SI.unsafePackLenBytes nRepl . replicate nRepl) 0
      ]
    , bgroup "unsafePackLenChar"
      [ bench "not fused" $ nf (SI.unsafePackLenChars nRepl) (replicate nRepl 'A')
      , bench "fused" $ nf (SI.unsafePackLenChars nRepl . replicate nRepl) 'A'
      ]

Before

All
  pack
    not fused: OK (0.39s)
      45.7 μs ± 1.9 μs, 6.4 KB allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.26s)
      120  μs ± 8.3 μs, 546 KB allocated, 2.2 KB copied,  76 MB peak memory
  unsafePackLenBytes
    not fused: OK (0.23s)
      27.1 μs ± 1.9 μs, 6.4 KB allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.17s)
      81.0 μs ± 7.1 μs, 546 KB allocated,   6 B  copied,  76 MB peak memory
  unsafePackLenChar
    not fused: OK (0.14s)
      34.7 μs ± 3.4 μs,   0 B  allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.35s)
      84.4 μs ± 4.8 μs, 552 KB allocated,   4 B  copied,  76 MB peak memory

After

All
  pack
    not fused: OK (0.40s)
      46.7 μs ± 1.8 μs, 6.4 KB allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.26s)
      122  μs ± 7.5 μs, 546 KB allocated, 2.2 KB copied,  76 MB peak memory
  unsafePackLenBytes
    not fused: OK (1.85s)
      27.6 μs ± 1.6 μs, 8.9 KB allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.54s)
      3.86 μs ± 189 ns, 9.5 KB allocated,   0 B  copied,  76 MB peak memory
  unsafePackLenChar
    not fused: OK (0.16s)
      38.0 μs ± 2.7 μs,   0 B  allocated,   0 B  copied,  76 MB peak memory
    fused:     OK (0.55s)
      3.85 μs ± 154 ns, 9.5 KB allocated,   0 B  copied,  76 MB peak memory

If the fusion works out there is a 20x speedup, but otherwise it is slightly slower.

Update: It seems that it is just the inlining which makes things a tiny bit worse in this case.

Also, note that this function is only exposed via pack, where it is never possible to fuse, so to really make this useful I would suggest adding a safe packLen :: Int -> [Word8] -> ByteString function which just fills the rest with zeroes if the list is too short.

Bodigrim

Is it possible to do the same for ShortByteString?

bytestring/Data/ByteString/Short/Internal.hs

Lines 558 to 566 in 1256378

    
           packLenBytes :: Int -> [Word8] -> ShortByteString 
        
           packLenBytes len ws0 = 
        
               create len (\mba -> go mba 0 ws0) 
        
             where 
        
               go :: MBA s -> Int -> [Word8] -> ST s () 
        
               go !_   !_ []     = return () 
        
               go !mba !i (w:ws) = do 
        
                 writeWord8Array mba i w 
        
                 go mba (i+1) ws

Bodigrim · 2022-03-05T19:18:36Z

Data/ByteString/Internal.hs

+    unsafeCreate len . foldr
+      (\x go p -> poke p x >> go (p `plusPtr` 1))
+      (\_ -> return ())
+{-# INLINE unsafePackLenBytes #-}


I wonder why these functions have not been inlined before.

As the benchmarks show, if no fusion happens then there is no benefit in inlining (it even costs a few percents of performance).

I now think this is actually quite a big problem, because the only way this function is exposed in the safe API is through pack which can never fuse anyway, so is this patch really still desirable? We would need a safe packLen function to really get the benefits.

Here's a version of pack that is fusable (by using dynamic arrays doubling in size):

https://gist.github.com/noughtmare/8902ddec65fedae9fb24d3826dbe11a0

This again has the problem that it is slightly slower for short inputs if not fused (around x1.5), but a lot faster if it is fused.

Data/ByteString/Internal.hs

bench/BenchAll.hs

Bodigrim · 2022-09-26T18:02:17Z

@noughtmare what's the status of this? Could you please rebase?

noughtmare · 2022-09-26T18:38:45Z

I was kind of waiting for a response on this thread:

As the benchmarks show, if no fusion happens then there is no benefit in inlining (it even costs a few percents of performance).

I now think this is actually quite a big problem, because the only way this function is exposed in the safe API is through pack which can never fuse anyway, so is this patch really still desirable? We would need a safe packLen function to really get the benefits.

Here's a version of pack that is fusable (by using dynamic arrays doubling in size):

https://gist.github.com/noughtmare/8902ddec65fedae9fb24d3826dbe11a0

This again has the problem that it is slightly slower for short inputs if not fused (around x1.5), but a lot faster if it is fused.

So the options are:

Go through with this and accept the 1.5x performance penalty if the fusion doesn't work.
Add (complicated) rewrite rules to do the fusion and undo if fusion doesn't work. Like how base does it, e.g.:
```
{-# RULES
  "iterate"    [~1] forall f x.   iterate f x = build (\c _n -> iterateFB c f x)
  "iterateFB"  [1]                iterateFB (:) = iterate
#-}
```
This requires someone writing a bunch more code, I don't know how soon I will have time.
Keep the old behavior and add the new fusible functions as completely new functions. This requires someone to pick new names for the functions or a new module name.

I'd prefer 2 or 3

clyring · 2022-09-27T14:12:04Z

The existing fromListN is a good candidate for option 3.

Bodigrim · 2022-09-27T18:49:23Z

I'd rather avoid rules with phase control, they are such a nightmare to work with.

Bodigrim · 2022-12-16T20:08:56Z

Marking as draft for now.

Bodigrim reviewed Mar 3, 2022

View reviewed changes

Data/ByteString/Internal.hs Outdated Show resolved Hide resolved

noughtmare added 5 commits March 5, 2022 11:54

Add packing benchmarks

12a6069

Expand packing benchmarks

969ec2a

Enable fusion of unsafePackLen{Bytes,Chars}

a04d938

Fix name shadowing warnings

2ef63df

Remove some points

b43d295

noughtmare force-pushed the master branch from fb41f64 to b43d295 Compare March 5, 2022 12:48

Bodigrim reviewed Mar 5, 2022

View reviewed changes

sjakobi reviewed Mar 5, 2022

View reviewed changes

Data/ByteString/Internal.hs Show resolved Hide resolved

sjakobi reviewed Mar 5, 2022

View reviewed changes

bench/BenchAll.hs Show resolved Hide resolved

Bodigrim marked this pull request as draft December 16, 2022 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable fusion of unsafePackLen{Bytes,Chars} #494

Enable fusion of unsafePackLen{Bytes,Chars} #494

noughtmare commented Mar 3, 2022 •

edited

Loading

Bodigrim commented Mar 3, 2022

Bodigrim commented Mar 3, 2022

noughtmare commented Mar 5, 2022 •

edited

Loading

Bodigrim left a comment

Bodigrim Mar 5, 2022

noughtmare Mar 5, 2022

noughtmare Mar 5, 2022

noughtmare Mar 5, 2022 •

edited

Loading

Bodigrim commented Sep 26, 2022

noughtmare commented Sep 26, 2022

clyring commented Sep 27, 2022

Bodigrim commented Sep 27, 2022

Bodigrim commented Dec 16, 2022

	packLenBytes :: Int -> [Word8] -> ShortByteString
	packLenBytes len ws0 =
	create len (\mba -> go mba 0 ws0)
	where
	go :: MBA s -> Int -> [Word8] -> ST s ()
	go !_ !_ [] = return ()
	go !mba !i (w:ws) = do
	writeWord8Array mba i w
	go mba (i+1) ws

Enable fusion of unsafePackLen{Bytes,Chars} #494

Are you sure you want to change the base?

Enable fusion of unsafePackLen{Bytes,Chars} #494

Conversation

noughtmare commented Mar 3, 2022 • edited Loading

Bodigrim commented Mar 3, 2022

Bodigrim commented Mar 3, 2022

noughtmare commented Mar 5, 2022 • edited Loading

Bodigrim left a comment

Choose a reason for hiding this comment

Bodigrim Mar 5, 2022

Choose a reason for hiding this comment

noughtmare Mar 5, 2022

Choose a reason for hiding this comment

noughtmare Mar 5, 2022

Choose a reason for hiding this comment

noughtmare Mar 5, 2022 • edited Loading

Choose a reason for hiding this comment

Bodigrim commented Sep 26, 2022

noughtmare commented Sep 26, 2022

clyring commented Sep 27, 2022

Bodigrim commented Sep 27, 2022

Bodigrim commented Dec 16, 2022

noughtmare commented Mar 3, 2022 •

edited

Loading

noughtmare commented Mar 5, 2022 •

edited

Loading

noughtmare Mar 5, 2022 •

edited

Loading