Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple guards leads to less efficient HDL than single guard of conjunction #2842

Open
gergoerdi opened this issue Nov 12, 2024 · 6 comments

Comments

@gergoerdi
Copy link
Contributor

In the following program, guess1 is written as a signal function, whereas guess2 is written as a pure function. lifted (via fmap unbundle . liftA2) into a signal function. The Verilog generated for guess1, when fed to Vivado, produces a much smaller circuit than the Verilog generated for guess2.

{-# LANGUAGE DerivingStrategies, DerivingVia, GeneralizedNewtypeDeriving #-}
{-# LANGUAGE BlockArguments #-}
module Sudoku where

import Clash.Prelude hiding (mapAccumR)
import Clash.Annotations.TH

import Data.Traversable (mapAccumR)
import Data.Functor.Compose
import Data.Coerce

newtype Matrix n m a = FromRows{ matrixRows :: Vec n (Vec m a) }
    deriving stock (Generic, Eq)
    deriving anyclass (NFDataX)
    deriving (Functor, Applicative, Foldable) via Compose (Vec n) (Vec m)

instance (KnownNat n, KnownNat m) => Traversable (Matrix n m) where
    traverse f = fmap coerce . traverse @(Compose (Vec n) (Vec m)) f . coerce

newtype Grid n m a = Grid{ getGrid :: Matrix n m (Matrix m n a) }
    deriving stock (Generic, Eq)
    deriving anyclass (NFDataX)
    deriving (Functor, Applicative, Foldable) via Compose (Matrix n m) (Matrix m n)

instance (KnownNat n, KnownNat m) => Traversable (Grid n m) where
    traverse f = fmap coerce . traverse @(Compose (Matrix n m) (Matrix m n)) f . coerce

newtype Cell n m = Cell{ cellBits :: BitVector (n * m) }
    deriving stock (Generic)
    deriving anyclass (NFDataX)
    deriving newtype (Eq)

conflicted :: (KnownNat n, KnownNat m) => Cell n m
conflicted = Cell zeroBits

lastBit :: (KnownNat n) => BitVector n -> BitVector n
lastBit x = x .&. negate x

splitCell :: (KnownNat n, KnownNat m) => Cell n m -> (Cell n m, Cell n m)
splitCell (Cell c) = (Cell last, Cell rest)
  where
    last = lastBit c
    rest = c .&. complement last

type Sudoku n m = Grid n m (Cell n m)
type Solvable n m = (KnownNat n, KnownNat m, 1 <= n * m * m * n)

infixl 4 .<$>.
(.<$>.) :: (Applicative f, Applicative g) => (a -> b) -> f (g a) -> f (g b)
(.<$>.) f = liftA ((<$>) f)

-- VERSION 1: Ugly, but produces much smaller circuit
expand1
    :: forall n m dom. (Solvable n m, HiddenClockResetEnable dom)
    => Grid n m (Signal dom (Cell n m))
    -> (Grid n m (Signal dom Bool), Grid n m (Signal dom (Cell n m)), Grid n m (Signal dom (Cell n m)))
expand1 cells = (singles, first_guesses, next_guesses)
  where
    singles = fst .<$>. guesses
    first_guesses = fst . snd .<$>. guesses
    next_guesses = snd . snd .<$>. guesses

    (_, guesses) = mapAccumR guess1 (pure False) cells

    guess1 guessed_before cell =
        unbundle $ mux
          ((not <$> guessed_before) .&&. (not <$> single))
          (bundle (pure True, bundle (single, split)))
          (bundle (guessed_before, bundle (single, bundle (cell, cell))))
      where
        split = splitCell <$> cell
        single = (== conflicted) . snd <$> split

-- VERSION 2: This is what I want to write, but it produces much larger circuit!
expand2
    :: forall n m dom. (Solvable n m, HiddenClockResetEnable dom)
    => Grid n m (Signal dom (Cell n m))
    -> (Grid n m (Signal dom Bool), Grid n m (Signal dom (Cell n m)), Grid n m (Signal dom (Cell n m)))
expand2 cells = (singles, first_guesses, next_guesses)
  where
    singles = fst .<$>. guesses
    first_guesses = fst . snd .<$>. guesses
    next_guesses = snd . snd .<$>. guesses

    (_, guesses) = mapAccumR guess2 (pure False) cells

    guess2 = fmap unbundle . liftA2 guess2'
    guess2' guessed_before cell
        | not guessed_before
        , not single
        = (True, (single, (first_guess, next_guess)))

        | otherwise
        = (guessed_before, (single, (cell, cell)))
      where
        (first_guess, next_guess) = splitCell cell
        single = next_guess == conflicted

type N = 2
type M = 2

topEntity
    :: "CLK_100MHZ" ::: Clock System
    -> "RESET"      ::: Reset System
    -> "GRID"       ::: Grid N M (Signal System (Cell M M))
    -> ( "SINGLE"   ::: Grid N M (Signal System Bool)
       , "GUESS1"   ::: Grid N M (Signal System (Cell M M))
       , "GUESS2"   ::: Grid N M (Signal System (Cell M M))
       )
topEntity clk rst = withClockResetEnable clk rst enableGen expand1
makeTopEntity 'topEntity

topEntity2
    :: "CLK_100MHZ" ::: Clock System
    -> "RESET"      ::: Reset System
    -> "GRID"       ::: Grid N M (Signal System (Cell M M))
    -> ( "SINGLE"   ::: Grid N M (Signal System Bool)
       , "GUESS1"   ::: Grid N M (Signal System (Cell M M))
       , "GUESS2"   ::: Grid N M (Signal System (Cell M M))
       )
topEntity2 clk rst = withClockResetEnable clk rst enableGen expand2
makeTopEntity 'topEntity2
@gergoerdi
Copy link
Contributor Author

This is what the generated Verilog looks like (renamed a bit so more things line up) for one of the guess1 instantiations:

  assign last = GRID[63:60] & (-GRID[63:60]);
  assign result_5 = {last, GRID[63:60] & (~ last)};
  assign next_guess = result_5[3:0];
  assign single = next_guess == 4'b0000;

  assign b = (~ result_6[9:9]) & (~ single);

  assign result_4 = 
                    b ? 
                    {1'b1, {single, result_5}} : 
                    {result_6[9:9], {single,   {GRID[63:60], GRID[63:60]}}};

And the same for guess2:

  assign last = GRID[63:60] & (-GRID[63:60]);
  assign result_5 = {last, GRID[63:60] & (~ last)};
  assign next_guess = result_5[3:0];
  assign single = next_guess == 4'b0000;

  assign c$ds1_case_alt = 
                          (~ single) ? 
                          {1'b1, {single,   {result_5[7:4], next_guess}}} : 
                          {result_6[9:9],   {single, {GRID[63:60],   GRID[63:60]}}};

  assign result_4 = 
                    (~ result_6[9:9]) ? 
                    c$ds1_case_alt : 
                    {result_6[9:9], {single,   {GRID[63:60],   GRID[63:60]}}};

@gergoerdi
Copy link
Contributor Author

gergoerdi commented Nov 12, 2024

It seems the big difference is that in guess1, there is a single multiplexer with the selector ((not <$> guessed_before) .&&. (not <$> single)), whereas in guess2, we have two separate guards | not guessed_before, not single.

@gergoerdi
Copy link
Contributor Author

gergoerdi commented Nov 12, 2024

OK this is huge: by changing my guard in guess2 from |not guessed_before, not single to | not guessed_before && not single, my real circuit is now shedding ~11% of its total gate count after Vivado synthesis!

@gergoerdi gergoerdi changed the title Pure function liftA2'd produces much larger circuit than signal function Multiple guards leads to less efficient HDL than single guard of conjunction Nov 12, 2024
@gergoerdi
Copy link
Contributor Author

So, what would it take for Clash to be able to merge these conjunctions? I guess the problem is that by Core, this:

f | p1, p2 = x
  | otherwise = y

is turned into

f  = case p1 of
  True -> case p2 of
    True -> x
    False -> y
  False -> y

But it sure would be neat if Clash could recognize this pattern and turn it into case p1 && p2 of True -> x; False -> y.

@christiaanb
Copy link
Member

It would also be neat if the downstream logic synthesis tools would recognize this opportunity. What happens when you tell Vivado to optimize for size instead of whatever its default is?

@DigitalBrains1
Copy link
Member

DigitalBrains1 commented Nov 13, 2024

We recently encountered this as well, with the following snippet being inefficient:

        outReady = _dcSize == 0 || (_dcSize == 1 && inReady)
        st' = case fwdIn of
                Just inp | outReady -> fromPacketStreamM2S inp
                _ -> st  { _dcBuf = _dcBuf'
                         , _dcSize = _dcSize'
                         }

and the equivalent but less pretty version that reduced the muxes this:

       outReady = _dcSize == 0 || (_dcSize == 1 && inReady)
        st'
          | isJust fwdIn && outReady = fromPacketStreamM2S (fromJustX fwdIn)
          | otherwise = st { _dcBuf = _dcBuf'
                            , _dcSize = _dcSize'
                            }

The combined guard collapsed it down to a single mux.

The inefficient version boiled down to

case fwdIn of
  Nothing ->
    st { _dcBuf = _dcBuf'
       , _dcSize = _dcSize'
       }
  Just inp ->
    case (_dcSize == 0 || (_dcSize == 1 && inReady)) of
      True -> fromPacketStreamM2S inp
      False ->
        st { _dcBuf = _dcBuf'
           , _dcSize = _dcSize'
           }

in Core, duplicating the record update for the wild match in the original Haskell.

This inefficiency was observed both with Yosys+nextpnr as well as ISE (Tim didn't have a Vivado at the ready).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants