Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid usage of __builtin_popcountll when -simdutf is unset #453

Merged
merged 6 commits into from
Jul 19, 2022

Conversation

wz1000
Copy link
Contributor

@wz1000 wz1000 commented Jul 7, 2022

Also avoid linking against gcc/gcc_s on all platforms.

This works around https://gitlab.haskell.org/ghc/ghc/-/issues/21787
and https://gitlab.haskell.org/ghc/ghc/-/issues/19900 which cause
problems when GHC's RTS linker tries to load text, which occurs if
you use a statically linked GHC to compile a file with a TH splice that
depends on text.

Since we don't require SSE4.2 to build text -simdutf, this shouldn't
be much of a pessimisation.

Fixes #450

This is meant to be a temporary workaround for https://gitlab.haskell.org/ghc/ghc/-/issues/21787 while we work on a robust method for properly exposing all GCC symbols from the RTS linker.

However, older versions of GHC (particularly the 9.0 series and earlier) which won't be patched still need a workaround so that text-2.0 is usable under all configurations.

It is also problematic to include extra-libraries: gcc_s when compiling with clang.

@Bodigrim
Copy link
Contributor

Bodigrim commented Jul 7, 2022

Thanks @wz1000. First of all, before we discuss the chosen approach, I'd like to see a reproducible evidence of the issue as an additional CI job. This would help us to validate the solution and prevent future regressions.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 11, 2022

@Bodigrim I've modified the -simdutf CI job to test on windows (which is the only GHC distribution we ship with a statically linked GHC). I also modified it to test GHC-9.2.1 instead of latest because currently latest resolves to 9.2.2, and chocolatey includes a workaround on that version which also masks the text issue.

See https://github.com/haskell/text/runs/7279337599 (which is a run from #454) for an example of the failure without this patch.

@Bodigrim
Copy link
Contributor

@wz1000

text/text.cabal

Lines 197 to 198 in 971051b

if os(windows) && impl(ghc < 9.3)
extra-libraries: gcc_s

is there for a reason indeed. If you remove it, the build breaks. But this does not motivate me to avoid __builtin_popcountll, because extra-libraries: gcc_s is a much simpler workaround. What I've been asking you was to demonstrate that the existing setup is not enough and there exists a reproducible configuration with a build failure.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 12, 2022

I've improved the implementation to only use 2 bit shifts and a multiplication to compute the popcount and added a CI job that tests it on alpine.

@wz1000 wz1000 force-pushed the wip/no-popcount branch 2 times, most recently from 72e91da to 4820609 Compare July 12, 2022 11:07
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 12, 2022

Unfortunately the static alpine GHC binaries are not usable because of https://gitlab.haskell.org/ghc/ghc/-/issues/21844

I'm stumped by this, not sure how to add a non windows CI job that demonstrates the problem.

@wz1000 wz1000 force-pushed the wip/no-popcount branch 3 times, most recently from a777146 to f6c3cd4 Compare July 12, 2022 11:40
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 12, 2022

I guess I can try to run it in an alpine container and it should work. I'll try this tomorrow.

@Bodigrim
Copy link
Contributor

I'm stumped by this, not sure how to add a non windows CI job that demonstrates the problem.

As I suggested in #450 (comment), try to reproduce bytestring CI with respect to running Windows job with a clean PATH: https://github.com/haskell/bytestring/blob/22b36125ac52605e807b7b96ef31e8f087248f17/.github/workflows/ci.yml#L93-L99
If this proves that extra-libraries: gcc_s is not a good solution, we are all set, no need for Alpine.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 14, 2022

I've fixed the alpine job and also attempted to add a windows job to perform the check you added in bytestring, but that job seems to fail in a different way to the intended failure mode. I'm not sure about this. See https://github.com/haskell/text/runs/7337239414?check_suite_focus=true

@Bodigrim
Copy link
Contributor

The simplest setup I can come up is https://github.com/Bodigrim/text/tree/purge-path. After the first commit the build succeeds, but as soon as we purge PATH, if fails (the error message is just error code 1, that's fine). If this setup is enough, I'd rather avoid Alpine job with unreleased GHC version.

Now I don't quite understand what does it have to do with simdutf flag. The code is used unconditionally for any length / drop / take, so its performance is absolutely crucial. Please acompany your patch with benchmark results (cabal bench --benchmark-options='-p length').

@bgamari
Copy link
Contributor

bgamari commented Jul 15, 2022

The failure in #454 is due not to text but rather bytestring, which text links against. See haskell/bytestring#497.

This should be fixed in the bytestring shipped with 9.2.3.

@bgamari
Copy link
Contributor

bgamari commented Jul 15, 2022

#454 also neglects to disable the simdutf8 flag, which introduces a dependency on libstdc++, which will of course fail for the same reason described in the last paragraph of GHC #20878.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 15, 2022

Like Ben said, there are two mostly independent issues here:

  1. The RTS linker does not know about __popcountdi2 so is unable to link any code referencing it. This is tracked by text-2.* -simdutf8 is broken with statically linked GHC  #450
  2. extra-libraries: gcc_s is in itself problematic. This is tracked by Linking against gcc_s is problematic on windows #456.

#450 is only triggered if we don't link against extra_libraries: gcc_s. So it is triggered in the alpine job, and in the windows job if we remove the extra_libraries section.

The alpine CI job I added demonstrates issue #450. The windows CI job added by this PR demonstrates #456.

@Bodigrim
Copy link
Contributor

@wz1000 could you please check performance impact of your change?

and avoid linking against gcc/gcc_s on all platforms.

This works around https://gitlab.haskell.org/ghc/ghc/-/issues/21787
and https://gitlab.haskell.org/ghc/ghc/-/issues/19900 which cause
problems when GHC's RTS linker tries to load `text`, which occurs if
you use a statically linked GHC to compile a file with a TH splice that
depends on `text`.

Fixes haskell#450 Please enter the commit
message for your changes. Lines starting
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

Here are the results of running cabal bench --benchmark-options='-p length' with GHC 9.2.3

Before: https://gist.github.com/wz1000/7604ab1663d8612bbd8485b13ae22f86#file-before-csv
After: https://gist.github.com/wz1000/7604ab1663d8612bbd8485b13ae22f86#file-after-csv

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

Comparisions using --baseline:

Running 1 benchmarks...
Benchmark text-benchmarks: RUNNING...
All
  Pure
    tiny
      length
        cons
          Text:     OK (0.23s)
            22.2 ns ± 1.3 ns
          LazyText: OK (0.24s)
            25.4 ns ± 1.4 ns
        decode
          Text:     OK (0.21s)
            40.6 ns ± 2.7 ns
          LazyText: OK (0.29s)
            125  ns ±  10 ns
        drop
          Text:     OK (0.17s)
            30.7 ns ± 2.6 ns,  8% faster than baseline
          LazyText: OK (0.17s)
            31.3 ns ± 3.0 ns, 10% faster than baseline
        filter
          Text:     OK (0.58s)
            15.6 ns ± 382 ps,  6% faster than baseline
          LazyText: OK (0.96s)
            26.3 ns ± 638 ps,  7% faster than baseline
        filter.filter
          Text:     OK (0.17s)
            15.0 ns ± 1.4 ns,  9% faster than baseline
          LazyText: OK (0.26s)
            26.2 ns ± 1.4 ns,  9% faster than baseline
        init
          Text:     OK (0.37s)
            19.4 ns ± 1.9 ns
          LazyText: OK (0.25s)
            24.1 ns ± 2.0 ns, 10% faster than baseline
        intercalate
          Text:     OK (0.24s)
            24.2 ns ± 2.0 ns, 18% faster than baseline
          LazyText: OK (0.26s)
            26.1 ns ± 2.1 ns, 10% faster than baseline
        intersperse
          Text:     OK (0.48s)
            26.4 ns ± 1.4 ns,  6% faster than baseline
          LazyText: OK (0.16s)
            27.5 ns ± 2.6 ns, 10% faster than baseline
        map
          Text:     OK (0.25s)
            23.8 ns ± 1.6 ns,  7% faster than baseline
          LazyText: OK (0.27s)
            26.3 ns ± 2.1 ns, 12% faster than baseline
        map.map
          Text:     OK (0.25s)
            23.8 ns ± 1.7 ns
          LazyText: OK (0.27s)
            26.3 ns ± 1.3 ns
        replicate char
          Text:     OK (0.23s)
            21.6 ns ± 1.4 ns, 12% faster than baseline
          LazyText: OK (0.21s)
            16.8 ns ± 1.3 ns, 13% faster than baseline
        replicate string
          Text:     OK (0.24s)
            23.1 ns ± 1.9 ns, 20% faster than baseline
          LazyText: OK (0.21s)
            19.5 ns ± 1.5 ns, 19% faster than baseline
        take
          Text:     OK (0.23s)
            21.4 ns ± 1.3 ns, 20% faster than baseline
          LazyText: OK (0.27s)
            26.6 ns ± 1.5 ns, 19% faster than baseline
        tail
          Text:     OK (0.25s)
            24.0 ns ± 1.6 ns,  6% faster than baseline
          LazyText: OK (0.29s)
            28.5 ns ± 1.4 ns
        toLower
          Text:     OK (0.28s)
            111  ns ± 5.9 ns
          LazyText: OK (0.86s)
            193  ns ± 4.5 ns, 11% faster than baseline
        toUpper
          Text:     OK (0.17s)
            120  ns ±  11 ns,  9% faster than baseline
          LazyText: OK (0.28s)
            222  ns ±  10 ns,  9% faster than baseline
        words
          Text:     OK (0.22s)
            20.5 ns ± 2.0 ns, 13% faster than baseline
          LazyText: OK (0.20s)
            35.4 ns ± 2.6 ns, 11% faster than baseline
        zipWith
          Text:     OK (0.25s)
            25.4 ns ± 1.4 ns, 13% faster than baseline
          LazyText: OK (0.18s)
            31.5 ns ± 2.7 ns
    ascii-small
      length
        cons
          Text:     OK (0.19s)
            21.9 μs ± 1.5 μs, 81% slower than baseline
          LazyText: OK (0.21s)
            22.1 μs ± 2.1 μs, 79% slower than baseline
        decode
          Text:     OK (0.51s)
            28.5 μs ± 2.1 μs, 58% slower than baseline
          LazyText: OK (1.02s)
            28.2 μs ± 595 ns, 41% slower than baseline
        drop
          Text:     OK (0.20s)
            22.7 μs ± 1.7 μs, 105% slower than baseline
          LazyText: OK (0.20s)
            22.4 μs ± 2.1 μs, 89% slower than baseline
        filter
          Text:     OK (0.28s)
            128  μs ± 6.2 μs
          LazyText: OK (0.16s)
            133  μs ±  11 μs
        filter.filter
          Text:     OK (0.12s)
            128  μs ±  11 μs
          LazyText: OK (0.16s)
            132  μs ±  11 μs
        init
          Text:     OK (0.19s)
            21.7 μs ± 1.6 μs, 114% slower than baseline
          LazyText: OK (0.40s)
            23.1 μs ± 1.5 μs, 124% slower than baseline
        intercalate
          Text:     OK (0.17s)
            33.6 μs ± 2.8 μs, 53% slower than baseline
          LazyText: OK (0.17s)
            35.2 μs ± 2.9 μs, 37% slower than baseline
        intersperse
          Text:     OK (0.22s)
            23.4 μs ± 1.6 μs, 129% slower than baseline
          LazyText: OK (1.59s)
            22.8 μs ± 779 ns, 122% slower than baseline
        map
          Text:     OK (0.24s)
            24.5 μs ± 1.3 μs, 110% slower than baseline
          LazyText: OK (0.84s)
            24.8 μs ± 539 ns, 137% slower than baseline
        map.map
          Text:     OK (0.23s)
            24.0 μs ± 2.0 μs, 133% slower than baseline
          LazyText: OK (0.23s)
            24.6 μs ± 1.5 μs, 134% slower than baseline
        replicate char
          Text:     OK (0.26s)
            23.4 ns ± 1.6 ns,  7% slower than baseline
          LazyText: OK (0.22s)
            19.5 ns ± 1.5 ns
        replicate string
          Text:     OK (0.47s)
            24.3 ns ± 1.4 ns
          LazyText: OK (0.23s)
            20.3 ns ± 1.3 ns, 11% faster than baseline
        take
          Text:     OK (0.22s)
            21.4 μs ± 1.3 μs, 170% slower than baseline
          LazyText: OK (0.20s)
            21.5 μs ± 1.3 μs, 167% slower than baseline
        tail
          Text:     OK (0.18s)
            32.6 μs ± 2.7 μs, 175% slower than baseline
          LazyText: OK (0.18s)
            32.7 μs ± 2.7 μs, 186% slower than baseline
        toLower
          Text:     OK (0.70s)
            1.20 ms ± 101 μs, 11% faster than baseline
          LazyText: OK (0.12s)
            1.75 ms ± 170 μs
        toUpper
          Text:     OK (0.22s)
            1.71 ms ±  99 μs
          LazyText: OK (0.29s)
            2.12 ms ±  99 μs
        words
          Text:     OK (0.17s)
            291  μs ±  28 μs
          LazyText: OK (0.32s)
            560  μs ±  27 μs,  9% faster than baseline
        zipWith
          Text:     OK (0.22s)
            22.7 μs ± 1.7 μs, 104% slower than baseline
          LazyText: OK (0.21s)
            23.2 μs ± 1.6 μs, 102% slower than baseline
    ascii
      length
        cons
          Text:     OK (0.64s)
            18.4 ms ± 1.1 ms, 78% slower than baseline
          LazyText: OK (0.31s)
            19.0 ms ± 1.6 ms, 90% slower than baseline
        decode
          Text:     OK (1.34s)
            27.2 ms ± 707 μs, 56% slower than baseline
          LazyText: OK (1.32s)
            26.9 ms ± 2.0 ms, 43% slower than baseline
        drop
          Text:     OK (0.40s)
            18.7 ms ± 1.4 ms, 106% slower than baseline
          LazyText: OK (0.40s)
            18.9 ms ± 1.6 ms, 106% slower than baseline
        filter
          Text:     OK (0.51s)
            110  ms ± 2.9 ms
          LazyText: OK (0.52s)
            115  ms ± 3.0 ms, 13% faster than baseline
        filter.filter
          Text:     OK (0.51s)
            110  ms ± 3.7 ms
          LazyText: OK (0.46s)
            121  ms ± 9.8 ms
        init
          Text:     OK (0.39s)
            18.4 ms ± 1.4 ms, 77% slower than baseline
          LazyText: OK (0.40s)
            18.6 ms ± 1.4 ms, 75% slower than baseline
        intercalate
          Text:     OK (0.37s)
            26.9 ms ± 1.5 ms, 23% slower than baseline
          LazyText: OK (0.37s)
            27.8 ms ± 1.4 ms, 36% slower than baseline
        intersperse
          Text:     OK (0.40s)
            18.4 ms ± 1.8 ms, 83% slower than baseline
          LazyText: OK (0.40s)
            18.9 ms ± 1.5 ms, 79% slower than baseline
        map
          Text:     OK (0.40s)
            18.4 ms ± 1.4 ms, 89% slower than baseline
          LazyText: OK (0.67s)
            20.4 ms ± 921 μs, 111% slower than baseline
        map.map
          Text:     OK (0.31s)
            19.9 ms ± 1.6 ms, 117% slower than baseline
          LazyText: OK (0.43s)
            20.5 ms ± 1.5 ms, 106% slower than baseline
        replicate char
          Text:     OK (2.46s)
            22.4 ns ± 1.9 ns
          LazyText: OK (2.38s)
            17.7 ns ± 1.4 ns
        replicate string
          Text:     OK (2.35s)
            24.5 ns ± 1.7 ns
          LazyText: OK (2.36s)
            28.0 ns ± 2.7 ns, 11% slower than baseline
        take
          Text:     OK (0.90s)
            12.3 ms ± 471 μs, 78% slower than baseline
          LazyText: OK (0.46s)
            12.7 ms ± 673 μs, 69% slower than baseline
        tail
          Text:     OK (0.30s)
            19.1 ms ± 1.7 ms, 92% slower than baseline
          LazyText: OK (0.43s)
            20.6 ms ± 1.5 ms, 94% slower than baseline
        toLower
          Text:     OK (3.35s)
            1.056 s ±  23 ms, 19% faster than baseline
          LazyText: OK (4.49s)
            1.426 s ± 107 ms, 19% faster than baseline
        toUpper
          Text:     OK (4.16s)
            1.328 s ±  33 ms, 17% faster than baseline
          LazyText: OK (5.27s)
            1.733 s ± 136 ms, 12% faster than baseline
        words
          Text:     OK (0.86s)
            227  ms ± 7.9 ms, 10% faster than baseline
          LazyText: OK (1.65s)
            490  ms ±  22 ms, 18% faster than baseline
        zipWith
          Text:     OK (0.41s)
            17.7 ms ± 1.5 ms, 67% slower than baseline
          LazyText: OK (0.41s)
            17.8 ms ± 1.4 ms, 71% slower than baseline
    english
      length
        cons
          Text:     OK (0.23s)
            1.15 ms ±  87 μs, 67% slower than baseline
          LazyText: OK (0.38s)
            1.19 ms ±  90 μs, 77% slower than baseline
        decode
          Text:     OK (0.52s)
            1.64 ms ±  90 μs, 54% slower than baseline
          LazyText: OK (1.04s)
            1.78 ms ±  83 μs, 45% slower than baseline
        drop
          Text:     OK (0.77s)
            1.38 ms ±  39 μs, 107% slower than baseline
          LazyText: OK (0.39s)
            1.25 ms ± 105 μs, 80% slower than baseline
        filter
          Text:     OK (0.27s)
            7.49 ms ± 491 μs
          LazyText: OK (0.15s)
            8.06 ms ± 734 μs
        filter.filter
          Text:     OK (0.27s)
            7.60 ms ± 658 μs, 11% faster than baseline
          LazyText: OK (0.14s)
            7.87 ms ± 698 μs
        init
          Text:     OK (0.19s)
            1.26 ms ±  91 μs, 97% slower than baseline
          LazyText: OK (0.22s)
            1.27 ms ± 102 μs, 93% slower than baseline
        intercalate
          Text:     OK (0.17s)
            1.83 ms ± 172 μs, 28% slower than baseline
          LazyText: OK (0.29s)
            1.83 ms ± 160 μs, 21% slower than baseline
        intersperse
          Text:     OK (0.17s)
            1.23 ms ±  94 μs, 96% slower than baseline
          LazyText: OK (0.22s)
            1.26 ms ±  98 μs, 104% slower than baseline
        map
          Text:     OK (0.22s)
            1.23 ms ±  96 μs, 76% slower than baseline
          LazyText: OK (0.22s)
            1.26 ms ±  89 μs, 96% slower than baseline
        map.map
          Text:     OK (0.22s)
            1.23 ms ± 107 μs, 67% slower than baseline
          LazyText: OK (0.22s)
            1.26 ms ±  89 μs, 76% slower than baseline
        replicate char
          Text:     OK (0.36s)
            20.6 ns ± 1.8 ns, 20% faster than baseline
          LazyText: OK (0.34s)
            16.9 ns ± 1.5 ns, 17% faster than baseline
        replicate string
          Text:     OK (0.41s)
            24.4 ns ± 1.3 ns, 17% faster than baseline
          LazyText: OK (0.88s)
            19.3 ns ± 324 ps, 21% faster than baseline
        take
          Text:     OK (0.28s)
            845  μs ±  68 μs, 76% slower than baseline
          LazyText: OK (0.28s)
            851  μs ±  50 μs, 81% slower than baseline
        tail
          Text:     OK (0.22s)
            1.29 ms ±  91 μs, 84% slower than baseline
          LazyText: OK (0.22s)
            1.28 ms ± 110 μs, 81% slower than baseline
        toLower
          Text:     OK (0.21s)
            67.1 ms ± 5.8 ms, 15% faster than baseline
          LazyText: OK (0.29s)
            95.0 ms ± 3.0 ms, 11% faster than baseline
        toUpper
          Text:     OK (6.26s)
            92.7 ms ± 8.5 ms, 12% faster than baseline
          LazyText: OK (0.37s)
            117  ms ± 3.6 ms, 11% faster than baseline
        words
          Text:     OK (0.25s)
            15.5 ms ± 671 μs, 15% faster than baseline
          LazyText: OK (0.26s)
            32.7 ms ± 2.6 ms, 15% faster than baseline
        zipWith
          Text:     OK (0.70s)
            1.23 ms ±  36 μs, 72% slower than baseline
          LazyText: OK (0.39s)
            1.25 ms ±  95 μs, 74% slower than baseline
    russian
      length
        cons
          Text:     OK (0.14s)
            3.64 μs ± 344 ns, 74% slower than baseline
          LazyText: OK (0.14s)
            3.59 μs ± 348 ns, 69% slower than baseline
        decode
          Text:     OK (0.77s)
            5.45 μs ± 267 ns, 27% slower than baseline
          LazyText: OK (0.77s)
            5.38 μs ± 281 ns, 22% slower than baseline
        drop
          Text:     OK (0.27s)
            3.56 μs ± 207 ns, 83% slower than baseline
          LazyText: OK (0.27s)
            3.52 μs ± 196 ns, 79% slower than baseline
        filter
          Text:     OK (0.20s)
            21.4 μs ± 1.6 μs,  8% faster than baseline
          LazyText: OK (0.24s)
            25.2 μs ± 1.7 μs,  9% faster than baseline
        filter.filter
          Text:     OK (0.37s)
            20.3 μs ± 921 ns,  9% faster than baseline
          LazyText: OK (0.23s)
            24.8 μs ± 1.7 μs, 10% faster than baseline
        init
          Text:     OK (0.25s)
            3.47 μs ± 303 ns, 83% slower than baseline
          LazyText: OK (0.26s)
            3.34 μs ± 213 ns, 71% slower than baseline
        intercalate
          Text:     OK (0.31s)
            4.22 μs ± 233 ns, 28% slower than baseline
          LazyText: OK (0.31s)
            4.32 μs ± 350 ns
        intersperse
          Text:     OK (0.25s)
            3.43 μs ± 256 ns, 29% slower than baseline
          LazyText: OK (0.15s)
            3.40 μs ± 339 ns, 78% slower than baseline
        map
          Text:     OK (0.15s)
            3.47 μs ± 346 ns, 81% slower than baseline
          LazyText: OK (0.26s)
            3.45 μs ± 300 ns, 81% slower than baseline
        map.map
          Text:     OK (0.26s)
            3.45 μs ± 317 ns, 81% slower than baseline
          LazyText: OK (0.26s)
            3.38 μs ± 260 ns, 77% slower than baseline
        replicate char
          Text:     OK (0.22s)
            19.9 ns ± 1.3 ns, 13% faster than baseline
          LazyText: OK (0.62s)
            16.8 ns ± 902 ps, 11% faster than baseline
        replicate string
          Text:     OK (0.23s)
            21.5 ns ± 1.4 ns, 20% faster than baseline
          LazyText: OK (0.21s)
            18.1 ns ± 1.6 ns, 19% faster than baseline
        take
          Text:     OK (0.19s)
            2.28 μs ± 183 ns, 73% slower than baseline
          LazyText: OK (0.70s)
            2.57 μs ±  82 ns, 92% slower than baseline
        tail
          Text:     OK (0.17s)
            3.88 μs ± 335 ns, 102% slower than baseline
          LazyText: OK (0.30s)
            4.00 μs ± 244 ns, 112% slower than baseline
        toLower
          Text:     OK (0.27s)
            116  μs ± 7.0 μs
          LazyText: OK (0.19s)
            159  μs ±  13 μs
        toUpper
          Text:     OK (0.19s)
            160  μs ±  13 μs
          LazyText: OK (0.24s)
            203  μs ±  13 μs
        words
          Text:     OK (0.61s)
            34.5 μs ± 780 ns,  9% faster than baseline
          LazyText: OK (0.31s)
            72.6 μs ± 2.7 μs
        zipWith
          Text:     OK (0.26s)
            3.42 μs ± 314 ns, 74% slower than baseline
          LazyText: OK (0.25s)
            3.42 μs ± 175 ns, 73% slower than baseline
    japanese
      length
        cons
          Text:     OK (0.28s)
            3.66 μs ± 226 ns, 78% slower than baseline
          LazyText: OK (0.28s)
            3.62 μs ± 188 ns, 77% slower than baseline
        decode
          Text:     OK (0.81s)
            5.54 μs ± 182 ns, 31% slower than baseline
          LazyText: OK (1.67s)
            6.09 μs ± 109 ns, 34% slower than baseline
        drop
          Text:     OK (0.32s)
            4.19 μs ± 271 ns, 96% slower than baseline
          LazyText: OK (0.18s)
            4.27 μs ± 329 ns, 103% slower than baseline
        filter
          Text:     OK (0.25s)
            12.6 μs ± 664 ns
          LazyText: OK (0.15s)
            16.0 μs ± 1.5 μs
        filter.filter
          Text:     OK (0.25s)
            12.9 μs ± 791 ns
          LazyText: OK (0.14s)
            14.7 μs ± 1.4 μs, 15% faster than baseline
        init
          Text:     OK (0.16s)
            3.64 μs ± 353 ns, 73% slower than baseline
          LazyText: OK (0.53s)
            3.63 μs ± 274 ns, 75% slower than baseline
        intercalate
          Text:     OK (0.21s)
            5.36 μs ± 518 ns, 23% slower than baseline
          LazyText: OK (0.24s)
            6.02 μs ± 564 ns, 24% slower than baseline
        intersperse
          Text:     OK (0.28s)
            3.64 μs ± 232 ns, 77% slower than baseline
          LazyText: OK (0.16s)
            3.64 μs ± 360 ns, 78% slower than baseline
        map
          Text:     OK (0.29s)
            3.74 μs ± 321 ns, 83% slower than baseline
          LazyText: OK (0.28s)
            3.63 μs ± 237 ns, 79% slower than baseline
        map.map
          Text:     OK (0.16s)
            3.63 μs ± 350 ns, 84% slower than baseline
          LazyText: OK (0.28s)
            3.62 μs ± 175 ns, 86% slower than baseline
        replicate char
          Text:     OK (0.22s)
            19.4 ns ± 1.5 ns, 10% faster than baseline
          LazyText: OK (0.19s)
            15.7 ns ± 1.3 ns, 11% faster than baseline
        replicate string
          Text:     OK (0.24s)
            21.8 ns ± 2.0 ns, 17% faster than baseline
          LazyText: OK (0.21s)
            18.6 ns ± 1.5 ns, 16% faster than baseline
        take
          Text:     OK (0.20s)
            2.50 μs ± 187 ns, 71% slower than baseline
          LazyText: OK (0.20s)
            2.51 μs ± 181 ns, 74% slower than baseline
        tail
          Text:     OK (0.16s)
            3.76 μs ± 340 ns, 96% slower than baseline
          LazyText: OK (0.16s)
            3.81 μs ± 380 ns, 78% slower than baseline
        toLower
          Text:     OK (0.60s)
            68.0 μs ± 2.8 μs, 13% faster than baseline
          LazyText: OK (0.44s)
            98.9 μs ± 8.1 μs
        toUpper
          Text:     OK (0.17s)
            65.7 μs ± 5.4 μs, 13% faster than baseline
          LazyText: OK (0.22s)
            93.5 μs ± 6.9 μs, 19% faster than baseline
        words
          Text:     OK (0.20s)
            47.2 μs ± 3.8 μs, 22% faster than baseline
          LazyText: OK (0.37s)
            40.4 μs ± 2.2 μs, 16% faster than baseline
        zipWith
          Text:     OK (0.28s)
            3.69 μs ± 204 ns, 74% slower than baseline
          LazyText: OK (0.27s)
            3.65 μs ± 221 ns, 59% slower than baseline

All 216 tests passed (108.34s)
Benchmark text-benchmarks: FINISH

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

Tried to improve the implementation of popcount16. Results at https://gist.github.com/wz1000/7604ab1663d8612bbd8485b13ae22f86#file-after-improved-csv.

Baseline is before.csv, which is tip text-2.0 master (971051b).

All
  Pure
    tiny
      length
        cons
          Text:     OK (0.22s)
            21.4 ns ± 1.3 ns
          LazyText: OK (0.23s)
            23.5 ns ± 1.5 ns,  9% faster than baseline
        decode
          Text:     OK (0.21s)
            40.7 ns ± 3.2 ns
          LazyText: OK (0.29s)
            118  ns ± 5.4 ns, 10% faster than baseline
        drop
          Text:     OK (0.29s)
            29.6 ns ± 1.8 ns, 12% faster than baseline
          LazyText: OK (0.31s)
            31.5 ns ± 2.1 ns,  9% faster than baseline
        filter
          Text:     OK (0.17s)
            15.3 ns ± 1.4 ns,  8% faster than baseline
          LazyText: OK (0.27s)
            26.7 ns ± 1.7 ns,  6% faster than baseline
        filter.filter
          Text:     OK (0.29s)
            15.2 ns ± 692 ps,  8% faster than baseline
          LazyText: OK (0.50s)
            27.3 ns ± 2.2 ns
        init
          Text:     OK (0.36s)
            18.8 ns ± 1.2 ns, 11% faster than baseline
          LazyText: OK (0.25s)
            24.6 ns ± 2.0 ns,  8% faster than baseline
        intercalate
          Text:     OK (0.26s)
            25.2 ns ± 1.6 ns, 14% faster than baseline
          LazyText: OK (0.26s)
            26.0 ns ± 1.5 ns, 10% faster than baseline
        intersperse
          Text:     OK (0.26s)
            25.4 ns ± 2.0 ns, 10% faster than baseline
          LazyText: OK (0.27s)
            26.0 ns ± 2.3 ns, 15% faster than baseline
        map
          Text:     OK (0.23s)
            22.2 ns ± 2.2 ns, 13% faster than baseline
          LazyText: OK (0.25s)
            24.5 ns ± 2.1 ns, 18% faster than baseline
        map.map
          Text:     OK (0.43s)
            22.7 ns ± 1.0 ns, 11% faster than baseline
          LazyText: OK (0.25s)
            24.5 ns ± 2.2 ns, 10% faster than baseline
        replicate char
          Text:     OK (0.21s)
            19.7 ns ± 1.7 ns, 20% faster than baseline
          LazyText: OK (0.18s)
            15.7 ns ± 1.4 ns, 18% faster than baseline
        replicate string
          Text:     OK (0.24s)
            23.1 ns ± 1.7 ns, 20% faster than baseline
          LazyText: OK (0.20s)
            18.6 ns ± 1.3 ns, 23% faster than baseline
        take
          Text:     OK (0.23s)
            21.5 ns ± 2.1 ns, 20% faster than baseline
          LazyText: OK (0.26s)
            26.1 ns ± 1.8 ns, 20% faster than baseline
        tail
          Text:     OK (0.23s)
            21.9 ns ± 1.8 ns, 15% faster than baseline
          LazyText: OK (0.27s)
            26.3 ns ± 1.4 ns,  8% faster than baseline
        toLower
          Text:     OK (0.26s)
            103  ns ± 5.9 ns, 15% faster than baseline
          LazyText: OK (0.21s)
            183  ns ±  13 ns, 16% faster than baseline
        toUpper
          Text:     OK (0.99s)
            113  ns ± 3.0 ns, 14% faster than baseline
          LazyText: OK (0.27s)
            214  ns ±  11 ns, 12% faster than baseline
        words
          Text:     OK (0.40s)
            20.8 ns ± 1.0 ns, 11% faster than baseline
          LazyText: OK (0.18s)
            35.1 ns ± 3.0 ns, 12% faster than baseline
        zipWith
          Text:     OK (0.46s)
            25.4 ns ± 1.8 ns, 13% faster than baseline
          LazyText: OK (0.18s)
            32.2 ns ± 3.0 ns
    ascii-small
      length
        cons
          Text:     OK (0.32s)
            9.26 μs ± 716 ns, 23% faster than baseline
          LazyText: OK (0.18s)
            9.52 μs ± 920 ns, 22% faster than baseline
        decode
          Text:     OK (0.58s)
            15.0 μs ± 1.4 μs, 16% faster than baseline
          LazyText: OK (0.60s)
            15.5 μs ± 906 ns, 22% faster than baseline
        drop
          Text:     OK (0.33s)
            9.61 μs ± 772 ns, 13% faster than baseline
          LazyText: OK (0.19s)
            9.59 μs ± 668 ns, 19% faster than baseline
        filter
          Text:     OK (0.16s)
            133  μs ±  12 μs
          LazyText: OK (0.30s)
            136  μs ±  13 μs
        filter.filter
          Text:     OK (0.16s)
            131  μs ±  11 μs
          LazyText: OK (0.16s)
            137  μs ±  11 μs
        init
          Text:     OK (0.17s)
            9.22 μs ± 687 ns,  9% faster than baseline
          LazyText: OK (0.19s)
            9.56 μs ± 797 ns
        intercalate
          Text:     OK (0.19s)
            19.7 μs ± 1.4 μs, 10% faster than baseline
          LazyText: OK (0.40s)
            22.2 μs ± 1.0 μs, 13% faster than baseline
        intersperse
          Text:     OK (0.64s)
            9.15 μs ± 306 ns,  9% faster than baseline
          LazyText: OK (0.19s)
            9.41 μs ± 688 ns
        map
          Text:     OK (0.18s)
            9.16 μs ± 719 ns, 21% faster than baseline
          LazyText: OK (0.19s)
            9.37 μs ± 809 ns, 10% faster than baseline
        map.map
          Text:     OK (0.18s)
            9.09 μs ± 775 ns, 11% faster than baseline
          LazyText: OK (0.18s)
            8.96 μs ± 781 ns, 14% faster than baseline
        replicate char
          Text:     OK (0.22s)
            19.6 ns ± 1.9 ns, 10% faster than baseline
          LazyText: OK (0.32s)
            16.0 ns ± 862 ps, 19% faster than baseline
        replicate string
          Text:     OK (0.25s)
            23.6 ns ± 1.5 ns
          LazyText: OK (0.22s)
            19.3 ns ± 1.5 ns, 15% faster than baseline
        take
          Text:     OK (0.24s)
            6.17 μs ± 337 ns, 21% faster than baseline
          LazyText: OK (0.24s)
            6.23 μs ± 427 ns, 22% faster than baseline
        tail
          Text:     OK (0.18s)
            9.76 μs ± 844 ns, 17% faster than baseline
          LazyText: OK (0.19s)
            9.88 μs ± 814 ns, 13% faster than baseline
        toLower
          Text:     OK (0.32s)
            1.15 ms ±  82 μs, 14% faster than baseline
          LazyText: OK (0.22s)
            1.57 ms ±  97 μs, 11% faster than baseline
        toUpper
          Text:     OK (0.22s)
            1.63 ms ± 133 μs
          LazyText: OK (0.28s)
            2.04 ms ± 114 μs
        words
          Text:     OK (0.16s)
            271  μs ±  27 μs
          LazyText: OK (0.17s)
            573  μs ±  44 μs
        zipWith
          Text:     OK (0.64s)
            9.21 μs ± 374 ns, 16% faster than baseline
          LazyText: OK (0.17s)
            9.52 μs ± 697 ns, 16% faster than baseline
    ascii
      length
        cons
          Text:     OK (0.67s)
            7.83 ms ± 535 μs, 23% faster than baseline
          LazyText: OK (0.37s)
            8.05 ms ± 739 μs, 18% faster than baseline
        decode
          Text:     OK (1.54s)
            15.2 ms ± 1.4 ms, 12% faster than baseline
          LazyText: OK (1.49s)
            14.8 ms ± 562 μs, 20% faster than baseline
        drop
          Text:     OK (0.68s)
            8.08 ms ± 416 μs, 10% faster than baseline
          LazyText: OK (0.39s)
            8.21 ms ± 711 μs, 10% faster than baseline
        filter
          Text:     OK (0.99s)
            105  ms ± 2.5 ms,  6% faster than baseline
          LazyText: OK (0.49s)
            108  ms ± 5.7 ms, 18% faster than baseline
        filter.filter
          Text:     OK (0.29s)
            104  ms ± 3.3 ms,  9% faster than baseline
          LazyText: OK (0.51s)
            112  ms ± 7.1 ms
        init
          Text:     OK (0.28s)
            8.06 ms ± 718 μs, 22% faster than baseline
          LazyText: OK (0.39s)
            8.29 ms ± 692 μs, 21% faster than baseline
        intercalate
          Text:     OK (0.38s)
            16.9 ms ± 1.5 ms, 22% faster than baseline
          LazyText: OK (0.53s)
            17.6 ms ± 809 μs, 13% faster than baseline
        intersperse
          Text:     OK (0.47s)
            8.06 ms ± 756 μs, 19% faster than baseline
          LazyText: OK (0.72s)
            8.47 ms ± 363 μs, 19% faster than baseline
        map
          Text:     OK (0.48s)
            8.24 ms ± 769 μs, 15% faster than baseline
          LazyText: OK (0.40s)
            8.43 ms ± 704 μs, 12% faster than baseline
        map.map
          Text:     OK (0.66s)
            7.75 ms ± 493 μs, 15% faster than baseline
          LazyText: OK (0.92s)
            7.93 ms ± 394 μs, 20% faster than baseline
        replicate char
          Text:     OK (2.21s)
            20.0 ns ± 1.3 ns
          LazyText: OK (2.72s)
            16.1 ns ± 688 ps
        replicate string
          Text:     OK (2.56s)
            23.3 ns ± 1.0 ns
          LazyText: OK (3.45s)
            17.9 ns ± 212 ps, 28% faster than baseline
        take
          Text:     OK (0.58s)
            5.07 ms ± 396 μs, 26% faster than baseline
          LazyText: OK (0.50s)
            5.34 ms ± 440 μs, 28% faster than baseline
        tail
          Text:     OK (0.39s)
            8.46 ms ± 841 μs, 14% faster than baseline
          LazyText: OK (0.39s)
            8.27 ms ± 722 μs, 22% faster than baseline
        toLower
          Text:     OK (3.14s)
            1.016 s ± 3.2 ms, 22% faster than baseline
          LazyText: OK (4.11s)
            1.347 s ±  71 ms, 23% faster than baseline
        toUpper
          Text:     OK (4.17s)
            1.329 s ±  32 ms, 17% faster than baseline
          LazyText: OK (5.19s)
            1.700 s ±  42 ms, 14% faster than baseline
        words
          Text:     OK (0.84s)
            225  ms ± 8.9 ms, 11% faster than baseline
          LazyText: OK (1.64s)
            488  ms ±  14 ms, 18% faster than baseline
        zipWith
          Text:     OK (0.60s)
            7.91 ms ± 377 μs, 24% faster than baseline
          LazyText: OK (0.61s)
            8.37 ms ± 757 μs, 19% faster than baseline
    english
      length
        cons
          Text:     OK (0.38s)
            549  μs ±  43 μs, 20% faster than baseline
          LazyText: OK (0.36s)
            556  μs ±  27 μs, 17% faster than baseline
        decode
          Text:     OK (1.77s)
            776  μs ±  23 μs, 26% faster than baseline
          LazyText: OK (1.15s)
            1.01 ms ±  95 μs, 16% faster than baseline
        drop
          Text:     OK (0.21s)
            554  μs ±  54 μs, 16% faster than baseline
          LazyText: OK (0.19s)
            540  μs ±  46 μs, 21% faster than baseline
        filter
          Text:     OK (0.27s)
            7.39 ms ± 371 μs,  9% faster than baseline
          LazyText: OK (0.26s)
            7.72 ms ± 371 μs,  7% faster than baseline
        filter.filter
          Text:     OK (0.27s)
            7.32 ms ± 435 μs, 14% faster than baseline
          LazyText: OK (0.14s)
            7.60 ms ± 752 μs
        init
          Text:     OK (0.30s)
            518  μs ±  23 μs, 18% faster than baseline
          LazyText: OK (0.35s)
            542  μs ±  31 μs, 17% faster than baseline
        intercalate
          Text:     OK (0.34s)
            1.11 ms ±  62 μs, 21% faster than baseline
          LazyText: OK (0.37s)
            1.19 ms ±  84 μs, 20% faster than baseline
        intersperse
          Text:     OK (0.20s)
            532  μs ±  52 μs, 14% faster than baseline
          LazyText: OK (0.21s)
            545  μs ±  51 μs, 11% faster than baseline
        map
          Text:     OK (0.35s)
            527  μs ±  31 μs, 24% faster than baseline
          LazyText: OK (0.64s)
            536  μs ±  13 μs, 16% faster than baseline
        map.map
          Text:     OK (0.34s)
            499  μs ±  31 μs, 32% faster than baseline
          LazyText: OK (4.54s)
            538  μs ±  27 μs, 24% faster than baseline
        replicate char
          Text:     OK (0.36s)
            19.9 ns ± 1.7 ns, 23% faster than baseline
          LazyText: OK (0.46s)
            15.7 ns ± 1.2 ns, 22% faster than baseline
        replicate string
          Text:     OK (0.57s)
            22.6 ns ± 1.2 ns, 23% faster than baseline
          LazyText: OK (0.51s)
            18.4 ns ± 1.6 ns, 24% faster than baseline
        take
          Text:     OK (0.24s)
            332  μs ±  22 μs, 30% faster than baseline
          LazyText: OK (0.25s)
            346  μs ±  27 μs, 26% faster than baseline
        tail
          Text:     OK (0.20s)
            528  μs ±  47 μs, 24% faster than baseline
          LazyText: OK (0.34s)
            506  μs ±  38 μs, 28% faster than baseline
        toLower
          Text:     OK (0.21s)
            65.0 ms ± 3.4 ms, 18% faster than baseline
          LazyText: OK (0.28s)
            88.3 ms ± 5.3 ms, 17% faster than baseline
        toUpper
          Text:     OK (0.28s)
            89.2 ms ± 3.2 ms, 16% faster than baseline
          LazyText: OK (0.35s)
            111  ms ± 9.1 ms, 15% faster than baseline
        words
          Text:     OK (0.51s)
            15.1 ms ± 531 μs, 17% faster than baseline
          LazyText: OK (0.25s)
            31.7 ms ± 1.9 ms, 18% faster than baseline
        zipWith
          Text:     OK (0.33s)
            503  μs ±  30 μs, 29% faster than baseline
          LazyText: OK (0.34s)
            518  μs ±  52 μs, 27% faster than baseline
    russian
      length
        cons
          Text:     OK (0.24s)
            1.48 μs ±  93 ns, 29% faster than baseline
          LazyText: OK (0.24s)
            1.49 μs ± 102 ns, 29% faster than baseline
        decode
          Text:     OK (0.49s)
            3.20 μs ± 291 ns, 25% faster than baseline
          LazyText: OK (0.27s)
            3.53 μs ± 228 ns, 19% faster than baseline
        drop
          Text:     OK (0.26s)
            1.66 μs ± 112 ns, 14% faster than baseline
          LazyText: OK (0.25s)
            1.67 μs ± 100 ns, 14% faster than baseline
        filter
          Text:     OK (0.20s)
            22.3 μs ± 1.6 μs
          LazyText: OK (0.25s)
            26.8 μs ± 1.8 μs
        filter.filter
          Text:     OK (0.20s)
            22.1 μs ± 1.8 μs
          LazyText: OK (0.24s)
            26.8 μs ± 1.8 μs
        init
          Text:     OK (0.23s)
            1.57 μs ±  95 ns, 17% faster than baseline
          LazyText: OK (0.25s)
            1.59 μs ± 149 ns, 18% faster than baseline
        intercalate
          Text:     OK (0.20s)
            2.46 μs ± 232 ns, 25% faster than baseline
          LazyText: OK (0.20s)
            2.59 μs ± 254 ns, 41% faster than baseline
        intersperse
          Text:     OK (0.24s)
            1.61 μs ± 100 ns, 39% faster than baseline
          LazyText: OK (0.25s)
            1.58 μs ±  89 ns, 17% faster than baseline
        map
          Text:     OK (0.25s)
            1.57 μs ±  99 ns, 18% faster than baseline
          LazyText: OK (0.25s)
            1.58 μs ± 107 ns, 17% faster than baseline
        map.map
          Text:     OK (0.25s)
            1.59 μs ±  87 ns, 16% faster than baseline
          LazyText: OK (0.25s)
            1.58 μs ±  91 ns, 17% faster than baseline
        replicate char
          Text:     OK (0.22s)
            18.6 ns ± 1.4 ns, 18% faster than baseline
          LazyText: OK (0.32s)
            16.1 ns ± 780 ps, 14% faster than baseline
        replicate string
          Text:     OK (0.26s)
            23.2 ns ± 1.8 ns, 13% faster than baseline
          LazyText: OK (0.37s)
            18.0 ns ± 716 ps, 20% faster than baseline
        take
          Text:     OK (0.59s)
            1.04 μs ±  71 ns, 20% faster than baseline
          LazyText: OK (0.30s)
            1.04 μs ±  96 ns, 21% faster than baseline
        tail
          Text:     OK (0.25s)
            1.60 μs ± 150 ns, 16% faster than baseline
          LazyText: OK (0.25s)
            1.61 μs ± 147 ns, 14% faster than baseline
        toLower
          Text:     OK (0.26s)
            112  μs ±  11 μs
          LazyText: OK (0.17s)
            158  μs ±  13 μs
        toUpper
          Text:     OK (0.18s)
            150  μs ±  14 μs
          LazyText: OK (0.23s)
            201  μs ±  15 μs
        words
          Text:     OK (0.17s)
            33.2 μs ± 3.0 μs, 12% faster than baseline
          LazyText: OK (0.31s)
            67.7 μs ± 4.0 μs, 11% faster than baseline
        zipWith
          Text:     OK (0.22s)
            1.48 μs ±  82 ns, 24% faster than baseline
          LazyText: OK (0.24s)
            1.47 μs ±  92 ns, 25% faster than baseline
    japanese
      length
        cons
          Text:     OK (0.25s)
            1.61 μs ± 129 ns, 21% faster than baseline
          LazyText: OK (0.47s)
            1.61 μs ± 112 ns, 20% faster than baseline
        decode
          Text:     OK (0.97s)
            3.41 μs ±  54 ns, 19% faster than baseline
          LazyText: OK (0.54s)
            3.53 μs ± 243 ns, 22% faster than baseline
        drop
          Text:     OK (0.26s)
            1.65 μs ±  82 ns, 22% faster than baseline
          LazyText: OK (0.24s)
            1.67 μs ± 160 ns, 20% faster than baseline
        filter
          Text:     OK (0.20s)
            11.6 μs ± 1.0 μs, 11% faster than baseline
          LazyText: OK (0.14s)
            14.7 μs ± 1.4 μs, 14% faster than baseline
        filter.filter
          Text:     OK (0.43s)
            11.9 μs ± 340 ns, 12% faster than baseline
          LazyText: OK (0.53s)
            15.0 μs ± 1.0 μs, 14% faster than baseline
        init
          Text:     OK (0.24s)
            1.62 μs ±  84 ns, 22% faster than baseline
          LazyText: OK (0.51s)
            1.69 μs ± 132 ns, 18% faster than baseline
        intercalate
          Text:     OK (0.26s)
            3.36 μs ± 216 ns, 22% faster than baseline
          LazyText: OK (0.30s)
            3.93 μs ± 166 ns, 18% faster than baseline
        intersperse
          Text:     OK (0.25s)
            1.60 μs ± 103 ns, 22% faster than baseline
          LazyText: OK (0.26s)
            1.61 μs ± 113 ns, 21% faster than baseline
        map
          Text:     OK (0.25s)
            1.58 μs ± 145 ns, 22% faster than baseline
          LazyText: OK (0.25s)
            1.59 μs ± 145 ns, 21% faster than baseline
        map.map
          Text:     OK (0.25s)
            1.63 μs ± 123 ns, 17% faster than baseline
          LazyText: OK (0.26s)
            1.61 μs ± 121 ns, 16% faster than baseline
        replicate char
          Text:     OK (0.22s)
            19.4 ns ± 1.6 ns, 10% faster than baseline
          LazyText: OK (0.32s)
            15.1 ns ± 1.2 ns, 14% faster than baseline
        replicate string
          Text:     OK (0.25s)
            22.9 ns ± 1.7 ns, 13% faster than baseline
          LazyText: OK (0.22s)
            18.8 ns ± 1.3 ns, 15% faster than baseline
        take
          Text:     OK (0.19s)
            1.14 μs ± 102 ns, 21% faster than baseline
          LazyText: OK (0.18s)
            1.13 μs ±  93 ns, 21% faster than baseline
        tail
          Text:     OK (0.26s)
            1.72 μs ± 117 ns, 10% faster than baseline
          LazyText: OK (0.28s)
            1.81 μs ± 130 ns, 15% faster than baseline
        toLower
          Text:     OK (0.19s)
            74.9 μs ± 5.9 μs
          LazyText: OK (0.24s)
            105  μs ± 7.2 μs
        toUpper
          Text:     OK (0.17s)
            70.0 μs ± 5.4 μs,  7% faster than baseline
          LazyText: OK (0.22s)
            98.4 μs ± 6.4 μs, 15% faster than baseline
        words
          Text:     OK (0.21s)
            47.7 μs ± 4.4 μs, 21% faster than baseline
          LazyText: OK (0.21s)
            43.0 μs ± 3.9 μs, 10% faster than baseline
        zipWith
          Text:     OK (0.27s)
            1.69 μs ±  92 ns, 19% faster than baseline
          LazyText: OK (0.27s)
            1.73 μs ± 134 ns, 24% faster than baseline

All 216 tests passed (108.18s)
Benchmark text-benchmarks: FINISH

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

I've successfully validated the implementation of popcount16 using the following program:

#include <stdint.h>
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
#include <sys/types.h>

static inline const size_t popcount16(uint16_t x) {

  // Taken from https://en.wikipedia.org/wiki/Hamming_weight
  const uint16_t m1  = 0x5555; //binary: 0101...
  const uint16_t m2  = 0x3333; //binary: 00110011..
  const uint16_t m4  = 0x0f0f; //binary:  4 zeros,  4 ones ...
  x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
  x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
  x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits 
  return (x >> 8) + (x & 0x00FF);
}

int main() {
  for(int i = 0; i <= 0xFFFF; i++) {
    size_t a,b;
    a = __builtin_popcount((uint16_t) i);
    b = popcount16(i);
    if (a != b) {
      printf("No match %d %d \n", a, b);
      exit(1);
    }
  }
  printf("All values validated\n");
}

@Bodigrim
Copy link
Contributor

It is very surprising that software emulation of __builtin_popcount is actually faster, but hard to argue against benchmarks :)

@wz1000 I assume you want to take care of all __builtin_popcount in measure_off.c.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

It is very surprising that software emulation of __builtin_popcount is actually faster, but hard to argue against benchmarks :)

The problem is only triggered if GCC ends up using its own software implementation of popcount, rather than emitting the actual instruction. It is a known issue that the GCC emulation is suboptimal. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041

@wz1000 I assume you want to take care of all __builtin_popcount in measure_off.c.

I'm reasonably confident all the other usages of the symbol are OK because they are guarded by sufficient feature flags to guarantee that GCC emits the popcount instruction for those usages. The RTS linker bug is only triggered if GCC decides to use its software emulation for popcount.

@@ -0,0 +1,12 @@
-- Simple test script for #450
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please inline this into simdutf-flag-alpine.yml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this.

@Bodigrim Bodigrim merged commit 9412f44 into haskell:master Jul 19, 2022
@Bodigrim
Copy link
Contributor

Thanks @wz1000!

@Bodigrim Bodigrim linked an issue Jul 19, 2022 that may be closed by this pull request
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 20, 2022

@Bodigrim Could we have a release for inclusion into 9.4?

@Bodigrim
Copy link
Contributor

@wz1000 sure, I'm waiting for #448 to land into master before releasing.

@Bodigrim
Copy link
Contributor

@wz1000 @bgamari actually what is the timeline for GHC 9.4.1? Do we have any time left to finish #448? If no, could you please confirm that master branch works for GHC purposes as is?

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 21, 2022

The RC will be out this week, but we can use the master branch for that without waiting for a release. The final release will be made in about 2 weeks (early August) and we do need a release by then, ideally before.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 21, 2022

I will test the master branch one more time to be sure, but I think all the patches we need have been merged.

@bgamari
Copy link
Contributor

bgamari commented Jul 21, 2022

@Bodigrim, I'm afraid 9.4 is essentially done. rc1 should be released by the end of today and will ship with 9412f44, which appears to be working well. It would be great if we could produce a text release from it or something closely related

@Bodigrim
Copy link
Contributor

@wz1000 @bgamari Released as text-2.0.1, fdb06ff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linking against gcc_s is problematic on windows text-2.* -simdutf8 is broken with statically linked GHC
3 participants