Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant build regressions on swift:6.0-noble compared to 5.10-noble #76555

Open
MahdiBM opened this issue Sep 18, 2024 · 37 comments
Open

Significant build regressions on swift:6.0-noble compared to 5.10-noble #76555

MahdiBM opened this issue Sep 18, 2024 · 37 comments
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels

Comments

@MahdiBM
Copy link

MahdiBM commented Sep 18, 2024

Description

Absolutely massive Significant build regressions on linux , worsened by swift-testing

EDIT: Please also read my next comment which includes more info.

Environment

  • GitHub Actions
  • 8 CPU x 16GB RAM c7x EC2 Instance (we use RunsOn)
  • Migrating from swift:5.10-noble to swift:6.0-noble
  • Tried with and without using a previous .build cache. No behavior difference noticed at all.
  • 200K-300K LOC project
  • Updated to // swift-tools-version:6.0 in Package.swift
  • All targets still have Swift 5 language mode enabled.
  • No actual tests migrated to swift-testing just yet.

What Happened?

  • Our Tests CI started getting stuck after the toolchain update.
  • The CI machines were getting killed by AWS. Considering this has been a common problem specially in the Server Side Swift ecosystem, which also has been brought up for ages and numerous times, my first guess was that the machine is running out of RAM when building the Swift project.
  • After a lot of tries, I noticed a c7x-family machine with 64 CPU x 128 GB RAM (8x larger than before) runs the tests as well as they were being run before, on swift 5.10.
  • My first guess was that maybe Swift testing is causing an issue, so tried --disable-swift-testing.
  • After that, our tests have been running on an only 2x larger machine, and the rest of the things are back to normal in tests CI.
  • With this bigger machine, it still almost takes as long as before for the tests CI to complete (we run tests in parallel with 64 workers).
  • So this means that not only Swift 6 has a significant build regression, but also swift-testing makes it go from significant to absolutely massive.
  • This is borderline unusable for the Swift-Server ecosystem. I hope this problem requires more specific situations to happen that what meets the eye.
  • Still trying things out with the deployment CI. even 2x CPU and 4x RAM isn't proving helpful even though I did throw the --disable-swift-testing flag into the mix.

Reproduction

Not sure.

Expected behavior

No regressions. Preferably even faster than previous Swift versions.

Environment

Mentioned above.

Additional information

No response

@MahdiBM MahdiBM added bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels labels Sep 18, 2024
@grynspan
Copy link
Contributor

grynspan commented Sep 18, 2024

Some notes, in no particular order (will update if I think of more):

  • @stmontgomery's and my initial reaction is that this is probably not related to Swift Testing since this project hasn't added any Swift Testing tests yet.
  • --disable-swift-testing does not affect the build. It did for a while with prerelease Swift 6 toolchains before we added Swift Testing to the toolchain, but it has no effect on the release toolchain. Only one build product is produced for a test target, and it's the same either way. It does affect whether or not a second test process is spun up after building to run Swift Testing tests.
  • It is normal to see text output from Swift Testing when you run swift test even if you don't have any tests using it. The output is basically just saying "didn't find any Swift Testing tests, bye now." If you pass --disable-swift-testing, it suppresses spawning the process that looks for Swift Testing tests, which is why you don't see any output from it when you pass that flag.
  • Swift Testing is available in Swift 5 language mode so long as you're using a Swift 6 toolchain. Language mode isn't a factor.
  • @tayloraswift suggested this might be Ubuntu 24.04 Swift 5.10.1 release toolchain should not have assertions enabled #76091.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 20, 2024

To be clear:

  • I don't think it was necessarily swift-testing doing something wrong. More-likely the compiler or SwiftPM mishandling it.
  • Even without swift-testing there is a significant build time regression, when the build ends at all. The benchmarks below show it properly.

It's very sad to see the [Linux] build situation constantly getting worse despite us asking for faster and less-RAM-hungry builds for ages.
Based on the lots of user reports we consistently have over on the Vapor Discord server, a lot of deployment services have blocking problems with building Swift projects since their builders run out of RAM. As you can imagine, this can be a significant obstacle in deploying server-side Swift apps.
Just ask the SSWG folks @0xTim / @gwynne if you have any doubts.

Anyways, let's head to the benchmarks I did. I likely ran 500+ CI jobs in the past 48 hours ...
Note that we do use a couple of macros. All self-served. One of them lives in an external repo.
Image names are the exact official Docker image names.

Tests CI

  • Tests CI machine sizes are compared to what we had for swift:5.10-noble: 8 CPU x 16GB RAM c7x EC2 Instance.
    • Other than "Same", the other machines are of the m7x family which have the higher RAM ratio.
  • In the tests below, I tried both with and without --disable-swift-testing. Results are only marginally different.
  • There are 3 variables in build times. 1- usage of cache 2- build-step time 3- tests-build/run-time.
    • The table below contains 2 of the variables.
    • If cache-usage and build-step-time have not changed but the run-time has changed, It means it was because of the tests-run/build time.
    • We build the package, then cache .build, and then run the tests in another step.
    • The tests-build/run-step does also build something more. I'm not sure what, but it seems like it does.
    • Build-step command: swift build --explicit-target-dependency-import-check error --build-tests
    • Tests-build/run-step command: swift test --enable-code-coverage --parallel --num-workers 64
Image Machine Size w/ cache total w/ cache build no cache total no cache build
5.10-noble Same 12m 53s 3m 7s 19m 37s 7m 34s
5.10-noble 2x RAM 13m 41s 3m 16s 21m 9s 7m 55s
5.10-noble 4x RAM 2x CPU 9m 19s 2m 40s 16m 37s 6m 18s
6.0-jammy Same 15m 21s 3m 12s 24m 8s 9m 14s
6.0-jammy 2x RAM 16m 18s 3m 25s 25m 1s 9m 36s
6.0-jammy 4x RAM 2x CPU 9m 5s 2m 17s 17m 15s 6m 15s
6.0-noble Same 17m 21s 5m 10s 23m 31s 9m 9s
6.0-noble 2x RAM 18m 32s 5m 42s 23m 34s 9m 1s
6.0-noble 4x RAM 2x CPU 9m 47s 2m 59s 18m 42s 6m 43s

(Side note: I didn't know using higher RAM could hurt?! I don't think it's a machine-type problem since the deployment builds below show the expected behavior of some performance improvements when having access to more RAM.)

Analyzing the results (Excluding the bigger 4x RAM 2x CPU machine):

  • 30% worse tests-build/run-step performance on 5.10-noble compared to 6.0-jammy.
    • The exact tests-build/run-step numbers are not in the table. It would hover around 7-11 minutes.
  • 60+% worse build performance in the build-step w/ cache, on 6.0-noble compared to 6.0-jammy.
  • Overall all things considered, 40-50+% worse build performance when moving from 5.10-noble to 6.0-noble.
  • A bit of the total time (which is not included in the comparisons above, but visible in the total time) goes back to our "cache .build" step which uses the actions/cache which apparently doesn't properly handle the big 2GB .build directories.

Deployment CI

  • Deployment CI machine sizes are compared to what we had for swift:5.10-noble: 4 CPU x 8GB RAM c7x EC2 Instance.
    • Other than "Same", the other machines are of the m7x family which have the higher RAM ratio.
  • Compared to the tests builds, deployment builds lack --build-tests, use jemalloc, and only build the certain product that will be deployed.
    • jemalloc might be why the behavior is worse than test builds?!
Image Machine Size w/ cache total w/ cache build no cache total no cache build
5.10-noble Same 4m 19s 2m 37s 13m 27s 9m 1s
5.10-noble 2x RAM 3m 31s 1m 49s 12m 29s 9m 12s
5.10-noble 4x RAM 2x CPU 4m 7s 2m 14s 9m 47s 6m 9s
6.0-jammy Same 6m 25s 3m 7s OOM OOM
6.0-jammy 2x RAM 3m 40s 2m 0s 13m 57s 10m 14s
6.0-jammy 4x RAM 2x CPU 4m 39s 2m 39s 10m 7s 6m 32s
6.0-noble Same OOM OOM OOM OOM
6.0-noble 2x RAM OOM OOM OOM OOM
6.0-noble 4x RAM 2x CPU 6m 12s 4m 10s 11m 6s 7m 21s

Only notable change when I moved our app to Swift 6 compiler, is that we have 3 executable targets which Swift was throwing errors about e.g. using @testable on those, so what I did is that I added 3 more .executableTarget targets, marked existing targets as just .target, and the .executableTargets only have like 5 lines in each to call the original target's entry-point.
This way we can still @testable import the original target.

You may ask: "Didn't you say you needed 8x larger machine to run the tests? How come this time they ran even on the same machine as you had before when using swift:5.10-noble?!"
That's a good question.
I tried multiple times yesterday, result of which is still available in GitHub Actions. I was even live-texting the results over on the Vapor Discord Server in a thread. I also sent a sample of such failing logs to @grynspan.
I also did check and the swift:6.0-noble images from yesterday and today are the same (have matching hashes).
I know there hasn't been any other changes since yesterday to the project, so not exactly sure why the tests are going with no getting-stuck today. I have no complaints about that. Though the deployment CI does still get stuck.

So:
Yesterday I reported that there seem to be 2 problems, one a general build time regression, second one a massive regression when not using --disable-swift-testing. Today I'm unable to reproduce the latter.
Something's really up. And It definitely isn't the Swift code since that hasn't changed.
I don't think it's the AWS machines we use either, those are just standard EC2 instances.
The only thing I can think of is our cache usage. Maybe Swift 6 was using Swift 5's caches and didn't like that at all. But even then, I remember testing both with and without cache. I've set the CI to disable usage of caching when you rerun a job, so it was trivial to test both scenarios.

Worth mentioning, when the build is stuck, I consistently see a sequence of logs like this, containing Write Objects.LinkFileList around the end:

[6944/6948] Wrapping AST for MyLib for debugging
[6946/6950] Compiling MyExec Entrypoint.swift
[6947/6950] Emitting module MyExec
[6948/6951] Wrapping AST for MyExec for debugging
[6949/6951] Write Objects.LinkFileList

@gwynne
Copy link
Contributor

gwynne commented Sep 20, 2024

This almost starts to sound like a recurrence of the infamous linker RAM usage problem due to huge command lines with repeated libraries. @al45tair Is there any chance we're still failing to deduplicate linker command lines fully?

@MahdiBM MahdiBM changed the title Absolutely massive regressions on swift:6.0-noble Significant build regressions on swift:6.0-noble compared to 5.10-noble Sep 20, 2024
@finagolfin
Copy link
Contributor

Is there any chance we're still failing to deduplicate linker command lines fully?

Yes, I pinged you on that last month, but never got a response. A fix was just merged into release/6.0 in #76504, so it will not be released until 6.0.2 or 6.0.3.

@jmschonfeld or @shahmishal, can that fix be prioritized to get into the next patch release?

@finagolfin
Copy link
Contributor

@MahdiBM, thanks for all the build info. Do you do any CI builds of the 6.0 snapshot toolchains before the final release? That would help find and stop build regressions like this when they happen, rather than being surprised on the final release.

If you can, I'd like to know how an earlier 6.0 July 19 snapshot toolchain build for jammy does on these same CI runs of yours. That might help figure out the regression, particularly if you compare it to the next July 21 build of the 6.0 toolchain.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 20, 2024

@MahdiBM, thanks for all the build info. Do you do any CI builds of the 6.0 snapshot toolchains before the final release? That would help find and stop build regressions like this when they happen, rather than being surprised on the final release.

@finagolfin this is an "executable" work project, not a public library, so we don't test on multiple Swift versions.
I think it would be possible for us though to do such a thing on the next nightly images. Not a bad idea to catch these kinds of issues.
The only problem is that how reliable the nightly images are? Do they not have assertions and such enabled which makes the build slower? How can I trust the results? Do I need to run current-nightly and next-ver-nightly and compare those?
I can set up a weekly job perhaps 🤔

If you can, I'd like to know how an earlier 6.0 July 19 snapshot toolchain build for jammy does on these same CI runs of yours. That might help figure out the regression, particularly if you compare it to the next July 21 build of the 6.0 toolchain.

We just use docker images in CI. To be clear, the image names above are exact Docker image names that we use (added this explanation to the comment). I haven't tried or figured out manually using nightly images, although I imagine I could just use swiftly to set up the machine with the specific nightly toolchain and there should be little problems. It will make the different benchmarks diverge a bit though in terms of environment / setup and all.
Preferably, I should be able to just try a nightly image if you think that'll be helpful, just let me know what exact nightly image (the image identifier (hash) and tag ?).
Or I can try a newer 6.0 nightly image that does contain the linker RAM usage fix? If such already exists.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 20, 2024

Another visible issue is the noble vs jammy difference ... I don't think I could have caught that even if we were running CIs on nightly images, considering Swift 6 just very recently got a noble image.

Not sure where that comes from. Any ideas?

@MahdiBM
Copy link
Author

MahdiBM commented Sep 20, 2024

I would guess that even on 5.10, the jammy images would behave better than the nobles. Though haven't tested that.

@gwynne
Copy link
Contributor

gwynne commented Sep 20, 2024

My guess is that difference comes from the updated glibc in noble; when going from bionic to jammy we saw a significant improvement in malloc behavior for that reason, I wouldn't be too surprised if noble regressed some.

@finagolfin
Copy link
Contributor

The only problem is that how reliable the nightly images are? Do they not have assertions and such enabled which makes the build slower? How can I trust the results?

I don't know. In my Android CI, the latest Swift release builds the same code about 70-80% faster than the development snapshot toolchains. But you'd be looking for regressions in relative build time, so those branch differences shouldn't matter.

Do I need to run current-nightly and next-ver-nightly and compare those?

I'd simply build with the snapshots of the next release, eg 6.1 once that's branched, and look for regressions in build time with the almost-daily snapshots.

I can set up a weekly job perhaps

The CI tags snapshots and provides official builds a couple times a week: I'd set it up to run whenever one of those drops.

Preferably, I should be able to just try a nightly image if you think that'll be helpful, just let me know what exact nightly image (the image identifier (hash) and tag ?).

I don't use Docker, so don't know anything about its identifiers, but I presume the 6.0 snapshot tag dates I listed should identify the right images.

Or I can try a newer 6.0 nightly image that does contain the linker RAM usage fix? If such already exists.

Not yet. The fix was added to trunk a couple weeks ago, so you could try the Sep. 4 or latest 6.1 snapshot build with it and compare to the Aug. 29 build without it.

You may also want to talk to @ktoso and the SSWG about what kind of toolchain benchmarking exists to catch these issues on linux and what needs to be done to either start or augment it.

@tbkka
Copy link
Contributor

tbkka commented Sep 20, 2024

If you can log into the build machine and watch memory usage with top, it should be a lot clearer what's going on. There's a big difference between linker and compiler, as noted above. There's also a big difference between "the compiler uses too much memory" and "the build system is running too many compilers at the same time."

@MahdiBM
Copy link
Author

MahdiBM commented Sep 20, 2024

@tbkka I can ssh into the containers since RunsOn provides such a feature, but as you already noticed, I'm not well-versed in compiler/build-system workings. I can use top but not sure what or how to derive conclusions about where the problem is.

It does appear @gwynne was on point though, about linker issues.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 20, 2024

You may also want to talk to @ktoso and the SSWG about what kind of toolchain benchmarking exists to catch these issues on linux and what needs to be done to either start or augment it.

@finagolfin I imagine SSWG is already aware of these long standing problems and I expect they have already communicated their concerns in the past few years, just probably haven't managed to get it up in the priority list of the Swift team. I've seen some discussions in the public places.

Even if there were no regressions, Swift's build system looks to me - purely from a user's perspective with no knowledge of the inner workings - pretty behind (dare I say, bad), and one of the biggest pain points of the language. We've just gotten used to our M-series devices build things fast enough before we get way too bored.

Though I'm open to help in benchmarking things. I think one problem is that we need a real, big, and messy project like what most corporate projects are, so we can test things on real environments.
There is the Vapor's penny-bot project that shouldn't a bad start though. Not small, also with a fine amount of complexity.

@finagolfin
Copy link
Contributor

@MahdiBM, I have submitted that linker fix for the next 6.0.1 patch release, but the branch managers would like some confirmation from you that this is fixed in trunk first. Specifically, you should compare the Aug. 29 trunk 6.1 snapshot build from last month to the Sep. 4 or subsequent trunk builds.

@finagolfin
Copy link
Contributor

Moving the thread about benchmarking the linker fix here, since it adds nothing to the review of that pull. I was off the internet for the last seven hours, so only seeing your messages now.

Complains about CompilerPluginSupport or whatever

Maybe you can give some error info on that.

there was a recent release and our dependencies were not up to date

Ideally, you'd build against the same commit of your codebase as the test runs you measured above.

.build/aarch64-unknown-linux-gnu/debug/CNIODarwin-tool.build/module.modulemap:1:8: error: redefinition of module 'CNIODarwin'

Hmm, this is building inside the linux image? A quick fix might be to remove that CNIODarwin dependency from the targets you're building in the NIO package manifest, as it is unused on linux.

However, this seems entirely unrelated to the toolchain used: I'd try first to build the same working commit of your codebase that you used for the test runs you measured above.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 21, 2024

@finagolfin

Ideally, you'd build against the same commit of your codebase as the test runs you measured above.

That's not possible for multiple reasons. Such as the fact that normally i use Swift Docker images, but here i need to install specific toolchains which means i need to use the ubuntu image.

Hmm, this is building inside the linux image?

Yes. That's how GitHub Actions works. (ubuntu jammy)

I'd try first to build the same working commit of your codebase that you used for the test runs you measured above.

Not sure how that'd be helpful. Current commit is close enough though.

Those tests above were made in a different environment (Swift Docker images, release images only) so while i trust that you know better than me about these stuff, i don't understand how you're going to be able to properly compare the numbers considering i had to make some adjustments.

@finagolfin
Copy link
Contributor

Not sure how that'd be helpful.

I figure you know it builds to completion at least, without hitting all these build errors.

Those tests above were made in a different environment (Swift Docker images, release images only)

There are Swift Docker images for all these snapshot toolchains too, why not use those?

Basically, you can't compare snapshot build timings if you keep hitting compilation errors, so I'm saying you should try to reproduce the known-good environment where you measured the runs above, but only change one ingredient, ie swapping the 6.0 Docker image for the snapshot Docker images.

If these build errors are because other factors, like your Swift project, are changing too, that should fix it. If that still doesn't build, I suggest you use the 6.0 snapshot toolchain tags given, as they will be most similar to the 6.0 release, and show any build error output for those.

If you can't get anything but the final release builds to compile your codebase, you're stuck simply observing the build with some process monitor or file timestamps. If linking seems to be the problem, you could get the linker command from the verbose -v output and manually deduplicate the linker flags to see how much of a difference it makes.

I took a look at some linux CI build times of swift-docc between the 6.0.0 and 6.0 branches, ie with and without the linker fix, and didn't see a big difference. I don't know if that's because they have a lot of RAM, unlike your baseline config that showed the most regression.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 21, 2024

@finagolfin but how do i figure out the hash of the exact image that relates to the specific main snapshots?

I tried a bunch to fix the nio errors with no luck: https://github.com/MahdiBM/swift-nio/tree/mmbm-no-cniodarwin-on-linux

This is not normal, and not a fault of the project. This is not the first time i'm building the app in a linux environment.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 21, 2024

It also complained about CNIOWASI, as well as CNIOLinux.
I deleted those alongside CNIOWindows and now I'm getting this:

error: exhausted attempts to resolve the dependencies graph, with the following dependencies unresolved:
* 'swift-nio' from https://github.com/mahdibm/swift-nio.git

@finagolfin
Copy link
Contributor

how do i figure out the hash of the exact image that relates to the specific main snapshots?

Hmm, looking it up now, I guess you can't. As I said, I don't use Docker, so I was unaware of that.

My suggestion is that you get the 6.0 Docker image and make sure it builds some known-stable commit of your codebase. Then, use that same docker image to download the 6.0 snapshots given above, like the one I linked yesterday, and after unpacking them in the Docker image, use them to build your code instead. That way, you have a known-good Docker environment and source commit, with the only difference being the Swift 6.0 toolchain build date.

The Docker files almost never change, so only swapping out the toolchain used inside the 6.0 image should minimize the differences.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 21, 2024

The Docker files almost never change, so only swapping out the toolchain used inside the 6.0 image should minimize the differences.

@finagolfin Good idea, didn't think of that, but still didn't work.

For the reference:

CI File
name: test build

on:
  pull_request: { types: [opened, reopened, synchronize] }

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  unit-tests:
    strategy:
      fail-fast: false
      matrix:
        snapshot:
          - swift-6.0-DEVELOPMENT-SNAPSHOT-2024-07-19-a
          - swift-6.0-DEVELOPMENT-SNAPSHOT-2024-07-21-a
        machine:
          - name: "medium" # 16gb 8cpu c7i-flex
            arch: amd64
          - name: "large" # 32gb 16cpu c7i-flex
            arch: amd64
          - name: "huge-stable-arm" # 128gb 64cpu bare metal c7g
            arch: arm64

    runs-on:
      labels:
        - runs-on
        - runner=${{ matrix.machine.name }}
        - run-id=${{ github.run_id }}

    timeout-minutes: 60

    steps:
      - name: Check out ${{ github.event.repository.name }}
        uses: actions/checkout@v4

      - name: Build Docker Image
        run: |
          docker build \
            --network=host \
            --memory=128g \
            -f SwiftDockerfile \
            -t custom-swift:1 . \
            --build-arg DOWNLOAD_DIR="${{ matrix.snapshot }}" \
            --build-arg TARGETARCH="${{ matrix.machine.arch }}"

      - name: Prepare
        run: |
          docker run --name swift-container custom-swift:1 bash -c 'apt-get update -y && apt-get install -y libjemalloc-dev && git config --global --add url."https://${{ secrets.GH_PAT }}@github.com/".insteadOf "https://github.com/" && git clone https://github.com/${{ github.repository }} && cd ${{ github.event.repository.name }} && git checkout ${{ github.head_ref }} && swift package resolve --force-resolved-versions --skip-update'
          docker commit swift-container prepared-container:1

      - name: Build ${{ matrix.snapshot }}
        run: |
          docker run prepared-container:1 bash -c 'cd ${{ github.event.repository.name }} && swift build --build-tests'
Modified Dockerfile
FROM ubuntu:22.04 AS base
LABEL maintainer="Swift Infrastructure <[email protected]>"
LABEL description="Docker Container for the Swift programming language"

RUN export DEBIAN_FRONTEND=noninteractive DEBCONF_NONINTERACTIVE_SEEN=true && apt-get -q update && \
    apt-get -q install -y \
    binutils \
    git \
    gnupg2 \
    libc6-dev \
    libcurl4-openssl-dev \
    libedit2 \
    libgcc-11-dev \
    libpython3-dev \
    libsqlite3-0 \
    libstdc++-11-dev \
    libxml2-dev \
    libz3-dev \
    pkg-config \
    tzdata \
    zip \
    zlib1g-dev \
    && rm -r /var/lib/apt/lists/*

# Everything up to here should cache nicely between Swift versions, assuming dev dependencies change little

# gpg --keyid-format LONG -k FAF6989E1BC16FEA
# pub   rsa4096/FAF6989E1BC16FEA 2019-11-07 [SC] [expires: 2021-11-06]
#       8A7495662C3CD4AE18D95637FAF6989E1BC16FEA
# uid                 [ unknown] Swift Automatic Signing Key #3 <[email protected]>
ARG SWIFT_SIGNING_KEY=8A7495662C3CD4AE18D95637FAF6989E1BC16FEA
ARG SWIFT_PLATFORM=ubuntu
ARG OS_MAJOR_VER=22
ARG OS_MIN_VER=04
ARG SWIFT_WEBROOT=https://download.swift.org/development
ARG DOWNLOAD_DIR

# This is a small trick to enable if/else for arm64 and amd64.
# Because of https://bugs.swift.org/browse/SR-14872 we need adjust tar options.
FROM base AS base-amd64
ARG OS_ARCH_SUFFIX=

FROM base AS base-arm64
ARG OS_ARCH_SUFFIX=-aarch64

FROM base-$TARGETARCH AS final

ARG OS_VER=$SWIFT_PLATFORM$OS_MAJOR_VER.$OS_MIN_VER$OS_ARCH_SUFFIX
ARG PLATFORM_WEBROOT="$SWIFT_WEBROOT/$SWIFT_PLATFORM$OS_MAJOR_VER$OS_MIN_VER$OS_ARCH_SUFFIX"

RUN echo "${PLATFORM_WEBROOT}/latest-build.yml"

ARG download="$DOWNLOAD_DIR-$SWIFT_PLATFORM$OS_MAJOR_VER.$OS_MIN_VER$OS_ARCH_SUFFIX.tar.gz"

RUN echo "DOWNLOAD IS THIS: ${download} ; ${DOWNLOAD_DIR}"

RUN set -e; \
    # - Grab curl here so we cache better up above
    export DEBIAN_FRONTEND=noninteractive \
    && apt-get -q update && apt-get -q install -y curl && rm -rf /var/lib/apt/lists/* \
    # - Latest Toolchain info
    && echo $DOWNLOAD_DIR > .swift_tag \
    # - Download the GPG keys, Swift toolchain, and toolchain signature, and verify.
    && export GNUPGHOME="$(mktemp -d)" \
    && curl -fsSL ${PLATFORM_WEBROOT}/${DOWNLOAD_DIR}/${download} -o latest_toolchain.tar.gz \
        ${PLATFORM_WEBROOT}/${DOWNLOAD_DIR}/${download}.sig -o latest_toolchain.tar.gz.sig \
    && curl -fSsL https://swift.org/keys/all-keys.asc | gpg --import -  \
    && gpg --batch --verify latest_toolchain.tar.gz.sig latest_toolchain.tar.gz \
    # - Unpack the toolchain, set libs permissions, and clean up.
    && tar -xzf latest_toolchain.tar.gz --directory / --strip-components=1 \
    && chmod -R o+r /usr/lib/swift \
    && rm -rf "$GNUPGHOME" latest_toolchain.tar.gz.sig latest_toolchain.tar.gz \
    && apt-get purge --auto-remove -y curl

# Print Installed Swift Version
RUN swift --version

RUN echo "[ -n \"\${TERM:-}\" -a -r /etc/motd ] && cat /etc/motd" >> /etc/bash.bashrc; \
    ( \
      printf "################################################################\n"; \
      printf "# %-60s #\n" ""; \
      printf "# %-60s #\n" "Swift Nightly Docker Image"; \
      printf "# %-60s #\n" "Tag: $(cat .swift_tag)"; \
      printf "# %-60s #\n" ""; \
      printf "################################################################\n" \
    ) > /etc/motd

@MahdiBM
Copy link
Author

MahdiBM commented Sep 21, 2024

To be clear, by "didn't work" I mean that I'm getting exactly the same errors.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 21, 2024

Tried the 6.0 snapshots, they complain about usage of swiftLanguageMode in dependencies.

@finagolfin
Copy link
Contributor

Does simply building your code with the Swift 6.0 release still work? If so, I'd try to instrument the build to figure out the bottlenecks, as I suggested before. In particular, if you're building a large executable at the end, that might be taking the most time.

As I said yesterday, you could try dumping all the commands that swift build is running with the -v flag, check if the final linker command is repeating -lFoundationEssentials -l_FoundationICU over and over again, then manually run that link alone twice to measure the timing: once exactly as it was to get the baseline, then a second time with those library flags de-duplicated to see if that helps. Ideally, you'd do this on the same lower-RAM hardware where you were seeing the largest build-time regressions before.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 22, 2024

Of course there is no problem on the released Swift 6 😅. This whole issue is about CI slowness. We run CI on push, PR, etc... and they've all passed.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 22, 2024

@finagolfin
Finally got the build working with some dependency downgrades.

        snapshot:
          - swift-6.0-DEVELOPMENT-SNAPSHOT-2024-07-19-a
          - swift-6.0-DEVELOPMENT-SNAPSHOT-2024-07-21-a
        machine:
          - name: "medium" # 16gb 8cpu c7i-flex
            arch: amd64
          - name: "large" # 32gb 16cpu c7i-flex
            arch: amd64
          - name: "huge-stable-arm" # 128gb 64cpu bare metal c7g
            arch: arm64

The results are only marginally different:

Screenshot 2024-09-22 at 5 08 07 PM

Ignore that it says "unit-tests". It's only a swift build --build-tests like in the shared files above.
Also the differences are really ignorable, specially when you go see the build step's times.

@finagolfin
Copy link
Contributor

It's not the same low-RAM scenarios though, right? You initially gave data for 8 and 16 GB RAM servers that showed regressions, whereas this is from 16, 32, and 128 GB runners, with the smallest showing the biggest regression albeit less than with the final 6.0 release.

Why don't you wait for the next 6.0 snapshot toolchain with the linker patch? That should be a good comparison to the 6.0 release.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 22, 2024

@finagolfin sorry you're right, i just wrote my conclusions here before looking at the smallest machine.
Trying on a 8-ram 4-cpu machine now.

For the record, 07-19's w/ "medium" machine took 9m 46s, and for 07-21 it took 10m 29s.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 22, 2024

@finagolfin this is not exactly the same scenario as before, these numbers are not comparable to the older number.
We should look at them compared to themselves.

With a smaller machine (half the "medium" machine) 07-19 takes 17m 3s and 07-21 takes 18m 5s (not the whole job run, only the build step).
So yeah, this does indicate a part of the regressions, although definitely not the whole of it.

In the comparisons above, 6.0 jammy was 22% slower than 5.10 noble on the same machine (look at Tests CI, build-step time w/ no cache, which uses the same command but has to resolve the packages itself, 9m 14s vs 7m 34s).
Here it's 7% difference on the same machine (9m 46s vs 10m 29s).

So like a third of the regression? I guess there can also be other factors here as well (like how snapshots behave vs release builds), so let me know what you think.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 22, 2024

You initially gave data for 8 and 16 GB RAM servers that showed regressions, whereas this is from 16, 32, and 128 GB runners,

The medium machine is the same as the "Same" machines in "Test CI" section.
There are 3 machines in the image, not 2.

@finagolfin
Copy link
Contributor

It seems to indicate the Foundation re-core in July definitely contributed to the slowdown, but wasn't all of it. Are you able to also compare the trunk snapshot toolchains from a couple weeks ago? That might be a better benchmark.

The medium machine is the same as the "Same" machines in "Test CI" section.
There are 3 machines in the image, not 2.

I was talking about your original test and deployment runs from a couple days ago, the smallest from each were 8 and 16 GB.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 22, 2024

Are you able to also compare the trunk snapshot toolchains from a couple weeks ago?

Just let me know what the snapshot names are.

I was talking about #76555 (comment), the smallest from each were 8 and 16 GB.

The deployment CI uses a smaller base machine because it only builds a specific product. Not everything including tests.
The comparisons i mentioned above are the closest we have to compare, and should be reliable enough considering they have 3x difference.

@finagolfin
Copy link
Contributor

Just let me know what the snapshot names are.

See above: "you could try the Sep. 4 or latest 6.1 snapshot build with it and compare to the Aug. 29 build without it."

@MahdiBM
Copy link
Author

MahdiBM commented Sep 22, 2024

@finagolfin ah i did try those already, the error situation was less solvable.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 22, 2024

It was the CNIO errors. Couldn't get them fixed although i made some decent effort.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 24, 2024

First victim:
Heroku can't build Swift apps that it previously could: vapor-community/heroku-buildpack#78 (comment)

Heroku is a pretty popular platform for SSS apps, considering it once had a free tier. This problem will be have a decent negative impact.

@MahdiBM
Copy link
Author

MahdiBM commented Sep 30, 2024

Ever since the last comment, I've seen 2 more users encountering this problem on Heroku. So a total of 3, and that's only those who cared to ask about their weird build failures before reverting back.

Interesting bit is that one person was consistently having problems on Swift 6 jammy, but their project would consistently build fine on Swift 6 noble, in Heroku buildpack builders.
This is unexpected to me, but shows that the regressions are even more complicated and have happened on multiple different fronts.

So:

  • Swift noble, even on 5.10 has build-time regressions.
  • Swift 6 itself contains some regressions as well.
  • jammy / noble contain their own regressions as well?!

I was unable to find specs of Heroku builders (also asked around a bit, no luck) to see if anything has changed between their Ubuntu 22 and 24 stacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels
Projects
None yet
Development

No branches or pull requests

5 participants