Parallelization performance severely degraded since Tycho 1.7.0 #342

k-spa · 2021-10-19T22:52:16Z

We have been using Java 11 / Tycho 1.6.0 in our project since the release of Java 11.

The build is quite lengthy and, without any special configuration, takes about 3 hours. Therefore we set the -T 1C command line option to build several projects in parallel. This logs warnings about unsupported parallelization but has always been working fine. Building in parallel drops the build time to 2 hours.

Anticipating the upcoming Java 17 support I tried upgrading the build system to Tycho 2.4.0 to test for any major roadblocks.

With Tycho 2.4.0 the unparallelized build takes the same 3 hours as with Tycho 1.6.0, but setting the parallelization option actually increases the build time to 3 hours and 47 minutes!

I tested with older versions of Tycho and it seems that the severe slowdown happens since version 1.7.0. I did not expect that - minor changes in the build time are fine, but I expected the build time to stay at least in the area of 2 hours.

	with `-T 1C`	without `-T 1C`
Tycho 1.6.0	2 hours	3 hours
Tycho 1.7.0+	~4 hours	3 hours

Going from our current, parallelized, 2 hour build, that is an increase of almost 100 % - but even increasing the build time to 3 hours by building unparallelized is hardly an option.

Is there a problem with running the build parallelized since Tycho 1.7.0? Do I have to configure something differently?

Our maven call looks like this:
mvn install -Dmaven.repo.local=<repository location> -P <profile name> -T 1C -fae

The profile just determines which tests are executed and should not behave differently whether the -T 1C parameter is present or not.

The text was updated successfully, but these errors were encountered:

akurtakov · 2021-10-20T04:16:11Z

Can you provide small reproducer where this is visible for testing purposes?

laeubi · 2021-10-20T05:09:30Z

I tried upgrading the build system to Tycho 2.4.0

Have you tried tycho 2.5.0? It contains some improvements for large build setups regarding performance.

k-spa · 2021-10-21T07:42:53Z

Oh, I completely missed the release of Tycho 2.5.0, thanks for the heads up.

The numbers are in:

	with `-T 1C`	without `-T 1C`
Tycho 1.6.0	2 hours	3 hours
Tycho 2.4.0	~4 hours	3 hours
Tycho 2.5.0	3 h 30 min	3 h 15 min

It is a bit faster but still nowhere near to 1.6.0-durations. The non-parallelized build seems to have become a little bit slower (although maybe within error margins).

I will try to create a reproducible project, but I cannot make promises. I guess it will have to be a bunch of test projects with sleeps in them.

mickaelistria · 2021-10-21T08:05:22Z

The gap in term of amount of commits between Tycho 1.6.0 and 2.5.0 is huge. It's going to be very hard, and not so motivating, to identify the change that cause this issue. And if issue is between 1.6.0 and 1.7.0, it's also not so motivating to dig into outdated code.
So I recommend that you consider upgrading to a recent release more often if you want to get better support. And this is true for all OSS projects.

Bananeweizen · 2021-11-19T09:07:12Z

I would guess the root cause to be around making the mojos thread-safe. We did that by simply introducing an artificial lock object in each and every mojo. Therefore if your builds use a huge amount of time in just one identical mojo for a big number of maven modules, that one mojo may dominate the build and cannot be run in parallel anymore. I.e. you can run tycho-foo and tycho-bar in parallel, but not tycho-foo 2 times, since that change.
Using a lock object is the most simple implementation, and for some mojos it might be possible to lift that restriction after further analysis whether they actually access some shared objects or not. But for that it would be good to see which tycho mojo is the one dominating your builds.
You might want to run Yourkit YouMonitor (not be confused with Yourkit Profiler) to get all the insights of your build times: https://www.yourkit.com/youmonitor/features/

k-spa · 2021-11-21T22:02:10Z

First of all: This might seriously be the best tool recommendation I have gotten in quite a while, thanks a lot!

I monitored builds with tycho 1.7.0 and 2.5.0 and the data I got fully supports your suspicion.

As I mentioned earlier, we run the build with the -T 1C option, which means that a lot of maven modules are being build around the same time.

This is visible in the following screenshot that shows several builds starting at around the same timestamp. This behavior is identical, whether tycho 1.7.0 or tycho 2.5.0 is used.

But when single module builds are examined, it gets really interesting. Please have a look at the duration of the tycho-surefire goal and the duration of the started test process in the following two screenshots:

This is totally different from the results I could observe with tycho 1.7.0. To me it seems that the tycho-surefire goal is started and then waits for sometimes over two hours until the actual JUnit process is started that then usually finishes in a matter of minutes. I actually plotted the build timeline and it is clearly visible that with tycho 2.4.0 only one JUnit process is running at a time - so even with multiple surefire goals started in parallel, the actual JUnit runs are locked against each other.

As the tests take up a majority of our build time, this really hurts us.

I do suspect though that it might be possible to lessen that restriction for tycho-surefire - as said before, we've been successfully building with -T 1C for years now.

Last but not least - thanks for someone taking interest. I fully understand that this might not be the most fun or motivating issue to analyze or work on and I certainly wish we would have noticed the problem sooner, but as we all cannot change the past, this is what it is now. You can be assured that it is not in my personal interest to not upgrade infrastructure software for such a long time either, but that should not be the topic here.

Anyhow, if you need further data, I'll do my best to provide it. I hope the analysis helped!

laeubi · 2021-11-22T05:23:21Z

It seems the lock was introduced with Tycho 1.7.x 8afcd61 from a quick look at the code it seems that at least we do not need to lock globally as each project is using its own work-directory.
The tricky part would be to check if the used services are thread save (even as the could be), but at least I think we can leave the lock at the runTest method.

My suggestion would be, move that method call outside the synchronized block and check if that helps in your case or causing issues. If it helps we can then integrate this change permanently into Tycho.

k-spa · 2021-11-24T08:54:06Z

Your suspicion seems to be correct. I cloned the repository and tested your suggested change:

The build is successful (no multi threading failures, although there are up to 16 test projects running in parallel) and the build time is back down to approximately 2 hours.

I'm a bit unsure though on how to continue from here. As far as I have seen I would need to sign the Eclipse ECA in order to commit the patch myself. I have not done that yet and would have to check with my company first if I would actually be allowed to do that. Should I be going forward with that or would you rather have someone of you commit the change yourself? I'd be fine with that too.

laeubi · 2021-11-24T09:03:20Z

@k-spa please check first with your company as it is generally a great opportunity to contribute (small or larger) patches to (eclipse) OSS projects and such a small change is a good starting-point to get familiar with the process and tools.

If it would become too complicated let me know and I'll write an own patch for this but as you are currently the only one complaining its your time-scedule and probably it helps to convince your company to contribute if they are directly affected by this.

mickaelistria · 2021-11-24T16:44:02Z

Should I be going forward with that or would you rather have someone of you commit the change yourself?

We'd rather have you submitting the patch on your own name for proper and fair attribution, and also -since you've demonstrated some interest in the project and proven you have skills to improve it- to get you familiar with the contribution process as it's very likely you'll want to suggest some more changes in the future.

Running tests while the lock is present leads to the tests being executed in series which severely degrades performance when the build is running with parallel maven projects.

Fixes #342: Run surefire tests outside of lock

Bananeweizen · 2021-11-29T19:42:32Z

We should probably mention this in the release notes. If there are people with SWTBot UI tests and parallel build enabled, they would previously not run into issues, but after this change multiple bot tests might try to manipulate windows and controls at the same time. That will probably fail, at least on Windows (due to not being able to run with multiple separate interactive sessions in parallel). In fact, in our company a big amount of tests follows that SWTBot pattern, and that's why we don't use the parallel build for surefire testing. Otherwise I might have noticed this specific bug much earlier myself. :)

Document parallel testing changes (fixes #342)

laeubi added performance 🚨 regression Regression compared to previous release - blocks upcoming release and other merges labels Nov 24, 2021

laeubi linked a pull request Nov 25, 2021 that will close this issue

Fixes #342: Run surefire tests outside of lock #394

Merged

laeubi closed this as completed in #394 Nov 25, 2021

laeubi added a commit that referenced this issue Nov 25, 2021

Merge pull request #394 from k-spa/master

c9ac1ba

Fixes #342: Run surefire tests outside of lock

Bananeweizen added a commit to Bananeweizen/tycho that referenced this issue Nov 29, 2021

Document parallel testing changes (fixes eclipse-tycho#342)

3cdc51c

Bananeweizen added a commit that referenced this issue Nov 30, 2021

Merge pull request #402 from Bananeweizen/patch-2

b5aa568

Document parallel testing changes (fixes #342)

laeubi added this to the 2.6 milestone Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization performance severely degraded since Tycho 1.7.0 #342

Parallelization performance severely degraded since Tycho 1.7.0 #342

k-spa commented Oct 19, 2021

akurtakov commented Oct 20, 2021

laeubi commented Oct 20, 2021

k-spa commented Oct 21, 2021 •

edited

Loading

mickaelistria commented Oct 21, 2021

Bananeweizen commented Nov 19, 2021

k-spa commented Nov 21, 2021 •

edited

Loading

laeubi commented Nov 22, 2021

k-spa commented Nov 24, 2021

laeubi commented Nov 24, 2021

mickaelistria commented Nov 24, 2021

Bananeweizen commented Nov 29, 2021

Parallelization performance severely degraded since Tycho 1.7.0 #342

Parallelization performance severely degraded since Tycho 1.7.0 #342

Comments

k-spa commented Oct 19, 2021

akurtakov commented Oct 20, 2021

laeubi commented Oct 20, 2021

k-spa commented Oct 21, 2021 • edited Loading

mickaelistria commented Oct 21, 2021

Bananeweizen commented Nov 19, 2021

k-spa commented Nov 21, 2021 • edited Loading

laeubi commented Nov 22, 2021

k-spa commented Nov 24, 2021

laeubi commented Nov 24, 2021

mickaelistria commented Nov 24, 2021

Bananeweizen commented Nov 29, 2021

k-spa commented Oct 21, 2021 •

edited

Loading

k-spa commented Nov 21, 2021 •

edited

Loading