Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a jit for drawing pixels in the software renderer #15163

Merged
merged 31 commits into from
Nov 28, 2021

Conversation

unknownbrackets
Copy link
Collaborator

@unknownbrackets unknownbrackets commented Nov 23, 2021

This is not yet battle tested or anything, but implements a jit on x64 only for the software renderer. I created a very simple reg cache for it, which I tried to design with arm64 in mind too.

I didn't test Linux at all, though I did attempt to support it.

Currently, seeing 50-80% improvement in some games, most notably Cave Story. Alpha blending is not yet implemented (I want to check its accuracy more), which is the main thing preventing it from running in most games.

I had previously messed with drawing 4 pixels simultaneously with a mask. I didn't go nearly as far with it, but there were some complexities and it was initially seeming like it wouldn't be faster based on changing C++ first. I'm still not sure. To do that would mean z and fog would be vec4s, and color would be packed 16x8 (4x4x8.) Mask would be passed additionally. That could still be interesting, but I wanted to try this for now.

There's probably still some areas to improve, I've only looked a little at the produced assembly to weed out silly patterns.

Also: not currently well tested. Pretty much just wrote out all the code until "softjit: Initial color write" without any good way to test, but it worked with only a few small fixes, actually.

-[Unknown]

@hrydgard
Copy link
Owner

Code looks great already!

Yeah it would indeed be a lot of data to pass in and out for 2x2, but you'd also get up to four times the work done in one invocation.

But yeah, hard to say if the benefit would actually be that large.

@unknownbrackets unknownbrackets changed the title Implement a jit for drawing pixel in the software renderer Implement a jit for drawing pixels in the software renderer Nov 23, 2021
It's easier to use it in these places, but seems it stalls longer on the
dest reg.
@unknownbrackets
Copy link
Collaborator Author

Okay, now all paths are implemented. Some stats:

  • Improvements range from 30-80%, generally.
  • Actual pixel drawing now takes approximately 1/3 the time based on profiling.
  • Profiles are now dominated by threading overhead (it seems even higher than expected?) and ApplyTexturing().
  • The improvements are similar, but slightly better (at most 10 points), with threading disabled.

Was honestly hoping for more, but this really improves things and makes drawPixel no longer a bottleneck in simpler cases. For areas where it runs slower, even excluding threading overhead, ApplyTexturing() is typically 20-40% (including samplerjit which can be 10-30%.)

-[Unknown]

@unknownbrackets unknownbrackets marked this pull request as ready for review November 26, 2021 19:04
@unknownbrackets
Copy link
Collaborator Author

I did some initial experiments with four pixels at a time, I think I found a way around the problem I saw before that hurt performance. Seems like it'll be a win, but I plan to build upon this and don't want to add more commits to this pull...

Also looks like the sampler path and UV/ST handling could be using more SIMD and go through jit, might worry about that first.

-[Unknown]

Copy link
Owner

@hrydgard hrydgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for slowness reviewing! Looks good, let's merge.

@hrydgard hrydgard merged commit aa12c9b into hrydgard:master Nov 28, 2021
@unknownbrackets unknownbrackets deleted the softjit branch November 28, 2021 13:23
@unknownbrackets unknownbrackets added this to the v1.13.0 milestone Nov 28, 2021
@unknownbrackets
Copy link
Collaborator Author

So I implemented a SIMD version:
master...unknownbrackets:softjit-vec

In the best cases, it gives an improvement of i.e. 254->278. In most cases, it's pretty flat. And in some (heavy DrawSprite users, which does 1 pixel at a time still), it's slower (i.e. 600->500.)

Not sure if I did something obviously silly, though I think the main bottleneck is indeed the sampler.

-[Unknown]

@hrydgard
Copy link
Owner

hrydgard commented Dec 2, 2021

Hm, that's interesting (and cool work!). Though when I thought SIMD would help a lot, I thought that a lot more of execution time was taken by the pixel pipeline. As for the sampler, maybe texture filtering four pixels at a time using SIMD might be better than one, but then again, maybe not..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants