Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VFPU sin/cos #16984

Merged
merged 17 commits into from
Mar 28, 2023
Merged

VFPU sin/cos #16984

merged 17 commits into from
Mar 28, 2023

Conversation

fp64
Copy link
Contributor

@fp64 fp64 commented Feb 19, 2023

Implements #16946, with the recent fix.

@hrydgard hrydgard added this to the v1.15.0 milestone Feb 19, 2023
@hrydgard
Copy link
Owner

Looks good, though I'm not the biggest fan of giant text-format headers with binary data. Maybe a binary file in /assets would make sense? "inc-bin"-type approaches are probably out due to cross platform issues.

@hrydgard
Copy link
Owner

Actually maybe it's not so bad these days with the header thing, compiling them isn't that slow..

@fp64
Copy link
Contributor Author

fp64 commented Feb 19, 2023

Not terribly pleased with the header myself.

As for speed, this has some benchmarks (this specific header represents 473KB of binary, so 400KB column is closest: compile time should be around 2s for GCC, and 4s for MSVC). Memory benchmarks below that are also interesting.

@ghost ghost mentioned this pull request Feb 21, 2023
@fp64
Copy link
Contributor Author

fp64 commented Mar 9, 2023

So I assume this one is obsoleted.
Or I can try to add the rest of the functions right here, probably with data in /assets (is VFS reworking over?).

@hrydgard
Copy link
Owner

hrydgard commented Mar 9, 2023

VFS rework is done, but shouldn't make much difference.

Yeah, I think it makes more sense to have a PR with everything in at once, then this can be tested against the math-problematic games like Ridge Racer more easily.

@fp64
Copy link
Contributor Author

fp64 commented Mar 10, 2023

Hm, storing tables as files in theory requires to do something about endianness.

Anyway, I'm having trouble testing this, because on my machine (32-bit Linux) complied PPSSPP (master, e.g. commit 938230e7d64fe5a1826a825d42b72eef91ba615b here) now crashes trying to start any game.

Some details, in case it is useful:

GDB
$ gdb PPSSPPSDL 
Reading symbols from PPSSPPSDL...
(gdb) run
Starting program: [...]/ppsspp/build/PPSSPPSDL 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
VulkanMayBeAvailable: Device allowed ('SDL:Linux')
VulkanMayBeAvailable: Library loaded ('libvulkan.so.1')
VulkanMayBeAvailable: Enumerating instance extensions
VulkanMayBeAvailable: Instance extension count: 2
VulkanMaybeAvailable: Instance extension found: VK_EXT_debug_report (00000009)
VulkanMaybeAvailable: Instance extension found: VK_EXT_debug_utils (00000002)
Surface extension not found
Vulkan with working device not detected.
DEBUG: Vulkan is not available, not using Vulkan.
Info: We compiled against SDL version 2.26.2 and we are linking against SDL version 2.26.2. :)
ThreadManager::Init(compute threads: 2, all: 6)
[New Thread 0xb6515b40 (LWP 16964)]
[New Thread 0xb5d14b40 (LWP 16965)]
[New Thread 0xb5513b40 (LWP 16966)]
[New Thread 0xb4d12b40 (LWP 16967)]
[New Thread 0xb4511b40 (LWP 16968)]
[New Thread 0xb3d10b40 (LWP 16969)]
30:33:614 Core/Config.cpp:634 I[G3D]: Longest display side: 1280 pixels. Choosing scale 2
[New Thread 0xb350fb40 (LWP 16970)]
[Thread 0xb350fb40 (LWP 16970) exited]
dp_xres/yres: 960, 544
pixel_xres/yres: 960, 544
dpi, x, y: 1.000000, 1.000000, 1.000000
pixel_in_dps: 1.000000, 1.000000
dpi_real: 1.000000, 1.000000
display_hz: 60.000000
rotation: 0
1.000000 0.000000 0.000000 0.000000
0.000000 1.000000 0.000000 0.000000
0.000000 0.000000 1.000000 0.000000
0.000000 0.000000 0.000000 1.000000

[New Thread 0xaaf45b40 (LWP 16971)]
OpenGL 2.0 or higher.
[New Thread 0xaa5bcb40 (LWP 16973)]
loading control pad mappings from gamecontrollerdb.txt: SUCCESS!
[New Thread 0xa9bffb40 (LWP 16974)]
[New Thread 0xa90f6b40 (LWP 16975)]
[New Thread 0xa87f5b40 (LWP 16976)]
[New Thread 0xa7ff4b40 (LWP 16977)]
[Thread 0xa7ff4b40 (LWP 16977) exited]
[Thread 0xa87f5b40 (LWP 16976) exited]
30:44:466 Core/System.cpp:421 N[BOOT]: PPSSPP v1.12.3-5014-g938230e7d
30:44:579 Core/Compatibility.cpp:140 N[G3D]: UnitsPerMeter for ULJM05781: 0.000000
[New Thread 0xa7ff4b40 (LWP 16978)]
[New Thread 0xb350fb40 (LWP 16979)]
[New Thread 0xa87f5b40 (LWP 16980)]
[Thread 0xa87f5b40 (LWP 16980) exited]
[New Thread 0xa58f2b40 (LWP 16981)]
[New Thread 0xa50f1b40 (LWP 16982)]
[Thread 0xa7ff4b40 (LWP 16978) exited]
30:45:678 root         N[BOOT]: UI/EmuScreen.cpp:360 Loading [...]/Valkyria Chronicles 3 (English v2)/Valkyria Chronicles 3 (English v2).iso...
30:45:931 Odin_Main    E[SCEUTIL]: HLE/sceUtility.cpp:514 80111102=sceUtilityLoadModule(00000301): already loaded
30:45:931 Odin_Main    E[SCEUTIL]: HLE/sceUtility.cpp:514 80111102=sceUtilityLoadModule(00000300): already loaded
30:45:931 Odin_Main    E[SCEUTIL]: HLE/sceUtility.cpp:514 80111102=sceUtilityLoadModule(00000302): already loaded
30:46:058 CRI FS File  E[SCEIO]: HLE/sceIo.cpp:2583 Not a valid PGD file. Open as normal file.

Thread 11 "Emu" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xa9bffb40 (LWP 16974)]
0x008f45ca in TextureCacheGLES::BuildTexture(TexCacheEntry*) ()
(gdb) bt
#0  0x008f45ca in TextureCacheGLES::BuildTexture(TexCacheEntry*) ()
#1  0x0084d47b in TextureCacheCommon::ApplyTexture() ()
#2  0x008f85de in DrawEngineGLES::DoFlush() ()
#3  0x008fa006 in DrawEngineGLES::DispatchFlush() ()
#4  0x0087371a in GPUCommonHW::FastRunLoop(DisplayList&) ()
#5  0x0086c16c in GPUCommon::InterpretList(DisplayList&) ()
#6  0x0086c42f in GPUCommon::ProcessDLQueue() ()
#7  0x008707cc in GPUCommon::InterruptEnd(int) ()
#8  0x006bd26b in __KernelReturnFromInterrupt() ()
#9  0x0066b14f in CallSyscallWithoutFlags(HLEFunction const*) ()
#10 0xa59ad6db in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) disassemble 0x008f45ca
Dump of assembler code for function _ZN16TextureCacheGLES12BuildTextureEP13TexCacheEntry:
   0x008f4510 <+0>:	push   %ebp
   0x008f4511 <+1>:	mov    $0x100,%eax
   0x008f4516 <+6>:	mov    %esp,%ebp
   0x008f4518 <+8>:	push   %edi
   0x008f4519 <+9>:	push   %esi
   0x008f451a <+10>:	call   0x507123 <__x86.get_pc_thunk.si>
   0x008f451f <+15>:	add    $0xd11ae1,%esi
   0x008f4525 <+21>:	push   %ebx
   0x008f4526 <+22>:	sub    $0x90,%esp
   0x008f452c <+28>:	mov    %ax,-0x50(%ebp)
   0x008f4530 <+32>:	lea    -0x50(%ebp),%eax
   0x008f4533 <+35>:	push   0xc(%ebp)
   0x008f4536 <+38>:	push   %eax
   0x008f4537 <+39>:	push   0x8(%ebp)
   0x008f453a <+42>:	mov    %esi,%ebx
   0x008f453c <+44>:	mov    %eax,-0x88(%ebp)
   0x008f4542 <+50>:	call   0x848630 <_ZN18TextureCacheCommon19PrepareBuildTextureER16BuildTexturePlanP13TexCacheEntry>
   0x008f4547 <+55>:	add    $0x10,%esp
   0x008f454a <+58>:	test   %al,%al
   0x008f454c <+60>:	je     0x8f4ae2 <_ZN16TextureCacheGLES12BuildTextureEP13TexCacheEntry+1490>
   0x008f4552 <+66>:	mov    0xc(%ebp),%eax
   0x008f4555 <+69>:	mov    0x18(%eax),%edi
   0x008f4558 <+72>:	test   %edi,%edi
   0x008f455a <+74>:	je     0x8f458d <_ZN16TextureCacheGLES12BuildTextureEP13TexCacheEntry+125>
   0x008f455c <+76>:	sub    $0xc,%esp
   0x008f455f <+79>:	lea    -0x4b0e0d(%esi),%eax
   0x008f4565 <+85>:	push   %eax
   0x008f4566 <+86>:	lea    -0x4541e0(%esi),%eax
   0x008f456c <+92>:	push   %eax
   0x008f456d <+93>:	lea    -0x454260(%esi),%eax
   0x008f4573 <+99>:	push   $0xee
   0x008f4578 <+104>:	push   %eax
   0x008f4579 <+105>:	lea    -0x4541cc(%esi),%eax
   0x008f457f <+111>:	push   %eax
   0x008f4580 <+112>:	call   0xa37e10 <_Z12HandleAssertPKcS0_iS0_S0_z>
   0x008f4585 <+117>:	add    $0x20,%esp
   0x008f4588 <+120>:	test   %al,%al
   0x008f458a <+122>:	jne    0x8f458d <_ZN16TextureCacheGLES12BuildTextureEP13TexCacheEntry+125>
   0x008f458c <+124>:	int3   
   0x008f458d <+125>:	mov    -0x2c(%ebp),%eax
   0x008f4590 <+128>:	sub    $0x4,%esp
   0x008f4593 <+131>:	mov    %esi,%ebx
   0x008f4595 <+133>:	mov    %eax,-0x7c(%ebp)
   0x008f4598 <+136>:	mov    -0x28(%ebp),%eax
   0x008f459b <+139>:	mov    %eax,-0x84(%ebp)
   0x008f45a1 <+145>:	lea    0x2a970(%esi),%eax
   0x008f45a7 <+151>:	mov    0x314(%eax),%eax
   0x008f45ad <+157>:	and    $0x3,%eax
   0x008f45b0 <+160>:	push   %eax
   0x008f45b1 <+161>:	mov    0xc(%ebp),%eax
   0x008f45b4 <+164>:	movzbl 0x10(%eax),%eax
   0x008f45b8 <+168>:	push   %eax
   0x008f45b9 <+169>:	push   0x8(%ebp)
   0x008f45bc <+172>:	call   0x8f44a0 <_ZNK16TextureCacheGLES13GetDestFormatE15GETextureFormat15GEPaletteFormat>
   0x008f45c1 <+177>:	mov    -0x20(%ebp),%edx
   0x008f45c4 <+180>:	add    $0x10,%esp
   0x008f45c7 <+183>:	mov    %al,-0x7d(%ebp)
=> 0x008f45ca <+186>:	cmpb   $0x0,0x3a(%edx)
   0x008f45ce <+190>:	je     0x8f45eb <_ZN16TextureCacheGLES12BuildTextureEP13TexCacheEntry+219>
   0x008f45d0 <+192>:	mov    (%edx),%edi
   0x008f45d2 <+194>:	mov    0x4(%edx),%eax
   0x008f45d5 <+197>:	mov    -0x40(%ebp),%ecx
   0x008f45d8 <+200>:	sub    %edi,%eax
   0x008f45da <+202>:	sar    $0x3,%eax
   0x008f45dd <+205>:	imul   $0xcccccccd,%eax,%eax
   0x008f45e3 <+211>:	cmp    %ecx,%eax
   0x008f45e5 <+213>:	ja     0x8f4c70 <_ZN16TextureCacheGLES12BuildTextureEP13TexCacheEntry+1888>
   0x008f45eb <+219>:	cmpl   $0x1,-0x3c(%ebp)
   0x008f45ef <+223>:	jle    0x8f4c10 <_ZN16TextureCacheGLES12BuildTextureEP13TexCacheEntry+1792>
   0x008f45f5 <+229>:	movb   $0x4,-0x7d(%ebp)
   0x008f45f9 <+233>:	mov    0x8(%ebp),%eax
   0x008f45fc <+236>:	mov    -0x24(%ebp),%edx
   0x008f45ff <+239>:	mov    0x224(%eax),%edi
   0x008f4605 <+245>:	lea    0x558(%edi),%ecx
   0x008f460b <+251>:	mov    %ecx,-0x90(%ebp)
   0x008f4611 <+257>:	cmp    $0x1,%edx
   0x008f4614 <+260>:	je     0x8f4b90 <_ZN16TextureCacheGLES12BuildTextureEP13TexCacheEntry+1664>
   0x008f461a <+266>:	sub    $0xc,%esp
   0x008f461d <+269>:	mov    %esi,%ebx
   0x008f461f <+271>:	mov    %edx,-0x8c(%ebp)
   0x008f4625 <+277>:	push   $0x34
...

It crashes fairly consistently at this location, regardless of what game is loaded.

@hrydgard
Copy link
Owner

hrydgard commented Mar 10, 2023

Hm, sorry for that, I'll take a look. It's my texture replacer stuff, somehow. Working on fixing up some stuff I broke.

As for endianness, meh, we only support little-endian platforms currently anyway. Big-endian is gone.

@hrydgard
Copy link
Owner

Actually you should hopefully be fine if you update to the current master now.

@fp64
Copy link
Contributor Author

fp64 commented Mar 10, 2023

Yes, fixed now.

@fp64
Copy link
Contributor Author

fp64 commented Mar 10, 2023

Looks like PPSSPP doesn't do -mfpmath=sse (which clang doesn't recognize, but GCC does), which probably affects things in 32-bit. Need to try adding that to CMakeLists.txt near -msse2. There's also the whole -mstackrealign thing (relevant to MinGW, but sometimes does rear its ugly head on Linux), which hopefully we do not need to care about.

As for Ridge Racer, it seems that while JIT, Interpreter and IR all give different results, none of them are affected by these functions (car path, that is; camera is affected if I make them return something funky).

Or I might have missed something. I just added the functions to MIPSVFPUUtils.cpp, and tweaked MIPSIntVFPU.cpp to actually use them. Didn't touch JIT (e.g. CompVFPU.cpp) which currently seems to be doing its own thing for at least sqrt.

Might add the commits, so you can take a look, but I need to figure out how to make CMakeLists.txt copy new assets/vfpu on build.

@fp64
Copy link
Contributor Author

fp64 commented Mar 10, 2023

Okay, after -mfpmath=sse JIT, Interpreter and IR produce the same behaviour (visually) in Ridge Racer. It's still wrong, but consistent. The behaviour is the same as JIT already had (IR was close, but not quite, Interpreter - pretty far).
Besides better consistency, the prformance seems to slightly improve as well.

Again, this only affects 32-bit x86 (and only on GCC).

@hrydgard
Copy link
Owner

That's a good catch, great to have 32-bit match!

@fp64
Copy link
Contributor Author

fp64 commented Mar 10, 2023

Ok, WIP version.

@fp64
Copy link
Contributor Author

fp64 commented Mar 10, 2023

...and I got it (copying to assets/vfpu on build) wrong anyway (it was mostly guessing, without actually knowing CMake).

Other notes:

  • No loading on demand so far.
  • Table files are stored raw, but they are actually fairly compressible. May change later.
  • Functions not routed to JIT yet.

@fp64
Copy link
Contributor Author

fp64 commented Mar 10, 2023

Rather, it looks like the assets/vfpu is copied fine (on my system, anyway), but --unittest doesn't do g_VFS.Register()? Not sure.

@hrydgard
Copy link
Owner

Yeah that wouldn't surprise me, feel free to add that.

@fp64
Copy link
Contributor Author

fp64 commented Mar 10, 2023

Let's try this again.

@hrydgard
Copy link
Owner

Such globals I think are fine for this kind of use, but I do generally prefer to put them at the top of the file instead - statics in functions are harder to see and care needs to be taken with globals in general.

@unknownbrackets
Copy link
Collaborator

unknownbrackets commented Mar 12, 2023

Right. Just to say, we should also not need to really care about threading here since only one thread should ever actively call these funcs.

2211584 vfpu.tar.zst.1

Ah, that's less reduction than I imagined, might still be a win at low compression levels, though they probably compress worse individually.

-[Unknown]

@fp64
Copy link
Contributor Author

fp64 commented Mar 12, 2023

I think C++11 guarantees that local statics are initialized exactly once (on "first" invocation), even in multi-threaded setting, which is what I'm trying to leverage (can these functions be called from multiple threads?), rather than rolling my own thread-safe loader.

The actual code looks like this:

float vfpu_exp2(float x) {
	static bool loaded =
	        LOAD_TABLE(vfpu_exp2_lut65536,    512)&&
		LOAD_TABLE(vfpu_exp2_lut,      262144);
// Rest of the function.
}

The tables themselves are still declared globally.

@hrydgard
Copy link
Owner

As unknown says, the threading behavior is not relevant since we only run the CPU emulation on one thread.

@fp64
Copy link
Contributor Author

fp64 commented Mar 12, 2023

Some tidbits, if you find them useful.

  1. On my system ./PPSSPPUnitTest ALL fails with
[...]/ppsspp/Common/x64Emitter.cpp:WriteAVXOp:1427): [cpu_info.bAVX] (menu) Trying to use AVX on a system that doesn't support it.
Trying to use AVX on a system that doesn't support it.
Trace/breakpoint trap

but, perhaps, that's what it's supposed to do.

  1. For python3 test.py -g --graphics=software the pspautotests/tests/audio/sascore/vag.prx: (the message before Build / test (ubuntu-latest) (pull_request) previously failed) seems to be where e.g. less would wait for the next buffer chunk. Maybe flushing stdout and stderr (on cpp and/or python side) after each test makes sense?

@fp64
Copy link
Contributor Author

fp64 commented Mar 12, 2023

Load-on-demand stuff (no fallbacks presently).

@unknownbrackets
Copy link
Collaborator

Quick-fixed the AVX bug in #17101. It's just that we were testing AVX codegen and disassembly.

-[Unknown]

@fp64
Copy link
Contributor Author

fp64 commented Mar 12, 2023

though they probably compress worse individually

Actually

2211584 vfpu.tar.zst.1
2181120 vfpu.zst.1.tar

No, I don't know why.

@unknownbrackets
Copy link
Collaborator

unknownbrackets commented Mar 12, 2023

Hm, vfpu_sin_lut8192 is nullptr for some reason on Windows when running the unit tests, which is why it crashes. Let's see.

Or not, my working dir for the debug session had the wrong assets I think.

-[Unknown]

@fp64
Copy link
Contributor Author

fp64 commented Mar 12, 2023

Ok, lets try to bring in #17102 (and just up-to-date with master).

@unknownbrackets
Copy link
Collaborator

Actually, I think it's still not working - I forgot to merge when I pushed to actions, I only tested the merge locally. I suspect it's still broken, testing a change on my fork's actions now in case. If so, might have to merge one more time (sorry.)

-[Unknown]

@fp64
Copy link
Contributor Author

fp64 commented Mar 13, 2023

Ok, now (after #17105) it seems to work.

To reiterate: JIT is still mostly not using these functions (it already uses vfpu_sin/vfpu_cos). That's, probably, a separate thing, not for this PR (not sure about perf cost, too).

At least one hex-float constant (needs C++17) is still there, but it seems to compile fine on all platforms, so leaving it as-is.

No fallback for now: if it cannot load the tables most calls would just crash (e.g. vfpu_sin(INFINITY) does not need tables, but most args do). Not sure if needed ("crash early, crash hard" may help debugging, vs fallback likely more convenient to users). It's easy to add: e.g. if(!loaded) return sinf(M_PI_2*x);.

Dissidia loads sin/cos during startup, then asin when starting ingame, then log2/exp2 during battle (after the first attack, apparently).

Can be both HP or BRV attack, from either player or opponent.

@fp64
Copy link
Contributor Author

fp64 commented Mar 13, 2023

Problematic replays from #6710 are still available, but just in case, here's a copy, in zip format:
Replays.zip
ULUS10566Replay16.zip

Haven't tested yet.

@fp64
Copy link
Contributor Author

fp64 commented Mar 13, 2023

Hm, Replay16 looks different now, but (I think) still desyncs.
Does anyone have a video of what it should look, on real PSP?

@fp64
Copy link
Contributor Author

fp64 commented Mar 16, 2023

Oh, I'm dumb. USE_VFPU_SQRT was still false, so I was testing with effectively vfpu_sqrt and vfpu_rsqrt disabled.
Looking at #2990 (comment)

vsqrt     (3891304)   // MAYBE: Introducing a small error makes glitches MUCH worse.
vrsq      (971862)    // MAYBE: Introducing a small error makes glitches MUCH worse.

I started to wonder, why the tables for sqrt do not show in logs.

Also, tables for rcp are only loaded in Interpreter mode, not IR interpreter.
Also, Ridge Racer does use vrnd*. And Dissidia does not.

@fp64
Copy link
Contributor Author

fp64 commented Mar 16, 2023

Hm, with USE_VFPU_DOT = false, USE_VFPU_SQRT = true, the car advances further, but still gets stuck bumping walls after a while. With USE_VFPU_DOT = true, USE_VFPU_SQRT = true difference seems earlier, the car reverses at some point and starts going backwards.

Dissidia replays are also affected, but not fixed.

This was referenced Mar 22, 2023
@hrydgard
Copy link
Owner

hrydgard commented Mar 23, 2023

Building this, I'm getting lot of warnings about applying unary minus to an unsigned value, like these:

uint32_t lo = (x + 0u) & -64u;

But I think we can indeed trust that all compilers will do the same thing with that, so we should probably just add a #pragma warning(ignore) kind of thing?

Anyway, I tested a bit on Android, no unknown issues (not that I would expect any).

@fp64
Copy link
Contributor Author

fp64 commented Mar 24, 2023

Sounds ok.
Well, if e.g. unsigned was 16-bit (which is legal per C++ Standard) the code example above would break (which may explain the warning).

@hrydgard
Copy link
Owner

Alright, let's just get it in now. Not super happy about the large tables of course, but the improved accuracy should be worth it in the long run.

I'm thinking maybe we should fall back to inaccurate functions if the files are not there, but we already have several hard dependencies on the assets folder so we'd be screwed anyway. Though we are largely compatible with mismatched-version assets.

@hrydgard hrydgard merged commit 98d9a9c into hrydgard:master Mar 28, 2023
@fp64
Copy link
Contributor Author

fp64 commented Mar 28, 2023

Now that it's in the codebase, just to list some remaining things:

  1. Make it available in JIT/IR.
  2. Maybe default USE_VFPU_SQRT (and USE_VFPU_DOT while we are at it) to true?
  3. Fallback if tables are not present.
  4. Maybe option to disable/fallback regardless (not sure what speed impact is, but e.g. 4×vfpu_sqrt is likely a lot slower than sqrtps).
  5. Make tables compressed?

Probably not by me (at least, I'm not confident enough to mess with JIT).

hrydgard added a commit that referenced this pull request Apr 3, 2023
…sing.

PR #16984 added more accurate versions of these functions, but they require
large lookup tables stored in assets/.

If these files are missing, PPSSPP would simply crash, which isn't good.

We should probably try to warn the user somehow that these files are
missing, though...
@fp64 fp64 deleted the vfpu-sincos branch June 30, 2023 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants