-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Old 3DS Performance #28
Comments
I think a big benefit would come from moving the functions gfx_sp_vertex and/or gfx_sp_tri1 from https://github.com/sm64-port/sm64_3ds/blob/master/src/pc/gfx/gfx_pc.c to run on the gpu via vertex shader(s). I tried simplifying the draw_triangles calls to use ~10 (slightly) different shaders and simply memcpy-ing the vbo_buf, but it didn't do anything for performance. Perhaps slightly worse if anything. Do you have the source for the vpu stuff? Edit tired brain was conflating VPU/GPU. I've switched out the That said, do you have additional functionality implemented in assembly? Basically: float x = v->ob[0] * rsp.MP_matrix[0][0] + v->ob[1] * rsp.MP_matrix[1][0] + v->ob[2] * rsp.MP_matrix[2][0] + rsp.MP_matrix[3][0];
float y = v->ob[0] * rsp.MP_matrix[0][1] + v->ob[1] * rsp.MP_matrix[1][1] + v->ob[2] * rsp.MP_matrix[2][1] + rsp.MP_matrix[3][1];
float z = v->ob[0] * rsp.MP_matrix[0][2] + v->ob[1] * rsp.MP_matrix[1][2] + v->ob[2] * rsp.MP_matrix[2][2] + rsp.MP_matrix[3][2];
float w = v->ob[0] * rsp.MP_matrix[0][3] + v->ob[1] * rsp.MP_matrix[1][3] + v->ob[2] * rsp.MP_matrix[2][3] + rsp.MP_matrix[3][3]; which is DP4 i think? and these two would be good: static inline void gfx_normalize_vector(float v[3]) {
float s = sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
v[0] /= s;
v[1] /= s;
v[2] /= s;
}
static inline void gfx_transposed_matrix_mul(float res[3], const float a[3], const float b[4][4]) {
res[0] = a[0] * b[0][0] + a[1] * b[0][1] + a[2] * b[0][2];
res[1] = a[0] * b[1][0] + a[1] * b[1][1] + a[2] * b[1][2];
res[2] = a[0] * b[2][0] + a[1] * b[2][1] + a[2] * b[2][2];
} |
@CarlosEFML I've just managed to get audio moved across to the 2nd CPU and it's drastically improved performance. It's by no means perfect, but the slowdowns during winged-cap / surfing turtle etc are significantly reduced. |
@mkst Good job!!! I'll try to port gfx_normalize_vector and gfx_transposed_matrix_mul to VFP ASM. |
The first piece of code (generating x, y, z, w) is called for every vertex (see Are you writing the ASM by hand, or creating a simple function and then compiling with fpu=neon? or some other method? In any case, I look forward to the results! |
I'm writing by hand and it's been a long time since my last line of code for 3DS. But you're right, I will try to optimize gfx_sp_vertex first. |
they have a downloadable version in .cia |
Bad news. I did not notice any improvement with the use of VFP. Maybe because of the function call overhead (I couldn't make the asm inline code work). I created a PR with the code in case you want to test. Unfortunately, 3DS only supports scalar operations and does not support vector operations like this: |
How did you test? Were you noting CPU usage before/after or just "feeling it out"? Also, have you tried my fork (based off Gericom's) with audio running on the 2nd CPU? |
Yes, just feelings. But now I have applied this code to your fork and the result is the same. It seems that audio is the real villain in this port. When disabling the audio, we have a solid 30 fps on the O3DS. |
@CarlosEFML do you not get a smooth(er) experience with audio on the OS core? Can any of the audio code (mixer.c) be accelerated by the VPU? There are already SSE4.1 and NEON optimisations for the vanilla port, but the poor CPU in the 3DS doesnt have NEON. Also, I took your |
I tested the fork with the audio processing on CPU=1, but only with .3dsx and didn't notice any improvement. |
Interesting, are you running a new(ish) version of Luma (i.e. 10.1 or above)? Might be that the call to request 80% of CPU1 fails so the thread is started on the app core (CPU0) instead - which would have no improvement, potentially worse performance. If you go to the first level (BOB) and get the winged cap, the game should not slow down to half-speed like it does normally. If you really want to test, you could checkout the 3ds-shaders fork as the bottom screen is a console , it should say tell you which CPU is used for audio: printf("Created audio thread on core %i\n", cpu); I might put in a while loop to set the OS time to as high as it can before it fails (was 30% on older version of luma). BTW, feel free to join the Discord if you'd prefer some real-time communication! |
Is it possible to port gd_math.c to use VFP on the 3DS? This is an example code that I used to use in my homebrews:
The text was updated successfully, but these errors were encountered: