Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old 3DS Performance #28

Open
CarlosEFML opened this issue Sep 16, 2020 · 12 comments
Open

Old 3DS Performance #28

CarlosEFML opened this issue Sep 16, 2020 · 12 comments

Comments

@CarlosEFML
Copy link

Is it possible to port gd_math.c to use VFP on the 3DS? This is an example code that I used to use in my homebrews:

.global multMatrix44FPU
multMatrix44FPU:
    VPUSH {d8-d11}
    VLDMIA r1 !, {s16-s19} // Load 1st line of m2 -> [0 1 2 3]
    VLDR.F32 s20, [r0, # 16 * 0 + 0 * 4] // Load 1st col of m1 -> g
    VLDR.F32 s21, [r0, # 16 * 1 + 0 * 4] // -> m
    VLDR.F32 s22, [r0, # 16 * 2 + 0 * 4] // -> s
    VLDR.F32 s23, [r0, # 16 * 3 + 0 * 4] // -> w
    VMUL.F32 s0, s20, s16 // = {g * 0}
    VMUL.F32 s1, s20, s17 // = {g * 1}
    VMUL.F32 s2, s20, s18 // = {g * 2}
    VMUL.F32 s3, s20, s19 // = {g * 3}
    VMUL.F32 s4, s21, s16 // = {m * 0}
    VMUL.F32 s5, s21, s17 // = {m * 1}
    VMUL.F32 s6, s21, s18 // = {m * 2}
    VMUL.F32 s7, s21, s19 // = {m * 3}
    VMUL.F32 s8, s22, s16 // = {s * 0}
    VMUL.F32 s9, s22, s17 // = {s * 1}
    VMUL.F32 s10, s22, s18 // = {s * 2}
    VMUL.F32 s11, s22, s19 // = {s * 3}
    VMUL.F32 s12, s23, s16 // = {w * 0}
    VMUL.F32 s13, s23, s17 // = {w * 1}
    VMUL.F32 s14, s23, s18 // = {w * 2}
    VMUL.F32 s15, s23, s19 // = {w * 3}
    VLDMIA r1 !, {s16-s19} // Load 2nd line of m2 -> [4 5 6 7]
    VLDR.F32 s20, [r0, # 16 * 0 + 1 * 4] // Load 2nd col of m1 -> h
    VLDR.F32 s21, [r0, # 16 * 1 + 1 * 4] // -> n
    VLDR.F32 s22, [r0, # 16 * 2 + 1 * 4] // -> t
    VLDR.F32 s23, [r0, # 16 * 3 + 1 * 4] // -> x
    VMLA.F32 s0, s20, s16 // = {g * 0} + {h * 4}
    VMLA.F32 s1, s20, s17 // = {g * 1} + {h * 5}
    VMLA.F32 s2, s20, s18 // = {g * 2} + {h * 6}
    VMLA.F32 s3, s20, s19 // = {g * 3} + {h * 7}
    VMLA.F32 s4, s21, s16 // = {m * 0} + {n * 4}
    VMLA.F32 s5, s21, s17 // = {m * 1} + {n * 5}
    VMLA.F32 s6, s21, s18 // = {m * 2} + {n * 6}
    VMLA.F32 s7, s21, s19 // = {m * 3} + {n * 7}
    VMLA.F32 s8, s22, s16 // = {s * 0} + {t * 4}
    VMLA.F32 s9, s22, s17 // = {s * 1} + {t * 5}
    VMLA.F32 s10, s22, s18 // = {s * 2} + {t * 6}
    VMLA.F32 s11, s22, s19 // = {s * 3} + {t * 7}
    VMLA.F32 s12, s23, s16 // = {w * 0} + {x * 4}
    VMLA.F32 s13, s23, s17 // = {w * 1} + {x * 5}
    VMLA.F32 s14, s23, s18 // = {w * 2} + {x * 6}
    VMLA.F32 s15, s23, s19 // = {w * 3} + {x * 7}
    VLDMIA r1 !, {s16-s19} // Load 3rd line of m2 -> [8 9 A B]
    VLDR.F32 s20, [r0, # 16 * 0 + 2 * 4] // Load 3rd col of m1 -> i
    VLDR.F32 s21, [r0, # 16 * 1 + 2 * 4] // -> o
    VLDR.F32 s22, [r0, # 16 * 2 + 2 * 4] // -> u
    VLDR.F32 s23, [r0, # 16 * 3 + 2 * 4] // -> y
    VMLA.F32 s0, s20, s16 // = {g * 0} + {h * 4} + {i * 8}
    VMLA.F32 s1, s20, s17 // = {g * 1} + {h * 5} + {i * 9}
    VMLA.F32 s2, s20, s18 // = {g * 2} + {h * 6} + {i * A}
    VMLA.F32 s3, s20, s19 // = {g * 3} + {h * 7} + {i * B}
    VMLA.F32 s4, s21, s16 // = {m * 0} + {n * 4} + {o * 8}
    VMLA.F32 s5, s21, s17 // = {m * 1} + {n * 5} + {o * 9}
    VMLA.F32 s6, s21, s18 // = {m * 2} + {n * 6} + {o * A}
    VMLA.F32 s7, s21, s19 // = {m * 3} + {n * 7} + {o * B}
    VMLA.F32 s8, s22, s16 // = {s * 0} + {t * 4} + {u * 8}
    VMLA.F32 s9, s22, s17 // = {s * 1} + {t * 5} + {u * 9}
    VMLA.F32 s10, s22, s18 // = {s * 2} + {t * 6} + {u * A}
    VMLA.F32 s11, s22, s19 // = {s * 3} + {t * 7} + {u * B}
    VMLA.F32 s12, s23, s16 // = {w * 0} + {x * 4} + {y * 8}
    VMLA.F32 s13, s23, s17 // = {w * 1} + {x * 5} + {y * 9}
    VMLA.F32 s14, s23, s18 // = {w * 2} + {x * 6} + {y * A}
    VMLA.F32 s15, s23, s19 // = {w * 3} + {x * 7} + {y * B}
    VLDMIA r1, {s16-s19} // Load 4th line of m2 -> [C D E F]
    VLDR.F32 s20, [r0, # 16 * 0 + 3 * 4] // Load 4th col of m1 -> j
    VLDR.F32 s21, [r0, # 16 * 1 + 3 * 4] // -> p
    VLDR.F32 s22, [r0, # 16 * 2 + 3 * 4] // -> v
    VLDR.F32 s23, [r0, # 16 * 3 + 3 * 4] // -> z
    VMLA.F32 s0, s20, s16 // = {g * 0} + {h * 4} + {i * 8} + {j * C}
    VMLA.F32 s1, s20, s17 // = {g * 1} + {h * 5} + {i * 9} + {j * D}
    VMLA.F32 s2, s20, s18 // = {g * 2} + {h * 6} + {i * A} + {j * E}
    VMLA.F32 s3, s20, s19 // = {g * 3} + {h * 7} + {i * B} + {j * F}
    VMLA.F32 s4, s21, s16 // = {m * 0} + {n * 4} + {o * 8} + {p * C}
    VMLA.F32 s5, s21, s17 // = {m * 1} + {n * 5} + {o * 9} + {p * D}
    VMLA.F32 s6, s21, s18 // = {m * 2} + {n * 6} + {o * A} + {p * E}
    VMLA.F32 s7, s21, s19 // = {m * 3} + {n * 7} + {o * B} + {p * F}
    VMLA.F32 s8, s22, s16 // = {s * 0} + {t * 4} + {u * 8} + {v * C}
    VMLA.F32 s9, s22, s17 // = {s * 1} + {t * 5} + {u * 9} + {v * D}
    VMLA.F32 s10, s22, s18 // = {s * 2} + {t * 6} + {u * A} + {v * E}
    VMLA.F32 s11, s22, s19 // = {s * 3} + {t * 7} + {u * B} + {v * F}
    VMLA.F32 s12, s23, s16 // = {w * 0} + {x * 4} + {y * 8} + {z * C}
    VMLA.F32 s13, s23, s17 // = {w * 1} + {x * 5} + {y * 9} + {z * D}
    VMLA.F32 s14, s23, s18 // = {w * 2} + {x * 6} + {y * A} + {z * E}
    VMLA.F32 s15, s23, s19 // = {w * 3} + {x * 7} + {y * B} + {z * F}
    VPOP {d8-d11}
    VSTMIA r2, {s0-s15}
    BX lr
@mkst
Copy link

mkst commented Sep 16, 2020

I think a big benefit would come from moving the functions gfx_sp_vertex and/or gfx_sp_tri1 from https://github.com/sm64-port/sm64_3ds/blob/master/src/pc/gfx/gfx_pc.c to run on the gpu via vertex shader(s).

I tried simplifying the draw_triangles calls to use ~10 (slightly) different shaders and simply memcpy-ing the vbo_buf, but it didn't do anything for performance. Perhaps slightly worse if anything.

Do you have the source for the vpu stuff?

Edit tired brain was conflating VPU/GPU. I've switched out the gfx_matrix_mul function for that assembly on my branch, and whilst it runs, it did not give an obvious performance improvement.

That said, do you have additional functionality implemented in assembly? Basically:

    float x = v->ob[0] * rsp.MP_matrix[0][0] + v->ob[1] * rsp.MP_matrix[1][0] + v->ob[2] * rsp.MP_matrix[2][0] + rsp.MP_matrix[3][0];
    float y = v->ob[0] * rsp.MP_matrix[0][1] + v->ob[1] * rsp.MP_matrix[1][1] + v->ob[2] * rsp.MP_matrix[2][1] + rsp.MP_matrix[3][1];
    float z = v->ob[0] * rsp.MP_matrix[0][2] + v->ob[1] * rsp.MP_matrix[1][2] + v->ob[2] * rsp.MP_matrix[2][2] + rsp.MP_matrix[3][2];
    float w = v->ob[0] * rsp.MP_matrix[0][3] + v->ob[1] * rsp.MP_matrix[1][3] + v->ob[2] * rsp.MP_matrix[2][3] + rsp.MP_matrix[3][3];

which is DP4 i think?

and these two would be good:

static inline void gfx_normalize_vector(float v[3]) {
    float s = sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
    v[0] /= s;
    v[1] /= s;
    v[2] /= s;
}

static inline void gfx_transposed_matrix_mul(float res[3], const float a[3], const float b[4][4]) {
    res[0] = a[0] * b[0][0] + a[1] * b[0][1] + a[2] * b[0][2];
    res[1] = a[0] * b[1][0] + a[1] * b[1][1] + a[2] * b[1][2];
    res[2] = a[0] * b[2][0] + a[1] * b[2][1] + a[2] * b[2][2];
}

@mkst
Copy link

mkst commented Sep 17, 2020

@CarlosEFML I've just managed to get audio moved across to the 2nd CPU and it's drastically improved performance. It's by no means perfect, but the slowdowns during winged-cap / surfing turtle etc are significantly reduced.

@CarlosEFML
Copy link
Author

@mkst Good job!!! I'll try to port gfx_normalize_vector and gfx_transposed_matrix_mul to VFP ASM.

@mkst
Copy link

mkst commented Sep 18, 2020

@mkst Good job!!! I'll try to port gfx_normalize_vector and gfx_transposed_matrix_mul to VFP ASM.

The first piece of code (generating x, y, z, w) is called for every vertex (see gfx_sp_vertex) so I think that would be a good target.

Are you writing the ASM by hand, or creating a simple function and then compiling with fpu=neon? or some other method? In any case, I look forward to the results!

@CarlosEFML
Copy link
Author

I'm writing by hand and it's been a long time since my last line of code for 3DS. But you're right, I will try to optimize gfx_sp_vertex first.

@abelol954
Copy link

they have a downloadable version in .cia

@CarlosEFML
Copy link
Author

Bad news. I did not notice any improvement with the use of VFP. Maybe because of the function call overhead (I couldn't make the asm inline code work). I created a PR with the code in case you want to test.

Unfortunately, 3DS only supports scalar operations and does not support vector operations like this:
VMUL.F32 q0, q1, q2

nintendo-3ds-hardware-thread

@mkst
Copy link

mkst commented Sep 21, 2020

Bad news. I did not notice any improvement with the use of VFP. Maybe because of the function call overhead (I couldn't make the asm inline code work). I created a PR with the code in case you want to test.

Unfortunately, 3DS only supports scalar operations and does not support vector operations like this:
VMUL.F32 q0, q1, q2

nintendo-3ds-hardware-thread

How did you test? Were you noting CPU usage before/after or just "feeling it out"?

Also, have you tried my fork (based off Gericom's) with audio running on the 2nd CPU?

@CarlosEFML
Copy link
Author

Yes, just feelings. But now I have applied this code to your fork and the result is the same. It seems that audio is the real villain in this port. When disabling the audio, we have a solid 30 fps on the O3DS.

@mkst
Copy link

mkst commented Sep 22, 2020

@CarlosEFML do you not get a smooth(er) experience with audio on the OS core?

Can any of the audio code (mixer.c) be accelerated by the VPU? There are already SSE4.1 and NEON optimisations for the vanilla port, but the poor CPU in the 3DS doesnt have NEON.

Also, I took your transfByMatrix44FPU code and used it in gfx_pc.c so it would be more generic than the gd_maths.c which is only used for the intro Mario head. That said I've just added transfByMatrix44FPU too (locally) and didn't have a noticeable impact either sadly.

@CarlosEFML
Copy link
Author

I tested the fork with the audio processing on CPU=1, but only with .3dsx and didn't notice any improvement.

@mkst
Copy link

mkst commented Sep 22, 2020

I tested the fork with the audio processing on CPU=1, but only with .3dsx and didn't notice any improvement.

Interesting, are you running a new(ish) version of Luma (i.e. 10.1 or above)? Might be that the call to request 80% of CPU1 fails so the thread is started on the app core (CPU0) instead - which would have no improvement, potentially worse performance.

If you go to the first level (BOB) and get the winged cap, the game should not slow down to half-speed like it does normally.

If you really want to test, you could checkout the 3ds-shaders fork as the bottom screen is a console , it should say tell you which CPU is used for audio:

printf("Created audio thread on core %i\n", cpu);

I might put in a while loop to set the OS time to as high as it can before it fails (was 30% on older version of luma).

BTW, feel free to join the Discord if you'd prefer some real-time communication!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants