-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge performance drop on r300 with default Thunder and Vega scene #1172
Comments
The cause of the bottleneck is very hard to track down. Orbit fails to show me the calltrace, only identifying a very costly And gprofng also shows me something suspect with openal, but what if the i/o on the sound card is actually slowed down by the whole PCIe bus being on knees because of agressive io for rendering? But gperftools seems to track it back to the r300 DRM driver through Mesa, more explicitely: That's why I suspect something similar to what we had with the sky: that GPU has a slow PCIe 1.0 x16 bus, and if we process the memory GPU directly in our computation, this can be very bad. |
What if you move backwards from there? If perf drops once the globe is in view, then it is the issue, otherwise it's not. |
Both the electric arc effect and the globe materials use some
In the slow sky thread, SomaZ recommended to also look at
|
tcMod uses a matrix uniform. |
Commenting out the I also verified that doing |
|
One may also suspect the amount of tris, but vega was running properly in Tremulous.
|
For comparing things like that it'd be helpful to know how many triangles are in each case in particular (some value of |
Same computer, but with onboard nvidia GPU.
Where the comparison is not fair is that instead of dedicated memory the Nvidia chip uses shared DDR2 memory from the host and instead of PCIe 1.0 x16 bus the Nvidia uses HyperTransport (some onboard bus shared with CPU). This may help with The Nvidia has a larger ALU (but that should not make a difference as long as shader runs) and may accept larger images (or properly downscale them). The ATi one uses Like the ATi one is from the first generation of ATi cards to support OpenGL 2, that Nvidia card is from the first generation of Nvidia cards to support OpenGL 2, and is from the year after then ATi one. Both are low end, in fact the “LE“ string in the product name means “low end”. That card can sustain solid 20fps on both scenes: (Thunder |
For comparison on the same computer and that old ATi card, the With the OpenGL 2 renderer it gets 40fps (almost 50fps): Except for the vertex lighting, this is using the default Tremulous configuration (no “lowest”-like preset). That other engine is also ridled with bugs, on such hardware the OpenGL1 renderer spawns lots of error logs every frame:
and on such hardware the OpenGL2 renderer doesn't work at all with lightmaps, and some fonts are totally broken, and some texture blending is broken: But the experiments says something: not only it may be possible to render such map on such hardware without any performance drop in that scene, but we may be able to achieve double the performance on such hardware. Edit: In fact we already achieve 60fps in other maps that don't trigger the slow down. This test was done with the ATi card described in first message, on the computer described in first message. |
So, it looks like I uncovered not only one performance problem but two. First performance problem is that serious performance drop in some places. Second performance problem is that we may be able to get twice the same performance on some lower-end hardware but don't get it. I expect such hardware to perform better with the OpenGL1 renderer, as at the time most of the game were doing things the GL1 way and only using shaders for some effects, so the hardware is probably architectured for that in mind. On the contrary we use shaders for everything. But since the OpenGL 2 renderer of ioquake3 seems to still be twice faster than us when doing things the shader way in that range of hardware, I believe we should also get twice the performance we already have (reaching 40fps on such map). And I'm very annoyed by how much it is hard to track down this. I rebuilt entirely Orbit and its dependencies to run on that old CPU, I have a own build of Mesa and libdrm (even libpciaccess) built with debug symbols, I installed all the debug packages I could find for the installed packages of my distro, including the one of the libc, and Orbit still fails to track down the slow code backtrace. |
It looks like I got a calltrace with GDB on
|
The slow code seems to be (if I caught the right Daemon/src/engine/renderer/tr_shade.cpp Line 328 in f42c220
glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts,
GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives ); |
I added this logger right before the Log::Warn( "=== tess.multiDrawPrimitives: %d", tess.multiDrawPrimitives ); Then on the default Vega spectator scene (
|
On the
On the
|
Just as a quick test, reducing Also the ioq3 and grangerhub's tremulous source codes have no reference to |
At the same places we call |
Also for curiosity, I did that, but I still get 5fps on that default Vega scene: if ( tess.multiDrawPrimitives )
{
- glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ if ( false )
+ {
+ glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ }
+ else
+ {
+ for ( i = 0; i < tess.multiDrawPrimitives; i++ )
+ {
+ glDrawElements( GL_TRIANGLES, tess.multiDrawCounts[ i ], GL_INDEX_TYPE, tess.multiDrawIndexes[ i ] );
+ }
+ } |
And if I interrupt, I interrupt in the middle of an
|
It looks to happen right after the autosprite deform in if ( oldShader != nullptr )
{
if ( oldShader->autoSpriteMode && !(tess.attribsSet & ATTR_ORIENTATION) ) {
Tess_AutospriteDeform( oldShader->autoSpriteMode,
0, tess.numVertexes,
0, tess.numIndexes );
}
Tess_End(); // <-- here
} |
So I did that: diff --git a/src/engine/renderer/tr_shade.cpp b/src/engine/renderer/tr_shade.cpp
index c99eb8369..3063a5117 100644
--- a/src/engine/renderer/tr_shade.cpp
+++ b/src/engine/renderer/tr_shade.cpp
@@ -311,6 +311,8 @@ void GLSL_FinishGPUShaders()
Tess_DrawElements
==================
*/
+#include <ctime>
+#include <cstdio>
void Tess_DrawElements()
{
int i;
@@ -325,7 +327,10 @@ void Tess_DrawElements()
{
if ( tess.multiDrawPrimitives )
{
+ clock_t start = clock();
glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ clock_t diff = clock() - start;
+ printf("=== time: %ld, primitives: %d\n", diff, tess.multiDrawPrimitives);
backEnd.pc.c_multiDrawElements++;
backEnd.pc.c_multiDrawPrimitives += tess.multiDrawPrimitives; I get this:
|
With: diff --git a/src/engine/renderer/tr_shade.cpp b/src/engine/renderer/tr_shade.cpp
index c99eb8369..01aac402c 100644
--- a/src/engine/renderer/tr_shade.cpp
+++ b/src/engine/renderer/tr_shade.cpp
@@ -311,6 +311,8 @@ void GLSL_FinishGPUShaders()
Tess_DrawElements
==================
*/
+#include <ctime>
+#include <cstdio>
void Tess_DrawElements()
{
int i;
@@ -325,7 +327,10 @@ void Tess_DrawElements()
{
if ( tess.multiDrawPrimitives )
{
+ clock_t start = clock();
glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ clock_t diff = clock() - start;
+ printf("=== time: %ld, primitives: %d, time/primitives: %f\n", diff, tess.multiDrawPrimitives, (float) diff/tess.multiDrawPrimitives);
backEnd.pc.c_multiDrawElements++;
backEnd.pc.c_multiDrawPrimitives += tess.multiDrawPrimitives; I get:
Some of them are really slow to render, 1 or 2ms for a single thing… |
If I take a simple scene that achieves 47fps on the same Vega map (
There is much less things to do, but still things lasting more than 1ms and even close to 2ms. So it looks like we have a very slow stuff somewhere, and having more tris just gives more chance to get those slow stuff to render. If we find them and fasten them, we may not only fix the huge performance drop but also the general performance. |
Now I need help to identify what are those things so slow to render… |
So I did this: diff --git a/src/engine/renderer/tr_shade.cpp b/src/engine/renderer/tr_shade.cpp
index c99eb8369..8b4a46f45 100644
--- a/src/engine/renderer/tr_shade.cpp
+++ b/src/engine/renderer/tr_shade.cpp
@@ -793,6 +793,8 @@ void Render_generic3D( shaderStage_t *pStage )
GL_CheckErrors();
}
+#include <ctime>
+#include <cstdio>
void Render_generic( shaderStage_t *pStage )
{
if ( backEnd.projection2D )
@@ -801,7 +803,10 @@ void Render_generic( shaderStage_t *pStage )
return;
}
+ clock_t start = clock();
Render_generic3D( pStage );
+ clock_t diff = clock() - start;
+ printf("=== time: %ld, %s\n", diff, tess.surfaceShader->name);
}
/* And I got this:
|
I don't know why the
Spending 2ms on rendering this is very wrong. And for the glass01 that also looks stupidly slow, it's a bit more complex but not crazily complex:
Also, why one glass01 stage is fast, and one is slow? And since the lightmap is rendered by
It doesn't make sense at all that this stage can take almost 1.7ms to render! The simple fact everytime I interrupt I stop on a |
The arguments in favor of the bug being in our code or at least triggered by our code:
This doesn't exclude the possibility the driver has an nasty bug that can be triggered by doing something as right as The fact The argument in favor of the bug being in driver code:
For information why the There is no reason |
So I managed to record some run with apitrace, an painfully iterated the calls until I visually recognize the
Si I now have some preprocessed GLSL code for both cases. Then I edited a bit the spacing in vp GLSL shaders to reduce the diff noise. I have not modified the fp GLSL shaders because they are too much different and no spacing change can help. Here is preprocessed float smoothstep(float edge0, float edge1, float x) { float t = clamp((x - edge0) / (edge1 - edge0), 0.0, 1.0); return t * t * (3.0 - 2.0 * t); }
const int MAX_GLSL_BONES = 41;
uniform sampler2D u_ColorMap;
uniform float u_AlphaThreshold;
uniform float u_InverseLightFactor;
varying vec2 var_TexCoords;
varying vec4 var_Color;
void main()
{
vec4 color = texture2D(u_ColorMap, var_TexCoords);
color *= var_Color;
color.rgb *= u_InverseLightFactor;
gl_FragColor = color;
} Here is preprocessed float smoothstep(float edge0, float edge1, float x) { float t = clamp((x - edge0) / (edge1 - edge0), 0.0, 1.0); return t * t * (3.0 - 2.0 * t); }
const int MAX_GLSL_BONES = 41;
uniform sampler2D u_Lights;
const struct GetLightOffsets {
int center_radius;
int color_type;
int direction_angle;
} getLightOffsets = GetLightOffsets(0, 1, 2);
uniform int u_numLights;
uniform vec2 u_SpecularExponent;
void ReadLightGrid(in vec4 texel, out vec3 ambientColor, out vec3 lightColor) {
float ambientScale = 2.0 * texel.a;
float directedScale = 2.0 - ambientScale;
ambientColor = ambientScale * texel.rgb;
lightColor = directedScale * texel.rgb;
}
void computeLight(in vec3 lightColor, vec4 diffuseColor, inout vec4 color) {
color.rgb += lightColor.rgb * diffuseColor.rgb;
}
void computeDeluxeLight( vec3 lightDir, vec3 normal, vec3 viewDir, vec3 lightColor,
vec4 diffuseColor, vec4 materialColor,
inout vec4 color ) {
vec3 H = normalize( lightDir + viewDir );
float NdotL = dot( normal, lightDir );
NdotL = clamp( NdotL, 0.0, 1.0 );
color.rgb += lightColor.rgb * NdotL * diffuseColor.rgb;
}
const int lightsPerLayer = 4;
uniform sampler3D u_LightTiles;
vec4 fetchIdxs( in vec3 coords ) {
return texture3D( u_LightTiles, coords ) * 255.0;
}
int nextIdx( inout vec4 idxs ) {
vec4 tmp = idxs;
idxs = floor(idxs * 0.25);
tmp -= 4.0 * idxs;
return int( dot( tmp, vec4( 64.0, 16.0, 4.0, 1.0 ) ) );
}
const int numLayers = 1024 / 256;
vec3 NormalInTangentSpace(vec2 texNormal)
{
vec3 normal;
normal = vec3(0.0, 0.0, 1.0);
return normal;
}
vec3 NormalInWorldSpace(vec2 texNormal, mat3 tangentToWorldMatrix)
{
vec3 normal = NormalInTangentSpace(texNormal);
return normalize(tangentToWorldMatrix * normal);
}
uniform sampler2D u_DiffuseMap;
uniform sampler2D u_MaterialMap;
uniform sampler2D u_GlowMap;
uniform float u_AlphaThreshold;
uniform float u_InverseLightFactor;
uniform vec3 u_ViewOrigin;
varying vec3 var_Position;
varying vec2 var_TexCoords;
varying vec4 var_Color;
varying vec3 var_Tangent;
varying vec3 var_Binormal;
varying vec3 var_Normal;
uniform sampler2D u_LightMap;
varying vec2 var_TexLight;
void main()
{
vec3 viewDir = normalize(u_ViewOrigin - var_Position);
vec2 texCoords = var_TexCoords;
mat3 tangentToWorldMatrix = mat3(var_Tangent.xyz, var_Binormal.xyz, var_Normal.xyz);
vec4 diffuse = texture2D(u_DiffuseMap, texCoords);
diffuse *= var_Color;
if(abs(diffuse.a + u_AlphaThreshold) <= 1.0)
{
discard;
return;
}
vec3 normal = NormalInWorldSpace(texCoords, tangentToWorldMatrix);
vec4 material = texture2D(u_MaterialMap, texCoords);
vec4 color;
color.a = diffuse.a;
vec3 lightColor = texture2D(u_LightMap, var_TexLight).rgb;
color.rgb = vec3(0.0);
computeLight(lightColor, diffuse, color);
if ( u_InverseLightFactor > 0 )
{
color.rgb *= u_InverseLightFactor;
}
vec3 glow = texture2D(u_GlowMap, texCoords).rgb;
if ( u_InverseLightFactor < 0 )
{
glow *= - u_InverseLightFactor;
}
color.rgb += glow;
gl_FragColor = color;
} Here is preprocessed float waveSin(float x) {
return sin( radians( 360.0 * x ) );
}
float waveSquare(float x) {
return sign( waveSin( x ) );
}
float waveTriangle(float x)
{
return 1.0 - abs( 4.0 * fract( x + 0.25 ) - 2.0 );
}
float waveSawtooth(float x)
{
return fract( x );
}
void DeformVertex( inout vec4 pos,
inout vec3 normal,
inout vec2 st,
inout vec4 color,
in float time)
{
vec4 work = vec4(0.0);
}
float smoothstep(float edge0, float edge1, float x) { float t = clamp((x - edge0) / (edge1 - edge0), 0.0, 1.0); return t * t * (3.0 - 2.0 * t); }
const int MAX_GLSL_BONES = 41;
struct localBasis {
vec3 normal;
vec3 tangent, binormal;
};
vec3 QuatTransVec(in vec4 quat, in vec3 vec) {
vec3 tmp = 2.0 * cross( quat.xyz, vec );
return vec + quat.w * tmp + cross( quat.xyz, tmp );
}
void QTangentToLocalBasis( in vec4 qtangent, out localBasis LB ) {
LB.normal = QuatTransVec( qtangent, vec3( 0.0, 0.0, 1.0 ) );
LB.tangent = QuatTransVec( qtangent, vec3( 1.0, 0.0, 0.0 ) );
LB.tangent *= sign( qtangent.w );
LB.binormal = QuatTransVec( qtangent, vec3( 0.0, 1.0, 0.0 ) );
}
attribute vec3 attr_Position;
attribute vec4 attr_Color;
attribute vec4 attr_QTangent;
attribute vec4 attr_TexCoord0;
void VertexFetch(out vec4 position,
out localBasis normalBasis,
out vec4 color,
out vec2 texCoord,
out vec2 lmCoord)
{
position = vec4( attr_Position, 1.0 );
QTangentToLocalBasis( attr_QTangent, normalBasis );
color = attr_Color;
texCoord = attr_TexCoord0.xy;
lmCoord = attr_TexCoord0.zw;
}
uniform mat4 u_TextureMatrix;
uniform vec3 u_ViewOrigin;
uniform mat4 u_ModelViewProjectionMatrix;
uniform float u_Time;
uniform vec4 u_ColorModulate;
uniform vec4 u_Color;
varying vec2 var_TexCoords;
varying vec4 var_Color;
void DeformVertex(inout vec4 pos, inout vec3 normal, inout vec2 st, inout vec4 color, in float time);
void main()
{
localBasis LB;
vec4 position, color;
vec2 texCoord, lmCoord;
VertexFetch(position, LB, color, texCoord, lmCoord);
color = color * u_ColorModulate + u_Color;
DeformVertex(position, LB.normal, texCoord, color, u_Time);
gl_Position = u_ModelViewProjectionMatrix * position;
var_TexCoords = (u_TextureMatrix * vec4(texCoord, 0.0, 1.0)).xy;
var_Color = color;
}} Here is preprocessed float waveSin(float x) {
return sin( radians( 360.0 * x ) );
}
float waveSquare(float x) {
return sign( waveSin( x ) );
}
float waveTriangle(float x)
{
return 1.0 - abs( 4.0 * fract( x + 0.25 ) - 2.0 );
}
float waveSawtooth(float x)
{
return fract( x );
}
void DeformVertex( inout vec4 pos,
inout vec3 normal,
inout vec2 st,
inout vec4 color,
in float time)
{
vec4 work = vec4(0.0);
}
float smoothstep(float edge0, float edge1, float x) { float t = clamp((x - edge0) / (edge1 - edge0), 0.0, 1.0); return t * t * (3.0 - 2.0 * t); }
const int MAX_GLSL_BONES = 41;
struct localBasis {
vec3 normal;
vec3 tangent, binormal;
};
vec3 QuatTransVec(in vec4 quat, in vec3 vec) {
vec3 tmp = 2.0 * cross( quat.xyz, vec );
return vec + quat.w * tmp + cross( quat.xyz, tmp );
}
void QTangentToLocalBasis( in vec4 qtangent, out localBasis LB ) {
LB.normal = QuatTransVec( qtangent, vec3( 0.0, 0.0, 1.0 ) );
LB.tangent = QuatTransVec( qtangent, vec3( 1.0, 0.0, 0.0 ) );
LB.tangent *= sign( qtangent.w );
LB.binormal = QuatTransVec( qtangent, vec3( 0.0, 1.0, 0.0 ) );
}
attribute vec3 attr_Position;
attribute vec4 attr_Color;
attribute vec4 attr_QTangent;
attribute vec4 attr_TexCoord0;
void VertexFetch(out vec4 position,
out localBasis normalBasis,
out vec4 color,
out vec2 texCoord,
out vec2 lmCoord)
{
position = vec4( attr_Position, 1.0 );
QTangentToLocalBasis( attr_QTangent, normalBasis );
color = attr_Color;
texCoord = attr_TexCoord0.xy;
lmCoord = attr_TexCoord0.zw;
}
uniform mat4 u_TextureMatrix;
uniform mat4 u_ModelViewProjectionMatrix;
uniform float u_Time;
uniform vec4 u_ColorModulate;
uniform vec4 u_Color;
varying vec3 var_Position;
varying vec2 var_TexCoords;
varying vec2 var_TexLight;
varying vec3 var_Tangent;
varying vec3 var_Binormal;
varying vec3 var_Normal;
varying vec4 var_Color;
void DeformVertex(inout vec4 pos, inout vec3 normal, inout vec2 st, inout vec4 color, in float time);
void main()
{
localBasis LB;
vec4 position, color;
vec2 texCoord, lmCoord;
VertexFetch(position, LB, color, texCoord, lmCoord);
color = color * u_ColorModulate + u_Color;
DeformVertex(position, LB.normal, texCoord, color, u_Time);
gl_Position = u_ModelViewProjectionMatrix * position;
var_Position = position.xyz;
var_Tangent = LB.tangent;
var_Binormal = LB.binormal;
var_Normal = LB.normal;
var_TexLight = lmCoord;
var_TexCoords = (u_TextureMatrix * vec4(texCoord, 0.0, 1.0)).xy;
var_Color = color;
} |
So the diffs: --- pp.generic3D_fp.glsl
+++ pp.lightMapping_fp.glsl
@@ -1,14 +1,95 @@
float smoothstep(float edge0, float edge1, float x) { float t = clamp((x - edge0) / (edge1 - edge0), 0.0, 1.0); return t * t * (3.0 - 2.0 * t); }
const int MAX_GLSL_BONES = 41;
-uniform sampler2D u_ColorMap;
+uniform sampler2D u_Lights;
+const struct GetLightOffsets {
+ int center_radius;
+ int color_type;
+ int direction_angle;
+} getLightOffsets = GetLightOffsets(0, 1, 2);
+uniform int u_numLights;
+uniform vec2 u_SpecularExponent;
+void ReadLightGrid(in vec4 texel, out vec3 ambientColor, out vec3 lightColor) {
+ float ambientScale = 2.0 * texel.a;
+ float directedScale = 2.0 - ambientScale;
+ ambientColor = ambientScale * texel.rgb;
+ lightColor = directedScale * texel.rgb;
+}
+void computeLight(in vec3 lightColor, vec4 diffuseColor, inout vec4 color) {
+ color.rgb += lightColor.rgb * diffuseColor.rgb;
+}
+void computeDeluxeLight( vec3 lightDir, vec3 normal, vec3 viewDir, vec3 lightColor,
+ vec4 diffuseColor, vec4 materialColor,
+ inout vec4 color ) {
+ vec3 H = normalize( lightDir + viewDir );
+ float NdotL = dot( normal, lightDir );
+ NdotL = clamp( NdotL, 0.0, 1.0 );
+ color.rgb += lightColor.rgb * NdotL * diffuseColor.rgb;
+}
+const int lightsPerLayer = 4;
+uniform sampler3D u_LightTiles;
+vec4 fetchIdxs( in vec3 coords ) {
+ return texture3D( u_LightTiles, coords ) * 255.0;
+}
+int nextIdx( inout vec4 idxs ) {
+ vec4 tmp = idxs;
+ idxs = floor(idxs * 0.25);
+ tmp -= 4.0 * idxs;
+ return int( dot( tmp, vec4( 64.0, 16.0, 4.0, 1.0 ) ) );
+}
+const int numLayers = 1024 / 256;
+vec3 NormalInTangentSpace(vec2 texNormal)
+{
+ vec3 normal;
+ normal = vec3(0.0, 0.0, 1.0);
+ return normal;
+}
+vec3 NormalInWorldSpace(vec2 texNormal, mat3 tangentToWorldMatrix)
+{
+ vec3 normal = NormalInTangentSpace(texNormal);
+ return normalize(tangentToWorldMatrix * normal);
+}
+uniform sampler2D u_DiffuseMap;
+uniform sampler2D u_MaterialMap;
+uniform sampler2D u_GlowMap;
uniform float u_AlphaThreshold;
uniform float u_InverseLightFactor;
+uniform vec3 u_ViewOrigin;
+varying vec3 var_Position;
varying vec2 var_TexCoords;
varying vec4 var_Color;
+varying vec3 var_Tangent;
+varying vec3 var_Binormal;
+varying vec3 var_Normal;
+ uniform sampler2D u_LightMap;
+ varying vec2 var_TexLight;
void main()
{
- vec4 color = texture2D(u_ColorMap, var_TexCoords);
- color *= var_Color;
- color.rgb *= u_InverseLightFactor;
+ vec3 viewDir = normalize(u_ViewOrigin - var_Position);
+ vec2 texCoords = var_TexCoords;
+ mat3 tangentToWorldMatrix = mat3(var_Tangent.xyz, var_Binormal.xyz, var_Normal.xyz);
+ vec4 diffuse = texture2D(u_DiffuseMap, texCoords);
+ diffuse *= var_Color;
+ if(abs(diffuse.a + u_AlphaThreshold) <= 1.0)
+ {
+ discard;
+ return;
+ }
+ vec3 normal = NormalInWorldSpace(texCoords, tangentToWorldMatrix);
+ vec4 material = texture2D(u_MaterialMap, texCoords);
+ vec4 color;
+ color.a = diffuse.a;
+ vec3 lightColor = texture2D(u_LightMap, var_TexLight).rgb;
+ color.rgb = vec3(0.0);
+ computeLight(lightColor, diffuse, color);
+ if ( u_InverseLightFactor > 0 )
+ {
+ color.rgb *= u_InverseLightFactor;
+ }
+ vec3 glow = texture2D(u_GlowMap, texCoords).rgb;
+ if ( u_InverseLightFactor < 0 )
+ {
+ glow *= - u_InverseLightFactor;
+ }
+ color.rgb += glow;
gl_FragColor = color;
} --- pps.generic3D_vp.glsl
+++ pps.lightMapping_vp.glsl
@@ -1,75 +1,84 @@
float waveSin(float x) {
return sin( radians( 360.0 * x ) );
}
float waveSquare(float x) {
return sign( waveSin( x ) );
}
float waveTriangle(float x)
{
return 1.0 - abs( 4.0 * fract( x + 0.25 ) - 2.0 );
}
float waveSawtooth(float x)
{
return fract( x );
}
void DeformVertex( inout vec4 pos,
inout vec3 normal,
inout vec2 st,
inout vec4 color,
in float time)
{
vec4 work = vec4(0.0);
}
float smoothstep(float edge0, float edge1, float x) { float t = clamp((x - edge0) / (edge1 - edge0), 0.0, 1.0); return t * t * (3.0 - 2.0 * t); }
const int MAX_GLSL_BONES = 41;
struct localBasis {
vec3 normal;
vec3 tangent, binormal;
};
vec3 QuatTransVec(in vec4 quat, in vec3 vec) {
vec3 tmp = 2.0 * cross( quat.xyz, vec );
return vec + quat.w * tmp + cross( quat.xyz, tmp );
}
void QTangentToLocalBasis( in vec4 qtangent, out localBasis LB ) {
LB.normal = QuatTransVec( qtangent, vec3( 0.0, 0.0, 1.0 ) );
LB.tangent = QuatTransVec( qtangent, vec3( 1.0, 0.0, 0.0 ) );
LB.tangent *= sign( qtangent.w );
LB.binormal = QuatTransVec( qtangent, vec3( 0.0, 1.0, 0.0 ) );
}
attribute vec3 attr_Position;
attribute vec4 attr_Color;
attribute vec4 attr_QTangent;
attribute vec4 attr_TexCoord0;
void VertexFetch(out vec4 position,
out localBasis normalBasis,
out vec4 color,
out vec2 texCoord,
out vec2 lmCoord)
{
position = vec4( attr_Position, 1.0 );
QTangentToLocalBasis( attr_QTangent, normalBasis );
color = attr_Color;
texCoord = attr_TexCoord0.xy;
lmCoord = attr_TexCoord0.zw;
}
uniform mat4 u_TextureMatrix;
-uniform vec3 u_ViewOrigin;
uniform mat4 u_ModelViewProjectionMatrix;
uniform float u_Time;
uniform vec4 u_ColorModulate;
uniform vec4 u_Color;
+varying vec3 var_Position;
varying vec2 var_TexCoords;
+varying vec2 var_TexLight;
+varying vec3 var_Tangent;
+varying vec3 var_Binormal;
+varying vec3 var_Normal;
varying vec4 var_Color;
void DeformVertex(inout vec4 pos, inout vec3 normal, inout vec2 st, inout vec4 color, in float time);
void main()
{
localBasis LB;
vec4 position, color;
vec2 texCoord, lmCoord;
VertexFetch(position, LB, color, texCoord, lmCoord);
color = color * u_ColorModulate + u_Color;
DeformVertex(position, LB.normal, texCoord, color, u_Time);
gl_Position = u_ModelViewProjectionMatrix * position;
+ var_Position = position.xyz;
+ var_Tangent = LB.tangent;
+ var_Binormal = LB.binormal;
+ var_Normal = LB.normal;
+ var_TexLight = lmCoord;
var_TexCoords = (u_TextureMatrix * vec4(texCoord, 0.0, 1.0)).xy;
var_Color = color;
} So it looks like there is no real difference between the If something is slower in |
This doesn't make sense:
The single first line means only 10fps is left after only rendering |
I wondered if the issue came from the OpenGL driver switching GLSL permutations, so I did that: // choose right shader program ----------------------------------
gl_genericShader->SetVertexSkinning( 0 ); //glConfig2.vboVertexSkinningAvailable && tess.vboVertexSkinning );
gl_genericShader->SetVertexAnimation( 0 ); //tess.vboVertexAnimation );
gl_genericShader->SetTCGenEnvironment( 0 ); //pStage->tcGen_Environment );
gl_genericShader->SetTCGenLightmap( 0 ); //pStage->tcGen_Lightmap );
gl_genericShader->SetDepthFade( 0 ); //hasDepthFade );
gl_genericShader->SetVertexSprite( 0 ); //tess.vboVertexSprite );
gl_genericShader->BindProgram( 0 ); //pStage->deformIndex );
// end choose right shader program ------------------------------ It doesn't bring any performance bump on rendering the default spectator Vega scene ( |
The following test was running the previously quoted change to avoid switching between shader permutations. Plat23 bird view,
Vega default spectator scene,
|
Forget about that… reverting that HACK doesn't slow down the Plat23 map. I now get 60 fps on default Plat23 spectator scene and 33 fps on both Human and Alien base entry scenes… This huge performance boost on default Plat23 scene likely comes from the amount of improvements recently merged or to-be-merged that now lives in the development branch I rebase my current work over. So, it looks like we are now reaching on Plat23 the performance of the ioq3 opengl2 renderer, so that Vega default spectator slow down is really something wrong. |
Default Yocto spectator scene (
|
With: diff --git a/src/engine/renderer/tr_backend.cpp b/src/engine/renderer/tr_backend.cpp
index 4c43547d5..5823a8462 100644
--- a/src/engine/renderer/tr_backend.cpp
+++ b/src/engine/renderer/tr_backend.cpp
@@ -4789,6 +4789,7 @@ static void RB_RenderView( bool depthPass )
}
if( depthPass ) {
+ glFinish();
RB_RenderDrawSurfaces( shaderSort_t::SS_DEPTH, shaderSort_t::SS_DEPTH, DRAWSURFACES_ALL );
RB_RunVisTests();
RB_RenderPostDepthLightTile();
diff --git a/src/engine/renderer/tr_shade.cpp b/src/engine/renderer/tr_shade.cpp
index dc3bf44a9..7cec1886a 100644
--- a/src/engine/renderer/tr_shade.cpp
+++ b/src/engine/renderer/tr_shade.cpp
@@ -327,7 +327,10 @@ void Tess_DrawElements()
{
if ( tess.multiDrawPrimitives )
{
+ clock_t start = clock();
glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ clock_t diff = clock() - start;
+ printf("=== time: %ld, %s\n", diff, tess.surfaceShader->name );
backEnd.pc.c_multiDrawElements++;
backEnd.pc.c_multiDrawPrimitives += tess.multiDrawPrimitives; I get:
And breaking the execution still breaks in
|
For comparison, this is the timing log of the Plat23 whole map birdview:
|
I did that: --- a/src/engine/renderer/tr_shade.cpp
+++ b/src/engine/renderer/tr_shade.cpp
@@ -327,7 +327,15 @@ void Tess_DrawElements()
{
if ( tess.multiDrawPrimitives )
{
- glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ // glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+
+ for ( i = 0; i < tess.multiDrawPrimitives; i++ )
+ {
+ clock_t start = clock();
+ glDrawElements( GL_TRIANGLES, tess.multiDrawCounts[ i ], GL_INDEX_TYPE, tess.multiDrawIndexes[ i ] );
+ clock_t diff = clock() - start;
+ printf("=== time: %ld, %s, %d\n", diff, tess.surfaceShader->name, i );
+ }
backEnd.pc.c_multiDrawElements++;
backEnd.pc.c_multiDrawPrimitives += tess.multiDrawPrimitives; And now the log is:
|
With this: --- a/src/engine/renderer/tr_shade.cpp
+++ b/src/engine/renderer/tr_shade.cpp
@@ -327,7 +327,15 @@ void Tess_DrawElements()
{
if ( tess.multiDrawPrimitives )
{
- glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+ // glMultiDrawElements( GL_TRIANGLES, tess.multiDrawCounts, GL_INDEX_TYPE, ( const GLvoid ** ) tess.multiDrawIndexes, tess.multiDrawPrimitives );
+
+ for ( i = 0; i < tess.multiDrawPrimitives; i++ )
+ {
+ clock_t start = clock();
+ glDrawElements( GL_TRIANGLES, tess.multiDrawCounts[ i ], GL_INDEX_TYPE, tess.multiDrawIndexes[ i ] );
+ clock_t diff = clock() - start;
+ printf("=== time: %ld, %s#%d, drawCounts: %d\n", diff, tess.surfaceShader->name, i, tess.multiDrawCounts[ i ] );
+ }
backEnd.pc.c_multiDrawElements++;
backEnd.pc.c_multiDrawPrimitives += tess.multiDrawPrimitives; I get that:
The time is not even related to draw counts:
|
As a remind this is the call trace:
From /* Translate vertices with non-native layouts or formats. */
if (unroll_indices ||
incompatible_vb_mask ||
mgr->ve->incompatible_elem_mask) {
if (!u_vbuf_translate_begin(mgr, &new_info, &new_draw,
start_vertex, num_vertices,
min_index, unroll_indices, misaligned)) { |
I did this: --- a/src/engine/renderer/tr_backend.cpp
+++ b/src/engine/renderer/tr_backend.cpp
@@ -666,6 +666,7 @@ void GL_VertexAttribsState( uint32_t stateBits )
glState.vertexAttribsState = stateBits;
}
+#include <cstdio>
void GL_VertexAttribPointers( uint32_t attribBits )
{
uint32_t i;
@@ -720,6 +721,7 @@ void GL_VertexAttribPointers( uint32_t attribBits )
frame = glState.vertexAttribsOldFrame;
}
+ printf("=== layout numComponents: %d, componentType: %#x, normalize: %d, stride: %d\n", layout->numComponents, layout->componentType, layout->normalize, layout->stride );
glVertexAttribPointer( i, layout->numComponents, layout->componentType, layout->normalize, layout->stride, BUFFER_OFFSET( layout->ofs + ( frame * layout->frameOffset + base ) ) );
glState.vertexAttribPointersSet |= bit;
}
@@ -4789,6 +4791,7 @@ static void RB_RenderView( bool depthPass )
} I got this:
And
It looks like everything looks good. Well I don't know why there are some |
Well, maybe the thing is the usage of With With |
The single change that recovers the performance (while breaking the texture rendering) is to do: --- a/src/engine/renderer/tr_vbo.cpp
+++ b/src/engine/renderer/tr_vbo.cpp
@@ -220,7 +220,7 @@ static void R_SetAttributeLayoutsStatic( VBO_t *vbo )
vbo->attribs[ ATTR_INDEX_QTANGENT ].frameOffset = 0;
vbo->attribs[ ATTR_INDEX_TEXCOORD ].numComponents = 4;
- vbo->attribs[ ATTR_INDEX_TEXCOORD ].componentType = GL_HALF_FLOAT;
+ vbo->attribs[ ATTR_INDEX_TEXCOORD ].componentType = GL_FLOAT;
vbo->attribs[ ATTR_INDEX_TEXCOORD ].normalize = GL_FALSE;
vbo->attribs[ ATTR_INDEX_TEXCOORD ].ofs = offsetof( shaderVertex_t, texCoords );
vbo->attribs[ ATTR_INDEX_TEXCOORD ].realStride = sizeShaderVertex; |
Two years ago I started a work-in-progress branch to support graphics cards not having support for But preparatory work for it was already merged:
I revived this branch and worked on it a bit more to make it work (it is still in a very rough state), given I wanted to check if using that on an ATI r300 would improve the performances: Et voilà! 🎉️ I now get 55fps on the default Vega spectator scene: And 53fps on default Thunder spectator scene: And I guess I fixed the support for those old GL 2 Intel cards in the process (I may be able to verify it with real hardware in the coming months). 😎️ I assume the r300 driver just emulates half-float vertexes, as it is said it is a hardware limitation to not support |
Is there some necessary reason why the driver has to implement this in the dumbest possible way? I.e. re-translating the attributes on every frame instead of when the VBO is created or used for the first time? Maybe we could implement attribute reformatting at the last minute like the driver does, instead of modifying all our structs. It appears we have all the struct layouts stored in a tabular format ( |
Because it doesn't know if the data in it has changed, probably.
I don't really get this part, since it wouldn't solve the problem? |
If we know that some data stays the same but the driver doesn't, then we would only have to translate it once, which should fix the performance. |
Ah, yea, that would work, I misunderstood "on the fly" as "per frame". |
I will probably report the issue on Mesa side at some point. Something looks weird indeed. Maybe they can do better and providing them our game as a testbed may help.
This is currently a very naïve implementation and some conversions may be useless. Also it may indentifies that some data model are badly shaped. Before I touched anything there was already multiple places where some half-float were converted to float for processing then later re-converted to half-float. We may decide to keep everything of this as float before the latest storage step, which would not only avoid duplication of some code to support both float and half-float, but also would make faster the original half-float code. |
I investigated my question a bit and found one possibility... In case your R300 does not have the |
@slipher Running |
Those hints are very unreliable, the driver is free to just completely ignore them. |
I'm trying to investigate a huge performance drop affecting some low-end cards.
I currently suspect:
Sky performance cost #849 (comment)
Here is the Thunder map, lowest preset,
640×480
resolution, default spectator scene (1024 -3456 448 -90 0
), 3fps:Here is the Vega map, lowest preset,
640×480
resolution, default spectator scene (1472 -1792 176 -180 0
), 5fps:This configuration is expected to achieve
20~30fps60fps.The testbench is a very low-end config, one of my favorites for identifying bottlenecks:
In the Thunder map, I suspect the bottleneck comes from rendering something in that room and/or the outside. It can be the electric arc effect on those pillars, and/or the lightning effect on the top of the outside of the building, as this room as huge windows on the outside. Leaving that room and losing sight with the outside brings back performance, leaving that room without losing sight with the outside keeps the performance low.
In the Vega map, I suspect the bottleneck comes from rendering that globe or something around that globe. From the 4 sides of the room, I recover performance by turning back to that globe.
I identified places were just turning around the view enables/disables the performance loss in both maps.
Thunder,
-30 -1497 10 -20 0
:Thunder,
-30 -1497 10 112 0
:Vega,
667 -1792 176 0 0
:Vega:
667 -1792 176 -180 0
:As you see, just turning around the head without moving from the place you are enables/disables the cause of the bottleneck.
The text was updated successfully, but these errors were encountered: