Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangling OmniLight happening randomly #41360

Closed
Zylann opened this issue Aug 18, 2020 · 16 comments
Closed

Dangling OmniLight happening randomly #41360

Zylann opened this issue Aug 18, 2020 · 16 comments

Comments

@Zylann
Copy link
Contributor

Zylann commented Aug 18, 2020

Godot 3.2.2, GLES3, Windows 10 64 bits

Continuing from Zylann/voxelgame#71

I was hesitant to post this one, but after investigating I think it's a bug in the renderer.

I have a game project in which I create explosions, each with an OmniLight inside. They get freed after 3 seconds, after lighting the terrain and a bunch of debris I spawn around.
After spawning several explosions, the following error happens, and starts spamming:

ERROR: RID_Owner<struct RasterizerSceneGLES3::LightInstance>::getornull: Condition "!id_map.has(p_rid.get_data())" is true. Returned: 0

It happens randomly. Can be after 10 seconds of spawning explosions every so often, 1 minute, or 5 minutes, but consistently happens in a relatively short time, the same way.

After inspection, it seems Godot is pairing lights with mesh instances, but sometimes forgets to unpair. So dangling lights remain attached forever, causing error spams as long as I look at any affected mesh.

EDIT: simpler repro ==> #41360 (comment)
Unfortunately I don't have a simple repro, only my whole project, as I didnt manage to reproduce the problem in a simpler scene so far.
Steps:

  • Download this project https://github.com/Zylann/voxelgame
  • Install Voxel Tools module (name its folder voxel) and compile Godot with it (note, this module doesnt do anything with lights, I'm pretty confident it doesnt affect the bug)
  • Run res://blocky_game/blocky_game.tscn
  • Select the rocket launcher with mouse wheel, and start shooting stuff around. At some point, the error will occur.
@clayjohn
Copy link
Member

Could this be multithreading issue? Is anything happening in multiple threads, and/or is the renderer set to multithreaded?

@Zylann
Copy link
Contributor Author

Zylann commented Aug 18, 2020

Although threads are going on to manage voxels, the renderer is set to single/safe and is not used from any of those threads.

@Zylann
Copy link
Contributor Author

Zylann commented Aug 18, 2020

Just managed to reproduce it in a simpler project.
It's still a rather complicated repro and involves randomness (one of the many situations causes the bug, I just dont know which, so I shoot large), but at least it doesn't involve the voxel module:

OmnilightBug.zip

Launch main scene, start shooting around randomly using LMB. At some point the errors will start printing.
Reproduction rate is similar, it can take a few minutes of trying.

@lawnjelly
Copy link
Member

I've tried this both with the octree fix and the old octree, and haven't been able to reproduce in either.

I think given that it happens on @Zylann's machine, it would be best for him to download #41122 and see if this fixes the issue on his side.

@jocamar
Copy link

jocamar commented Sep 16, 2020

I have this same issue happening to me. I have a project with @Zylann 's terrain addon and I'm instantiating a bunch of bullets, each with an OmniLight, which disappear after a couple of seconds. After some time of firing the bullets I start to get a bunch of errors which brings my game to a crawl:

ERROR: RID_Owner<struct RasterizerSceneGLES3::LightInstance>::getornull: Condition "!id_map.has(p_rid.get_data())" is true. Returned: 0
   At: D:\godot-3.2.2-stable\core/rid.h:160

I tried applying the changes from the suggested fix (#41122) but the issue persists. I'm probably going to try to pool my bullets to fix this but it would be nice to get a fix.

@TokisanGames
Copy link
Contributor

Most likely caused by RID design, dangling RID references, and free() spam by SpatialEditor. Probably fixed by #54650

@Zylann
Copy link
Contributor Author

Zylann commented Nov 6, 2021

This happens in game, not in editor, so it cannot be caused by SpatialEditor. Maybe there is another rogue dangling RID elsewhere though.

@TokisanGames
Copy link
Contributor

OK. But it should be fixed by the PR which enforces self cleaning rids.

@Zylann
Copy link
Contributor Author

Zylann commented Nov 6, 2021

What happens to cleaning RIDs in a case like this?

RID a = something_valid;
RID b = a;
free(a);
// b?

@lawnjelly
Copy link
Member

lawnjelly commented Nov 6, 2021

What happens to cleaning RIDs in a case like this?

RID a = something_valid;
RID b = a;
free(a);
// b?

Yes I also noticed this in #54650 (comment) (marked as second thing)

@TokisanGames
Copy link
Contributor

What happens to cleaning RIDs in a case like this?

RID a = something_valid;
RID b = a;
free(a);
// b?

The same thing that happens to regular pointers, b is dangling.

Here, the object memory is free, a is reset to null, b is dangling. Previously both a and b were dangling, and bad things happen if you free dangling pointers. This change isn't to make a smart pointer system. It's to provide access be able to reset the pointer, and nullify the RID you sent to free.

If the 4.x system is smarter, I'm not opposed to back porting it. I haven't looked at it yet

@TokisanGames
Copy link
Contributor

I've found a few more dangling RIDs in the engine and fixed them in my PR.

Lawnjelly made an RID patch #54907 where you can build in RID handles w/ or w/o tracking to find leaks. However, it's not turned on by default. Maybe testing with this will help. Otherwise, all you can do is ensure you're clearing your own RIDs.

Also in release mode, the id_map red/black trees are compiled out, so it may be less likely to cause an issue with disappearing meshes due to dangling RIDs.

@akien-mga akien-mga added this to the 3.5 milestone Dec 9, 2021
@akien-mga
Copy link
Member

Would be good to test if this was solved by either @lawnjelly or @tinmanjuggernaut's fixes included in 3.4.1 and 3.5. If not, then indeed a custom build with RID handles tracking could help debug further.

@lawnjelly
Copy link
Member

lawnjelly commented Dec 10, 2021

I've just checked the OmniLight_Bug project above and it isn't fixed, however it is being caught by the rids=tracked_handles build. I'll see if I can work out what it causing it.

E 0:00:14.582   handle_get_or_null: RID get_or_null, revision is incorrect, possible dangling RID. RID id=614 [ rev 1 ], PE [ rev 4 ] visual_instance.cpp line 159 ( prev visual_instance.cpp line 159 )
  <C++ Error>   Condition "pe.revision != p_rid._revision" is true. Returned: nullptr
  <C++ Source>  core/rid_handle.cpp:244 @ handle_get_or_null()

This is the error source BTW.

Got it. (Or the problem source). I've found the problem and can fix it, but still tracking down where it is being first introduced, as there is a chain of events involved (for the best fix). Quite an evasive bug! 😀

@lawnjelly
Copy link
Member

lawnjelly commented Dec 10, 2021

This is a really odd bug, and seems very timing related (possibly a race condition but not sure yet).

There is a dangling RID, but this is a symptom at the end of a long chain of events, rather than the cause. The dangling RID is in VisualServerScene, line 2653.

			if (geom->lighting_dirty) {
				int l = 0;
				//only called when lights AABB enter/exit this geometry
				ins->light_instances.resize(geom->lighting.size());

				for (List<Instance *>::Element *E = geom->lighting.front(); E; E = E->next()) {
					InstanceLightData *light = static_cast<InstanceLightData *>(E->get()->base_data);

					ins->light_instances.write[l++] = light->instance;
				}

				geom->lighting_dirty = false;
			}

This code basically updates a list of lights affecting a piece of geometry when this lighting_dirty flag is set.

The problem is that this flag is not set when it should be. If we add an:

			else
			{
				DEV_ASSERT(ins->light_instances.size() == geom->lighting.size());
			}

This triggers just before the error happens - the dirty flag has got out of sync with the lights, and there is one extra dangling RID in the unupdated list ins->light_instances.

(We can incidentally fix the bug by making the condition:

if (geom->lighting_dirty || (ins->light_instances.size() != geom->lighting.size()))

but that is fixing the symptom rather than cause).

It turns out the dirty flag is only set / cleared in the pairing / unpairing callbacks from the BVH, which suggests maybe a problem there. However the problem also occurs with the octree (indeed this issue predates the BVH), and it seems less likely that both independently contain the same bug. So I'm now also suspecting there is misuse going on of the spatial partitioning.

Next I have been reducing the complexity of the reproduction project. It contains a lot of extra unnecessary stuff so I've reduced it quite a bit, and it now seems to go down without a few seconds of firing:
OmnilightBug_Cutdown3.zip

It turns out there are two things which seem to need to happen together to cause the conditions for it to occur. There is a timer in rocket_explosion.gd which seems to need to call queue_free(). This is presumably the cause of the light deletion.

This interacts with a random event in _process in omnilight_bug.gd. Specifically alter_mesh is called, which replaces one of the meshes on an instance (at random) with another random sphere mesh.

My best guess is that during the handover between one mesh and the other, something is not being updated which is causing the lights to get out of sync (perhaps an update missing to the BVH / octree?).

Quite a challenge! That's about all I can manage today. 😁

@akien-mga
Copy link
Member

Fixed by #55813.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants