Skip to content

Conversation

@andriyDev
Copy link
Contributor

@andriyDev andriyDev commented Dec 3, 2025

Objective

  • Fix the first flake related to Flaky asset processing test due to multiple processing tasks for same path. #22001.
  • We had a double lock problem. We would first lock asset_infos, then lock one of the assets within asset_infos and then dropped the asset_infos lock. This means however that if a process needs to access asset_infos (which is necessary for loading nested assets during processing), we'll have one thread trying to lock 1) asset_infos, 2) per-asset lock, and we'll have another thread trying to lock 1) per-asset lock (we've already dropped the asset_infos lock), 2) asset_infos. A classic deadlock!

Solution

  • Before locking the per-asset lock, we clone the Arc<RwLock> out of the asset_infos, drop the asset_infos, and only then lock the per-asset lock. This ensures that we never hang on the asset_infos lock when just trying to fetch a single asset.
  • Make all the access to asset_infos "short" - they get what they need out of asset_infos and then drop the lock as soon as possible.
  • I also used ? to cleanup some methods in ProcessorAssetInfos, where previously it was "remove the asset info, then have one big if let after it". Now we just return early if the value is none.
  • I also happened to fix a weird case where the new path in a rename wasn't guarded by a transaction lock. Now it is!

Testing

  • Running without this fix I get "Ran out of loops" from the only_reprocesses_wrong_hash_on_startup test quite quickly (after a minute or so). With this fix, I now only get the assertion failure problem. If I also skip that assertion, the test hasn't flaked for a while! Yay, no more deadlock!
@andriyDev andriyDev added C-Bug An unexpected or incorrect behavior A-Assets Load files from disk to use for things like images, models, and sounds D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Dec 3, 2025
Copy link
Contributor

@kfc35 kfc35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to not make the two “operations" (the modifying of asset_infos and the actual removal/writing over of each asset) in handle_removed_asset and handle_renamed_asset atomic with respect to the locking? Now hypothetically both operations could wait after modifying asset_infos. I assume it’s fine but I just wanted to make sure.

Unfortunately for testing, my computer never came up with the deadlock on main, so I can neither confirm nor deny that the code fixes it in reality, but from my reading, it would fix it in theory.

I also have a clarifying question, as someone more unfamiliar with the asset code. You described that there is a situation where a thread (I presume when going through the function process_asset_internal?):

  1. locks the asset_infos
  2. locks the asset specific lock
  3. unlocks the asset_infos (these first three steps seem to correspond what’s happening in process_asset_internal)

Is the 4th step that the same thread eventually wants to lock asset_infos again, causing an inversion of the locking order with other threads (hence deadlock)? If so, where/how exactly does that happen?

Beyond that, I don’t see any mismatch between what you wrote in code and the way you described the solution in the description, so in that respect it looks good!


/// Remove the info for the given path. This should only happen if an asset's source is removed / non-existent
/// Remove the info for the given path. This should only happen if an asset's source is
/// removed/non-existent. Returns the transaction locks for the old and new assets respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// removed/non-existent. Returns the transaction locks for the old and new assets respectively.

If using my comment suggestion above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this suggestion? It looks like just a delete? Regardless I clarified the return value. to match the other method.


Some((
info.file_transaction_lock,
new_info.file_transaction_lock.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old path’s lock is not a clone because theoretically there should be no other threads that want it since it has been removed, but the new path’s lock is a clone because there might be other threads currently using it (if it’s an existing path) right?

(also this was a nice catch!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old path's lock is not a clone because we remove the info from the infos hashmap, so we have an owned instance. Other uses of info already move out of this value, so moving the lock out of this as well is alright (we're not gonna use info)!

@andriyDev
Copy link
Contributor Author

@kfc35

Is it ok to not make the two “operations" [...] in handle_removed_asset and handle_renamed_asset atomic with respect to the locking

In theory we could have specifically the remove and rename still lock asset_infos and then also lock the files. However this means blocking all other access to the files while we wait for access to the old and new assets. I'd prefer not blocking *all access just for that.

my computer never came up with the deadlock on main

If I just run on main, the deadlock is very rare. Once I enabled logging, the deadlock was much more common. As an aside, we should probably just enable logging for bevy_asset tests in general - it makes it impossible to debug these issues otherwise.

Is the 4th step that the same thread eventually wants to lock asset_infos again, causing an inversion of the locking order...

Yes exactly! "Inversion of the locking order" is a much better way to describe it too haha. This happens through nested immediate asset loading. The dep_changed asset loader needs to nested-immedate asset load the source_changed asset, and processed asset reads are gated by ProcessorGatedReader which acquires the file transaction lock (which we need to acquire the asset_infos lock to look up).

@andriyDev andriyDev requested a review from kfc35 December 4, 2025 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Assets Load files from disk to use for things like images, models, and sounds C-Bug An unexpected or incorrect behavior D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Needs-Review Needs reviewer attention (from anyone!) to move forward

2 participants