Skip to content

Reduce RM initialization timeout from 2.0s to 1.5s#831

Open
Vanjoseluis wants to merge 1 commit intoros-controls:rollingfrom
Vanjoseluis:rm-init-timeout-study
Open

Reduce RM initialization timeout from 2.0s to 1.5s#831
Vanjoseluis wants to merge 1 commit intoros-controls:rollingfrom
Vanjoseluis:rm-init-timeout-study

Conversation

@Vanjoseluis
Copy link
Copy Markdown
Contributor

This PR replaces the previous conservative 2.0 s Resource Manager initialization timeout with a measured and reproducible value of 1.5 seconds.
The new value is based on an extensive experimental study designed to identify the minimum stable timeout under Gazebo.

This change reduces test runtime by 25% while maintaining full stability.


Motivation
Issue #801 showed that a too small timeout (0.2 s) leads to incorrect controller behavior and large deviations in expected joint positions. The goal of this study was to:

  • Determine the minimum stable timeout
  • Reduce test execution time
  • Eliminate flaky behavior caused by Gazebo initialization jitter
  • Replace a conservative value with an empirically justified one

Methodology
The pendulum_effort_test was used as the primary benchmark because it is the most timing‑sensitive test in the suite.
All other tests (pendulum_position_test, gripper_mimic_joint_position_test, gripper_mimic_joint_effort_test...) were also validated with the final value (1.5 s).


Initial Bisection Results

Timeout (s) Result
2.0 PASS (baseline)
1.1 PASS (11/11)
0.65 PASS (10/11)
0.425 FAIL (3/4, 0/1, 0/1)
0.2 FAIL (issue #801)

Extended jitter analysis
Further testing revealed that Gazebo introduces significant initialization jitter.
Values that initially appeared stable (e.g., 1.1 s) were not consistently reproducible.

Additional runs:

1.0 s → 8/9 (not stable)

1.2 s → unstable

1.5 s → 30/30 PASS, fully stable with all tests


CONCLUSION
The RM initialization timeout is updated to 1.5 seconds, which:

  • Maintains full stability under Gazebo jitter
  • Reduces the previous timeout by 25%
  • Removes the arbitrary nature of the previous conservative 2.0 s value

This value is supported by reproducible experimental evidence (30/30 runs).

@Vanjoseluis Vanjoseluis requested a review from ahcorde as a code owner April 4, 2026 23:15
@Vanjoseluis
Copy link
Copy Markdown
Contributor Author

Vanjoseluis commented Apr 5, 2026

If needed, we could explore using different RM initialization timeouts depending on the test or hardware.
During the experiments, all tests except the pendulum ones were stable with significantly lower timeout values (e.g., 0.2 s).
This PR keeps a unified timeout for simplicity, but the distinction may be useful in future work.

Long‑term, it might be interesting to explore whether Gazebo could be kept alive across tests with a proper reset mechanism.
This could avoid repeated RM initialization and significantly reduce CI time.
I’m not sure whether the current ros2_control and Gazebo plugin architecture supports this, but it might be worth discussing.

@Vanjoseluis
Copy link
Copy Markdown
Contributor Author

It may also be that physics starts advancing before the RM is fully initialized, so a larger timeout is needed to let the system settle.

Copy link
Copy Markdown
Member

@christophfroehlich christophfroehlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is dependent on the system load (CPU), and can get flaky on the CI runners. have you tested the same with higher CPU load? I use this to max out 15 of my 16 cores for example stress-ng --cpu 15 --vm 1 --vm-bytes 3G --vm-keep

@Vanjoseluis
Copy link
Copy Markdown
Contributor Author

Vanjoseluis commented Apr 5, 2026

I suppose this is dependent on the system load (CPU), and can get flaky on the CI runners. have you tested the same with higher CPU load? I use this to max out 15 of my 16 cores for example stress-ng --cpu 15 --vm 1 --vm-bytes 3G --vm-keep

I tried running the pendulum test under extreme load using stress-ng (15 CPU hogs + 3 GB VM pressure). Under these conditions Gazebo becomes systematically unstable: most runs fail due to the assertion, and a couple of them due to missing joint_state messages.


Update:
I also tested the previous 2.0 s timeout under the same extreme stress‑ng conditions, and it fails consistently as well (mostly due to missing joint_state messages). I even tried 5.0 s with the same result.

This shows that stress‑ng overload breaks Gazebo initialization regardless of the timeout value, so it’s not a meaningful criterion for choosing the timeout.

Under normal and CI‑like load, the 1.5 s timeout behaves reliably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants