Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

luke-gruber · 2025-06-23T19:03:25Z

In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup if we're a dnt, in thread_sched_wait_running_turn() we acquire thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr). This lock inversion can cause a deadlock with rb_ractor_wakeup_all() (ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then thread_sched_lock(other_th).

So, the error happens:

nt 1: Ractor.receive
rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn():
- thread_sched_lock(cur_th) (condvar) # acquires lock
- rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS

nt 2: port.send
ractor_wakeup_all()
- RACTOR_LOCK(port_r) # acquires lock
- thread_sched_lock # tries to acquire, HANGS

One solution would be to rework thread_sched_wait_running_turn() with DNT's. I didn't do this because it would be a bigger architectural change. What I changed is to unlock RACTOR_LOCK before calling rb_ractor_sched_wakeup() in a pthread env. In a non-pthread env it's safe to hold this lock, and we should.

Fixes [Bug #21398]

…d_wakeup() In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup if we're a dnt, in thread_sched_wait_running_turn() we acquire thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr). This lock inversion can cause a deadlock with rb_ractor_wakeup_all() (ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then thread_sched_lock(other_th). So, the error happens: nt 1: Ractor.receive rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn(): - thread_sched_lock(cur_th) (condvar) # acquires lock - rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS nt 2: port.send ractor_wakeup_all() - RACTOR_LOCK(port_r) # acquires lock - thread_sched_lock # tries to acquire, HANGS One solution would be to rework `thread_sched_wait_running_turn()` with DNT's. I didn't do this because it would be a bigger architectural change. What I changed is to unlock RACTOR_LOCK before calling rb_ractor_sched_wakeup() in a pthread env. In a non-pthread env it's safe to hold this lock, and we should. Fixes [Bug #21398]

launchable-app · 2025-06-23T19:19:47Z

❌ Tests Failed

✖️no tests failed ✔️62012 tests passed(1 flake)

luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch from fb0f525 to 9de6c4a Compare June 23, 2025 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

Uh oh!

luke-gruber commented Jun 23, 2025

Uh oh!

launchable-app bot commented Jun 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

Are you sure you want to change the base?

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

Uh oh!

Conversation

luke-gruber commented Jun 23, 2025

Uh oh!

launchable-app bot commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Tests Failed

Uh oh!

Uh oh!

launchable-app bot commented Jun 23, 2025 •

edited

Loading