Skip to content

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

luke-gruber
Copy link
Contributor

In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup if we're a dnt, in thread_sched_wait_running_turn() we acquire thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr). This lock inversion can cause a deadlock with rb_ractor_wakeup_all() (ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then thread_sched_lock(other_th).

So, the error happens:

nt 1: Ractor.receive
rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn():
- thread_sched_lock(cur_th) (condvar) # acquires lock
- rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS

nt 2: port.send
ractor_wakeup_all()
- RACTOR_LOCK(port_r) # acquires lock
- thread_sched_lock # tries to acquire, HANGS

One solution would be to rework thread_sched_wait_running_turn() with DNT's. I didn't do this because it would be a bigger architectural change. What I changed is to unlock RACTOR_LOCK before calling rb_ractor_sched_wakeup() in a pthread env. In a non-pthread env it's safe to hold this lock, and we should.

Fixes [Bug #21398]

…d_wakeup()

In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire
RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup
if we're a dnt, in thread_sched_wait_running_turn() we acquire
thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr).
This lock inversion can cause a deadlock with rb_ractor_wakeup_all()
(ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then
thread_sched_lock(other_th).

So, the error happens:

nt 1:   Ractor.receive
            rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn():
              - thread_sched_lock(cur_th) (condvar) # acquires lock
              - rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS

nt 2: port.send
            ractor_wakeup_all()
              - RACTOR_LOCK(port_r) # acquires lock
              - thread_sched_lock # tries to acquire, HANGS

One solution would be to rework `thread_sched_wait_running_turn()` with DNT's. I didn't
do this because it would be a bigger architectural change. What I changed is to unlock
RACTOR_LOCK before calling rb_ractor_sched_wakeup() in a pthread env. In a non-pthread
env it's safe to hold this lock, and we should.

Fixes [Bug #21398]
@luke-gruber luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch from fb0f525 to 9de6c4a Compare June 23, 2025 19:17
Copy link

launchable-app bot commented Jun 23, 2025

Tests Failed

✖️no tests failed ✔️62012 tests passed(1 flake)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant