Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup if we're a dnt, in thread_sched_wait_running_turn() we acquire thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr). This lock inversion can cause a deadlock with rb_ractor_wakeup_all() (ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then thread_sched_lock(other_th).
So, the error happens:
nt 1: Ractor.receive
rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn():
- thread_sched_lock(cur_th) (condvar) # acquires lock
- rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS
nt 2: port.send
ractor_wakeup_all()
- RACTOR_LOCK(port_r) # acquires lock
- thread_sched_lock # tries to acquire, HANGS
One solution would be to rework
thread_sched_wait_running_turn()
with DNT's. I didn't do this because it would be a bigger architectural change. What I changed is to unlock RACTOR_LOCK before calling rb_ractor_sched_wakeup() in a pthread env. In a non-pthread env it's safe to hold this lock, and we should.Fixes [Bug #21398]