Drain responses on completion for TransportNodesAction #130303

ywangd · 2025-06-30T06:26:51Z

This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation.

Resolves: #128852

This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: elastic#128852

elasticsearchmachine · 2025-06-30T06:27:17Z

Hi @ywangd, I've created a changelog YAML for you.

elasticsearchmachine · 2025-06-30T06:27:18Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java

DaveCTurner · 2025-06-30T07:05:08Z

server/src/test/java/org/elasticsearch/action/support/nodes/TransportNodesActionTests.java

+            ) {
+                final var waited = new AtomicBoolean();
+                for (var response : testNodeResponses) {
+                    if (waited.compareAndSet(false, true)) {


This is kind of a convoluted way to wait on a nonempty list. There's no concurrency here so the compareAndSet is a bit of a sledgehammer. Can we just check testNodeResponses.isEmpty()?

This is to wait for only the first response. You are right there is no need for AtomicBoolean. I changed it to a primitive boolean variable.

DaveCTurner · 2025-06-30T07:48:34Z

server/src/test/java/org/elasticsearch/action/support/nodes/TransportNodesActionTests.java

+                boolean waited = false;
+                for (var response : testNodeResponses) {
+                    if (waited == false) {
+                        waited = true;
+                        safeAwait(barrier);
+                        safeAwait(barrier);
+                    }
+                }


Can we not just do this?

Suggested change

boolean waited = false;

for (var response : testNodeResponses) {

if (waited == false) {

waited = true;

safeAwait(barrier);

safeAwait(barrier);

}

}

if (testNodeResponses.isEmpty() == false) {

safeAwait(barrier);

safeAwait(barrier);

}

Indeed can we not assert that testNodeResponses is nonempty in this test?

The for-loop is to reproduce the ConcurrentModificationException reported in #128852. The test always passes without it.

I see, could you add a comment to that effect or else this'll get "tidied up"

Comment added in fdf0b22

DaveCTurner · 2025-06-30T07:49:46Z

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java

+                        assert task instanceof CancellableTask : "expect CancellableTask, but got: " + task;
+                        final var cancellableTask = (CancellableTask) task;
+                        assert cancellableTask.isCancelled();
+                        throw new TaskCancelledException("task cancelled [" + cancellableTask.getReasonCancelled() + "]");


getReasonCancelled is racy according to its Javadocs: "May also be null if the task was just cancelled since we don't set the reason and the cancellation flag atomically." You need to use notifyIfCancelled to get the right behaviour here.

Thanks. Pushed 3d07261. Please let me know if it has used the right listener.

ywangd · 2025-06-30T07:56:54Z

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java

+                        logger.debug("task cancelled after all responses were collected");
+                        assert task instanceof CancellableTask : "expect CancellableTask, but got: " + task;
+                        final var cancellableTask = (CancellableTask) task;
+                        assert cancellableTask.isCancelled();
+                        throw new TaskCancelledException("task cancelled [" + cancellableTask.getReasonCancelled() + "]");


This change is to address the edge case commented here. But I struggle to write a test for it. Essentially we need the cancel to comes in after all node responses are collected but before the AtomicBoolean responsesHandled is checked. One option is to extract the creation of CancellableFanOut into its own protected method plus wrapping the returned value with a delgating CancellableFanOut. But this requires making the 4 protected methods in CancellableFanOut package private. I am a bit suspicous on whether this is the right path to go down. I am open to suggestions.

I'd be content with a test which concurrently completes the action and cancels it, and asserts that we always either get an exception or we get a successful response. I expect such a test would find the bug here pretty reliably.

Cool I added such a test, see fb71e89

DaveCTurner · 2025-06-30T08:46:16Z

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java

+                        assert task instanceof CancellableTask : "expect CancellableTask, but got: " + task;
+                        final var cancellableTask = (CancellableTask) task;
+                        assert cancellableTask.isCancelled();
+                        cancellableTask.notifyIfCancelled(finalListener);


I think we should complete l here, not the finalListener.

Ah right. I was in a rush to finish up and missed the obvious 🤦 pushed db76ef8

DaveCTurner · 2025-06-30T09:04:46Z

server/src/test/java/org/elasticsearch/action/support/nodes/TransportNodesActionTests.java

+
+        try {
+            final var testNodesResponse = future.actionGet(SAFE_AWAIT_TIMEOUT);
+            assertFalse(cancellableTask.isCancelled());


I don't think this'll hold in general, we could cancel the task after the completion has already passed the point of no return and then the task's cancellation flag will be set even though it completed successfully.

Yeah good point, Thanks. I removed that in b38783d which also contains a few other tweaks.

DaveCTurner

LGTM

Drain responses on completion for TransportNodesAction

a7daa50

This PR ensures the node responses are copied and drained exclusively in onCompletion so that they do not get concurrently modified by cancellation. Resolves: elastic#128852

ywangd requested review from nicktindall and DaveCTurner June 30, 2025 06:26

ywangd added >bug v9.0.0 v8.19.0 v9.1.0 :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. v9.2.0 labels Jun 30, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jun 30, 2025

Update docs/changelog/130303.yaml

6e8bfbe

ywangd added v9.0.4 and removed v9.0.0 labels Jun 30, 2025

github-actions bot deployed to docs-preview June 30, 2025 06:28 View deployment

unwanted change

670a175

github-actions bot deployed to docs-preview June 30, 2025 06:29 View deployment

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java Outdated Show resolved Hide resolved

Use atomicBoolean

4adb4a6

github-actions bot deployed to docs-preview June 30, 2025 06:51 View deployment

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java Outdated Show resolved Hide resolved

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/action/support/nodes/TransportNodesAction.java Outdated Show resolved Hide resolved

move comment

f990e43

github-actions bot deployed to docs-preview June 30, 2025 07:02 View deployment

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

more edge case

3d6140e

github-actions bot deployed to docs-preview June 30, 2025 07:41 View deployment

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

ywangd commented Jun 30, 2025

View reviewed changes

notify cancel

3d07261

github-actions bot deployed to docs-preview June 30, 2025 08:04 View deployment

[CI] Auto commit changes from spotless

e06c981

github-actions bot deployed to docs-preview June 30, 2025 08:13 View deployment

ywangd added 2 commits June 30, 2025 18:38

test concurrently completing and cancelling

fb71e89

tweak name

9dcbbd0

github-actions bot deployed to docs-preview June 30, 2025 08:40 View deployment

ywangd requested a review from DaveCTurner June 30, 2025 08:40

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

better assertions

8612b1a

github-actions bot deployed to docs-preview June 30, 2025 08:56 View deployment

complete l

db76ef8

github-actions bot deployed to docs-preview June 30, 2025 08:58 View deployment

comment for loop

fdf0b22

github-actions bot deployed to docs-preview June 30, 2025 09:02 View deployment

Merge remote-tracking branch 'origin/main' into es-128852-fix

b521449

github-actions bot deployed to docs-preview June 30, 2025 09:04 View deployment

DaveCTurner reviewed Jun 30, 2025

View reviewed changes

remove assertion for task cancellation

b38783d

github-actions bot deployed to docs-preview July 1, 2025 02:49 View deployment

Merge remote-tracking branch 'origin/main' into es-128852-fix

d40d44b

github-actions bot deployed to docs-preview July 1, 2025 02:52 View deployment

ywangd requested a review from DaveCTurner July 1, 2025 03:00

wording

5a0f186

github-actions bot deployed to docs-preview July 1, 2025 03:07 View deployment

DaveCTurner approved these changes Jul 1, 2025

View reviewed changes

Drain responses on completion for TransportNodesAction #130303

Are you sure you want to change the base?

Drain responses on completion for TransportNodesAction #130303

Conversation

ywangd commented Jun 30, 2025

Uh oh!

elasticsearchmachine commented Jun 30, 2025

Uh oh!

elasticsearchmachine commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!