Skip to content

Add allocation write load stats to write thread pool #130373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Jul 1, 2025

Instrument the WRITE thread pool to collect:

  1. pool utilization EWMA
  2. queue latency EWMA

Relates ES-12233


See the ES ticket for details. I would like to collect some agreement on the approach. For example, taking an EWMA of the queue latencies; and taking the EWMA of the thread utilization metric, which is sampled periodically by the APM system. This approach is notably attempting to be 'good enough' for a first implementation of the write load balancing.

I still need to work out the thread pool utilization EWMA's alpha seed. I believe it must be tied to the APM period in order to be sensible. Unfortunately the APM collection frequency is configurable as well. Unless I make a separate and parallel utilization sampler, we're stuck with a dependency on APM's frequency. Anyway, still working on that bit.

@DiannaHohensee DiannaHohensee self-assigned this Jul 1, 2025
@DiannaHohensee DiannaHohensee added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team and removed v9.2.0 labels Jul 1, 2025
Instrument the WRITE thread pool to collect:
1) pool utilization EWMA
2) queue latency EWMA

Relates ES-12233
@DiannaHohensee DiannaHohensee force-pushed the 2025/06/30/add-node-level-write-load-stats branch from 4a7987e to 4037ccf Compare July 1, 2025 04:43
Comment on lines +212 to +214
if (queueLatencyMillis > 0) {
queueLatencyMillisEWMA.addValue(queueLatencyMillis);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 0 is meaningful for queue time and should be included?

Comment on lines +182 to +186
if (trackUtilizationEWMA) {
percentPoolUtilizationEWMA.addValue(utilizationSinceLastPoll);
// Test only tracking.
assert setUtilizationSinceLastPoll(utilizationSinceLastPoll);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels odd to tie utilization polling to APM. I think they can be separate. For example, we can have a separate set of lastTotalExecutionTime and lastPollTime which should allow us compute the utlization in a different frequency than the APM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the decision originally was because it avoided the need for an extra thread etc. given that we're now extending it's use, I think it makes sense to rethink all of that.

it sounds like we want a more sophisticated average utilisation number.

If we're willing to wear the cost of a thread to do the polling, I think it makes sense for that thread to maintain THE average (which will become an EWMA) and publish that as the es.thread_pool.{name}.threads.utilization.current value.

Comment on lines +150 to +162
public double getPercentPoolUtilizationEWMA() {
if (trackUtilizationEWMA == false) {
return 0;
}
return this.percentPoolUtilizationEWMA.getAverage();
}

public double getQueuedTaskLatencyMillisEWMA() {
if (trackQueueLatencyEWMA == false) {
return 0;
}
return queueLatencyMillisEWMA.getAverage();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some service needs to call these methods. Is it going to be AverageWriteLoadSampler or is it still in discussion?

Comment on lines +220 to +228
/**
* If the queue latency reaches a high value (e.g. 10-30 seconds), then this thread pool is overwhelmed. It may be temporary, but that
* spike warrants the allocation balancer adjusting some number of shards, if possible. Therefore, it is alright to react quickly.
*
* As an example, suppose the EWMA is 10_000ms, i.e. 10 seconds.
* A single task in the queue that takes 30_000ms, i.e. 30 seconds, would result in a new EWMA of ~12_000ms
* 0.1 x 30_000ms + 0.9 x 10_000 = 3_000ms + 9_000ms = 12_000ms
*/
public static final double DEFAULT_WRITE_THREAD_POOL_QUEUE_LATENCY_EWMA_ALPHA = 0.1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, I feel this value could use some relationship with the one for executionEWMA since they are updated with the same fequency before and after the task execution, respectively. If the node has N write threads and we observe N large execution times in a row, it feels roughly a similar effect as seeing a single large queue latency, i.e. all threads were busy executing tasks that took long to complete. This makes me think whether this value should be N * executionEWMA_alpha. I could be be wrong on this. Just an idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling a bit with the ewma used in this way. We get a value weighted towards the last N values, those could have arrived in the last 0.5ms or the last 30seconds, so it seems difficult to reason about. I wonder if it'd be easier to reason about if the queue were polled at a fixed interval (by the same regular thread that we introduce to poll the ulitlization) and ask the task at the front of the queue how long it'd been waiting?

That way I think the utilisation and latency would reflect observations over the same fixed time period and we could tweak the alpha values to find a period that was meaningful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants