-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Track snapshot stats as metrics #130301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Track snapshot stats as metrics #130301
Conversation
@@ -152,6 +156,7 @@ public RepositoriesService( | |||
threadPool.relativeTimeInMillisSupplier() | |||
); | |||
this.preRestoreChecks = preRestoreChecks; | |||
this.snapshotMetrics = new SnapshotMetrics(meterRegistry, this::getSnapshotsInProgress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only create SnapshotMetrics
here so we can provide it with the supplier for the "snapshots in progress" gauge. This means we need to pass it to the Repository.Factory#create
call rather than the RepositoryPlugin#getRepositories
.
ActionListener.running( | ||
() -> blobStoreSnapshotMetrics.shardSnapshotCompleted(threadPool.absoluteTimeInMillis() - startTimeInMillis) | ||
) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We track start and finish shard here, which means we don't include the time in the queue, we could move this to include time in the queue as well?
I think we'd want this time to reflect the amount of time shards are unable to be moved, I will dig in to see whether queued time or snapshotting time is a better reflection of that?
snapshotMetrics.snapshotsShardsCompletedCounter().increment(); | ||
if (durationInMillis > 0) { | ||
snapshotMetrics.snapshotShardsDurationHistogram().record(durationInMillis / 1_000f); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totalTime
will be zero in some failure cases I think
Adds additional snapshot metrics and publishes them via APM
Still a WIP, need to add some tests, need to add new metrics to node stats