-
Notifications
You must be signed in to change notification settings - Fork 939
feat: measure pubsub latencies and expose metrics #13126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
coderd/database/pubsub/pubsub.go
Outdated
@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) { | |||
p.qMu.Unlock() | |||
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs)) | |||
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events)) | |||
|
|||
// additional metrics | |||
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about running a measurement with a 10s timeout from within Collect(), which typically gets called every 15s or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I settled on that number because a) it's sufficiently high as to indicate a serious problem, b) it won't disrupt the default 1m scrape interval (source), and c) even with a low interval like 15s it still will not interfere with scrapes.
We could lower this a bit though, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, what's the expectation for how quickly Collect()
returns?
// This method may be called concurrently and must therefore be
// implemented in a concurrency safe way. Blocking occurs at the expense
// of total performance of rendering all registered metrics. Ideally,
// Collector implementations support concurrent readers.
If NOTIFY is hobbled, then we're blocking Collect()
for up to 10s, which doesn't seem like we're playing nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's acceptable to have Collect()
block while we measure the latency, provided we don't overrun a reasonable scrape interval. In most architectures it's rare to have multiple scrapers scraping the same targets, so Collect()
should only ever be called once at a time.
Alternatively we could spawn a background process which measures latency periodically and Collect()
simply reads its state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only hesitation I have with background measurement is that scrapes will be showing stale data, which may give operators the wrong impression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This architecture should be fine with multiple collections, as the unique ping will just get ignored by other calls to Collect().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's give it a try: a7c042f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to double back to synchronous-only collection: 33a2a1d
The complexity background collection introduced wasn't worth the benefit; I've satisfied my curiosity now 😅
coderd/database/pubsub/latency.go
Outdated
) | ||
|
||
cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) { | ||
res <- time.Since(start).Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having each measurement be an identical message (and ignoring it) could lead to previously timed-out measurements getting processed here. That could lead to very strange results such as a ping arriving before start
is set, which will result in the recv
latency being two millennia 🙀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! OK, I'll refactor to generate a unique message and only mark the latency roundtrip done when I observe that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know what you think of 65f57b1 please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
49d2002 supercedes the above
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
coderd/database/pubsub/latency.go
Outdated
) | ||
|
||
cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) { | ||
res <- time.Since(start).Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
49d2002 supercedes the above
|
||
// Force fast publisher to start after the slow one to avoid flakiness. | ||
<-hold | ||
time.Sleep(time.Millisecond * 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YOLO-ed this a bit because Go's scheduler gives a goroutine at most 20ms on CPU before it is pre-empted; I added a bit of extra time to be sure. I ran this test with -count=100
and didn't observe any flakes, but I'm open to better approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably OK for now, although the scheduler latency on Windows can apparently vary.
The approach I like to use for controlling time is to pass in a chan time.Time
. Any time something tries to determine the current time then comes from reading that channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.
I think you should simplify and fake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// Force fast publisher to start after the slow one to avoid flakiness. | ||
<-hold | ||
time.Sleep(time.Millisecond * 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably OK for now, although the scheduler latency on Windows can apparently vary.
The approach I like to use for controlling time is to pass in a chan time.Time
. Any time something tries to determine the current time then comes from reading that channel.
coderd/database/pubsub/pubsub.go
Outdated
@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) { | |||
p.qMu.Unlock() | |||
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs)) | |||
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events)) | |||
|
|||
// additional metrics | |||
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.
|
||
// Force fast publisher to start after the slow one to avoid flakiness. | ||
<-hold | ||
time.Sleep(time.Millisecond * 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.
I think you should simplify and fake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Refactor async measurement for immediate exit upon signal Signed-off-by: Danny Kopping <danny@coder.com>
bf89a2d
to
ff73789
Compare
Signed-off-by: Danny Kopping <danny@coder.com>
9f60b66
to
7d055d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also update docs/admin/prometheus.md
with the added metrics. LGTM once that's done.
869b13d
to
361538c
Compare
…orth the complexity Signed-off-by: Danny Kopping <danny@coder.com>
Added #13223 to track this; seems like the metrics aren't detected automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion inline, but I don't need to review again
coderd/database/pubsub/pubsub.go
Outdated
@@ -528,6 +566,20 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) { | |||
p.qMu.Unlock() | |||
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs)) | |||
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events)) | |||
|
|||
// additional metrics | |||
ctx, cancel := context.WithTimeout(context.Background(), LatencyMeasureInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small quibble with the name: "interval" sounds to me like a periodic thing, so I would suggest LatencyMeasureTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
Thanks for the patience on this PR!
Signed-off-by: Danny Kopping <danny@coder.com>
Pubsub is a crucial aspect of Coder's design. We use Postgres
LISTEN
/NOTIFY
for this purpose, but the problem is that this subsystem of Postgres is not very observable. At most, one can observe how full its buffers are.This PR adds 3 metrics:
coder_pubsub_send_latency_seconds
coder_pubsub_receive_latency_seconds
coder_pubsub_latency_measure_errs_total
These metrics are tracked on each coderd replica, and each replica will use its own notification channel.
Ideally we want to isolate Postgres' contribution to the latency as much as possible, but since there's a network in between this won't be very accurate; however these metrics will still be Good Enough™️ to alert on - to give operators an indication why coderd appears to be slowing down.
Operators can use these metrics in conjunction with the rate of change in
coder_pubsub_{publishes,subscribes}_total
to infer if the queue has become overly large and/or the receivers too bogged down to alleviate pressure on the pubsub queue. If Postgres is underresourced on CPU this might also be a contributing factor.TODO: add this as a healthcheck in a subsequent PR.