Skip to content

feat: measure pubsub latencies and expose metrics #13126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 10, 2024

Conversation

dannykopping
Copy link
Contributor

Pubsub is a crucial aspect of Coder's design. We use Postgres LISTEN/NOTIFY for this purpose, but the problem is that this subsystem of Postgres is not very observable. At most, one can observe how full its buffers are.

This PR adds 3 metrics:

  • coder_pubsub_send_latency_seconds
  • coder_pubsub_receive_latency_seconds
  • coder_pubsub_latency_measure_errs_total

These metrics are tracked on each coderd replica, and each replica will use its own notification channel.

Ideally we want to isolate Postgres' contribution to the latency as much as possible, but since there's a network in between this won't be very accurate; however these metrics will still be Good Enough™️ to alert on - to give operators an indication why coderd appears to be slowing down.

Operators can use these metrics in conjunction with the rate of change in coder_pubsub_{publishes,subscribes}_total to infer if the queue has become overly large and/or the receivers too bogged down to alleviate pressure on the pubsub queue. If Postgres is underresourced on CPU this might also be a contributing factor.

TODO: add this as a healthcheck in a subsequent PR.

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@dannykopping dannykopping changed the title feat(database): measure pubsub latency and expose as metrics feat: measure pubsub latency and expose as metrics May 2, 2024
@dannykopping dannykopping changed the title feat: measure pubsub latency and expose as metrics feat: measure pubsub latencies and expose metrics May 2, 2024
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
p.qMu.Unlock()
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))

// additional metrics
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about running a measurement with a 10s timeout from within Collect(), which typically gets called every 15s or so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I settled on that number because a) it's sufficiently high as to indicate a serious problem, b) it won't disrupt the default 1m scrape interval (source), and c) even with a low interval like 15s it still will not interfere with scrapes.

We could lower this a bit though, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, what's the expectation for how quickly Collect() returns?

	// This method may be called concurrently and must therefore be
	// implemented in a concurrency safe way. Blocking occurs at the expense
	// of total performance of rendering all registered metrics. Ideally,
	// Collector implementations support concurrent readers.

If NOTIFY is hobbled, then we're blocking Collect() for up to 10s, which doesn't seem like we're playing nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's acceptable to have Collect() block while we measure the latency, provided we don't overrun a reasonable scrape interval. In most architectures it's rare to have multiple scrapers scraping the same targets, so Collect() should only ever be called once at a time.

Alternatively we could spawn a background process which measures latency periodically and Collect() simply reads its state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only hesitation I have with background measurement is that scrapes will be showing stale data, which may give operators the wrong impression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This architecture should be fine with multiple collections, as the unique ping will just get ignored by other calls to Collect().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's give it a try: a7c042f

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to double back to synchronous-only collection: 33a2a1d
The complexity background collection introduced wasn't worth the benefit; I've satisfied my curiosity now 😅

)

cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) {
res <- time.Since(start).Seconds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having each measurement be an identical message (and ignoring it) could lead to previously timed-out measurements getting processed here. That could lead to very strange results such as a ping arriving before start is set, which will result in the recv latency being two millennia 🙀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! OK, I'll refactor to generate a unique message and only mark the latency roundtrip done when I observe that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think of 65f57b1 please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

49d2002 supercedes the above

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
)

cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) {
res <- time.Since(start).Seconds()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

49d2002 supercedes the above


// Force fast publisher to start after the slow one to avoid flakiness.
<-hold
time.Sleep(time.Millisecond * 50)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YOLO-ed this a bit because Go's scheduler gives a goroutine at most 20ms on CPU before it is pre-empted; I added a bit of extra time to be sure. I ran this test with -count=100 and didn't observe any flakes, but I'm open to better approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably OK for now, although the scheduler latency on Windows can apparently vary.

The approach I like to use for controlling time is to pass in a chan time.Time. Any time something tries to determine the current time then comes from reading that channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.

I think you should simplify and fake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannykopping dannykopping marked this pull request as ready for review May 3, 2024 11:22
@dannykopping dannykopping requested review from mafredri and johnstcn May 3, 2024 14:35

// Force fast publisher to start after the slow one to avoid flakiness.
<-hold
time.Sleep(time.Millisecond * 50)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably OK for now, although the scheduler latency on Windows can apparently vary.

The approach I like to use for controlling time is to pass in a chan time.Time. Any time something tries to determine the current time then comes from reading that channel.

@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
p.qMu.Unlock()
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))

// additional metrics
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.


// Force fast publisher to start after the slow one to avoid flakiness.
<-hold
time.Sleep(time.Millisecond * 50)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.

I think you should simplify and fake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Refactor async measurement for immediate exit upon signal

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also update docs/admin/prometheus.md with the added metrics. LGTM once that's done.

…orth the complexity

Signed-off-by: Danny Kopping <danny@coder.com>
@dannykopping
Copy link
Contributor Author

We should also update docs/admin/prometheus.md with the added metrics. LGTM once that's done.

Added #13223 to track this; seems like the metrics aren't detected automatically.

Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion inline, but I don't need to review again

@@ -528,6 +566,20 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
p.qMu.Unlock()
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))

// additional metrics
ctx, cancel := context.WithTimeout(context.Background(), LatencyMeasureInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small quibble with the name: "interval" sounds to me like a periodic thing, so I would suggest LatencyMeasureTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call
Thanks for the patience on this PR!

Signed-off-by: Danny Kopping <danny@coder.com>
@dannykopping dannykopping enabled auto-merge (squash) May 10, 2024 12:20
@dannykopping dannykopping merged commit 4671ebb into coder:main May 10, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 10, 2024
@dannykopping dannykopping deleted the dk/pubsub-latency branch May 10, 2024 12:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants