feat: measure pubsub latencies and expose metrics #13126

dannykopping · 2024-05-02T07:42:43Z

Pubsub is a crucial aspect of Coder's design. We use Postgres LISTEN/NOTIFY for this purpose, but the problem is that this subsystem of Postgres is not very observable. At most, one can observe how full its buffers are.

This PR adds 3 metrics:

coder_pubsub_send_latency_seconds
coder_pubsub_receive_latency_seconds
coder_pubsub_latency_measure_errs_total

These metrics are tracked on each coderd replica, and each replica will use its own notification channel.

Ideally we want to isolate Postgres' contribution to the latency as much as possible, but since there's a network in between this won't be very accurate; however these metrics will still be Good Enough™️ to alert on - to give operators an indication why coderd appears to be slowing down.

Operators can use these metrics in conjunction with the rate of change in coder_pubsub_{publishes,subscribes}_total to infer if the queue has become overly large and/or the receivers too bogged down to alleviate pressure on the pubsub queue. If Postgres is underresourced on CPU this might also be a contributing factor.

TODO: add this as a healthcheck in a subsequent PR.

Signed-off-by: Danny Kopping <danny@coder.com>

coderd/database/pubsub/latency.go

spikecurtis · 2024-05-02T09:24:47Z

coderd/database/pubsub/pubsub.go

@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
 	p.qMu.Unlock()
 	metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
 	metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))
+
+	// additional metrics
+	ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)


I worry about running a measurement with a 10s timeout from within Collect(), which typically gets called every 15s or so.

I settled on that number because a) it's sufficiently high as to indicate a serious problem, b) it won't disrupt the default 1m scrape interval (source), and c) even with a low interval like 15s it still will not interfere with scrapes.

We could lower this a bit though, what do you think?

I guess, what's the expectation for how quickly Collect() returns?

// This method may be called concurrently and must therefore be // implemented in a concurrency safe way. Blocking occurs at the expense // of total performance of rendering all registered metrics. Ideally, // Collector implementations support concurrent readers.

If NOTIFY is hobbled, then we're blocking Collect() for up to 10s, which doesn't seem like we're playing nice.

I think it's acceptable to have Collect() block while we measure the latency, provided we don't overrun a reasonable scrape interval. In most architectures it's rare to have multiple scrapers scraping the same targets, so Collect() should only ever be called once at a time.

Alternatively we could spawn a background process which measures latency periodically and Collect() simply reads its state.

I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.

The only hesitation I have with background measurement is that scrapes will be showing stale data, which may give operators the wrong impression.

This architecture should be fine with multiple collections, as the unique ping will just get ignored by other calls to Collect().

OK, let's give it a try: a7c042f

I decided to double back to synchronous-only collection: 33a2a1d
The complexity background collection introduced wasn't worth the benefit; I've satisfied my curiosity now 😅

spikecurtis · 2024-05-02T09:30:35Z

coderd/database/pubsub/latency.go

+	)
+
+	cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) {
+		res <- time.Since(start).Seconds()


having each measurement be an identical message (and ignoring it) could lead to previously timed-out measurements getting processed here. That could lead to very strange results such as a ping arriving before start is set, which will result in the recv latency being two millennia 🙀

Good call! OK, I'll refactor to generate a unique message and only mark the latency roundtrip done when I observe that.

Let me know what you think of 65f57b1 please

49d2002 supercedes the above

Signed-off-by: Danny Kopping <danny@coder.com>

coderd/database/pubsub/latency.go

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping · 2024-05-03T11:19:14Z

coderd/database/pubsub/latency.go

+	)
+
+	cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) {
+		res <- time.Since(start).Seconds()


49d2002 supercedes the above

dannykopping · 2024-05-03T11:22:00Z

coderd/database/pubsub/pubsub_linux_test.go

+
+			// Force fast publisher to start after the slow one to avoid flakiness.
+			<-hold
+			time.Sleep(time.Millisecond * 50)


YOLO-ed this a bit because Go's scheduler gives a goroutine at most 20ms on CPU before it is pre-empted; I added a bit of extra time to be sure. I ran this test with -count=100 and didn't observe any flakes, but I'm open to better approaches.

This is probably OK for now, although the scheduler latency on Windows can apparently vary.

The approach I like to use for controlling time is to pass in a chan time.Time. Any time something tries to determine the current time then comes from reading that channel.

This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.

I think you should simplify and fake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.

coderd/database/pubsub/latency.go

johnstcn · 2024-05-03T14:46:08Z

coderd/database/pubsub/pubsub_linux_test.go

+
+			// Force fast publisher to start after the slow one to avoid flakiness.
+			<-hold
+			time.Sleep(time.Millisecond * 50)


This is probably OK for now, although the scheduler latency on Windows can apparently vary.

The approach I like to use for controlling time is to pass in a chan time.Time. Any time something tries to determine the current time then comes from reading that channel.

johnstcn · 2024-05-03T14:48:27Z

coderd/database/pubsub/pubsub.go

@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
 	p.qMu.Unlock()
 	metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
 	metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))
+
+	// additional metrics
+	ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)


I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.

coderd/database/pubsub/latency.go

coderd/database/pubsub/pubsub.go

spikecurtis · 2024-05-06T07:31:51Z

coderd/database/pubsub/pubsub_linux_test.go

+
+			// Force fast publisher to start after the slow one to avoid flakiness.
+			<-hold
+			time.Sleep(time.Millisecond * 50)


This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.

I think you should simplify and fake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.

…latency

Signed-off-by: Danny Kopping <danny@coder.com>

coderd/database/pubsub/pubsub.go

Refactor async measurement for immediate exit upon signal Signed-off-by: Danny Kopping <danny@coder.com>

Signed-off-by: Danny Kopping <danny@coder.com>

johnstcn

We should also update docs/admin/prometheus.md with the added metrics. LGTM once that's done.

testutil/prometheus.go

coderd/database/pubsub/latency.go

…latency

…orth the complexity Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping · 2024-05-10T10:57:29Z

We should also update docs/admin/prometheus.md with the added metrics. LGTM once that's done.

Added #13223 to track this; seems like the metrics aren't detected automatically.

spikecurtis

Minor suggestion inline, but I don't need to review again

spikecurtis · 2024-05-10T12:05:03Z

coderd/database/pubsub/pubsub.go

@@ -528,6 +566,20 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
 	p.qMu.Unlock()
 	metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
 	metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))
+
+	// additional metrics
+	ctx, cancel := context.WithTimeout(context.Background(), LatencyMeasureInterval)


Small quibble with the name: "interval" sounds to me like a periodic thing, so I would suggest LatencyMeasureTimeout

Good call
Thanks for the patience on this PR!

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping added 3 commits May 2, 2024 09:10

Measuring pubsub latency and exposing as Prometheus metric

c249b70

Signed-off-by: Danny Kopping <danny@coder.com>

Track errors

f845bda

Signed-off-by: Danny Kopping <danny@coder.com>

Explicitly checking latencies are >0

4f62b40

Signed-off-by: Danny Kopping <danny@coder.com>

github-actions bot assigned dannykopping May 2, 2024

dannykopping changed the title ~~feat(database): measure pubsub latency and expose as metrics~~ feat: measure pubsub latency and expose as metrics May 2, 2024

dannykopping changed the title ~~feat: measure pubsub latency and expose as metrics~~ feat: measure pubsub latencies and expose metrics May 2, 2024

dannykopping added 2 commits May 2, 2024 10:13

Enhancements

3d9e3dd

Signed-off-by: Danny Kopping <danny@coder.com>

Fix TestPGPubsub_Metrics test

34083d0

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping mentioned this pull request May 2, 2024

Add integration for Postgres in observability chart #12995

Closed

4 tasks

spikecurtis reviewed May 2, 2024

View reviewed changes

coderd/database/pubsub/latency.go Outdated Show resolved Hide resolved

spikecurtis reviewed May 2, 2024

View reviewed changes

dannykopping added 2 commits May 2, 2024 11:47

Refactor to avoid global state

28a96de

Signed-off-by: Danny Kopping <danny@coder.com>

Protect against NOTIFY races a slow receiver

65f57b1

Signed-off-by: Danny Kopping <danny@coder.com>

spikecurtis reviewed May 2, 2024

View reviewed changes

coderd/database/pubsub/latency.go Outdated Show resolved Hide resolved

coderd/database/pubsub/latency.go Outdated Show resolved Hide resolved

Reliably cause notify races by delaying publishes

49d2002

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping commented May 3, 2024

View reviewed changes

dannykopping marked this pull request as ready for review May 3, 2024 11:22

dannykopping requested review from mafredri and johnstcn May 3, 2024 14:35

johnstcn reviewed May 3, 2024

View reviewed changes

spikecurtis reviewed May 6, 2024

View reviewed changes

dannykopping added 4 commits May 6, 2024 07:54

Merge branch 'main' of https://github.com/coder/coder into dk/pubsub-…

68412c8

…latency

Measure latency in the background

a7c042f

Signed-off-by: Danny Kopping <danny@coder.com>

Fake pubsub to simulate race

633365d

Signed-off-by: Danny Kopping <danny@coder.com>

make fmt

722a233

Signed-off-by: Danny Kopping <danny@coder.com>

spikecurtis reviewed May 6, 2024

View reviewed changes

coderd/database/pubsub/pubsub.go Outdated Show resolved Hide resolved

Stop async measurements on pubsub close

ff73789

Refactor async measurement for immediate exit upon signal Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed the dk/pubsub-latency branch from bf89a2d to ff73789 Compare May 7, 2024 10:29

dannykopping requested review from spikecurtis and johnstcn May 8, 2024 08:48

Cancel goroutine immediately on stop

7d055d2

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed the dk/pubsub-latency branch from 9f60b66 to 7d055d2 Compare May 8, 2024 08:59

johnstcn reviewed May 9, 2024

View reviewed changes

testutil/prometheus.go Show resolved Hide resolved

testutil/prometheus.go Show resolved Hide resolved

coderd/database/pubsub/latency.go Outdated Show resolved Hide resolved

Merge branch 'main' of https://github.com/coder/coder into dk/pubsub-…

361538c

…latency

dannykopping force-pushed the dk/pubsub-latency branch from 869b13d to 361538c Compare May 10, 2024 10:22

Revert to only synchronous collection; background collection is not w…

33a2a1d

…orth the complexity Signed-off-by: Danny Kopping <danny@coder.com>

spikecurtis approved these changes May 10, 2024

View reviewed changes

Wording fixup

bad502a

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping enabled auto-merge (squash) May 10, 2024 12:20

dannykopping merged commit 4671ebb into coder:main May 10, 2024

github-actions bot locked and limited conversation to collaborators May 10, 2024

dannykopping deleted the dk/pubsub-latency branch May 10, 2024 12:41

feat: measure pubsub latencies and expose metrics #13126

feat: measure pubsub latencies and expose metrics #13126

Uh oh!

Conversation

dannykopping commented May 2, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dannykopping commented May 10, 2024

Uh oh!

spikecurtis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!