Skip to content

Commit 9f33300

Browse files
committed
Prevent excessive delays before launching new logrep workers.
The logical replication launcher process would sometimes sleep for as much as 3 minutes before noticing that it is supposed to launch a new worker. This could happen if (1) WaitForReplicationWorkerAttach absorbed a process latch wakeup that was meant to cause ApplyLauncherMain to do work, or (2) logicalrep_worker_launch reported failure, either because of resource limits or because the new worker terminated immediately. In case (2), the expected behavior is that we retry the launch after wal_retrieve_retry_interval, but that didn't reliably happen. It's not clear how often such conditions would occur in the field, but in our subscription test suite they are somewhat common, especially in tests that exercise cases that cause quick worker failure. That causes the tests to take substantially longer than they ought to do on typical setups. To fix (1), make WaitForReplicationWorkerAttach re-set the latch before returning if it cleared it while looping. To fix (2), ensure that we reduce wait_time to no more than wal_retrieve_retry_interval when logicalrep_worker_launch reports failure. In passing, fix a couple of perhaps-hypothetical race conditions, e.g. examining worker->in_use without a lock. Backpatch to v16. Problem (2) didn't exist before commit 5a3a953 because the previous code always set wait_time to wal_retrieve_retry_interval when launching a worker, regardless of success or failure of the launch. That behavior also greatly mitigated problem (1), so I'm not excited about adapting the remainder of the patch to the substantially-different code in older branches. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/817604.1750723007@sss.pgh.pa.us Backpatch-through: 16
1 parent 2c0bc47 commit 9f33300

File tree

2 files changed

+44
-15
lines changed

2 files changed

+44
-15
lines changed

src/backend/replication/logical/launcher.c

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -184,12 +184,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
184184
uint16 generation,
185185
BackgroundWorkerHandle *handle)
186186
{
187-
BgwHandleStatus status;
188-
int rc;
187+
bool result = false;
188+
bool dropped_latch = false;
189189

190190
for (;;)
191191
{
192+
BgwHandleStatus status;
192193
pid_t pid;
194+
int rc;
193195

194196
CHECK_FOR_INTERRUPTS();
195197

@@ -198,8 +200,9 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
198200
/* Worker either died or has started. Return false if died. */
199201
if (!worker->in_use || worker->proc)
200202
{
203+
result = worker->in_use;
201204
LWLockRelease(LogicalRepWorkerLock);
202-
return worker->in_use;
205+
break;
203206
}
204207

205208
LWLockRelease(LogicalRepWorkerLock);
@@ -214,7 +217,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
214217
if (generation == worker->generation)
215218
logicalrep_worker_cleanup(worker);
216219
LWLockRelease(LogicalRepWorkerLock);
217-
return false;
220+
break; /* result is already false */
218221
}
219222

220223
/*
@@ -229,8 +232,18 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
229232
{
230233
ResetLatch(MyLatch);
231234
CHECK_FOR_INTERRUPTS();
235+
dropped_latch = true;
232236
}
233237
}
238+
239+
/*
240+
* If we had to clear a latch event in order to wait, be sure to restore
241+
* it before exiting. Otherwise caller may miss events.
242+
*/
243+
if (dropped_latch)
244+
SetLatch(MyLatch);
245+
246+
return result;
234247
}
235248

236249
/*
@@ -1197,10 +1210,21 @@ ApplyLauncherMain(Datum main_arg)
11971210
(elapsed = TimestampDifferenceMilliseconds(last_start, now)) >= wal_retrieve_retry_interval)
11981211
{
11991212
ApplyLauncherSetWorkerStartTime(sub->oid, now);
1200-
logicalrep_worker_launch(WORKERTYPE_APPLY,
1201-
sub->dbid, sub->oid, sub->name,
1202-
sub->owner, InvalidOid,
1203-
DSM_HANDLE_INVALID);
1213+
if (!logicalrep_worker_launch(WORKERTYPE_APPLY,
1214+
sub->dbid, sub->oid, sub->name,
1215+
sub->owner, InvalidOid,
1216+
DSM_HANDLE_INVALID))
1217+
{
1218+
/*
1219+
* We get here either if we failed to launch a worker
1220+
* (perhaps for resource-exhaustion reasons) or if we
1221+
* launched one but it immediately quit. Either way, it
1222+
* seems appropriate to try again after
1223+
* wal_retrieve_retry_interval.
1224+
*/
1225+
wait_time = Min(wait_time,
1226+
wal_retrieve_retry_interval);
1227+
}
12041228
}
12051229
else
12061230
{

src/backend/replication/logical/tablesync.c

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -604,14 +604,19 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
604604
TimestampDifferenceExceeds(hentry->last_start_time, now,
605605
wal_retrieve_retry_interval))
606606
{
607-
logicalrep_worker_launch(WORKERTYPE_TABLESYNC,
608-
MyLogicalRepWorker->dbid,
609-
MySubscription->oid,
610-
MySubscription->name,
611-
MyLogicalRepWorker->userid,
612-
rstate->relid,
613-
DSM_HANDLE_INVALID);
607+
/*
608+
* Set the last_start_time even if we fail to start
609+
* the worker, so that we won't retry until
610+
* wal_retrieve_retry_interval has elapsed.
611+
*/
614612
hentry->last_start_time = now;
613+
(void) logicalrep_worker_launch(WORKERTYPE_TABLESYNC,
614+
MyLogicalRepWorker->dbid,
615+
MySubscription->oid,
616+
MySubscription->name,
617+
MyLogicalRepWorker->userid,
618+
rstate->relid,
619+
DSM_HANDLE_INVALID);
615620
}
616621
}
617622
}

0 commit comments

Comments
 (0)