Skip to content

Commit 7fe2f67

Browse files
committed
Limit the size of numa_move_pages requests
There's a kernel bug in do_pages_stat(), affecting systems combining 64-bit kernel and 32-bit user space. The function splits the request into chunks of 16 pointers, but forgets the pointers are 32-bit when advancing to the next chunk. Some of the pointers get skipped, and memory after the array is interpreted as pointers. The result is that the produced status of memory pages is mostly bogus. Systems combining 64-bit and 32-bit environments like this might seem rare, but that's not the case - all 32-bit Debian packages are built in a 32-bit chroot on a system with a 64-bit kernel. This is a long-standing kernel bug (since 2010), affecting pretty much all kernels, so it'll take time until all systems get a fixed kernel. Luckily, we can work around the issue by chunking the requests the same way do_pages_stat() does, at least on affected systems. We don't know what kernel a 32-bit build will run on, so all 32-bit builds use chunks of 16 elements (the largest chunk before hitting the issue). 64-bit builds are not affected by this issue, and so could work without the chunking. But chunking has other advantages, so we apply chunking even for 64-bit builds, with chunks of 1024 elements. Reported-by: Christoph Berg <myon@debian.org> Author: Christoph Berg <myon@debian.org> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aEtDozLmtZddARdB@msg.df7cb.de Context: https://marc.info/?l=linux-mm&m=175077821909222&w=2 Backpatch-through: 18
1 parent b5cd0ec commit 7fe2f67

File tree

1 file changed

+49
-1
lines changed

1 file changed

+49
-1
lines changed

src/port/pg_numa.c

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,19 @@
2929
#include <numa.h>
3030
#include <numaif.h>
3131

32+
/*
33+
* numa_move_pages() chunk size, has to be <= 16 to work around a kernel bug
34+
* in do_pages_stat() (chunked by DO_PAGES_STAT_CHUNK_NR). By using the same
35+
* chunk size, we make it work even on unfixed kernels.
36+
*
37+
* 64-bit system are not affected by the bug, and so use much larger chunks.
38+
*/
39+
#if SIZEOF_SIZE_T == 4
40+
#define NUMA_QUERY_CHUNK_SIZE 16
41+
#else
42+
#define NUMA_QUERY_CHUNK_SIZE 1024
43+
#endif
44+
3245
/* libnuma requires initialization as per numa(3) on Linux */
3346
int
3447
pg_numa_init(void)
@@ -42,11 +55,46 @@ pg_numa_init(void)
4255
* We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
4356
* first one allows us to batch and query about many memory pages in one single
4457
* giant system call that is way faster.
58+
*
59+
* We call numa_move_pages() for smaller chunks of the whole array. The first
60+
* reason is to work around a kernel bug, but also to allow interrupting the
61+
* query between the calls (for many pointers processing the whole array can
62+
* take a lot of time).
4563
*/
4664
int
4765
pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
4866
{
49-
return numa_move_pages(pid, count, pages, NULL, status, 0);
67+
unsigned long next = 0;
68+
int ret = 0;
69+
70+
/*
71+
* Chunk pointers passed to numa_move_pages to NUMA_QUERY_CHUNK_SIZE
72+
* items, to work around a kernel bug in do_pages_stat().
73+
*/
74+
while (next < count)
75+
{
76+
unsigned long count_chunk = Min(count - next,
77+
NUMA_QUERY_CHUNK_SIZE);
78+
79+
/*
80+
* Bail out if any of the chunks errors out (ret<0). We ignore
81+
* (ret>0) which is used to return number of nonmigrated pages,
82+
* but we're not migrating any pages here.
83+
*/
84+
ret = numa_move_pages(pid, count_chunk, &pages[next], NULL, &status[next], 0);
85+
if (ret < 0)
86+
{
87+
/* plain error, return as is */
88+
return ret;
89+
}
90+
91+
next += count_chunk;
92+
}
93+
94+
/* should have consumed the input array exactly */
95+
Assert(next == count);
96+
97+
return 0;
5098
}
5199

52100
int

0 commit comments

Comments
 (0)