mm: parallelize clear_gigantic_page
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- Parallelize clear_gigantic_page, which zeroes any page size larger than 8M (e.g. 1G on x86). Performance results (the default number of threads is 4; higher thread counts shown for context only): Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory Test: Clear a range of gigantic pages (triggered via fallocate) nthread speedup size (GiB) min time (s) stdev 1 100 41.13 0.03 2 2.03x 100 20.26 0.14 4 4.28x 100 9.62 0.09 8 8.39x 100 4.90 0.05 16 10.44x 100 3.94 0.03 1 200 89.68 0.35 2 2.21x 200 40.64 0.18 4 4.64x 200 19.33 0.32 8 8.99x 200 9.98 0.04 16 11.27x 200 7.96 0.04 1 400 188.20 1.57 2 2.30x 400 81.84 0.09 4 4.63x 400 40.62 0.26 8 8.92x 400 21.09 0.50 16 11.78x 400 15.97 0.25 1 800 434.91 1.81 2 2.54x 800 170.97 1.46 4 4.98x 800 87.38 1.91 8 10.15x 800 42.86 2.59 16 12.99x 800 33.48 0.83 The speedups are mostly due to the fact that more threads can use more memory bandwidth. The loop we're stressing on the x86 chip in this test is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one thread. We get the same bandwidth per thread for 2, 4, or 8 threads, but at 16 threads the per-thread bandwidth drops to 1420 MiB/s. However, the performance also improves over a single thread because of the ktask threads' NUMA awareness (ktask migrates worker threads to the node local to the work being done). This becomes a bigger factor as the amount of pages to zero grows to include memory from multiple nodes, so that speedups increase as the size increases. Signed-off-by:Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by:
Hongbo Yao <yaohongbo@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Tested-by:
Hongbo Yao <yaohongbo@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
Please register or sign in to comment