- Apr 15, 2021
-
-
Jens Axboe authored
mainline inclusion from mainline-5.1-rc1 commit 2b188cc1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_cqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up an io_uring instance for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Conflicts: include/linux/fs.h [ Patch 4666750aa5ade2("fs: Export generic_fadvise()") applied earlier. ] include/linux/syscalls.h [ Non-bugfix 9afc5eee("y2038: globally rename compat_time to old_time32") is not applied. ] include/uapi/asm-generic/unistd.h [ Patch 4e21565b("asm-generic: add kexec_file_load system call to unistd.h") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Deepa Dinamani authored
mainline inclusion from mainline-5.0-rc1 commit 7a35397f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- struct timespec is not y2038 safe. struct __kernel_timespec is the new y2038 safe structure for all syscalls that are using struct timespec. Update io_pgetevents interfaces to use struct __kernel_timespec. sigset_t also has different representations on 32 bit and 64 bit architectures. Hence, we need to support the following different syscalls: New y2038 safe syscalls: (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs) Native 64 bit(unchanged) and native 32 bit : sys_io_pgetevents Compat : compat_sys_io_pgetevents_time64 Older y2038 unsafe syscalls: (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs) Native 32 bit : sys_io_pgetevents_time32 Compat : compat_sys_io_pgetevents Note that io_getevents syscalls do not have a y2038 safe solution. Signed-off-by:
Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Conflicts: fs/aio.c include/linux/compat.h [ Patch 9afc5eee("y2038: globally rename compat_time to old_time32") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Deepa Dinamani authored
mainline inclusion from mainline-5.0-rc1 commit e024707b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- struct timespec is not y2038 safe. struct __kernel_timespec is the new y2038 safe structure for all syscalls that are using struct timespec. Update pselect interfaces to use struct __kernel_timespec. sigset_t also has different representations on 32 bit and 64 bit architectures. Hence, we need to support the following different syscalls: New y2038 safe syscalls: (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs) Native 64 bit(unchanged) and native 32 bit : sys_pselect6 Compat : compat_sys_pselect6_time64 Older y2038 unsafe syscalls: (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs) Native 32 bit : pselect6_time32 Compat : compat_sys_pselect6 Note that all other versions of select syscalls will not have y2038 safe versions. Signed-off-by:
Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Conflicts: fs/select.c include/linux/compat.h [ Patch 9afc5eee("y2038: globally rename compat_time to old_time32") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Deepa Dinamani authored
mainline inclusion from mainline-5.0-rc1 commit 8bd27a30 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- struct timespec is not y2038 safe. struct __kernel_timespec is the new y2038 safe structure for all syscalls that are using struct timespec. Update ppoll interfaces to use struct __kernel_timespec. sigset_t also has different representations on 32 bit and 64 bit architectures. Hence, we need to support the following different syscalls: New y2038 safe syscalls: (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs) Native 64 bit(unchanged) and native 32 bit : sys_ppoll Compat : compat_sys_ppoll_time64 Older y2038 unsafe syscalls: (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs) Native 32 bit : ppoll_time32 Compat : compat_sys_ppoll Signed-off-by:
Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Conflicts: fs/select.c include/linux/compat.h [ Patch 9afc5eee("y2038: globally rename compat_time to old_time32") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Deepa Dinamani authored
mainline inclusion from mainline-5.0-rc1 commit 854a6ed5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Refactor the logic to restore the sigmask before the syscall returns into an api. This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during the execution and restored after the execution of the syscall. With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise. Signed-off-by:
Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Deepa Dinamani authored
mainline inclusion from mainline-5.0-rc1 commit ded653cc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Refactor reading sigset from userspace and updating sigmask into an api. This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during, and restored after, the execution of the syscall. With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll, and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise. Note that the calls to sigprocmask() ignored the return value from the api as the function only returns an error on an invalid first argument that is hardcoded at these call sites. The updated logic uses set_current_blocked() instead. Signed-off-by:
Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Conflicts: include/linux/signal.h [ Patch ae7795bc("signal: Distinguish between kernel_siginfo and siginfo") is not applied. ] kernel/signal.c [ Patch fb50f5a4("signal: Pair exports with their functions") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Damien Le Moal authored
mainline inclusion from mainline-5.0-rc1 commit 20578bdf category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- For the synchronous I/O path case (read(), write() etc system calls), a BIO I/O priority is not initialized until the execution of blk_init_request_from_bio() when the BIO is submitted and a request initialized for the BIO execution. This is due to the ki_ioprio field of the struct kiocb defined on stack being always initialized to IOPRIO_CLASS_NONE, regardless of the calling process I/O context ioprio value set with ioprio_set(). This late initialization can result in the BIO being merged to pending requests even when the I/O priorities differ. Fix this by initializing the ki_iopriority field of on stack struct kiocb using the get_current_ioprio() helper, ensuring that all BIOs allocated and submitted for the system call execution see the correct intended I/O priority early. With this, since a BIO I/O priority is always set to the intended effective value for both the sync and async path, blk_init_request_from_bio() can be simplified. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Damien Le Moal authored
mainline inclusion from mainline-5.0-rc1 commit 668ffc03 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Growing in size a high priority request by merging it with a lower priority BIO or request will increase the request execution time. This is the opposite result of the desired effect of high I/O priorities, namely getting low I/O latencies. Prevent merging of requests and BIOs that have different I/O priorities to fix this. Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-merge.c [ Patch 9cf2bab6("block: kill request ->cpu member") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Damien Le Moal authored
mainline inclusion from mainline-5.0-rc1 commit 76dc8913 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- For cases when the application does not specify aio_reqprio for an aio, fallback to use get_current_ioprio() to obtain the task I/O priority last set using ioprio_set() rather than the hardcoded IOPRIO_CLASS_NONE value. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by:
Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Damien Le Moal authored
mainline inclusion from mainline-5.0-rc1 commit 64845a1d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Define get_current_ioprio() as an inline helper to obtain the caller I/O priority from its task I/O context. Use this helper in blk_init_request_from_bio() to set a request ioprio. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Conflicts: block/blk-core.c [e2b3fa5a ("block: Remove bio->bi_ioc") not included] Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Damien Le Moal authored
mainline inclusion from mainline-5.0-rc1 commit 23464f8c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Comment the use of the IOCB_FLAG_IOPRIO aio flag similarly to the IOCB_FLAG_RESFD flag. Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
yangerkun authored
hulk inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Christoph Hellwig authored
mainline inclusion from mainline-5.1-rc1 commit fb7e1600 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. Reviewed-by:
Hannes Reinecke <hare@suse.com> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Conflicts: [add ki_cookie in struct kiocb will change KABI and can not fix it. Stop to support block poll.] Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Eric W. Biederman authored
stable inclusion from linux-4.19.99 commit 6db0e28b893aa28af3f7c0197749a5d9cbfded5c bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- [ Upstream commit 33da8e7c ] My recent to change to only use force_sig for a synchronous events wound up breaking signal reception cifs and drbd. I had overlooked the fact that by default kthreads start out with all signals set to SIG_IGN. So a change I thought was safe turned out to have made it impossible for those kernel thread to catch their signals. Reverting the work on force_sig is a bad idea because what the code was doing was very much a misuse of force_sig. As the way force_sig ultimately allowed the signal to happen was to change the signal handler to SIG_DFL. Which after the first signal will allow userspace to send signals to these kernel threads. At least for wake_ack_receiver in drbd that does not appear actively wrong. So correct this problem by adding allow_kernel_signal that will allow signals whose siginfo reports they were sent by the kernel through, but will not allow userspace generated signals, and update cifs and drbd to call allow_kernel_signal in an appropriate place so that their thread can receive this signal. Fixing things this way ensures that userspace won't be able to send signals and cause problems, that it is clear which signals the threads are expecting to receive, and it guarantees that nothing else in the system will be affected. This change was partly inspired by similar cifs and drbd patches that added allow_signal. Reported-by:
ronnie sahlberg <ronniesahlberg@gmail.com> Reported-by:
Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Tested-by:
Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Cc: Steve French <smfrench@gmail.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: David Laight <David.Laight@ACULAB.COM> Fixes: 247bc947 ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes") Fixes: 72abe3bc ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig") Fixes: fee10990 ("signal/drbd: Use send_sig not force_sig") Fixes: 3cf5d076 ("signal: Remove task parameter from force_sig") Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> [io_uring need allow_kernel_signal] Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Steve French authored
stable inclusion from linux-4.19.99 commit 7f6a96dd8223796ffae4dd251be3bff161a28a4b bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- [ Upstream commit 247bc947 ] Fixes: 72abe3bc ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig") The global change from force_sig caused module unloading of cifs.ko to fail (since the cifsd process could not be killed, "rmmod cifs" now would always fail) Signed-off-by:
Steve French <stfrench@microsoft.com> Reviewed-by:
Ronnie Sahlberg <lsahlber@redhat.com> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> [io_uring need allow_kernel_signal] Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Eric W. Biederman authored
stable inclusion from linux-4.19.99 commit e6a13c753f912564256d81f7036f9c524b1ef8ae bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- [ Upstream commit 72abe3bc ] The locking in force_sig_info is not prepared to deal with a task that exits or execs (as sighand may change). The is not a locking problem in force_sig as force_sig is only built to handle synchronous exceptions. Further the function force_sig_info changes the signal state if the signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the delivery of the signal. The signal SIGKILL can not be ignored and can not be blocked and SIGNAL_UNKILLABLE won't prevent it from being delivered. So using force_sig rather than send_sig for SIGKILL is confusing and pointless. Because it won't impact the sending of the signal and and because using force_sig is wrong, replace force_sig with send_sig. Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Jeff Layton <jlayton@primarydata.com> Cc: Steve French <smfrench@gmail.com> Fixes: a5c3e1c7 ("Revert "cifs: No need to send SIGKILL to demux_thread during umount"") Fixes: e7ddee90 ("cifs: disable sharing session and tcon and add new TCP sharing code") Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> [io_uring need allow_kernel_signal] Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
- Apr 14, 2021
-
-
Piotr Krysiuk authored
stable inclusion from linux-4.19.186 commit 7b77ae2a0d6f9e110e13e85d802124b111b3e027 CVE: CVE-2021-29154 -------------------------------- commit 26f55a59dc65ff77cd1c4b37991e26497fc68049 upstream. The branch displacement logic in the BPF JIT compilers for x86 assumes that, for any generated branch instruction, the distance cannot increase between optimization passes. But this assumption can be violated due to how the distances are computed. Specifically, whenever a backward branch is processed in do_jit(), the distance is computed by subtracting the positions in the machine code from different optimization passes. This is because part of addrs[] is already updated for the current optimization pass, before the branch instruction is visited. And so the optimizer can expand blocks of machine code in some cases. This can confuse the optimizer logic, where it assumes that a fixed point has been reached for all machine code blocks once the total program size stops changing. And then the JIT compiler can output abnormal machine code containing incorrect branch displacements. To mitigate this issue, we assert that a fixed point is reached while populating the output image. This rejects any problematic programs. The issue affects both x86-32 and x86-64. We mitigate separately to ease backporting. Signed-off-by:
Piotr Krysiuk <piotras@gmail.com> Reviewed-by:
Daniel Borkmann <daniel@iogearbox.net> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Yue Haibing <yuehaibing@huawei.com> Reviewed-by:
Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Piotr Krysiuk authored
stable inclusion from linux-4.19.186 commit 5f26f1f838aa960045c712e13dbab8ff451fed74 CVE: CVE-2021-29154 -------------------------------- commit e4d4d456436bfb2fe412ee2cd489f7658449b098 upstream. The branch displacement logic in the BPF JIT compilers for x86 assumes that, for any generated branch instruction, the distance cannot increase between optimization passes. But this assumption can be violated due to how the distances are computed. Specifically, whenever a backward branch is processed in do_jit(), the distance is computed by subtracting the positions in the machine code from different optimization passes. This is because part of addrs[] is already updated for the current optimization pass, before the branch instruction is visited. And so the optimizer can expand blocks of machine code in some cases. This can confuse the optimizer logic, where it assumes that a fixed point has been reached for all machine code blocks once the total program size stops changing. And then the JIT compiler can output abnormal machine code containing incorrect branch displacements. To mitigate this issue, we assert that a fixed point is reached while populating the output image. This rejects any problematic programs. The issue affects both x86-32 and x86-64. We mitigate separately to ease backporting. Signed-off-by:
Piotr Krysiuk <piotras@gmail.com> Reviewed-by:
Daniel Borkmann <daniel@iogearbox.net> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Yue Haibing <yuehaibing@huawei.com> Reviewed-by:
Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Kuppuswamy Sathyanarayanan authored
mainline inclusion from mainline-5.3-rc5 commit 5336e52c category: bugfix bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25 CVE: NA ------------------------------------------------- Recent changes to the vmalloc code by commit 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can cause spurious percpu allocation failures. These, in turn, can result in panic()s in the slub code. One such possible panic was reported by Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939. Another related panic observed is, RIP: 0033:0x7f46f7441b9b Call Trace: dump_stack+0x61/0x80 pcpu_alloc.cold.30+0x22/0x4f mem_cgroup_css_alloc+0x110/0x650 cgroup_apply_control_enable+0x133/0x330 cgroup_mkdir+0x41b/0x500 kernfs_iop_mkdir+0x5a/0x90 vfs_mkdir+0x102/0x1b0 do_mkdirat+0x7d/0xf0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly uses two lists (vmap_area_list & free_vmap_area_list) to track the used and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[], sizes[], nr_vms, align) function is used for allocating congruent VM areas for percpu memory allocator. In order to not conflict with VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the VMALLOC space. So the search for free vm_area for the given requirement starts near VMALLOC_END and moves upwards towards VMALLOC_START. Prior to commit 68ad4a33, the search for free vm_area in pcpu_get_vm_areas() involves following two main steps. Step 1: Find a aligned "base" adress near VMALLOC_END. va = free vm area near VMALLOC_END Step 2: Loop through number of requested vm_areas and check, Step 2.1: if (base < VMALLOC_START) 1. fail with error Step 2.2: // end is offsets[area] + sizes[area] if (base + end > va->vm_end) 1. Move the base downwards and repeat Step 2 Step 2.3: if (base + start < va->vm_start) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 But Commit 68ad4a33 removed Step 2.2 and modified Step 2.3 as below: Step 2.3: if (base + start < va->vm_start || base + end > va->vm_end) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 Above change is the root cause of spurious percpu memory allocation failures. For example, consider a case where a relatively large vm_area (~ 30 TB) was ignored in free vm_area search because it did not pass the base + end < vm->vm_end boundary check. Ignoring such large free vm_area's would lead to not finding free vm_area within boundary of VMALLOC_start to VMALLOC_END which in turn leads to allocation failures. So modify the search algorithm to include Step 2.2. Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by:
Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reported-by:
Dave Hansen <dave.hansen@intel.com> Acked-by:
Dennis Zhou <dennis@kernel.org> Reviewed-by:
Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 5336e52c) Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Arnd Bergmann authored
mainline inclusion from mainline-5.2-rc7 commit 2c929233 category: bugfix bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25 CVE: NA ------------------------------------------------- gcc gets confused in pcpu_get_vm_areas() because there are too many branches that affect whether 'lva' was initialized before it gets used: mm/vmalloc.c: In function 'pcpu_get_vm_areas': mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized] insert_vmap_area_augment(lva, &va->rb_node, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ &free_vmap_area_root, &free_vmap_area_list); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/vmalloc.c:916:20: note: 'lva' was declared here struct vmap_area *lva; ^~~ Add an intialization to NULL, and check whether this has changed before the first use. [akpm@linux-foundation.org: tweak comments] Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Reviewed-by:
Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 2c929233) Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Uladzislau Rezki (Sony) authored
mainline inclusion from mainline-5.2-rc1 commit a6cf4e0f category: bugfix bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25 CVE: NA ------------------------------------------------- This macro adds some debug code to check that vmap allocations are happened in ascending order. By default this option is set to 0 and not active. It requires recompilation of the kernel to activate it. Set to 1, compile the kernel. [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com Signed-off-by:
Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by:
Roman Gushchin <guro@fb.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit a6cf4e0f) Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Uladzislau Rezki (Sony) authored
mainline inclusion from mainline-5.2-rc1 commit bb850f4d category: bugfix bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25 CVE: NA ------------------------------------------------- This macro adds some debug code to check that the augment tree is maintained correctly, meaning that every node contains valid subtree_max_size value. By default this option is set to 0 and not active. It requires recompilation of the kernel to activate it. Set to 1, compile the kernel. [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com Signed-off-by:
Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by:
Roman Gushchin <guro@fb.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit bb850f4d) Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Uladzislau Rezki (Sony) authored
mainline inclusion from mainline-5.2-rc1 commit 68ad4a33 category: bugfix bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25 CVE: NA ------------------------------------------------- Patch series "improve vmap allocation", v3. Objective --------- Please have a look for the description at: https://lkml.org/lkml/2018/10/19/786 but let me also summarize it a bit here as well. The current implementation has O(N) complexity. Requests with different permissive parameters can lead to long allocation time. When i say "long" i mean milliseconds. Description ----------- This approach organizes the KVA memory layout into free areas of the 1-ULONG_MAX range, i.e. an allocation is done over free areas lookups, instead of finding a hole between two busy blocks. It allows to have lower number of objects which represent the free space, therefore to have less fragmented memory allocator. Because free blocks are always as large as possible. It uses the augment tree where all free areas are sorted in ascending order of va->va_start address in pair with linked list that provides O(1) access to prev/next elements. Since the tree is augment, we also maintain the "subtree_max_size" of VA that reflects a maximum available free block in its left or right sub-tree. Knowing that, we can easily traversal toward the lowest (left most path) free area. Allocation: ~O(log(N)) complexity. It is sequential allocation method therefore tends to maximize locality. The search is done until a first suitable block is large enough to encompass the requested parameters. Bigger areas are split. I copy paste here the description of how the area is split, since i described it in https://lkml.org/lkml/2018/10/19/786 <snip> A free block can be split by three different ways. Their names are FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested size and alignment fit to a free block. FL_FIT_TYPE - in this case a free block is just removed from the free list/tree because it fully fits. Comparing with current design there is an extra work with rb-tree updating. LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is just cutting a free block. It is as fast as a current design. Most of the vmalloc allocations just end up with this case, because the edge is always aligned to 1. NE_FIT_TYPE - Is much less common case. Basically it happens when requested size and alignment does not fit left nor right edges, i.e. it is between them. In this case during splitting we have to build a remaining left free area and place it back to the free list/tree. Comparing with current design there are two extra steps. First one is we have to allocate a new vmap_area structure. Second one we have to insert that remaining free block to the address sorted list/tree. In order to optimize a first case there is a cache with free_vmap objects. Instead of allocating from slab we just take an object from the cache and reuse it. Second one is pretty optimized. Since we know a start point in the tree we do not do a search from the top. Instead a traversal begins from a rb-tree node we split. <snip> De-allocation. ~O(log(N)) complexity. An area is not inserted straight away to the tree/list, instead we identify the spot first, checking if it can be merged around neighbors. The list provides O(1) access to prev/next, so it is pretty fast to check it. Summarizing. If merged then large coalesced areas are created, if not the area is just linked making more fragments. There is one more thing that i should mention here. After modification of VA node, its subtree_max_size is updated if it was/is the biggest area in its left or right sub-tree. Apart of that it can also be populated back to upper levels to fix the tree. For more details please have a look at the __augment_tree_propagate_from() function and the description. Tests and stressing ------------------- I use the "test_vmalloc.sh" test driver available under "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo ./test_vmalloc.sh" to find out how to deal with it. Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA. Regarding last one, i do not have any physical access to NUMA system, therefore i emulated it. The time of stressing is days. If you run the test driver in "stress mode", you also need the patch that is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it: http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c After massive testing, i have not identified any problems like memory leaks, crashes or kernel panics. I find it stable, but more testing would be good. Performance analysis -------------------- I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1 as well, but the results have not been ready by time i an writing this. Currently it consist of 8 tests. There are three of them which correspond to different types of splitting(to compare with default). We have 3 ones(see above). Another 5 do allocations in different conditions. a) sudo ./test_vmalloc.sh performance When the test driver is run in "performance" mode, it runs all available tests pinned to first online CPU with sequential execution test order. We do it in order to get stable and repeatable results. Take a look at time difference in "long_busy_list_alloc_test". It is not surprising because the worst case is O(N). How many cycles all tests took: CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt How many cycles all tests took: CPU0=3478683207 cycles vs CPU0=463767978 cycles ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt b) time sudo ./test_vmalloc.sh test_repeat_count=1 With this configuration, all tests are run on all available online CPUs. Before running each CPU shuffles its tests execution order. It gives random allocation behaviour. So it is rough comparison, but it puts in the picture for sure. <default> vs <patched> real 101m22.813s real 0m56.805s user 0m0.011s user 0m0.015s sys 0m5.076s sys 0m0.023s ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt <default> vs <patched> real unknown real 4m25.214s user unknown user 0m0.011s sys unknown sys 0m0.670s I did not manage to complete this test on "default Hikey960" kernel version. After 24 hours it was still running, therefore i had to cancel it. That is why real/user/sys are "unknown". This patch (of 3): Currently an allocation of the new vmap area is done over busy list iteration(complexity O(n)) until a suitable hole is found between two busy areas. Therefore each new allocation causes the list being grown. Due to over fragmented list and different permissive parameters an allocation can take a long time. For example on embedded devices it is milliseconds. This patch organizes the KVA memory layout into free areas of the 1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks sorted by their offsets in pair with linked list keeping the free space in order of increasing addresses. Nodes are augmented with the size of the maximum available free block in its left or right sub-tree. Thus, that allows to take a decision and traversal toward the block that will fit and will have the lowest start address, i.e. it is sequential allocation. Allocation: to allocate a new block a search is done over the tree until a suitable lowest(left most) block is large enough to encompass: the requested size, alignment and vstart point. If the block is bigger than requested size - it is split. De-allocation: when a busy vmap area is freed it can either be merged or inserted to the tree. Red-black tree allows efficiently find a spot whereas a linked list provides a constant-time access to previous and next blocks to check if merging can be done. In case of merging of de-allocated memory chunk a large coalesced area is created. Complexity: ~O(log(N)) [urezki@gmail.com: v3] Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com Signed-off-by:
Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by:
Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 68ad4a33) Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Xiongfeng Wang authored
hulk inclusion category: feature bugzilla: 47439 CVE: NA ------------------------------------------------- Enable CONFIG_USERSWAP for hulk_defconfg and openeuler_defconfig Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Guo Fan authored
hulk inclusion category: feature bugzilla: 47439 CVE: NA ------------------------------------------------- This patch modify the userfaultfd to support userswap. To check whether tha pages are dirty since the last swap in, we make them clean when we swap in the pages. The userspace may swap in a large area and part of it are not swapped out. We need to skip those pages that are not swapped out. Signed-off-by:
Guo Fan <guofan5@huawei.com> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Guo Fan authored
hulk inclusion category: feature bugzilla: 47439 CVE: NA ------------------------------------------------- To make sure there are no other userspace threads access the memory region we are swapping out, we need unmmap the memory region, map it to a new address and use the new address to perform the swapout. We add a new flag 'MAP_REPLACE' for mmap() to unmap the pages of the input parameter 'VA' and remap them to a new tmpVA. Signed-off-by:
Guo Fan <guofan5@huawei.com> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Michal Hocko authored
mainline inclusion from mainline-v5.8-rc1 commit 2d3a36a4 category: bugfix bugzilla: 47439 CVE: NA --------------------------- ba841078 ("mm/mempolicy: Allow lookup_node() to handle fatal signal") has added a special casing for 0 return value because that was a possible gup return value when interrupted by fatal signal. This has been fixed by ae46d2aa ("mm/gup: Let __get_user_pages_locked() return -EINTR for fatal signal") in the mean time so ba841078 can be reverted. This patch however doesn't go all the way to revert it because the check for 0 is wrong and confusing here. Firstly it is inherently unsafe to access the page when get_user_pages_locked returns 0 (aka no page returned). Fortunatelly this will not happen because get_user_pages_locked will not return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not the case here. Document this potential error code in gup code while we are at it. Signed-off-by:
Michal Hocko <mhocko@suse.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/gup.c [wangxiongfeng: conflicts in comments ] Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Peter Xu authored
mainline inclusion from mainline-v5.7-rc1 commit ba841078 category: bugfix bugzilla: 47439 CVE: NA --------------------------- lookup_node() uses gup to pin the page and get node information. It checks against ret>=0 assuming the page will be filled in. However it's also possible that gup will return zero, for example, when the thread is quickly killed with a fatal signal. Teach lookup_node() to gracefully return an error -EFAULT if it happens. Meanwhile, initialize "page" to NULL to avoid potential risk of exploiting the pointer. Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times") Reported-by:
<syzbot+693dc11fcb53120b5559@syzkaller.appspotmail.com> Signed-off-by:
Peter Xu <peterx@redhat.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/mempolicy.c Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Hillf Danton authored
mainline inclusion from mainline-v5.7-rc1 commit ae46d2aa category: bugfix bugzilla: 47439 CVE: NA --------------------------- __get_user_pages_locked() will return 0 instead of -EINTR after commit 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times") which added extra code to allow gup detect fatal signal faster. Restore the original -EINTR behavior. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times") Reported-by:
<syzbot+3be1a33f04dc782e9fd5@syzkaller.appspotmail.com> Signed-off-by:
Hillf Danton <hdanton@sina.com> Acked-by:
Michal Hocko <mhocko@suse.com> Signed-off-by:
Peter Xu <peterx@redhat.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Peter Xu authored
mainline inclusion from mainline-v5.7-rc6 commit 475f4dfc category: bugfix bugzilla: 47439 CVE: NA --------------------------- This part was overlooked when reworking the gup code on multiple retries. When we get the 2nd+ retry, we'll be with TRIED flag set. Current code will bail out on the 2nd retry because the !TRIED check will fail so the retry logic will be skipped. What's worse is that, it will also return zero which errornously hints the caller that the page is faulted in while it's not. The !TRIED flag check seems to not be needed even before the mutliple retries change because if we get a VM_FAULT_RETRY, it must be the 1st retry, and we should not have TRIED set for that. Fix it by removing the !TRIED check, at the meantime check against fatal signals properly before the page fault so we can still properly respond to the user killing the process during retries. Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times") Reported-by:
Randy Dunlap <rdunlap@infradead.org> Signed-off-by:
Peter Xu <peterx@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Brian Geffon <bgeffon@google.com> Link: http://lkml.kernel.org/r/20200502003523.8204-1-peterx@redhat.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Peter Xu authored
mainline inclusion from mainline-5.6 commit 4426e945 category: bugfix bugzilla: 47439 CVE: NA --------------------------- This is the gup counterpart of the change that allows the VM_FAULT_RETRY to happen for more than once. One thing to mention is that we must check the fatal signal here before retry because the GUP can be interrupted by that, otherwise we can loop forever. Signed-off-by:
Peter Xu <peterx@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Tested-by:
Brian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Peter Xu authored
mainline inclusion from mainline-5.6 commit 4064b982 category: bugfix bugzilla: 47439 CVE: NA --------------------------- The idea comes from a discussion between Linus and Andrea [1]. Before this patch we only allow a page fault to retry once. We achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing handle_mm_fault() the second time. This was majorly used to avoid unexpected starvation of the system by looping over forever to handle the page fault on a single page. However that should hardly happen, and after all for each code path to return a VM_FAULT_RETRY we'll first wait for a condition (during which time we should possibly yield the cpu) to happen before VM_FAULT_RETRY is really returned. This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY. It means that the page fault handler now can retry the page fault for multiple times if necessary without the need to generate another page fault event. Meanwhile we still keep the FAULT_FLAG_TRIED flag so page fault handler can still identify whether a page fault is the first attempt or not. Then we'll have these combinations of fault flags (only considering ALLOW_RETRY flag and TRIED flag): - ALLOW_RETRY and !TRIED: this means the page fault allows to retry, and this is the first try - ALLOW_RETRY and TRIED: this means the page fault allows to retry, and this is not the first try - !ALLOW_RETRY and !TRIED: this means the page fault does not allow to retry at all - !ALLOW_RETRY and TRIED: this is forbidden and should never be used In existing code we have multiple places that has taken special care of the first condition above by checking against (fault_flags & FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect the first retry of a page fault by checking against both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now even the 2nd try will have the ALLOW_RETRY set, then use that helper in all existing special paths. One example is in __lock_page_or_retry(), now we'll drop the mmap_sem only in the first attempt of page fault and we'll keep it in follow up retries, so old locking behavior will be retained. This will be a nice enhancement for current code [2] at the same time a supporting material for the future userfaultfd-writeprotect work, since in that work there will always be an explicit userfault writeprotect retry for protected pages, and if that cannot resolve the page fault (e.g., when userfaultfd-writeprotect is used in conjunction with swapped pages) then we'll possibly need a 3rd retry of the page fault. It might also benefit other potential users who will have similar requirement like userfault write-protection. GUP code is not touched yet and will be covered in follow up patch. Please read the thread below for more information. [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/ [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/ Suggested-by:
Linus Torvalds <torvalds@linux-foundation.org> Suggested-by:
Andrea Arcangeli <aarcange@redhat.com> Signed-off-by:
Peter Xu <peterx@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Tested-by:
Brian Geffon <bgeffon@google.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Conflicts: arch/arc/mm/fault.c arch/arm64/mm/fault.c arch/x86/mm/fault.c drivers/gpu/drm/ttm/ttm_bo_vm.c include/linux/mm.h mm/internal.h Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Cheng Jian authored
hulk inclusion category: bugfix bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Previous patches added fields in struct rq and sched_domain_shared, which caused the KABI changed. We can use some helper structures to fix this KABI change, but this is not necessary, because these structures are only used internally, the driver is not aware of them, so we simply avoid them. Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com>
-
Cheng Jian authored
hulk inclusion category: bugfix bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- If we disable CONFIG_SMP, try_steal will lose its definition, resulting in a compile error as follows. kernel/sched/fair.c: In function ‘pick_next_task_fair’: kernel/sched/fair.c:7001:15: error: implicit declaration of function ‘try_steal’ [-Werror=implicit-function-declaration] new_tasks = try_steal(rq, rf); ^~~~~~~~~ We can use allnoconfig to reproduce this problem. Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Bin Li <huawei.libin@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Cheng Jian authored
hulk inclusion category: config bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA ------------------------------------------------- Enable steal tasks by default to improve CPU utilization Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Cheng Jian authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Introduce CONFIG_SCHED_STEAL to limit the impact of steal task. 1). If turn off CONFIG_SCHED_STEAL, then all the changes will not exist, for we use some empty functions, so this depends on compiler optimization. 2). enable CONFIG_SCHED_STEAL, but disable STEAL and schedstats, it will introduce some impact whith schedstat check. but this has little effect on performance. This will be our default choice. Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Cheng Jian authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Steal tasks to improve CPU utilization can solve some performance problems such as mysql, but not all scenarios are optimized, such as hackbench. So turn off by default. Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Add schedstats to measure the effectiveness of searching for idle CPUs and stealing tasks. This is a temporary patch intended for use during development only. SCHEDSTAT_VERSION is bumped to 16, and the following fields are added to the per-CPU statistics of /proc/schedstat: field 10: # of times select_idle_sibling "easily" found an idle CPU -- prev or target is idle. field 11: # of times select_idle_sibling searched and found an idle cpu. field 12: # of times select_idle_sibling searched and found an idle core. field 13: # of times select_idle_sibling failed to find anything idle. field 14: time in nanoseconds spent in functions that search for idle CPUs and search for tasks to steal. field 15: # of times an idle CPU steals a task from another CPU. field 16: # of times try_steal finds overloaded CPUs but no task is migratable. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- The STEAL feature causes regressions on hackbench on larger NUMA systems, so disable it on systems with more than sched_steal_node_limit nodes (default 2). Note that the feature remains enabled as seen in features.h and /sys/kernel/debug/sched_features, but stealing is only performed if nodes <= sched_steal_node_limit. This arrangement allows users to activate stealing on reboot by setting the kernel parameter sched_steal_node_limit on kernels built without CONFIG_SCHED_DEBUG. The parameter is temporary and will be deleted when the regression is fixed. Details of the regression follow. With the STEAL feature set, hackbench is slower on many-node systems: X5-8: 8 sockets * 18 cores * 2 hyperthreads = 288 CPUs Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz Average of 10 runs of: hackbench <groups> processes 50000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 3.627 15.8 3.876 7.3 -6.5 2 4.545 24.7 5.583 16.7 -18.6 3 5.716 25.0 7.367 14.2 -22.5 4 6.901 32.9 7.718 14.5 -10.6 8 8.604 38.5 9.111 16.0 -5.6 16 7.734 6.8 11.007 8.2 -29.8 Total CPU time increases. Profiling shows that CPU time increases uniformly across all functions, suggesting a systemic increase in cache or memory latency. This may be due to NUMA migrations, as they cause loss of LLC cache footprint and remote memory latencies. The domains for this system and their flags are: domain0 (SMT) : 1 core SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_SHARE_CPUCAPACITY SD_WAKE_AFFINE domain1 (MC) : 1 socket SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_WAKE_AFFINE domain2 (NUMA) : 4 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SERIALIZE SD_OVERLAP SD_NUMA SD_WAKE_AFFINE domain3 (NUMA) : 8 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA Schedstats point to the root cause of the regression. hackbench is run 10 times per group and the average schedstat accumulation per-run and per-cpu is shown below. Note that domain3 moves are zero because SD_WAKE_AFFINE is not set there. NO_STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.3 10.3 28710 14346 14366 0 490 3378 0 4039 0 0 2 26.4 18.8 56721 28258 28469 0 792 7026 12 9229 0 7 3 29.9 28.3 90191 44933 45272 0 5380 7204 19 16481 0 3 4 30.2 35.8 121324 60409 60933 0 7012 9372 27 21438 0 5 8 27.7 64.2 229174 111917 117272 0 11991 1837 168 44006 0 32 16 32.6 74.0 334615 146784 188043 0 3404 1468 49 61405 0 8 STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.6 10.2 28490 14232 14261 18 3 3525 0 4254 0 0 2 27.9 18.8 56757 28203 28562 303 1675 7839 5 9690 0 2 3 35.3 27.7 87337 43274 44085 698 741 12785 14 15689 0 3 4 36.8 36.0 118630 58437 60216 1579 2973 14101 28 18732 0 7 8 48.1 73.8 289374 133681 155600 18646 35340 10179 171 65889 0 34 16 41.4 82.5 268925 91908 177172 47498 17206 6940 176 71776 0 20 Cross-numa-node migrations are caused by load balancing pulls and wake_affine moves. Pulls are small and similar for no_steal and steal. However, moves are significantly higher for steal, and rows above with the highest moves have the worst regressions for time; see for example grp=8. Moves increase for steal due to the following logic in wake_affine_idle() for synchronous wakeup: if (sync && cpu_rq(this_cpu)->nr_running == 1) return this_cpu; // move the task The steal feature does a better job of smoothing the load between idle and busy CPUs, so nr_running is 1 more often, and moves are performed more often. For hackbench, cross-node affine moves early in the run are good because they colocate wakers and wakees from the same group on the same node, but continued moves later in the run are bad, because the wakee is moved away from peers on its previous node. Note that even no_steal is far from optimal; binding an instance of "hackbench 2" to each of the 8 NUMA nodes runs much faster than running "hackbench 16" with no binding. Clearing SD_WAKE_AFFINE for domain2 eliminates the affine cross-node migrations and eliminates the difference between no_steal and steal performance. However, overall performance is lower than WA_IDLE because some migrations are helpful as explained above. I have tried many heuristics in a attempt to optimize the number of cross-node moves in all conditions, with limited success. The fundamental problem is that the scheduler does not track which groups of tasks talk to each other. Parts of several groups become entrenched on the same node, filling it to capacity, leaving no room for either group to pull its peers over, and there is neither data nor mechanism for the scheduler to evict one group to make room for the other. For now, disable STEAL on such systems until we can do better, or it is shown that hackbench is atypical and most workloads benefit from stealing. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- When a CPU has no more CFS tasks to run, and idle_balance() fails to find a task, then attempt to steal a task from an overloaded CPU in the same LLC, using the cfs_overload_cpus bitmap to efficiently identify candidates. To minimize search time, steal the first migratable task that is found when the bitmap is traversed. For fairness, search for migratable tasks on an overloaded CPU in order of next to run. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, so it may be called every time the CPU is about to go idle. idle_balance() does more work because it searches widely for the busiest queue, so to limit its CPU consumption, it declines to search if the system is too busy. Simple stealing does not offload the globally busiest queue, but it is much better than running nothing at all. Stealing is controlled by the sched feature SCHED_STEAL, which is enabled by default. Stealing imprroves utilization with only a modest CPU overhead in scheduler code. In the following experiment, hackbench is run with varying numbers of groups (40 tasks per group), and the delta in /proc/schedstat is shown for each run, averaged per CPU, augmented with these non-standard stats: %find - percent of time spent in old and new functions that search for idle CPUs and tasks to steal and set the overloaded CPUs bitmap. steal - number of times a task is stolen from another CPU. X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz hackbench <grps> process 100000 sched_wakeup_granularity_ns=15000000 baseline grps time %busy slice sched idle wake %find steal 1 8.084 75.02 0.10 105476 46291 59183 0.31 0 2 13.892 85.33 0.10 190225 70958 119264 0.45 0 3 19.668 89.04 0.10 263896 87047 176850 0.49 0 4 25.279 91.28 0.10 322171 94691 227474 0.51 0 8 47.832 94.86 0.09 630636 144141 486322 0.56 0 new grps time %busy slice sched idle wake %find steal %speedup 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6 Elapsed time improves by 8 to 36%, and CPU busy utilization is up by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks). The cost is at most 0.4% more find time. Additional performance results follow. A negative "speedup" is a regression. Note: for all hackbench runs, sched_wakeup_granularity_ns is set to 15 msec. Otherwise, preemptions increase at higher loads and distort the comparison between baseline and new. ------------------ 1 Socket Results ------------------ X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Average of 10 runs of: hackbench <groups> process 100000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 8.008 0.1 5.905 0.2 35.6 2 13.814 0.2 11.438 0.1 20.7 3 19.488 0.2 16.919 0.1 15.1 4 25.059 0.1 22.409 0.1 11.8 8 47.478 0.1 44.221 0.1 7.3 X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Average of 10 runs of: hackbench <groups> process 100000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 4.586 0.8 4.596 0.6 -0.3 2 7.693 0.2 5.775 1.3 33.2 3 10.442 0.3 8.288 0.3 25.9 4 13.087 0.2 11.057 0.1 18.3 8 24.145 0.2 22.076 0.3 9.3 16 43.779 0.1 41.741 0.2 4.8 KVM 4-cpu Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz tbench, average of 11 runs clients %speedup 1 16.2 2 11.7 4 9.9 8 12.8 16 13.7 KVM 2-cpu Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz Benchmark %speedup specjbb2015_critical_jops 5.7 mysql_sysb1.0.14_mutex_2 40.6 mysql_sysb1.0.14_oltp_2 3.9 ------------------ 2 Socket Results ------------------ X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Average of 10 runs of: hackbench <groups> process 100000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 7.945 0.2 7.219 8.7 10.0 2 8.444 0.4 6.689 1.5 26.2 3 12.100 1.1 9.962 2.0 21.4 4 15.001 0.4 13.109 1.1 14.4 8 27.960 0.2 26.127 0.3 7.0 X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Average of 10 runs of: hackbench <groups> process 100000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 5.826 5.4 5.840 5.0 -0.3 2 5.041 5.3 6.171 23.4 -18.4 3 6.839 2.1 6.324 3.8 8.1 4 8.177 0.6 7.318 3.6 11.7 8 14.429 0.7 13.966 1.3 3.3 16 26.401 0.3 25.149 1.5 4.9 X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Oracle database OLTP, logging disabled, NVRAM storage Customers Users %speedup 1200000 40 -1.2 2400000 80 2.7 3600000 120 8.9 4800000 160 4.4 6000000 200 3.0 X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz Results from the Oracle "Performance PIT". Benchmark %speedup mysql_sysb1.0.14_fileio_56_rndrd 19.6 mysql_sysb1.0.14_fileio_56_seqrd 12.1 mysql_sysb1.0.14_fileio_56_rndwr 0.4 mysql_sysb1.0.14_fileio_56_seqrewr -0.3 pgsql_sysb1.0.14_fileio_56_rndrd 19.5 pgsql_sysb1.0.14_fileio_56_seqrd 8.6 pgsql_sysb1.0.14_fileio_56_rndwr 1.0 pgsql_sysb1.0.14_fileio_56_seqrewr 0.5 opatch_time_ASM_12.2.0.1.0_HP2M 7.5 select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M 5.1 select-1_users_asmm_ASM_12.2.0.1.0_HP2M 4.4 swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M 5.8 lm3_memlat_L2 4.8 lm3_memlat_L1 0.0 ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching 60.1 ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent 5.2 ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent -3.0 ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4 X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz NAS_OMP bench class ncpu %improved(Mops) dc B 72 1.3 is C 72 0.9 is D 72 0.7 sysbench mysql, average of 24 runs --- base --- --- new --- nthr events %stdev events %stdev %speedup 1 331.0 0.25 331.0 0.24 -0.1 2 661.3 0.22 661.8 0.22 0.0 4 1297.0 0.88 1300.5 0.82 0.2 8 2420.8 0.04 2420.5 0.04 -0.1 16 4826.3 0.07 4825.4 0.05 -0.1 32 8815.3 0.27 8830.2 0.18 0.1 64 12823.0 0.24 12823.6 0.26 0.0 ------------------------------------------------------------- Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-