Commits · a2491968ab51323208d2d1a4d4b28325886d41b8 · Summer2022 / 22b970264

Apr 15, 2021

Add io_uring IO interface · a2491968

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.1-rc1
commit 2b188cc1
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.

Two new system calls are added for this:

io_uring_setup(entries, params)
	Sets up an io_uring instance for doing async IO. On success,
	returns a file descriptor that the application can mmap to
	gain access to the SQ ring, CQ ring, and io_uring_sqes.

io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
	Initiates IO against the rings mapped to this fd, or waits for
	them to complete, or both. The behavior is controlled by the
	parameters passed in. If 'to_submit' is non-zero, then we'll
	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
	kernel will wait for 'min_complete' events, if they aren't
	already available. It's valid to set IORING_ENTER_GETEVENTS
	and 'min_complete' == 0 at the same time, this allows the
	kernel to return already completed events without waiting
	for them. This is useful only for polling, as for IRQ
	driven IO, the application can just check the CQ ring
	without entering the kernel.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c



Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	include/linux/fs.h
	[ Patch 4666750aa5ade2("fs: Export generic_fadvise()") applied
	  earlier. ]

	include/linux/syscalls.h
	[ Non-bugfix 9afc5eee("y2038: globally rename
	  compat_time to old_time32") is not applied. ]

	include/uapi/asm-generic/unistd.h
	[ Patch 4e21565b("asm-generic: add kexec_file_load system
	  call to unistd.h") is not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

a2491968

io_pgetevents: use __kernel_timespec · 39052cbf

Deepa Dinamani authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit 7a35397f
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update io_pgetevents interfaces to use struct __kernel_timespec.

sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:

New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)

Native 64 bit(unchanged) and native 32 bit : sys_io_pgetevents
Compat : compat_sys_io_pgetevents_time64

Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)

Native 32 bit : sys_io_pgetevents_time32
Compat : compat_sys_io_pgetevents

Note that io_getevents syscalls do not have a y2038 safe solution.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Conflicts:
	fs/aio.c
	include/linux/compat.h
	[ Patch 9afc5eee("y2038: globally rename compat_time to
	  old_time32") is not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

39052cbf

pselect6: use __kernel_timespec · efecd427

Deepa Dinamani authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit e024707b
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update pselect interfaces to use struct __kernel_timespec.

sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:

New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)

Native 64 bit(unchanged) and native 32 bit : sys_pselect6
Compat : compat_sys_pselect6_time64

Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)

Native 32 bit : pselect6_time32
Compat : compat_sys_pselect6

Note that all other versions of select syscalls will not have
y2038 safe versions.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Conflicts:
	fs/select.c
	include/linux/compat.h
	[ Patch 9afc5eee("y2038: globally rename compat_time to
	  old_time32") is not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

efecd427

ppoll: use __kernel_timespec · 8bf934ed

Deepa Dinamani authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit 8bd27a30
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

struct timespec is not y2038 safe.
struct __kernel_timespec is the new y2038 safe structure for all
syscalls that are using struct timespec.
Update ppoll interfaces to use struct __kernel_timespec.

sigset_t also has different representations on 32 bit and 64 bit
architectures. Hence, we need to support the following different
syscalls:

New y2038 safe syscalls:
(Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)

Native 64 bit(unchanged) and native 32 bit : sys_ppoll
Compat : compat_sys_ppoll_time64

Older y2038 unsafe syscalls:
(Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)

Native 32 bit : ppoll_time32
Compat : compat_sys_ppoll

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Conflicts:
	fs/select.c
	include/linux/compat.h
	[ Patch 9afc5eee("y2038: globally rename compat_time to
	  old_time32") is not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

8bf934ed

signal: Add restore_user_sigmask() · b902bb06

Deepa Dinamani authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit 854a6ed5
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Refactor the logic to restore the sigmask before the syscall
returns into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during
the execution and restored after the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

b902bb06

signal: Add set_user_sigmask() · 22435d29

Deepa Dinamani authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit ded653cc
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Refactor reading sigset from userspace and updating sigmask
into an api.

This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during,
and restored after, the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll,
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.

Note that the calls to sigprocmask() ignored the return value
from the api as the function only returns an error on an invalid
first argument that is hardcoded at these call sites.
The updated logic uses set_current_blocked() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Conflicts:
	include/linux/signal.h
	[ Patch ae7795bc("signal: Distinguish between kernel_siginfo
	  and siginfo") is not applied. ]
	kernel/signal.c
	[ Patch fb50f5a4("signal: Pair exports with their functions")
	  is not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

22435d29

block: Initialize BIO I/O priority early · 602b1dee

Damien Le Moal authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit 20578bdf
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

For the synchronous I/O path case (read(), write() etc system calls), a
BIO I/O priority is not initialized until the execution of
blk_init_request_from_bio() when the BIO is submitted and a request
initialized for the BIO execution. This is due to the ki_ioprio field of
the struct kiocb defined on stack being always initialized to
IOPRIO_CLASS_NONE, regardless of the calling process I/O context ioprio
value set with ioprio_set(). This late initialization can result in the
BIO being merged to pending requests even when the I/O priorities
differ.

Fix this by initializing the ki_iopriority field of on stack struct
kiocb using the get_current_ioprio() helper, ensuring that all BIOs
allocated and submitted for the system call execution see the correct
intended I/O priority early. With this, since a BIO I/O priority is
always set to the intended effective value for both the sync and async
path, blk_init_request_from_bio() can be simplified.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

602b1dee

block: prevent merging of requests with different priorities · 8ae060d3

Damien Le Moal authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit 668ffc03
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Growing in size a high priority request by merging it with a lower
priority BIO or request will increase the request execution time. This
is the opposite result of the desired effect of high I/O priorities,
namely getting low I/O latencies. Prevent merging of requests and BIOs
that have different I/O priorities to fix this.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	block/blk-merge.c
	[ Patch 9cf2bab6("block: kill request ->cpu member") is not
	  applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

8ae060d3

aio: Fix fallback I/O priority value · 13d6bc56

Damien Le Moal authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit 76dc8913
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

For cases when the application does not specify aio_reqprio for an aio,
fallback to use get_current_ioprio() to obtain the task I/O priority
last set using ioprio_set() rather than the hardcoded IOPRIO_CLASS_NONE
value.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

13d6bc56

block: Introduce get_current_ioprio() · c0870139

Damien Le Moal authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit 64845a1d
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Define get_current_ioprio() as an inline helper to obtain the caller
I/O priority from its task I/O context. Use this helper in
blk_init_request_from_bio() to set a request ioprio.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>

Conflicts:
	block/blk-core.c
	[e2b3fa5a ("block: Remove bio->bi_ioc") not included]

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

c0870139

aio: Comment use of IOCB_FLAG_IOPRIO aio flag · 5f0eff53

Damien Le Moal authored 4 years ago

mainline inclusion
from mainline-5.0-rc1
commit 23464f8c
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Comment the use of the IOCB_FLAG_IOPRIO aio flag similarly to the
IOCB_FLAG_RESFD flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

5f0eff53

fs: fix kabi change since add iopoll · 74d1b343

yangerkun authored 4 years ago

hulk inclusion
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

74d1b343

fs: add an iopoll method to struct file_operations · 5f66e5ad

Christoph Hellwig authored 4 years ago

mainline inclusion
from mainline-5.1-rc1
commit fb7e1600
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

This new methods is used to explicitly poll for I/O completion for an
iocb.  It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
[add ki_cookie in struct kiocb will change KABI and can not fix it. Stop
to support block poll.]

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

5f66e5ad

signal: Allow cifs and drbd to receive their terminating signals · 3583776f

Eric W. Biederman authored 4 years ago

stable inclusion
from linux-4.19.99
commit 6db0e28b893aa28af3f7c0197749a5d9cbfded5c
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

[ Upstream commit 33da8e7c ]

My recent to change to only use force_sig for a synchronous events
wound up breaking signal reception cifs and drbd.  I had overlooked
the fact that by default kthreads start out with all signals set to
SIG_IGN.  So a change I thought was safe turned out to have made it
impossible for those kernel thread to catch their signals.

Reverting the work on force_sig is a bad idea because what the code
was doing was very much a misuse of force_sig.  As the way force_sig
ultimately allowed the signal to happen was to change the signal
handler to SIG_DFL.  Which after the first signal will allow userspace
to send signals to these kernel threads.  At least for
wake_ack_receiver in drbd that does not appear actively wrong.

So correct this problem by adding allow_kernel_signal that will allow
signals whose siginfo reports they were sent by the kernel through,
but will not allow userspace generated signals, and update cifs and
drbd to call allow_kernel_signal in an appropriate place so that their
thread can receive this signal.

Fixing things this way ensures that userspace won't be able to send
signals and cause problems, that it is clear which signals the
threads are expecting to receive, and it guarantees that nothing
else in the system will be affected.

This change was partly inspired by similar cifs and drbd patches that
added allow_signal.

Reported-by: ronnie sahlberg <ronniesahlberg@gmail.com>
Reported-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Steve French <smfrench@gmail.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Fixes: 247bc947 ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
Fixes: 72abe3bc ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
Fixes: fee10990 ("signal/drbd: Use send_sig not force_sig")
Fixes: 3cf5d076 ("signal: Remove task parameter from force_sig")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
[io_uring need allow_kernel_signal]
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

3583776f

cifs: fix rmmod regression in cifs.ko caused by force_sig changes · 69bed7d4

Steve French authored 4 years ago

stable inclusion
from linux-4.19.99
commit 7f6a96dd8223796ffae4dd251be3bff161a28a4b
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

[ Upstream commit 247bc947 ]

Fixes: 72abe3bc ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")

The global change from force_sig caused module unloading of cifs.ko
to fail (since the cifsd process could not be killed, "rmmod cifs"
now would always fail)

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
[io_uring need allow_kernel_signal]
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

69bed7d4

signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig · af7543a7

Eric W. Biederman authored 4 years ago

stable inclusion
from linux-4.19.99
commit e6a13c753f912564256d81f7036f9c524b1ef8ae
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

[ Upstream commit 72abe3bc ]

The locking in force_sig_info is not prepared to deal with a task that
exits or execs (as sighand may change).  The is not a locking problem
in force_sig as force_sig is only built to handle synchronous
exceptions.

Further the function force_sig_info changes the signal state if the
signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the
delivery of the signal.  The signal SIGKILL can not be ignored and can
not be blocked and SIGNAL_UNKILLABLE won't prevent it from being
delivered.

So using force_sig rather than send_sig for SIGKILL is confusing
and pointless.

Because it won't impact the sending of the signal and and because
using force_sig is wrong, replace force_sig with send_sig.

Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Jeff Layton <jlayton@primarydata.com>
Cc: Steve French <smfrench@gmail.com>
Fixes: a5c3e1c7 ("Revert "cifs: No need to send SIGKILL to demux_thread during umount"")
Fixes: e7ddee90 ("cifs: disable sharing session and tcon and add new TCP sharing code")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
[io_uring need allow_kernel_signal]
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

af7543a7

Apr 14, 2021

bpf, x86: Validate computation of branch displacements for x86-32 · 27b0b915

Piotr Krysiuk authored 4 years ago


stable inclusion
from linux-4.19.186
commit 7b77ae2a0d6f9e110e13e85d802124b111b3e027
CVE: CVE-2021-29154

--------------------------------

commit 26f55a59dc65ff77cd1c4b37991e26497fc68049 upstream.

The branch displacement logic in the BPF JIT compilers for x86 assumes
that, for any generated branch instruction, the distance cannot
increase between optimization passes.

But this assumption can be violated due to how the distances are
computed. Specifically, whenever a backward branch is processed in
do_jit(), the distance is computed by subtracting the positions in the
machine code from different optimization passes. This is because part
of addrs[] is already updated for the current optimization pass, before
the branch instruction is visited.

And so the optimizer can expand blocks of machine code in some cases.

This can confuse the optimizer logic, where it assumes that a fixed
point has been reached for all machine code blocks once the total
program size stops changing. And then the JIT compiler can output
abnormal machine code containing incorrect branch displacements.

To mitigate this issue, we assert that a fixed point is reached while
populating the output image. This rejects any problematic programs.
The issue affects both x86-32 and x86-64. We mitigate separately to
ease backporting.

Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

4.19.90-2104.10.0

27b0b915

bpf, x86: Validate computation of branch displacements for x86-64 · 39fd7bff

Piotr Krysiuk authored 4 years ago


stable inclusion
from linux-4.19.186
commit 5f26f1f838aa960045c712e13dbab8ff451fed74
CVE: CVE-2021-29154

--------------------------------

commit e4d4d456436bfb2fe412ee2cd489f7658449b098 upstream.

The branch displacement logic in the BPF JIT compilers for x86 assumes
that, for any generated branch instruction, the distance cannot
increase between optimization passes.

But this assumption can be violated due to how the distances are
computed. Specifically, whenever a backward branch is processed in
do_jit(), the distance is computed by subtracting the positions in the
machine code from different optimization passes. This is because part
of addrs[] is already updated for the current optimization pass, before
the branch instruction is visited.

And so the optimizer can expand blocks of machine code in some cases.

This can confuse the optimizer logic, where it assumes that a fixed
point has been reached for all machine code blocks once the total
program size stops changing. And then the JIT compiler can output
abnormal machine code containing incorrect branch displacements.

To mitigate this issue, we assert that a fixed point is reached while
populating the output image. This rejects any problematic programs.
The issue affects both x86-32 and x86-64. We mitigate separately to
ease backporting.

Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

39fd7bff

mm/vmalloc.c: fix percpu free VM area search criteria · 36d32b23

Kuppuswamy Sathyanarayanan authored 4 years ago

mainline inclusion
from mainline-5.3-rc5
commit 5336e52c
category: bugfix
bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
CVE: NA

-------------------------------------------------
Recent changes to the vmalloc code by commit 68ad4a33
("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
cause spurious percpu allocation failures.  These, in turn, can result
in panic()s in the slub code.  One such possible panic was reported by
Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
Another related panic observed is,

 RIP: 0033:0x7f46f7441b9b
 Call Trace:
  dump_stack+0x61/0x80
  pcpu_alloc.cold.30+0x22/0x4f
  mem_cgroup_css_alloc+0x110/0x650
  cgroup_apply_control_enable+0x133/0x330
  cgroup_mkdir+0x41b/0x500
  kernfs_iop_mkdir+0x5a/0x90
  vfs_mkdir+0x102/0x1b0
  do_mkdirat+0x7d/0xf0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
uses two lists (vmap_area_list & free_vmap_area_list) to track the used
and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
sizes[], nr_vms, align) function is used for allocating congruent VM
areas for percpu memory allocator.  In order to not conflict with
VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
VMALLOC space.  So the search for free vm_area for the given requirement
starts near VMALLOC_END and moves upwards towards VMALLOC_START.

Prior to commit 68ad4a33, the search for free vm_area in
pcpu_get_vm_areas() involves following two main steps.

Step 1:
    Find a aligned "base" adress near VMALLOC_END.
    va = free vm area near VMALLOC_END
Step 2:
    Loop through number of requested vm_areas and check,
        Step 2.1:
           if (base < VMALLOC_START)
              1. fail with error
        Step 2.2:
           // end is offsets[area] + sizes[area]
           if (base + end > va->vm_end)
               1. Move the base downwards and repeat Step 2
        Step 2.3:
           if (base + start < va->vm_start)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

But Commit 68ad4a33 removed Step 2.2 and modified Step 2.3 as below:

        Step 2.3:
           if (base + start < va->vm_start || base + end > va->vm_end)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

Above change is the root cause of spurious percpu memory allocation
failures.  For example, consider a case where a relatively large vm_area
(~ 30 TB) was ignored in free vm_area search because it did not pass the
base + end < vm->vm_end boundary check.  Ignoring such large free
vm_area's would lead to not finding free vm_area within boundary of
VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

So modify the search algorithm to include Step 2.2.

Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com


Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5336e52c)
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

4.19.90-2104.9.0

36d32b23

mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning · a9708156

Arnd Bergmann authored 4 years ago

mainline inclusion
from mainline-5.2-rc7
commit 2c929233
category: bugfix
bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
CVE: NA

-------------------------------------------------
gcc gets confused in pcpu_get_vm_areas() because there are too many
branches that affect whether 'lva' was initialized before it gets used:

  mm/vmalloc.c: In function 'pcpu_get_vm_areas':
  mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      insert_vmap_area_augment(lva, &va->rb_node,
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       &free_vmap_area_root, &free_vmap_area_list);
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/vmalloc.c:916:20: note: 'lva' was declared here
    struct vmap_area *lva;
                      ^~~

Add an intialization to NULL, and check whether this has changed before
the first use.

[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de


Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 2c929233)
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

a9708156

mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro · 2003379b

Uladzislau Rezki (Sony) authored 4 years ago

mainline inclusion
from mainline-5.2-rc1
commit a6cf4e0f
category: bugfix
bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
CVE: NA

-------------------------------------------------
This macro adds some debug code to check that vmap allocations are
happened in ascending order.

By default this option is set to 0 and not active.  It requires
recompilation of the kernel to activate it.  Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
  Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com


Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a6cf4e0f)
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

2003379b

mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro · db0b5445

Uladzislau Rezki (Sony) authored 4 years ago

mainline inclusion
from mainline-5.2-rc1
commit bb850f4d
category: bugfix
bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
CVE: NA

-------------------------------------------------
This macro adds some debug code to check that the augment tree is
maintained correctly, meaning that every node contains valid
subtree_max_size value.

By default this option is set to 0 and not active.  It requires
recompilation of the kernel to activate it.  Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
  Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com


Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit bb850f4d)
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

db0b5445

mm/vmalloc.c: keep track of free blocks for vmap allocation · a93bd443

Uladzislau Rezki (Sony) authored 4 years ago

mainline inclusion
from mainline-5.2-rc1
commit 68ad4a33
category: bugfix
bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
CVE: NA

-------------------------------------------------
Patch series "improve vmap allocation", v3.

Objective
---------

Please have a look for the description at:

  https://lkml.org/lkml/2018/10/19/786

but let me also summarize it a bit here as well.

The current implementation has O(N) complexity. Requests with different
permissive parameters can lead to long allocation time. When i say
"long" i mean milliseconds.

Description
-----------

This approach organizes the KVA memory layout into free areas of the
1-ULONG_MAX range, i.e.  an allocation is done over free areas lookups,
instead of finding a hole between two busy blocks.  It allows to have
lower number of objects which represent the free space, therefore to have
less fragmented memory allocator.  Because free blocks are always as large
as possible.

It uses the augment tree where all free areas are sorted in ascending
order of va->va_start address in pair with linked list that provides
O(1) access to prev/next elements.

Since the tree is augment, we also maintain the "subtree_max_size" of VA
that reflects a maximum available free block in its left or right
sub-tree.  Knowing that, we can easily traversal toward the lowest (left
most path) free area.

Allocation: ~O(log(N)) complexity.  It is sequential allocation method
therefore tends to maximize locality.  The search is done until a first
suitable block is large enough to encompass the requested parameters.
Bigger areas are split.

I copy paste here the description of how the area is split, since i
described it in https://lkml.org/lkml/2018/10/19/786

<snip>

A free block can be split by three different ways.  Their names are
FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e.  they
correspond to how requested size and alignment fit to a free block.

FL_FIT_TYPE - in this case a free block is just removed from the free
list/tree because it fully fits.  Comparing with current design there is
an extra work with rb-tree updating.

LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit.  In this case what we do
is just cutting a free block.  It is as fast as a current design.  Most of
the vmalloc allocations just end up with this case, because the edge is
always aligned to 1.

NE_FIT_TYPE - Is much less common case.  Basically it happens when
requested size and alignment does not fit left nor right edges, i.e.  it
is between them.  In this case during splitting we have to build a
remaining left free area and place it back to the free list/tree.

Comparing with current design there are two extra steps.  First one is we
have to allocate a new vmap_area structure.  Second one we have to insert
that remaining free block to the address sorted list/tree.

In order to optimize a first case there is a cache with free_vmap objects.
Instead of allocating from slab we just take an object from the cache and
reuse it.

Second one is pretty optimized.  Since we know a start point in the tree
we do not do a search from the top.  Instead a traversal begins from a
rb-tree node we split.
<snip>

De-allocation.  ~O(log(N)) complexity.  An area is not inserted straight
away to the tree/list, instead we identify the spot first, checking if it
can be merged around neighbors.  The list provides O(1) access to
prev/next, so it is pretty fast to check it.  Summarizing.  If merged then
large coalesced areas are created, if not the area is just linked making
more fragments.

There is one more thing that i should mention here.  After modification of
VA node, its subtree_max_size is updated if it was/is the biggest area in
its left or right sub-tree.  Apart of that it can also be populated back
to upper levels to fix the tree.  For more details please have a look at
the __augment_tree_propagate_from() function and the description.

Tests and stressing
-------------------

I use the "test_vmalloc.sh" test driver available under
"tools/testing/selftests/vm/" since 5.1-rc1 kernel.  Just trigger "sudo
./test_vmalloc.sh" to find out how to deal with it.

Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
Regarding last one, i do not have any physical access to NUMA system,
therefore i emulated it.  The time of stressing is days.

If you run the test driver in "stress mode", you also need the patch that
is in Andrew's tree but not in Linux 5.1-rc1.  So, please apply it:

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c

After massive testing, i have not identified any problems like memory
leaks, crashes or kernel panics.  I find it stable, but more testing would
be good.

Performance analysis
--------------------

I have used two systems to test.  One is i5-3320M CPU @ 2.60GHz and
another is HiKey960(arm64) board.  i5-3320M runs on 4.20 kernel, whereas
Hikey960 uses 4.15 kernel.  I have both system which could run on 5.1-rc1
as well, but the results have not been ready by time i an writing this.

Currently it consist of 8 tests.  There are three of them which correspond
to different types of splitting(to compare with default).  We have 3
ones(see above).  Another 5 do allocations in different conditions.

a) sudo ./test_vmalloc.sh performance

When the test driver is run in "performance" mode, it runs all available
tests pinned to first online CPU with sequential execution test order.  We
do it in order to get stable and repeatable results.  Take a look at time
difference in "long_busy_list_alloc_test".  It is not surprising because
the worst case is O(N).

How many cycles all tests took:
CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles

ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt

How many cycles all tests took:
CPU0=3478683207 cycles vs CPU0=463767978 cycles

ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt

b) time sudo ./test_vmalloc.sh test_repeat_count=1

With this configuration, all tests are run on all available online CPUs.
Before running each CPU shuffles its tests execution order.  It gives
random allocation behaviour.  So it is rough comparison, but it puts in
the picture for sure.

<default>            vs            <patched>
real    101m22.813s                real    0m56.805s
user    0m0.011s                   user    0m0.015s
sys     0m5.076s                   sys     0m0.023s

ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt

<default>            vs            <patched>
real    unknown                    real    4m25.214s
user    unknown                    user    0m0.011s
sys     unknown                    sys     0m0.670s

I did not manage to complete this test on "default Hikey960" kernel
version.  After 24 hours it was still running, therefore i had to cancel
it.  That is why real/user/sys are "unknown".

This patch (of 3):

Currently an allocation of the new vmap area is done over busy list
iteration(complexity O(n)) until a suitable hole is found between two busy
areas.  Therefore each new allocation causes the list being grown.  Due to
over fragmented list and different permissive parameters an allocation can
take a long time.  For example on embedded devices it is milliseconds.

This patch organizes the KVA memory layout into free areas of the
1-ULONG_MAX range.  It uses an augment red-black tree that keeps blocks
sorted by their offsets in pair with linked list keeping the free space in
order of increasing addresses.

Nodes are augmented with the size of the maximum available free block in
its left or right sub-tree.  Thus, that allows to take a decision and
traversal toward the block that will fit and will have the lowest start
address, i.e.  it is sequential allocation.

Allocation: to allocate a new block a search is done over the tree until a
suitable lowest(left most) block is large enough to encompass: the
requested size, alignment and vstart point.  If the block is bigger than
requested size - it is split.

De-allocation: when a busy vmap area is freed it can either be merged or
inserted to the tree.  Red-black tree allows efficiently find a spot
whereas a linked list provides a constant-time access to previous and next
blocks to check if merging can be done.  In case of merging of
de-allocated memory chunk a large coalesced area is created.

Complexity: ~O(log(N))

[urezki@gmail.com: v3]
  Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
[urezki@gmail.com: v4]
  Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com


Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 68ad4a33)
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

a93bd443

config: Enable CONFIG_USERSWAP · 6f799d99

Xiongfeng Wang authored 4 years ago


hulk inclusion
category: feature
bugzilla: 47439
CVE: NA

-------------------------------------------------

Enable CONFIG_USERSWAP for hulk_defconfg and openeuler_defconfig

Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

4.19.90-2104.8.0

6f799d99

userswap: support userswap via userfaultfd · c3e6287f

Guo Fan authored 4 years ago


hulk inclusion
category: feature
bugzilla: 47439
CVE: NA

-------------------------------------------------

This patch modify the userfaultfd to support userswap. To check whether
tha pages are dirty since the last swap in, we make them clean when we
swap in the pages. The userspace may swap in a large area and part of it
are not swapped out. We need to skip those pages that are not swapped
out.

Signed-off-by: Guo Fan <guofan5@huawei.com>
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Kefeng  Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

c3e6287f

userswap: add a new flag 'MAP_REPLACE' for mmap() · e3452806

Guo Fan authored 4 years ago


hulk inclusion
category: feature
bugzilla: 47439
CVE: NA

-------------------------------------------------

To make sure there are no other userspace threads access the memory
region we are swapping out, we need unmmap the memory region, map it
to a new address and use the new address to perform the swapout. We add
a new flag 'MAP_REPLACE' for mmap() to unmap the pages of the input
parameter 'VA' and remap them to a new tmpVA.

Signed-off-by: Guo Fan <guofan5@huawei.com>
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

e3452806

mm, mempolicy: fix up gup usage in lookup_node · 8bc9bb27

Michal Hocko authored 4 years ago


mainline inclusion
from mainline-v5.8-rc1
commit 2d3a36a4
category: bugfix
bugzilla: 47439
CVE: NA
---------------------------

ba841078 ("mm/mempolicy: Allow lookup_node() to handle fatal signal")
has added a special casing for 0 return value because that was a possible
gup return value when interrupted by fatal signal.  This has been fixed by
ae46d2aa ("mm/gup: Let __get_user_pages_locked() return -EINTR for
fatal signal") in the mean time so ba841078 can be reverted.

This patch however doesn't go all the way to revert it because the check
for 0 is wrong and confusing here.  Firstly it is inherently unsafe to
access the page when get_user_pages_locked returns 0 (aka no page
returned).

Fortunatelly this will not happen because get_user_pages_locked will not
return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not
the case here.  Document this potential error code in gup code while we
are at it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.org


Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

 Conflicts:
	mm/gup.c
[wangxiongfeng: conflicts in comments ]
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Kefeng  Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

8bc9bb27

mm/mempolicy: Allow lookup_node() to handle fatal signal · 169cc12f

Peter Xu authored 4 years ago


mainline inclusion
from mainline-v5.7-rc1
commit ba841078
category: bugfix
bugzilla: 47439
CVE: NA
---------------------------

lookup_node() uses gup to pin the page and get node information.  It
checks against ret>=0 assuming the page will be filled in.  However it's
also possible that gup will return zero, for example, when the thread is
quickly killed with a fatal signal.  Teach lookup_node() to gracefully
return an error -EFAULT if it happens.

Meanwhile, initialize "page" to NULL to avoid potential risk of
exploiting the pointer.

Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
Reported-by:  <syzbot+693dc11fcb53120b5559@syzkaller.appspotmail.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 Conflicts:
	mm/mempolicy.c
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Kefeng  Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

169cc12f

mm/gup: Let __get_user_pages_locked() return -EINTR for fatal signal · bcfd7200

Hillf Danton authored 4 years ago


mainline inclusion
from mainline-v5.7-rc1
commit ae46d2aa
category: bugfix
bugzilla: 47439
CVE: NA
---------------------------

__get_user_pages_locked() will return 0 instead of -EINTR after commit
4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times") which
added extra code to allow gup detect fatal signal faster.

Restore the original -EINTR behavior.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
Reported-by:  <syzbot+3be1a33f04dc782e9fd5@syzkaller.appspotmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Kefeng  Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

bcfd7200

mm/gup: fix fixup_user_fault() on multiple retries · 1a397268

Peter Xu authored 4 years ago


mainline inclusion
from mainline-v5.7-rc6
commit 475f4dfc
category: bugfix
bugzilla: 47439
CVE: NA
---------------------------

This part was overlooked when reworking the gup code on multiple
retries.

When we get the 2nd+ retry, we'll be with TRIED flag set.  Current code
will bail out on the 2nd retry because the !TRIED check will fail so the
retry logic will be skipped.  What's worse is that, it will also return
zero which errornously hints the caller that the page is faulted in
while it's not.

The !TRIED flag check seems to not be needed even before the mutliple
retries change because if we get a VM_FAULT_RETRY, it must be the 1st
retry, and we should not have TRIED set for that.

Fix it by removing the !TRIED check, at the meantime check against fatal
signals properly before the page fault so we can still properly respond
to the user killing the process during retries.

Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Link: http://lkml.kernel.org/r/20200502003523.8204-1-peterx@redhat.com


Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Kefeng  Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

1a397268

mm/gup: allow VM_FAULT_RETRY for multiple times · 85c63908

Peter Xu authored 4 years ago


mainline inclusion
from mainline-5.6
commit 4426e945
category: bugfix
bugzilla: 47439
CVE: NA
---------------------------

This is the gup counterpart of the change that allows the VM_FAULT_RETRY
to happen for more than once.  One thing to mention is that we must check
the fatal signal here before retry because the GUP can be interrupted by
that, otherwise we can loop forever.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.com


Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Kefeng  Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

85c63908

mm: allow VM_FAULT_RETRY for multiple times · 9745f703

Peter Xu authored 4 years ago

mainline inclusion
from mainline-5.6
commit 4064b982
category: bugfix
bugzilla: 47439
CVE: NA
---------------------------

The idea comes from a discussion between Linus and Andrea [1].

Before this patch we only allow a page fault to retry once.  We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time.  This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page.  However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.

This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY.  It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event.  Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.

Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):

  - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                             retry, and this is the first try

  - ALLOW_RETRY and TRIED:   this means the page fault allows to
                             retry, and this is not the first try

  - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                             to retry at all

  - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used

In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths.  One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.

This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault.  It might also benefit
other potential users who will have similar requirement like userfault
write-protection.

GUP code is not touched yet and will be covered in follow up patch.

Please read the thread below for more information.

[1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
[2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/



Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com


Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

 Conflicts:
	arch/arc/mm/fault.c
	arch/arm64/mm/fault.c
	arch/x86/mm/fault.c
	drivers/gpu/drm/ttm/ttm_bo_vm.c
	include/linux/mm.h
	mm/internal.h
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Kefeng  Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

9745f703

sched/fair: fix kabi broken due to adding fields in rq and sched_domain_shared · 8a53999a

Cheng Jian authored 4 years ago

hulk inclusion
category: bugfix
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23


CVE: NA

---------------------------

Previous patches added fields in struct rq and sched_domain_shared,
which caused the KABI changed.

We can use some helper structures to fix this KABI change, but this is
not necessary, because these structures are only used internally, the
driver is not aware of them, so we simply avoid them.

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>

4.19.90-2104.7.0

8a53999a

sched/fair: fix try_steal compile error · edf8c7d2

Cheng Jian authored 4 years ago

hulk inclusion
category: bugfix
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23


CVE: NA

---------------------------

If we disable CONFIG_SMP, try_steal will lose its definition,
resulting in a compile error as follows.

	kernel/sched/fair.c: In function ‘pick_next_task_fair’:
	kernel/sched/fair.c:7001:15: error: implicit declaration of function ‘try_steal’ [-Werror=implicit-function-declaration]
		new_tasks = try_steal(rq, rf);
			    ^~~~~~~~~

We can use allnoconfig to reproduce this problem.

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Reviewed-by: Bin Li <huawei.libin@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

edf8c7d2

config: enable CONFIG_SCHED_STEAL by default · 402b5928

Cheng Jian authored 4 years ago

hulk inclusion
category: config
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23


CVE: NA

-------------------------------------------------

Enable steal tasks by default to improve CPU utilization

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

402b5928

sched/fair: introduce SCHED_STEAL · fbd5102b

Cheng Jian authored 4 years ago

hulk inclusion
category: feature
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23


CVE: NA

---------------------------

Introduce CONFIG_SCHED_STEAL to limit the impact of steal task.

1). If turn off CONFIG_SCHED_STEAL, then all the changes will not
exist, for we use some empty functions, so this depends on compiler
optimization.

2). enable CONFIG_SCHED_STEAL, but disable STEAL and schedstats, it
will introduce some impact whith schedstat check. but this has little
effect on performance. This will be our default choice.

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

fbd5102b

disable stealing by default · da163c22

Cheng Jian authored 4 years ago

hulk inclusion
category: feature
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23


CVE: NA

---------------------------

Steal tasks to improve CPU utilization can solve some performance
problems such as mysql, but not all scenarios are optimized, such as
hackbench.

So turn off by default.

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

da163c22

sched/fair: Provide idle search schedstats · 1d2cc076

Steve Sistare authored 4 years ago

hulk inclusion
category: feature
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23


CVE: NA

---------------------------

Add schedstats to measure the effectiveness of searching for idle CPUs
and stealing tasks.  This is a temporary patch intended for use during
development only.  SCHEDSTAT_VERSION is bumped to 16, and the following
fields are added to the per-CPU statistics of /proc/schedstat:

field 10: # of times select_idle_sibling "easily" found an idle CPU --
          prev or target is idle.
field 11: # of times select_idle_sibling searched and found an idle cpu.
field 12: # of times select_idle_sibling searched and found an idle core.
field 13: # of times select_idle_sibling failed to find anything idle.
field 14: time in nanoseconds spent in functions that search for idle
          CPUs and search for tasks to steal.
field 15: # of times an idle CPU steals a task from another CPU.
field 16: # of times try_steal finds overloaded CPUs but no task is
           migratable.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

1d2cc076

sched/fair: disable stealing if too many NUMA nodes · 49353d1e

Steve Sistare authored 4 years ago

hulk inclusion
category: feature
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23


CVE: NA

---------------------------

The STEAL feature causes regressions on hackbench on larger NUMA systems,
so disable it on systems with more than sched_steal_node_limit nodes
(default 2).  Note that the feature remains enabled as seen in features.h
and /sys/kernel/debug/sched_features, but stealing is only performed if
nodes <= sched_steal_node_limit.  This arrangement allows users to activate
stealing on reboot by setting the kernel parameter sched_steal_node_limit
on kernels built without CONFIG_SCHED_DEBUG.  The parameter is temporary
and will be deleted when the regression is fixed.

Details of the regression follow.  With the STEAL feature set, hackbench
is slower on many-node systems:

X5-8: 8 sockets * 18 cores * 2 hyperthreads = 288 CPUs
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz
Average of 10 runs of: hackbench <groups> processes 50000

          --- base --    --- new ---
groups    time %stdev    time %stdev  %speedup
     1   3.627   15.8   3.876    7.3      -6.5
     2   4.545   24.7   5.583   16.7     -18.6
     3   5.716   25.0   7.367   14.2     -22.5
     4   6.901   32.9   7.718   14.5     -10.6
     8   8.604   38.5   9.111   16.0      -5.6
    16   7.734    6.8  11.007    8.2     -29.8

Total CPU time increases.  Profiling shows that CPU time increases
uniformly across all functions, suggesting a systemic increase in cache
or memory latency.  This may be due to NUMA migrations, as they cause
loss of LLC cache footprint and remote memory latencies.

The domains for this system and their flags are:

  domain0 (SMT) : 1 core
    SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
    SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_SHARE_CPUCAPACITY
    SD_WAKE_AFFINE

  domain1 (MC) : 1 socket
    SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
    SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
    SD_WAKE_AFFINE

  domain2 (NUMA) : 4 sockets
    SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
    SD_SERIALIZE SD_OVERLAP SD_NUMA
    SD_WAKE_AFFINE

  domain3 (NUMA) : 8 sockets
    SD_LOAD_BALANCE SD_BALANCE_NEWIDLE
    SD_SERIALIZE SD_OVERLAP SD_NUMA

Schedstats point to the root cause of the regression.  hackbench is run
10 times per group and the average schedstat accumulation per-run and
per-cpu is shown below.  Note that domain3 moves are zero because
SD_WAKE_AFFINE is not set there.

NO_STEAL
                                         --- domain2 ---   --- domain3 ---
grp time %busy sched  idle   wake steal remote  move pull remote  move pull
 1 20.3 10.3  28710  14346  14366     0    490  3378    0   4039     0    0
 2 26.4 18.8  56721  28258  28469     0    792  7026   12   9229     0    7
 3 29.9 28.3  90191  44933  45272     0   5380  7204   19  16481     0    3
 4 30.2 35.8 121324  60409  60933     0   7012  9372   27  21438     0    5
 8 27.7 64.2 229174 111917 117272     0  11991  1837  168  44006     0   32
16 32.6 74.0 334615 146784 188043     0   3404  1468   49  61405     0    8

STEAL
                                         --- domain2 ---   --- domain3 ---
grp time %busy sched  idle   wake steal remote  move pull remote  move pull
 1 20.6 10.2  28490  14232  14261    18      3  3525    0   4254     0    0
 2 27.9 18.8  56757  28203  28562   303   1675  7839    5   9690     0    2
 3 35.3 27.7  87337  43274  44085   698    741 12785   14  15689     0    3
 4 36.8 36.0 118630  58437  60216  1579   2973 14101   28  18732     0    7
 8 48.1 73.8 289374 133681 155600 18646  35340 10179  171  65889     0   34
16 41.4 82.5 268925  91908 177172 47498  17206  6940  176  71776     0   20

Cross-numa-node migrations are caused by load balancing pulls and
wake_affine moves.  Pulls are small and similar for no_steal and steal.
However, moves are significantly higher for steal, and rows above with the
highest moves have the worst regressions for time; see for example grp=8.

Moves increase for steal due to the following logic in wake_affine_idle()
for synchronous wakeup:

    if (sync && cpu_rq(this_cpu)->nr_running == 1)
        return this_cpu;        // move the task

The steal feature does a better job of smoothing the load between idle
and busy CPUs, so nr_running is 1 more often, and moves are performed
more often.  For hackbench, cross-node affine moves early in the run are
good because they colocate wakers and wakees from the same group on the
same node, but continued moves later in the run are bad, because the wakee
is moved away from peers on its previous node.  Note that even no_steal
is far from optimal; binding an instance of "hackbench 2" to each of the
8 NUMA nodes runs much faster than running "hackbench 16" with no binding.

Clearing SD_WAKE_AFFINE for domain2 eliminates the affine cross-node
migrations and eliminates the difference between no_steal and steal
performance.  However, overall performance is lower than WA_IDLE because
some migrations are helpful as explained above.

I have tried many heuristics in a attempt to optimize the number of
cross-node moves in all conditions, with limited success.  The fundamental
problem is that the scheduler does not track which groups of tasks talk to
each other.  Parts of several groups become entrenched on the same node,
filling it to capacity, leaving no room for either group to pull its peers
over, and there is neither data nor mechanism for the scheduler to evict
one group to make room for the other.

For now, disable STEAL on such systems until we can do better, or it is
shown that hackbench is atypical and most workloads benefit from stealing.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

49353d1e

sched/fair: Steal work from an overloaded CPU when CPU goes idle · bccfa644

Steve Sistare authored 4 years ago

hulk inclusion
category: feature
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23


CVE: NA

---------------------------

When a CPU has no more CFS tasks to run, and idle_balance() fails to find a
task, then attempt to steal a task from an overloaded CPU in the same LLC,
using the cfs_overload_cpus bitmap to efficiently identify candidates.  To
minimize search time, steal the first migratable task that is found when
the bitmap is traversed.  For fairness, search for migratable tasks on an
overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle.  idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy.  Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code.  In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

  %find - percent of time spent in old and new functions that search for
    idle CPUs and tasks to steal and set the overloaded CPUs bitmap.

  steal - number of times a task is stolen from another CPU.

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
hackbench <grps> process 100000
sched_wakeup_granularity_ns=15000000

  baseline
  grps  time  %busy  slice   sched   idle     wake %find  steal
  1    8.084  75.02   0.10  105476  46291    59183  0.31      0
  2   13.892  85.33   0.10  190225  70958   119264  0.45      0
  3   19.668  89.04   0.10  263896  87047   176850  0.49      0
  4   25.279  91.28   0.10  322171  94691   227474  0.51      0
  8   47.832  94.86   0.09  630636 144141   486322  0.56      0

  new
  grps  time  %busy  slice   sched   idle     wake %find  steal  %speedup
  1    5.938  96.80   0.24   31255   7190    24061  0.63   7433  36.1
  2   11.491  99.23   0.16   74097   4578    69512  0.84  19463  20.9
  3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
  4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
  8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6

Elapsed time improves by 8 to 36%, and CPU busy utilization is up
by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
The cost is at most 0.4% more find time.

Additional performance results follow.  A negative "speedup" is a
regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
is set to 15 msec.  Otherwise, preemptions increase at higher loads and
distort the comparison between baseline and new.

------------------ 1 Socket Results ------------------

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

            --- base --    --- new ---
  groups    time %stdev    time %stdev  %speedup
       1   8.008    0.1   5.905    0.2      35.6
       2  13.814    0.2  11.438    0.1      20.7
       3  19.488    0.2  16.919    0.1      15.1
       4  25.059    0.1  22.409    0.1      11.8
       8  47.478    0.1  44.221    0.1       7.3

X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

            --- base --    --- new ---
  groups    time %stdev    time %stdev  %speedup
       1   4.586    0.8   4.596    0.6      -0.3
       2   7.693    0.2   5.775    1.3      33.2
       3  10.442    0.3   8.288    0.3      25.9
       4  13.087    0.2  11.057    0.1      18.3
       8  24.145    0.2  22.076    0.3       9.3
      16  43.779    0.1  41.741    0.2       4.8

KVM 4-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
tbench, average of 11 runs

  clients    %speedup
        1        16.2
        2        11.7
        4         9.9
        8        12.8
       16        13.7

KVM 2-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

  Benchmark                     %speedup
  specjbb2015_critical_jops          5.7
  mysql_sysb1.0.14_mutex_2          40.6
  mysql_sysb1.0.14_oltp_2            3.9

------------------ 2 Socket Results ------------------

X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

            --- base --    --- new ---
  groups    time %stdev    time %stdev  %speedup
       1   7.945    0.2   7.219    8.7      10.0
       2   8.444    0.4   6.689    1.5      26.2
       3  12.100    1.1   9.962    2.0      21.4
       4  15.001    0.4  13.109    1.1      14.4
       8  27.960    0.2  26.127    0.3       7.0

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

            --- base --    --- new ---
  groups    time %stdev    time %stdev  %speedup
       1   5.826    5.4   5.840    5.0      -0.3
       2   5.041    5.3   6.171   23.4     -18.4
       3   6.839    2.1   6.324    3.8       8.1
       4   8.177    0.6   7.318    3.6      11.7
       8  14.429    0.7  13.966    1.3       3.3
      16  26.401    0.3  25.149    1.5       4.9

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Oracle database OLTP, logging disabled, NVRAM storage

  Customers   Users   %speedup
    1200000      40       -1.2
    2400000      80        2.7
    3600000     120        8.9
    4800000     160        4.4
    6000000     200        3.0

X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Results from the Oracle "Performance PIT".

  Benchmark                                           %speedup

  mysql_sysb1.0.14_fileio_56_rndrd                        19.6
  mysql_sysb1.0.14_fileio_56_seqrd                        12.1
  mysql_sysb1.0.14_fileio_56_rndwr                         0.4
  mysql_sysb1.0.14_fileio_56_seqrewr                      -0.3

  pgsql_sysb1.0.14_fileio_56_rndrd                        19.5
  pgsql_sysb1.0.14_fileio_56_seqrd                         8.6
  pgsql_sysb1.0.14_fileio_56_rndwr                         1.0
  pgsql_sysb1.0.14_fileio_56_seqrewr                       0.5

  opatch_time_ASM_12.2.0.1.0_HP2M                          7.5
  select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M             5.1
  select-1_users_asmm_ASM_12.2.0.1.0_HP2M                  4.4
  swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M           5.8

  lm3_memlat_L2                                            4.8
  lm3_memlat_L1                                            0.0

  ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching     60.1
  ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent        5.2
  ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent       -3.0
  ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4

X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

  NAS_OMP
  bench class   ncpu    %improved(Mops)
  dc    B       72      1.3
  is    C       72      0.9
  is    D       72      0.7

  sysbench mysql, average of 24 runs
          --- base ---     --- new ---
  nthr   events  %stdev   events  %stdev %speedup
     1    331.0    0.25    331.0    0.24     -0.1
     2    661.3    0.22    661.8    0.22      0.0
     4   1297.0    0.88   1300.5    0.82      0.2
     8   2420.8    0.04   2420.5    0.04     -0.1
    16   4826.3    0.07   4825.4    0.05     -0.1
    32   8815.3    0.27   8830.2    0.18      0.1
    64  12823.0    0.24  12823.6    0.26      0.0

-------------------------------------------------------------

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>

bccfa644