Commits · 2e064c502cf62ba2d159837d9713b9f162d526d2 · Summer2022 / 22b970264

Apr 15, 2021

io_uring: replace workqueue usage with io-wq · 2e064c50

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit 561fb04a
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Drop various work-arounds we have for workqueues:

- We no longer need the async_list for tracking sequential IO.

- We don't have to maintain our own mm tracking/setting.

- We don't need a separate workqueue for buffered writes. This didn't
  even work that well to begin with, as it was suboptimal for multiple
  buffered writers on multiple files.

- We can properly cancel pending interruptible work. This fixes
  deadlocks with particularly socket IO, where we cannot cancel them
  when the io_uring is closed. Hence the ring will wait forever for
  these requests to complete, which may never happen. This is different
  from disk IO where we know requests will complete in a finite amount
  of time.

- Due to being able to cancel work interruptible work that is already
  running, we can implement file table support for work. We need that
  for supporting system calls that add to a process file table.

- It gets us one step closer to adding async support for any system
  call.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	fs/io_uring.c
	[ Patch b5420237("mm: refactor readahead defines in mm.h")
	  is not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

2e064c50

io-wq: small threadpool implementation for io_uring · e515fdb2

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit 771b53d0
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:

- We can assign memory context smarter and more persistently if we
  manage the life time of threads.

- We can drop various work-arounds we have in io_uring, like the
  async_list.

- We can implement hashed work insertion, to manage concurrency of
  buffered writes without needing a) an extra workqueue, or b)
  needlessly making the concurrency of said workqueue very low
  which hurts performance of multiple buffered file writers.

- We can implement cancel through signals, for cancelling
  interruptible work like read/write (or send/recv) to/from sockets.

- We need the above cancel for being able to assign and use file tables
  from a process.

- We can implement a more thorough cancel operation in general.

- We need it to move towards a syslet/threadlet model for even faster
  async execution. For that we need to take ownership of the used
  threads.

This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.

io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	fs/Kconfig
	fs/Makefile
	include/linux/sched.h
	[ Patch d7fefcc8("mm/cma: add PF flag to force non cma
	  alloc") is not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

e515fdb2

sched/core, workqueues: Distangle worker accounting from rq lock · eed3aaf0

Thomas Gleixner authored 4 years ago

mainline inclusion
from mainline-5.2-rc1
commit 6d25be57
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

The worker accounting for CPU bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.

Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. When nr_running is updated when
the task is retunrning from schedule() then it is later compared when it
is done from ttwu().

[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com


Signed-off-by: Ingo Molnar <mingo@kernel.org>

Conflicts:
	kernel/workqueue_internal.h
	[ Patch 1b69ac6b("psi: fix aggregation idle shut-off") is
          not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

eed3aaf0

sched: Remove stale PF_MUTEX_TESTER bit · fecf10e0

Thomas Gleixner authored 4 years ago

mainline inclusion
from mainline-5.1-rc1
commit 15917dc0
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

The RTMUTEX tester was removed long ago but the PF bit stayed
around. Remove it and free up the space.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Conflicts:
	include/linux/sched.h
	[ Patch 73ab1cb2("umh: add exit routine for UMH process")
	  is not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Cheng Jian <cj.chengjian@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

fecf10e0

io_uring: Fix mm_fault with READ/WRITE_FIXED · c8cbc98c

Pavel Begunkov authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit 95a1b3ff
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Commit fb5ccc98 ("io_uring: Fix broken links with offloading")
introduced a potential performance regression with unconditionally
taking mm even for READ/WRITE_FIXED operations.

Return the logic handling it back. mm-faulted requests will go through
the generic submission path, so honoring links and drains, but will
fail further on req->has_user check.

Fixes: fb5ccc98 ("io_uring: Fix broken links with offloading")
Cc: stable@vger.kernel.org # v5.4
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

c8cbc98c

io_uring: remove index from sqe_submit · ae7dca05

Pavel Begunkov authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit fa456228
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

submit->index is used only for inbound check in submission path (i.e.
head < ctx->sq_entries). However, it always will be true, as
1. it's already validated by io_get_sqring()
2. ctx->sq_entries can't be changedd in between, because of held
ctx->uring_lock and ctx->refs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

ae7dca05

io_uring: add set of tracing events · 03ef8916

Dmitrii Dolgov authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit c826bd7a
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

To trace io_uring activity one can get an information from workqueue and
io trace events, but looks like some parts could be hard to identify via
this approach. Making what happens inside io_uring more transparent is
important to be able to reason about many aspects of it, hence introduce
the set of tracing events.

All such events could be roughly divided into two categories:

* those, that are helping to understand correctness (from both kernel
  and an application point of view). E.g. a ring creation, file
  registration, or waiting for available CQE. Proposed approach is to
  get a pointer to an original structure of interest (ring context, or
  request), and then find relevant events. io_uring_queue_async_work
  also exposes a pointer to work_struct, to be able to track down
  corresponding workqueue events.

* those, that provide performance related information. Mostly it's about
  events that change the flow of requests, e.g. whether an async work
  was queued, or delayed due to some dependencies. Another important
  case is how io_uring optimizations (e.g. registered files) are
  utilized.

Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	include/Kbuild
	[ Patch 43c78d88("kbuild: compile-test kernel headers to
	  ensure they are self-contained") is not applied. ]

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

03ef8916

io_uring: add support for canceling timeout requests · 052b863f

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit 11365043
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

We might have cases where the need for a specific timeout is gone, add
support for canceling an existing timeout operation. This works like the
POLL_REMOVE command, where the application passes in the user_data of
the timeout it wishes to cancel in the sqe->addr field.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

052b863f

io_uring: add support for absolute timeouts · 8fdf4e71

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit a41525ab
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

This is a pretty trivial addition on top of the relative timeouts
we have now, but it's handy for ensuring tighter timing for those
that are building scheduling primitives on top of io_uring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

8fdf4e71

io_uring: replace s->needs_lock with s->in_async · 9630e32a

Jackie Liu authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit ba5290cc
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

There is no function change, just to clean up the code, use s->in_async
to make the code know where it is.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

9630e32a

io_uring: allow application controlled CQ ring size · de551968

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit 33a107f0
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

We currently size the CQ ring as twice the SQ ring, to allow some
flexibility in not overflowing the CQ ring. This is done because the
SQE life time is different than that of the IO request itself, the SQE
is consumed as soon as the kernel has seen the entry.

Certain application don't need a huge SQ ring size, since they just
submit IO in batches. But they may have a lot of requests pending, and
hence need a big CQ ring to hold them all. By allowing the application
to control the CQ ring size multiplier, we can cater to those
applications more efficiently.

If an application wants to define its own CQ ring size, it must set
IORING_SETUP_CQSIZE in the setup flags, and fill out
io_uring_params->cq_entries. The value must be a power of two.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

de551968

io_uring: add support for IORING_REGISTER_FILES_UPDATE · 74d94cfe

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit c3a31e60
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Allows the application to remove/replace/add files to/from a file set.
Passes in a struct:

struct io_uring_files_update {
	__u32 offset;
	__s32 *fds;
};

that holds an array of fds, size of array passed in through the usual
nr_args part of the io_uring_register() system call. The logic is as
follows:

1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
   the set.
2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
   replaced with ->fds[i].

For case #2, is the existing file is currently empty (fd == -1), the
new fd is simply added to the array.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

74d94cfe

io_uring: allow sparse fixed file sets · c578b6d9

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit 08a45173
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

This is in preparation for allowing updates to fixed file sets without
requiring a full unregister+register.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

c578b6d9

io_uring: run dependent links inline if possible · 0450a998

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.5-rc1
commit ba816ad6
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Currently any dependent link is executed from a new workqueue context,
which means that we'll be doing a context switch per link in the chain.
If we are running the completion of the current request from our async
workqueue and find that the next request is a link, then run it directly
from the workqueue context instead of forcing another switch.

This improves the performance of linked SQEs, and reduces the CPU
overhead.

Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

0450a998

io_uring: don't touch ctx in setup after ring fd install · 715a925e

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc6
commit 044c1ab3
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.

Defer ring fd installation until after we're done reading those
values.

Fixes: 75b28aff ("io_uring: allocate the two rings together")
Reported-by:  <syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

715a925e

io_uring: Fix leaked shadow_req · 7cd0c55f

Pavel Begunkov authored 4 years ago

mainline inclusion
from mainline-5.4-rc6
commit 7b20238d
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

io_queue_link_head() owns shadow_req after taking it as an argument.
By not freeing it in case of an error, it can leak the request along
with taken ctx->refs.

Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

7cd0c55f

io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD · 5aac65f8

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc5
commit 2b2ed975
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.

For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

5aac65f8

io_uring: used cached copies of sq->dropped and cq->overflow · 6cf10da4

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc5
commit 498ccd9e
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

6cf10da4

io_uring: Fix race for sqes with userspace · 1ec3d7e4

Pavel Begunkov authored 4 years ago

mainline inclusion
from mainline-5.4-rc5
commit 935d1e45
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe

Reorder them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

1ec3d7e4

io_uring: Fix broken links with offloading · d6e6c2a8

Pavel Begunkov authored 4 years ago

mainline inclusion
from mainline-5.4-rc5
commit fb5ccc98
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.

The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().

Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

d6e6c2a8

io_uring: Fix corrupted user_data · e2990b78

Pavel Begunkov authored 4 years ago

mainline inclusion
from mainline-5.4-rc5
commit 84d55dc5
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

There is a bug, where failed linked requests are returned not with
specified @user_data, but with garbage from a kernel stack.

The reason is that io_fail_links() uses req->user_data, which is
uninitialised when called from io_queue_sqe() on fail path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

e2990b78

io_uring: correct timeout req sequence when inserting a new entry · 33724241

zhangyi (F) authored 4 years ago

mainline inclusion
from mainline-5.4-rc5
commit a1f58ba4
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

The sequence number of the timeout req (req->sequence) indicate the
expected completion request. Because of each timeout req consume a
sequence number, so the sequence of each timeout req on the timeout
list shouldn't be the same. But now, we may get the same number (also
incorrect) if we insert a new entry before the last one, such as submit
such two timeout reqs on a new ring instance below.

                    req->sequence
 req_1 (count = 2):       2
 req_2 (count = 1):       2

Then, if we submit a nop req, req_2 will still timeout even the nop req
finished. This patch fix this problem by adjust the sequence number of
each reordered reqs when inserting a new entry.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

33724241

io_uring : correct timeout req sequence when waiting timeout · b915c3c1

zhangyi (F) authored 4 years ago

mainline inclusion
from mainline-5.4-rc5
commit ef03681a
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

The sequence number of reqs on the timeout_list before the timeout req
should be adjusted in io_timeout_fn(), because the current timeout req
will consumes a slot in the cq_ring and cq_tail pointer will be
increased, otherwise other timeout reqs may return in advance without
waiting for enough wait_nr.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

b915c3c1

io_uring: revert "io_uring: optimize submit_and_wait API" · 48abbaa9

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc5
commit bc808bce
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

There are cases where it isn't always safe to block for submission,
even if the caller asked to wait for events as well. Revert the
previous optimization of doing that.

This reverts two commits:

bf7ec93c
c5766668

Fixes: c5766668 ("io_uring: optimize submit_and_wait API")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

48abbaa9

io_uring: fix logic error in io_timeout · 7d10f7c8

yangerkun authored 4 years ago

mainline inclusion
from mainline-5.4-rc4
commit 8b07a65a
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not
tmp_nxt.

Fixes: 5da0fb1a ("io_uring: consider the overflow of sequence for timeout req")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

7d10f7c8

io_uring: fix up O_NONBLOCK handling for sockets · fde43185

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc4
commit 491381ce
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

We've got two issues with the non-regular file handling for non-blocking
IO:

1) We don't want to re-do a short read in full for a non-regular file,
   as we can't just read the data again.
2) For non-regular files that don't support non-blocking IO attempts,
   we need to punt to async context even if the file is opened as
   non-blocking. Otherwise the caller always gets -EAGAIN.

Add two new request flags to handle these cases. One is just a cache
of the inode S_ISREG() status, the other tells io_uring that we always
need to punt this request to async context, even if REQ_F_NOWAIT is set.

Cc: stable@vger.kernel.org
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

fde43185

io_uring: consider the overflow of sequence for timeout req · 4b0321bb

yangerkun authored 4 years ago

mainline inclusion
from mainline-5.4-rc4
commit 5da0fb1a
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:

1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.

2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.

This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

4b0321bb

io_uring: fix sequence logic for timeout requests · 73b3361d

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc3
commit 7adf4eaf
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

We have two ways a request can be deferred:

1) It's a regular request that depends on another one
2) It's a timeout that tracks completions

We have a shared helper to determine whether to defer, and that
attempts to make the right decision based on the request. But we
only have some of this information in the caller. Un-share the
two timeout/defer helpers so the caller can use the right one.

Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support")
Reported-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

73b3361d

io_uring: only flush workqueues on fileset removal · 9232c274

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc3
commit 8a997340
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

We should not remove the workqueue, we just need to ensure that the
workqueues are synced. The workqueues are torn down on ctx removal.

Cc: stable@vger.kernel.org
Fixes: 6b06314c ("io_uring: add file set registration")
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

9232c274

io_uring: remove wait loop spurious wakeups · 620d8eca

Pavel Begunkov authored 4 years ago

mainline inclusion
from mainline-5.4-rc3
commit 6805b32e
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Any changes interesting to tasks waiting in io_cqring_wait() are
commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs()
also tries to do that but with no reason, that means spurious wakeups
every io_free_req() and io_uring_enter().

Just use percpu_ref_put() instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

620d8eca

io_uring: fix reversed nonblock flag for link submission · 2a89cd39

Pavel Begunkov authored 4 years ago

mainline inclusion
from mainline-5.4-rc3
commit bf7ec93c
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit()
passes something opposite.

Fixes: c5766668 ("io_uring: optimize submit_and_wait API")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

2a89cd39

io_uring: use __kernel_timespec in timeout ABI · 99d11c97

Arnd Bergmann authored 4 years ago

mainline inclusion
from mainline-5.4-rc2
commit bdf20073
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

All system calls use struct __kernel_timespec instead of the old struct
timespec, but this one was just added with the old-style ABI. Change it
now to enforce the use of __kernel_timespec, avoiding ABI confusion and
the need for compat handlers on 32-bit architectures.

Any user space caller will have to use __kernel_timespec now, but this
is unambiguous and works for any C library regardless of the time_t
definition. A nicer way to specify the timeout would have been a less
ambiguous 64-bit nanosecond value, but I suppose it's too late now to
change that as this would impact both 32-bit and 64-bit users.

Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

99d11c97

io_uring: make CQ ring wakeups be more efficient · c858b1b8

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc1
commit bda52162
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

For batched IO, it's not uncommon for waiters to ask for more than 1
IO to complete before being woken up. This is a problem with
wait_event() since tasks will get woken for every IO that completes,
re-check condition, then go back to sleep. For batch counts on the
order of what you do for high IOPS, that can result in 10s of extra
wakeups for the waiting task.

Add a private wake function that checks for the wake up count criteria
being met before calling autoremove_wake_function(). Pavel reports that
one test case he has runs 40% faster with proper batching of wakeups.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

c858b1b8

io_uring: compare cached_cq_tail with cq.head in_io_uring_poll · 065edf60

yangerkun authored 4 years ago

mainline inclusion
from mainline-5.4-rc1
commit daa5de54
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

After 75b28aff("io_uring: allocate the two rings together"), we compare
sq.head with cached_cq_tail to determine does there any cq invalid.
Actually, we should use cq.head.

Fixes: 75b28aff ("io_uring: allocate the two rings together")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

065edf60

io_uring: correctly handle non ->{read,write}_iter() file_operations · 676147e4

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc1
commit 32960613
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Currently we just -EINVAL a read or write to an fd that isn't backed
by ->read_iter() or ->write_iter(). But we can handle them just fine,
as long as we punt fo async context first.

Implement a simple loop function for doing ->read() or ->write()
instead, and ensure we call it appropriately.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

676147e4

io_uring: IORING_OP_TIMEOUT support · 2ab94624

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc1
commit 5262f567
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

There's been a few requests for functionality similar to io_getevents()
and epoll_wait(), where the user can specify a timeout for waiting on
events. I deliberately did not add support for this through the system
call initially to avoid overloading the args, but I can see that the use
cases for this are valid.

This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
when waiting for events, simply submit one of these timeout commands
with your wait call (or before). This ensures that the application
sleeping on the CQ ring waiting for events will get woken. The timeout
command is passed in as a pointer to a struct timespec. Timeouts are
relative. The timeout command also includes a way to auto-cancel after
N events has passed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

2ab94624

io_uring: use cond_resched() in sqthread · 59ec0032

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc1
commit 9831a90c
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

If preempt isn't enabled in the kernel, we can run into hang issues with
sqthread submissions. Use cond_resched() to play nice instead of
cpu_relax(), if we end up starting the loop and not having any events
pending for submissions.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

59ec0032

io_uring: fix potential crash issue due to io_get_req failure · 0fdbf248

Jackie Liu authored 4 years ago

mainline inclusion
from mainline-5.4-rc1
commit a1041c27
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

Sometimes io_get_req will return a NUL, then we need to do the
correct error handling, otherwise it will cause the kernel null
pointer exception.

Fixes: 4fe2c963 ("io_uring: add support for link with drain")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

0fdbf248

io_uring: ensure poll commands clear ->sqe · 329c8bf0

Jens Axboe authored 4 years ago

mainline inclusion
from mainline-5.4-rc1
commit 6cc47d1d
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

If we end up getting woken in poll (due to a signal), then we may need
to punt the poll request to an async worker. When we do that, we look up
the list to queue at, deferefencing req->submit.sqe, however that is
only set for requests we initially decided to queue async.

This fixes a crash with poll command usage and wakeups that need to punt
to async context.

Fixes: 54a91f3b ("io_uring: limit parallelism of buffered writes")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

329c8bf0

io_uring: fix use-after-free of shadow_req · e6960099

Jackie Liu authored 4 years ago

mainline inclusion
from mainline-5.4-rc1
commit 5f5ad9ce
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27


CVE: NA
---------------------------

There is a potential dangling pointer problem. we never clean
shadow_req, if there are multiple link lists in this series of
sqes, then the shadow_req will not reallocate, and continue to
use the last one. but in the previous, his memory has been
released, thus forming a dangling pointer. let's clean up him
and make sure that every new link list can reapply for a new
shadow_req.

Fixes: 4fe2c963 ("io_uring: add support for link with drain")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>

e6960099