- Apr 15, 2021
-
-
Jens Axboe authored
mainline inclusion from mainline-5.5-rc1 commit 561fb04a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Drop various work-arounds we have for workqueues: - We no longer need the async_list for tracking sequential IO. - We don't have to maintain our own mm tracking/setting. - We don't need a separate workqueue for buffered writes. This didn't even work that well to begin with, as it was suboptimal for multiple buffered writers on multiple files. - We can properly cancel pending interruptible work. This fixes deadlocks with particularly socket IO, where we cannot cancel them when the io_uring is closed. Hence the ring will wait forever for these requests to complete, which may never happen. This is different from disk IO where we know requests will complete in a finite amount of time. - Due to being able to cancel work interruptible work that is already running, we can implement file table support for work. We need that for supporting system calls that add to a process file table. - It gets us one step closer to adding async support for any system call. Signed-off-by:
Jens Axboe <axboe@kernel.dk> Conflicts: fs/io_uring.c [ Patch b5420237("mm: refactor readahead defines in mm.h") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.5-rc1 commit 771b53d0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This adds support for io-wq, a smaller and specialized thread pool implementation. This is meant to replace workqueues for io_uring. Among the reasons for this addition are: - We can assign memory context smarter and more persistently if we manage the life time of threads. - We can drop various work-arounds we have in io_uring, like the async_list. - We can implement hashed work insertion, to manage concurrency of buffered writes without needing a) an extra workqueue, or b) needlessly making the concurrency of said workqueue very low which hurts performance of multiple buffered file writers. - We can implement cancel through signals, for cancelling interruptible work like read/write (or send/recv) to/from sockets. - We need the above cancel for being able to assign and use file tables from a process. - We can implement a more thorough cancel operation in general. - We need it to move towards a syslet/threadlet model for even faster async execution. For that we need to take ownership of the used threads. This list is just off the top of my head. Performance should be the same, or better, at least that's what I've seen in my testing. io-wq supports basic NUMA functionality, setting up a pool per node. io-wq hooks up to the scheduler schedule in/out just like workqueue and uses that to drive the need for more/less workers. Acked-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Conflicts: fs/Kconfig fs/Makefile include/linux/sched.h [ Patch d7fefcc8("mm/cma: add PF flag to force non cma alloc") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Thomas Gleixner authored
mainline inclusion from mainline-5.2-rc1 commit 6d25be57 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- The worker accounting for CPU bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. When nr_running is updated when the task is retunrning from schedule() then it is later compared when it is done from ttwu(). [ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ] Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Tejun Heo <tj@kernel.org> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com Signed-off-by:
Ingo Molnar <mingo@kernel.org> Conflicts: kernel/workqueue_internal.h [ Patch 1b69ac6b("psi: fix aggregation idle shut-off") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Thomas Gleixner authored
mainline inclusion from mainline-5.1-rc1 commit 15917dc0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- The RTMUTEX tester was removed long ago but the PF bit stayed around. Remove it and free up the space. Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Conflicts: include/linux/sched.h [ Patch 73ab1cb2("umh: add exit routine for UMH process") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Pavel Begunkov authored
mainline inclusion from mainline-5.5-rc1 commit 95a1b3ff category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Commit fb5ccc98 ("io_uring: Fix broken links with offloading") introduced a potential performance regression with unconditionally taking mm even for READ/WRITE_FIXED operations. Return the logic handling it back. mm-faulted requests will go through the generic submission path, so honoring links and drains, but will fail further on req->has_user check. Fixes: fb5ccc98 ("io_uring: Fix broken links with offloading") Cc: stable@vger.kernel.org # v5.4 Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Pavel Begunkov authored
mainline inclusion from mainline-5.5-rc1 commit fa456228 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- submit->index is used only for inbound check in submission path (i.e. head < ctx->sq_entries). However, it always will be true, as 1. it's already validated by io_get_sqring() 2. ctx->sq_entries can't be changedd in between, because of held ctx->uring_lock and ctx->refs. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Dmitrii Dolgov authored
mainline inclusion from mainline-5.5-rc1 commit c826bd7a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- To trace io_uring activity one can get an information from workqueue and io trace events, but looks like some parts could be hard to identify via this approach. Making what happens inside io_uring more transparent is important to be able to reason about many aspects of it, hence introduce the set of tracing events. All such events could be roughly divided into two categories: * those, that are helping to understand correctness (from both kernel and an application point of view). E.g. a ring creation, file registration, or waiting for available CQE. Proposed approach is to get a pointer to an original structure of interest (ring context, or request), and then find relevant events. io_uring_queue_async_work also exposes a pointer to work_struct, to be able to track down corresponding workqueue events. * those, that provide performance related information. Mostly it's about events that change the flow of requests, e.g. whether an async work was queued, or delayed due to some dependencies. Another important case is how io_uring optimizations (e.g. registered files) are utilized. Signed-off-by:
Dmitrii Dolgov <9erthalion6@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Conflicts: include/Kbuild [ Patch 43c78d88("kbuild: compile-test kernel headers to ensure they are self-contained") is not applied. ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.5-rc1 commit 11365043 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- We might have cases where the need for a specific timeout is gone, add support for canceling an existing timeout operation. This works like the POLL_REMOVE command, where the application passes in the user_data of the timeout it wishes to cancel in the sqe->addr field. Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.5-rc1 commit a41525ab category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This is a pretty trivial addition on top of the relative timeouts we have now, but it's handy for ensuring tighter timing for those that are building scheduling primitives on top of io_uring. Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jackie Liu authored
mainline inclusion from mainline-5.5-rc1 commit ba5290cc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- There is no function change, just to clean up the code, use s->in_async to make the code know where it is. Signed-off-by:
Jackie Liu <liuyun01@kylinos.cn> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.5-rc1 commit 33a107f0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- We currently size the CQ ring as twice the SQ ring, to allow some flexibility in not overflowing the CQ ring. This is done because the SQE life time is different than that of the IO request itself, the SQE is consumed as soon as the kernel has seen the entry. Certain application don't need a huge SQ ring size, since they just submit IO in batches. But they may have a lot of requests pending, and hence need a big CQ ring to hold them all. By allowing the application to control the CQ ring size multiplier, we can cater to those applications more efficiently. If an application wants to define its own CQ ring size, it must set IORING_SETUP_CQSIZE in the setup flags, and fill out io_uring_params->cq_entries. The value must be a power of two. Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.5-rc1 commit c3a31e60 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Allows the application to remove/replace/add files to/from a file set. Passes in a struct: struct io_uring_files_update { __u32 offset; __s32 *fds; }; that holds an array of fds, size of array passed in through the usual nr_args part of the io_uring_register() system call. The logic is as follows: 1) If ->fds[i] is -1, the existing file at i + ->offset is removed from the set. 2) If ->fds[i] is a valid fd, the existing file at i + ->offset is replaced with ->fds[i]. For case #2, is the existing file is currently empty (fd == -1), the new fd is simply added to the array. Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.5-rc1 commit 08a45173 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This is in preparation for allowing updates to fixed file sets without requiring a full unregister+register. Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.5-rc1 commit ba816ad6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Currently any dependent link is executed from a new workqueue context, which means that we'll be doing a context switch per link in the chain. If we are running the completion of the current request from our async workqueue and find that the next request is a link, then run it directly from the workqueue context instead of forcing another switch. This improves the performance of linked SQEs, and reduces the CPU overhead. Reviewed-by:
Jackie Liu <liuyun01@kylinos.cn> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc6 commit 044c1ab3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- syzkaller reported an issue where it looks like a malicious app can trigger a use-after-free of reading the ctx ->sq_array and ->rings value right after having installed the ring fd in the process file table. Defer ring fd installation until after we're done reading those values. Fixes: 75b28aff ("io_uring: allocate the two rings together") Reported-by:
<syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Pavel Begunkov authored
mainline inclusion from mainline-5.4-rc6 commit 7b20238d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- io_queue_link_head() owns shadow_req after taking it as an argument. By not freeing it in case of an error, it can leak the request along with taken ctx->refs. Reviewed-by:
Jackie Liu <liuyun01@kylinos.cn> Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc5 commit 2b2ed975 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- We currently assume that submissions from the sqthread are successful, and if IO polling is enabled, we use that value for knowing how many completions to look for. But if we overflowed the CQ ring or some requests simply got errored and already completed, they won't be available for polling. For the case of IO polling and SQTHREAD usage, look at the pending poll list. If it ever hits empty then we know that we don't have anymore pollable requests inflight. For that case, simply reset the inflight count to zero. Reported-by:
Pavel Begunkov <asml.silence@gmail.com> Reviewed-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc5 commit 498ccd9e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- We currently use the ring values directly, but that can lead to issues if the application is malicious and changes these values on our behalf. Created in-kernel cached versions of them, and just overwrite the user side when we update them. This is similar to how we treat the sq/cq ring tail/head updates. Reported-by:
Pavel Begunkov <asml.silence@gmail.com> Reviewed-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Pavel Begunkov authored
mainline inclusion from mainline-5.4-rc5 commit 935d1e45 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- io_ring_submit() finalises with 1. io_commit_sqring(), which releases sqes to the userspace 2. Then calls to io_queue_link_head(), accessing released head's sqe Reorder them. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Pavel Begunkov authored
mainline inclusion from mainline-5.4-rc5 commit fb5ccc98 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- io_sq_thread() processes sqes by 8 without considering links. As a result, links will be randomely subdivided. The easiest way to fix it is to call io_get_sqring() inside io_submit_sqes() as do io_ring_submit(). Downsides: 1. This removes optimisation of not grabbing mm_struct for fixed files 2. It submitting all sqes in one go, without finer-grained sheduling with cq processing. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Pavel Begunkov authored
mainline inclusion from mainline-5.4-rc5 commit 84d55dc5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- There is a bug, where failed linked requests are returned not with specified @user_data, but with garbage from a kernel stack. The reason is that io_fail_links() uses req->user_data, which is uninitialised when called from io_queue_sqe() on fail path. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
zhangyi (F) authored
mainline inclusion from mainline-5.4-rc5 commit a1f58ba4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- The sequence number of the timeout req (req->sequence) indicate the expected completion request. Because of each timeout req consume a sequence number, so the sequence of each timeout req on the timeout list shouldn't be the same. But now, we may get the same number (also incorrect) if we insert a new entry before the last one, such as submit such two timeout reqs on a new ring instance below. req->sequence req_1 (count = 2): 2 req_2 (count = 1): 2 Then, if we submit a nop req, req_2 will still timeout even the nop req finished. This patch fix this problem by adjust the sequence number of each reordered reqs when inserting a new entry. Signed-off-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
zhangyi (F) authored
mainline inclusion from mainline-5.4-rc5 commit ef03681a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- The sequence number of reqs on the timeout_list before the timeout req should be adjusted in io_timeout_fn(), because the current timeout req will consumes a slot in the cq_ring and cq_tail pointer will be increased, otherwise other timeout reqs may return in advance without waiting for enough wait_nr. Signed-off-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc5 commit bc808bce category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- There are cases where it isn't always safe to block for submission, even if the caller asked to wait for events as well. Revert the previous optimization of doing that. This reverts two commits: bf7ec93c c5766668 Fixes: c5766668 ("io_uring: optimize submit_and_wait API") Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
yangerkun authored
mainline inclusion from mainline-5.4-rc4 commit 8b07a65a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not tmp_nxt. Fixes: 5da0fb1a ("io_uring: consider the overflow of sequence for timeout req") Signed-off-by:
yangerkun <yangerkun@huawei.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc4 commit 491381ce category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- We've got two issues with the non-regular file handling for non-blocking IO: 1) We don't want to re-do a short read in full for a non-regular file, as we can't just read the data again. 2) For non-regular files that don't support non-blocking IO attempts, we need to punt to async context even if the file is opened as non-blocking. Otherwise the caller always gets -EAGAIN. Add two new request flags to handle these cases. One is just a cache of the inode S_ISREG() status, the other tells io_uring that we always need to punt this request to async context, even if REQ_F_NOWAIT is set. Cc: stable@vger.kernel.org Reported-by:
Hrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by:
Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
yangerkun authored
mainline inclusion from mainline-5.4-rc4 commit 5da0fb1a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Now we recalculate the sequence of timeout with 'req->sequence = ctx->cached_sq_head + count - 1', judge the right place to insert for timeout_list by compare the number of request we still expected for completion. But we have not consider about the situation of overflow: 1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for the new timeout req can have a small req->sequence. 2. cached_sq_head of now may overflow compare with before req. And it will lead the timeout req with small req->sequence. This overflow will lead to the misorder of timeout_list, which can lead to the wrong order of the completion of timeout_list. Fix it by reuse req->submit.sequence to store the count, and change the logic of inserting sort in io_timeout. Signed-off-by:
yangerkun <yangerkun@huawei.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc3 commit 7adf4eaf category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- We have two ways a request can be deferred: 1) It's a regular request that depends on another one 2) It's a timeout that tracks completions We have a shared helper to determine whether to defer, and that attempts to make the right decision based on the request. But we only have some of this information in the caller. Un-share the two timeout/defer helpers so the caller can use the right one. Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support") Reported-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
Jackie Liu <liuyun01@kylinos.cn> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc3 commit 8a997340 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- We should not remove the workqueue, we just need to ensure that the workqueues are synced. The workqueues are torn down on ctx removal. Cc: stable@vger.kernel.org Fixes: 6b06314c ("io_uring: add file set registration") Reported-by:
Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Pavel Begunkov authored
mainline inclusion from mainline-5.4-rc3 commit 6805b32e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Any changes interesting to tasks waiting in io_cqring_wait() are commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs() also tries to do that but with no reason, that means spurious wakeups every io_free_req() and io_uring_enter(). Just use percpu_ref_put() instead. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Pavel Begunkov authored
mainline inclusion from mainline-5.4-rc3 commit bf7ec93c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit() passes something opposite. Fixes: c5766668 ("io_uring: optimize submit_and_wait API") Reported-by:
kbuild test robot <lkp@intel.com> Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Arnd Bergmann authored
mainline inclusion from mainline-5.4-rc2 commit bdf20073 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- All system calls use struct __kernel_timespec instead of the old struct timespec, but this one was just added with the old-style ABI. Change it now to enforce the use of __kernel_timespec, avoiding ABI confusion and the need for compat handlers on 32-bit architectures. Any user space caller will have to use __kernel_timespec now, but this is unambiguous and works for any C library regardless of the time_t definition. A nicer way to specify the timeout would have been a less ambiguous 64-bit nanosecond value, but I suppose it's too late now to change that as this would impact both 32-bit and 64-bit users. Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc1 commit bda52162 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- For batched IO, it's not uncommon for waiters to ask for more than 1 IO to complete before being woken up. This is a problem with wait_event() since tasks will get woken for every IO that completes, re-check condition, then go back to sleep. For batch counts on the order of what you do for high IOPS, that can result in 10s of extra wakeups for the waiting task. Add a private wake function that checks for the wake up count criteria being met before calling autoremove_wake_function(). Pavel reports that one test case he has runs 40% faster with proper batching of wakeups. Reported-by:
Pavel Begunkov <asml.silence@gmail.com> Tested-by:
Pavel Begunkov <asml.silence@gmail.com> Reviewed-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
yangerkun authored
mainline inclusion from mainline-5.4-rc1 commit daa5de54 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- After 75b28aff("io_uring: allocate the two rings together"), we compare sq.head with cached_cq_tail to determine does there any cq invalid. Actually, we should use cq.head. Fixes: 75b28aff ("io_uring: allocate the two rings together") Signed-off-by:
yangerkun <yangerkun@huawei.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc1 commit 32960613 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Currently we just -EINVAL a read or write to an fd that isn't backed by ->read_iter() or ->write_iter(). But we can handle them just fine, as long as we punt fo async context first. Implement a simple loop function for doing ->read() or ->write() instead, and ensure we call it appropriately. Reported-by:
李通洲 <carter.li@eoitek.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc1 commit 5262f567 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- There's been a few requests for functionality similar to io_getevents() and epoll_wait(), where the user can specify a timeout for waiting on events. I deliberately did not add support for this through the system call initially to avoid overloading the args, but I can see that the use cases for this are valid. This adds support for IORING_OP_TIMEOUT. If a user wants to get woken when waiting for events, simply submit one of these timeout commands with your wait call (or before). This ensures that the application sleeping on the CQ ring waiting for events will get woken. The timeout command is passed in as a pointer to a struct timespec. Timeouts are relative. The timeout command also includes a way to auto-cancel after N events has passed. Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc1 commit 9831a90c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- If preempt isn't enabled in the kernel, we can run into hang issues with sqthread submissions. Use cond_resched() to play nice instead of cpu_relax(), if we end up starting the loop and not having any events pending for submissions. Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jackie Liu authored
mainline inclusion from mainline-5.4-rc1 commit a1041c27 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Sometimes io_get_req will return a NUL, then we need to do the correct error handling, otherwise it will cause the kernel null pointer exception. Fixes: 4fe2c963 ("io_uring: add support for link with drain") Signed-off-by:
Jackie Liu <liuyun01@kylinos.cn> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jens Axboe authored
mainline inclusion from mainline-5.4-rc1 commit 6cc47d1d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- If we end up getting woken in poll (due to a signal), then we may need to punt the poll request to an async worker. When we do that, we look up the list to queue at, deferefencing req->submit.sqe, however that is only set for requests we initially decided to queue async. This fixes a crash with poll command usage and wakeups that need to punt to async context. Fixes: 54a91f3b ("io_uring: limit parallelism of buffered writes") Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Jackie Liu authored
mainline inclusion from mainline-5.4-rc1 commit 5f5ad9ce category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- There is a potential dangling pointer problem. we never clean shadow_req, if there are multiple link lists in this series of sqes, then the shadow_req will not reallocate, and continue to use the last one. but in the previous, his memory has been released, thus forming a dangling pointer. let's clean up him and make sure that every new link list can reapply for a new shadow_req. Fixes: 4fe2c963 ("io_uring: add support for link with drain") Signed-off-by:
Jackie Liu <liuyun01@kylinos.cn> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-