- Jul 19, 2021
-
-
Cheng Jian authored
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3OX0C CVE: NA -------------------------------- We must ensure that the following four instructions are cache-aligned. Otherwise, it will cause problems with the performance of libMicro pread. 1: # uao_user_alternative 9f, str, sttr, xzr, x0, 8 str xzr, [x0], #8 nop subs x1, x1, #8 b.pl 1b with this patch: prc thr usecs/call samples errors cnt/samp size pread_z100 1 1 5.88400 807 0 1 102400 The result of pread can range from 5 to 9 depending on the alignment performance of this function. Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Jason Yan authored
mainline inclusion from mainline-v5.14-rc1 commit 1ee2753422349723d27009f2f973d03289d430ab category: bugfix bugzilla: NA CVE: NA ----------------------------------------------- When a SCSI device is offline a MODE SENSE command will return a result with only DID_NO_CONNECT set. In sd_read_write_protect_flag() only the status byte of the result is checked. Despite a returned status of DID_NO_CONNECT the command is considered successful and we read sdkp->write_prot from a buffer containing garbage. Modify scsi_status_is_good() to treat DID_NO_CONNECT as a failure case. Link: https://lore.kernel.org/r/20210330114727.234467-1-yanaijie@huawei.com Signed-off-by:
Jason Yan <yanaijie@huawei.com> conflicts: include/scsi/scsi.h Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Jason Yan <yanaijie@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Ye Bin authored
hulk inclusion category: bugfix bugzilla: NA CVE: NA ----------------------------------------------- This reverts commit e9e58048. Signed-off-by:
Ye Bin <yebin10@huawei.com> Reviewed-by:
Jason Yan <yanaijie@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Ye Bin authored
mainline inclusion from mainline-v5.14-rc1 commit 558d6450c7755aa005d89021204b6cdcae5e848f category: bugfix bugzilla: 109280 CVE: NA ----------------------------------------------- If a writeback of the superblock fails with an I/O error, the buffer is marked not uptodate. However, this can cause a WARN_ON to trigger when we attempt to write superblock a second time. (Which might succeed this time, for cerrtain types of block devices such as iSCSI devices over a flaky network.) Try to detect this case in flush_stashed_error_work(), and also change __ext4_handle_dirty_metadata() so we always set the uptodate flag, not just in the nojournal case. Before this commit, this problem can be repliciated via: 1. dmsetup create dust1 --table '0 2097152 dust /dev/sdc 0 4096' 2. mount /dev/mapper/dust1 /home/test 3. dmsetup message dust1 0 addbadblock 0 10 4. cd /home/test 5. echo "XXXXXXX" > t After a few seconds, we got following warning: [ 80.654487] end_buffer_async_write: bh=0xffff88842f18bdd0 [ 80.656134] Buffer I/O error on dev dm-0, logical block 0, lost async page write [ 85.774450] EXT4-fs error (device dm-0): ext4_check_bdev_write_error:193: comm kworker/u16:8: Error while async write back metadata [ 91.415513] mark_buffer_dirty: bh=0xffff88842f18bdd0 [ 91.417038] ------------[ cut here ]------------ [ 91.418450] WARNING: CPU: 1 PID: 1944 at fs/buffer.c:1092 mark_buffer_dirty.cold+0x1c/0x5e [ 91.440322] Call Trace: [ 91.440652] __jbd2_journal_temp_unlink_buffer+0x135/0x220 [ 91.441354] __jbd2_journal_unfile_buffer+0x24/0x90 [ 91.441981] __jbd2_journal_refile_buffer+0x134/0x1d0 [ 91.442628] jbd2_journal_commit_transaction+0x249a/0x3240 [ 91.443336] ? put_prev_entity+0x2a/0x200 [ 91.443856] ? kjournald2+0x12e/0x510 [ 91.444324] kjournald2+0x12e/0x510 [ 91.444773] ? woken_wake_function+0x30/0x30 [ 91.445326] kthread+0x150/0x1b0 [ 91.445739] ? commit_timeout+0x20/0x20 [ 91.446258] ? kthread_flush_worker+0xb0/0xb0 [ 91.446818] ret_from_fork+0x1f/0x30 [ 91.447293] ---[ end trace 66f0b6bf3d1abade ]--- Signed-off-by:
Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20210615090537.3423231-1-yebin10@huawei.com Signed-off-by:
Theodore Ts'o <tytso@mit.edu> Signed-off-by:
Ye Bin <yebin10@huawei.com> Reviewed-by:
Zhang Yi <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
zhenpengzheng authored
hulk inclusion category: feature bugzilla: 50777 CVE: NA ------------------------------------------------------------------------- Ensure the netswift 10G NIC driver ko can be distributed in ISO on X86. Signed-off-by:
zhenpengzheng <zhenpengzheng@net-swift.com> Signed-off-by:
Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by:
Jian Cheng <cj.chengjian@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Coly Li authored
mainline inclusion from mainline-v5.7-rc1 commit 8e710227 category: performance bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=327 CVE: NA When registering a cache device, bch_btree_check() is called to check all btree nodes, to make sure the btree is consistent and not corrupted. bch_btree_check() is recursively executed in a single thread, when there are a lot of data cached and the btree is huge, it may take very long time to check all the btree nodes. In my testing, I observed it took around 50 minutes to finish bch_btree_check(). When checking the bcache btree nodes, the cache set is not running yet, and indeed the whole tree is in read-only state, it is safe to create multiple threads to check the btree in parallel. This patch tries to create multiple threads, and each thread tries to one-by-one check the sub-tree indexed by a key from the btree root node. The parallel thread number depends on how many keys in the btree root node. At most BCH_BTR_CHKTHREAD_MAX (64) threads can be created, but in practice is should be min(cpu-number/2, root-node-keys-number). Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
qinghaixiang <xuweiqhx@163.com> Signed-off-by:
Xu Wei <xuwei56@huawei.com> Acked-by:
Xie XiuQi <xiexiuqi@huawei.com> Reviewed-by:
Li Ruilin <liruilin4@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Xu Wei authored
euleros inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=327 CVE: NA When kernel config don't enbale CONFIG_BCACHE, compiling bcache module will fail. This patch add the judgment for CONFIG_BCACHE macro to make sure compiling bcache module success. Signed-off-by:
qinghaixiang <xuweiqhx@163.com> Signed-off-by:
Xu Wei <xuwei56@huawei.com> Acked-by:
Xie XiuQi <xiexiuqi@huawei.com> Reviewed-by:
Li Ruilin <liruilin4@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Xu Wei authored
euleros inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=327 CVE: NA Bcache will move all data, including clean and dirty data, in bucket when gc running. This will cause big write amplification, which may reduce the cache device's life. This patch provice a switch for gc to move only dirty data, which can reduce write amplification. Signed-off-by:
qinghaixiang <xuweiqhx@163.com> Signed-off-by:
Xu Wei <xuwei56@huawei.com> Acked-by:
Xie XiuQi <xiexiuqi@huawei.com> Reviewed-by:
Li Ruilin <liruilin4@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Xu Wei authored
euleros inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=327 CVE: NA When cache available is low, bcache turn to writethrough mode. Therefore, All write IO will be directly sent to backend device, which is usually HDD. At same time, cache device flush dirty data to the backend device in the bcache writeback process. So write IO from user will damage the sequentiality of writeback. And if there is lots of IO from writeback, user's write IO may be block. This patch add traffic policy in bcache to solve the problem and improve the performance for bcache when cache available is low. Signed-off-by:
qinghaixiang <xuweiqhx@163.com> Signed-off-by:
Xu Wei <xuwei56@huawei.com> Acked-by:
Xie XiuQi <xiexiuqi@huawei.com> Reviewed-by:
Li Ruilin <liruilin4@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Liu Jian authored
mainline inclusion from mainline-net-next commit 23d2b94043ca8835bd1e67749020e839f396a1c2 category: bugfix bugzilla: NA CVE: NA -------------------------------- I got below panic when doing fuzz test: Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 4056 Comm: syz-executor.3 Tainted: G B 5.14.0-rc1-00195-gcff5c4254439-dirty #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack_lvl+0x7a/0x9b panic+0x2cd/0x5af end_report.cold+0x5a/0x5a kasan_report+0xec/0x110 ip_check_mc_rcu+0x556/0x5d0 __mkroute_output+0x895/0x1740 ip_route_output_key_hash_rcu+0x2d0/0x1050 ip_route_output_key_hash+0x182/0x2e0 ip_route_output_flow+0x28/0x130 udp_sendmsg+0x165d/0x2280 udpv6_sendmsg+0x121e/0x24f0 inet6_sendmsg+0xf7/0x140 sock_sendmsg+0xe9/0x180 ____sys_sendmsg+0x2b8/0x7a0 ___sys_sendmsg+0xf0/0x160 __sys_sendmmsg+0x17e/0x3c0 __x64_sys_sendmmsg+0x9e/0x100 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x462eb9 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f3df5af1c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462eb9 RDX: 0000000000000312 RSI: 0000000020001700 RDI: 0000000000000007 RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f3df5af26bc R13: 00000000004c372d R14: 0000000000700b10 R15: 00000000ffffffff It is one use-after-free in ip_check_mc_rcu. In ip_mc_del_src, the ip_sf_list of pmc has been freed under pmc->lock protection. But access to ip_sf_list in ip_check_mc_rcu is not protected by the lock. Signed-off-by:
Liu Jian <liujian56@huawei.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Reviewed-by:
Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
- Jul 18, 2021
-
-
卢佳琳 authored
hulk inclusion category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA -------- Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
- Jul 16, 2021
-
-
GONG, Ruiqi authored
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZURN CVE: NA -------- Kernel build would fail in case of CONFIG_HALTPOLL_CPUIDLE=m, caused by haltpoll_switch_governor() not marked as an exported symbol. Fix this by complementing the EXPORT_SYMBOL statement. Fixes: 97c22788 ("cpuidle: fix container_of err in cpuidle_device and cpuidle_driver") Signed-off-by:
GONG, Ruiqi <gongruiqi1@huawei.com> Cc: Jiajun Chen <chenjiajun8@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Reviewed-by:
Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Yang Yingliang authored
hulk inclusion category: bugfix bugzilla: NA CVE: NA -------------------------------- Enable KASAN and UBSAN by default for test. Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Jing Liu authored
mainline inclusion from mainline-v5.3-rc1 commit 0b774629 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3YAEG CVE: NA ----------------------------- AVX512 BFLOAT16 instructions support 16-bit BFLOAT16 floating-point format (BF16) for deep learning optimization. Intel adds AVX512 BFLOAT16 feature in CooperLake, which is CPUID.7.1.EAX[5]. Detailed information of the CPUID bit can be found here, https://software.intel.com/sites/default/files/managed/c5/15/\ architecture-instruction-set-extensions-programming-reference.pdf. Signed-off-by:
Jing Liu <jing2.liu@linux.intel.com> [Fix type mismatch in min, changing constant "1" to "1u". - Paolo] Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Jingyi Wang <wangjingyi11@huawei.com> Reviewed-by:
Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Paolo Bonzini authored
mainline inclusion from mainline-v5.3-rc1 commit 60cec433 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3YAEG CVE: NA ----------------------------- The has_leaf_count member was originally added for KVM's paravirtualization CPUID leaves. However, since then the leaf count _has_ been added to those leaves as well, so we can drop that special case. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Jingyi Wang <wangjingyi11@huawei.com> Reviewed-by:
Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Paolo Bonzini authored
mainline inclusion from mainline-v5.3-rc1 commit 50a9e1a4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3YAEG CVE: NA ----------------------------- do_cpuid_1_ent does not do the entire processing for a CPUID entry, it only retrieves the host's values. Rename it to match reality. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Jingyi Wang <wangjingyi11@huawei.com> Reviewed-by:
Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Paolo Bonzini authored
mainline inclusion from mainline-v5.3-rc1 commit d9aadaf6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3YAEG CVE: NA ----------------------------- do_cpuid_1_ent is typically called in two places by __do_cpuid_func for CPUID functions that have subleafs. Both places have to set the KVM_CPUID_FLAG_SIGNIFCANT_INDEX. Set that flag, and KVM_CPUID_FLAG_STATEFUL_FUNC as well, directly in do_cpuid_1_ent. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Jingyi Wang <wangjingyi11@huawei.com> Reviewed-by:
Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Paolo Bonzini authored
mainline inclusion from mainline-v5.3-rc1 commit 54d360d4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3YAEG CVE: NA ----------------------------- CPUID function 7 has multiple subleafs. Instead of having nested switch statements, move the logic to filter supported features to a separate function, and call it for each subleaf. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Jingyi Wang <wangjingyi11@huawei.com> Reviewed-by:
Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Paolo Bonzini authored
mainline inclusion from mainline-v5.3-rc1 commit ab8bcf64 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3YAEG CVE: NA ----------------------------- Rename it as well as __do_cpuid_ent and __do_cpuid_ent_emulated to have "func" in its name, and drop the index parameter which is always 0. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Jingyi Wang <wangjingyi11@huawei.com> Reviewed-by:
Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
- Jul 14, 2021
-
-
Theodore Ts'o authored
mainline inclusion from mainline-5.14 commit 61bb4a1c417e5b95d9edb4f887f131de32e419cb category: bugfix bugzilla: 173880 CVE: NA ------------------------------------------------- After commit 618f003199c6 ("ext4: fix memory leak in ext4_fill_super"), after the file system is remounted read-only, there is a race where the kmmpd thread can exit, causing sbi->s_mmp_tsk to point at freed memory, which the call to ext4_stop_mmpd() can trip over. Fix this by only allowing kmmpd() to exit when it is stopped via ext4_stop_mmpd(). Link: https://lore.kernel.org/r/20210707002433.3719773-1-tytso@mit.edu Reported-by:
Ye Bin <yebin10@huawei.com> Bug-Report-Link: <20210629143603.2166962-1-yebin10@huawei.com> Signed-off-by:
Theodore Ts'o <tytso@mit.edu> Reviewed-by:
Jan Kara <jack@suse.cz> Conflicts: fs/ext4/super.c Signed-off-by:
Baokun Li <libaokun1@huawei.com> Reviewed-by:
Zhang Yi <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Luo Meng authored
mainline inclusion from mainline-v5.11-rc1 commit 16238415eb9886328a89fe7a3cb0b88c7335fe16 category: bugfix bugzilla: 38689 CVE: NA ----------------------------------------------- When the sum of fl->fl_start and l->l_len overflows, UBSAN shows the following warning: UBSAN: Undefined behaviour in fs/locks.c:482:29 signed integer overflow: 2 + 9223372036854775806 cannot be represented in type 'long long int' Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xe4/0x14e lib/dump_stack.c:118 ubsan_epilogue+0xe/0x81 lib/ubsan.c:161 handle_overflow+0x193/0x1e2 lib/ubsan.c:192 flock64_to_posix_lock fs/locks.c:482 [inline] flock_to_posix_lock+0x595/0x690 fs/locks.c:515 fcntl_setlk+0xf3/0xa90 fs/locks.c:2262 do_fcntl+0x456/0xf60 fs/fcntl.c:387 __do_sys_fcntl fs/fcntl.c:483 [inline] __se_sys_fcntl fs/fcntl.c:468 [inline] __x64_sys_fcntl+0x12d/0x180 fs/fcntl.c:468 do_syscall_64+0xc8/0x5a0 arch/x86/entry/common.c:293 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fix it by parenthesizing 'l->l_len - 1'. Signed-off-by:
Luo Meng <luomeng12@huawei.com> Signed-off-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Luo Meng <luomeng12@huawei.com> Reviewed-by:
Zhang Yi <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Matthew Wilcox (Oracle) authored
mainline inclusion from mainline-v5.10 commit 14284fed category: bugfix bugzilla: 43547 CVE: NA ----------------------------------------------- When bringing (portions of) a page uptodate, we were marking blocks that were zeroed as being uptodate, but not blocks that were read from storage. Like the previous commit, this problem was found with generic/127 and a kernel which failed readahead I/Os. This bug causes writes to be silently lost when working with flaky storage. Fixes: 9dc55f13 ("iomap: add support for sub-pagesize buffered I/O without buffer heads") Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> conflicts: fs/iomap.c Signed-off-by:
Ye Bin <yebin10@huawei.com> Reviewed-by:
Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yang Yingliang <yangyi...
-
Matthew Wilcox (Oracle) authored
mainline inclusion from mainline-v5.10-rc1 commit e6e7ca92 category: bugfix bugzilla: 43551 CVE: NA ----------------------------------------------- If we find a page in write_begin which is !Uptodate, we need to clear any error on the page before starting to read data into it. This matches how filemap_fault(), do_read_cache_page() and generic_file_buffered_read() handle PageError on !Uptodate pages. When calling iomap_set_range_uptodate() in __iomap_write_begin(), blocks were not being marked as uptodate. This was found with generic/127 and a specially modified kernel which would fail (some) readahead I/Os. The test read some bytes in a prior page which caused readahead to extend into page 0x34. There was a subsequent write to page 0x34, followed by a read to page 0x34. Because the blocks were still marked as !Uptodate, the read caused all blocks to be re-read, overwriting the write. With this change, and the next one, the bytes which were written are marked as being Uptodate, so even though the page is still marked as !Uptodate, the blocks containing the written data are not re-read from storage. Fixes: 9dc55f13 ("iomap: add support for sub-pagesize buffered I/O without buffer heads") Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> conflicts: fs/iomap.c Signed-off-by:
Ye Bin <yebin10@huawei.com> Reviewed-by:
Zhang Yi <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Christoph Hellwig authored
mainline inclusion from mainline-v5.5-rc1 commit d3b40439 category: bugfix bugzilla: 43551 CVE: NA ----------------------------------------------- That keeps the function a little easier to understand, and easier to modify for pending enhancements. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com> conflicts: fs/iomap.c Signed-off-by:
Ye Bin <yebin10@huawei.com> Reviewed-by:
Zhang Yi <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Josef Bacik authored
mainline inclusion from mainline-5.12-rc1 commit c9a2f90f4d6b category: bugfix bugzilla: 50455 CVE: NA ------------------------------------------------- There exists a race where we can be attempting to create a new nbd configuration while a previous configuration is going down, both configured with DESTROY_ON_DISCONNECT. Normally devices all have a reference of 1, as they won't be cleaned up until the module is torn down. However with DESTROY_ON_DISCONNECT we'll make sure that there is only 1 reference (generally) on the device for the config itself, and then once the config is dropped, the device is torn down. The race that exists looks like this TASK1 TASK2 nbd_genl_connect() idr_find() refcount_inc_not_zero(nbd) * count is 2 here ^^ nbd_config_put() nbd_put(nbd) (count is 1) setup new config check DESTROY_ON_DISCONNECT put_dev = true if (put_dev) nbd_put(nbd) * free'd here ^^ In nbd_genl_connect() we assume that the nbd ref count will be 2, however clearly that won't be true if the nbd device had been setup as DESTROY_ON_DISCONNECT with its prior configuration. Fix this by getting rid of the runtime flag to check if we need to mess with the nbd device refcount, and use the device NBD_DESTROY_ON_DISCONNECT flag to check if we need to adjust the ref counts. This was reported by syzkaller with the following kasan dump BUG: KASAN: use-after-free in instrument_atomic_read include/linux/instrumented.h:71 [inline] BUG: KASAN: use-after-free in atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] BUG: KASAN: use-after-free in refcount_dec_not_one+0x71/0x1e0 lib/refcount.c:76 Read of size 4 at addr ffff888143bf71a0 by task systemd-udevd/8451 CPU: 0 PID: 8451 Comm: systemd-udevd Not tainted 5.11.0-rc7-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:230 __kasan_report mm/kasan/report.c:396 [inline] kasan_report.cold+0x79/0xd5 mm/kasan/report.c:413 check_memory_region_inline mm/kasan/generic.c:179 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:185 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] refcount_dec_not_one+0x71/0x1e0 lib/refcount.c:76 refcount_dec_and_mutex_lock+0x19/0x140 lib/refcount.c:115 nbd_put drivers/block/nbd.c:248 [inline] nbd_release+0x116/0x190 drivers/block/nbd.c:1508 __blkdev_put+0x548/0x800 fs/block_dev.c:1579 blkdev_put+0x92/0x570 fs/block_dev.c:1632 blkdev_close+0x8c/0xb0 fs/block_dev.c:1640 __fput+0x283/0x920 fs/file_table.c:280 task_work_run+0xdd/0x190 kernel/task_work.c:140 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:174 [inline] exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fc1e92b5270 Code: 73 01 c3 48 8b 0d 38 7d 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c1 20 00 00 75 10 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ee fb ff ff 48 89 04 24 RSP: 002b:00007ffe8beb2d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000007 RCX: 00007fc1e92b5270 RDX: 000000000aba9500 RSI: 0000000000000000 RDI: 0000000000000007 RBP: 00007fc1ea16f710 R08: 000000000000004a R09: 0000000000000008 R10: 0000562f8cb0b2a8 R11: 0000000000000246 R12: 0000000000000000 R13: 0000562f8cb0afd0 R14: 0000000000000003 R15: 000000000000000e Allocated by task 1: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:401 [inline] ____kasan_kmalloc.constprop.0+0x82/0xa0 mm/kasan/common.c:429 kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:682 [inline] nbd_dev_add+0x44/0x8e0 drivers/block/nbd.c:1673 nbd_init+0x250/0x271 drivers/block/nbd.c:2394 do_one_initcall+0x103/0x650 init/main.c:1223 do_initcall_level init/main.c:1296 [inline] do_initcalls init/main.c:1312 [inline] do_basic_setup init/main.c:1332 [inline] kernel_init_freeable+0x605/0x689 init/main.c:1533 kernel_init+0xd/0x1b8 init/main.c:1421 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296 Freed by task 8451: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track+0x1c/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:356 ____kasan_slab_free+0xe1/0x110 mm/kasan/common.c:362 kasan_slab_free include/linux/kasan.h:192 [inline] slab_free_hook mm/slub.c:1547 [inline] slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1580 slab_free mm/slub.c:3143 [inline] kfree+0xdb/0x3b0 mm/slub.c:4139 nbd_dev_remove drivers/block/nbd.c:243 [inline] nbd_put.part.0+0x180/0x1d0 drivers/block/nbd.c:251 nbd_put drivers/block/nbd.c:295 [inline] nbd_config_put+0x6dd/0x8c0 drivers/block/nbd.c:1242 nbd_release+0x103/0x190 drivers/block/nbd.c:1507 __blkdev_put+0x548/0x800 fs/block_dev.c:1579 blkdev_put+0x92/0x570 fs/block_dev.c:1632 blkdev_close+0x8c/0xb0 fs/block_dev.c:1640 __fput+0x283/0x920 fs/file_table.c:280 task_work_run+0xdd/0x190 kernel/task_work.c:140 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:174 [inline] exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff888143bf7000 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 416 bytes inside of 1024-byte region [ffff888143bf7000, ffff888143bf7400) The buggy address belongs to the page: page:000000005238f4ce refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x143bf0 head:000000005238f4ce order:3 compound_mapcount:0 compound_pincount:0 flags: 0x57ff00000010200(slab|head) raw: 057ff00000010200 ffffea00004b1400 0000000300000003 ffff888010c41140 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888143bf7080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888143bf7100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888143bf7180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888143bf7200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Reported-and-tested-by:
<syzbot+429d3f82d757c211bff3@syzkaller.appspotmail.com> Signed-off-by:
Josef Bacik <josef@toxicpanda.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Luo Meng <luomeng12@huawei.com> Reviewed-by:
Jason Yan <yanaijie@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Paul Aurich authored
mainline inclusion from mainline-5.9-rc1 commit baf57b56 category: bugfix bugzilla: 40791 CVE: NA --------------------------- Handling a lease break for the cached root didn't free the smb2_lease_break_work allocation, resulting in a leak: unreferenced object 0xffff98383a5af480 (size 128): comm "cifsd", pid 684, jiffies 4294936606 (age 534.868s) hex dump (first 32 bytes): c0 ff ff ff 1f 00 00 00 88 f4 5a 3a 38 98 ff ff ..........Z:8... 88 f4 5a 3a 38 98 ff ff 80 88 d6 8a ff ff ff ff ..Z:8........... backtrace: [<0000000068957336>] smb2_is_valid_oplock_break+0x1fa/0x8c0 [<0000000073b70b9e>] cifs_demultiplex_thread+0x73d/0xcc0 [<00000000905fa372>] kthread+0x11c/0x150 [<0000000079378e4e>] ret_from_fork+0x22/0x30 Avoid this leak by only allocating when necessary. Fixes: a93864d9 ("cifs: add lease tracking to the cached root fid") Signed-off-by:
Paul Aurich <paul@darkrain42.org> CC: Stable <stable@vger.kernel.org> # v4.18+ Reviewed-by:
Aurelien Aptel <aaptel@suse.com> Signed-off-by:
Steve French <stfrench@microsoft.com> Conflicts: fs/cifs/smb2misc.c [ Not apply 9bd45408("CIFS: Properly process SMB3 lease breaks") ] Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by:
Zhang Yi <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
- Jul 12, 2021
-
-
卢佳琳 authored
hulk inclusion category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA -------- static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) { ... pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp); if (!pn) return 1; pnext = to_mgpn_ext(pn); pnext->lruvec_stat_local = alloc_percpu(struct lruvec_stat); } the size of pnext is larger than pn, so pnext->lruvec_stat_local is out of bounds Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Mimi Zohar authored
stable inclusion from linux-4.19.196 commit ff660863628fb144badcb3395cde7821c82c13a6 CVE: CVE-2021-35039 -------------------------------- [ Upstream commit 0c18f29aae7ce3dadd26d8ee3505d07cc982df75 ] Irrespective as to whether CONFIG_MODULE_SIG is configured, specifying "module.sig_enforce=1" on the boot command line sets "sig_enforce". Only allow "sig_enforce" to be set when CONFIG_MODULE_SIG is configured. This patch makes the presence of /sys/module/module/parameters/sig_enforce dependent on CONFIG_MODULE_SIG=y. Fixes: fda784e5 ("module: export module signature enforcement status") Reported-by:
Nayna Jain <nayna@linux.ibm.com> Tested-by:
Mimi Zohar <zohar@linux.ibm.com> Tested-by:
Jessica Yu <jeyu@kernel.org> Signed-off-by:
Mimi Zohar <zohar@linux.ibm.com> Signed-off-by:
Jessica Yu <jeyu@kernel.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Yufen Yu authored
hulk inclusion category: feature bugzilla: 173267 CVE: NA --------------------------- For hibench applications, likely kmeans, wordcount, terasort, we can try to use this bpf tool to improve io performance. Usage: make -C bpf ./test_xfs_file spec_readahead Signed-off-by:
Yufen Yu <yuyufen@huawei.com> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by:
Hou Tao <houtao1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Yufen Yu authored
hulk inclusion category: feature bugzilla: 173267 CVE: NA --------------------------- For hibench applications, include kmeans, wordcount and terasort, they will read whole blk_xxx and blk_xxx.meta from disk in sequential. And almost all of the read issued to disk are triggered by async readahead. While sequential read of single thread does't means sequential io on disk when multiple threads cocurrently do that. Multiple threads interleaving sequentail read can make io issued into disk become random, which will limit disk IO throughput. To reduce disk randomization, we can consider to increase readahead window. Then IO generated by filesystem will be bigger in each time of async readahead. But, limited by disk max_hw_sectors_kb, big IO will be split and the whole bio need to wait all split bios complete, which can cause longer io latency. Our trace shows that many long latency in threads are caused by waiting async readahead IO complete when set readahead window with a big value. That means, thread read read speed is faster than async readahead io complete. To improve performance, we try to provide a special async readahead method: * On the one hand, we try to read more sequential data from disk, which can reduce disk randomization when multiple thread interleaving. * On the other hand, size of each IO issued to disk is 2M, which can avoid big IO split and long io latency. Signed-off-by:
Yufen Yu <yuyufen@huawei.com> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by:
Hou Tao <houtao1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Yufen Yu authored
hulk inclusion category: feature bugzilla: 173267 CVE: NA --------------------------- If ra->prev_pos page index is equal to current pos, that means it is sequential read, then clear FMODE_RANDOM flag to enable async readahead. Usage: make -C bpf ./test_xfs_file clear Signed-off-by:
Yufen Yu <yuyufen@huawei.com> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by:
Hou Tao <houtao1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Yufen Yu authored
hulk inclusion category: feature bugzilla: 173267 CVE: NA --------------------------- Adding a new member clear_f_mode into struct xfs_writable_file, then we can clear some flag of file->f_mode. Signed-off-by:
Yufen Yu <yuyufen@huawei.com> Signed-off-by:
Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by:
Hou Tao <houtao1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
- Jul 09, 2021
-
-
yangerkun authored
hulk inclusion category: bugfix bugzilla: 172974 CVE: NA --------------------------- 72c9e4df ('jbd2: ensure abort the journal if detect IO error when writing original buffer back') will add 'j_atomic_flags' which can lead lots of kabi broken like jbd2_journal_destroy/jbd2_journal_abort and so on. Fix it by add a wrapper. Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
Zhang Yi <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Qu Wenruo authored
mainline inclusion from mainline-v5.13-rc5 commit 6d4572a9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I39MZM CVE: NA ------------------------------------------------------ [BUG] When the data space is exhausted, even if the inode has NOCOW attribute, we will still refuse to truncate unaligned range due to ENOSPC. The following script can reproduce it pretty easily: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev -b 1G mount -o nospace_cache $dev $mnt touch $mnt/foobar chattr +C $mnt/foobar xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null sync xfs_io -c "fpunch 0 2k" $mnt/foobar umount $mnt Currently this will fail at the fpunch part. [CAUSE] Because btrfs_truncate_block() always reserves space without checking the NOCOW attribute. Since the writeback path follows NOCOW bit, we only need to bother the space reservation code in btrfs_truncate_block(). [FIX] Make btrfs_truncate_block() follow btrfs_buffered_write() to try to reserve data space first, and fall back to NOCOW check only when we don't have enough space. Such always-try-reserve is an optimization introduced in btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call. This patch will export check_can_nocow() as btrfs_check_can_nocow(), and use it in btrfs_truncate_block() to fix the problem. Reference: https://patchwork.kernel.org/project/linux-btrfs/patch/20200130052822.11765-1-wqu@suse.com Reported-by:
Martin Doucha <martin.doucha@suse.com> Reviewed-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
Anand Jain <anand.jain@oracle.com> Signed-off-by:
Qu Wenruo <wqu@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com> Conflicts: fs/btrfs/file.c fs/btrfs/inode.c Signed-off-by:
Gou Hao <gouhao@uniontech.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Jiao Fenfang <jiaofenfang@uniontech.com> Reviewed-by:
Zhang Yi <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
ChenXiaoSong authored
hulk inclusion category: bugfix bugzilla: NA CVE: NA ------------------------------------------------- commit a2ff6d97 ("NFSv4.1: Don't rebind to the same source port when reconnecting to the server") add new member into struct rpc_xprt, which will break KABI. This patch try to fix it. Signed-off-by:
ChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by:
Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
- Jul 08, 2021
-
-
王海 authored
hulk inclusion category: bugfix bugzilla: 172330 CVE: NA -------------------------------- We can construct some special USB packets that cause kernel info leak by the following steps of rndis. 1. construct the packet to make rndis call gen_ndis_set_resp(). In gen_ndis_set_resp(), BufOffset comes from the USB packet and it is not checked so that BufOffset can be any value. Therefore, if OID is RNDIS_OID_GEN_CURRENT_PACKET_FILTER, then *params->filter can get data at any address. 2. construct the packet to make rndis call rndis_query_response(). In rndis_query_response(), if OID is RNDIS_OID_GEN_CURRENT_PACKET_FILTER, then the data of *params->filter is fetched and returned, resulting in info leak. Therefore, we need to check the BufOffset to prevent info leak. Here, buf size is USB_COMP_EP0_BUFSIZ, as long as "8 + BufOffset + BufLength" is less than USB_COMP_EP0_BUFSIZ, it will be considered legal. Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by:
Wang Hai <wanghai38@huawei.com> Reviewed-by:
Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by:
Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
- Jul 05, 2021
-
-
王克锋 authored
hulk inclusion category: bugfix bugzilla: 172153 CVE: NA ------------------------------------------------- DO_ONCE DEFINE_STATIC_KEY_TRUE(___once_key); __do_once_done once_disable_jump(once_key); INIT_WORK(&w->work, once_deferred); struct once_work *w; w->key = key; schedule_work(&w->work); module unload //*the key is destroy* process_one_work once_deferred BUG_ON(!static_key_enabled(work->key)); static_key_count((struct static_key *)x) //*access key, crash* When module uses DO_ONCE mechanism, it could crash due to the above concurrency problem, we could reproduce it with link[1]. Fix it by add/put module refcount in the once work process. [1] https://lore.kernel.org/netdev/eaa6c371-465e-57eb-6be9-f4b16b9d7cbf@huawei.com/ Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Reported-by:
Minmin chen <chenmingmin@huawei.com> Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Zhang Xiaoxu authored
mainline inclusion from mainline-v5.14 commit 5483b904bf336948826594610af4c9bbb0d9e3aa category: bugfix bugzilla: 51898 CVE: NA --------------------------- When find a task from wait queue to wake up, a non-privileged task may be found out, rather than the privileged. This maybe lead a deadlock same as commit dfe1fe75e00e ("NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()"): Privileged delegreturn task is queued to privileged list because all the slots are assigned. If there has no enough slot to wake up the non-privileged batch tasks(session less than 8 slot), then the privileged delegreturn task maybe lost waked up because the found out task can't get slot since the session is on draining. So we should treate the privileged task as the emergency task, and execute it as for as we can. Reported-by:
Hulk Robot <hulkci@huawei.com> Fixes: 5fcdfacc ("NFSv4: Return delegations synchronously in evict_inode") Cc: stable@vger.kernel.org Signed-off-by:
Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by:
Yue Haibing <yuehaibing@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Zhang Xiaoxu authored
mainline inclusion from mainline-v5.14 commit fcb170a9d825d7db4a3fb870b0300f5a40a8d096 category: bugfix bugzilla: 51898 CVE: NA --------------------------- The 'queue->nr' will wraparound from 0 to 255 when only current priority queue has tasks. This maybe lead a deadlock same as commit dfe1fe75e00e ("NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()"): Privileged delegreturn task is queued to privileged list because all the slots are assigned. When non-privileged task complete and release the slot, a non-privileged maybe picked out. It maybe allocate slot failed when the session on draining. If the 'queue->nr' has wraparound to 255, and no enough slot to service it, then the privileged delegreturn will lost to wake up. So we should avoid the wraparound on 'queue->nr'. Reported-by:
Hulk Robot <hulkci@huawei.com> Fixes: 5fcdfacc ("NFSv4: Return delegations synchronously in evict_inode") Fixes: 1da177e4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by:
Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by:
Yue Haibing <yuehaibing@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Daniel Borkmann authored
mainline inclusion from mainline-v5.13-rc7 commit 9183671af6dbf60a1219371d4ed73e23f43b49db category: bugfix bugzilla: NA CVE: CVE-2021-33624 -------------------------------- The verifier only enumerates valid control-flow paths and skips paths that are unreachable in the non-speculative domain. And so it can miss issues under speculative execution on mispredicted branches. For example, a type confusion has been demonstrated with the following crafted program: // r0 = pointer to a map array entry // r6 = pointer to readable stack slot // r9 = scalar controlled by attacker 1: r0 = *(u64 *)(r0) // cache miss 2: if r0 != 0x0 goto line 4 3: r6 = r9 4: if r0 != 0x1 goto line 6 5: r9 = *(u8 *)(r6) 6: // leak r9 Since line 3 runs iff r0 == 0 and line 5 runs iff r0 == 1, the verifier concludes that the pointer dereference on line 5 is safe. But: if the attacker trains both the branches to fall-through, such that the following is speculatively executed ... r6 = r9 r9 = *(u8 *)(r6) // leak r9 ... then the program will dereference an attacker-controlled value and could leak its content under speculative execution via side-channel. This requires to mistrain the branch predictor, which can be rather tricky, because the branches are mutually exclusive. However such training can be done at congruent addresses in user space using different branches that are not mutually exclusive. That is, by training branches in user space ... A: if r0 != 0x0 goto line C B: ... C: if r0 != 0x0 goto line D D: ... ... such that addresses A and C collide to the same CPU branch prediction entries in the PHT (pattern history table) as those of the BPF program's lines 2 and 4, respectively. A non-privileged attacker could simply brute force such collisions in the PHT until observing the attack succeeding. Alternative methods to mistrain the branch predictor are also possible that avoid brute forcing the collisions in the PHT. A reliable attack has been demonstrated, for example, using the following crafted program: // r0 = pointer to a [control] map array entry // r7 = *(u64 *)(r0 + 0), training/attack phase // r8 = *(u64 *)(r0 + 8), oob address // [...] // r0 = pointer to a [data] map array entry 1: if r7 == 0x3 goto line 3 2: r8 = r0 // crafted sequence of conditional jumps to separate the conditional // branch in line 193 from the current execution flow 3: if r0 != 0x0 goto line 5 4: if r0 == 0x0 goto exit 5: if r0 != 0x0 goto line 7 6: if r0 == 0x0 goto exit [...] 187: if r0 != 0x0 goto line 189 188: if r0 == 0x0 goto exit // load any slowly-loaded value (due to cache miss in phase 3) ... 189: r3 = *(u64 *)(r0 + 0x1200) // ... and turn it into known zero for verifier, while preserving slowly- // loaded dependency when executing: 190: r3 &= 1 191: r3 &= 2 // speculatively bypassed phase dependency 192: r7 += r3 193: if r7 == 0x3 goto exit 194: r4 = *(u8 *)(r8 + 0) // leak r4 As can be seen, in training phase (phase != 0x3), the condition in line 1 turns into false and therefore r8 with the oob address is overridden with the valid map value address, which in line 194 we can read out without issues. However, in attack phase, line 2 is skipped, and due to the cache miss in line 189 where the map value is (zeroed and later) added to the phase register, the condition in line 193 takes the fall-through path due to prior branch predictor training, where under speculation, it'll load the byte at oob address r8 (unknown scalar type at that point) which could then be leaked via side-channel. One way to mitigate these is to 'branch off' an unreachable path, meaning, the current verification path keeps following the is_branch_taken() path and we push the other branch to the verification stack. Given this is unreachable from the non-speculative domain, this branch's vstate is explicitly marked as speculative. This is needed for two reasons: i) if this path is solely seen from speculative execution, then we later on still want the dead code elimination to kick in in order to sanitize these instructions with jmp-1s, and ii) to ensure that paths walked in the non-speculative domain are not pruned from earlier walks of paths walked in the speculative domain. Additionally, for robustness, we mark the registers which have been part of the conditional as unknown in the speculative path given there should be no assumptions made on their content. The fix in here mitigates type confusion attacks described earlier due to i) all code paths in the BPF program being explored and ii) existing verifier logic already ensuring that given memory access instruction references one specific data structure. An alternative to this fix that has also been looked at in this scope was to mark aux->alu_state at the jump instruction with a BPF_JMP_TAKEN state as well as direction encoding (always-goto, always-fallthrough, unknown), such that mixing of different always-* directions themselves as well as mixing of always-* with unknown directions would cause a program rejection by the verifier, e.g. programs with constructs like 'if ([...]) { x = 0; } else { x = 1; }' with subsequent 'if (x == 1) { [...] }'. For unprivileged, this would result in only single direction always-* taken paths, and unknown taken paths being allowed, such that the former could be patched from a conditional jump to an unconditional jump (ja). Compared to this approach here, it would have two downsides: i) valid programs that otherwise are not performing any pointer arithmetic, etc, would potentially be rejected/broken, and ii) we are required to turn off path pruning for unprivileged, where both can be avoided in this work through pushing the invalid branch to the verification stack. The issue was originally discovered by Adam and Ofek, and later independently discovered and reported as a result of Benedict and Piotr's research work. Fixes: b2157399 ("bpf: prevent out-of-bounds speculation") Reported-by:
Adam Morrison <mad@cs.tau.ac.il> Reported-by:
Ofek Kirzner <ofekkir@gmail.com> Reported-by:
Benedict Schlueter <benedict.schlueter@rub.de> Reported-by:
Piotr Krysiuk <piotras@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
John Fastabend <john.fastabend@gmail.com> Reviewed-by:
Benedict Schlueter <benedict.schlueter@rub.de> Reviewed-by:
Piotr Krysiuk <piotras@gmail.com> Acked-by:
Alexei Starovoitov <ast@kernel.org> onflicts: kernel/bpf/verifier.c [yyl: bypass_spec_v1 is not introduced in kernel-4.19, use allow_ptr_leaks instead] Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
He <Fengqing<hefengqing@huawei.com> Reviewed-by:
Kuohai Xu <xukuohai@huawei.com> Reviewed-by:
Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-