- Jun 02, 2022
-
-
Nadav Amit authored
stable inclusion from stable-4.19.239 commit bd4ad32e44ea472c83c7c21e5bee29a1bb1c24e8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 9e949a3886356fe9112c6f6f34a6e23d1d35407f upstream. The check in flush_smp_call_function_queue() for callbacks that are sent to offline CPUs currently checks whether the queue is empty. However, flush_smp_call_function_queue() has just deleted all the callbacks from the queue and moved all the entries into a local list. This checks would only be positive if some callbacks were added in the short time after llist_del_all() was called. This does not seem to be the intention of this check. Change the check to look at the local list to which the entries were moved instead of the queue from which all the callbacks were just removed. Fixes: 8d056c48 ("CPU hotplug, smp: flush any pending IPI callbacks before CPU offline") Signed-off-by:
Nadav Amit <namit@vmware.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220319072015.1495036-1-namit@vmware.com Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Nicolas Dichtel authored
stable inclusion from stable-4.19.239 commit 74b68f5249f16c5f7f675d0f604fa6ae20e3a151 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit e3fa461d8b0e185b7da8a101fe94dfe6dd500ac0 upstream. kongweibin reported a kernel panic in ip6_forward() when input interface has no in6 dev associated. The following tc commands were used to reproduce this panic: tc qdisc del dev vxlan100 root tc qdisc add dev vxlan100 root netem corrupt 5% CC: stable@vger.kernel.org Fixes: ccd27f05ae7b ("ipv6: fix 'disable_policy' for fwd packets") Reported-by:
kongweibin <kongweibin2@huawei.com> Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by:
David Ahern <dsahern@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Patrick Wang authored
stable inclusion from stable-4.19.239 commit 7f4f020286e0bd4aaf40512c80c63a5e5bd629bc category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 23c2d497de21f25898fbea70aeb292ab8acc8c94 upstream. The kmemleak_*_phys() apis do not check the address for lowmem's min boundary, while the caller may pass an address below lowmem, which will trigger an oops: # echo scan > /sys/kernel/debug/kmemleak Unable to handle kernel paging request at virtual address ff5fffffffe00000 Oops [#1] Modules linked in: CPU: 2 PID: 134 Comm: bash Not tainted 5.18.0-rc1-next-20220407 #33 Hardware name: riscv-virtio,qemu (DT) epc : scan_block+0x74/0x15c ra : scan_block+0x72/0x15c epc : ffffffff801e5806 ra : ffffffff801e5804 sp : ff200000104abc30 gp : ffffffff815cd4e8 tp : ff60000004cfa340 t0 : 0000000000000200 t1 : 00aaaaaac23954cc t2 : 00000000000003ff s0 : ff200000104abc90 s1 : ffffffff81b0ff28 a0 : 0000000000000000 a1 : ff5fffffffe01000 a2 : ffffffff81b0ff28 a3 : 0000000000000002 a4 : 0000000000000001 a5 : 0000000000000000 a6 : ff200000104abd7c a7 : 0000000000000005 s2 : ff5fffffffe00ff9 s3 : ffffffff815cd998 s4 : ffffffff815d0e90 s5 : ffffffff81b0ff28 s6 : 0000000000000020 s7 : ffffffff815d0eb0 s8 : ffffffffffffffff s9 : ff5fffffffe00000 s10: ff5fffffffe01000 s11: 0000000000000022 t3 : 00ffffffaa17db4c t4 : 000000000000000f t5 : 0000000000000001 t6 : 0000000000000000 status: 0000000000000100 badaddr: ff5fffffffe00000 cause: 000000000000000d scan_gray_list+0x12e/0x1a6 kmemleak_scan+0x2aa/0x57e kmemleak_write+0x32a/0x40c full_proxy_write+0x56/0x82 vfs_write+0xa6/0x2a6 ksys_write+0x6c/0xe2 sys_write+0x22/0x2a ret_from_syscall+0x0/0x2 The callers may not quite know the actual address they pass(e.g. from devicetree). So the kmemleak_*_phys() apis should guarantee the address they finally use is in lowmem range, so check the address for lowmem's min boundary. Link: https://lkml.kernel.org/r/20220413122925.33856-1-patrick.wang.shcn@gmail.com Signed-off-by:
Patrick Wang <patrick.wang.shcn@gmail.com> Acked-by:
Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Juergen Gross authored
stable inclusion from stable-4.19.239 commit c0ca2da695d32e7e3c1d6ad31d17b499083645bf category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit e553f62f10d93551eb883eca227ac54d1a4fad84 upstream. Since commit 6aa303de ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") only zones with free memory are included in a built zonelist. This is problematic when e.g. all memory of a zone has been ballooned out when zonelists are being rebuilt. The decision whether to rebuild the zonelists when onlining new memory is done based on populated_zone() returning 0 for the zone the memory will be added to. The new zone is added to the zonelists only, if it has free memory pages (managed_zone() returns a non-zero value) after the memory has been onlined. This implies, that onlining memory will always free the added pages to the allocator immediately, but this is not true in all cases: when e.g. running as a Xen guest the onlined new memory will be added only to the ballooned memory list, it will be freed only when the guest is being ballooned up afterwards. Another problem with using managed_zone() for the decision whether a zone is being added to the zonelists is, that a zone with all memory used will in fact be removed from all zonelists in case the zonelists happen to be rebuilt. Use populated_zone() when building a zonelist as it has been done before that commit. There was a report that QubesOS (based on Xen) is hitting this problem. Xen has switched to use the zone device functionality in kernel 5.9 and QubesOS wants to use memory hotplugging for guests in order to be able to start a guest with minimal memory and expand it as needed. This was the report leading to the patch. Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com Fixes: 6aa303de ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") Signed-off-by:
Juergen Gross <jgross@suse.com> Reported-by:
Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by:
Michal Hocko <mhocko@suse.com> Acked-by:
David Hildenbrand <david@redhat.com> Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by:
Wei Yang <richard.weiyang@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Harshit Mogalapalli authored
stable inclusion from stable-4.19.239 commit eb5f51756944735ac70cd8bb38637cc202e29c91 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 64c4a37ac04eeb43c42d272f6e6c8c12bfcf4304 ] Smatch printed a warning: arch/x86/crypto/poly1305_glue.c:198 poly1305_update_arch() error: __memcpy() 'dctx->buf' too small (16 vs u32max) It's caused because Smatch marks 'link_len' as untrusted since it comes from sscanf(). Add a check to ensure that 'link_len' is not larger than the size of the 'link_str' buffer. Fixes: c69c1b6e ("cifs: implement CIFSParseMFSymlink()") Signed-off-by:
Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reviewed-by:
Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by:
Steve French <stfrench@microsoft.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Guillaume Nault authored
stable inclusion from stable-4.19.239 commit 1ef0088e43af1de4e3b365218c4d3179d9a37eec category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 726e2c5929de841fdcef4e2bf995680688ae1b87 ] After feeding a decapsulated packet to a veth device with act_mirred, skb_headlen() may be 0. But veth_xmit() calls __dev_forward_skb(), which expects at least ETH_HLEN byte of linear data (as __dev_forward_skb2() calls eth_type_trans(), which pulls ETH_HLEN bytes unconditionally). Use pskb_may_pull() to ensure veth_xmit() respects this constraint. kernel BUG at include/linux/skbuff.h:2328! RIP: 0010:eth_type_trans+0xcf/0x140 Call Trace: <IRQ> __dev_forward_skb2+0xe3/0x160 veth_xmit+0x6e/0x250 [veth] dev_hard_start_xmit+0xc7/0x200 __dev_queue_xmit+0x47f/0x520 ? skb_ensure_writable+0x85/0xa0 ? skb_mpls_pop+0x98/0x1c0 tcf_mirred_act+0x442/0x47e [act_mirred] tcf_action_exec+0x86/0x140 fl_classify+0x1d8/0x1e0 [cls_flower] ? dma_pte_clear_level+0x129/0x1a0 ? dma_pte_clear_level+0x129/0x1a0 ? prb_fill_curr_block+0x2f/0xc0 ? skb_copy_bits+0x11a/0x220 __tcf_classify+0x58/0x110 tcf_classify_ingress+0x6b/0x140 __netif_receive_skb_core.constprop.0+0x47d/0xfd0 ? __iommu_dma_unmap_swiotlb+0x44/0x90 __netif_receive_skb_one_core+0x3d/0xa0 netif_receive_skb+0x116/0x170 be_process_rx+0x22f/0x330 [be2net] be_poll+0x13c/0x370 [be2net] __napi_poll+0x2a/0x170 net_rx_action+0x22f/0x2f0 __do_softirq+0xca/0x2a8 __irq_exit_rcu+0xc1/0xe0 common_interrupt+0x83/0xa0 Fixes: e314dbdc ("[NET]: Virtual ethernet device driver.") Signed-off-by:
Guillaume Nault <gnault@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Waiman Long authored
stable inclusion from stable-4.19.238 commit ff929e3abef16c409a9beef08bcc29ab940d0ce6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit a431dbbc540532b7465eae4fc8b56a85a9fc7d17 upstream. The gcc 12 compiler reports a "'mem_section' will never be NULL" warning on the following code: static inline struct mem_section *__nr_to_section(unsigned long nr) { #ifdef CONFIG_SPARSEMEM_EXTREME if (!mem_section) return NULL; #endif if (!mem_section[SECTION_NR_TO_ROOT(nr)]) return NULL; : It happens with CONFIG_SPARSEMEM_EXTREME off. The mem_section definition is #ifdef CONFIG_SPARSEMEM_EXTREME extern struct mem_section **mem_section; #else extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]; #endif In the !CONFIG_SPARSEMEM_EXTREME case, mem_section is a static 2-dimensional array and so the check "!mem_section[SECTION_NR_TO_ROOT(nr)]" doesn't make sense. Fix this warning by moving the "!mem_section[SECTION_NR_TO_ROOT(nr)]" check up inside the CONFIG_SPARSEMEM_EXTREME block and adding an explicit NR_SECTION_ROOTS check to make sure that there is no out-of-bound array access. Link: https://lkml.kernel.org/r/20220331180246.2746210-1-longman@redhat.com Fixes: 3e347261 ("sparsemem extreme implementation") Signed-off-by:
Waiman Long <longman@redhat.com> Reported-by:
Justin Forbes <jforbes@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Peter Xu authored
stable inclusion from stable-4.19.238 commit 7265c880004d30691ea3715ce0eaf6ef3ce5d445 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 5abfd71d936a8aefd9f9ccd299dea7a164a5d455 upstream. Patch series "mm: Rework zap ptes on swap entries", v5. Patch 1 should fix a long standing bug for zap_pte_range() on zap_details usage. The risk is we could have some swap entries skipped while we should have zapped them. Migration entries are not the major concern because file backed memory always zap in the pattern that "first time without page lock, then re-zap with page lock" hence the 2nd zap will always make sure all migration entries are already recovered. However there can be issues with real swap entries got skipped errornoously. There's a reproducer provided in commit message of patch 1 for that. Patch 2-4 are cleanups that are based on patch 1. After the whole patchset applied, we should have a very clean view of zap_pte_range(). Only patch 1 needs to be backported to stable if necessary. This patch (of 4): The "details" pointer shouldn't be the token to decide whether we should skip swap entries. For example, when the callers specified details->zap_mapping==NULL, it means the user wants to zap all the pages (including COWed pages), then we need to look into swap entries because there can be private COWed pages that was swapped out. Skipping some swap entries when details is non-NULL may lead to wrongly leaving some of the swap entries while we should have zapped them. A reproducer of the problem: ===8<=== #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <stdio.h> #include <assert.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> int page_size; int shmem_fd; char *buffer; void main(void) { int ret; char val; page_size = getpagesize(); shmem_fd = memfd_create("test", 0); assert(shmem_fd >= 0); ret = ftruncate(shmem_fd, page_size * 2); assert(ret == 0); buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE, MAP_PRIVATE, shmem_fd, 0); assert(buffer != MAP_FAILED); /* Write private page, swap it out */ buffer[page_size] = 1; madvise(buffer, page_size * 2, MADV_PAGEOUT); /* This should drop private buffer[page_size] already */ ret = ftruncate(shmem_fd, page_size); assert(ret == 0); /* Recover the size */ ret = ftruncate(shmem_fd, page_size * 2); assert(ret == 0); /* Re-read the data, it should be all zero */ val = buffer[page_size]; if (val == 0) printf("Good\n"); else printf("BUG\n"); } ===8<=== We don't need to touch up the pmd path, because pmd never had a issue with swap entries. For example, shmem pmd migration will always be split into pte level, and same to swapping on anonymous. Add another helper should_zap_cows() so that we can also check whether we should zap private mappings when there's no page pointer specified. This patch drops that trick, so we handle swap ptes coherently. Meanwhile we should do the same check upon migration entry, hwpoison entry and genuine swap entries too. To be explicit, we should still remember to keep the private entries if even_cows==false, and always zap them when even_cows==true. The issue seems to exist starting from the initial commit of git. [peterx@redhat.com: comment tweaks] Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
John Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Marc Zyngier authored
stable inclusion from stable-4.19.238 commit c7daf1b4ad809692d5c26f33c02ed8a031066548 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 0df6664531a12cdd8fc873f0cac0dcb40243d3e9 upstream. It turns out that our polling of RWP is totally wrong when checking for it in the redistributors, as we test the *distributor* bit index, whereas it is a different bit number in the RDs... Oopsie boo. This is embarassing. Not only because it is wrong, but also because it took *8 years* to notice the blunder... Just fix the damn thing. Fixes: 021f6537 ("irqchip: gic-v3: Initial support for GICv3") Signed-off-by:
Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Reviewed-by:
Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Link: https://lore.kernel.org/r/20220315165034.794482-2-maz@kernel.org Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Miaohe Lin authored
stable inclusion from stable-4.19.238 commit 39a32f3c06f6d68a530bf9612afa19f50f12e93d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 4ad099559b00ac01c3726e5c95dc3108ef47d03e upstream. If mpol_new is allocated but not used in restart loop, mpol_new will be freed via mpol_put before returning to the caller. But refcnt is not initialized yet, so mpol_put could not do the right things and might leak the unused mpol_new. This would happen if mempolicy was updated on the shared shmem file while the sp->lock has been dropped during the memory allocation. This issue could be triggered easily with the below code snippet if there are many processes doing the below work at the same time: shmid = shmget((key_t)5566, 1024 * PAGE_SIZE, 0666|IPC_CREAT); shm = shmat(shmid, 0, 0); loop many times { mbind(shm, 1024 * PAGE_SIZE, MPOL_LOCAL, mask, maxnode, 0); mbind(shm + 128 * PAGE_SIZE, 128 * PAGE_SIZE, MPOL_DEFAULT, mask, maxnode, 0); } Link: https://lkml.kernel.org/r/20220329111416.27954-1-linmiaohe@huawei.com Fixes: 42288fe3 ("mm: mempolicy: Convert shared_policy mutex to spinlock") Signed-off-by:
Miaohe Lin <linmiaohe@huawei.com> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.8] Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Paolo Bonzini authored
stable inclusion from stable-4.19.238 commit e2c328c2a8f9de8b761bd4025b66c63120c55761 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 01e67e04c28170c47700c2c226d732bbfedb1ad0 upstream. If an mremap() syscall with old_size=0 ends up in move_page_tables(), it will call invalidate_range_start()/invalidate_range_end() unnecessarily, i.e. with an empty range. This causes a WARN in KVM's mmu_notifier. In the past, empty ranges have been diagnosed to be off-by-one bugs, hence the WARNing. Given the low (so far) number of unique reports, the benefits of detecting more buggy callers seem to outweigh the cost of having to fix cases such as this one, where userspace is doing something silly. In this particular case, an early return from move_page_tables() is enough to fix the issue. Link: https://lkml.kernel.org/r/20220329173155.172439-1-pbonzini@redhat.com Reported-by:
<syzbot+6bde52d89cfdf9f61425@syzkaller.appspotmail.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Mauricio Faria de Oliveira authored
stable inclusion from stable-4.19.238 commit ef521626ac9f7e3b20eeb52b64a4f85992702c54 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 6c8e2a256915a223f6289f651d6b926cd7135c9e upstream. Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(*ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x0 # free | grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment #50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo@canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: 802a3a92 ("mm: reclaim MADV_FREE pages") Signed-off-by:
Mauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Hill <daniel.hill@canonical.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Dongdong Tao <dongdong.tao@canonical.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Gerald Yang <gerald.yang@canonical.com> Cc: Heitor Alves de Siqueira <halves@canonical.com> Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> Cc: <stable@vger.kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> [mfo: backport: replace folio/test_flag with page/flag equivalents; real Fixes: 854e9ed0 ("mm: support madvise(MADV_FREE)") in v4.] Signed-off-by:
Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
NeilBrown authored
stable inclusion from stable-4.19.238 commit 194722a9e87eaf7fc3af9a38f1330a224342e730 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit c265de257f558a05c1859ee9e3fed04883b9ec0e ] The commit handling code is not safe against memory-pressure deadlocks when writing to swap. In particular, nfs_commitdata_alloc() blocks indefinitely waiting for memory, and this can consume all available workqueue threads. swap-out most likely uses STABLE writes anyway as COND_STABLE indicates that a stable write should be used if the write fits in a single request, and it normally does. However if we ever swap with a small wsize, or gather unusually large numbers of pages for a single write, this might change. For safety, make it explicit in the code that direct writes used for swap must always use FLUSH_STABLE. Signed-off-by:
NeilBrown <neilb@suse.de> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
NeilBrown authored
stable inclusion from stable-4.19.238 commit 2615cdae1daac1f1548394a5843252cac1049b6c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 64158668ac8b31626a8ce48db4cad08496eb8340 ] 1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding possible deadlocks with "fs_reclaim". These deadlocks could, I believe, eventuate if a buffered read on the swapfile was attempted. We don't need coherence with the page cache for a swap file, and buffered writes are forbidden anyway. There is no other need for i_rwsem during direct IO. So never take it for swap_rw() 2/ generic_write_checks() explicitly forbids writes to swap, and performs checks that are not needed for swap. So bypass it for swap_rw(). Signed-off-by:
NeilBrown <neilb@suse.de> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
NeilBrown authored
stable inclusion from stable-4.19.238 commit 9970fca80164b3cbfd700acd735aa043d40a73ed category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit c487216bec83b0c5a8803e5c61433d33ad7b104d ] When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. mempools are particularly a problem as memory can only be released back to the mempool by an async rpc task running. If all available workqueue threads are waiting on the mempool, no thread is available to return anything. rpc_malloc() can block, and this might cause deadlocks. So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if blocking is acceptable. Signed-off-by:
NeilBrown <neilb@suse.de> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Trond Myklebust authored
stable inclusion from stable-4.19.238 commit 2724d7743bb7df345feb7e4f88ecee97f0416d65 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 3e17898aca293a24dae757a440a50aa63ca29671 ] If memory allocation triggers a direct reclaim from the state recovery thread, then we can deadlock. Use memalloc_nofs_save/restore to ensure that doesn't happen. Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Sven Eckelmann authored
stable inclusion from stable-4.19.238 commit d70c60cf2bdbe7ef08e227427fc3de99a93931d5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit a02192151b7dbf855084c38dca380d77c7658353 ] Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is added to rtnetlink messages. This fixes iproute2 which otherwise resolved the link interface to an interface in the wrong namespace. Test commands: ip netns add nst ip link add dummy0 type dummy ip link add link macvtap0 link dummy0 type macvtap ip link set macvtap0 netns nst ip -netns nst link show macvtap0 Before: 10: macvtap0@gre0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500 link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff After: 10: macvtap0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500 link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0 Reported-by:
Leonardo Mörlein <freifunk@irrelefant.net> Signed-off-by:
Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20220228003240.1337426-1-sven@narfation.org Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Jordy Zomer authored
stable inclusion from stable-4.19.238 commit 7ae2c5b89da3cfaf856df880af27d3bb32a74b3d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit cd9c88da171a62c4b0f1c70e50c75845969fbc18 ] It appears like cmd could be a Spectre v1 gadget as it's supplied by a user and used as an array index. Prevent the contents of kernel memory from being leaked to userspace via speculative execution by using array_index_nospec. Signed-off-by:
Jordy Zomer <jordy@pwning.systems> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Ido Schimmel authored
stable inclusion from stable-4.19.238 commit 75517bd7e4afad5a800b4d14f80aa7e6f73a9681 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 0c51e12e218f20b7d976158fdc18019627326f7a ] In case user space sends a packet destined to a broadcast address when a matching broadcast route is not configured, the kernel will create a unicast neighbour entry that will never be resolved [1]. When the broadcast route is configured, the unicast neighbour entry will not be invalidated and continue to linger, resulting in packets being dropped. Solve this by invalidating unresolved neighbour entries for broadcast addresses after routes for these addresses are internally configured by the kernel. This allows the kernel to create a broadcast neighbour entry following the next route lookup. Another possible solution that is more generic but also more complex is to have the ARP code register a listener to the FIB notification chain and invalidate matching neighbour entries upon the addition of broadcast routes. It is also possible to wave off the issue as a user space problem, but it seems a bit excessive to expect user space to be that intimately familiar with the inner workings of the FIB/neighbour kernel code. [1] https://lore.kernel.org/netdev/55a04a8f-56f3-f73c-2aea-2195923f09d1@huawei.com/ Reported-by:
Wang Hai <wanghai38@huawei.com> Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Tested-by:
Wang Hai <wanghai38@huawei.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Randy Dunlap authored
stable inclusion from stable-4.19.238 commit 9727c3165ba48f58befb9f2b6069bcc6ca048309 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 460a79e18842caca6fa0c415de4a3ac1e671ac50 upstream. __setup() handlers should return 1 if the command line option is handled and 0 if not (or maybe never return 0; it just pollutes init's environment). The only reason that this particular __setup handler does not pollute init's environment is that the setup string contains a '.', as in "cgroup.memory". This causes init/main.c::unknown_boottoption() to consider it to be an "Unused module parameter" and ignore it. (This is for parsing of loadable module parameters any time after kernel init.) Otherwise the string "cgroup.memory=whatever" would be added to init's environment strings. Instead of relying on this '.' quirk, just return 1 to indicate that the boot option has been handled. Note that there is no warning message if someone enters: cgroup.memory=anything_invalid Link: https://lkml.kernel.org/r/20220222005811.10672-1-rdunlap@infradead.org Fixes: f7e1cb6e ("mm: memcontrol: account socket memory in unified hierarchy memory controller") Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Reported-by:
Igor Zhbanov <i.zhbanov@omprussia.ru> Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Reviewed-by:
Michal Koutný <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Rafael J. Wysocki authored
stable inclusion from stable-4.19.238 commit 28d5387c1994f5e1e0d41b30a1f3dd6e1f609252 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 40d8abf364bcab23bc715a9221a3c8623956257b upstream. If the NumEntries field in the _CPC return package is less than 2, do not attempt to access the "Revision" element of that package, because it may not be present then. Fixes: 337aadff ("ACPI: Introduce CPU performance controls using CPPC") BugLink: https://lore.kernel.org/lkml/20220322143534.GC32582@xsang-OptiPlex-9020/ Reported-by:
kernel test robot <oliver.sang@intel.com> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by:
Huang Rui <ray.huang@amd.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Theodore Ts'o authored
stable inclusion from stable-4.19.238 commit 0d3a6926f7e8be3c897fa46216ce13b119a9f56a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit cc5095747edfb054ca2068d01af20be3fcc3634f ] [un]pin_user_pages_remote is dirtying pages without properly warning the file system in advance. A related race was noted by Jan Kara in 2018[1]; however, more recently instead of it being a very hard-to-hit race, it could be reliably triggered by process_vm_writev(2) which was discovered by Syzbot[2]. This is technically a bug in mm/gup.c, but arguably ext4 is fragile in that if some other kernel subsystem dirty pages without properly notifying the file system using page_mkwrite(), ext4 will BUG, while other file systems will not BUG (although data will still be lost). So instead of crashing with a BUG, issue a warning (since there may be potential data loss) and just mark the page as clean to avoid unprivileged denial of service attacks until the problem can be properly fixed. More discussion and background can be found in the thread starting at [2]. [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com Reported-by:
<syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com> Reported-by:
Lee Jones <lee.jones@linaro.org> Signed-off-by:
Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Dmitry Baryshkov authored
stable inclusion from stable-4.19.238 commit be8bc05f38d667eda1e820bc6f69234795be7809 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 524bb1da785a7ae43dd413cd392b5071c6c367f8 ] The function device_pm_check_callbacks() can be called under the spin lock (in the reported case it happens from genpd_add_device() -> dev_pm_domain_set(), when the genpd uses spinlocks rather than mutexes. However this function uncoditionally uses spin_lock_irq() / spin_unlock_irq(), thus not preserving the CPU flags. Use the irqsave/irqrestore instead. The backtrace for the reference: [ 2.752010] ------------[ cut here ]------------ [ 2.756769] raw_local_irq_restore() called with IRQs enabled [ 2.762596] WARNING: CPU: 4 PID: 1 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x34/0x50 [ 2.772338] Modules linked in: [ 2.775487] CPU: 4 PID: 1 Comm: swapper/0 Tainted: G S 5.17.0-rc6-00384-ge330d0d82eff-dirty #684 [ 2.781384] Freeing initrd memory: 46024K [ 2.785839] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2.785841] pc : warn_bogus_irq_restore+0x34/0x50 [ 2.785844] lr : warn_bogus_irq_restore+0x34/0x50 [ 2.785846] sp : ffff80000805b7d0 [ 2.785847] x29: ffff80000805b7d0 x28: 0000000000000000 x27: 0000000000000002 [ 2.785850] x26: ffffd40e80930b18 x25: ffff7ee2329192b8 x24: ffff7edfc9f60800 [ 2.785853] x23: ffffd40e80930b18 x22: ffffd40e80930d30 x21: ffff7edfc0dffa00 [ 2.785856] x20: ffff7edfc09e3768 x19: 0000000000000000 x18: ffffffffffffffff [ 2.845775] x17: 6572206f74206465 x16: 6c696166203a3030 x15: ffff80008805b4f7 [ 2.853108] x14: 0000000000000000 x13: ffffd40e809550b0 x12: 00000000000003d8 [ 2.860441] x11: 0000000000000148 x10: ffffd40e809550b0 x9 : ffffd40e809550b0 [ 2.867774] x8 : 00000000ffffefff x7 : ffffd40e809ad0b0 x6 : ffffd40e809ad0b0 [ 2.875107] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000 [ 2.882440] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff7edfc03a8000 [ 2.889774] Call trace: [ 2.892290] warn_bogus_irq_restore+0x34/0x50 [ 2.896770] _raw_spin_unlock_irqrestore+0x94/0xa0 [ 2.901690] genpd_unlock_spin+0x20/0x30 [ 2.905724] genpd_add_device+0x100/0x2d0 [ 2.909850] __genpd_dev_pm_attach+0xa8/0x23c [ 2.914329] genpd_dev_pm_attach_by_id+0xc4/0x190 [ 2.919167] genpd_dev_pm_attach_by_name+0x3c/0xd0 [ 2.924086] dev_pm_domain_attach_by_name+0x24/0x30 [ 2.929102] psci_dt_attach_cpu+0x24/0x90 [ 2.933230] psci_cpuidle_probe+0x2d4/0x46c [ 2.937534] platform_probe+0x68/0xe0 [ 2.941304] really_probe.part.0+0x9c/0x2fc [ 2.945605] __driver_probe_device+0x98/0x144 [ 2.950085] driver_probe_device+0x44/0x15c [ 2.954385] __device_attach_driver+0xb8/0x120 [ 2.958950] bus_for_each_drv+0x78/0xd0 [ 2.962896] __device_attach+0xd8/0x180 [ 2.966843] device_initial_probe+0x14/0x20 [ 2.971144] bus_probe_device+0x9c/0xa4 [ 2.975092] device_add+0x380/0x88c [ 2.978679] platform_device_add+0x114/0x234 [ 2.983067] platform_device_register_full+0x100/0x190 [ 2.988344] psci_idle_init+0x6c/0xb0 [ 2.992113] do_one_initcall+0x74/0x3a0 [ 2.996060] kernel_init_freeable+0x2fc/0x384 [ 3.000543] kernel_init+0x28/0x130 [ 3.004132] ret_from_fork+0x10/0x20 [ 3.007817] irq event stamp: 319826 [ 3.011404] hardirqs last enabled at (319825): [<ffffd40e7eda0268>] __up_console_sem+0x78/0x84 [ 3.020332] hardirqs last disabled at (319826): [<ffffd40e7fd6d9d8>] el1_dbg+0x24/0x8c [ 3.028458] softirqs last enabled at (318312): [<ffffd40e7ec90410>] _stext+0x410/0x588 [ 3.036678] softirqs last disabled at (318299): [<ffffd40e7ed1bf68>] __irq_exit_rcu+0x158/0x174 [ 3.045607] ---[ end trace 0000000000000000 ]--- Signed-off-by:
Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Darren Hart authored
stable inclusion from stable-4.19.238 commit 1e8a54689f6573367134b7dc96f9b5c86a0efd4f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 3f8dec116210ca649163574ed5f8df1e3b837d07 ] Platforms with large BERT table data can trigger soft lockup errors while attempting to print the entire BERT table data to the console at boot: watchdog: BUG: soft lockup - CPU#160 stuck for 23s! [swapper/0:1] Observed on Ampere Altra systems with a single BERT record of ~250KB. The original bert driver appears to have assumed relatively small table data. Since it is impractical to reassemble large table data from interwoven console messages, and the table data is available in /sys/firmware/acpi/tables/data/BERT limit the size for tables printed to the console to 1024 (for no reason other than it seemed like a good place to kick off the discussion, would appreciate feedback from existing users in terms of what size would maintain their current usage model). Alternatively, we could make printing a CONFIG option, use the bert_disable boot arg (or something similar), or use a debug log level. However, all those solutions require extra steps or change the existing behavior for small table data. Limiting the size preserves existing behavior on existing platforms with small table data, and eliminates the soft lockups for platforms with large table data, while still making it available. Signed-off-by:
Darren Hart <darren@os.amperecomputing.com> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Rafael J. Wysocki authored
stable inclusion from stable-4.19.238 commit 93c3e89c9f28f71ccbb51d91821dbe0aa9bfcb16 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 0c9992315e738e7d6e927ef36839a466b080dba6 ] ACPICA commit b1c3656ef4950098e530be68d4b589584f06cddc Prevent acpi_ns_walk_namespace() from crashing when called with start_node equal to ACPI_ROOT_OBJECT if the Namespace has not been instantiated yet and acpi_gbl_root_node is NULL. For instance, this can happen if the kernel is run with "acpi=off" in the command line. Link: https://github.com/acpica/acpica/commit/b1c3656ef4950098e530be68d4b589584f06cddc Link: https://lore.kernel.org/linux-acpi/CAJZ5v0hJWW_vZ3wwajE7xT38aWjY7cZyvqMJpXHzUL98-SiCVQ@mail.gmail.com/ Reported-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Pablo Neira Ayuso authored
stable inclusion from stable-4.19.238 commit 126f86e865ef1e776e853b97b4af2af3d69d4b1c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit f2dd495a8d589371289981d5ed33e6873df94ecc ] Do not reset IP_CT_TCP_FLAG_BE_LIBERAL flag in out-of-sync scenarios coming before the TCP window tracking, otherwise such connections will fail in the window check. Update tcp_options() to leave this flag in place and add a new helper function to reset the tcp window state. Based on patch from Sven Auhagen. Fixes: c4832c7b ("netfilter: nf_ct_tcp: improve out-of-sync situation in TCP tracking") Tested-by:
Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Alexey Khoroshilov authored
stable inclusion from stable-4.19.238 commit 97cc035bdfe754dbfbc2c9b04bc46bc71ef5bc93 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit cb8fac6d2727f79f211e745b16c9abbf4d8be652 ] [You don't often get email from khoroshilov@ispras.ru. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification. ] Overflow check in not needed anymore after we switch to kmalloc_array(). Signed-off-by:
Alexey Khoroshilov <khoroshilov@ispras.ru> Fixes: a4f743a6 ("NFSv4.1: Convert open-coded array allocation calls to kmalloc_array()") Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Uwe Kleine-König authored
stable inclusion from stable-4.19.238 commit 157224dbc7270c1f9af791d96b68b8aa19f0168e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit dedab69fd650ea74710b2e626e63fd35584ef773 ] Set em485->active_timer = NULL isn't always enough to take out the stop timer. While there is a check that it acts in the right state (i.e. waiting for RTS-after-send to pass after sending some chars) but the following might happen: - CPU1: some chars send, shifter becomes empty, stop tx timer armed - CPU0: more chars send before RTS-after-send expired - CPU0: shifter empty irq, port lock taken - CPU1: tx timer triggers, waits for port lock - CPU0: em485->active_timer = &em485->stop_tx_timer, hrtimer_start(), releases lock() - CPU1: get lock, see em485->active_timer == &em485->stop_tx_timer, tear down RTS too early This fix bases on research done by Steffen Trumtrar. Fixes: b86f86e8 ("serial: 8250: fix potential deadlock in rs485-mode") Signed-off-by:
Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Link: https://lore.kernel.org/r/20220215160236.344236-1-u.kleine-koenig@pengutronix.de Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Andy Shevchenko authored
stable inclusion from stable-4.19.238 commit a8b63dbebbd8eab46dc24d158c05eeb3a6c2a436 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 67ec6dd0b257bd81b4e9fcac89b29da72f6265e5 ] The pci_get_slot() increases its reference count, the caller must decrement the reference count by calling pci_dev_put(). Fixes: 90b9aacf ("serial: 8250_pci: add Intel Tangier support") Fixes: f549e94e ("serial: 8250_pci: add Intel Penwell ports") Reported-by:
Qing Wang <wangqing@vivo.com> Signed-off-by:
Andy Shevchenko <andriy.shevchenko@linux.intel.com> Depends-on: d9eda9ba ("serial: 8250_pci: Intel MID UART support to its own driver") Link: https://lore.kernel.org/r/20220215100920.41984-1-andriy.shevchenko@linux.intel.com Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Jakub Kicinski authored
stable inclusion from stable-4.19.238 commit 270b9e733f035af26cbccc7a0e3a7f5b005e7f3f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit ed0c99dc0f499ff8b6e75b5ae6092ab42be1ad39 ] tp->rx_opt.mss_clamp is not populated, yet, during TFO send so we rise it to the local MSS. tp->mss_cache is not updated, however: tcp_v6_connect(): tp->rx_opt.mss_clamp = IPV6_MIN_MTU - headers; tcp_connect(): tcp_connect_init(): tp->mss_cache = min(mtu, tp->rx_opt.mss_clamp) tcp_send_syn_data(): tp->rx_opt.mss_clamp = tp->advmss After recent fixes to ICMPv6 PTB handling we started dropping PMTU updates higher than tp->mss_cache. Because of the stale tp->mss_cache value PMTU updates during TFO are always dropped. Thanks to Wei for helping zero in on the problem and the fix! Fixes: c7bb4b89033b ("ipv6: tcp: drop silly ICMPv6 packet too big messages") Reported-by:
Andre Nash <alnash@fb.com> Reported-by:
Neil Spring <ntspring@fb.com> Reviewed-by:
Wei Wang <weiwan@google.com> Acked-by:
Yuchung Cheng <ycheng@google.com> Acked-by:
Martin KaFai Lau <kafai@fb.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Reviewed-by:
Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220321165957.1769954-1-kuba@kernel.org Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Petr Machata authored
stable inclusion from stable-4.19.238 commit f75f4abeec4c04b600a15b50c89a481f1e7435ee category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 0caf6d9922192dd1afa8dc2131abfb4df1443b9f ] When a netlink message is received, netlink_recvmsg() fills in the address of the sender. One of the fields is the 32-bit bitfield nl_groups, which carries the multicast group on which the message was received. The least significant bit corresponds to group 1, and therefore the highest group that the field can represent is 32. Above that, the UB sanitizer flags the out-of-bounds shift attempts. Which bits end up being set in such case is implementation defined, but it's either going to be a wrong non-zero value, or zero, which is at least not misleading. Make the latter choice deterministic by always setting to 0 for higher-numbered multicast groups. To get information about membership in groups >= 32, userspace is expected to use nl_pktinfo control messages[0], which are enabled by NETLINK_PKTINFO socket option. [0] https://lwn.net/Articles/147608/ The way to trigger this issue is e.g. through monitoring the BRVLAN group: # bridge monitor vlan & # ip link add name br type bridge Which produces the following citation: UBSAN: shift-out-of-bounds in net/netlink/af_netlink.c:162:19 shift exponent 32 is too large for 32-bit type 'int' Fixes: f7fa9b10 ("[NETLINK]: Support dynamic number of multicast groups per netlink family") Signed-off-by:
Petr Machata <petrm@nvidia.com> Reviewed-by:
Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/2bef6aabf201d1fc16cca139a744700cff9dcb04.1647527635.git.petrm@nvidia.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Xin Xiong authored
stable inclusion from stable-4.19.238 commit 9843c9c98f26c6ad843260b19bfdaa2598f2ae1e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit fecbd4a317c95d73c849648c406bcf1b6a0ec1cf ] The reference counting issue happens in several error handling paths on a refcounted object "nc->dmac". In these paths, the function simply returns the error code, forgetting to balance the reference count of "nc->dmac", increased earlier by dma_request_channel(), which may cause refcount leaks. Fix it by decrementing the refcount of specific object in those error paths. Fixes: f88fc122 ("mtd: nand: Cleanup/rework the atmel_nand driver") Co-developed-by:
Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by:
Xiyu Yang <xiyuyang19@fudan.edu.cn> Co-developed-by:
Xin Tan <tanxin.ctf@gmail.com> Signed-off-by:
Xin Tan <tanxin.ctf@gmail.com> Signed-off-by:
Xin Xiong <xiongx18@fudan.edu.cn> Reviewed-by:
Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by:
Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/linux-mtd/20220304085330.3610-1-xiongx18@fudan.edu.cn Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Jiasheng Jiang authored
stable inclusion from stable-4.19.238 commit 4869932423fa08d93e8dc9df3c962be34122518b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 3e68f331c8c759c0daa31cc92c3449b23119a215 ] For the possible failure of the platform_get_irq(), the returned irq could be error number and will finally cause the failure of the request_irq(). Consider that platform_get_irq() can now in certain cases return -EPROBE_DEFER, and the consequences of letting request_irq() effectively convert that into -EINVAL, even at probe time rather than later on. So it might be better to check just now. Fixes: 2c22120f ("MTD: OneNAND: interrupt based wait support") Signed-off-by:
Jiasheng Jiang <jiasheng@iscas.ac.cn> Signed-off-by:
Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/linux-mtd/20220104162658.1988142-1-jiasheng@iscas.ac.cn Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Randy Dunlap authored
stable inclusion from stable-4.19.238 commit bb5b72645288a1a4cf1c8b87fb7f469a52119555 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit b665eae7a788c5e2bc10f9ac3c0137aa0ad1fc97 ] If an invalid option value is used with "printk.devkmsg=<value>", it is silently ignored. If a valid option value is used, it is honored but the wrong return value (0) is used, indicating that the command line option had an error and was not handled. This string is not added to init's environment strings due to init/main.c::unknown_bootoption() checking for a '.' in the boot option string and then considering that string to be an "Unused module parameter". Print a warning message if a bad option string is used. Always return 1 from the __setup handler to indicate that the command line option has been handled. Fixes: 750afe7b ("printk: add kernel parameter to control writes to /dev/kmsg") Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Reported-by:
Igor Zhbanov <i.zhbanov@omprussia.ru> Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Cc: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: John Ogness <john.ogness@linutronix.de> Reviewed-by:
John Ogness <john.ogness@linutronix.de> Reviewed-by:
Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by:
Petr Mladek <pmladek@suse.com> Signed-off-by:
Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220228220556.23484-1-rdunlap@infradead.org Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Adrian Hunter authored
stable inclusion from stable-4.19.238 commit 31ceb83b64d6e246a13db42547c2e9d6655027e1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit d680ff24e9e14444c63945b43a37ede7cd6958f9 ] Reset appropriate variables in the parser loop between parsing separate filters, so that they do not interfere with parsing the next filter. Fixes: 375637bc ("perf/core: Introduce address range filtering") Signed-off-by:
Adrian Hunter <adrian.hunter@intel.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220131072453.2839535-4-adrian.hunter@intel.com Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Randy Dunlap authored
stable inclusion from stable-4.19.238 commit ab6ba985f3da39861ebc0ba047e6b6abcdc7ce6d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit f3303ff649dbf7dcdc6a6e1a922235b12b3028f4 ] __setup() handlers should return 1 to indicate that the boot option has been handled. Returning 0 causes a boot option to be listed in the Unknown kernel command line parameters and also added to init's arg list (if no '=' sign) or environment list (if of the form 'a=b'). Unknown kernel command line parameters "erst_disable bert_disable hest_disable BOOT_IMAGE=/boot/bzImage-517rc6", will be passed to user space. Run /sbin/init as init process with arguments: /sbin/init erst_disable bert_disable hest_disable with environment: HOME=/ TERM=linux BOOT_IMAGE=/boot/bzImage-517rc6 Fixes: a3e2acc5 ("ACPI / APEI: Add Boot Error Record Table (BERT) support") Fixes: a08f82d0 ("ACPI, APEI, Error Record Serialization Table (ERST) support") Fixes: 9dc96664 ("ACPI, APEI, HEST table parsing") Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Reported-by:
Igor Zhbanov <i.zhbanov@omprussia.ru> Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Reviewed-by:
"Huang, Ying" <ying.huang@intel.com> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Herbert Xu authored
stable inclusion from stable-4.19.238 commit c7249eb9dc07e30028c750e9057e80b24793a8e8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- [ Upstream commit 66eae850333d639fc278d6f915c6fc01499ea893 ] The function crypto_authenc_decrypt_tail discards its flags argument and always relies on the flags from the original request when starting its sub-request. This is clearly wrong as it may cause the SLEEPABLE flag to be set when it shouldn't. Fixes: 92d95ba9 ("crypto: authenc - Convert to new AEAD interface") Reported-by:
Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by:
Herbert Xu <herbert@gondor.apana.org.au> Tested-by:
Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by:
Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Liguang Zhang authored
stable inclusion from stable-4.19.238 commit 77ebcaf87c43c8331bcc8a41a3d99b6db3109f0c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 92912b175178c7e895f5e5e9f1e30ac30319162b upstream. Writes to a Downstream Port's Slot Control register are PCIe hotplug "commands." If the Port supports Command Completed events, software must wait for a command to complete before writing to Slot Control again. pcie_do_write_cmd() sets ctrl->cmd_busy when it writes to Slot Control. If software notification is enabled, i.e., PCI_EXP_SLTCTL_HPIE and PCI_EXP_SLTCTL_CCIE are set, ctrl->cmd_busy is cleared by pciehp_isr(). But when software notification is disabled, as it is when pcie_init() powers off an empty slot, pcie_wait_cmd() uses pcie_poll_cmd() to poll for command completion, and it neglects to clear ctrl->cmd_busy, which leads to spurious timeouts: pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0 (issued 2264 msec ago) pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0 (issued 2288 msec ago) Clear ctrl->cmd_busy in pcie_poll_cmd() when it detects a Command Completed event (PCI_EXP_SLTSTA_CC). [bhelgaas: commit log] Fixes: a5dd4b4b ("PCI: pciehp: Wait for hotplug command completion where necessary") Link: https://lore.kernel.org/r/20211111054258.7309-1-zhangliguang@linux.alibaba.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=215143 Link: https://lore.kernel.org/r/20211126173309.GA12255@wunner.de Signed-off-by:
Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by:
Bjorn Helgaas <bhelgaas@google.com> Reviewed-by:
Lukas Wunner <lukas@wunner.de> Cc: stable@vger.kernel.org # v4.19+ Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Sakari Ailus authored
stable inclusion from stable-4.19.238 commit eeccbbde9695f0562ad342f0e0375134a7ba9d39 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit babc92da5928f81af951663fc436997352e02d3a upstream. __acpi_node_get_property_reference() is documented to return -ENOENT if the caller requests a property reference at an index that does not exist, not -EINVAL which it actually does. Fix this by returning -ENOENT consistenly, independently of whether the property value is a plain reference or a package. Fixes: c343bc2c ("ACPI: properties: Align return codes of __acpi_node_get_property_reference()") Cc: 4.14+ <stable@vger.kernel.org> # 4.14+ Signed-off-by:
Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-
Rik van Riel authored
stable inclusion from stable-4.19.238 commit 79a8b36e124001547906f3612959fa548b58bb95 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA CVE: NA -------------------------------- commit 3149c79f3cb0e2e3bafb7cfadacec090cbd250d3 upstream. In some cases it appears the invalidation of a hwpoisoned page fails because the page is still mapped in another process. This can cause a program to be continuously restarted and die when it page faults on the page that was not invalidated. Avoid that problem by unmapping the hwpoisoned page when we find it. Another issue is that sometimes we end up oopsing in finish_fault, if the code tries to do something with the now-NULL vmf->page. I did not hit this error when submitting the previous patch because there are several opportunities for alloc_set_pte to bail out before accessing vmf->page, and that apparently happened on those systems, and most of the time on other systems, too. However, across several million systems that error does occur a handful of times a day. It can be avoided by returning VM_FAULT_NOPAGE which will cause do_read_fault to return before calling finish_fault. Link: https://lkml.kernel.org/r/20220325161428.5068d97e@imladris.surriel.com Fixes: e53ac7374e64 ("mm: invalidate hwpoison page cache page in fault path") Signed-off-by:
Rik van Riel <riel@surriel.com> Reviewed-by:
Miaohe Lin <linmiaohe@huawei.com> Tested-by:
Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yongqiang Liu <liuyongqiang13@huawei.com>
-