- Jun 30, 2021
-
-
Suzuki K Poulose authored
mainline inclusion from mainline-v5.0-rc1 commit c9460dcb category: bugfix bugzilla: 46773 CVE: NA References: https://gitee.com/src-openeuler/kernel/issues/I235Y8 --------------------------- We have two entries for ARM64_WORKAROUND_CLEAN_CACHE capability : 1) ARM Errata 826319, 827319, 824069, 819472 on A53 r0p[012] 2) ARM Errata 819472 on A53 r0p[01] Both have the same work around. Merge these entries to avoid duplicate entries for a single capability. Add a new Kconfig entry to control the "capability" entry to make it easier to handle combinations of the CONFIGs. Cc: Will Deacon <will.deacon@arm.com> Cc: Andre Przywara <andre.przywara@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by:
Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com> Signed-off-by:
Zheng Zengkai <zhengzengkai@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Heiner Kallweit authored
mainline inclusion from mainline-v5.1-rc1 commit cbfd12b3 category: bugfix bugzilla: NA CVE: NA ------------------------------------------------------ The call to the phylib state machine in phy_stop() just ensures that the state machine isn't re-triggered, but a state machine call may be scheduled already. So lets's call phy_stop_machine(). This also allows to get rid of the call to phy_stop_machine() in phy_disconnect(). Signed-off-by:
Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
David S. Miller <davem@davemloft.net> Reviewed-by:
Weiwei Deng <dengweiwei@huawei.com> Reviewed-by:
Zhaohui Zhong <zhongzhaohui@huawei.com> Reviewed-by:
Yonglong Liu <liuyonglong@huawei.com> Signed-off-by:
You Shengzui <youshengzui@huawei.com> Reviewed-by:
YueHaibing <yuehaibing@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Heiner Kallweit authored
mainline inclusion from mainline-v5.1-rc1 commit 124eee3f category: bugfix bugzilla: NA CVE: NA ------------------------------------------------------ When bringing down the netdevice (incl. detaching it) and calling netif_carrier_off directly or indirectly the latter triggers an asynchronous linkwatch event. This linkwatch event eventually may fail to access chip registers in the ndo_get_stats/ndo_get_stats64 callback because the device isn't accessible any longer, see call trace in [0]. To prevent this scenario don't check for IFF_UP only, but also make sure that the netdevice is present. [0] https://lists.openwall.net/netdev/2018/03/15/62 Signed-off-by:
Heiner Kallweit <hkallweit1@gmail.com> Tested-by:
Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by:
Florian Fainelli <f.fainelli@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Reviewed-by:
Weiwei Deng <dengweiwei@huawei.com> Reviewed-by:
Zhaohui Zhong <zhongzhaohui@huawei.com> Reviewed-by:
Yonglong Liu <liuyonglong@huawei.com> Signed-off-by:
You Shengzui <youshengzui@huawei.com> Reviewed-by:
YueHaibing <yuehaibing@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Heiner Kallweit authored
mainline inclusion from mainline-v4.20-rc1 commit e8cfd9d6 category: bugfix bugzilla: NA CVE: NA ---------------------------------------------------- phy_stop() may be called e.g. when suspending, therefore all needed actions should be performed synchronously. Therefore add a synchronous call to the state machine. Signed-off-by:
Heiner Kallweit <hkallweit1@gmail.com> Tested-by:
Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by:
Florian Fainelli <f.fainelli@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Reviewed-by:
Weiwei Deng <dengweiwei@huawei.com> Reviewed-by:
Zhaohui Zhong <zhongzhaohui@huawei.com> Reviewed-by:
Yonglong Liu <liuyonglong@huawei.com> Signed-off-by:
You Shengzui <youshengzui@huawei.com> Reviewed-by:
YueHaibing <yuehaibing@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Mike Rapoport authored
mainline inclusion from mainline-v5.1-rc1 commit 5c01a25a category: bugfix bugzilla: 42401 CVE: NA ------------------------------------------------- Marc Gonzalez reported the following kmemleak crash: Unable to handle kernel paging request at virtual address ffffffc021e00000 Mem abort info: ESR = 0x96000006 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 swapper pgtable: 4k pages, 39-bit VAs, pgdp = (____ptrval____) [ffffffc021e00000] pgd=000000017e3ba803, pud=000000017e3ba803, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Modules linked in: CPU: 6 PID: 523 Comm: kmemleak Tainted: G S W 5.0.0-rc1 #13 Hardware name: Qualcomm Technologies, Inc. MSM8998 v1 MTP (DT) pstate: 80000085 (Nzcv daIf -PAN -UAO) pc : scan_block+0x70/0x190 lr : scan_block+0x6c/0x190 Process kmemleak (pid: 523, stack limit = 0x(____ptrval____)) Call trace: scan_block+0x70/0x190 scan_gray_list+0x108/0x1c0 kmemleak_scan+0x33c/0x7c0 kmemleak_scan_thread+0x98/0xf0 kthread+0x11c/0x120 ret_from_fork+0x10/0x1c Code: f9000fb4 d503201f 97ffffd2 35000580 (f9400260) The crash happens when a no-map area is allocated in early_init_dt_alloc_reserved_memory_arch(). The allocated region is registered with kmemleak, but it is then removed from memblock using memblock_remove() that is not kmemleak-aware. Replacing memblock_phys_alloc_range() with memblock_find_in_range() makes sure that the allocated memory is not added to kmemleak and then memblock_remove()'ing this memory is safe. As a bonus, since memblock_find_in_range() ensures the allocation in the specified range, the bounds check can be removed. [rppt@linux.ibm.com: of: fix parameters order for call to memblock_find_in_range()] Link: http://lkml.kernel.org/r/20190221112619.GC32004@rapoport-lnx Link: http://lkml.kernel.org/r/20190213181921.GB15270@rapoport-lnx Fixes: 3f0c8206 ("drivers: of: add initialization code for dynamic reserved memory") Signed-off-by:
Mike Rapoport <rppt@linux.ibm.com> Acked-by:
Marek Szyprowski <m.szyprowski@samsung.com> Acked-by:
Prateek Patel <prpatel@nvidia.com> Tested-by:
Marc Gonzalez <marc.w.gonzalez@free.fr> Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Theodore Ts'o authored
mainline inclusion from mainline-v5.2-rc2 commit 58be0106 category: bugfix bugzilla: NA CVE: NA https://gitee.com/openeuler/kernel/issues/I1CVQ4?from=project-issue ------------------------------------------------- Fixes: eb9d1bf0: "random: only read from /dev/random after its pool has received 128 bits" Reported-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Theodore Ts'o <tytso@mit.edu> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Theodore Ts'o authored
mainline inclusion from mainline-v5.2-rc1 commit eb9d1bf0 category: bugfix bugzilla: NA CVE: NA https://gitee.com/openeuler/kernel/issues/I1CVQ4?from=project-issue ------------------------------------------------- Immediately after boot, we allow reads from /dev/random before its entropy pool has been fully initialized. Fix this so that we don't allow this until the blocking pool has received 128 bits. We do this by repurposing the initialized flag in the entropy pool struct, and use the initialized flag in the blocking pool to indicate whether it is safe to pull from the blocking pool. To do this, we needed to rework when we decide to push entropy from the input pool to the blocking pool, since the initialized flag for the input pool was used for this purpose. To simplify things, we no longer use the initialized flag for that purpose, nor do we use the entropy_total field any more. Signed-off-by:
Theodore Ts'o <tytso@mit.edu> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Coly Li authored
mainline inclusion from mainline-v5.9-rc1 commit b35fd742 category: bugfix bugzilla: NA CVE: NA ------------------------------------------------- If create a loop device with a backing NVMe SSD, current loop device driver doesn't correctly set its queue's limits.discard_granularity and leaves it as 0. If a discard request at LBA 0 on this loop device, in __blkdev_issue_discard() the calculated req_sects will be 0, and a zero length discard request will trigger a BUG() panic in generic block layer code at block/blk-mq.c:563. [ 955.565006][ C39] ------------[ cut here ]------------ [ 955.559660][ C39] invalid opcode: 0000 [#1] SMP NOPTI [ 955.622171][ C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G E 5.8.0-default+ #40 [ 955.622171][ C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020 [ 955.622175][ C39] RIP: 0010:blk_mq_end_request+0x107/0x110 [ 955.622177][ C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54 [ 955.622179][ C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202 [ 955.749277][ C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003 [ 955.749278][ C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000 [ 955.749279][ C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 955.749279][ C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160 [ 955.749280][ C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e [ 955.749281][ C39] FS: 0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000 [ 955.749282][ C39] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 955.749282][ C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0 [ 955.749283][ C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 955.749284][ C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 955.749284][ C39] PKRU: 55555554 [ 955.749285][ C39] Call Trace: [ 955.749290][ C39] blk_done_softirq+0x99/0xc0 [ 957.550669][ C39] __do_softirq+0xd3/0x45f [ 957.550677][ C39] ? smpboot_thread_fn+0x2f/0x1e0 [ 957.550679][ C39] ? smpboot_thread_fn+0x74/0x1e0 [ 957.550680][ C39] ? smpboot_thread_fn+0x14e/0x1e0 [ 957.550684][ C39] run_ksoftirqd+0x30/0x60 [ 957.550687][ C39] smpboot_thread_fn+0x149/0x1e0 [ 957.886225][ C39] ? sort_range+0x20/0x20 [ 957.886226][ C39] kthread+0x137/0x160 [ 957.886228][ C39] ? kthread_park+0x90/0x90 [ 957.886231][ C39] ret_from_fork+0x22/0x30 [ 959.117120][ C39] ---[ end trace 3dacdac97e2ed164 ]--- This is the procedure to reproduce the panic, # modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1 # losetup -f /dev/nvme0n1 --direct-io=on # blkdiscard /dev/loop0 -o 0 -l 0x200 This patch fixes the issue by checking q->limits.discard_granularity in __blkdev_issue_discard() before composing the discard bio. If the value is 0, then prints a warning oops information and returns -EOPNOTSUPP to the caller to indicate that this buggy device driver doesn't support discard request. Fixes: 9b15d109 ("block: improve discard bio alignment in __blkdev_issue_discard()") Fixes: c52abf56 ("loop: Better discard support for block devices") Reported-and-suggested-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Coly Li <colyli@suse.de> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Jack Wang <jinpu.wang@cloud.ionos.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Enzo Matsumiya <ematsumiya@suse.com> Cc: Evan Green <evgreen@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Yufen Yu <yuyufen@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Ming Lei authored
mainline inclusion from mainline-v5.9-rc3 commit bcb21c8c category: bugfix bugzilla: NA CVE: NA ------------------------------------------------- In case of block device backend, if the backend supports write zeros, the loop device will set queue flag of QUEUE_FLAG_DISCARD. However, limits.discard_granularity isn't setup, and this way is wrong, see the following description in Documentation/ABI/testing/sysfs-block: A discard_granularity of 0 means that the device does not support discard functionality. Especially 9b15d109 ("block: improve discard bio alignment in __blkdev_issue_discard()") starts to take q->limits.discard_granularity for computing max discard sectors. And zero discard granularity may cause kernel oops, or fail discard request even though the loop queue claims discard support via QUEUE_FLAG_DISCARD. Fix the issue by setup discard granularity and alignment. Fixes: c52abf56 ("loop: Better discard support for block devices") Signed-off-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Coly Li <colyli@suse.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Xiao Ni <xni@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Evan Green <evgreen@chromium.org> Cc: Gwendal Grignou <gwendal@chromium.org> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com> Cc: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Yufen Yu <yuyufen@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Eric W. Biederman authored
mainline inclusion from mainline-v5.6 commit b95e31c0 category: bugfix bugzilla: 32426 CVE: NA ------------------------ The reasons why the extra posix_cpu_timers_exit_group() invocation has been added are not entirely clear from the commit message. Today all that posix_cpu_timers_exit_group() does is stop timers that are tracking the task from firing. Every other operation on those timers is still allowed. The practical implication of this is posix_cpu_timer_del() which could not get the siglock after the thread group leader has exited (because sighand == NULL) would be able to run successfully because the timer was already dequeued. With that locking issue fixed there is no point in disabling all of the timers. So remove this ``tempoary'' hack. Fixes: e0a70217 ("posix-cpu-timers: workaround to suppress the problems with mt exec") Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87o8tityzs.fsf@x220.int.ebiederm.org Reviewed-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Muchun Song authored
mainline inclusion from mainline-v5.9-rc1 commit 10de795a category: bugfix bugzilla: NA CVE: NA --------------------------- Fix compiler warning(as show below) for !CONFIG_KPROBES_ON_FTRACE. kernel/kprobes.c: In function 'kill_kprobe': kernel/kprobes.c:1116:33: warning: statement with no effect [-Wunused-value] 1116 | #define disarm_kprobe_ftrace(p) (-ENODEV) | ^ kernel/kprobes.c:2154:3: note: in expansion of macro 'disarm_kprobe_ftrace' 2154 | disarm_kprobe_ftrace(p); Link: https://lore.kernel.org/r/20200805142136.0331f7ea@canb.auug.org.au Link: https://lkml.kernel.org/r/20200805172046.19066-1-songmuchun@bytedance.com Reported-by:
Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 0cb2f137 ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") Acked-by:
Masami Hiramatsu <mhiramat@kernel.org> Acked-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Muchun Song <songmuchun@bytedance.com> Signed-off-by:
Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by:
Jian Cheng <cj.chengjian@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Tommi Rantala authored
mainline inclusion from mainline-5.6-rc6 commit 29b4f5f1 category: bugfix bugzilla: 31774 CVE: NA ------------------------------------------------- Since glibc 2.28 when running 'perf top --stdio', input handling no longer works, but hitting any key always just prints the "Mapped keys" help text. To fix it, call clearerr() in the display_thread() loop to clear any EOF sticky errors, as instructed in the glibc NEWS file (https://sourceware.org/git/?p=glibc.git;a=blob;f=NEWS ): * All stdio functions now treat end-of-file as a sticky condition. If you read from a file until EOF, and then the file is enlarged by another process, you must call clearerr or another function with the same effect (e.g. fseek, rewind) before you can read the additional data. This corrects a longstanding C99 conformance bug. It is most likely to affect programs that use stdio to read interactive input from a terminal. (Bug #1190.) Signed-off-by:
Tommi Rantala <tommi.t.rantala@nokia.com> Tested-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20200305083714.9381-2-tommi.t.rantala@nokia.com Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by:
Wei Li <liwei391@huawei.com> Reviewed-by:
Jian Cheng <cj.chengjian@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Jacob Pan authored
mainline inclusion from mainline-v5.7-rc1 commit 902baf61 category: bugfix bugzilla: 34102 CVE: NA ------------------------------------------------------------------------- Move canonical address check before mmget_not_zero() to avoid mm reference leak. Fixes: 9d8c3af3 ("iommu/vt-d: IOMMU Page Request needs to check if address is canonical.") Signed-off-by:
Jacob Pan <jacob.jun.pan@linux.intel.com> Acked-by:
Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by:
Joerg Roedel <jroedel@suse.de> Signed-off-by:
Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Yunsheng Lin authored
mainline inclusion from mainline-v5.4-rc1 commit 6b0c54e7 category: bugfix bugzilla: 21318 CVE: NA ------------------------------------------------------------------------- The cookie is dereferenced before null checking in the function iommu_dma_init_domain. This patch moves the dereferencing after the null checking. Fixes: fdbe574e ("iommu/dma: Allow MSI-only cookies") Signed-off-by:
Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by:
Joerg Roedel <jroedel@suse.de> Conflicts: drivers/iommu/dma-iommu.c Signed-off-by:
Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Paul E. McKenney authored
mainline inclusion from mainline-v5.6-rc1 commit 844a378d category: bugfix bugzilla: 28851 CVE: NA ------------------------------------------------------------------------- The ->srcu_last_gp_end field is accessed from any CPU at any time by synchronize_srcu(), so non-initialization references need to use READ_ONCE() and WRITE_ONCE(). This commit therefore makes that change. Reported-by:
<syzbot+08f3e9d26e5541e1ecf2@syzkaller.appspotmail.com> Acked-by:
Marco Elver <elver@google.com> Signed-off-by:
Paul E. McKenney <paulmck@kernel.org> Conflicts: kernel/rcu/srcutree.c Signed-off-by:
Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Vladimir Murzin authored
mainline inclusion from mainline-v5.6-rc1 commit 98346023 category: bugfix bugzilla: 28680 CVE: NA ------------------------------------------------------------------------- arm64 provides always working implementation of futex_atomic_cmpxchg_inatomic(), so there is no need to check it runtime. Reported-by:
Piyush swami <Piyush.swami@arm.com> Signed-off-by:
Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by:
Will Deacon <will@kernel.org> Conflicts: arch/arm64/Kconfig Signed-off-by:
Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Al Viro authored
mainline inclusion from mainline-5.1-rc1 commit 6d7fbce7 category: bugfix bugzilla: 30217 CVE: NA --------------------------- unused now and impossible to use safely anyway. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk> Signed-off-by:
yangerkun <yangerkun@huawei.com> Reviewed-by:
Hou Tao <houtao1@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
David Rientjes authored
mainline inclusion from mainline-5.6-rc1 commit f42f2552 category: bugfix bugzilla: 29954 CVE: NA ------------------------------------------------- If thp defrag setting "defer" is used and a newline is *not* used when writing to the sysfs file, this is interpreted as the "defer+madvise" option. This is because we do prefix matching and if five characters are written without a newline, the current code ends up comparing to the first five bytes of the "defer+madvise" option and using that instead. Use the more appropriate sysfs_streq() that handles the trailing newline for us. Since this doubles as a nice cleanup, do it in enabled_store() as well. The current implementation relies on prefix matching: the number of bytes compared is either the number of bytes written or the length of the option being compared. With a newline, "defer\n" does not match "defer+"madvise"; without a newline, however, "defer" is considered to match "defer+madvise" (prefix matching is only comparing the first five bytes). End result is that writing "defer" is broken unless it has an additional trailing character. This means that writing "madv" in the past would match and set "madvise". With strict checking, that no longer is the case but it is unlikely anybody is currently doing this. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001171411020.56385@chino.kir.corp.google.com Fixes: 21440d7e ("mm, thp: add new defer+madvise defrag option") Signed-off-by:
David Rientjes <rientjes@google.com> Suggested-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit f42f2552) Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Trond Myklebust authored
mainline inclusion from mainline-5.6-rc1 commit 57f64034 category: bugfix bugzilla: 30516 CVE: NA ----------------------------------------------- vfs_clone_file_range() can modify the metadata on the source file too, so we need to commit that to stable storage as well. Reported-by:
Dave Chinner <david@fromorbit.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by:
Dave Chinner <david@fromorbit.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com> Signed-off-by:
Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Trond Myklebust authored
mainline inclusion from mainline-v5.5-rc1 commit a25e3726 category: bugfix bugzilla: 27346 CVE: NA ------------------------------------------------- The NFSv4.2 CLONE operation has implicit persistence requirements on the target file, since there is no protocol requirement that the client issue a separate operation to persist data. For that reason, we should call vfs_fsync_range() on the destination file after a successful call to vfs_clone_file_range(). Fixes: ffa0160a ("nfsd: implement the NFSv4.2 CLONE operation") Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.5+ Signed-off-by:
J. Bruce Fields <bfields@redhat.com> Conflicts: fs/nfsd/nfs4proc.c fs/nfsd/vfs.c 42ec3d4c ("vfs: make remap_file_range functions take and return bytes completed") 2e5dfc99 ("vfs: combine the clone and dedupe into a single remap_file_range") Signed-off-by:
Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by:
zhangyi (F) <yi.zhang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Arvind Sankar authored
mainline inclusion from v5.5-rc1 commit dacc9092 category: bugfix bugzilla: 28820 CVE: NA --------------------------- When checking whether the reported lfb_size makes sense, the height * stride result is page-aligned before seeing whether it exceeds the reported size. This doesn't work if height * stride is not an exact number of pages. For example, as reported in the kernel bugzilla below, an 800x600x32 EFI framebuffer gets skipped because of this. Move the PAGE_ALIGN to after the check vs size. Reported-by:
Christopher Head <chead@chead.ca> Tested-by:
Christopher Head <chead@chead.ca> Signed-off-by:
Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206051 Link: https://lkml.kernel.org/r/20200107230410.2291947-1-nivedita@alum.mit.edu Signed-off-by:
Hongbo Yao <yaohongbo@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Jan Stancek authored
mainline inclusion from mainline-v5.3 commit afa8b475 category: bugfix bugzilla: 17251 CVE: NA ------------------------------------------------- KVM guests with commit c8c40767 ("x86/timer: Skip PIT initialization on modern chipsets") applied to guest kernel have been observed to have unusually higher CPU usage with symptoms of increase in vm exits for HLT and MSW_WRITE (MSR_IA32_TSCDEADLINE). This is caused by older QEMUs lacking support for X86_FEATURE_ARAT. lapic clock retains CLOCK_EVT_FEAT_C3STOP and nohz stays inactive. There's no usable broadcast device either. Do the PIT initialization if guest CPU lacks X86_FEATURE_ARAT. On real hardware it shouldn't matter as ARAT and DEADLINE come together. Fixes: c8c40767 ("x86/timer: Skip PIT initialization on modern chipsets") Signed-off-by:
Jan Stancek <jstancek@redhat.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Thomas Gleixner authored
mainline inclusion from mainline-v5.6-rc1 commit 97992387 category: bugfix bugzilla: 17251 CVE: NA ------------------------------------------------- Tony reported a boot regression caused by the recent workaround for systems which have a disabled (clock gate off) PIT. On his machine the kernel fails to initialize the PIT because apic_needs_pit() does not take into account whether the local APIC interrupt delivery mode will actually allow to setup and use the local APIC timer. This should be easy to reproduce with acpi=off on the command line which also disables HPET. Due to the way the PIT/HPET and APIC setup ordering works (APIC setup can require working PIT/HPET) the information is not available at the point where apic_needs_pit() makes this decision. To address this, split out the interrupt mode selection from apic_intr_mode_init(), invoke the selection before making the decision whether PIT is required or not, and add the missing checks into apic_needs_pit(). Fixes: c8c40767 ("x86/timer: Skip PIT initialization on modern chipsets") Reported-by:
Anthony Buckley <tony.buckley000@gmail.com> Tested-by:
Anthony Buckley <tony.buckley000@gmail.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Cc: Daniel Drake <drake@endlessm.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206125 Link: https://lore.kernel.org/r/87sgk6tmk2.fsf@nanos.tec.linutronix.de Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Thomas Gleixner authored
mainline inclusion from mainline-v5.3-rc1 commit c8c40767 category: bugfix bugzilla: 17251 CVE: NA ------------------------------------------------- Recent Intel chipsets including Skylake and ApolloLake have a special ITSSPRC register which allows the 8254 PIT to be gated. When gated, the 8254 registers can still be programmed as normal, but there are no IRQ0 timer interrupts. Some products such as the Connex L1430 and exone go Rugged E11 use this register to ship with the PIT gated by default. This causes Linux to fail to boot: Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. The panic happens before the framebuffer is initialized, so to the user, it appears as an early boot hang on a black screen. Affected products typically have a BIOS option that can be used to enable the 8254 and make Linux work (Chipset -> South Cluster Configuration -> Miscellaneous Configuration -> 8254 Clock Gating), however it would be best to make Linux support the no-8254 case. Modern sytems allow to discover the TSC and local APIC timer frequencies, so the calibration against the PIT is not required. These systems have always running timers and the local APIC timer works also in deep power states. So the setup of the PIT including the IO-APIC timer interrupt delivery checks are a pointless exercise. Skip the PIT setup and the IO-APIC timer interrupt checks on these systems, which avoids the panic caused by non ticking PITs and also speeds up the boot process. Thanks to Daniel for providing the changelog, initial analysis of the problem and testing against a variety of machines. Reported-by:
Daniel Drake <drake@endlessm.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Tested-by:
Daniel Drake <drake@endlessm.com> Cc: bp@alien8.de Cc: hpa@zytor.com Cc: linux@endlessm.com Cc: rafael.j.wysocki@intel.com Cc: hdegoede@redhat.com Link: https://lkml.kernel.org/r/20190628072307.24678-1-drake@endlessm.com Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Daniel Drake authored
mainline inclusion from mainline-v5.3-rc1 commit 52ae346b category: bugfix bugzilla: 17251 CVE: NA ------------------------------------------------- This variable is a period unit (number of clock cycles per jiffy), not a frequency (which is number of cycles per second). Give it a more appropriate name. Suggested-by:
Ingo Molnar <mingo@kernel.org> Signed-off-by:
Daniel Drake <drake@endlessm.com> Reviewed-by:
Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: len.brown@intel.com Cc: linux@endlessm.com Cc: rafael.j.wysocki@intel.com Link: http://lkml.kernel.org/r/20190509055417.13152-2-drake@endlessm.com Signed-off-by:
Ingo Molnar <mingo@kernel.org> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Eric Auger authored
mainline inclusion from mainline-5.3 commit 3855ba2d category: bugfix bugzilla: 17414 CVE: NA ------------------------------------------------- In the case the RMRR device scope is a PCI-PCI bridge, let's check the device belongs to the PCI sub-hierarchy. Fixes: 0659b8dc ("iommu/vt-d: Implement reserved region get/put callbacks") Signed-off-by:
Eric Auger <eric.auger@redhat.com> Reviewed-by:
Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by:
Joerg Roedel <jroedel@suse.de> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Eric Auger authored
mainline inclusion from mainline-5.3 commit e143fd45 category: bugfix bugzilla: 17401 CVE: NA ------------------------------------------------- When reading the vtd specification and especially the Reserved Memory Region Reporting Structure chapter, it is not obvious a device scope element cannot be a PCI-PCI bridge, in which case all downstream ports are likely to access the reserved memory region. Let's handle this case in device_has_rmrr. Fixes: ea2447f7 ("intel-iommu: Prevent devices with RMRRs from being placed into SI Domain") Signed-off-by:
Eric Auger <eric.auger@redhat.com> Reviewed-by:
Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by:
Joerg Roedel <jroedel@suse.de> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Eric Auger authored
mainline inclusion from mainline-5.3 commit b9a7f981 category: bugfix bugzilla: 17401 CVE: NA ------------------------------------------------- Several call sites are about to check whether a device belongs to the PCI sub-hierarchy of a candidate PCI-PCI bridge. Introduce an helper to perform that check. Signed-off-by:
Eric Auger <eric.auger@redhat.com> Reviewed-by:
Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by:
Joerg Roedel <jroedel@suse.de> Signed-off-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Ard Biesheuvel authored
mainline inclusion from mainline-v4.20-rc1 commit ab8085c1 category: bugfix bugzilla: NA CVE: NA ----------------------------------------------- As it turns out, the AVX2 multibuffer SHA routines are currently broken [0], in a way that would have likely been noticed if this code were in wide use. Since the code is too complicated to be maintained by anyone except the original authors, and since the performance benefits for real-world use cases are debatable to begin with, it is better to drop it entirely for the moment. [0] https://marc.info/?l=linux-crypto-vger&m=153476243825350&w=2 Suggested-by:
Eric Biggers <ebiggers@google.com> Cc: Megha Dey <megha.dey@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by:
Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by:
Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by:
Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by:
Jason Yan <yanaijie@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Eric Auger authored
[ Upstream commit 5f64ce54 ] intel_iommu_get_resv_regions() aims to return the list of reserved regions accessible by a given @device. However several devices can access the same reserved memory region and when building the list it is not safe to use a single iommu_resv_region object, whose container is the RMRR. This iommu_resv_region must be duplicated per device reserved region list. Let's remove the struct iommu_resv_region from the RMRR unit and allocate the iommu_resv_region directly in intel_iommu_get_resv_regions(). We hold the dmar_global_lock instead of the rcu-lock to allow sleeping. Fixes: 0659b8dc ("iommu/vt-d: Implement reserved region get/put callbacks") Signed-off-by:
Eric Auger <eric.auger@redhat.com> Reviewed-by:
Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by:
Joerg Roedel <jroedel@suse.de> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
卢佳琳 authored
hulk inclusion category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA -------- 1) Rename struct mem_vmstats_percpu to struct mem_cgroup_stat_cpu 2) Move lruvec_stat_local from mem_cgroup_stat_cpu to mem_cgroup_per_node_extension 3) Rename vmstats and vmevents to stat and events in struct mem_cgroup 4) Rename vmstats_percpu to stat_cpu in struct mem_cgroup 5) Move vmstats_local from struct mem_cgroup to struct mem_cgroup_extension Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Shakeel Butt authored
mainline inclusion from mainline-v5.4-rc7 commit 7961eee3 category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- __mem_cgroup_free() can be called on the failure path in mem_cgroup_alloc(). However memcg_flush_percpu_vmstats() and memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free() access the fields of memcg which can potentially be null if called from failure path from mem_cgroup_alloc(). Indeed syzbot has reported the following crash: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436 Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90 RSP: 0018:ffff888095c27980 EFLAGS: 00010206 RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000 RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007 RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33 R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760 R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007 FS: 00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021 mem_cgroup_free mm/memcontrol.c:5033 [inline] mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160 css_create kernel/cgroup/cgroup.c:5156 [inline] cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119 cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401 kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124 vfs_mkdir+0x42e/0x670 fs/namei.c:3807 do_mkdirat+0x234/0x2a0 fs/namei.c:3830 __do_sys_mkdir fs/namei.c:3846 [inline] __se_sys_mkdir fs/namei.c:3844 [inline] __x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixing this by moving the flush to mem_cgroup_free as there is no need to flush anything if we see failure in mem_cgroup_alloc(). Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com Fixes: bb65f89b ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: c350a99e ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by:
Shakeel Butt <shakeelb@google.com> Reported-by:
<syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com> Reviewed-by:
Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> (cherry pick commit from b04fca9a1b9e2668a999dac3897c15fb8f4bb093) Conflicts: mm/memcontrol.c Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Honglei Wang authored
mainline inclusion from mainline-v5.4-rc4 commit b11edebb category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- Commit 1a61ab80 ("mm: memcontrol: replace zone summing with lruvec_page_state()") has made lruvec_page_state to use per-cpu counters instead of calculating it directly from lru_zone_size with an idea that this would be more effective. Tim has reported that this is not really the case for their database benchmark which is showing an opposite results where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25% of the system time which is roughly 7% of total cpu cycles) on 5.3 kernels. The workload is running on a larger machine (96cpus), it has many cgroups (500) and it is heavily direct reclaim bound. Tim Chen said: : The problem can also be reproduced by running simple multi-threaded : pmbench benchmark with a fast Optane SSD swap (see profile below). : : : 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size : | : |--3.07%--lruvec_lru_size : | | : | |--2.11%--cpumask_next : | | | : | | --1.66%--find_next_bit : | | : | --0.57%--call_function_interrupt : | | : | --0.55%--smp_call_function_interrupt : | : |--1.59%--0x441f0fc3d009 : | _ops_rdtsc_init_base_freq : | access_histogram : | page_fault : | __do_page_fault : | handle_mm_fault : | __handle_mm_fault : | | : | --1.54%--do_swap_page : | swapin_readahead : | swap_cluster_readahead : | | : | --1.53%--read_swap_cache_async : | __read_swap_cache_async : | alloc_pages_vma : | __alloc_pages_nodemask : | __alloc_pages_slowpath : | try_to_free_pages : | do_try_to_free_pages : | shrink_node : | shrink_node_memcg : | | : | |--0.77%--lruvec_lru_size : | | : | --0.76%--inactive_list_is_low : | | : | --0.76%--lruvec_lru_size : | : --1.50%--measure_read : page_fault : __do_page_fault : handle_mm_fault : __handle_mm_fault : do_swap_page : swapin_readahead : swap_cluster_readahead : | : --1.48%--read_swap_cache_async : __read_swap_cache_async : alloc_pages_vma : __alloc_pages_nodemask : __alloc_pages_slowpath : try_to_free_pages : do_try_to_free_pages : shrink_node : shrink_node_memcg : | : |--0.75%--inactive_list_is_low : | | : | --0.75%--lruvec_lru_size : | : --0.73%--lruvec_lru_size The likely culprit is the cache traffic the lruvec_page_state_local generates. Dave Hansen says: : I was thinking purely of the cache footprint. If it's reading : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192 : bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would : be 18k of data for the whole system and the caching would be pretty : efficient and all 18k would probably survive a tight page fault loop in : the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't : fit in the L1 and probably wouldn't survive a tight page fault loop if : both logical threads were banging on different cgroups. : : It's just a theory, but it's why I noted the number of cgroups when I : initially saw this show up in profiles Fix the regression by partially reverting the said commit and calculate the lru size explicitly. Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com Fixes: 1a61ab80 ("mm: memcontrol: replace zone summing with lruvec_page_state()") Signed-off-by:
Honglei Wang <honglei.wang@oracle.com> Reported-by:
Tim Chen <tim.c.chen@linux.intel.com> Acked-by:
Tim Chen <tim.c.chen@linux.intel.com> Tested-by:
Tim Chen <tim.c.chen@linux.intel.com> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: <stable@vger.kernel.org> [5.2+] Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> (cherry pick commit from 649cad91d7ad38cbd918a74f8128e7190b1cffb5) Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Shakeel Butt authored
mainline inclusion from mainline-v5.3-rc7 commit 6c1c2808 category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- Instead of using raw_cpu_read() use per_cpu() to read the actual data of the corresponding cpu otherwise we will be reading the data of the current cpu for the number of online CPUs. Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com Fixes: bb65f89b ("mm: memcontrol: flush percpu vmevents before releasing memcg") Fixes: c350a99e ("mm: memcontrol: flush percpu vmstats before releasing memcg") Signed-off-by:
Shakeel Butt <shakeelb@google.com> Acked-by:
Roman Gushchin <guro@fb.com> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> (cherry pick commit from ff99cde5226f5ab3e032b870bec6b1631574bd0b) Conflicts: mm/memcontrol.c Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Roman Gushchin authored
mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones" mainline inclusion from mainline-v5.3-rc7 commit b4c46484 category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- Commit 766a4c19 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") effectively decreased the precision of per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters. That's good for displaying in memory.stat, but brings a serious regression into the reclaim process. One issue I've discovered and debugged is the following: lruvec_lru_size() can return 0 instead of the actual number of pages in the lru list, preventing the kernel to reclaim last remaining pages. Result is yet another dying memory cgroups flooding. The opposite is also happening: scanning an empty lru list is the waste of cpu time. Also, inactive_list_is_low() can return incorrect values, preventing the active lru from being scanned and freed. It can fail both because the size of active and inactive lists are inaccurate, and because the number of workingset refaults isn't precise. In other words, the result is pretty random. I'm not sure, if using the approximate number of slab pages in count_shadow_number() is acceptable, but issues described above are enough to partially revert the patch. Let's keep per-memcg vmstat_local batched (they are only used for displaying stats to the userspace), but keep lruvec stats precise. This change fixes the dead memcg flooding on my setup. Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com Fixes: 766a4c19 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") Signed-off-by:
Roman Gushchin <guro@fb.com> Acked-by:
Yafang Shao <laoar.shao@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> (cherry-pick commit from 81b6bde6ce186b25297ffc0138f9bc4523f3497c) Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Yafang Shao authored
mainline inclusion from mainline-v5.3-rc1 commit 766a4c19 category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- After commit 815744d7 ("mm: memcontrol: don't batch updates of local VM stats and events"), the local VM counter are not in sync with the hierarchical ones. Below is one example in a leaf memcg on my server (with 8 CPUs): inactive_file 3567570944 total_inactive_file 3568029696 We find that the deviation is very great because the 'val' in __mod_memcg_state() is in pages while the effective value in memcg_stat_show() is in bytes. So the maximum of this deviation between local VM stats and total VM stats can be (32 * number_of_cpu * PAGE_SIZE), that may be an unacceptably great value. We should keep the local VM stats in sync with the total stats. In order to keep this behavior the same across counters, this patch updates __mod_lruvec_state() and __count_memcg_events() as well. Link: http://lkml.kernel.org/r/1562851979-10610-1-git-send-email-laoar.shao@gmail.com Signed-off-by:
Yafang Shao <laoar.shao@gmail.com> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yafang Shao <shaoyafang@didiglobal.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> (cherry-pick commit from 2f5bb3c19249c51a41395f9a4400ec05c002a60a) Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Roman Gushchin authored
mainline inclusion from mainline-v5.3-rc6 commit bb65f89b category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- Similar to vmstats, percpu caching of local vmevents leads to an accumulation of errors on non-leaf levels. This happens because some leftovers may remain in percpu caches, so that they are never propagated up by the cgroup tree and just disappear into nonexistence with on releasing of the memory cgroup. To fix this issue let's accumulate and propagate percpu vmevents values before releasing the memory cgroup similar to what we're doing with vmstats. Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate only over online cpus. Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by:
Roman Gushchin <guro@fb.com> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> (cherry-pick commit from 01d21e395a5b635a93820b4279d37f5cd6de6913) Conflicts: mm/memcontrol.c Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Roman Gushchin authored
mainline inclusion from mainline-v5.3-rc6 commit c350a99e category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- Percpu caching of local vmstats with the conditional propagation by the cgroup tree leads to an accumulation of errors on non-leaf levels. Let's imagine two nested memory cgroups A and A/B. Say, a process belonging to A/B allocates 100 pagecache pages on the CPU 0. The percpu cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B and A atomic vmstat counters, 4 pages will remain in the percpu cache. Imagine A/B is nearby memory.max, so that every following allocation triggers a direct reclaim on the local CPU. Say, each such attempt will free 16 pages on a new cpu. That means every percpu cache will have -16 pages, except the first one, which will have 4 - 16 = -12. A/B and A atomic counters will not be touched at all. Now a user removes A/B. All percpu caches are freed and corresponding vmstat numbers are forgotten. A has 96 pages more than expected. As memory cgroups are created and destroyed, errors do accumulate. Even 1-2 pages differences can accumulate into large numbers. To fix this issue let's accumulate and propagate percpu vmstat values before releasing the memory cgroup. At this point these numbers are stable and cannot be changed. Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate only over online cpus. Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by:
Roman Gushchin <guro@fb.com> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> (cherry-from commit from c22087d02d52e521e348ec207b7434f3b10f0637) Conficts: mm/memcontrol.c Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Yafang Shao authored
mainline inclusion from mainline-v5.3-rc1 commit dd923990 category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- When we calculate total statistics for memcg1_stats and memcg1_events, we use the the index 'i' in the for loop as the events index. Actually we should use memcg1_stats[i] and memcg1_events[i] as the events index. Link: http://lkml.kernel.org/r/1562116978-19539-1-git-send-email-laoar.shao@gmail.com Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty"). Signed-off-by:
Yafang Shao <laoar.shao@gmail.com> Reviewed-by:
Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yafang Shao <shaoyafang@didiglobal.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> (cherry-pick commit from e97c8caa68a68615f64d3a94eb3ec225a90e56b9) Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Johannes Weiner authored
mainline inclusion from mainline-v5.2-rc5 commit 815744d7 category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- The kernel test robot noticed a 26% will-it-scale pagefault regression from commit 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty"). This appears to be caused by bouncing the additional cachelines from the new hierarchical statistics counters. We can fix this by getting rid of the batched local counters instead. Originally, there were *only* group-local counters, and they were fully maintained per cpu. A reader of a stats file high up in the cgroup tree would have to walk the entire subtree and collect each level's per-cpu counters to get the recursive view. This was prohibitively expensive, and so we switched to per-cpu batched updates of the local counters during a983b5eb ("mm: memcontrol: fix excessive complexity in memory.stat reporting"), reducing the complexity from nr_subgroups * nr_cpus to nr_subgroups. With growing machines and cgroup trees, the tree walk itself became too expensive for monitoring top-level groups, and this is when the culprit patch added hierarchy counters on each cgroup level. When the per-cpu batch size would be reached, both the local and the hierarchy counters would get batch-updated from the per-cpu delta simultaneously. This makes local and hierarchical counter reads blazingly fast, but it unfortunately makes the write-side too cache line intense. Since local counter reads were never a problem - we only centralized them to accelerate the hierarchy walk - and use of the local counters are becoming rarer due to replacement with hierarchical views (ongoing rework in the page reclaim and workingset code), we can make those local counters unbatched per-cpu counters again. The scheme will then be as such: when a memcg statistic changes, the writer will: - update the local counter (per-cpu) - update the batch counter (per-cpu). If the batch is full: - spill the batch into the group's atomic_t - spill the batch into all ancestors' atomic_ts - empty out the batch counter (per-cpu) when a local memcg counter is read, the reader will: - collect the local counter from all cpus when a hiearchy memcg counter is read, the reader will: - read the atomic_t We might be able to simplify this further and make the recursive counters unbatched per-cpu counters as well (batch upward propagation, but leave per-cpu collection to the readers), but that will require a more in-depth analysis and testing of all the callsites. Deal with the immediate regression for now. Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty") Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Reported-by:
kernel test robot <rong.a.chen@intel.com> Tested-by:
kernel test robot <rong.a.chen@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Chen Zhou <chenzhou10@huawei.com> Signed-off-by:
Liu Shixin <liushixin2@huawei.com> Reviewed-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> (cherry-pick commit from f5c6200d96e39a6dd80d397d54d6cd7a22c7d968) Conflicts: mm/memcontrol.c Signed-off-by:
Lu Jialin <lujialin4@huawei.com> Reviewed-by:
Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-