Skip to content
Snippets Groups Projects
  1. Sep 22, 2020
    • Suzuki K Poulose's avatar
      arm64: cpufeature: Fix the type of no FP/SIMD capability · 2cebf0b0
      Suzuki K Poulose authored
      
      stable inclusion
      from linux-4.19.104
      commit 12e2dca1f224fde0ec1dec10a3c6e178c6dd8a7a
      
      --------------------------------
      
      commit 449443c0 upstream.
      
      The NO_FPSIMD capability is defined with scope SYSTEM, which implies
      that the "absence" of FP/SIMD on at least one CPU is detected only
      after all the SMP CPUs are brought up. However, we use the status
      of this capability for every context switch. So, let us change
      the scope to LOCAL_CPU to allow the detection of this capability
      as and when the first CPU without FP is brought up.
      
      Also, the current type allows hotplugged CPU to be brought up without
      FP/SIMD when all the current CPUs have FP/SIMD and we have the userspace
      up. Fix both of these issues by changing the capability to
      BOOT_RESTRICTED_LOCAL_CPU_FEATURE.
      
      Fixes: 82e0191a ("arm64: Support systems without FP/ASIMD")
      Cc: Will Deacon <will@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      2cebf0b0
    • Robert Milkowski's avatar
      NFSv4: try lease recovery on NFS4ERR_EXPIRED · ae9b7851
      Robert Milkowski authored
      
      stable inclusion
      from linux-4.19.104
      commit 070818b71dca057ba69c53f2a3c8098508ece818
      
      --------------------------------
      
      commit 924491f2 upstream.
      
      Currently, if an nfs server returns NFS4ERR_EXPIRED to open(),
      we return EIO to applications without even trying to recover.
      
      Fixes: 272289a3 ("NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid")
      Signed-off-by: default avatarRobert Milkowski <rmilkowski@gmail.com>
      Reviewed-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      ae9b7851
    • Trond Myklebust's avatar
      NFS: Revalidate the file size on a fatal write error · c944eae2
      Trond Myklebust authored
      
      stable inclusion
      from linux-4.19.104
      commit 3060883146e59cb60c7e2d8981bc256f7c5283df
      
      --------------------------------
      
      commit 0df68ced upstream.
      
      If we suffer a fatal error upon writing a file, which causes us to
      need to revalidate the entire mapping, then we should also revalidate
      the file size.
      
      Fixes: d2ceb7e5 ("NFS: Don't use page_file_mapping after removing the page")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      c944eae2
    • Geert Uytterhoeven's avatar
      nfs: NFS_SWAP should depend on SWAP · 6dd98e92
      Geert Uytterhoeven authored
      
      stable inclusion
      from linux-4.19.104
      commit 008ff93deec291d3ab69f020537f0813f83ed23a
      
      --------------------------------
      
      commit 474c4f30 upstream.
      
      If CONFIG_SWAP=n, it does not make much sense to offer the user the
      option to enable support for swapping over NFS, as that will still fail
      at run time:
      
          # swapon /swap
          swapon: /swap: swapon failed: Function not implemented
      
      Fix this by adding a dependency on CONFIG_SWAP.
      
      Fixes: a564b8f0 ("nfs: enable swap on NFS")
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      6dd98e92
    • Logan Gunthorpe's avatar
      PCI: Don't disable bridge BARs when assigning bus resources · cbab034d
      Logan Gunthorpe authored
      stable inclusion
      from linux-4.19.104
      commit 412cb7a7b0a0bbaf32a23951d604f59b8a8b0f20
      
      --------------------------------
      
      commit 9db8dc6d upstream.
      
      Some PCI bridges implement BARs in addition to bridge windows.  For
      example, here's a PLX switch:
      
        04:00.0 PCI bridge: PLX Technology, Inc. PEX 8724 24-Lane, 6-Port PCI
                  Express Gen 3 (8 GT/s) Switch, 19 x 19mm FCBGA (rev ca)
      	    (prog-if 00 [Normal decode])
            Flags: bus master, fast devsel, latency 0, IRQ 30, NUMA node 0
            Memory at 90a00000 (32-bit, non-prefetchable) [size=256K]
            Bus: primary=04, secondary=05, subordinate=0a, sec-latency=0
            I/O behind bridge: 00002000-00003fff
            Memory behind bridge: 90000000-909fffff
            Prefetchable memory behind bridge: 0000380000800000-0000380000bfffff
      
      Previously, when the kernel assigned resource addresses (with the
      pci=realloc command line parameter, for example) it could clear the struct
      resource corresponding to the BAR.  When this happened, lspci would report
      this BAR as "ignored":
      
         Region 0: Memory at <ignored> (32-bit, non-prefetchable) [size=256K]
      
      This is because the kernel reports a zero start address and zero flags
      in the corresponding sysfs resource file and in /proc/bus/pci/devices.
      Investigation with 'lspci -x', however, shows the BIOS-assigned address
      will still be programmed in the device's BAR registers.
      
      It's clearly a bug that the kernel lost track of the BAR value, but in most
      cases, this still won't result in a visible issue because nothing uses the
      memory, so nothing is affected.  However, when an IOMMU is in use, it will
      not reserve this space in the IOVA because the kernel no longer thinks the
      range is valid.  (See dmar_init_reserved_ranges() for the Intel
      implementation of this.)
      
      Without the proper reserved range, a DMA mapping may allocate an IOVA that
      matches a bridge BAR, which results in DMA accesses going to the BAR
      instead of the intended RAM.
      
      The problem was in pci_assign_unassigned_root_bus_resources().  When any
      resource from a bridge device fails to get assigned, the code set the
      resource's flags to zero.  This makes sense for bridge windows, as they
      will be re-enabled later, but for regular BARs, it makes the kernel
      permanently lose track of the fact that they decode address space.
      
      Change pci_assign_unassigned_root_bus_resources() and
      pci_assign_unassigned_bridge_resources() so they only clear "res->flags"
      for bridge *windows*, not bridge BARs.
      
      Fixes: da7822e5 ("PCI: update bridge resources to get more big ranges when allocating space (again)")
      Link: https://lore.kernel.org/r/20200108213208.4612-1-logang@deltatee.com
      
      
      [bhelgaas: commit log, check for pci_is_bridge()]
      Reported-by: default avatarKit Chow <kchow@gigaio.com>
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      cbab034d
    • Song Liu's avatar
      perf/core: Fix mlock accounting in perf_mmap() · 40945cd3
      Song Liu authored
      
      stable inclusion
      from linux-4.19.103
      commit a3623db43a3c06538591370db955d85b80657e17
      
      --------------------------------
      
      commit 00346155 upstream.
      
      Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
      a perf ring buffer may lead to an integer underflow in locked memory
      accounting. This may lead to the undesired behaviors, such as failures in
      BPF map creation.
      
      Address this by adjusting the accounting logic to take into account the
      possibility that the amount of already locked memory may exceed the
      current limit.
      
      Fixes: c4b75479 ("perf/core: Make the mlock accounting simple again")
      Suggested-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Acked-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      40945cd3
    • Konstantin Khlebnikov's avatar
      clocksource: Prevent double add_timer_on() for watchdog_timer · 0c47b8e8
      Konstantin Khlebnikov authored
      
      stable inclusion
      from linux-4.19.103
      commit 6284d30e96ede11d9d434eebfacbe4b4625b6c87
      
      --------------------------------
      
      commit febac332 upstream.
      
      Kernel crashes inside QEMU/KVM are observed:
      
        kernel BUG at kernel/time/timer.c:1154!
        BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().
      
      At the same time another cpu got:
      
        general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:
      
        __hlist_del at include/linux/list.h:681
        (inlined by) detach_timer at kernel/time/timer.c:818
        (inlined by) expire_timers at kernel/time/timer.c:1355
        (inlined by) __run_timers at kernel/time/timer.c:1686
        (inlined by) run_timer_softirq at kernel/time/timer.c:1699
      
      Unfortunately kernel logs are badly scrambled, stacktraces are lost.
      
      Printing the timer->function before the BUG_ON() pointed to
      clocksource_watchdog().
      
      The execution of clocksource_watchdog() can race with a sequence of
      clocksource_stop_watchdog() .. clocksource_start_watchdog():
      
      expire_timers()
       detach_timer(timer, true);
        timer->entry.pprev = NULL;
       raw_spin_unlock_irq(&base->lock);
       call_timer_fn
        clocksource_watchdog()
      
      					clocksource_watchdog_kthread() or
      					clocksource_unbind()
      
      					spin_lock_irqsave(&watchdog_lock, flags);
      					clocksource_stop_watchdog();
      					 del_timer(&watchdog_timer);
      					 watchdog_running = 0;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
      
      					spin_lock_irqsave(&watchdog_lock, flags);
      					clocksource_start_watchdog();
      					 add_timer_on(&watchdog_timer, ...);
      					 watchdog_running = 1;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
      
        spin_lock(&watchdog_lock);
        add_timer_on(&watchdog_timer, ...);
         BUG_ON(timer_pending(timer) || !timer->function);
          timer_pending() -> true
          BUG()
      
      I.e. inside clocksource_watchdog() watchdog_timer could be already armed.
      
      Check timer_pending() before calling add_timer_on(). This is sufficient as
      all operations are synchronized by watchdog_lock.
      
      Fixes: 75c5158f ("timekeeping: Update clocksource with stop_machine")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      0c47b8e8
    • Thomas Gleixner's avatar
      x86/apic/msi: Plug non-maskable MSI affinity race · 7ff6d61f
      Thomas Gleixner authored
      
      stable inclusion
      from linux-4.19.103
      commit 032a2bf9787acdaef31369045ff0cb0b301eee61
      
      --------------------------------
      
      commit 6f1a4891 upstream.
      
      Evan tracked down a subtle race between the update of the MSI message and
      the device raising an interrupt internally on PCI devices which do not
      support MSI masking. The update of the MSI message is non-atomic and
      consists of either 2 or 3 sequential 32bit wide writes to the PCI config
      space.
      
         - Write address low 32bits
         - Write address high 32bits (If supported by device)
         - Write data
      
      When an interrupt is migrated then both address and data might change, so
      the kernel attempts to mask the MSI interrupt first. But for MSI masking is
      optional, so there exist devices which do not provide it. That means that
      if the device raises an interrupt internally between the writes then a MSI
      message is sent built from half updated state.
      
      On x86 this can lead to spurious interrupts on the wrong interrupt
      vector when the affinity setting changes both address and data. As a
      consequence the device interrupt can be lost causing the device to
      become stuck or malfunctioning.
      
      Evan tried to handle that by disabling MSI accross an MSI message
      update. That's not feasible because disabling MSI has issues on its own:
      
       If MSI is disabled the PCI device is routing an interrupt to the legacy
       INTx mechanism. The INTx delivery can be disabled, but the disablement is
       not working on all devices.
      
       Some devices lose interrupts when both MSI and INTx delivery are disabled.
      
      Another way to solve this would be to enforce the allocation of the same
      vector on all CPUs in the system for this kind of screwed devices. That
      could be done, but it would bring back the vector space exhaustion problems
      which got solved a few years ago.
      
      Fortunately the high address (if supported by the device) is only relevant
      when X2APIC is enabled which implies interrupt remapping. In the interrupt
      remapping case the affinity setting is happening at the interrupt remapping
      unit and the PCI MSI message is programmed only once when the PCI device is
      initialized.
      
      That makes it possible to solve it with a two step update:
      
        1) Target the MSI msg to the new vector on the current target CPU
      
        2) Target the MSI msg to the new vector on the new target CPU
      
      In both cases writing the MSI message is only changing a single 32bit word
      which prevents the issue of inconsistency.
      
      After writing the final destination it is necessary to check whether the
      device issued an interrupt while the intermediate state #1 (new vector,
      current CPU) was in effect.
      
      This is possible because the affinity change is always happening on the
      current target CPU. The code runs with interrupts disabled, so the
      interrupt can be detected by checking the IRR of the local APIC. If the
      vector is pending in the IRR then the interrupt is retriggered on the new
      target CPU by sending an IPI for the associated vector on the target CPU.
      
      This can cause spurious interrupts on both the local and the new target
      CPU.
      
       1) If the new vector is not in use on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then interrupt entry code will
          ignore that spurious interrupt. The vector is marked so that the
          'No irq handler for vector' warning is supressed once.
      
       2) If the new vector is in use already on the local CPU then the IRR check
          might see an pending interrupt from the device which is using this
          vector. The IPI to the new target CPU will then invoke the handler of
          the device, which got the affinity change, even if that device did not
          issue an interrupt
      
       3) If the new vector is in use already on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then the handler of the device which
          uses that vector on the local CPU will be invoked.
      
      expose issues in device driver interrupt handlers which are not prepared to
      handle a spurious interrupt correctly. This not a regression, it's just
      exposing something which was already broken as spurious interrupts can
      happen for a lot of reasons and all driver handlers need to be able to deal
      with them.
      
      Reported-by: default avatarEvan Green <evgreen@chromium.org>
      Debugged-by: default avatarEvan Green <evgreen@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarEvan Green <evgreen@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      7ff6d61f
    • David Hildenbrand's avatar
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section · d0a3efd5
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.103
      commit 0a69047d8235c60d88c6ca488d8dccc7c60d4d3c
      
      --------------------------------
      
      [ Upstream commit e822969c ]
      
      Patch series "mm: fix max_pfn not falling on section boundary", v2.
      
      Playing with different memory sizes for a x86-64 guest, I discovered that
      some memmaps (highest section if max_mem does not fall on the section
      boundary) are marked as being valid and online, but contain garbage.  We
      have to properly initialize these memmaps.
      
      Looking at /proc/kpageflags and friends, I found some more issues,
      partially related to this.
      
      This patch (of 3):
      
      If max_pfn is not aligned to a section boundary, we can easily run into
      BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
      memory size that is not a multiple of 128MB (e.g., 4097MB, but also
      4160MB).  I was told that on real HW, we can easily have this scenario
      (esp., one of the main reasons sub-section hotadd of devmem was added).
      
      The issue is, that we have a valid memmap (pfn_valid()) for the whole
      section, and the whole section will be marked "online".
      pfn_to_online_page() will succeed, but the memmap contains garbage.
      
      E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
      4160M" - (see tools/vm/page-types.c):
      
      [  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
      [  200.477500] #PF: supervisor read access in kernel mode
      [  200.478334] #PF: error_code(0x0000) - not-present page
      [  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
      [  200.479557] Oops: 0000 [#4] SMP NOPTI
      [  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
      [  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
      [  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
      [  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
      [  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
      [  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
      [  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
      [  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
      [  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
      [  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
      [  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
      [  200.488897] Call Trace:
      [  200.489115]  kpageflags_read+0xe9/0x140
      [  200.489447]  proc_reg_read+0x3c/0x60
      [  200.489755]  vfs_read+0xc2/0x170
      [  200.490037]  ksys_pread64+0x65/0xa0
      [  200.490352]  do_syscall_64+0x5c/0xa0
      [  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
      after cold/hot plugging a DIMM to such a system:
      
      [root@localhost ~]# cat /proc/kpageflags > /dev/null
      [  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
      [  111.517907] #PF: supervisor read access in kernel mode
      [  111.518333] #PF: error_code(0x0000) - not-present page
      [  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
      
      This patch fixes that by at least zero-ing out that memmap (so e.g.,
      page_to_pfn() will not crash).  Commit 907ec5fc ("mm: zero remaining
      unavailable struct pages") tried to fix a similar issue, but forgot to
      consider this special case.
      
      After this patch, there are still problems to solve.  E.g., not all of
      these pages falling into a memory hole will actually get initialized later
      and set PageReserved - they are only zeroed out - but at least the
      immediate crashes are gone.  A follow-up patch will take care of this.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
      
      
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: <stable@vger.kernel.org>	[4.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      d0a3efd5
    • Pavel Tatashin's avatar
      mm: return zero_resv_unavail optimization · 809dc9f5
      Pavel Tatashin authored
      stable inclusion
      from linux-4.19.103
      commit f19a50c1e3ba9f58ca5a591a82ac4852da8bc4ee
      
      --------------------------------
      
      [ Upstream commit ec393a0f ]
      
      When checking for valid pfns in zero_resv_unavail(), it is not necessary
      to verify that pfns within pageblock_nr_pages ranges are valid, only the
      first one needs to be checked.  This is because memory for pages are
      allocated in contiguous chunks that contain pageblock_nr_pages struct
      pages.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      809dc9f5
    • Naoya Horiguchi's avatar
      mm: zero remaining unavailable struct pages · f8d9d8ce
      Naoya Horiguchi authored
      stable inclusion
      from linux-4.19.103
      commit 9ac5917a1d28220981512c4f4c391c90a997e0c6
      
      --------------------------------
      
      [ Upstream commit 907ec5fc ]
      
      Patch series "mm: Fix for movable_node boot option", v3.
      
      This patch series contains a fix for the movable_node boot option issue
      which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM
      regions into memblock.reserved").
      
      The commit breaks the option because it changed the memory gap range to
      reserved memblock.  So, the node is marked as Normal zone even if the SRAT
      has Hot pluggable affinity.
      
      First and second patch fix the original issue which the commit tried to
      fix, then revert the commit.
      
      This patch (of 3):
      
      There is a kernel panic that is triggered when reading /proc/kpageflags on
      the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
      default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      outside memblock.memory, but currently it covers only the reserved
      unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
      patch extends it to cover all unavailable range, which fixes the reported
      issue.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
      
      
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      f8d9d8ce
    • Eric Biggers's avatar
      ext4: fix deadlock allocating crypto bounce page from mempool · d2b5dce2
      Eric Biggers authored
      
      stable inclusion
      from linux-4.19.103
      commit 987bb7a3fda122e8e5f98fcd8c01ae188fc27e1e
      
      --------------------------------
      
      [ Upstream commit 547c556f ]
      
      ext4_writepages() on an encrypted file has to encrypt the data, but it
      can't modify the pagecache pages in-place, so it encrypts the data into
      bounce pages and writes those instead.  All bounce pages are allocated
      from a mempool using GFP_NOFS.
      
      This is not correct use of a mempool, and it can deadlock.  This is
      because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never
      fail" mode for mempool_alloc() where a failed allocation will fall back
      to waiting for one of the preallocated elements in the pool.
      
      But since this mode is used for all a bio's pages and not just the
      first, it can deadlock waiting for pages already in the bio to be freed.
      
      This deadlock can be reproduced by patching mempool_alloc() to pretend
      that pool->alloc() always fails (so that it always falls back to the
      preallocations), and then creating an encrypted file of size > 128 KiB.
      
      Fix it by only using GFP_NOFS for the first page in the bio.  For
      subsequent pages just use GFP_NOWAIT, and if any of those fail, just
      submit the bio and start a new one.
      
      This will need to be fixed in f2fs too, but that's less straightforward.
      
      Fixes: c9af28fd ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.org
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      d2b5dce2
    • Jens Axboe's avatar
      aio: prevent potential eventfd recursion on poll · 3c79f696
      Jens Axboe authored
      
      stable inclusion
      from linux-4.19.103
      commit d062d9826ae4659bc2f976f70b666b245d351650
      
      --------------------------------
      
      commit 01d7a356 upstream.
      
      If we have nested or circular eventfd wakeups, then we can deadlock if
      we run them inline from our poll waitqueue wakeup handler. It's also
      possible to have very long chains of notifications, to the extent where
      we could risk blowing the stack.
      
      Check the eventfd recursion count before calling eventfd_signal(). If
      it's non-zero, then punt the signaling to async context. This is always
      safe, as it takes us out-of-line in terms of stack and locking context.
      
      Cc: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      3c79f696
    • Jens Axboe's avatar
      eventfd: track eventfd_signal() recursion depth · c1910c45
      Jens Axboe authored
      
      stable inclusion
      from linux-4.19.103
      commit eaef83c4c0cb8c82ab7cea99479d49d35a5cd25d
      
      --------------------------------
      
      commit b5e683d5 upstream.
      
      eventfd use cases from aio and io_uring can deadlock due to circular
      or resursive calling, when eventfd_signal() tries to grab the waitqueue
      lock. On top of that, it's also possible to construct notification
      chains that are deep enough that we could blow the stack.
      
      Add a percpu counter that tracks the percpu recursion depth, warn if we
      exceed it. The counter is also exposed so that users of eventfd_signal()
      can do the right thing if it's non-zero in the context where it is
      called.
      
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      c1910c45
    • Vladis Dronov's avatar
      watchdog: fix UAF in reboot notifier handling in watchdog core code · 7350b47a
      Vladis Dronov authored
      
      stable inclusion
      from linux-4.19.103
      commit 1ca3742a9bba93977d9e7cae8ffdcaba03f93101
      
      --------------------------------
      
      commit 69503e58 upstream.
      
      After the commit 44ea3942 ("drivers/watchdog: make use of
      devm_register_reboot_notifier()") the struct notifier_block reboot_nb in
      the struct watchdog_device is removed from the reboot notifiers chain at
      the time watchdog's chardev is closed. But at least in i6300esb.c case
      reboot_nb is embedded in the struct esb_dev which can be freed on its
      device removal and before the chardev is closed, thus UAF at reboot:
      
      [    7.728581] esb_probe: esb_dev.watchdog_device ffff91316f91ab28
      ts# uname -r                            note the address ^^^
      5.5.0-rc5-ae6088-wdog
      ts# ./openwdog0 &
      [1] 696
      ts# opened /dev/watchdog0, sleeping 10s...
      ts# echo 1 > /sys/devices/pci0000\:00/0000\:00\:09.0/remove
      [  178.086079] devres:rel_nodes: dev ffff91317668a0b0 data ffff91316f91ab28
                 esb_dev.watchdog_device.reboot_nb memory is freed here ^^^
      ts# ...woken up
      [  181.459010] devres:rel_nodes: dev ffff913171781000 data ffff913174a1dae8
      [  181.460195] devm_unreg_reboot_notifier: res ffff913174a1dae8 nb ffff91316f91ab78
                                           attempt to use memory already freed ^^^
      [  181.461063] devm_unreg_reboot_notifier: nb->call 6b6b6b6b6b6b6b6b
      [  181.461243] devm_unreg_reboot_notifier: nb->next 6b6b6b6b6b6b6b6b
                      freed memory is filled with a slub poison ^^^
      [1]+  Done                    ./openwdog0
      ts# reboot
      [  229.921862] systemd-shutdown[1]: Rebooting.
      [  229.939265] notifier_call_chain: nb ffffffff9c6c2f20 nb->next ffffffff9c6d50c0
      [  229.943080] notifier_call_chain: nb ffffffff9c6d50c0 nb->next 6b6b6b6b6b6b6b6b
      [  229.946054] notifier_call_chain: nb 6b6b6b6b6b6b6b6b INVAL
      [  229.957584] general protection fault: 0000 [#1] SMP
      [  229.958770] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.5.0-rc5-ae6088-wdog
      [  229.960224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
      [  229.963288] RIP: 0010:notifier_call_chain+0x66/0xd0
      [  229.969082] RSP: 0018:ffffb20dc0013d88 EFLAGS: 00010246
      [  229.970812] RAX: 000000000000002e RBX: 6b6b6b6b6b6b6b6b RCX: 00000000000008b3
      [  229.972929] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffffffff9ccc46ac
      [  229.975028] RBP: 0000000000000001 R08: 0000000000000000 R09: 00000000000008b3
      [  229.977039] R10: 0000000000000001 R11: ffffffff9c26c740 R12: 0000000000000000
      [  229.979155] R13: 6b6b6b6b6b6b6b6b R14: 0000000000000000 R15: 00000000fffffffa
      ...   slub_debug=FZP poison ^^^
      [  229.989089] Call Trace:
      [  229.990157]  blocking_notifier_call_chain+0x43/0x59
      [  229.991401]  kernel_restart_prepare+0x14/0x30
      [  229.992607]  kernel_restart+0x9/0x30
      [  229.993800]  __do_sys_reboot+0x1d2/0x210
      [  230.000149]  do_syscall_64+0x3d/0x130
      [  230.001277]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  230.002639] RIP: 0033:0x7f5461bdd177
      [  230.016402] Modules linked in: i6300esb
      [  230.050261] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
      
      Fix the crash by reverting 44ea3942 so unregister_reboot_notifier()
      is called when watchdog device is removed. This also makes handling of
      the reboot notifier unified with the handling of the restart handler,
      which is freed with unregister_restart_handler() in the same place.
      
      Fixes: 44ea3942 ("drivers/watchdog: make use of devm_register_reboot_notifier()")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: default avatarVladis Dronov <vdronov@redhat.com>
      Reviewed-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Link: https://lore.kernel.org/r/20200108125347.6067-1-vdronov@redhat.com
      
      
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarWim Van Sebroeck <wim@linux-watchdog.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      7350b47a
    • Vasily Averin's avatar
      jbd2_seq_info_next should increase position index · 9ceb9a3a
      Vasily Averin authored
      stable inclusion
      from linux-4.19.103
      commit 68e81e14ddb897d5392cb4967ac398b52435adea
      
      --------------------------------
      
      commit 1a8e9cf4 upstream.
      
      if seq_file .next fuction does not change position index,
      read after some lseek can generate unexpected output.
      
      Script below generates endless output
       $ q=;while read -r r;do echo "$((++q)) $r";done </proc/fs/jbd2/DEV/info
      
      https://bugzilla.kernel.org/show_bug.cgi?id=206283
      
      
      
      Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
      Cc: stable@kernel.org
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/d13805e5-695e-8ac3-b678-26ca2313629f@virtuozzo.com
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      9ceb9a3a
    • Trond Myklebust's avatar
      NFS: Directory page cache pages need to be locked when read · d190c880
      Trond Myklebust authored
      
      stable inclusion
      from linux-4.19.103
      commit 729c1232c7f1c6427851d835d0d36c2d75f097bf
      
      --------------------------------
      
      commit 114de382 upstream.
      
      When a NFS directory page cache page is removed from the page cache,
      its contents are freed through a call to nfs_readdir_clear_array().
      To prevent the removal of the page cache entry until after we've
      finished reading it, we must take the page lock.
      
      Fixes: 11de3b11 ("NFS: Fix a memory leak in nfs_readdir")
      Cc: stable@vger.kernel.org # v2.6.37+
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Reviewed-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      d190c880
    • Trond Myklebust's avatar
      NFS: Fix memory leaks and corruption in readdir · 31c6482a
      Trond Myklebust authored
      
      stable inclusion
      from linux-4.19.103
      commit 68b1724316b0f0b554bc4af15f5ab8f52d2b1bed
      
      --------------------------------
      
      commit 4b310319 upstream.
      
      nfs_readdir_xdr_to_array() must not exit without having initialised
      the array, so that the page cache deletion routines can safely
      call nfs_readdir_clear_array().
      Furthermore, we should ensure that if we exit nfs_readdir_filler()
      with an error, we free up any page contents to prevent a leak
      if we try to fill the page again.
      
      Fixes: 11de3b11 ("NFS: Fix a memory leak in nfs_readdir")
      Cc: stable@vger.kernel.org # v2.6.37+
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Reviewed-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      31c6482a
    • Herbert Xu's avatar
      padata: Remove broken queue flushing · c4650c52
      Herbert Xu authored
      
      stable inclusion
      from linux-4.19.103
      commit dc34710a7aba5207e7cb99d11588c04535b3c53d
      
      --------------------------------
      
      [ Upstream commit 07928d9b ]
      
      The function padata_flush_queues is fundamentally broken because
      it cannot force padata users to complete the request that is
      underway.  IOW padata has to passively wait for the completion
      of any outstanding work.
      
      As it stands flushing is used in two places.  Its use in padata_stop
      is simply unnecessary because nothing depends on the queues to
      be flushed afterwards.
      
      The other use in padata_replace is more substantial as we depend
      on it to free the old pd structure.  This patch instead uses the
      pd->refcnt to dynamically free the pd structure once all requests
      are complete.
      
      Fixes: 2b73b07a ("padata: Flush the padata queues actively")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      c4650c52
    • Mikulas Patocka's avatar
      dm writecache: fix incorrect flush sequence when doing SSD mode commit · d1ed69bf
      Mikulas Patocka authored
      
      stable inclusion
      from linux-4.19.103
      commit a999296636fbbec86aa2af5025b89493532e907d
      
      --------------------------------
      
      commit aa950920 upstream.
      
      When committing state, the function writecache_flush does the following:
      1. write metadata (writecache_commit_flushed)
      2. flush disk cache (writecache_commit_flushed)
      3. wait for data writes to complete (writecache_wait_for_ios)
      4. increase superblock seq_count
      5. write the superblock
      6. flush disk cache
      
      It may happen that at step 3, when we wait for some write to finish, the
      disk may report the write as finished, but the write only hit the disk
      cache and it is not yet stored in persistent storage. At step 5 we write
      the superblock - it may happen that the superblock is written before the
      write that we waited for in step 3. If the machine crashes, it may result
      in incorrect data being returned after reboot.
      
      In order to fix the bug, we must swap steps 2 and 3 in the above sequence,
      so that we first wait for writes to complete and then flush the disk
      cache.
      
      Fixes: 48debafe ("dm: add writecache target")
      Cc: stable@vger.kernel.org # 4.18+
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      d1ed69bf
    • Mike Snitzer's avatar
      dm: fix potential for q->make_request_fn NULL pointer · b4109065
      Mike Snitzer authored
      
      stable inclusion
      from linux-4.19.103
      commit 9eb75d69e924722c3b062359f266073935ae0cf0
      
      --------------------------------
      
      commit 47ace7e0 upstream.
      
      Move blk_queue_make_request() to dm.c:alloc_dev() so that
      q->make_request_fn is never NULL during the lifetime of a DM device
      (even one that is created without a DM table).
      
      Otherwise generic_make_request() will crash simply by doing:
        dmsetup create -n test
        mount /dev/dm-N /mnt
      
      While at it, move ->congested_data initialization out of
      dm.c:alloc_dev() and into the bio-based specific init method.
      
      Reported-by: default avatarStefan Bader <stefan.bader@canonical.com>
      BugLink: https://bugs.launchpad.net/bugs/1860231
      
      
      Fixes: ff36ab34 ("dm: remove request-based logic from make_request_fn wrapper")
      Depends-on: c12c9a3c ("dm: various cleanups to md->queue initialization code")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        drivers/md/dm.c
      [yyl: change it as same as mainline patch]
      
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      b4109065
    • Milan Broz's avatar
      dm crypt: fix benbi IV constructor crash if used in authenticated mode · 221bef4f
      Milan Broz authored
      stable inclusion
      from linux-4.19.103
      commit 607d7cf285f75a5f57207b141398bec4888b3758
      
      --------------------------------
      
      commit 4ea9471f upstream.
      
      If benbi IV is used in AEAD construction, for example:
        cryptsetup luksFormat <device> --cipher twofish-xts-benbi --key-size 512 --integrity=hmac-sha256
      the constructor uses wrong skcipher function and crashes:
      
       BUG: kernel NULL pointer dereference, address: 00000014
       ...
       EIP: crypt_iv_benbi_ctr+0x15/0x70 [dm_crypt]
       Call Trace:
        ? crypt_subkey_size+0x20/0x20 [dm_crypt]
        crypt_ctr+0x567/0xfc0 [dm_crypt]
        dm_table_add_target+0x15f/0x340 [dm_mod]
      
      Fix this by properly using crypt_aead_blocksize() in this case.
      
      Fixes: ef43aa38 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
      Cc: stable@vger.kernel.org # v4.12+
      Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941051
      
      
      Reported-by: default avatarJerad Simpson <jbsimpson@gmail.com>
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      221bef4f
    • Joe Thornber's avatar
      dm space map common: fix to ensure new block isn't already in use · 2c5f84d3
      Joe Thornber authored
      
      stable inclusion
      from linux-4.19.103
      commit 1fac9f574cd2536abcdd645ee23d35c0e819d7ec
      
      --------------------------------
      
      commit 4feaef83 upstream.
      
      The space-maps track the reference counts for disk blocks allocated by
      both the thin-provisioning and cache targets.  There are variants for
      tracking metadata blocks and data blocks.
      
      Transactionality is implemented by never touching blocks from the
      previous transaction, so we can rollback in the event of a crash.
      
      When allocating a new block we need to ensure the block is free (has
      reference count of 0) in both the current and previous transaction.
      Prior to this fix we were doing this by searching for a free block in
      the previous transaction, and relying on a 'begin' counter to track
      where the last allocation in the current transaction was.  This
      'begin' field was not being updated in all code paths (eg, increment
      of a data block reference count due to breaking sharing of a neighbour
      block in the same btree leaf).
      
      This fix keeps the 'begin' field, but now it's just a hint to speed up
      the search.  Instead the current transaction is searched for a free
      block, and then the old transaction is double checked to ensure it's
      free.  Much simpler.
      
      This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
      DM thin-provisioning's snapshots are heavily used.
      
      Reported-by: default avatarEric Wheeler <dm-devel@lists.ewheeler.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      2c5f84d3
    • Dmitry Fomichev's avatar
      dm zoned: support zone sizes smaller than 128MiB · 25c8dfb0
      Dmitry Fomichev authored
      
      stable inclusion
      from linux-4.19.103
      commit 4ae8d3a5f3b47543fe90478b6707d1f03b966559
      
      --------------------------------
      
      commit b3996295 upstream.
      
      dm-zoned is observed to log failed kernel assertions and not work
      correctly when operating against a device with a zone size smaller
      than 128MiB (e.g. 32768 bits per 4K block). The reason is that the
      bitmap size per zone is calculated as zero with such a small zone
      size. Fix this problem and also make the code related to zone bitmap
      management be able to handle per zone bitmaps smaller than a single
      block.
      
      A dm-zoned-tools patch is required to properly format dm-zoned devices
      with zone sizes smaller than 128MiB.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry Fomichev <dmitry.fomichev@wdc.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      25c8dfb0
    • Amir Goldstein's avatar
      ovl: fix wrong WARN_ON() in ovl_cache_update_ino() · 2386832f
      Amir Goldstein authored
      
      stable inclusion
      from linux-4.19.103
      commit 65a876ee848b23170708648a418fe260d4c7ae3a
      
      --------------------------------
      
      commit 4c37e71b upstream.
      
      The WARN_ON() that child entry is always on overlay st_dev became wrong
      when we allowed this function to update d_ino in non-samefs setup with xino
      enabled.
      
      It is not true in case of xino bits overflow on a non-dir inode.  Leave the
      WARN_ON() only for directories, where assertion is still true.
      
      Fixes: adbf4f7e ("ovl: consistent d_ino for non-samefs with xino")
      Cc: <stable@vger.kernel.org> # v4.17+
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      2386832f
    • Stephen Boyd's avatar
      alarmtimer: Unregister wakeup source when module get fails · 7cd19d93
      Stephen Boyd authored
      
      stable inclusion
      from linux-4.19.103
      commit b522ff023e2daf109de25c94695b400bd217b3aa
      
      --------------------------------
      
      commit 6b6d188a upstream.
      
      The alarmtimer_rtc_add_device() function creates a wakeup source and then
      tries to grab a module reference. If that fails the function returns early
      with an error code, but fails to remove the wakeup source.
      
      Cleanup this exit path so there is no dangling wakeup source, which is
      named 'alarmtime' left allocated which will conflict with another RTC
      device that may be registered later.
      
      Fixes: 51218298 ("alarmtimer: Ensure RTC module is not unloaded")
      Signed-off-by: default avatarStephen Boyd <swboyd@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarDouglas Anderson <dianders@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20200109155910.907-2-swboyd@chromium.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      7cd19d93
    • Kevin Hao's avatar
      irqdomain: Fix a memory leak in irq_domain_push_irq() · dfb27fde
      Kevin Hao authored
      
      stable inclusion
      from linux-4.19.103
      commit 4f7d834cece26b4544f2eb4e6f46e70ec3caf7b2
      
      --------------------------------
      
      commit 0f394dae upstream.
      
      Fix a memory leak reported by kmemleak:
      unreferenced object 0xffff000bc6f50e80 (size 128):
        comm "kworker/23:2", pid 201, jiffies 4294894947 (age 942.132s)
        hex dump (first 32 bytes):
          00 00 00 00 41 00 00 00 86 c0 03 00 00 00 00 00  ....A...........
          00 a0 b2 c6 0b 00 ff ff 40 51 fd 10 00 80 ff ff  ........@Q......
        backtrace:
          [<00000000e62d2240>] kmem_cache_alloc_trace+0x1a4/0x320
          [<00000000279143c9>] irq_domain_push_irq+0x7c/0x188
          [<00000000d9f4c154>] thunderx_gpio_probe+0x3ac/0x438
          [<00000000fd09ec22>] pci_device_probe+0xe4/0x198
          [<00000000d43eca75>] really_probe+0xdc/0x320
          [<00000000d3ebab09>] driver_probe_device+0x5c/0xf0
          [<000000005b3ecaa0>] __device_attach_driver+0x88/0xc0
          [<000000004e5915f5>] bus_for_each_drv+0x7c/0xc8
          [<0000000079d4db41>] __device_attach+0xe4/0x140
          [<00000000883bbda9>] device_initial_probe+0x18/0x20
          [<000000003be59ef6>] bus_probe_device+0x98/0xa0
          [<0000000039b03d3f>] deferred_probe_work_func+0x74/0xa8
          [<00000000870934ce>] process_one_work+0x1c8/0x470
          [<00000000e3cce570>] worker_thread+0x1f8/0x428
          [<000000005d64975e>] kthread+0xfc/0x128
          [<00000000f0eaa764>] ret_from_fork+0x10/0x18
      
      Fixes: 495c38d3 ("irqdomain: Add irq_domain_{push,pop}_irq() functions")
      Signed-off-by: default avatarKevin Hao <haokexin@gmail.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20200120043547.22271-1-haokexin@gmail.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      dfb27fde
    • Eric Dumazet's avatar
      rcu: Avoid data-race in rcu_gp_fqs_check_wake() · b4155769
      Eric Dumazet authored
      stable inclusion
      from linux-4.19.103
      commit 00b13445f92180a4ac83c5fc1b7cce3ea5ed9a6d
      
      --------------------------------
      
      commit 6935c398 upstream.
      
      The rcu_gp_fqs_check_wake() function uses rcu_preempt_blocked_readers_cgp()
      to read ->gp_tasks while other cpus might overwrite this field.
      
      We need READ_ONCE()/WRITE_ONCE() pairs to avoid compiler
      tricks and KCSAN splats like the following :
      
      BUG: KCSAN: data-race in rcu_gp_fqs_check_wake / rcu_preempt_deferred_qs_irqrestore
      
      write to 0xffffffff85a7f190 of 8 bytes by task 7317 on cpu 0:
       rcu_preempt_deferred_qs_irqrestore+0x43d/0x580 kernel/rcu/tree_plugin.h:507
       rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
       __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
       rcu_read_unlock include/linux/rcupdate.h:645 [inline]
       __ip_queue_xmit+0x3b0/0xa40 net/ipv4/ip_output.c:533
       ip_queue_xmit+0x45/0x60 include/net/ip.h:236
       __tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_o...
      b4155769
    • Lu Shuaibing's avatar
      ipc/msg.c: consolidate all xxxctl_down() functions · ac8a1041
      Lu Shuaibing authored
      stable inclusion
      from linux-4.19.103
      commit 078dd7328e21ef634c5d28f1dc21841f590f6a75
      
      --------------------------------
      
      commit 889b3317 upstream.
      
      A use of uninitialized memory in msgctl_down() because msqid64 in
      ksys_msgctl hasn't been initialized.  The local | msqid64 | is created in
      ksys_msgctl() and then passed into msgctl_down().  Along the way msqid64
      is never initialized before msgctl_down() checks msqid64->msg_qbytes.
      
      KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
      reports:
      
      ==================================================================
      BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
      Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022
      
      CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      Call Trace:
       dump_stack+0x75/0xae
       __kumsan_report+0x17c/0x3e6
       kumsan_report+0xe/0x20
       msgctl_down+0x94/0x300
       ksys_msgctl.constprop.14+0xef/0x260
       do_syscall_64+0x7e/0x1f0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x4400e9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
      R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
      R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000
      
      The buggy address belongs to the page:
      page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x100000000000000()
      raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kumsan: bad access detected
      ==================================================================
      
      Syzkaller reproducer:
      msgctl$IPC_RMID(0x0, 0x0)
      
      C reproducer:
      // autogenerated by syzkaller (https://github.com/google/syzkaller)
      
      int main(void)
      {
        syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
        syscall(__NR_msgctl, 0, 0, 0);
        return 0;
      }
      
      [natechancellor@gmail.com: adjust indentation in ksys_msgctl]
        Link: https://github.com/ClangBuiltLinux/linux/issues/829
        Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
      Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
      
      
      Signed-off-by: default avatarLu Shuaibing <shuaibinglu@126.com>
      Signed-off-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: NeilBrown <neilb@suse.com>
      From: Andrew Morton <akpm@linux-foundation.org>
      Subject: ipc/msg.c: consolidate all xxxctl_down() functions
      
      Each line here overflows 80 cols by exactly one character.  Delete one tab
      per line to fix.
      
      Cc: Shaohua Li <shli@fb.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      ac8a1041
    • YueHaibing's avatar
      kernel/module: Fix memleak in module_add_modinfo_attrs() · 56cd947e
      YueHaibing authored
      
      stable inclusion
      from linux-4.19.103
      commit bdfaaf35ac49ef43d1f3a6ccf6f27b8e64f61828
      
      --------------------------------
      
      [ Upstream commit f6d061d6 ]
      
      In module_add_modinfo_attrs() if sysfs_create_file() fails
      on the first iteration of the loop (so i = 0), we forget to
      free the modinfo_attrs.
      
      Fixes: bc6f2a75 ("kernel/module: Fix mem leak in module_add_modinfo_attrs")
      Reviewed-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarJessica Yu <jeyu@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      56cd947e
    • Wei Yang's avatar
      mm/migrate.c: also overwrite error when it is bigger than zero · eb5042c1
      Wei Yang authored
      stable inclusion
      from linux-4.19.102
      commit b6606cc13491b4065a7a762890b9c21098812acd
      
      --------------------------------
      
      [ Upstream commit dfe9aa23 ]
      
      If we get here after successfully adding page to list, err would be 1 to
      indicate the page is queued in the list.
      
      Current code has two problems:
      
        * on success, 0 is not returned
        * on error, if add_page_for_migratioin() return 1, and the following err1
          from do_move_pages_to_node() is set, the err1 is not returned since err
          is 1
      
      And these behaviors break the user interface.
      
      Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
      
      
      Fixes: e0153fc2 ("mm: move_pages: return valid node id in status if the page is already on the target node").
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      eb5042c1
    • David Hildenbrand's avatar
      mm/memory_hotplug: shrink zones when offlining memory · 8de9a7ef
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit 86834898d5a5e5aef9ae6d285201f2d99a4eb300
      
      --------------------------------
      
      commit feee6b29 upstream.
      
      -- snip --
      
      - Missing arm64 hot(un)plug support
      - Missing some vmem_altmap_offset() cleanups
      - Missing sub-section hotadd support
      - Missing unification of mm/hmm.c and kernel/memremap.c
      
      -- snip --
      
      We currently try to shrink a single zone when removing memory.  We use
      the zone of the first page of the memory we are removing.  If that
      memmap was never initialized (e.g., memory was never onlined), we will
      read garbage and can trigger kernel BUGs (due to a stale pointer):
      
          BUG: unable to handle page fault for address: 000000000000353d
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:clear_zone_contiguous+0x5/0x10
          Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
          RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
          RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
          RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
          R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
          FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           __remove_pages+0x4b/0x640
           arch_remove_memory+0x63/0x8d
           try_remove_memory+0xdb/0x130
           __remove_memory+0xa/0x11
           acpi_memory_device_remove+0x70/0x100
           acpi_bus_trim+0x55/0x90
           acpi_device_hotplug+0x227/0x3a0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x221/0x550
           worker_thread+0x50/0x3b0
           kthread+0x105/0x140
           ret_from_fork+0x3a/0x50
          Modules linked in:
          CR2: 000000000000353d
      
      Instead, shrink the zones when offlining memory or when onlining failed.
      Introduce and use remove_pfn_range_from_zone(() for that.  We now
      properly shrink the zones, even if we have DIMMs whereby
      
       - Some memory blocks fall into no zone (never onlined)
      
       - Some memory blocks fall into multiple zones (offlined+re-onlined)
      
       - Multiple memory blocks that fall into different zones
      
      Drop the zone parameter (with a potential dubious value) from
      __remove_pages() and __remove_section().
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
      
      
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      [yyl: drop the zone parameter in arch/arm64/mm/mmu.c]
      
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      8de9a7ef
    • David Hildenbrand's avatar
      mm/memory_hotplug: fix try_offline_node() · 25086651
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit d98d053efa65169c4e29b8b477cd5bc2b7aef513
      
      --------------------------------
      
      commit 2c91f8fc upstream.
      
      -- snip --
      
      Only contextual issues:
      - Unrelated check_and_unmap_cpu_on_node() changes are missing.
      - Unrelated walk_memory_blocks() has not been moved/refactored yet.
      
      -- snip --
      
      try_offline_node() is pretty much broken right now:
      
       - The node span is updated when onlining memory, not when adding it. We
         ignore memory that was mever onlined. Bad.
      
       - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
         trigger a kernel panic. Bad for memory that is offline but also bad
         for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
         first PFN of a section might contain garbage.
      
       - Sections belonging to mixed nodes are not properly considered.
      
      As memory blocks might belong to multiple nodes, we would have to walk
      all pageblocks (or at least subsections) within present sections.
      However, we don't have a way to identify whether a memmap that is not
      online was initialized (relevant for ZONE_DEVICE).  This makes things
      more complicated.
      
      Luckily, we can piggy pack on the node span and the nid stored in memory
      blocks.  Currently, the node span is grown when calling
      move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
      removing memory, before calling try_offline_node().  Sysfs links are
      created via link_mem_sections(), e.g., during boot or when adding
      memory.
      
      If the node still spans memory or if any memory block belongs to the
      nid, we don't set the node offline.  As memory blocks that span multiple
      nodes cannot get offlined, the nid stored in memory blocks is reliable
      enough (for such online memory blocks, the node still spans the memory).
      
      Introduce for_each_memory_block() to efficiently walk all memory blocks.
      
      Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
      when removing ZONE_DEVICE memory to fix similar issues (access of
      garbage memmaps) - until we have a reliable way to identify whether
      these memmaps were properly initialized.  This implies later, that once
      a node had ZONE_DEVICE memory, we won't be able to set a node offline -
      which should be acceptable.
      
      Since commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") memory that is added is not
      assoziated with a zone/node (memmap not initialized).  The introducing
      commit 60a5a19e ("memory-hotplug: remove sysfs file of node")
      already missed that we could have multiple nodes for a section and that
      the zone/node span is updated when onlining pages, not when adding them.
      
      I tested this by hotplugging two DIMMs to a memory-less and cpu-less
      NUMA node.  The node is properly onlined when adding the DIMMs.  When
      removing the DIMMs, the node is properly offlined.
      
      Masayoshi Mizuma reported:
      
      : Without this patch, memory hotplug fails as panic:
      :
      :  BUG: kernel NULL pointer dereference, address: 0000000000000000
      :  ...
      :  Call Trace:
      :   remove_memory_block_devices+0x81/0xc0
      :   try_remove_memory+0xb4/0x130
      :   __remove_memory+0xa/0x20
      :   acpi_memory_device_remove+0x84/0x100
      :   acpi_bus_trim+0x57/0x90
      :   acpi_bus_trim+0x2e/0x90
      :   acpi_device_hotplug+0x2b2/0x4d0
      :   acpi_hotplug_work_fn+0x1a/0x30
      :   process_one_work+0x171/0x380
      :   worker_thread+0x49/0x3f0
      :   kthread+0xf8/0x130
      :   ret_from_fork+0x35/0x40
      
      [david@redhat.com: v3]
        Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
      
      
      Fixes: 60a5a19e ("memory-hotplug: remove sysfs file of node")
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e8
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Nayna Jain <nayna@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        mm/memory_hotplug.c
      [yyl: adjust context]
      
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      25086651
    • Aneesh Kumar K.V's avatar
      mm/memunmap: don't access uninitialized memmap in memunmap_pages() · af893605
      Aneesh Kumar K.V authored
      stable inclusion
      from linux-4.19.100
      commit f291080659d2bba2a33e65de57e1c2ce11473dac
      
      --------------------------------
      
      commit 77e080e7 upstream.
      
      -- snip --
      
      - Missing mm/hmm.c and kernel/memremap.c unification.
      -- hmm code does not need fixes (no altmap)
      - Missing 7cc7867f ("mm/devm_memremap_pages: enable sub-section remap")
      
      -- snip --
      
      Patch series "mm/memory_hotplug: Shrink zones before removing memory",
      v6.
      
      This series fixes the access of uninitialized memmaps when shrinking
      zones/nodes and when removing memory.  Also, it contains all fixes for
      crashes that can be triggered when removing certain namespace using
      memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
      
      We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
      more involved (we don't have SECTION_IS_ONLINE as an indicator), and
      shrinking is only of limited use (set_zone_contiguous() cannot detect
      the ZONE_DEVICE as contiguous).
      
      We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
      of code to a minimum.  Shrinking is especially necessary to keep
      zone->contiguous set where possible, especially, on memory unplug of
      DIMMs at zone boundaries.
      
      --------------------------------------------------------------------------
      
      Zones are now properly shrunk when offlining memory blocks or when
      onlining failed.  This allows to properly shrink zones on memory unplug
      even if the separate memory blocks of a DIMM were onlined to different
      zones or re-onlined to a different zone after offlining.
      
      Example:
      
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
        :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
        :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  98304
                present  65536
                managed  65536
        :/# echo 0 > /sys/devices/system/memory/memory43/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  32768
                present  32768
                managed  32768
        :/# echo 0 > /sys/devices/system/memory/memory41/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
      
      This patch (of 10):
      
      With an altmap, the memmap falling into the reserved altmap space are not
      initialized and, therefore, contain a garbage NID and a garbage zone.
      Make sure to read the NID/zone from a memmap that was initialized.
      
      This fixes a kernel crash that is observed when destroying a namespace:
      
        kernel BUG at include/linux/mm.h:1107!
        cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
            pc: c0000000004b9728: memunmap_pages+0x238/0x340
            lr: c0000000004b9724: memunmap_pages+0x234/0x340
        ...
            pid   = 3669, comm = ndctl
        kernel BUG at include/linux/mm.h:1107!
          devm_action_release+0x30/0x50
          release_nodes+0x268/0x2d0
          device_release_driver_internal+0x174/0x240
          unbind_store+0x13c/0x190
          drv_attr_store+0x44/0x60
          sysfs_kf_write+0x70/0xa0
          kernfs_fop_write+0x1ac/0x290
          __vfs_write+0x3c/0x70
          vfs_write+0xe4/0x200
          ksys_write+0x7c/0x140
          system_call+0x5c/0x68
      
      The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f ("mm,
      devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
      think we will never have driver reserved memory with
      MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
      
      [david@redhat.com: minimze code changes, rephrase description]
      Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
      
      
      Fixes: 2c2a5af6 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      af893605
    • David Hildenbrand's avatar
      drivers/base/node.c: simplify unregister_memory_block_under_nodes() · 4486b068
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit d830a11c6279c416424e0981b241e58bf51c7c8b
      
      --------------------------------
      
      commit d84f2f5a upstream.
      
      We don't allow to offline memory block devices that belong to multiple
      numa nodes.  Therefore, such devices can never get removed.  It is
      sufficient to process a single node when removing the memory block.  No
      need to iterate over each and every PFN.
      
      We already have the nid stored for each memory block.  Make sure that the
      nid always has a sane value.
      
      Please note that checking for node_online(nid) is not required.  If we
      would have a memory block belonging to a node that is no longer offline,
      then we would have a BUG in the node offlining code.
      
      Link: http://lkml.kernel.org/r/20190719135244.15242-1-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      4486b068
    • Dan Williams's avatar
      mm/hotplug: kill is_dev_zone() usage in __remove_pages() · 74334f69
      Dan Williams authored
      stable inclusion
      from linux-4.19.100
      commit b9cda6501a340c5d15487575a68454748e3132a4
      
      --------------------------------
      
      commit 96da4350 upstream.
      
      -- snip --
      
      Minor conflict, keep the altmap check.
      
      -- snip --
      
      The zone type check was a leftover from the cleanup that plumbed altmap
      through the memory hotplug path, i.e.  commit da024512 "mm: pass the
      vmem_altmap to arch_remove_memory and __remove_pages".
      
      Link: http://lkml.kernel.org/r/156092352642.979959.6664333788149363039.stgit@dwillia2-desk3.amr.corp.intel.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      74334f69
    • David Hildenbrand's avatar
      mm/memory_hotplug: remove "zone" parameter from sparse_remove_one_section · 457dd3c8
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit dc6be8597c8c2959a7ba0c7cc809094d5dd8d99a
      
      --------------------------------
      
      commit b9bf8d34 upstream.
      
      The parameter is unused, so let's drop it.  Memory removal paths should
      never care about zones.  This is the job of memory offlining and will
      require more refactorings.
      
      Link: http://lkml.kernel.org/r/20190527111152.16324-12-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      457dd3c8
    • David Hildenbrand's avatar
      mm/memory_hotplug: make unregister_memory_block_under_nodes() never fail · 3b0f0dae
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit 030d045dc067c6b67e4fb4a69e56d7cbda1b1f65
      
      --------------------------------
      
      commit a31b264c upstream.
      
      We really don't want anything during memory hotunplug to fail.  We
      always pass a valid memory block device, that check can go.  Avoid
      allocating memory and eventually failing.  As we are always called under
      lock, we can use a static piece of memory.  This avoids having to put
      the structure onto the stack, having to guess about the stack size of
      callers.
      
      Patch inspired by a patch from Oscar Salvador.
      
      In the future, there might be no need to iterate over nodes at all.
      mem->nid should tell us exactly what to remove.  Memory block devices
      with mixed nodes (added during boot) should properly fenced off and
      never removed.
      
      Link: http://lkml.kernel.org/r/20190527111152.16324-11-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      3b0f0dae
    • David Hildenbrand's avatar
      mm/memory_hotplug: remove memory block devices before arch_remove_memory() · 05065fee
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit d883abbc097e9e86ffba1342c6972ebdda06a127
      
      --------------------------------
      
      commit 4c4b7f9b upstream.
      
      Let's factor out removing of memory block devices, which is only
      necessary for memory added via add_memory() and friends that created
      memory block devices.  Remove the devices before calling
      arch_remove_memory().
      
      This finishes factoring out memory block device handling from
      arch_add_memory() and arch_remove_memory().
      
      Link: http://lkml.kernel.org/r/20190527111152.16324-10-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      05065fee
    • David Hildenbrand's avatar
      mm/memory_hotplug: create memory block devices after arch_add_memory() · 569f3192
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit aa49b6abcefc9baf8d254f75ed19c38b6866951f
      
      --------------------------------
      
      commit db051a0d upstream.
      
      Only memory to be added to the buddy and to be onlined/offlined by user
      space using /sys/devices/system/memory/...  needs (and should have!)
      memory block devices.
      
      Factor out creation of memory block devices.  Create all devices after
      arch_add_memory() succeeded.  We can later drop the want_memblock
      parameter, because it is now effectively stale.
      
      Only after memory block devices have been added, memory can be onlined
      by user space.  This implies, that memory is not visible to user space
      at all before arch_add_memory() succeeded.
      
      While at it
       - use WARN_ON_ONCE instead of BUG_ON in moved unregister_memory()
       - introduce find_memory_block_by_id() to search via block id
       - Use find_memory_block_by_id() in init_memory_block() to catch
         duplicates
      
      Link: http://lkml.kernel.org/r/20190527111152.16324-8-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      569f3192