Skip to content
Snippets Groups Projects
  1. Sep 22, 2020
    • Thomas Gleixner's avatar
      genirq/affinity: Handle affinity setting on inactive interrupts correctly · c998ccb2
      Thomas Gleixner authored
      
      stable inclusion
      from linux-4.19.134
      commit 2048e4375c552614d26a7191394d8a8398fe7a85
      
      --------------------------------
      
      commit baedb87d upstream.
      
      Setting interrupt affinity on inactive interrupts is inconsistent when
      hierarchical irq domains are enabled. The core code should just store the
      affinity and not call into the irq chip driver for inactive interrupts
      because the chip drivers may not be in a state to handle such requests.
      
      X86 has a hacky workaround for that but all other irq chips have not which
      causes problems e.g. on GIC V3 ITS.
      
      Instead of adding more ugly hacks all over the place, solve the problem in
      the core code. If the affinity is set on an inactive interrupt then:
      
          - Store it in the irq descriptors affinity mask
          - Update the effective affinity to reflect that so user space has
            a consistent view
          - Don't call into the irq chip driver
      
      This is the core equivalent of the X86 workaround and works correctly
      because the affinity setting is established in the irq chip when the
      interrupt is activated later on.
      
      Note, that this is only effective when hierarchical irq domains are enabled
      by the architecture. Doing it unconditionally would break legacy irq chip
      implementations.
      
      For hierarchial irq domains this works correctly as none of the drivers can
      have a dependency on affinity setting in inactive state by design.
      
      Remove the X86 workaround as it is not longer required.
      
      Fixes: 02edee15 ("x86/apic/vector: Ignore set_affinity call for inactive interrupts")
      Reported-by: default avatarAli Saidi <alisaidi@amazon.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAli Saidi <alisaidi@amazon.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20200529015501.15771-1-alisaidi@amazon.com
      Link: https://lkml.kernel.org/r/877dv2rv25.fsf@nanos.tec.linutronix.de
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      c998ccb2
    • Jiri Olsa's avatar
      kretprobe: Prevent triggering kretprobe from within kprobe_flush_task · ea29c178
      Jiri Olsa authored
      stable inclusion
      from linux-4.19.130
      commit 98abe944f93faf19c3707f8f188df5073e5668f8
      
      --------------------------------
      
      [ Upstream commit 9b38cc70 ]
      
      Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave.
      My test was also able to trigger lockdep output:
      
       ============================================
       WARNING: possible recursive locking detected
       5.6.0-rc6+ #6 Not tainted
       --------------------------------------------
       sched-messaging/2767 is trying to acquire lock:
       ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0
      
       but task is already holding lock:
       ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&(kretprobe_table_locks[i].lock));
         lock(&(kretprobe_table_locks[i].lock));
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       1 lock held by sched-messaging/2767:
        #0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50
      
       stack backtrace:
       CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ #6
       Call Trace:
        dump_stack+0x96/0xe0
        __lock_acquire.cold.57+0x173/0x2b7
        ? native_queued_spin_lock_slowpath+0x42b/0x9e0
        ? lockdep_hardirqs_on+0x590/0x590
        ? __lock_acquire+0xf63/0x4030
        lock_acquire+0x15a/0x3d0
        ? kretprobe_hash_lock+0x52/0xa0
        _raw_spin_lock_irqsave+0x36/0x70
        ? kretprobe_hash_lock+0x52/0xa0
        kretprobe_hash_lock+0x52/0xa0
        trampoline_handler+0xf8/0x940
        ? kprobe_fault_handler+0x380/0x380
        ? find_held_lock+0x3a/0x1c0
        kretprobe_trampoline+0x25/0x50
        ? lock_acquired+0x392/0xbc0
        ? _raw_spin_lock_irqsave+0x50/0x70
        ? __get_valid_kprobe+0x1f0/0x1f0
        ? _raw_spin_unlock_irqrestore+0x3b/0x40
        ? finish_task_switch+0x4b9/0x6d0
        ? __switch_to_asm+0x34/0x70
        ? __switch_to_asm+0x40/0x70
      
      The code within the kretprobe handler checks for probe reentrancy,
      so we won't trigger any _raw_spin_lock_irqsave probe in there.
      
      The problem is in outside kprobe_flush_task, where we call:
      
        kprobe_flush_task
          kretprobe_table_lock
            raw_spin_lock_irqsave
              _raw_spin_lock_irqsave
      
      where _raw_spin_lock_irqsave triggers the kretprobe and installs
      kretprobe_trampoline handler on _raw_spin_lock_irqsave return.
      
      The kretprobe_trampoline handler is then executed with already
      locked kretprobe_table_locks, and first thing it does is to
      lock kretprobe_table_locks ;-) the whole lockup path like:
      
        kprobe_flush_task
          kretprobe_table_lock
            raw_spin_lock_irqsave
              _raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed
      
              ---> kretprobe_table_locks locked
      
              kretprobe_trampoline
                trampoline_handler
                  kretprobe_hash_lock(current, &head, &flags);  <--- deadlock
      
      Adding kprobe_busy_begin/end helpers that mark code with fake
      probe installed to prevent triggering of another kprobe within
      this code.
      
      Using these helpers in kprobe_flush_task, so the probe recursion
      protection check is hit and the probe is never set to prevent
      above lockup.
      
      Link: http://lkml.kernel.org/r/158927059835.27680.7011202830041561604.stgit@devnote2
      
      
      
      Fixes: ef53d9c5 ("kprobes: improve kretprobe scalability with hashed locking")
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Reported-by: default avatar"Ziqian SUN (Zamir)" <zsun@redhat.com>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      ea29c178
    • Thomas Gleixner's avatar
      x86/apic/msi: Plug non-maskable MSI affinity race · 7ff6d61f
      Thomas Gleixner authored
      
      stable inclusion
      from linux-4.19.103
      commit 032a2bf9787acdaef31369045ff0cb0b301eee61
      
      --------------------------------
      
      commit 6f1a4891 upstream.
      
      Evan tracked down a subtle race between the update of the MSI message and
      the device raising an interrupt internally on PCI devices which do not
      support MSI masking. The update of the MSI message is non-atomic and
      consists of either 2 or 3 sequential 32bit wide writes to the PCI config
      space.
      
         - Write address low 32bits
         - Write address high 32bits (If supported by device)
         - Write data
      
      When an interrupt is migrated then both address and data might change, so
      the kernel attempts to mask the MSI interrupt first. But for MSI masking is
      optional, so there exist devices which do not provide it. That means that
      if the device raises an interrupt internally between the writes then a MSI
      message is sent built from half updated state.
      
      On x86 this can lead to spurious interrupts on the wrong interrupt
      vector when the affinity setting changes both address and data. As a
      consequence the device interrupt can be lost causing the device to
      become stuck or malfunctioning.
      
      Evan tried to handle that by disabling MSI accross an MSI message
      update. That's not feasible because disabling MSI has issues on its own:
      
       If MSI is disabled the PCI device is routing an interrupt to the legacy
       INTx mechanism. The INTx delivery can be disabled, but the disablement is
       not working on all devices.
      
       Some devices lose interrupts when both MSI and INTx delivery are disabled.
      
      Another way to solve this would be to enforce the allocation of the same
      vector on all CPUs in the system for this kind of screwed devices. That
      could be done, but it would bring back the vector space exhaustion problems
      which got solved a few years ago.
      
      Fortunately the high address (if supported by the device) is only relevant
      when X2APIC is enabled which implies interrupt remapping. In the interrupt
      remapping case the affinity setting is happening at the interrupt remapping
      unit and the PCI MSI message is programmed only once when the PCI device is
      initialized.
      
      That makes it possible to solve it with a two step update:
      
        1) Target the MSI msg to the new vector on the current target CPU
      
        2) Target the MSI msg to the new vector on the new target CPU
      
      In both cases writing the MSI message is only changing a single 32bit word
      which prevents the issue of inconsistency.
      
      After writing the final destination it is necessary to check whether the
      device issued an interrupt while the intermediate state #1 (new vector,
      current CPU) was in effect.
      
      This is possible because the affinity change is always happening on the
      current target CPU. The code runs with interrupts disabled, so the
      interrupt can be detected by checking the IRR of the local APIC. If the
      vector is pending in the IRR then the interrupt is retriggered on the new
      target CPU by sending an IPI for the associated vector on the target CPU.
      
      This can cause spurious interrupts on both the local and the new target
      CPU.
      
       1) If the new vector is not in use on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then interrupt entry code will
          ignore that spurious interrupt. The vector is marked so that the
          'No irq handler for vector' warning is supressed once.
      
       2) If the new vector is in use already on the local CPU then the IRR check
          might see an pending interrupt from the device which is using this
          vector. The IPI to the new target CPU will then invoke the handler of
          the device, which got the affinity change, even if that device did not
          issue an interrupt
      
       3) If the new vector is in use already on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then the handler of the device which
          uses that vector on the local CPU will be invoked.
      
      expose issues in device driver interrupt handlers which are not prepared to
      handle a spurious interrupt correctly. This not a regression, it's just
      exposing something which was already broken as spurious interrupts can
      happen for a lot of reasons and all driver handlers need to be able to deal
      with them.
      
      Reported-by: default avatarEvan Green <evgreen@chromium.org>
      Debugged-by: default avatarEvan Green <evgreen@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarEvan Green <evgreen@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      7ff6d61f
    • David Hildenbrand's avatar
      mm/memory_hotplug: shrink zones when offlining memory · 8de9a7ef
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit 86834898d5a5e5aef9ae6d285201f2d99a4eb300
      
      --------------------------------
      
      commit feee6b29 upstream.
      
      -- snip --
      
      - Missing arm64 hot(un)plug support
      - Missing some vmem_altmap_offset() cleanups
      - Missing sub-section hotadd support
      - Missing unification of mm/hmm.c and kernel/memremap.c
      
      -- snip --
      
      We currently try to shrink a single zone when removing memory.  We use
      the zone of the first page of the memory we are removing.  If that
      memmap was never initialized (e.g., memory was never onlined), we will
      read garbage and can trigger kernel BUGs (due to a stale pointer):
      
          BUG: unable to handle page fault for address: 000000000000353d
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:clear_zone_contiguous+0x5/0x10
          Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
          RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
          RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
          RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
          R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
          FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           __remove_pages+0x4b/0x640
           arch_remove_memory+0x63/0x8d
           try_remove_memory+0xdb/0x130
           __remove_memory+0xa/0x11
           acpi_memory_device_remove+0x70/0x100
           acpi_bus_trim+0x55/0x90
           acpi_device_hotplug+0x227/0x3a0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x221/0x550
           worker_thread+0x50/0x3b0
           kthread+0x105/0x140
           ret_from_fork+0x3a/0x50
          Modules linked in:
          CR2: 000000000000353d
      
      Instead, shrink the zones when offlining memory or when onlining failed.
      Introduce and use remove_pfn_range_from_zone(() for that.  We now
      properly shrink the zones, even if we have DIMMs whereby
      
       - Some memory blocks fall into no zone (never onlined)
      
       - Some memory blocks fall into multiple zones (offlined+re-onlined)
      
       - Multiple memory blocks that fall into different zones
      
      Drop the zone parameter (with a potential dubious value) from
      __remove_pages() and __remove_section().
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
      
      
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      [yyl: drop the zone parameter in arch/arm64/mm/mmu.c]
      
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      8de9a7ef
    • David Hildenbrand's avatar
      mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVE · 34b64663
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit 000a1d59cfe9d6e875462ed72de32770322c282b
      
      --------------------------------
      
      commit 80ec922d upstream.
      
      -- snip --
      
      Missing arm64 memory hot(un)plug support.
      
      -- snip --
      
      We want to improve error handling while adding memory by allowing to use
      arch_remove_memory() and __remove_pages() even if
      CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:
      
      	arch_add_memory()
      	rc = do_something();
      	if (rc) {
      		arch_remove_memory();
      	}
      
      We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
      quite some dependencies for memory offlining.
      
      Link: http://lkml.kernel.org/r/20190527111152.16324-7-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      [yyl: remove CONFIG_MEMORY_HOTREMOVE in arch/arm64/mm/mmu.c]
      
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      34b64663
    • David Hildenbrand's avatar
      mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail · a860e45f
      David Hildenbrand authored
      stable inclusion
      from linux-4.19.100
      commit 5163b1ec3a0c3a2e1e53b7794b64866cd6ba8697
      
      --------------------------------
      
      commit ac5c9426 upstream.
      
      -- snip --
      
      Minor conflict in arch/powerpc/mm/mem.c
      
      -- snip --
      
      All callers of arch_remove_memory() ignore errors.  And we should really
      try to remove any errors from the memory removal path.  No more errors are
      reported from __remove_pages().  BUG() in s390x code in case
      arch_remove_memory() is triggered.  We may implement that properly later.
      WARN in case powerpc code failed to remove the section mapping, which is
      better than ignoring the error completely right now.
      
      Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mike Travis <mike.travis@hpe.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      a860e45f
    • Oscar Salvador's avatar
      mm, memory_hotplug: add nid parameter to arch_remove_memory · 1ef436b7
      Oscar Salvador authored
      stable inclusion
      from linux-4.19.100
      commit 5c1f8f5358e8cd501245ee0e954dc0c0b231d6a2
      
      --------------------------------
      
      commit 2c2a5af6 upstream.
      
      -- snip --
      
      Missing unification of mm/hmm.c and kernel/memremap.c
      
      -- snip --
      
      Patch series "Do not touch pages in hot-remove path", v2.
      
      This patchset aims for two things:
      
       1) A better definition about offline and hot-remove stage
       2) Solving bugs where we can access non-initialized pages
          during hot-remove operations [2] [3].
      
      This is achieved by moving all page/zone handling to the offline
      stage, so we do not need to access pages when hot-removing memory.
      
      [1] https://patchwork.kernel.org/cover/10691415/
      [2] https://patchwork.kernel.org/patch/10547445/
      [3] https://www.spinics.net/lists/linux-mm/msg161316.html
      
      This patch (of 5):
      
      This is a preparation for the following-up patches.  The idea of passing
      the nid is that it will allow us to get rid of the zone parameter
      afterwards.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de
      
      
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      1ef436b7
  2. Aug 31, 2020
    • Yury Norov's avatar
      32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option · e6d26bce
      Yury Norov authored
      
      hulk inclusion
      category: feature
      bugzilla: NA
      CVE: NA
      ---------------------------
      
      All new 32-bit architectures should have 64-bit userspace off_t type, but
      existing architectures has 32-bit ones.
      
      To enforce the rule, new config option is added to arch/Kconfig that
      defaults ARCH_32BIT_OFF_T to be disabled for new 32-bit architectures.
      All existing 32-bit architectures enable it explicitly.
      
      New option affects force_o_largefile() behaviour. Namely, if userspace
      off_t is 64-bits long, we have no reason to reject user to open big files.
      
      Note that even if architectures has only 64-bit off_t in the kernel
      (arc, c6x, h8300, hexagon, nios2, openrisc, and unicore32),
      a libc may use 32-bit off_t, and therefore want to limit the file size
      to 4GB unless specified differently in the open flags.
      
      Signed-off-by: default avatarYury Norov <ynorov@caviumnetworks.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      
       Conflicts:
      	arch/x86/um/Kconfig
      [wangxiongfeng: conflicts in arc...
      e6d26bce
  3. Aug 19, 2020
  4. Jul 07, 2020
  5. Jun 28, 2020
    • Anthony Steinhauser's avatar
      x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches. · 023933da
      Anthony Steinhauser authored
      
      [ Upstream commit 4d8df8cb ]
      
      Currently, it is possible to enable indirect branch speculation even after
      it was force-disabled using the PR_SPEC_FORCE_DISABLE option. Moreover, the
      PR_GET_SPECULATION_CTRL command gives afterwards an incorrect result
      (force-disabled when it is in fact enabled). This also is inconsistent
      vs. STIBP and the documention which cleary states that
      PR_SPEC_FORCE_DISABLE cannot be undone.
      
      Fix this by actually enforcing force-disabled indirect branch
      speculation. PR_SPEC_ENABLE called after PR_SPEC_FORCE_DISABLE now fails
      with -EPERM as described in the documentation.
      
      Fixes: 9137bb27 ("x86/speculation: Add prctl() control for indirect branch speculation")
      Signed-off-by: default avatarAnthony Steinhauser <asteinhauser@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      023933da
    • Anthony Steinhauser's avatar
      x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS. · ed5caf73
      Anthony Steinhauser authored
      [ Upstream commit 21998a35 ]
      
      When STIBP is unavailable or enhanced IBRS is available, Linux
      force-disables the IBPB mitigation of Spectre-BTB even when simultaneous
      multithreading is disabled. While attempts to enable IBPB using
      prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, ...) fail with
      EPERM, the seccomp syscall (or its prctl(PR_SET_SECCOMP, ...) equivalent)
      which are used e.g. by Chromium or OpenSSH succeed with no errors but the
      application remains silently vulnerable to cross-process Spectre v2 attacks
      (classical BTB poisoning). At the same time the SYSFS reporting
      (/sys/devices/system/cpu/vulnerabilities/spectre_v2) displays that IBPB is
      conditionally enabled when in fact it is unconditionally disabled.
      
      STIBP is useful only when SMT is enabled. When SMT is disabled and STIBP is
      unavailable, it makes no sense to force-disable also IBPB, because IBPB
      protects against cross-process Spectre-BTB attacks regardless of the SMT
      state. At the same time since missing STIBP was only observed on AMD CPUs,
      AMD does not recommend using STIBP, but recommends using IBPB, so disabling
      IBPB because of missing STIBP goes directly against AMD's advice:
      https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf
      
      
      
      Similarly, enhanced IBRS is designed to protect cross-core BTB poisoning
      and BTB-poisoning attacks from user space against kernel (and
      BTB-poisoning attacks from guest against hypervisor), it is not designed
      to prevent cross-process (or cross-VM) BTB poisoning between processes (or
      VMs) running on the same core. Therefore, even with enhanced IBRS it is
      necessary to flush the BTB during context-switches, so there is no reason
      to force disable IBPB when enhanced IBRS is available.
      
      Enable the prctl control of IBPB even when STIBP is unavailable or enhanced
      IBRS is available.
      
      Fixes: 7cc765a6 ("x86/speculation: Enable prctl mode for spectre_v2_user")
      Signed-off-by: default avatarAnthony Steinhauser <asteinhauser@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      ed5caf73
    • Thomas Lendacky's avatar
      x86/speculation: Add support for STIBP always-on preferred mode · 49120a73
      Thomas Lendacky authored
      
      [ Upstream commit 20c3a2c3 ]
      
      Different AMD processors may have different implementations of STIBP.
      When STIBP is conditionally enabled, some implementations would benefit
      from having STIBP always on instead of toggling the STIBP bit through MSR
      writes. This preference is advertised through a CPUID feature bit.
      
      When conditional STIBP support is requested at boot and the CPU advertises
      STIBP always-on mode as preferred, switch to STIBP "on" support. To show
      that this transition has occurred, create a new spectre_v2_user_mitigation
      value and a new spectre_v2_user_strings message. The new mitigation value
      is used in spectre_v2_user_select_mitigation() to print the new mitigation
      message as well as to return a new string from stibp_state().
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Link: https://lkml.kernel.org/r/20181213230352.6937.74943.stgit@tlendack-t1.amdoffice.net
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      49120a73
    • Waiman Long's avatar
      x86/speculation: Change misspelled STIPB to STIBP · 1c7eda3d
      Waiman Long authored
      
      [ Upstream commit aa77bfb3 ]
      
      STIBP stands for Single Thread Indirect Branch Predictors. The acronym,
      however, can be easily mis-spelled as STIPB. It is perhaps due to the
      presence of another related term - IBPB (Indirect Branch Predictor
      Barrier).
      
      Fix the mis-spelling in the code.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: KarimAllah Ahmed <karahmed@amazon.de>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/1544039368-9009-1-git-send-email-longman@redhat.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      1c7eda3d
    • Anthony Steinhauser's avatar
      x86/speculation: Prevent rogue cross-process SSBD shutdown · 82cbcd3e
      Anthony Steinhauser authored
      
      commit dbbe2ad0 upstream.
      
      On context switch the change of TIF_SSBD and TIF_SPEC_IB are evaluated
      to adjust the mitigations accordingly. This is optimized to avoid the
      expensive MSR write if not needed.
      
      This optimization is buggy and allows an attacker to shutdown the SSBD
      protection of a victim process.
      
      The update logic reads the cached base value for the speculation control
      MSR which has neither the SSBD nor the STIBP bit set. It then OR's the
      SSBD bit only when TIF_SSBD is different and requests the MSR update.
      
      That means if TIF_SSBD of the previous and next task are the same, then
      the base value is not updated, even if TIF_SSBD is set. The MSR write is
      not requested.
      
      Subsequently if the TIF_STIBP bit differs then the STIBP bit is updated
      in the base value and the MSR is written with a wrong SSBD value.
      
      This was introduced when the per task/process conditional STIPB
      switching was added on top of the existing SSBD switching.
      
      It is exploitable if the attacker creates a process which enforces SSBD
      and has the contrary value of STIBP than the victim process (i.e. if the
      victim process enforces STIBP, the attacker process must not enforce it;
      if the victim process does not enforce STIBP, the attacker process must
      enforce it) and schedule it on the same core as the victim process. If
      the victim runs after the attacker the victim becomes vulnerable to
      Spectre V4.
      
      To fix this, update the MSR value independent of the TIF_SSBD difference
      and dependent on the SSBD mitigation method available. This ensures that
      a subsequent STIPB initiated MSR write has the correct state of SSBD.
      
      [ tglx: Handle X86_FEATURE_VIRT_SSBD & X86_FEATURE_VIRT_SSBD correctly
              and massaged changelog ]
      
      Fixes: 5bfbe3ad ("x86/speculation: Prepare for per task indirect branch speculation control")
      Signed-off-by: default avatarAnthony Steinhauser <asteinhauser@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      82cbcd3e
  6. Jun 16, 2020
  7. Jun 12, 2020
  8. May 13, 2020
  9. Mar 21, 2020
  10. Mar 14, 2020
  11. Mar 05, 2020