Skip to content
Snippets Groups Projects
  1. Jan 17, 2018
  2. Jan 16, 2018
  3. Jan 15, 2018
  4. Jan 14, 2018
  5. Jan 06, 2018
  6. Jan 05, 2018
  7. Jan 04, 2018
  8. Jan 03, 2018
  9. Dec 31, 2017
  10. Dec 30, 2017
    • Thomas Gleixner's avatar
      genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI · bc976233
      Thomas Gleixner authored
      
      The new reservation mode for interrupts assigns a dummy vector when the
      interrupt is allocated and assigns a real vector when the interrupt is
      requested. The reservation mode prevents vector pressure when devices with
      a large amount of queues/interrupts are initialized, but only a minimal
      subset of those queues/interrupts is actually used.
      
      This mode has an issue with MSI interrupts which cannot be masked. If the
      driver is not careful or the hardware emits an interrupt before the device
      irq is requestd by the driver then the interrupt ends up on the dummy
      vector as a spurious interrupt which can cause malfunction of the device or
      in the worst case a lockup of the machine.
      
      Change the logic for the reservation mode so that the early activation of
      MSI interrupts checks whether:
      
       - the device is a PCI/MSI device
       - the reservation mode of the underlying irqdomain is activated
       - PCI/MSI masking is globally enabled
       - the PCI/MSI device uses either MSI-X, which supports masking, or
         MSI with the maskbit supported.
      
      If one of those conditions is false, then clear the reservation mode flag
      in the irq data of the interrupt and invoke irq_domain_activate_irq() with
      the reserve argument cleared. In the x86 vector code, clear the can_reserve
      flag in the vector allocation data so a subsequent free_irq() won't create
      the same situation again. The interrupt stays assigned to a real vector
      until pci_disable_msi() is invoked and all allocations are undone.
      
      Fixes: 4900be83 ("x86/vector/msi: Switch to global reservation mode")
      Reported-by: default avatarAlexandru Chirvasitu <achirvasub@gmail.com>
      Reported-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAlexandru Chirvasitu <achirvasub@gmail.com>
      Tested-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291406420.1899@nanos
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291409460.1899@nanos
      bc976233
    • Thomas Gleixner's avatar
      genirq/irqdomain: Rename early argument of irq_domain_activate_irq() · 702cb0a0
      Thomas Gleixner authored
      
      The 'early' argument of irq_domain_activate_irq() is actually used to
      denote reservation mode. To avoid confusion, rename it before abuse
      happens.
      
      No functional change.
      
      Fixes: 72491643 ("genirq/irqdomain: Update irq_domain_ops.activate() signature")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexandru Chirvasitu <achirvasub@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loes...
      702cb0a0
    • Thomas Gleixner's avatar
      x86/vector: Use IRQD_CAN_RESERVE flag · 945f50a5
      Thomas Gleixner authored
      
      Set the new CAN_RESERVE flag when the initial reservation for an interrupt
      happens. The flag is used in a subsequent patch to disable reservation mode
      for a certain class of MSI devices.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAlexandru Chirvasitu <achirvasub@gmail.com>
      Tested-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      
      945f50a5
  11. Dec 29, 2017
    • Thomas Gleixner's avatar
      x86/apic: Switch all APICs to Fixed delivery mode · a31e58e1
      Thomas Gleixner authored
      Some of the APIC incarnations are operating in lowest priority delivery
      mode. This worked as long as the vector management code allocated the same
      vector on all possible CPUs for each interrupt.
      
      Lowest priority delivery mode does not necessarily respect the affinity
      setting and may redirect to some other online CPU. This was documented
      somewhere in the old code and the conversion to single target delivery
      missed to update the delivery mode of the affected APIC drivers which
      results in spurious interrupts on some of the affected CPU/Chipset
      combinations.
      
      Switch the APIC drivers over to Fixed delivery mode and remove all
      leftovers of lowest priority delivery mode.
      
      Switching to Fixed delivery mode is not a problem on these CPUs because the
      kernel already uses Fixed delivery mode for IPIs. The reason for this is
      that th SDM explicitely forbids lowest prio mode for IPIs. The reason is
      obvious: If the irq routing does not honor destination targets in lowest
      p...
      a31e58e1
  12. Dec 28, 2017
    • Dou Liyang's avatar
      x86/apic: Avoid wrong warning when parsing 'apic=' in X86-32 case · 4fcab669
      Dou Liyang authored
      
      There are two consumers of apic=:
        apic_set_verbosity() for setting the APIC debug level;
        parse_apic() for registering APIC driver by hand.
      
      X86-32 supports both of them, but sometimes, kernel issues a weird warning.
      eg: when kernel was booted up with 'apic=bigsmp' in command line,
      early_param would warn like that:
      
      ...
      [    0.000000] APIC Verbosity level bigsmp not recognised use apic=verbose or apic=debug
      [    0.000000] Malformed early option 'apic'
      ...
      
      Wrap the warning code in CONFIG_X86_64 case to avoid this.
      
      Signed-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: peterz@infradead.org
      Cc: rdunlap@infradead.org
      Cc: corbet@lwn.net
      Link: https://lkml.kernel.org/r/20171204040313.24824-1-douly.fnst@cn.fujitsu.com
      4fcab669
    • Linus Torvalds's avatar
      x86-32: Fix kexec with stack canary (CONFIG_CC_STACKPROTECTOR) · ac461122
      Linus Torvalds authored
      
      Commit e802a51e ("x86/idt: Consolidate IDT invalidation") cleaned up
      and unified the IDT invalidation that existed in a couple of places.  It
      changed no actual real code.
      
      Despite not changing any actual real code, it _did_ change code generation:
      by implementing the common idt_invalidate() function in
      archx86/kernel/idt.c, it made the use of the function in
      arch/x86/kernel/machine_kexec_32.c be a real function call rather than an
      (accidental) inlining of the function.
      
      That, in turn, exposed two issues:
      
       - in load_segments(), we had incorrectly reset all the segment
         registers, which then made the stack canary load (which gcc does
         using offset of %gs) cause a trap.  Instead of %gs pointing to the
         stack canary, it will be the normal zero-based kernel segment, and
         the stack canary load will take a page fault at address 0x14.
      
       - to make this even harder to debug, we had invalidated the GDT just
         before calling idt_invalidate(), which meant that the fault happened
         with an invalid GDT, which in turn causes a triple fault and
         immediate reboot.
      
      Fix this by
      
       (a) not reloading the special segments in load_segments(). We currently
           don't do any percpu accesses (which would require %fs on x86-32) in
           this area, but there's no reason to think that we might not want to
           do them, and like %gs, it's pointless to break it.
      
       (b) doing idt_invalidate() before invalidating the GDT, to keep things
           at least _slightly_ more debuggable for a bit longer. Without a
           IDT, traps will not work. Without a GDT, traps also will not work,
           but neither will any segment loads etc. So in a very real sense,
           the GDT is even more core than the IDT.
      
      Fixes: e802a51e ("x86/idt: Consolidate IDT invalidation")
      Reported-and-tested-by: default avatarAlexandru Chirvasitu <achirvasub@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.LFD.2.21.1712271143180.8572@i7.lan
      ac461122
  13. Dec 24, 2017
    • Thomas Gleixner's avatar
      x86/ldt: Make the LDT mapping RO · 9f5cb6b3
      Thomas Gleixner authored
      
      Now that the LDT mapping is in a known area when PAGE_TABLE_ISOLATION is
      enabled its a primary target for attacks, if a user space interface fails
      to validate a write address correctly. That can never happen, right?
      
      The SDM states:
      
          If the segment descriptors in the GDT or an LDT are placed in ROM, the
          processor can enter an indefinite loop if software or the processor
          attempts to update (write to) the ROM-based segment descriptors. To
          prevent this problem, set the accessed bits for all segment descriptors
          placed in a ROM. Also, remove operating-system or executive code that
          attempts to modify segment descriptors located in ROM.
      
      So its a valid approach to set the ACCESS bit when setting up the LDT entry
      and to map the table RO. Fixup the selftest so it can handle that new mode.
      
      Remove the manual ACCESS bit setter in set_tls_desc() as this is now
      pointless. Folded the patch from Peter Ziljstra.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9f5cb6b3
    • Vlastimil Babka's avatar
      x86/dumpstack: Indicate in Oops whether PTI is configured and enabled · 5f26d76c
      Vlastimil Babka authored
      
      CONFIG_PAGE_TABLE_ISOLATION is relatively new and intrusive feature that may
      still have some corner cases which could take some time to manifest and be
      fixed. It would be useful to have Oops messages indicate whether it was
      enabled for building the kernel, and whether it was disabled during boot.
      
      Example of fully enabled:
      
      	Oops: 0001 [#1] SMP PTI
      
      Example of enabled during build, but disabled during boot:
      
      	Oops: 0001 [#1] SMP NOPTI
      
      We can decide to remove this after the feature has been tested in the field
      long enough.
      
      [ tglx: Made it use boot_cpu_has() as requested by Borislav ]
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarEduardo Valentin <eduval@amazon.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: bpetkov@suse.de
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: jkosina@suse.cz
      Cc: keescook@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5f26d76c
    • Peter Zijlstra's avatar
      x86/mm: Use/Fix PCID to optimize user/kernel switches · 6fd166aa
      Peter Zijlstra authored
      
      We can use PCID to retain the TLBs across CR3 switches; including those now
      part of the user/kernel switch. This increases performance of kernel
      entry/exit at the cost of more expensive/complicated TLB flushing.
      
      Now that we have two address spaces, one for kernel and one for user space,
      we need two PCIDs per mm. We use the top PCID bit to indicate a user PCID
      (just like we use the PFN LSB for the PGD). Since we do TLB invalidation
      from kernel space, the existing code will only invalidate the kernel PCID,
      we augment that by marking the corresponding user PCID invalid, and upon
      switching back to userspace, use a flushing CR3 write for the switch.
      
      In order to access the user_pcid_flush_mask we use PER_CPU storage, which
      means the previously established SWAPGS vs CR3 ordering is now mandatory
      and required.
      
      Having to do this memory access does require additional registers, most
      sites have a functioning stack and we can spill one (RAX), sites without
      functional stack need to otherwise provide the second scratch register.
      
      Note: PCID is generally available on Intel Sandybridge and later CPUs.
      Note: Up until this point TLB flushing was broken in this series.
      
      Based-on-code-from: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6fd166aa
    • Andy Lutomirski's avatar
      x86/pti: Put the LDT in its own PGD if PTI is on · f55f0501
      Andy Lutomirski authored
      
      With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
      The LDT is per process, i.e. per mm.
      
      An earlier approach mapped the LDT on context switch into a fixmap area,
      but that's a big overhead and exhausted the fixmap space when NR_CPUS got
      big.
      
      Take advantage of the fact that there is an address space hole which
      provides a completely unused pgd. Use this pgd to manage per-mm LDT
      mappings.
      
      This has a down side: the LDT isn't (currently) randomized, and an attack
      that can write the LDT is instant root due to call gates (thanks, AMD, for
      leaving call gates in AMD64 but designing them wrong so they're only useful
      for exploits).  This can be mitigated by making the LDT read-only or
      randomizing the mapping, either of which is strightforward on top of this
      patch.
      
      This will significantly slow down LDT users, but that shouldn't matter for
      important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
      old libc implementations.
      
      [ tglx: Cleaned it up. ]
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f55f0501
    • Thomas Gleixner's avatar
      x86/entry: Align entry text section to PMD boundary · 2f7412ba
      Thomas Gleixner authored
      
      The (irq)entry text must be visible in the user space page tables. To allow
      simple PMD based sharing, make the entry text PMD aligned.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2f7412ba
    • Thomas Gleixner's avatar
      x86/mm/pti: Force entry through trampoline when PTI active · 8d4b0678
      Thomas Gleixner authored
      
      Force the entry through the trampoline only when PTI is active. Otherwise
      go through the normal entry code.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8d4b0678
    • Dave Hansen's avatar
      x86/mm/pti: Allocate a separate user PGD · d9e9a641
      Dave Hansen authored
      
      Kernel page table isolation requires to have two PGDs. One for the kernel,
      which contains the full kernel mapping plus the user space mapping and one
      for user space which contains the user space mappings and the minimal set
      of kernel mappings which are required by the architecture to be able to
      transition from and to user space.
      
      Add the necessary preliminaries.
      
      [ tglx: Split out from the big kaiser dump. EFI fixup from Kirill ]
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d9e9a641
    • Thomas Gleixner's avatar
      x86/cpufeatures: Add X86_BUG_CPU_INSECURE · a89f040f
      Thomas Gleixner authored
      
      Many x86 CPUs leak information to user space due to missing isolation of
      user space and kernel space page tables. There are many well documented
      ways to exploit that.
      
      The upcoming software migitation of isolating the user and kernel space
      page tables needs a misfeature flag so code can be made runtime
      conditional.
      
      Add the BUG bits which indicates that the CPU is affected and add a feature
      bit which indicates that the software migitation is enabled.
      
      Assume for now that _ALL_ x86 CPUs are affected by this. Exceptions can be
      made later.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a89f040f
  14. Dec 23, 2017
    • Thomas Gleixner's avatar
      init: Invoke init_espfix_bsp() from mm_init() · 613e396b
      Thomas Gleixner authored
      
      init_espfix_bsp() needs to be invoked before the page table isolation
      initialization. Move it into mm_init() which is the place where pti_init()
      will be added.
      
      While at it get rid of the #ifdeffery and provide proper stub functions.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      613e396b
    • Thomas Gleixner's avatar
      x86/cpu_entry_area: Move it out of the fixmap · 92a0f81d
      Thomas Gleixner authored
      
      Put the cpu_entry_area into a separate P4D entry. The fixmap gets too big
      and 0-day already hit a case where the fixmap PTEs were cleared by
      cleanup_highmap().
      
      Aside of that the fixmap API is a pain as it's all backwards.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      92a0f81d
    • Thomas Gleixner's avatar
      x86/cpu_entry_area: Move it to a separate unit · ed1bbc40
      Thomas Gleixner authored
      
      Separate the cpu_entry_area code out of cpu/common.c and the fixmap.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ed1bbc40
    • Peter Zijlstra's avatar
      x86/microcode: Dont abuse the TLB-flush interface · 23cb7d46
      Peter Zijlstra authored
      
      Commit:
      
        ec400dde ("x86/microcode_intel_early.c: Early update ucode on Intel's CPU")
      
      ... grubbed into tlbflush internals without coherent explanation.
      
      Since it says its a precaution and the SDM doesn't mention anything like
      this, take it out back.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: fenghua.yu@intel.com
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      23cb7d46
    • Dave Hansen's avatar
      x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack · 4fe2d8b1
      Dave Hansen authored
      
      If the kernel oopses while on the trampoline stack, it will print
      "<SYSENTER>" even if SYSENTER is not involved.  That is rather confusing.
      
      The "SYSENTER" stack is used for a lot more than SYSENTER now.  Give it a
      better string to display in stack dumps, and rename the kernel code to
      match.
      
      Also move the 32-bit code over to the new naming even though it still uses
      the entry stack only for SYSENTER.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4fe2d8b1
    • Thomas Gleixner's avatar
      x86/ldt: Prevent LDT inheritance on exec · a4828f81
      Thomas Gleixner authored
      
      The LDT is inherited across fork() or exec(), but that makes no sense
      at all because exec() is supposed to start the process clean.
      
      The reason why this happens is that init_new_context_ldt() is called from
      init_new_context() which obviously needs to be called for both fork() and
      exec().
      
      It would be surprising if anything relies on that behaviour, so it seems to
      be safe to remove that misfeature.
      
      Split the context initialization into two parts. Clear the LDT pointer and
      initialize the mutex from the general context init and move the LDT
      duplication to arch_dup_mmap() which is only called on fork().
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: dan.j.williams@intel.com
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a4828f81
    • Peter Zijlstra's avatar
      x86/ldt: Rework locking · c2b3496b
      Peter Zijlstra authored
      
      The LDT is duplicated on fork() and on exec(), which is wrong as exec()
      should start from a clean state, i.e. without LDT. To fix this the LDT
      duplication code will be moved into arch_dup_mmap() which is only called
      for fork().
      
      This introduces a locking problem. arch_dup_mmap() holds mmap_sem of the
      parent process, but the LDT duplication code needs to acquire
      mm->context.lock to access the LDT data safely, which is the reverse lock
      order of write_ldt() where mmap_sem nests into context.lock.
      
      Solve this by introducing a new rw semaphore which serializes the
      read/write_ldt() syscall operations and use context.lock to protect the
      actual installment of the LDT descriptor.
      
      So context.lock stabilizes mm->context.ldt and can nest inside of the new
      semaphore or mmap_sem.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: dan.j.williams@intel.com
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c2b3496b
  15. Dec 19, 2017
    • Josh Poimboeuf's avatar
      x86/stacktrace: Make zombie stack traces reliable · 6454b3bd
      Josh Poimboeuf authored
      
      Commit:
      
        1959a601 ("x86/dumpstack: Pin the target stack when dumping it")
      
      changed the behavior of stack traces for zombies.  Before that commit,
      /proc/<pid>/stack reported the last execution path of the zombie before
      it died:
      
        [<ffffffff8105b877>] do_exit+0x6f7/0xa80
        [<ffffffff8105bc79>] do_group_exit+0x39/0xa0
        [<ffffffff8105bcf0>] __wake_up_parent+0x0/0x30
        [<ffffffff8152dd09>] system_call_fastpath+0x16/0x1b
        [<00007fd128f9c4f9>] 0x7fd128f9c4f9
        [<ffffffffffffffff>] 0xffffffffffffffff
      
      After the commit, it just reports an empty stack trace.
      
      The new behavior is actually probably more correct.  If the stack
      refcount has gone down to zero, then the task has already gone through
      do_exit() and isn't going to run anymore.  The stack could be freed at
      any time and is basically gone, so reporting an empty stack makes sense.
      
      However, save_stack_trace_tsk_reliable() treats such a missing stack
      condition as an error.  That can cause livepatch transition stalls if
      there are any unreaped zombies.  Instead, just treat it as a reliable,
      empty stack.
      
      Reported-and-tested-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Fixes: af085d90 ("stacktrace/x86: add function for detecting reliable stack traces")
      Link: http://lkml.kernel.org/r/e4b09e630e99d0c1080528f0821fc9d9dbaeea82.1513631620.git.jpoimboe@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6454b3bd