Skip to content
Snippets Groups Projects
  1. Jul 06, 2015
    • Steven Rostedt's avatar
      x86/earlyprintk: Allow early_printk() to use console style parameters like '115200n8' · 827a82ff
      Steven Rostedt authored
      
      When I enable early_printk on a kernel, I cut and paste the
      console= input and add to earlyprintk parameter. But I notice
      recently that ktest has not been detecting triple faults. The
      way it detects it, is by seeing the kernel banner "Linux version
      .." with a different kernel version pop up. Then I noticed that
      early printk was no longer working on my console, which was why
      ktest was not seeing it.
      
      I bisected it down and it was added to 4.0 with this commit:
      
        ea9e9d80 ("Specify PCI based UART for earlyprintk")
      
      because it converted the simple_strtoul() that converts the baud
      number into a kstrtoul(). The problem with this is, I had as my
      baud rate, 115200n8 (acceptable for console=ttyS0), but because
      of the "n8", the kstrtoul() doesn't parse the baud rate and
      returns an error, which sets the baud rate to the default 9600.
      This explains the garbage on my screen.
      
      Now, earlyprintk= kernel parameter does not say it accepts that
      format. Thus, one answer would simply be me changing my kernel
      parameters to remove the "n8" since it isn't parsed anyway. But
      I wonder if other people run into this, and it seems strange
      that the two consoles for serial accepts different input.
      
      I could also extend this to have earlyprintk do something with
      that "n8" or whatever it has and have it match the console
      parsing (which, BTW, still uses simple_strtoul(), as I guess it
      has to).
      
      This patch just makes my old kernel parameter parsing work like
      it use to.
      
      Although, simple_strtoull() is considered obsolete, it is the
      only standard string parsing function that parses a number that
      is attached to text. Ironically, commit ea9e9d80 also added
      several calls to simple_strtoul()!
      
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Cohen <david.a.cohen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stuart R. Anderson <stuart.r.anderson@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150706101434.5f6a351b@gandalf.local.home
      
      
      [ Cleaned it up a bit. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      827a82ff
    • Zhu Guihua's avatar
      x86/espfix: Init espfix on the boot CPU side · 20d5e4a9
      Zhu Guihua authored
      
      As we alloc pages with GFP_KERNEL in init_espfix_ap() which is
      called before we enable local irqs, so the lockdep sub-system
      would (correctly) trigger a warning about the potentially
      blocking API.
      
      So we allocate them on the boot CPU side when the secondary CPU is
      brought up by the boot CPU, and hand them over to the secondary
      CPU.
      
      And we use alloc_pages_node() with the secondary CPU's node, to
      make sure the espfix stack is NUMA-local to the CPU that is
      going to use it.
      
      Signed-off-by: default avatarZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Cc: <bp@alien8.de>
      Cc: <luto@amacapital.net>
      Cc: <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/c97add2670e9abebb90095369f0cfc172373ac94.1435824469.git.zhugh.fnst@cn.fujitsu.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      20d5e4a9
    • Zhu Guihua's avatar
      x86/espfix: Add 'cpu' parameter to init_espfix_ap() · 1db87563
      Zhu Guihua authored
      
      Add a CPU index parameter to init_espfix_ap(), so that the
      parameter could be propagated to the function for espfix
      page allocation.
      
      Signed-off-by: default avatarZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Cc: <bp@alien8.de>
      Cc: <luto@amacapital.net>
      Cc: <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/cde3fcf1b3211f3f03feb1a995bce3fee850f0fc.1435824469.git.zhugh.fnst@cn.fujitsu.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1db87563
    • Alexander Popov's avatar
      x86/kasan: Fix KASAN shadow region page tables · 5d5aa3cf
      Alexander Popov authored
      
      Currently KASAN shadow region page tables created without
      respect of physical offset (phys_base). This causes kernel halt
      when phys_base is not zero.
      
      So let's initialize KASAN shadow region page tables in
      kasan_early_init() using __pa_nodebug() which considers
      phys_base.
      
      This patch also separates x86_64_start_kernel() from KASAN low
      level details by moving kasan_map_early_shadow(init_level4_pgt)
      into kasan_early_init().
      
      Remove the comment before clear_bss() which stopped bringing
      much profit to the code readability. Otherwise describing all
      the new order dependencies would be too verbose.
      
      Signed-off-by: default avatarAlexander Popov <alpopov@ptsecurity.com>
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: <stable@vger.kernel.org> # 4.0+
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1435828178-10975-3-git-send-email-a.ryabinin@samsung.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5d5aa3cf
    • Andrey Ryabinin's avatar
      x86/init: Clear 'init_level4_pgt' earlier · d0f77d4d
      Andrey Ryabinin authored
      
      Currently x86_64_start_kernel() has two KASAN related
      function calls. The first call maps shadow to early_level4_pgt,
      the second maps shadow to init_level4_pgt.
      
      If we move clear_page(init_level4_pgt) earlier, we could hide
      KASAN low level detail from generic x86_64 initialization code.
      The next patch will do it.
      
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: <stable@vger.kernel.org> # 4.0+
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1435828178-10975-2-git-send-email-a.ryabinin@samsung.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d0f77d4d
    • Adrian Hunter's avatar
      x86/tsc: Let high latency PIT fail fast in quick_pit_calibrate() · 5aac644a
      Adrian Hunter authored
      
      If it takes longer than 12us to read the PIT counter lsb/msb,
      then the error margin will never fall below 500ppm within 50ms,
      and Fast TSC calibration will always fail.
      
      This patch detects when that will happen and fails fast. Note
      the failure message is not printed in that case because:
      1. it will always happen on that class of hardware
      2. the absence of the message is more informative than its
      presence
      
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/556EB717.9070607@intel.com
      
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      5aac644a
  2. Jul 04, 2015
    • Ingo Molnar's avatar
      x86/fpu: Fix boot crash in the early FPU code · b96fecbf
      Ingo Molnar authored
      
      Jan Kara and Thomas Gleixner reported boot crashes in the FPU
      code:
      
        general protection fault: 0000 [#1] SMP
        RIP: 0010:[<ffffffff81048a6c>]  [<ffffffff81048a6c>] mxcsr_feature_mask_init+0x1c/0x40
      
        2b:*  0f ae 85 00 fe ff ff    fxsave -0x200(%rbp)
      
      and bisected it down to the following FPU commit:
      
         91a8c2a5 ("x86/fpu: Clean up and fix MXCSR handling")
      
      The reason is that the on-stack FPU registers state variable,
      used by the FXSAVE instruction, did not have the required
      minimum alignment of 16 bytes, causing the general protection
      fault.
      
      This is most likely a GCC bug in older GCC versions, but the
      offending commit also added a bogus extra 32-byte alignment
      (which GCC ignored too).
      
      So fix this bug by making the variable static again, but also
      mark it __initdata this time, because fpu__init_system_mxcsr()
      is now an __init function.
      
      Reported-and-bisected-by: default avatarJan Kara <jack@suse.cz>
      Reported-bisected-and-tested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150704075819.GA9201@gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b96fecbf
  3. Jul 01, 2015
  4. Jun 30, 2015
    • Peter Zijlstra's avatar
      perf/x86: Fix 'active_events' imbalance · 93472aff
      Peter Zijlstra authored
      
      Commit 1b7b938f ("perf/x86/intel: Fix PMI handling for Intel PT") conditionally
      increments active_events in x86_add_exclusive() but unconditionally decrements in
      x86_del_exclusive().
      
      These extra decrements can lead to the situation where
      active_events is zero and thus the PMI handler is 'disabled'
      while we have active events on the PMU generating PMIs.
      
      This leads to a truckload of:
      
        Uhhuh. NMI received for unknown reason 21 on CPU 28.
        Do you have a strange power saving mode enabled?
        Dazed and confused, but trying to continue
      
      messages and generally messes up perf.
      
      Remove the condition on the increment, double increment balanced
      by a double decrement is perfectly fine.
      
      Restructure the code a little bit to make the unconditional inc
      a bit more natural.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alexander.shishkin@linux.intel.com
      Cc: brgerst@gmail.com
      Cc: dvlasenk@redhat.com
      Cc: luto@amacapital.net
      Cc: oleg@redhat.com
      Fixes: 1b7b938f ("perf/x86/intel: Fix PMI handling for Intel PT")
      Link: http://lkml.kernel.org/r/20150624144750.GJ18673@twins.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      93472aff
    • Ingo Molnar's avatar
      x86/fpu: Fix FPU related boot regression when CPUID masking BIOS feature is enabled · db52ef74
      Ingo Molnar authored
      
      Mike Galbraith reported:
      
        " My i7-4790 box is having one hell of a time with this merge
          window, dead in the water.
      
          BIOS setting "Limit CPUID Maximum" upsets new fpu code
          mightily. "
      
      It turns out that Linux does a double workaround here, as per:
      
        066941bd ("x86: unmask CPUID levels on Intel CPUs")
      
      it undoes the BIOS workaround - but as a side effect the CPUID
      state is not completely constant during early init anymore,
      and the new FPU init code did not take this into account.
      
      So what happened is that the xstate init code did not have full
      CPUID available, which broke subsequent attempts to use xstate
      features.
      
      Fix this by ordering the early FPU init code to after we've
      stabilized the CPUID state.
      
      Reported-bisected-and-tested-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150627082514.GA10894@gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      db52ef74
  5. Jun 26, 2015
    • Toshi Kani's avatar
      libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6
      Toshi Kani authored
      
      ACPI NFIT table has System Physical Address Range Structure entries that
      describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
      set in the flags.
      
      Change acpi_nfit_register_region() to map a proximity ID to its node ID,
      and set it to a new numa_node field of nd_region_desc, which is then
      conveyed to the nd_region device.
      
      The device core arranges for btt and namespace devices to inherit their
      node from their parent region.
      
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      [djbw: move set_dev_node() from region.c to bus.c]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      41d7a6d6
  6. Jun 25, 2015
    • Dan Williams's avatar
      libnvdimm, pmem: add libnvdimm support to the pmem driver · 9f53f9fa
      Dan Williams authored
      
      nd_pmem attaches to persistent memory regions and namespaces emitted by
      the libnvdimm subsystem, and, same as the original pmem driver, presents
      the system-physical-address range as a block device.
      
      The existing e820-type-12 to pmem setup is converted to an nvdimm_bus
      that emits an nd_namespace_io device.
      
      Note that the X in 'pmemX' is now derived from the parent region.  This
      provides some stability to the pmem devices names from boot-to-boot.
      The minor numbers are also more predictable by passing 0 to
      alloc_disk().
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      9f53f9fa
    • Tony Luck's avatar
      x86, mirror: x86 enabling - find mirrored memory ranges · b05b9f5f
      Tony Luck authored
      UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
      address ranges.  See UEFI 2.5 spec pages 157-158:
      
        http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf
      
      
      
      On EFI enabled systems scan the memory map and tell memblock about any
      mirrored ranges.
      
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b05b9f5f
    • Tony Luck's avatar
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck authored
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for saf...
      fc6daaf9
  7. Jun 22, 2015
  8. Jun 21, 2015
  9. Jun 20, 2015
  10. Jun 19, 2015
    • Borislav Petkov's avatar
      x86/boot: Fix overflow warning with 32-bit binutils · 04c17341
      Borislav Petkov authored
      
      When building the kernel with 32-bit binutils built with support
      only for the i386 target, we get the following warning:
      
        arch/x86/kernel/head_32.S:66: Warning: shift count out of range (32 is not between 0 and 31)
      
      The problem is that in that case, binutils' internal type
      representation is 32-bit wide and the shift range overflows.
      
      In order to fix this, manipulate the shift expression which
      creates the 4GiB constant to not overflow the shift count.
      
      Suggested-by: default avatarMichael Matz <matz@suse.de>
      Reported-and-tested-by: default avatarEnrico Mioso <mrkiko.rs@gmail.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      04c17341
    • Palik, Imre's avatar
      perf/x86: Honor the architectural performance monitoring version · 2c33645d
      Palik, Imre authored
      
      Architectural performance monitoring, version 1, doesn't support fixed counters.
      
      Currently, even if a hypervisor advertises support for architectural
      performance monitoring version 1, perf may still try to use the fixed
      counters, as the constraints are set up based on the CPU model.
      
      This patch ensures that perf honors the architectural performance monitoring
      version returned by CPUID, and it only uses the fixed counters for version 2
      and above.
      
      (Some of the ideas in this patch came from Peter Zijlstra.)
      
      Signed-off-by: default avatarImre Palik <imrep@amazon.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Anthony Liguori <aliguori@amazon.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433767609-1039-1-git-send-email-imrep.amz@gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2c33645d
    • Alexander Shishkin's avatar
      perf/x86/intel: Fix PMI handling for Intel PT · 1b7b938f
      Alexander Shishkin authored
      
      Intel PT is a separate PMU and it is not using any of the x86_pmu
      code paths, which means in particular that the active_events counter
      remains intact when new PT events are created.
      
      However, PT uses the generic x86_pmu PMI handler for its PMI handling needs.
      
      The problem here is that the latter checks active_events and in case of it
      being zero, exits without calling the actual x86_pmu.handle_nmi(), which
      results in unknown NMI errors and massive data loss for PT.
      
      The effect is not visible if there are other perf events in the system
      at the same time that keep active_events counter non-zero, for instance
      if the NMI watchdog is running, so one needs to disable it to reproduce
      the problem.
      
      At the same time, the active_events counter besides doing what the name
      suggests also implicitly serves as a PMC hardware and DS area reference
      counter.
      
      This patch adds a separate reference counter for the PMC hardware, leaving
      active_events for actually counting the events and makes sure it also
      counts PT and BTS events.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Link: http://lkml.kernel.org/r/87k2v92t0s.fsf@ashishki-desk.ger.corp.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1b7b938f
    • Alexander Shishkin's avatar
      perf/x86/intel/bts: Fix DS area sharing with x86_pmu events · 6b099d9b
      Alexander Shishkin authored
      
      Currently, the intel_bts driver relies on the DS area allocated by the x86_pmu
      code in its event_init() path, which is a bug: creating a BTS event while
      no x86_pmu events are present results in a NULL pointer dereference.
      
      The same DS area is also used by PEBS sampling, which makes it quite a bit
      trickier to have a separate one for intel_bts' purposes.
      
      This patch makes intel_bts driver use the same DS allocation and reference
      counting code as x86_pmu to make sure it is always present when either
      intel_bts or x86_pmu need it.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Link: http://lkml.kernel.org/r/1434024837-9916-2-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6b099d9b
    • Andi Kleen's avatar
      perf/x86: Add more Broadwell model numbers · 4b36f1a4
      Andi Kleen authored
      
      This patch adds additional model numbers for Broadwell to perf.
      Support for Broadwell with Iris Pro (Intel Core i7-57xxC)
      and support for Broadwell Server Xeon.
      
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1434055942-28253-1-git-send-email-andi@firstfloor.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4b36f1a4
  11. Jun 18, 2015
  12. Jun 17, 2015
    • Paul Gortmaker's avatar
      x86: replace __init_or_module with __init in non-modular vsmp_64.c · 70c4f78b
      Paul Gortmaker authored
      
      The __init_or_module is from commit 05e12e1c
      ("x86: fix 27-rc crash on vsmp due to paravirt during module load").
      
      But as of commit 70511134
      ("Revert "x86: don't compile vsmp_64 for 32bit") this file became
      obj-y and hence is now only for built-in.  That makes any
      "_or_module" support no longer necessary.
      
      We need to distinguish between the two in order to do some header
      reorganization between init.h and module.h and we don't want to
      be including module.h in non-modular code.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      70c4f78b
    • Paul Gortmaker's avatar
      x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling · 5b00c1eb
      Paul Gortmaker authored
      
      This was using module_init, but the current Kconfig situation is
      as follows:
      
      In arch/x86/kernel/cpu/Makefile:
      
        obj-$(CONFIG_CPU_SUP_INTEL)    += perf_event_intel_pt.o perf_event_intel_bts.o
      
      and in arch/x86/Kconfig.cpu:
      
        config CPU_SUP_INTEL
              default y
              bool "Support Intel processors" if PROCESSOR_SELECT
      
      So currently, the end user can not build this code into a module.
      If in the future, there is desire for this to be modular, then
      it can be changed to include <linux/module.h> and use module_init.
      
      But currently, in the non-modular case, a module_init becomes a
      device_initcall.  But this really isn't a device, so we should
      choose a more appropriate initcall bucket to put it in.
      
      The obvious choice here seems to be arch_initcall, but that does
      make it earlier than it was currently through device_initcall.
      As long as perf_pmu_register() is functional, we should be OK.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      5b00c1eb
    • Paul Gortmaker's avatar
      x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling · ca41d24c
      Paul Gortmaker authored
      
      This was using module_init, but the current Kconfig situation is
      as follows:
      
      In arch/x86/kernel/cpu/Makefile:
      
        obj-$(CONFIG_CPU_SUP_INTEL)    += perf_event_intel_pt.o perf_event_intel_bts.o
      
      and in arch/x86/Kconfig.cpu:
      
        config CPU_SUP_INTEL
              default y
              bool "Support Intel processors" if PROCESSOR_SELECT
      
      So currently, the end user can not build this code into a module.
      If in the future, there is desire for this to be modular, then
      it can be changed to include <linux/module.h> and use module_init.
      
      But currently, in the non-modular case, a module_init becomes a
      device_initcall.  But this really isn't a device, so we should
      choose a more appropriate initcall bucket to put it in.
      
      The obvious choice here seems to be arch_initcall, but that does
      make it earlier than it was currently through device_initcall.
      As long as perf_pmu_register() is functional, we should be OK.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      ca41d24c
    • Paul Gortmaker's avatar
      x86: don't use module_init for non-modular core bootflag code · 1206f535
      Paul Gortmaker authored
      
      The bootflag.o is obj-y (always built in).  It will never be
      modular, so using module_init as an alias for __initcall is
      somewhat misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of arch_initcall (which
      makes sense for arch code) will thus change this registration
      from level 6-device to level 3-arch (i.e. slightly earlier).
      However no observable impact of that small difference has
      been observed during testing, or is expected.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      1206f535
    • Paul Gortmaker's avatar
      x86: don't use module_init in non-modular devicetree.c code · d54b675a
      Paul Gortmaker authored
      
      The devicetree.o is built for "OF" -- which is bool, and hence
      this code is either present or absent.  It will never be modular,
      so using module_init as an alias for __initcall can be somewhat
      misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of device_initcall
      directly in this change means that the runtime impact is
      zero -- it will remain at level 6 in initcall ordering.
      
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      d54b675a
  13. Jun 12, 2015
    • Dave Hansen's avatar
      x86/fpu: Fix double-increment in setup_xstate_features() · a8424003
      Dave Hansen authored
      
      I noticed that my MPX tracepoints were producing garbage for the
      lower and upper bounds:
      
      	mpx_bounds_register_exception: address referenced: 0x00007fffffffccb7 bounds: lower: 0x0 ~upper: 0xffffffffffffffff
      	mpx_bounds_register_exception: address referenced: 0x00007fffffffccbf bounds: lower: 0x0 ~upper: 0xffffffffffffffff
      
      This is, of course, bogus because 0x00007fffffffccbf is *within*
      the bounds.  I assumed that my instruction decoder was bad and
      went looking at it.  But I eventually realized that I was
      getting a '0' offset back from xstate_offsets[BNDREGS].
      
      It was being skipped in the initialization, which is obviously
      bogus, so remove the extra leaf++.
      
      This also goes an initializes xstate_offsets/sizes[] to -1 so
      so that bugs like this will oops instead of silently failing
      in interesting ways.
      
      This was introduced by:
      
      	39f1acd2 ("x86/fpu/xstate: Don't assume the first zero xfeatures zero bit means the end")
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@sr71.net
      Link: http://lkml.kernel.org/r/20150611193400.2E0B00DB@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a8424003
  14. Jun 11, 2015
    • Joerg Roedel's avatar
      x86/crash: Allocate enough low memory when crashkernel=high · 94fb9334
      Joerg Roedel authored
      
      When the crash kernel is loaded above 4GiB in memory, the
      first kernel allocates only 72MiB of low-memory for the DMA
      requirements of the second kernel. On systems with many
      devices this is not enough and causes device driver
      initialization errors and failed crash dumps. Testing by
      SUSE and Redhat has shown that 256MiB is a good default
      value for now and the discussion has lead to this value as
      well. So set this default value to 256MiB to make sure there
      is enough memory available for DMA.
      
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      [ Reflow comment. ]
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jörg Rödel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: kexec@lists.infradead.org
      Link: http://lkml.kernel.org/r/1433500202-25531-4-git-send-email-joro@8bytes.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      94fb9334
    • Joerg Roedel's avatar
      x86/swiotlb: Try coherent allocations with __GFP_NOWARN · 186dfc9d
      Joerg Roedel authored
      
      When we boot a kdump kernel in high memory, there is by
      default only 72MB of low memory available. The swiotlb code
      takes 64MB of it (by default) so that there are only 8MB
      left to allocate from. On systems with many devices this
      causes page allocator warnings from
      dma_generic_alloc_coherent():
      
        systemd-udevd: page allocation failure: order:0, mode:0x280d4
        CPU: 0 PID: 197 Comm: systemd-udevd Tainted: G        W
        3.12.28-4-default #1 Hardware name: HP ProLiant DL980 G7, BIOS
        P66 07/30/2012  ffff8800781335e0 ffffffff8150b1db 00000000000280d4 ffffffff8113af90
         0000000000000000 0000000000000000 ffff88007efdbb00 0000000100000000
         0000000000000000 0000000000000000 0000000000000000 0000000000000001
        Call Trace:
          dump_trace+0x7d/0x2d0
          show_stack_log_lvl+0x94/0x170
          show_stack+0x21/0x50
          dump_stack+0x41/0x51
          warn_alloc_failed+0xf0/0x160
          __alloc_pages_slowpath+0x72f/0x796
          __alloc_pages_nodemask+0x1ea/0x210
          dma_generic_alloc_coherent+0x96/0x140
          x86_swiotlb_alloc_coherent+0x1c/0x50
          ttm_dma_pool_alloc_new_pages+0xab/0x320 [ttm]
          ttm_dma_populate+0x3ce/0x640 [ttm]
          ttm_tt_bind+0x36/0x60 [ttm]
          ttm_bo_handle_move_mem+0x55f/0x5c0 [ttm]
          ttm_bo_move_buffer+0x105/0x130 [ttm]
          ttm_bo_validate+0xc1/0x130 [ttm]
          ttm_bo_init+0x24b/0x400 [ttm]
          radeon_bo_create+0x16c/0x200 [radeon]
          radeon_ring_init+0x11e/0x2b0 [radeon]
          r100_cp_init+0x123/0x5b0 [radeon]
          r100_startup+0x194/0x230 [radeon]
          r100_init+0x223/0x410 [radeon]
          radeon_device_init+0x6af/0x830 [radeon]
          radeon_driver_load_kms+0x89/0x180 [radeon]
          drm_get_pci_dev+0x121/0x2f0 [drm]
          local_pci_probe+0x39/0x60
          pci_device_probe+0xa9/0x120
          driver_probe_device+0x9d/0x3d0
          __driver_attach+0x8b/0x90
          bus_for_each_dev+0x5b/0x90
          bus_add_driver+0x1f8/0x2c0
          driver_register+0x5b/0xe0
          do_one_initcall+0xf2/0x1a0
          load_module+0x1207/0x1c70
          SYSC_finit_module+0x75/0xa0
          system_call_fastpath+0x16/0x1b
          0x7fac533d2788
      
      After these warnings the code enters a fall-back path and
      allocated directly from the swiotlb aperture in the end.
      So remove these warnings as this is not a fatal error.
      
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      [ Simplify, reflow comment. ]
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jörg Rödel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: kexec@lists.infradead.org
      Link: http://lkml.kernel.org/r/1433500202-25531-3-git-send-email-joro@8bytes.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      186dfc9d
  15. Jun 09, 2015
    • Dave Hansen's avatar
      x86: Make is_64bit_mm() widely available · b0e9b09b
      Dave Hansen authored
      
      The uprobes code has a nice helper, is_64bit_mm(), that consults
      both the runtime and compile-time flags for 32-bit support.
      Instead of reinventing the wheel, pull it in to an x86 header so
      we can use it for MPX.
      
      I prefer passing the 'mm' around to test_thread_flag(TIF_IA32)
      because it makes it explicit where the context is coming from.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150607183704.F0209999@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b0e9b09b
    • Dave Hansen's avatar
      x86/mpx: Trace #BR exceptions · e7126cf5
      Dave Hansen authored
      
      This is the first in a series of MPX tracing patches.
      I've found these extremely useful in the process of
      debugging applications and the kernel code itself.
      
      This exception hooks in to the bounds (#BR) exception
      very early and allows capturing the key registers which
      would influence how the exception is handled.
      
      Note that bndcfgu/bndstatus are technically still
      64-bit registers even in 32-bit mode.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150607183703.5FE2619A@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e7126cf5
    • Dave Hansen's avatar
      x86/mpx: Introduce a boot-time disable flag · 8c3641e9
      Dave Hansen authored
      
      MPX has the _potential_ to cause some issues.  Say part of your
      init system tried to protect one of its components from buffer
      overflows with MPX.  If there were a false positive, it's
      possible that MPX could keep a system from booting.
      
      MPX could also potentially cause performance issues since it is
      present in hot paths like the unmap path.
      
      Allow it to be disabled at boot time.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150607183702.2E8B77AB@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8c3641e9
    • Dave Hansen's avatar
      x86/mpx: Clean up the code by not passing a task pointer around when unnecessary · 46a6e0cf
      Dave Hansen authored
      
      The MPX code can only work on the current task.  You can not,
      for instance, enable MPX management in another process or
      thread. You can also not handle a fault for another process or
      thread.
      
      Despite this, we pass a task_struct around prolifically.  This
      patch removes all of the task struct passing for code paths
      where the code can not deal with another task (which turns out
      to be all of them).
      
      This has no functional changes.  It's just a cleanup.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: bp@alien8.de
      Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      46a6e0cf
    • Dave Hansen's avatar
      x86/mpx: Use the new get_xsave_field_ptr()API · a84eeaa9
      Dave Hansen authored
      
      The MPX registers (bndcsr/bndcfgu/bndstatus) are not directly
      accessible via normal instructions.  They essentially act as
      if they were floating point registers and are saved/restored
      along with those registers.
      
      There are two main paths in the MPX code where we care about
      the contents of these registers:
      
      	1. #BR (bounds) faults
      	2. the prctl() code where we are setting MPX up
      
      Both of those paths _might_ be called without the FPU having
      been used.  That means that 'tsk->thread.fpu.state' might
      never be allocated.
      
      Also, fpu_save_init() is not preempt-safe.  It was a bug to
      call it without disabling preemption.  The new
      get_xsave_addr() calls unlazy_fpu() instead and properly
      disables preemption.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suresh Siddha <sbsiddha@gmail.com>
      Cc: bp@alien8.de
      Link: http://lkml.kernel.org/r/20150607183701.BC0D37CF@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a84eeaa9
    • Dave Hansen's avatar
      x86/fpu/xstate: Wrap get_xsave_addr() to make it safer · 04cd027b
      Dave Hansen authored
      
      The MPX code appears is calling a low-level FPU function
      (copy_fpregs_to_fpstate()).  This function is not able to
      be called in all contexts, although it is safe to call
      directly in some cases.
      
      Although probably correct, the current code is ugly and
      potentially error-prone.  So, add a wrapper that calls
      the (slightly) higher-level fpu__save() (which is preempt-
      safe) and also ensures that we even *have* an FPU context
      (in the case that this was called when in lazy FPU mode).
      
      Ingo had this to say about the details about when we need
      preemption disabled:
      
      > it's indeed generally unsafe to access/copy FPU registers with preemption enabled,
      > for two reasons:
      >
      >   - on older systems that use FSAVE the instruction destroys FPU register
      >     contents, which has to be handled carefully
      >
      >   - even on newer systems if we copy to FPU registers (which this code doesn't)
      >     then we don't want a context switch to occur in the middle of it, because a
      >     context switch will write to the fpstate, potentially overwriting our new data
      >     with old FPU state.
      >
      > But it's safe to access FPU registers with preemption enabled in a couple of
      > special cases:
      >
      >   - potentially destructively saving FPU registers: the signal handling code does
      >     this in copy_fpstate_to_sigframe(), because it can rely on the signal restore
      >     side to restore the original FPU state.
      >
      >   - reading FPU registers on modern systems: we don't do this anywhere at the
      >     moment, mostly to keep symmetry with older systems where FSAVE is
      >     destructive.
      >
      >   - initializing FPU registers on modern systems: fpu__clear() does this. Here
      >     it's safe because we don't copy from the fpstate.
      >
      >   - directly writing FPU registers from user-space memory (!). We do this in
      >     fpu__restore_sig(), and it's safe because neither context switches nor
      >     irq-handler FPU use can corrupt the source context of the copy (which is
      >     user-space memory).
      >
      > Note that the MPX code's current use of copy_fpregs_to_fpstate() was safe I think,
      > because:
      >
      >  - MPX is predicated on eagerfpu, so the destructive F[N]SAVE instruction won't be
      >    used.
      >
      >  - the code was only reading FPU registers, and was doing it only in places that
      >    guaranteed that an FPU state was already active (i.e. didn't do it in
      >    kthreads)
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suresh Siddha <sbsiddha@gmail.com>
      Cc: bp@alien8.de
      Link: http://lkml.kernel.org/r/20150607183700.AA881696@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      04cd027b
    • Dave Hansen's avatar
      x86/fpu/xstate: Fix up bad get_xsave_addr() assumptions · 0c4109be
      Dave Hansen authored
      
      get_xsave_addr() assumes that if an xsave bit is present in the
      hardware (pcntxt_mask) that it is present in a given xsave
      buffer.  Due to an bug in the xsave code on all of the systems
      that have MPX (and thus all the users of this code), that has
      been a true assumption.
      
      But, the bug is getting fixed, so our assumption is not going
      to hold any more.
      
      It's quite possible (and normal) for an enabled state to be
      present on 'pcntxt_mask', but *not* in 'xstate_bv'.  We need
      to consult 'xstate_bv'.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150607183700.1E739B34@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0c4109be