Skip to content
Snippets Groups Projects
  1. Oct 27, 2022
    • Thomas Gleixner's avatar
      x86/apic/vector: Fix ordering in vector assignment · ed49ea9e
      Thomas Gleixner authored
      mainline inclusion
      from v5.10
      commit 	190113b4
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5Y4BR
      CVE: NA
      
      -------------------------------------------
      
      Prarit reported that depending on the affinity setting the
      
       ' irq $N: Affinity broken due to vector space exhaustion.'
      
      message is showing up in dmesg, but the vector space on the CPUs in the
      affinity mask is definitely not exhausted.
      
      Shung-Hsi provided traces and analysis which pinpoints the problem:
      
      The ordering of trying to assign an interrupt vector in
      assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
      valid node assigned. It does:
      
       1) Try the intersection of affinity mask and node mask
       2) Try the node mask
       3) Try the full affinity mask
       4) Try the full online mask
      
      Obviously #2 and #3 are in the wrong order as the requested affinity
      mask has to take precedence.
      
      In the observed cases #1 failed because the affinity m...
      ed49ea9e
  2. Jul 08, 2022
    • Randy Dunlap's avatar
      x86: Fix return value of __setup handlers · 2467bc50
      Randy Dunlap authored
      stable inclusion
      from stable-4.19.247
      commit 61967ac7ba2814fefd033eb3979058057a18edc1
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5FNPY
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 12441ccdf5e2f5a01a46e344976cbbd3d46845c9 ]
      
      __setup() handlers should return 1 to obsolete_checksetup() in
      init/main.c to indicate that the boot option has been handled. A return
      of 0 causes the boot option/value to be listed as an Unknown kernel
      parameter and added to init's (limited) argument (no '=') or environment
      (with '=') strings. So return 1 from these x86 __setup handlers.
      
      Examples:
      
        Unknown kernel command line parameters "apicpmtimer
          BOOT_IMAGE=/boot/bzImage-517rc8 vdso=1 ring3mwait=disable", will be
          passed to user space.
      
        Run /sbin/init as init process
         with arguments:
           /sbin/init
           apicpmtimer
         with environment:
           HOME=/
           TERM=linux
           BOOT_IMAGE=/boot/bzImage-517rc8
           vdso=1
           ring3mwait=disabl...
  3. Oct 25, 2021
    • Thomas Gleixner's avatar
      x86/ioapic: Force affinity setup before startup · 5c126be8
      Thomas Gleixner authored
      
      stable inclusion
      from linux-4.19.205
      commit 697658a61db4f3aa213d76336ccf30e66e6c44ca
      
      --------------------------------
      
      commit 0c0e37dc11671384e53ba6ede53a4d91162a2cc5 upstream.
      
      The IO/APIC cannot handle interrupt affinity changes safely after startup
      other than from an interrupt handler. The startup sequence in the generic
      interrupt code violates that assumption.
      
      Mark the irq chip with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
      the default interrupt setting happens before the interrupt is started up
      for the first time.
      
      Fixes: 18404756 ("genirq: Expose default irq affinity mask (take 3)")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarMarc Zyngier <maz@kernel.org>
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210729222542.832143400@linutronix.de
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      5c126be8
    • Thomas Gleixner's avatar
      x86/msi: Force affinity setup before startup · 297ac420
      Thomas Gleixner authored
      stable inclusion
      from linux-4.19.205
      commit 354b210062b1e50ef284f97590011c2231316eaa
      
      --------------------------------
      
      commit ff363f480e5997051dd1de949121ffda3b753741 upstream.
      
      The X86 MSI mechanism cannot handle interrupt affinity changes safely after
      startup other than from an interrupt handler, unless interrupt remapping is
      enabled. The startup sequence in the generic interrupt code violates that
      assumption.
      
      Mark the irq chips with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
      the default interrupt setting happens before the interrupt is started up
      for the first time.
      
      While the interrupt remapping MSI chip does not require this, there is no
      point in treating it differently as this might spare an interrupt to a CPU
      which is not in the default affinity mask.
      
      For the non-remapping case go to the direct write path when the interrupt
      is not yet started similar to the not yet activated case.
      
      Fixes: 18404756 ("genirq: Expose default irq affinity mask ...
      297ac420
  4. Jul 19, 2021
    • Thomas Gleixner's avatar
      x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing · a9c92fd4
      Thomas Gleixner authored
      
      stable inclusion
      from linux-4.19.194
      commit 7e25cb1b22f81239ae3332e14a1d0cff7014bccd
      
      --------------------------------
      
      commit 7d65f9e80646c595e8c853640a9d0768a33e204c upstream.
      
      PIC interrupts do not support affinity setting and they can end up on
      any online CPU. Therefore, it's required to mark the associated vectors
      as system-wide reserved. Otherwise, the corresponding irq descriptors
      are copied to the secondary CPUs but the vectors are not marked as
      assigned or reserved. This works correctly for the IO/APIC case.
      
      When the IO/APIC is disabled via config, kernel command line or lack of
      enumeration then all legacy interrupts are routed through the PIC, but
      nothing marks them as system-wide reserved vectors.
      
      As a consequence, a subsequent allocation on a secondary CPU can result in
      allocating one of these vectors, which triggers the BUG() in
      apic_update_vector() because the interrupt descriptor slot is not empty.
      
      Imran tried to work around that by marking those interrupts as allocated
      when a CPU comes online. But that's wrong in case that the IO/APIC is
      available and one of the legacy interrupts, e.g. IRQ0, has been switched to
      PIC mode because then marking them as allocated will fail as they are
      already marked as system vectors.
      
      Stay consistent and update the legacy vectors after attempting IO/APIC
      initialization and mark them as system vectors in case that no IO/APIC is
      available.
      
      Fixes: 69cde000 ("x86/vector: Use matrix allocator for vector assignment")
      Reported-by: default avatarImran Khan <imran.f.khan@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      a9c92fd4
  5. Jun 30, 2021
  6. May 26, 2021
    • Thomas Gleixner's avatar
      x86/apic/vector: Force interupt handler invocation to irq context · 3e134563
      Thomas Gleixner authored
      mainline inclusion
      from mainline-5.7
      commit 008f1d60
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      -------------------------------------------------
      
      Sathyanarayanan reported that the PCI-E AER error injection mechanism
      can result in a NULL pointer dereference in apic_ack_edge():
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
       RIP: 0010:apic_ack_edge+0x1e/0x40
       Call Trace:
         handle_edge_irq+0x7d/0x1e0
         generic_handle_irq+0x27/0x30
         aer_inject_write+0x53a/0x720
      
      It crashes in irq_complete_move() which dereferences get_irq_regs() which
      is obviously NULL when this is called from non interrupt context.
      
      Of course the pointer could be checked, but that just papers over the real
      issue. Invoking the low level interrupt handling mechanism from random code
      can wreckage the fragile interrupt affinity mechanism of x86 as interrupts
      can only be moved in interrupt context or with special care wh...
      3e134563
  7. Apr 07, 2021
  8. Oct 29, 2020
  9. Sep 22, 2020
    • Thomas Gleixner's avatar
      genirq/affinity: Make affinity setting if activated opt-in · 7cf94405
      Thomas Gleixner authored
      
      stable inclusion
      from linux-4.19.141
      commit 5c4d9eefd314e763dcb2a499797176c17ad6ab69
      
      --------------------------------
      
      commit f0c7baca upstream.
      
      John reported that on a RK3288 system the perf per CPU interrupts are all
      affine to CPU0 and provided the analysis:
      
       "It looks like what happens is that because the interrupts are not per-CPU
        in the hardware, armpmu_request_irq() calls irq_force_affinity() while
        the interrupt is deactivated and then request_irq() with IRQF_PERCPU |
        IRQF_NOBALANCING.
      
        Now when irq_startup() runs with IRQ_STARTUP_NORMAL, it calls
        irq_setup_affinity() which returns early because IRQF_PERCPU and
        IRQF_NOBALANCING are set, leaving the interrupt on its original CPU."
      
      This was broken by the recent commit which blocked interrupt affinity
      setting in hardware before activation of the interrupt. While this works in
      general, it does not work for this particular case. As contrary to the
      initial analysis not all interrupt chip drivers implement an activate
      callback, the safe cure is to make the deferred interrupt affinity setting
      at activation time opt-in.
      
      Implement the necessary core logic and make the two irqchip implementations
      for which this is required opt-in. In hindsight this would have been the
      right thing to do, but ...
      
      Fixes: baedb87d ("genirq/affinity: Handle affinity setting on inactive interrupts correctly")
      Reported-by: default avatarJohn Keeping <john@metanate.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarMarc Zyngier <maz@kernel.org>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/87blk4tzgm.fsf@nanos.tec.linutronix.de
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      7cf94405
    • Thomas Gleixner's avatar
      genirq/affinity: Handle affinity setting on inactive interrupts correctly · c998ccb2
      Thomas Gleixner authored
      
      stable inclusion
      from linux-4.19.134
      commit 2048e4375c552614d26a7191394d8a8398fe7a85
      
      --------------------------------
      
      commit baedb87d upstream.
      
      Setting interrupt affinity on inactive interrupts is inconsistent when
      hierarchical irq domains are enabled. The core code should just store the
      affinity and not call into the irq chip driver for inactive interrupts
      because the chip drivers may not be in a state to handle such requests.
      
      X86 has a hacky workaround for that but all other irq chips have not which
      causes problems e.g. on GIC V3 ITS.
      
      Instead of adding more ugly hacks all over the place, solve the problem in
      the core code. If the affinity is set on an inactive interrupt then:
      
          - Store it in the irq descriptors affinity mask
          - Update the effective affinity to reflect that so user space has
            a consistent view
          - Don't call into the irq chip driver
      
      This is the core equivalent of the X86 workaround and works correctly
      because the affinity setting is established in the irq chip when the
      interrupt is activated later on.
      
      Note, that this is only effective when hierarchical irq domains are enabled
      by the architecture. Doing it unconditionally would break legacy irq chip
      implementations.
      
      For hierarchial irq domains this works correctly as none of the drivers can
      have a dependency on affinity setting in inactive state by design.
      
      Remove the X86 workaround as it is not longer required.
      
      Fixes: 02edee15 ("x86/apic/vector: Ignore set_affinity call for inactive interrupts")
      Reported-by: default avatarAli Saidi <alisaidi@amazon.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAli Saidi <alisaidi@amazon.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20200529015501.15771-1-alisaidi@amazon.com
      Link: https://lkml.kernel.org/r/877dv2rv25.fsf@nanos.tec.linutronix.de
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      c998ccb2
    • Thomas Gleixner's avatar
      x86/apic/msi: Plug non-maskable MSI affinity race · 7ff6d61f
      Thomas Gleixner authored
      stable inclusion
      from linux-4.19.103
      commit 032a2bf9787acdaef31369045ff0cb0b301eee61
      
      --------------------------------
      
      commit 6f1a4891 upstream.
      
      Evan tracked down a subtle race between the update of the MSI message and
      the device raising an interrupt internally on PCI devices which do not
      support MSI masking. The update of the MSI message is non-atomic and
      consists of either 2 or 3 sequential 32bit wide writes to the PCI config
      space.
      
         - Write address low 32bits
         - Write address high 32bits (If supported by device)
         - Write data
      
      When an interrupt is migrated then both address and data might change, so
      the kernel attempts to mask the MSI interrupt first. But for MSI masking is
      optional, so there exist devices which do not provide it. That means that
      if the device raises an interrupt internally between the writes then a MSI
      message is sent built from half updated state.
      
      On x86 this can lead to spurious interrupts on...
      7ff6d61f
  10. Mar 05, 2020
    • Thomas Gleixner's avatar
      x86/ioapic: Prevent inconsistent state when moving an interrupt · 4b9b7c5b
      Thomas Gleixner authored
      
      [ Upstream commit df439342 ]
      
      There is an issue with threaded interrupts which are marked ONESHOT
      and using the fasteoi handler:
      
        if (IS_ONESHOT())
          mask_irq();
        ....
        cond_unmask_eoi_irq()
          chip->irq_eoi();
            if (setaffinity_pending) {
               mask_ioapic();
               ...
      	 move_affinity();
      	 unmask_ioapic();
            }
      
      So if setaffinity is pending the interrupt will be moved and then
      unconditionally unmasked at the ioapic level, which is wrong in two
      aspects:
      
       1) It should be kept masked up to the point where the threaded handler
          finished.
      
       2) The physical chip state and the software masked state are inconsistent
      
      Guard both the mask and the unmask with a check for the software masked
      state. If the line is marked masked then the ioapic line is also masked, so
      both mask_ioapic() and unmask_ioapic() can be skipped safely.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Fixes: 3aa551c9 ("genirq: add threaded interrupt handler support")
      Link: https://lkml.kernel.org/r/20191017101938.321393687@linutronix.de
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      4b9b7c5b
  11. Dec 27, 2019
    • Jan Beulich's avatar
      x86/apic/32: Avoid bogus LDR warnings · bc86e142
      Jan Beulich authored and 谢秀奇's avatar 谢秀奇 committed
      
      commit fe6f85ca upstream.
      
      The removal of the LDR initialization in the bigsmp_32 APIC code unearthed
      a problem in setup_local_APIC().
      
      The code checks unconditionally for a mismatch of the logical APIC id by
      comparing the early APIC id which was initialized in get_smp_config() with
      the actual LDR value in the APIC.
      
      Due to the removal of the bogus LDR initialization the check now can
      trigger on bigsmp_32 APIC systems emitting a warning for every booting
      CPU. This is of course a false positive because the APIC is not using
      logical destination mode.
      
      Restrict the check and the possibly resulting fixup to systems which are
      actually using the APIC in logical destination mode.
      
      [ tglx: Massaged changelog and added Cc stable ]
      
      Fixes: bae3a8d3 ("x86/apic: Do not initialize LDR and DFR for bigsmp")
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/666d8f91-b5a8-1afd-7add-821e72a35f03@suse.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      bc86e142
    • Sean Christopherson's avatar
      x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpu · e0779527
      Sean Christopherson authored and 谢秀奇's avatar 谢秀奇 committed
      commit 7a22e03b upstream.
      
      Check that the per-cpu cluster mask pointer has been set prior to
      clearing a dying cpu's bit.  The per-cpu pointer is not set until the
      target cpu reaches smp_callin() during CPUHP_BRINGUP_CPU, whereas the
      teardown function, x2apic_dead_cpu(), is associated with the earlier
      CPUHP_X2APIC_PREPARE.  If an error occurs before the cpu is awakened,
      e.g. if do_boot_cpu() itself fails, x2apic_dead_cpu() will dereference
      the NULL pointer and cause a panic.
      
        smpboot: do_boot_cpu failed(-22) to wakeup CPU#1
        BUG: kernel NULL pointer dereference, address: 0000000000000008
        RIP: 0010:x2apic_dead_cpu+0x1a/0x30
        Call Trace:
         cpuhp_invoke_callback+0x9a/0x580
         _cpu_up+0x10d/0x140
         do_cpu_up+0x69/0xb0
         smp_init+0x63/0xa9
         kernel_init_freeable+0xd7/0x229
         ? rest_init+0xa0/0xa0
         kernel_init+0xa/0x100
         ret_from_fork+0x35/0x40
      
      Fixes: 023a6117 ("x86/apic/x2apic: Simplify clust...
      e0779527
    • Neil Horman's avatar
      x86/apic/vector: Warn when vector space exhaustion breaks affinity · b6e4f231
      Neil Horman authored and 谢秀奇's avatar 谢秀奇 committed
      
      [ Upstream commit 743dac49 ]
      
      On x86, CPUs are limited in the number of interrupts they can have affined
      to them as they only support 256 interrupt vectors per CPU. 32 vectors are
      reserved for the CPU and the kernel reserves another 22 for internal
      purposes. That leaves 202 vectors for assignement to devices.
      
      When an interrupt is set up or the affinity is changed by the kernel or the
      administrator, the vector assignment code attempts to honor the requested
      affinity mask. If the vector space on the CPUs in that affinity mask is
      exhausted the code falls back to a wider set of CPUs and assigns a vector
      on a CPU outside of the requested affinity mask silently.
      
      While the effective affinity is reflected in the corresponding
      /proc/irq/$N/effective_affinity* files the silent breakage of the requested
      affinity can lead to unexpected behaviour for administrators.
      
      Add a pr_warn() when this happens so that adminstrators get at least
      informed about it in the syslog.
      
      [ tglx: Massaged changelog and made the pr_warn() more informative ]
      
      Reported-by: default avatar <djuran@redhat.com>
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatar <djuran@redhat.com>
      Link: https://lkml.kernel.org/r/20190822143421.9535-1-nhorman@tuxdriver.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      b6e4f231
    • Thomas Gleixner's avatar
      x86/apic: Soft disable APIC before initializing it · bb0a3ab1
      Thomas Gleixner authored and 谢秀奇's avatar 谢秀奇 committed
      
      [ Upstream commit 2640da4c ]
      
      If the APIC was already enabled on entry of setup_local_APIC() then
      disabling it soft via the SPIV register makes a lot of sense.
      
      That masks all LVT entries and brings it into a well defined state.
      
      Otherwise previously enabled LVTs which are not touched in the setup
      function stay unmasked and might surprise the just booting kernel.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20190722105219.068290579@linutronix.de
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      bb0a3ab1
    • Thomas Gleixner's avatar
      x86/apic: Make apic_pending_intr_clear() more robust · 5f20f40e
      Thomas Gleixner authored and 谢秀奇's avatar 谢秀奇 committed
      
      [ Upstream commit cc8bf191 ]
      
      In course of developing shorthand based IPI support issues with the
      function which tries to clear eventually pending ISR bits in the local APIC
      were observed.
      
        1) O-day testing triggered the WARN_ON() in apic_pending_intr_clear().
      
           This warning is emitted when the function fails to clear pending ISR
           bits or observes pending IRR bits which are not delivered to the CPU
           after the stale ISR bit(s) are ACK'ed.
      
           Unfortunately the function only emits a WARN_ON() and fails to dump
           the IRR/ISR content. That's useless for debugging.
      
           Feng added spot on debug printk's which revealed that the stale IRR
           bit belonged to the APIC timer interrupt vector, but adding ad hoc
           debug code does not help with sporadic failures in the field.
      
           Rework the loop so the full IRR/ISR contents are saved and on failure
           dumped.
      
        2) The loop termination logic is interesting at best.
      
           If the machine has no TSC or cpu_khz is not known yet it tries 1
           million times to ack stale IRR/ISR bits. What?
      
           With TSC it uses the TSC to calculate the loop termination. It takes a
           timestamp at entry and terminates the loop when:
      
           	  (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
      
           That's roughly one second.
      
           Both methods are problematic. The APIC has 256 vectors, which means
           that in theory max. 256 IRR/ISR bits can be set. In practice this is
           impossible and the chance that more than a few bits are set is close
           to zero.
      
           With the pure loop based approach the 1 million retries are complete
           overkill.
      
           With TSC this can terminate too early in a guest which is running on a
           heavily loaded host even with only a couple of IRR/ISR bits set. The
           reason is that after acknowledging the highest priority ISR bit,
           pending IRRs must get serviced first before the next round of
           acknowledge can take place as the APIC (real and virtualized) does not
           honour EOI without a preceeding interrupt on the CPU. And every APIC
           read/write takes a VMEXIT if the APIC is virtualized. While trying to
           reproduce the issue 0-day reported it was observed that the guest was
           scheduled out long enough under heavy load that it terminated after 8
           iterations.
      
           Make the loop terminate after 512 iterations. That's plenty enough
           in any case and does not take endless time to complete.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20190722105219.158847694@linutronix.de
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      5f20f40e
    • Thomas Gleixner's avatar
      x86/apic: Fix arch_dynirq_lower_bound() bug for DT enabled machines · d4d4c19f
      Thomas Gleixner authored and 谢秀奇's avatar 谢秀奇 committed
      
      [ Upstream commit 3e5bedc2 ]
      
      Rahul Tanwar reported the following bug on DT systems:
      
      > 'ioapic_dynirq_base' contains the virtual IRQ base number. Presently, it is
      > updated to the end of hardware IRQ numbers but this is done only when IOAPIC
      > configuration type is IOAPIC_DOMAIN_LEGACY or IOAPIC_DOMAIN_STRICT. There is
      > a third type IOAPIC_DOMAIN_DYNAMIC which applies when IOAPIC configuration
      > comes from devicetree.
      >
      > See dtb_add_ioapic() in arch/x86/kernel/devicetree.c
      >
      > In case of IOAPIC_DOMAIN_DYNAMIC (DT/OF based system), 'ioapic_dynirq_base'
      > remains to zero initialized value. This means that for OF based systems,
      > virtual IRQ base will get set to zero.
      
      Such systems will very likely not even boot.
      
      For DT enabled machines ioapic_dynirq_base is irrelevant and not
      updated, so simply map the IRQ base 1:1 instead.
      
      Reported-by: default avatarRahul Tanwar <rahul.tanwar@linux.intel.com>
      Tested-by: default avatarRahul Tanwar <rahul.tanwar@linux.intel.com>
      Tested-by: default avatarAndy Shevchenko <andriy.shevchenko@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: alan@linux.intel.com
      Cc: bp@alien8.de
      Cc: cheol.yong.kim@intel.com
      Cc: qi-ming.wu@intel.com
      Cc: rahul.tanwar@intel.com
      Cc: rppt@linux.ibm.com
      Cc: tony.luck@intel.com
      Link: http://lkml.kernel.org/r/20190821081330.1187-1-rahul.tanwar@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      d4d4c19f
    • Linus Torvalds's avatar
      Revert "x86/apic: Include the LDR when clearing out APIC registers" · e8bee45f
      Linus Torvalds authored and 谢秀奇's avatar 谢秀奇 committed
      [ Upstream commit 950b07c1 ]
      
      This reverts commit 558682b5.
      
      Chris Wilson reports that it breaks his CPU hotplug test scripts.  In
      particular, it breaks offlining and then re-onlining the boot CPU, which
      we treat specially (and the BIOS does too).
      
      The symptoms are that we can offline the CPU, but it then does not come
      back online again:
      
          smpboot: CPU 0 is now offline
          smpboot: Booting Node 0 Processor 0 APIC 0x0
          smpboot: do_boot_cpu failed(-1) to wakeup CPU#0
      
      Thomas says he knows why it's broken (my personal suspicion: our magic
      handling of the "cpu0_logical_apicid" thing), but for 5.3 the right fix
      is to just revert it, since we've never touched the LDR bits before, and
      it's not worth the risk to do anything else at this stage.
      
      [ Hotpluging of the boot CPU is special anyway, and should be off by
        default. See the "BOOTPARAM_HOTPLUG_CPU0" config option and the
        cpu0_hotplug kernel parameter.
      
        In general you should not do it, and it has various known limitations
        (hibernate and suspend require the boot CPU, for example).
      
        But it should work, even if the boot CPU is special and needs careful
        treatment       - Linus ]
      
      Link: https://lore.kernel.org/lkml/156785100521.13300.14461504732265570003@skylake-alporthouse-com/
      
      
      Reported-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Bandan Das <bsd@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      e8bee45f
    • Bandan Das's avatar
      x86/apic: Include the LDR when clearing out APIC registers · 96b5717f
      Bandan Das authored and 谢秀奇's avatar 谢秀奇 committed
      
      commit 558682b5 upstream.
      
      Although APIC initialization will typically clear out the LDR before
      setting it, the APIC cleanup code should reset the LDR.
      
      This was discovered with a 32-bit KVM guest jumping into a kdump
      kernel. The stale bits in the LDR triggered a bug in the KVM APIC
      implementation which caused the destination mapping for VCPUs to be
      corrupted.
      
      Note that this isn't intended to paper over the KVM APIC bug. The kernel
      has to clear the LDR when resetting the APIC registers except when X2APIC
      is enabled.
      
      This lacks a Fixes tag because missing to clear LDR goes way back into pre
      git history.
      
      [ tglx: Made x2apic_enabled a function call as required ]
      
      Signed-off-by: default avatarBandan Das <bsd@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190826101513.5080-3-bsd@redhat.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      96b5717f
    • Bandan Das's avatar
      x86/apic: Do not initialize LDR and DFR for bigsmp · a5c9bd4c
      Bandan Das authored and 谢秀奇's avatar 谢秀奇 committed
      commit bae3a8d3 upstream.
      
      Legacy apic init uses bigsmp for smp systems with 8 and more CPUs. The
      bigsmp APIC implementation uses physical destination mode, but it
      nevertheless initializes LDR and DFR. The LDR even ends up incorrectly with
      multiple bit being set.
      
      This does not cause a functional problem because LDR and DFR are ignored
      when physical destination mode is active, but it triggered a problem on a
      32-bit KVM guest which jumps into a kdump kernel.
      
      The multiple bits set unearthed a bug in the KVM APIC implementation. The
      code which creates the logical destination map for VCPUs ignores the
      disabled state of the APIC and ends up overwriting an existing valid entry
      and as a result, APIC calibration hangs in the guest during kdump
      initialization.
      
      Remove the bogus LDR/DFR initialization.
      
      This is not intended to work around the KVM APIC bug. The LDR/DFR
      ininitalization is wrong on its own.
      
      The issue goes back into th...
      a5c9bd4c
    • Thomas Gleixner's avatar
      x86/apic: Handle missing global clockevent gracefully · 103f5b8f
      Thomas Gleixner authored and 谢秀奇's avatar 谢秀奇 committed
      
      commit f897e60a upstream.
      
      Some newer machines do not advertise legacy timers. The kernel can handle
      that situation if the TSC and the CPU frequency are enumerated by CPUID or
      MSRs and the CPU supports TSC deadline timer. If the CPU does not support
      TSC deadline timer the local APIC timer frequency has to be known as well.
      
      Some Ryzens machines do not advertize legacy timers, but there is no
      reliable way to determine the bus frequency which feeds the local APIC
      timer when the machine allows overclocking of that frequency.
      
      As there is no legacy timer the local APIC timer calibration crashes due to
      a NULL pointer dereference when accessing the not installed global clock
      event device.
      
      Switch the calibration loop to a non interrupt based one, which polls
      either TSC (if frequency is known) or jiffies. The latter requires a global
      clockevent. As the machines which do not have a global clockevent installed
      have a known TSC frequency this is a non issue. For older machines where
      TSC frequency is not known, there is no known case where the legacy timers
      do not exist as that would have been reported long ago.
      
      Reported-by: default avatarDaniel Drake <drake@endlessm.com>
      Reported-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarDaniel Drake <drake@endlessm.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908091443030.21433@nanos.tec.linutronix.de
      Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1142926#c12
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      103f5b8f
    • Qian Cai's avatar
      x86/apic: Silence -Wtype-limits compiler warnings · 053b297e
      Qian Cai authored and 谢秀奇's avatar 谢秀奇 committed
      
      [ Upstream commit ec633558 ]
      
      There are many compiler warnings like this,
      
      In file included from ./arch/x86/include/asm/smp.h:13,
                       from ./arch/x86/include/asm/mmzone_64.h:11,
                       from ./arch/x86/include/asm/mmzone.h:5,
                       from ./include/linux/mmzone.h:969,
                       from ./include/linux/gfp.h:6,
                       from ./include/linux/mm.h:10,
                       from arch/x86/kernel/apic/io_apic.c:34:
      arch/x86/kernel/apic/io_apic.c: In function 'check_timer':
      ./arch/x86/include/asm/apic.h:37:11: warning: comparison of unsigned
      expression >= 0 is always true [-Wtype-limits]
         if ((v) <= apic_verbosity) \
                 ^~
      arch/x86/kernel/apic/io_apic.c:2160:2: note: in expansion of macro
      'apic_printk'
        apic_printk(APIC_QUIET, KERN_INFO "..TIMER: vector=0x%02X "
        ^~~~~~~~~~~
      ./arch/x86/include/asm/apic.h:37:11: warning: comparison of unsigned
      expression >= 0 is always true [-Wtype-limits]
         if ((v) <= apic_verbosity) \
                 ^~
      arch/x86/kernel/apic/io_apic.c:2207:4: note: in expansion of macro
      'apic_printk'
          apic_printk(APIC_QUIET, KERN_ERR "..MP-BIOS bug: "
          ^~~~~~~~~~~
      
      APIC_QUIET is 0, so silence them by making apic_verbosity type int.
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1562621805-24789-1-git-send-email-cai@lca.pw
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      053b297e
    • Thomas Gleixner's avatar
      x86/irq: Seperate unused system vectors from spurious entry again · 072aca7f
      Thomas Gleixner authored and 谢秀奇's avatar 谢秀奇 committed
      commit f8a8fe61 upstream
      
      Quite some time ago the interrupt entry stubs for unused vectors in the
      system vector range got removed and directly mapped to the spurious
      interrupt vector entry point.
      
      Sounds reasonable, but it's subtly broken. The spurious interrupt vector
      entry point pushes vector number 0xFF on the stack which makes the whole
      logic in __smp_spurious_interrupt() pointless.
      
      As a consequence any spurious interrupt which comes from a vector != 0xFF
      is treated as a real spurious interrupt (vector 0xFF) and not
      acknowledged. That subsequently stalls all interrupt vectors of equal and
      lower priority, which brings the system to a grinding halt.
      
      This can happen because even on 64-bit the system vector space is not
      guaranteed to be fully populated. A full compile time handling of the
      unused vectors is not possible because quite some of them are conditonally
      populated at runtime.
      
      Bring the entry stubs ...
      072aca7f
    • Thomas Gleixner's avatar
      x86/irq: Handle spurious interrupt after shutdown gracefully · 18cf32ca
      Thomas Gleixner authored and 谢秀奇's avatar 谢秀奇 committed
      commit b7107a67 upstream
      
      Since the rework of the vector management, warnings about spurious
      interrupts have been reported. Robert provided some more information and
      did an initial analysis. The following situation leads to these warnings:
      
         CPU 0                  CPU 1               IO_APIC
      
                                                    interrupt is raised
                                                    sent to CPU1
      			  Unable to handle
      			  immediately
      			  (interrupts off,
      			   deep idle delay)
         mask()
         ...
         free()
           shutdown()
           synchronize_irq()
           clear_vector()
                                do_IRQ()
                                  -> vector is clear
      
      Before the rework the vector entries of legacy interrupts were statically
      assigned and occupied precious vector space while most of them were
      unused. Due to that the above situation was handled silently because the
      vector was handled and the core handle...
      18cf32ca
    • Thomas Gleixner's avatar
      x86/ioapic: Implement irq_get_irqchip_state() callback · 91fabc80
      Thomas Gleixner authored and 谢秀奇's avatar 谢秀奇 committed
      
      commit dfe0cf8b upstream
      
      When an interrupt is shut down in free_irq() there might be an inflight
      interrupt pending in the IO-APIC remote IRR which is not yet serviced. That
      means the interrupt has been sent to the target CPUs local APIC, but the
      target CPU is in a state which delays the servicing.
      
      So free_irq() would proceed to free resources and to clear the vector
      because synchronize_hardirq() does not see an interrupt handler in
      progress.
      
      That can trigger a spurious interrupt warning, which is harmless and just
      confuses users, but it also can leave the remote IRR in a stale state
      because once the handler is invoked the interrupt resources might be freed
      already and therefore acknowledgement is not possible anymore.
      
      Implement the irq_get_irqchip_state() callback for the IO-APIC irq chip. The
      callback is invoked from free_irq() via __synchronize_hardirq(). Check the
      remote IRR bit of the interrupt and return 'in flight' if it is set and the
      interrupt is configured in level mode. For edge mode the remote IRR has no
      meaning.
      
      As this is only meaningful for level triggered interrupts this won't cure
      the potential spurious interrupt warning for edge triggered interrupts, but
      the edge trigger case does not result in stale hardware state. This has to
      be addressed at the vector/interrupt entry level seperately.
      
      Fixes: 464d1230 ("x86/vector: Switch IOAPIC to global reservation mode")
      Reported-by: default avatarRobert Hodaszi <Robert.Hodaszi@digi.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/20190628111440.370295517@linutronix.de
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      91fabc80
    • Colin Ian King's avatar
      x86/apic: Fix integer overflow on 10 bit left shift of cpu_khz · 2397a667
      Colin Ian King authored and 谢秀奇's avatar 谢秀奇 committed
      
      [ Upstream commit ea136a11 ]
      
      The left shift of unsigned int cpu_khz will overflow for large values of
      cpu_khz, so cast it to a long long before shifting it to avoid overvlow.
      For example, this can happen when cpu_khz is 4194305, i.e. ~4.2 GHz.
      
      Addresses-Coverity: ("Unintentional integer overflow")
      Fixes: 8c3ba8d0 ("x86, apic: ack all pending irqs when crashed/on kexec")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: kernel-janitors@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190619181446.13635-1-colin.king@canonical.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      2397a667
    • Jacob Pan's avatar
      x86/apic: Unify duplicated local apic timer clockevent initialization · 2e16f46d
      Jacob Pan authored and 谢秀奇's avatar 谢秀奇 committed
      
      mainline inclusion
      from mainline-6eb4f082 undefined
      commit 6eb4f082
      category: bugfix
      bugzilla: 14416
      CVE: NA
      
      -------------------------------------------------
      
      Local APIC timer clockevent parameters can be calculated based on platform
      specific methods. However the code is mostly duplicated with the interrupt
      based calibration. The commit which increased the max_delta parameter
      updated only one place and made the implementations diverge.
      
      Unify it to prevent further damage.
      
      [ tglx: Rename function to lapic_init_clockevent() and adjust changelog a bit ]
      
      Fixes: 4aed89d6 ("x86, lapic-timer: Increase the max_delta to 31 bits")
      Reported-by: default avatarDaniel Drake <drake@endlessm.com>
      Signed-off-by: default avatarJacob Pan <jacob.jun.pan@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Len Brown <lenb@kernel.org>
      Link: https://lkml.kernel.org/r/1556213272-63568-1-git-send-email-jacob.jun.pan@linux.intel.com
      
      
      Signed-off-by: default avatarChen Zhou <chenzhou10@huawei.com>
      Reviewed-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarYao Hongbo <yaohongbo@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      2e16f46d
    • Dou Liyang's avatar
      irq/matrix: Spread managed interrupts on allocation · 5fc68885
      Dou Liyang authored and 谢秀奇's avatar 谢秀奇 committed
      
      [ Upstream commit 76f99ae5 ]
      
      Linux spreads out the non managed interrupt across the possible target CPUs
      to avoid vector space exhaustion.
      
      Managed interrupts are treated differently, as for them the vectors are
      reserved (with guarantee) when the interrupt descriptors are initialized.
      
      When the interrupt is requested a real vector is assigned. The assignment
      logic uses the first CPU in the affinity mask for assignment. If the
      interrupt has more than one CPU in the affinity mask, which happens when a
      multi queue device has less queues than CPUs, then doing the same search as
      for non managed interrupts makes sense as it puts the interrupt on the
      least interrupt plagued CPU. For single CPU affine vectors that's obviously
      a NOOP.
      
      Restructre the matrix allocation code so it does the 'best CPU' search, add
      the sanity check for an empty affinity mask and adapt the call site in the
      x86 vector management code.
      
      [ tglx: Added the empty mask check to the core and improved change log ]
      
      Signed-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20180908175838.14450-2-dou_liyang@163.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      5fc68885
  12. Sep 08, 2018
  13. Aug 15, 2018
  14. Aug 05, 2018
    • Nicolai Stange's avatar
      x86: Don't include linux/irq.h from asm/hardirq.h · 447ae316
      Nicolai Stange authored
      The next patch in this series will have to make the definition of
      irq_cpustat_t available to entering_irq().
      
      Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
      dependencies like
      
        asm/smp.h
          asm/apic.h
            asm/hardirq.h
              linux/irq.h
                linux/topology.h
                  linux/smp.h
                    asm/smp.h
      
      or
      
        linux/gfp.h
          linux/mmzone.h
            asm/mmzone.h
              asm/mmzone_64.h
                asm/smp.h
                  asm/apic.h
                    asm/hardirq.h
                      linux/irq.h
                        linux/irqdesc.h
                          linux/kobject.h
                            linux/sysfs.h
                              linux/kernfs.h
                                linux/idr.h
                                  linux/gfp.h
      
      and others.
      
      This causes compilation errors because of the header guards becoming
      effective in the second inclusion: symbols/macros that had been defined
      before wouldn't be available to intermediate headers in the #include c...
      447ae316
  15. Jul 31, 2018
  16. Jul 24, 2018
  17. Jul 02, 2018
    • Thomas Gleixner's avatar
      Revert "x86/apic: Ignore secondary threads if nosmt=force" · 506a66f3
      Thomas Gleixner authored
      
      Dave Hansen reported, that it's outright dangerous to keep SMT siblings
      disabled completely so they are stuck in the BIOS and wait for SIPI.
      
      The reason is that Machine Check Exceptions are broadcasted to siblings and
      the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
      logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
      reboots the machine. The MCE chapter in the SDM contains the following
      blurb:
      
          Because the logical processors within a physical package are tightly
          coupled with respect to shared hardware resources, both logical
          processors are notified of machine check errors that occur within a
          given physical processor. If machine-check exceptions are enabled when
          a fatal error is reported, all the logical processors within a physical
          package are dispatched to the machine-check exception handler. If
          machine-check exceptions are disabled, the logical processors enter the
          shutdown state and assert the IERR# signal. When enabling machine-check
          exceptions, the MCE flag in control register CR4 should be set for each
          logical processor.
      
      Reverting the commit which ignores siblings at enumeration time solves only
      half of the problem. The core cpuhotplug logic needs to be adjusted as
      well.
      
      This thoughtful engineered mechanism also turns the boot process on all
      Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
      before the secondary CPUs are brought up. Depending on the number of
      physical cores the window in which this situation can happen is smaller or
      larger. On a HSW-EX it's about 750ms:
      
      MCE is enabled on the boot CPU:
      
      [    0.244017] mce: CPU supports 22 MCE banks
      
      The corresponding sibling #72 boots:
      
      [    1.008005] .... node  #0, CPUs:    #72
      
      That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
      between these two points the machine is going to shutdown. At least it's a
      known safe state.
      
      It's obvious that the early boot can be hit by an MCE as well and then runs
      into the same situation because MCEs are not yet enabled on the boot CPU.
      But after enabling them on the boot CPU, it does not make any sense to
      prevent the kernel from recovering.
      
      Adjust the nosmt kernel parameter documentation as well.
      
      Reverts: 2207def7 ("x86/apic: Ignore secondary threads if nosmt=force")
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarTony Luck <tony.luck@intel.com>
      506a66f3