Skip to content
Snippets Groups Projects
  1. Apr 15, 2021
  2. Apr 14, 2021
    • Piotr Krysiuk's avatar
      bpf, x86: Validate computation of branch displacements for x86-32 · 27b0b915
      Piotr Krysiuk authored
      
      stable inclusion
      from linux-4.19.186
      commit 7b77ae2a0d6f9e110e13e85d802124b111b3e027
      CVE: CVE-2021-29154
      
      --------------------------------
      
      commit 26f55a59dc65ff77cd1c4b37991e26497fc68049 upstream.
      
      The branch displacement logic in the BPF JIT compilers for x86 assumes
      that, for any generated branch instruction, the distance cannot
      increase between optimization passes.
      
      But this assumption can be violated due to how the distances are
      computed. Specifically, whenever a backward branch is processed in
      do_jit(), the distance is computed by subtracting the positions in the
      machine code from different optimization passes. This is because part
      of addrs[] is already updated for the current optimization pass, before
      the branch instruction is visited.
      
      And so the optimizer can expand blocks of machine code in some cases.
      
      This can confuse the optimizer logic, where it assumes that a fixed
      point has been reached for all machine code blocks once the total
      program size stops changing. And then the JIT compiler can output
      abnormal machine code containing incorrect branch displacements.
      
      To mitigate this issue, we assert that a fixed point is reached while
      populating the output image. This rejects any problematic programs.
      The issue affects both x86-32 and x86-64. We mitigate separately to
      ease backporting.
      
      Signed-off-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Reviewed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarYue Haibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
    • Piotr Krysiuk's avatar
      bpf, x86: Validate computation of branch displacements for x86-64 · 39fd7bff
      Piotr Krysiuk authored
      
      stable inclusion
      from linux-4.19.186
      commit 5f26f1f838aa960045c712e13dbab8ff451fed74
      CVE: CVE-2021-29154
      
      --------------------------------
      
      commit e4d4d456436bfb2fe412ee2cd489f7658449b098 upstream.
      
      The branch displacement logic in the BPF JIT compilers for x86 assumes
      that, for any generated branch instruction, the distance cannot
      increase between optimization passes.
      
      But this assumption can be violated due to how the distances are
      computed. Specifically, whenever a backward branch is processed in
      do_jit(), the distance is computed by subtracting the positions in the
      machine code from different optimization passes. This is because part
      of addrs[] is already updated for the current optimization pass, before
      the branch instruction is visited.
      
      And so the optimizer can expand blocks of machine code in some cases.
      
      This can confuse the optimizer logic, where it assumes that a fixed
      point has been reached for all machine code blocks once the total
      program size stops changing. And then the JIT compiler can output
      abnormal machine code containing incorrect branch displacements.
      
      To mitigate this issue, we assert that a fixed point is reached while
      populating the output image. This rejects any problematic programs.
      The issue affects both x86-32 and x86-64. We mitigate separately to
      ease backporting.
      
      Signed-off-by: default avatarPiotr Krysiuk <piotras@gmail.com>
      Reviewed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarYue Haibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      39fd7bff
    • Kuppuswamy Sathyanarayanan's avatar
      mm/vmalloc.c: fix percpu free VM area search criteria · 36d32b23
      Kuppuswamy Sathyanarayanan authored
      mainline inclusion
      from mainline-5.3-rc5
      commit 5336e52c
      category: bugfix
      bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
      CVE: NA
      
      -------------------------------------------------
      Recent changes to the vmalloc code by commit 68ad4a33
      ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
      cause spurious percpu allocation failures.  These, in turn, can result
      in panic()s in the slub code.  One such possible panic was reported by
      Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
      Another related panic observed is,
      
       RIP: 0033:0x7f46f7441b9b
       Call Trace:
        dump_stack+0x61/0x80
        pcpu_alloc.cold.30+0x22/0x4f
        mem_cgroup_css_alloc+0x110/0x650
        cgroup_apply_control_enable+0x133/0x330
        cgroup_mkdir+0x41b/0x500
        kernfs_iop_mkdir+0x5a/0x90
        vfs_mkdir+0x102/0x1b0
        do_mkdirat+0x7d/0xf0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
      to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
      uses two lists (vmap_area_list & free_vmap_area_list) to track the used
      and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
      sizes[], nr_vms, align) function is used for allocating congruent VM
      areas for percpu memory allocator.  In order to not conflict with
      VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
      VMALLOC space.  So the search for free vm_area for the given requirement
      starts near VMALLOC_END and moves upwards towards VMALLOC_START.
      
      Prior to commit 68ad4a33, the search for free vm_area in
      pcpu_get_vm_areas() involves following two main steps.
      
      Step 1:
          Find a aligned "base" adress near VMALLOC_END.
          va = free vm area near VMALLOC_END
      Step 2:
          Loop through number of requested vm_areas and check,
              Step 2.1:
                 if (base < VMALLOC_START)
                    1. fail with error
              Step 2.2:
                 // end is offsets[area] + sizes[area]
                 if (base + end > va->vm_end)
                     1. Move the base downwards and repeat Step 2
              Step 2.3:
                 if (base + start < va->vm_start)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      But Commit 68ad4a33 removed Step 2.2 and modified Step 2.3 as below:
      
              Step 2.3:
                 if (base + start < va->vm_start || base + end > va->vm_end)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      Above change is the root cause of spurious percpu memory allocation
      failures.  For example, consider a case where a relatively large vm_area
      (~ 30 TB) was ignored in free vm_area search because it did not pass the
      base + end < vm->vm_end boundary check.  Ignoring such large free
      vm_area's would lead to not finding free vm_area within boundary of
      VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.
      
      So modify the search algorithm to include Step 2.2.
      
      Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
      
      
      Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
      Signed-off-by: default avatarKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit 5336e52c)
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
    • Arnd Bergmann's avatar
      mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning · a9708156
      Arnd Bergmann authored
      mainline inclusion
      from mainline-5.2-rc7
      commit 2c929233
      category: bugfix
      bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
      CVE: NA
      
      -------------------------------------------------
      gcc gets confused in pcpu_get_vm_areas() because there are too many
      branches that affect whether 'lva' was initialized before it gets used:
      
        mm/vmalloc.c: In function 'pcpu_get_vm_areas':
        mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
            insert_vmap_area_augment(lva, &va->rb_node,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             &free_vmap_area_root, &free_vmap_area_list);
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        mm/vmalloc.c:916:20: note: 'lva' was declared here
          struct vmap_area *lva;
                            ^~~
      
      Add an intialization to NULL, and check whether this has changed before
      the first use.
      
      [akpm@linux-foundation.org: tweak comments]
      Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de
      
      
      Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit 2c929233)
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      a9708156
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro · 2003379b
      Uladzislau Rezki (Sony) authored
      mainline inclusion
      from mainline-5.2-rc1
      commit a6cf4e0f
      category: bugfix
      bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
      CVE: NA
      
      -------------------------------------------------
      This macro adds some debug code to check that vmap allocations are
      happened in ascending order.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
      
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit a6cf4e0f)
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      2003379b
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro · db0b5445
      Uladzislau Rezki (Sony) authored
      mainline inclusion
      from mainline-5.2-rc1
      commit bb850f4d
      category: bugfix
      bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
      CVE: NA
      
      -------------------------------------------------
      This macro adds some debug code to check that the augment tree is
      maintained correctly, meaning that every node contains valid
      subtree_max_size value.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
      
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit bb850f4d)
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      db0b5445
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc.c: keep track of free blocks for vmap allocation · a93bd443
      Uladzislau Rezki (Sony) authored
      mainline inclusion
      from mainline-5.2-rc1
      commit 68ad4a33
      category: bugfix
      bugzilla: 15766, https://bugzilla.openeuler.org/show_bug.cgi?id=25
      CVE: NA
      
      -------------------------------------------------
      Patch series "improve vmap allocation", v3.
      
      Objective
      ---------
      
      Please have a look for the description at:
      
        https://lkml.org/lkml/2018/10/19/786
      
      but let me also summarize it a bit here as well.
      
      The current implementation has O(N) complexity. Requests with different
      permissive parameters can lead to long allocation time. When i say
      "long" i mean milliseconds.
      
      Description
      -----------
      
      This approach organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range, i.e.  an allocation is done over free areas lookups,
      instead of finding a hole between two busy blocks.  It allows to have
      lower number of objects which represent the free space, therefore to have
      less fragmented memory allocator.  Because free blocks are always as large
      as possible.
      
      It uses the augment tree where all free areas are sorted in ascending
      order of va->va_start address in pair with linked list that provides
      O(1) access to prev/next elements.
      
      Since the tree is augment, we also maintain the "subtree_max_size" of VA
      that reflects a maximum available free block in its left or right
      sub-tree.  Knowing that, we can easily traversal toward the lowest (left
      most path) free area.
      
      Allocation: ~O(log(N)) complexity.  It is sequential allocation method
      therefore tends to maximize locality.  The search is done until a first
      suitable block is large enough to encompass the requested parameters.
      Bigger areas are split.
      
      I copy paste here the description of how the area is split, since i
      described it in https://lkml.org/lkml/2018/10/19/786
      
      <snip>
      
      A free block can be split by three different ways.  Their names are
      FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e.  they
      correspond to how requested size and alignment fit to a free block.
      
      FL_FIT_TYPE - in this case a free block is just removed from the free
      list/tree because it fully fits.  Comparing with current design there is
      an extra work with rb-tree updating.
      
      LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit.  In this case what we do
      is just cutting a free block.  It is as fast as a current design.  Most of
      the vmalloc allocations just end up with this case, because the edge is
      always aligned to 1.
      
      NE_FIT_TYPE - Is much less common case.  Basically it happens when
      requested size and alignment does not fit left nor right edges, i.e.  it
      is between them.  In this case during splitting we have to build a
      remaining left free area and place it back to the free list/tree.
      
      Comparing with current design there are two extra steps.  First one is we
      have to allocate a new vmap_area structure.  Second one we have to insert
      that remaining free block to the address sorted list/tree.
      
      In order to optimize a first case there is a cache with free_vmap objects.
      Instead of allocating from slab we just take an object from the cache and
      reuse it.
      
      Second one is pretty optimized.  Since we know a start point in the tree
      we do not do a search from the top.  Instead a traversal begins from a
      rb-tree node we split.
      <snip>
      
      De-allocation.  ~O(log(N)) complexity.  An area is not inserted straight
      away to the tree/list, instead we identify the spot first, checking if it
      can be merged around neighbors.  The list provides O(1) access to
      prev/next, so it is pretty fast to check it.  Summarizing.  If merged then
      large coalesced areas are created, if not the area is just linked making
      more fragments.
      
      There is one more thing that i should mention here.  After modification of
      VA node, its subtree_max_size is updated if it was/is the biggest area in
      its left or right sub-tree.  Apart of that it can also be populated back
      to upper levels to fix the tree.  For more details please have a look at
      the __augment_tree_propagate_from() function and the description.
      
      Tests and stressing
      -------------------
      
      I use the "test_vmalloc.sh" test driver available under
      "tools/testing/selftests/vm/" since 5.1-rc1 kernel.  Just trigger "sudo
      ./test_vmalloc.sh" to find out how to deal with it.
      
      Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
      Regarding last one, i do not have any physical access to NUMA system,
      therefore i emulated it.  The time of stressing is days.
      
      If you run the test driver in "stress mode", you also need the patch that
      is in Andrew's tree but not in Linux 5.1-rc1.  So, please apply it:
      
      http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c
      
      After massive testing, i have not identified any problems like memory
      leaks, crashes or kernel panics.  I find it stable, but more testing would
      be good.
      
      Performance analysis
      --------------------
      
      I have used two systems to test.  One is i5-3320M CPU @ 2.60GHz and
      another is HiKey960(arm64) board.  i5-3320M runs on 4.20 kernel, whereas
      Hikey960 uses 4.15 kernel.  I have both system which could run on 5.1-rc1
      as well, but the results have not been ready by time i an writing this.
      
      Currently it consist of 8 tests.  There are three of them which correspond
      to different types of splitting(to compare with default).  We have 3
      ones(see above).  Another 5 do allocations in different conditions.
      
      a) sudo ./test_vmalloc.sh performance
      
      When the test driver is run in "performance" mode, it runs all available
      tests pinned to first online CPU with sequential execution test order.  We
      do it in order to get stable and repeatable results.  Take a look at time
      difference in "long_busy_list_alloc_test".  It is not surprising because
      the worst case is O(N).
      
      How many cycles all tests took:
      CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles
      
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt
      
      How many cycles all tests took:
      CPU0=3478683207 cycles vs CPU0=463767978 cycles
      
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt
      
      b) time sudo ./test_vmalloc.sh test_repeat_count=1
      
      With this configuration, all tests are run on all available online CPUs.
      Before running each CPU shuffles its tests execution order.  It gives
      random allocation behaviour.  So it is rough comparison, but it puts in
      the picture for sure.
      
      <default>            vs            <patched>
      real    101m22.813s                real    0m56.805s
      user    0m0.011s                   user    0m0.015s
      sys     0m5.076s                   sys     0m0.023s
      
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt
      
      <default>            vs            <patched>
      real    unknown                    real    4m25.214s
      user    unknown                    user    0m0.011s
      sys     unknown                    sys     0m0.670s
      
      I did not manage to complete this test on "default Hikey960" kernel
      version.  After 24 hours it was still running, therefore i had to cancel
      it.  That is why real/user/sys are "unknown".
      
      This patch (of 3):
      
      Currently an allocation of the new vmap area is done over busy list
      iteration(complexity O(n)) until a suitable hole is found between two busy
      areas.  Therefore each new allocation causes the list being grown.  Due to
      over fragmented list and different permissive parameters an allocation can
      take a long time.  For example on embedded devices it is milliseconds.
      
      This patch organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range.  It uses an augment red-black tree that keeps blocks
      sorted by their offsets in pair with linked list keeping the free space in
      order of increasing addresses.
      
      Nodes are augmented with the size of the maximum available free block in
      its left or right sub-tree.  Thus, that allows to take a decision and
      traversal toward the block that will fit and will have the lowest start
      address, i.e.  it is sequential allocation.
      
      Allocation: to allocate a new block a search is done over the tree until a
      suitable lowest(left most) block is large enough to encompass: the
      requested size, alignment and vstart point.  If the block is bigger than
      requested size - it is split.
      
      De-allocation: when a busy vmap area is freed it can either be merged or
      inserted to the tree.  Red-black tree allows efficiently find a spot
      whereas a linked list provides a constant-time access to previous and next
      blocks to check if merging can be done.  In case of merging of
      de-allocated memory chunk a large coalesced area is created.
      
      Complexity: ~O(log(N))
      
      [urezki@gmail.com: v3]
        Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com
      
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit 68ad4a33)
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      a93bd443
    • Xiongfeng Wang's avatar
      config: Enable CONFIG_USERSWAP · 6f799d99
      Xiongfeng Wang authored
      
      hulk inclusion
      category: feature
      bugzilla: 47439
      CVE: NA
      
      -------------------------------------------------
      
      Enable CONFIG_USERSWAP for hulk_defconfg and openeuler_defconfig
      
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
    • Guo Fan's avatar
      userswap: support userswap via userfaultfd · c3e6287f
      Guo Fan authored
      
      hulk inclusion
      category: feature
      bugzilla: 47439
      CVE: NA
      
      -------------------------------------------------
      
      This patch modify the userfaultfd to support userswap. To check whether
      tha pages are dirty since the last swap in, we make them clean when we
      swap in the pages. The userspace may swap in a large area and part of it
      are not swapped out. We need to skip those pages that are not swapped
      out.
      
      Signed-off-by: default avatarGuo Fan <guofan5@huawei.com>
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: default avatarJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: default avatarKefeng  Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      c3e6287f
    • Guo Fan's avatar
      userswap: add a new flag 'MAP_REPLACE' for mmap() · e3452806
      Guo Fan authored
      
      hulk inclusion
      category: feature
      bugzilla: 47439
      CVE: NA
      
      -------------------------------------------------
      
      To make sure there are no other userspace threads access the memory
      region we are swapping out, we need unmmap the memory region, map it
      to a new address and use the new address to perform the swapout. We add
      a new flag 'MAP_REPLACE' for mmap() to unmap the pages of the input
      parameter 'VA' and remap them to a new tmpVA.
      
      Signed-off-by: default avatarGuo Fan <guofan5@huawei.com>
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      e3452806
    • Michal Hocko's avatar
      mm, mempolicy: fix up gup usage in lookup_node · 8bc9bb27
      Michal Hocko authored
      
      mainline inclusion
      from mainline-v5.8-rc1
      commit 2d3a36a4
      category: bugfix
      bugzilla: 47439
      CVE: NA
      ---------------------------
      
      ba841078 ("mm/mempolicy: Allow lookup_node() to handle fatal signal")
      has added a special casing for 0 return value because that was a possible
      gup return value when interrupted by fatal signal.  This has been fixed by
      ae46d2aa ("mm/gup: Let __get_user_pages_locked() return -EINTR for
      fatal signal") in the mean time so ba841078 can be reverted.
      
      This patch however doesn't go all the way to revert it because the check
      for 0 is wrong and confusing here.  Firstly it is inherently unsafe to
      access the page when get_user_pages_locked returns 0 (aka no page
      returned).
      
      Fortunatelly this will not happen because get_user_pages_locked will not
      return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not
      the case here.  Document this potential error code in gup code while we
      are at it.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Xu <peterx@redhat.com>
      Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
       Conflicts:
      	mm/gup.c
      [wangxiongfeng: conflicts in comments ]
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: default avatarJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: default avatarKefeng  Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      8bc9bb27
    • Peter Xu's avatar
      mm/mempolicy: Allow lookup_node() to handle fatal signal · 169cc12f
      Peter Xu authored
      
      mainline inclusion
      from mainline-v5.7-rc1
      commit ba841078
      category: bugfix
      bugzilla: 47439
      CVE: NA
      ---------------------------
      
      lookup_node() uses gup to pin the page and get node information.  It
      checks against ret>=0 assuming the page will be filled in.  However it's
      also possible that gup will return zero, for example, when the thread is
      quickly killed with a fatal signal.  Teach lookup_node() to gracefully
      return an error -EFAULT if it happens.
      
      Meanwhile, initialize "page" to NULL to avoid potential risk of
      exploiting the pointer.
      
      Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
      Reported-by: default avatar <syzbot+693dc11fcb53120b5559@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
       Conflicts:
      	mm/mempolicy.c
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: default avatarJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: default avatarKefeng  Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      169cc12f
    • Hillf Danton's avatar
      mm/gup: Let __get_user_pages_locked() return -EINTR for fatal signal · bcfd7200
      Hillf Danton authored
      
      mainline inclusion
      from mainline-v5.7-rc1
      commit ae46d2aa
      category: bugfix
      bugzilla: 47439
      CVE: NA
      ---------------------------
      
      __get_user_pages_locked() will return 0 instead of -EINTR after commit
      4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times") which
      added extra code to allow gup detect fatal signal faster.
      
      Restore the original -EINTR behavior.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
      Reported-by: default avatar <syzbot+3be1a33f04dc782e9fd5@syzkaller.appspotmail.com>
      Signed-off-by: default avatarHillf Danton <hdanton@sina.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: default avatarJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: default avatarKefeng  Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      bcfd7200
    • Peter Xu's avatar
      mm/gup: fix fixup_user_fault() on multiple retries · 1a397268
      Peter Xu authored
      
      mainline inclusion
      from mainline-v5.7-rc6
      commit 475f4dfc
      category: bugfix
      bugzilla: 47439
      CVE: NA
      ---------------------------
      
      This part was overlooked when reworking the gup code on multiple
      retries.
      
      When we get the 2nd+ retry, we'll be with TRIED flag set.  Current code
      will bail out on the 2nd retry because the !TRIED check will fail so the
      retry logic will be skipped.  What's worse is that, it will also return
      zero which errornously hints the caller that the page is faulted in
      while it's not.
      
      The !TRIED flag check seems to not be needed even before the mutliple
      retries change because if we get a VM_FAULT_RETRY, it must be the 1st
      retry, and we should not have TRIED set for that.
      
      Fix it by removing the !TRIED check, at the meantime check against fatal
      signals properly before the page fault so we can still properly respond
      to the user killing the process during retries.
      
      Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Link: http://lkml.kernel.org/r/20200502003523.8204-1-peterx@redhat.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: default avatarJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: default avatarKefeng  Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      1a397268
    • Peter Xu's avatar
      mm/gup: allow VM_FAULT_RETRY for multiple times · 85c63908
      Peter Xu authored
      
      mainline inclusion
      from mainline-5.6
      commit 4426e945
      category: bugfix
      bugzilla: 47439
      CVE: NA
      ---------------------------
      
      This is the gup counterpart of the change that allows the VM_FAULT_RETRY
      to happen for more than once.  One thing to mention is that we must check
      the fatal signal here before retry because the GUP can be interrupted by
      that, otherwise we can loop forever.
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: default avatarJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: default avatarKefeng  Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      85c63908
    • Peter Xu's avatar
      mm: allow VM_FAULT_RETRY for multiple times · 9745f703
      Peter Xu authored
      mainline inclusion
      from mainline-5.6
      commit 4064b982
      category: bugfix
      bugzilla: 47439
      CVE: NA
      ---------------------------
      
      The idea comes from a discussion between Linus and Andrea [1].
      
      Before this patch we only allow a page fault to retry once.  We achieved
      this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
      handle_mm_fault() the second time.  This was majorly used to avoid
      unexpected starvation of the system by looping over forever to handle the
      page fault on a single page.  However that should hardly happen, and after
      all for each code path to return a VM_FAULT_RETRY we'll first wait for a
      condition (during which time we should possibly yield the cpu) to happen
      before VM_FAULT_RETRY is really returned.
      
      This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
      flag when we receive VM_FAULT_RETRY.  It means that the page fault handler
      now can retry the page fault for multiple times if necessary without the
      need to generate another page fault event.  Meanwhile we still keep the
      FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
      page fault is the first attempt or not.
      
      Then we'll have these combinations of fault flags (only considering
      ALLOW_RETRY flag and TRIED flag):
      
        - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                                   retry, and this is the first try
      
        - ALLOW_RETRY and TRIED:   this means the page fault allows to
                                   retry, and this is not the first try
      
        - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                                   to retry at all
      
        - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
      
      In existing code we have multiple places that has taken special care of
      the first condition above by checking against (fault_flags &
      FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to detect
      the first retry of a page fault by checking against both (fault_flags &
      FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
      even the 2nd try will have the ALLOW_RETRY set, then use that helper in
      all existing special paths.  One example is in __lock_page_or_retry(), now
      we'll drop the mmap_sem only in the first attempt of page fault and we'll
      keep it in follow up retries, so old locking behavior will be retained.
      
      This will be a nice enhancement for current code [2] at the same time a
      supporting material for the future userfaultfd-writeprotect work, since in
      that work there will always be an explicit userfault writeprotect retry
      for protected pages, and if that cannot resolve the page fault (e.g., when
      userfaultfd-writeprotect is used in conjunction with swapped pages) then
      we'll possibly need a 3rd retry of the page fault.  It might also benefit
      other potential users who will have similar requirement like userfault
      write-protection.
      
      GUP code is not touched yet and will be covered in follow up patch.
      
      Please read the thread below for more information.
      
      [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
      [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/
      
      
      
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBrian Geffon <bgeffon@google.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
       Conflicts:
      	arch/arc/mm/fault.c
      	arch/arm64/mm/fault.c
      	arch/x86/mm/fault.c
      	drivers/gpu/drm/ttm/ttm_bo_vm.c
      	include/linux/mm.h
      	mm/internal.h
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: default avatarJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: default avatarKefeng  Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      9745f703
    • Cheng Jian's avatar
      sched/fair: fix kabi broken due to adding fields in rq and sched_domain_shared · 8a53999a
      Cheng Jian authored
      hulk inclusion
      category: bugfix
      bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
      
      
      CVE: NA
      
      ---------------------------
      
      Previous patches added fields in struct rq and sched_domain_shared,
      which caused the KABI changed.
      
      We can use some helper structures to fix this KABI change, but this is
      not necessary, because these structures are only used internally, the
      driver is not aware of them, so we simply avoid them.
      
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
    • Cheng Jian's avatar
      sched/fair: fix try_steal compile error · edf8c7d2
      Cheng Jian authored
      hulk inclusion
      category: bugfix
      bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
      
      
      CVE: NA
      
      ---------------------------
      
      If we disable CONFIG_SMP, try_steal will lose its definition,
      resulting in a compile error as follows.
      
      	kernel/sched/fair.c: In function ‘pick_next_task_fair’:
      	kernel/sched/fair.c:7001:15: error: implicit declaration of function ‘try_steal’ [-Werror=implicit-function-declaration]
      		new_tasks = try_steal(rq, rf);
      			    ^~~~~~~~~
      
      We can use allnoconfig to reproduce this problem.
      
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: default avatarBin Li <huawei.libin@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      edf8c7d2
    • Cheng Jian's avatar
      config: enable CONFIG_SCHED_STEAL by default · 402b5928
      Cheng Jian authored
      hulk inclusion
      category: config
      bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
      
      
      CVE: NA
      
      -------------------------------------------------
      
      Enable steal tasks by default to improve CPU utilization
      
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      402b5928
    • Cheng Jian's avatar
      sched/fair: introduce SCHED_STEAL · fbd5102b
      Cheng Jian authored
      hulk inclusion
      category: feature
      bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
      
      
      CVE: NA
      
      ---------------------------
      
      Introduce CONFIG_SCHED_STEAL to limit the impact of steal task.
      
      1). If turn off CONFIG_SCHED_STEAL, then all the changes will not
      exist, for we use some empty functions, so this depends on compiler
      optimization.
      
      2). enable CONFIG_SCHED_STEAL, but disable STEAL and schedstats, it
      will introduce some impact whith schedstat check. but this has little
      effect on performance. This will be our default choice.
      
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      fbd5102b
    • Cheng Jian's avatar
      disable stealing by default · da163c22
      Cheng Jian authored
      hulk inclusion
      category: feature
      bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
      
      
      CVE: NA
      
      ---------------------------
      
      Steal tasks to improve CPU utilization can solve some performance
      problems such as mysql, but not all scenarios are optimized, such as
      hackbench.
      
      So turn off by default.
      
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      da163c22
    • Steve Sistare's avatar
      sched/fair: Provide idle search schedstats · 1d2cc076
      Steve Sistare authored
      hulk inclusion
      category: feature
      bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
      
      
      CVE: NA
      
      ---------------------------
      
      Add schedstats to measure the effectiveness of searching for idle CPUs
      and stealing tasks.  This is a temporary patch intended for use during
      development only.  SCHEDSTAT_VERSION is bumped to 16, and the following
      fields are added to the per-CPU statistics of /proc/schedstat:
      
      field 10: # of times select_idle_sibling "easily" found an idle CPU --
                prev or target is idle.
      field 11: # of times select_idle_sibling searched and found an idle cpu.
      field 12: # of times select_idle_sibling searched and found an idle core.
      field 13: # of times select_idle_sibling failed to find anything idle.
      field 14: time in nanoseconds spent in functions that search for idle
                CPUs and search for tasks to steal.
      field 15: # of times an idle CPU steals a task from another CPU.
      field 16: # of times try_steal finds overloaded CPUs but no task is
                 migratable.
      
      Signed-off-by: default avatarSteve Sistare <steven.sistare@oracle.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      1d2cc076
    • Steve Sistare's avatar
      sched/fair: disable stealing if too many NUMA nodes · 49353d1e
      Steve Sistare authored
      hulk inclusion
      category: feature
      bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
      
      
      CVE: NA
      
      ---------------------------
      
      The STEAL feature causes regressions on hackbench on larger NUMA systems,
      so disable it on systems with more than sched_steal_node_limit nodes
      (default 2).  Note that the feature remains enabled as seen in features.h
      and /sys/kernel/debug/sched_features, but stealing is only performed if
      nodes <= sched_steal_node_limit.  This arrangement allows users to activate
      stealing on reboot by setting the kernel parameter sched_steal_node_limit
      on kernels built without CONFIG_SCHED_DEBUG.  The parameter is temporary
      and will be deleted when the regression is fixed.
      
      Details of the regression follow.  With the STEAL feature set, hackbench
      is slower on many-node systems:
      
      X5-8: 8 sockets * 18 cores * 2 hyperthreads = 288 CPUs
      Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz
      Average of 10 runs of: hackbench <groups> processes 50000
      
                --- base --    --- new ---
      groups    time %stdev    time %stdev  %speedup
           1   3.627   15.8   3.876    7.3      -6.5
           2   4.545   24.7   5.583   16.7     -18.6
           3   5.716   25.0   7.367   14.2     -22.5
           4   6.901   32.9   7.718   14.5     -10.6
           8   8.604   38.5   9.111   16.0      -5.6
          16   7.734    6.8  11.007    8.2     -29.8
      
      Total CPU time increases.  Profiling shows that CPU time increases
      uniformly across all functions, suggesting a systemic increase in cache
      or memory latency.  This may be due to NUMA migrations, as they cause
      loss of LLC cache footprint and remote memory latencies.
      
      The domains for this system and their flags are:
      
        domain0 (SMT) : 1 core
          SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
          SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_SHARE_CPUCAPACITY
          SD_WAKE_AFFINE
      
        domain1 (MC) : 1 socket
          SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
          SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
          SD_WAKE_AFFINE
      
        domain2 (NUMA) : 4 sockets
          SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
          SD_SERIALIZE SD_OVERLAP SD_NUMA
          SD_WAKE_AFFINE
      
        domain3 (NUMA) : 8 sockets
          SD_LOAD_BALANCE SD_BALANCE_NEWIDLE
          SD_SERIALIZE SD_OVERLAP SD_NUMA
      
      Schedstats point to the root cause of the regression.  hackbench is run
      10 times per group and the average schedstat accumulation per-run and
      per-cpu is shown below.  Note that domain3 moves are zero because
      SD_WAKE_AFFINE is not set there.
      
      NO_STEAL
                                               --- domain2 ---   --- domain3 ---
      grp time %busy sched  idle   wake steal remote  move pull remote  move pull
       1 20.3 10.3  28710  14346  14366     0    490  3378    0   4039     0    0
       2 26.4 18.8  56721  28258  28469     0    792  7026   12   9229     0    7
       3 29.9 28.3  90191  44933  45272     0   5380  7204   19  16481     0    3
       4 30.2 35.8 121324  60409  60933     0   7012  9372   27  21438     0    5
       8 27.7 64.2 229174 111917 117272     0  11991  1837  168  44006     0   32
      16 32.6 74.0 334615 146784 188043     0   3404  1468   49  61405     0    8
      
      STEAL
                                               --- domain2 ---   --- domain3 ---
      grp time %busy sched  idle   wake steal remote  move pull remote  move pull
       1 20.6 10.2  28490  14232  14261    18      3  3525    0   4254     0    0
       2 27.9 18.8  56757  28203  28562   303   1675  7839    5   9690     0    2
       3 35.3 27.7  87337  43274  44085   698    741 12785   14  15689     0    3
       4 36.8 36.0 118630  58437  60216  1579   2973 14101   28  18732     0    7
       8 48.1 73.8 289374 133681 155600 18646  35340 10179  171  65889     0   34
      16 41.4 82.5 268925  91908 177172 47498  17206  6940  176  71776     0   20
      
      Cross-numa-node migrations are caused by load balancing pulls and
      wake_affine moves.  Pulls are small and similar for no_steal and steal.
      However, moves are significantly higher for steal, and rows above with the
      highest moves have the worst regressions for time; see for example grp=8.
      
      Moves increase for steal due to the following logic in wake_affine_idle()
      for synchronous wakeup:
      
          if (sync && cpu_rq(this_cpu)->nr_running == 1)
              return this_cpu;        // move the task
      
      The steal feature does a better job of smoothing the load between idle
      and busy CPUs, so nr_running is 1 more often, and moves are performed
      more often.  For hackbench, cross-node affine moves early in the run are
      good because they colocate wakers and wakees from the same group on the
      same node, but continued moves later in the run are bad, because the wakee
      is moved away from peers on its previous node.  Note that even no_steal
      is far from optimal; binding an instance of "hackbench 2" to each of the
      8 NUMA nodes runs much faster than running "hackbench 16" with no binding.
      
      Clearing SD_WAKE_AFFINE for domain2 eliminates the affine cross-node
      migrations and eliminates the difference between no_steal and steal
      performance.  However, overall performance is lower than WA_IDLE because
      some migrations are helpful as explained above.
      
      I have tried many heuristics in a attempt to optimize the number of
      cross-node moves in all conditions, with limited success.  The fundamental
      problem is that the scheduler does not track which groups of tasks talk to
      each other.  Parts of several groups become entrenched on the same node,
      filling it to capacity, leaving no room for either group to pull its peers
      over, and there is neither data nor mechanism for the scheduler to evict
      one group to make room for the other.
      
      For now, disable STEAL on such systems until we can do better, or it is
      shown that hackbench is atypical and most workloads benefit from stealing.
      
      Signed-off-by: default avatarSteve Sistare <steven.sistare@oracle.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      49353d1e
    • Steve Sistare's avatar
      sched/fair: Steal work from an overloaded CPU when CPU goes idle · bccfa644
      Steve Sistare authored
      hulk inclusion
      category: feature
      bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
      
      
      CVE: NA
      
      ---------------------------
      
      When a CPU has no more CFS tasks to run, and idle_balance() fails to find a
      task, then attempt to steal a task from an overloaded CPU in the same LLC,
      using the cfs_overload_cpus bitmap to efficiently identify candidates.  To
      minimize search time, steal the first migratable task that is found when
      the bitmap is traversed.  For fairness, search for migratable tasks on an
      overloaded CPU in order of next to run.
      
      This simple stealing yields a higher CPU utilization than idle_balance()
      alone, because the search is cheap, so it may be called every time the CPU
      is about to go idle.  idle_balance() does more work because it searches
      widely for the busiest queue, so to limit its CPU consumption, it declines
      to search if the system is too busy.  Simple stealing does not offload the
      globally busiest queue, but it is much better than running nothing at all.
      
      Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
      by default.
      
      Stealing imprroves utilization with only a modest CPU overhead in scheduler
      code.  In the following experiment, hackbench is run with varying numbers
      of groups (40 tasks per group), and the delta in /proc/schedstat is shown
      for each run, averaged per CPU, augmented with these non-standard stats:
      
        %find - percent of time spent in old and new functions that search for
          idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
      
        steal - number of times a task is stolen from another CPU.
      
      X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
      Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
      hackbench <grps> process 100000
      sched_wakeup_granularity_ns=15000000
      
        baseline
        grps  time  %busy  slice   sched   idle     wake %find  steal
        1    8.084  75.02   0.10  105476  46291    59183  0.31      0
        2   13.892  85.33   0.10  190225  70958   119264  0.45      0
        3   19.668  89.04   0.10  263896  87047   176850  0.49      0
        4   25.279  91.28   0.10  322171  94691   227474  0.51      0
        8   47.832  94.86   0.09  630636 144141   486322  0.56      0
      
        new
        grps  time  %busy  slice   sched   idle     wake %find  steal  %speedup
        1    5.938  96.80   0.24   31255   7190    24061  0.63   7433  36.1
        2   11.491  99.23   0.16   74097   4578    69512  0.84  19463  20.9
        3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
        4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
        8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6
      
      Elapsed time improves by 8 to 36%, and CPU busy utilization is up
      by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
      The cost is at most 0.4% more find time.
      
      Additional performance results follow.  A negative "speedup" is a
      regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
      is set to 15 msec.  Otherwise, preemptions increase at higher loads and
      distort the comparison between baseline and new.
      
      ------------------ 1 Socket Results ------------------
      
      X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
      Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
      Average of 10 runs of: hackbench <groups> process 100000
      
                  --- base --    --- new ---
        groups    time %stdev    time %stdev  %speedup
             1   8.008    0.1   5.905    0.2      35.6
             2  13.814    0.2  11.438    0.1      20.7
             3  19.488    0.2  16.919    0.1      15.1
             4  25.059    0.1  22.409    0.1      11.8
             8  47.478    0.1  44.221    0.1       7.3
      
      X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
      Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
      Average of 10 runs of: hackbench <groups> process 100000
      
                  --- base --    --- new ---
        groups    time %stdev    time %stdev  %speedup
             1   4.586    0.8   4.596    0.6      -0.3
             2   7.693    0.2   5.775    1.3      33.2
             3  10.442    0.3   8.288    0.3      25.9
             4  13.087    0.2  11.057    0.1      18.3
             8  24.145    0.2  22.076    0.3       9.3
            16  43.779    0.1  41.741    0.2       4.8
      
      KVM 4-cpu
      Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
      tbench, average of 11 runs
      
        clients    %speedup
              1        16.2
              2        11.7
              4         9.9
              8        12.8
             16        13.7
      
      KVM 2-cpu
      Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
      
        Benchmark                     %speedup
        specjbb2015_critical_jops          5.7
        mysql_sysb1.0.14_mutex_2          40.6
        mysql_sysb1.0.14_oltp_2            3.9
      
      ------------------ 2 Socket Results ------------------
      
      X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
      Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
      Average of 10 runs of: hackbench <groups> process 100000
      
                  --- base --    --- new ---
        groups    time %stdev    time %stdev  %speedup
             1   7.945    0.2   7.219    8.7      10.0
             2   8.444    0.4   6.689    1.5      26.2
             3  12.100    1.1   9.962    2.0      21.4
             4  15.001    0.4  13.109    1.1      14.4
             8  27.960    0.2  26.127    0.3       7.0
      
      X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
      Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
      Average of 10 runs of: hackbench <groups> process 100000
      
                  --- base --    --- new ---
        groups    time %stdev    time %stdev  %speedup
             1   5.826    5.4   5.840    5.0      -0.3
             2   5.041    5.3   6.171   23.4     -18.4
             3   6.839    2.1   6.324    3.8       8.1
             4   8.177    0.6   7.318    3.6      11.7
             8  14.429    0.7  13.966    1.3       3.3
            16  26.401    0.3  25.149    1.5       4.9
      
      X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
      Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
      Oracle database OLTP, logging disabled, NVRAM storage
      
        Customers   Users   %speedup
          1200000      40       -1.2
          2400000      80        2.7
          3600000     120        8.9
          4800000     160        4.4
          6000000     200        3.0
      
      X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
      Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
      Results from the Oracle "Performance PIT".
      
        Benchmark                                           %speedup
      
        mysql_sysb1.0.14_fileio_56_rndrd                        19.6
        mysql_sysb1.0.14_fileio_56_seqrd                        12.1
        mysql_sysb1.0.14_fileio_56_rndwr                         0.4
        mysql_sysb1.0.14_fileio_56_seqrewr                      -0.3
      
        pgsql_sysb1.0.14_fileio_56_rndrd                        19.5
        pgsql_sysb1.0.14_fileio_56_seqrd                         8.6
        pgsql_sysb1.0.14_fileio_56_rndwr                         1.0
        pgsql_sysb1.0.14_fileio_56_seqrewr                       0.5
      
        opatch_time_ASM_12.2.0.1.0_HP2M                          7.5
        select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M             5.1
        select-1_users_asmm_ASM_12.2.0.1.0_HP2M                  4.4
        swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M           5.8
      
        lm3_memlat_L2                                            4.8
        lm3_memlat_L1                                            0.0
      
        ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching     60.1
        ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent        5.2
        ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent       -3.0
        ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4
      
      X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
      Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
      
        NAS_OMP
        bench class   ncpu    %improved(Mops)
        dc    B       72      1.3
        is    C       72      0.9
        is    D       72      0.7
      
        sysbench mysql, average of 24 runs
                --- base ---     --- new ---
        nthr   events  %stdev   events  %stdev %speedup
           1    331.0    0.25    331.0    0.24     -0.1
           2    661.3    0.22    661.8    0.22      0.0
           4   1297.0    0.88   1300.5    0.82      0.2
           8   2420.8    0.04   2420.5    0.04     -0.1
          16   4826.3    0.07   4825.4    0.05     -0.1
          32   8815.3    0.27   8830.2    0.18      0.1
          64  12823.0    0.24  12823.6    0.26      0.0
      
      -------------------------------------------------------------
      
      Signed-off-by: default avatarSteve Sistare <steven.sistare@oracle.com>
      Signed-off-by: default avatarCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      bccfa644