- Apr 14, 2021
-
-
Cheng Jian authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Introduce CONFIG_SCHED_STEAL to limit the impact of steal task. 1). If turn off CONFIG_SCHED_STEAL, then all the changes will not exist, for we use some empty functions, so this depends on compiler optimization. 2). enable CONFIG_SCHED_STEAL, but disable STEAL and schedstats, it will introduce some impact whith schedstat check. but this has little effect on performance. This will be our default choice. Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Cheng Jian authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Steal tasks to improve CPU utilization can solve some performance problems such as mysql, but not all scenarios are optimized, such as hackbench. So turn off by default. Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Add schedstats to measure the effectiveness of searching for idle CPUs and stealing tasks. This is a temporary patch intended for use during development only. SCHEDSTAT_VERSION is bumped to 16, and the following fields are added to the per-CPU statistics of /proc/schedstat: field 10: # of times select_idle_sibling "easily" found an idle CPU -- prev or target is idle. field 11: # of times select_idle_sibling searched and found an idle cpu. field 12: # of times select_idle_sibling searched and found an idle core. field 13: # of times select_idle_sibling failed to find anything idle. field 14: time in nanoseconds spent in functions that search for idle CPUs and search for tasks to steal. field 15: # of times an idle CPU steals a task from another CPU. field 16: # of times try_steal finds overloaded CPUs but no task is migratable. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- The STEAL feature causes regressions on hackbench on larger NUMA systems, so disable it on systems with more than sched_steal_node_limit nodes (default 2). Note that the feature remains enabled as seen in features.h and /sys/kernel/debug/sched_features, but stealing is only performed if nodes <= sched_steal_node_limit. This arrangement allows users to activate stealing on reboot by setting the kernel parameter sched_steal_node_limit on kernels built without CONFIG_SCHED_DEBUG. The parameter is temporary and will be deleted when the regression is fixed. Details of the regression follow. With the STEAL feature set, hackbench is slower on many-node systems: X5-8: 8 sockets * 18 cores * 2 hyperthreads = 288 CPUs Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz Average of 10 runs of: hackbench <groups> processes 50000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 3.627 15.8 3.876 7.3 -6.5 2 4.545 24.7 5.583 16.7 -18.6 3 5.716 25.0 7.367 14.2 -22.5 4 6.901 32.9 7.718 14.5 -10.6 8 8.604 38.5 9.111 16.0 -5.6 16 7.734 6.8 11.007 8.2 -29.8 Total CPU time increases. Profiling shows that CPU time increases uniformly across all functions, suggesting a systemic increase in cache or memory latency. This may be due to NUMA migrations, as they cause loss of LLC cache footprint and remote memory latencies. The domains for this system and their flags are: domain0 (SMT) : 1 core SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_SHARE_CPUCAPACITY SD_WAKE_AFFINE domain1 (MC) : 1 socket SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_WAKE_AFFINE domain2 (NUMA) : 4 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SERIALIZE SD_OVERLAP SD_NUMA SD_WAKE_AFFINE domain3 (NUMA) : 8 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA Schedstats point to the root cause of the regression. hackbench is run 10 times per group and the average schedstat accumulation per-run and per-cpu is shown below. Note that domain3 moves are zero because SD_WAKE_AFFINE is not set there. NO_STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.3 10.3 28710 14346 14366 0 490 3378 0 4039 0 0 2 26.4 18.8 56721 28258 28469 0 792 7026 12 9229 0 7 3 29.9 28.3 90191 44933 45272 0 5380 7204 19 16481 0 3 4 30.2 35.8 121324 60409 60933 0 7012 9372 27 21438 0 5 8 27.7 64.2 229174 111917 117272 0 11991 1837 168 44006 0 32 16 32.6 74.0 334615 146784 188043 0 3404 1468 49 61405 0 8 STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.6 10.2 28490 14232 14261 18 3 3525 0 4254 0 0 2 27.9 18.8 56757 28203 28562 303 1675 7839 5 9690 0 2 3 35.3 27.7 87337 43274 44085 698 741 12785 14 15689 0 3 4 36.8 36.0 118630 58437 60216 1579 2973 14101 28 18732 0 7 8 48.1 73.8 289374 133681 155600 18646 35340 10179 171 65889 0 34 16 41.4 82.5 268925 91908 177172 47498 17206 6940 176 71776 0 20 Cross-numa-node migrations are caused by load balancing pulls and wake_affine moves. Pulls are small and similar for no_steal and steal. However, moves are significantly higher for steal, and rows above with the highest moves have the worst regressions for time; see for example grp=8. Moves increase for steal due to the following logic in wake_affine_idle() for synchronous wakeup: if (sync && cpu_rq(this_cpu)->nr_running == 1) return this_cpu; // move the task The steal feature does a better job of smoothing the load between idle and busy CPUs, so nr_running is 1 more often, and moves are performed more often. For hackbench, cross-node affine moves early in the run are good because they colocate wakers and wakees from the same group on the same node, but continued moves later in the run are bad, because the wakee is moved away from peers on its previous node. Note that even no_steal is far from optimal; binding an instance of "hackbench 2" to each of the 8 NUMA nodes runs much faster than running "hackbench 16" with no binding. Clearing SD_WAKE_AFFINE for domain2 eliminates the affine cross-node migrations and eliminates the difference between no_steal and steal performance. However, overall performance is lower than WA_IDLE because some migrations are helpful as explained above. I have tried many heuristics in a attempt to optimize the number of cross-node moves in all conditions, with limited success. The fundamental problem is that the scheduler does not track which groups of tasks talk to each other. Parts of several groups become entrenched on the same node, filling it to capacity, leaving no room for either group to pull its peers over, and there is neither data nor mechanism for the scheduler to evict one group to make room for the other. For now, disable STEAL on such systems until we can do better, or it is shown that hackbench is atypical and most workloads benefit from stealing. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- When a CPU has no more CFS tasks to run, and idle_balance() fails to find a task, then attempt to steal a task from an overloaded CPU in the same LLC, using the cfs_overload_cpus bitmap to efficiently identify candidates. To minimize search time, steal the first migratable task that is found when the bitmap is traversed. For fairness, search for migratable tasks on an overloaded CPU in order of next to run. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, so it may be called every time the CPU is about to go idle. idle_balance() does more work because it searches widely for the busiest queue, so to limit its CPU consumption, it declines to search if the system is too busy. Simple stealing does not offload the globally busiest queue, but it is much better than running nothing at all. Stealing is controlled by the sched feature SCHED_STEAL, which is enabled by default. Stealing imprroves utilization with only a modest CPU overhead in scheduler code. In the following experiment, hackbench is run with varying numbers of groups (40 tasks per group), and the delta in /proc/schedstat is shown for each run, averaged per CPU, augmented with these non-standard stats: %find - percent of time spent in old and new functions that search for idle CPUs and tasks to steal and set the overloaded CPUs bitmap. steal - number of times a task is stolen from another CPU. X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz hackbench <grps> process 100000 sched_wakeup_granularity_ns=15000000 baseline grps time %busy slice sched idle wake %find steal 1 8.084 75.02 0.10 105476 46291 59183 0.31 0 2 13.892 85.33 0.10 190225 70958 119264 0.45 0 3 19.668 89.04 0.10 263896 87047 176850 0.49 0 4 25.279 91.28 0.10 322171 94691 227474 0.51 0 8 47.832 94.86 0.09 630636 144141 486322 0.56 0 new grps time %busy slice sched idle wake %find steal %speedup 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6 Elapsed time improves by 8 to 36%, and CPU busy utilization is up by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks). The cost is at most 0.4% more find time. Additional performance results follow. A negative "speedup" is a regression. Note: for all hackbench runs, sched_wakeup_granularity_ns is set to 15 msec. Otherwise, preemptions increase at higher loads and distort the comparison between baseline and new. ------------------ 1 Socket Results ------------------ X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Average of 10 runs of: hackbench <groups> process 100000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 8.008 0.1 5.905 0.2 35.6 2 13.814 0.2 11.438 0.1 20.7 3 19.488 0.2 16.919 0.1 15.1 4 25.059 0.1 22.409 0.1 11.8 8 47.478 0.1 44.221 0.1 7.3 X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Average of 10 runs of: hackbench <groups> process 100000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 4.586 0.8 4.596 0.6 -0.3 2 7.693 0.2 5.775 1.3 33.2 3 10.442 0.3 8.288 0.3 25.9 4 13.087 0.2 11.057 0.1 18.3 8 24.145 0.2 22.076 0.3 9.3 16 43.779 0.1 41.741 0.2 4.8 KVM 4-cpu Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz tbench, average of 11 runs clients %speedup 1 16.2 2 11.7 4 9.9 8 12.8 16 13.7 KVM 2-cpu Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz Benchmark %speedup specjbb2015_critical_jops 5.7 mysql_sysb1.0.14_mutex_2 40.6 mysql_sysb1.0.14_oltp_2 3.9 ------------------ 2 Socket Results ------------------ X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Average of 10 runs of: hackbench <groups> process 100000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 7.945 0.2 7.219 8.7 10.0 2 8.444 0.4 6.689 1.5 26.2 3 12.100 1.1 9.962 2.0 21.4 4 15.001 0.4 13.109 1.1 14.4 8 27.960 0.2 26.127 0.3 7.0 X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Average of 10 runs of: hackbench <groups> process 100000 --- base -- --- new --- groups time %stdev time %stdev %speedup 1 5.826 5.4 5.840 5.0 -0.3 2 5.041 5.3 6.171 23.4 -18.4 3 6.839 2.1 6.324 3.8 8.1 4 8.177 0.6 7.318 3.6 11.7 8 14.429 0.7 13.966 1.3 3.3 16 26.401 0.3 25.149 1.5 4.9 X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Oracle database OLTP, logging disabled, NVRAM storage Customers Users %speedup 1200000 40 -1.2 2400000 80 2.7 3600000 120 8.9 4800000 160 4.4 6000000 200 3.0 X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz Results from the Oracle "Performance PIT". Benchmark %speedup mysql_sysb1.0.14_fileio_56_rndrd 19.6 mysql_sysb1.0.14_fileio_56_seqrd 12.1 mysql_sysb1.0.14_fileio_56_rndwr 0.4 mysql_sysb1.0.14_fileio_56_seqrewr -0.3 pgsql_sysb1.0.14_fileio_56_rndrd 19.5 pgsql_sysb1.0.14_fileio_56_seqrd 8.6 pgsql_sysb1.0.14_fileio_56_rndwr 1.0 pgsql_sysb1.0.14_fileio_56_seqrewr 0.5 opatch_time_ASM_12.2.0.1.0_HP2M 7.5 select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M 5.1 select-1_users_asmm_ASM_12.2.0.1.0_HP2M 4.4 swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M 5.8 lm3_memlat_L2 4.8 lm3_memlat_L1 0.0 ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching 60.1 ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent 5.2 ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent -3.0 ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4 X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz NAS_OMP bench class ncpu %improved(Mops) dc B 72 1.3 is C 72 0.9 is D 72 0.7 sysbench mysql, average of 24 runs --- base --- --- new --- nthr events %stdev events %stdev %speedup 1 331.0 0.25 331.0 0.24 -0.1 2 661.3 0.22 661.8 0.22 0.0 4 1297.0 0.88 1300.5 0.82 0.2 8 2420.8 0.04 2420.5 0.04 -0.1 16 4826.3 0.07 4825.4 0.05 -0.1 32 8815.3 0.27 8830.2 0.18 0.1 64 12823.0 0.24 12823.6 0.26 0.0 ------------------------------------------------------------- Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Define a simpler version of can_migrate_task called can_migrate_task_llc which does not require a struct lb_env argument, and judges whether a migration from one CPU to another within the same LLC should be allowed. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- The detach_task function takes a struct lb_env argument, but only needs a few of its members. Pass the rq and cpu arguments explicitly so the function may be called from code that is not based on lb_env. No functional change. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Move the update of idle_stamp from idle_balance to the call site in pick_next_task_fair, to prepare for a future patch that adds work to pick_next_task_fair which must be included in the idle_stamp interval. No functional change. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- An overloaded CPU has more than 1 runnable task. When a CFS task wakes on a CPU, if h_nr_running transitions from 1 to more, then set the CPU in the cfs_overload_cpus bitmap. When a CFS task sleeps, if h_nr_running transitions from 2 to less, then clear the CPU in cfs_overload_cpus. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Define and initialize a sparse bitmap of overloaded CPUs, per last-level-cache scheduling domain, for use by the CFS scheduling class. Save a pointer to cfs_overload_cpus in the rq for efficient access. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Add functions sd_llc_alloc_all() and sd_llc_free_all() to allocate and free data pointed to by struct sched_domain_shared at the last-level-cache domain. sd_llc_alloc_all() is called after the SD hierarchy is known, to eliminate the unnecessary allocations that would occur if we instead allocated in __sdt_alloc() and then figured out which shared nodes are redundant. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Steve Sistare authored
hulk inclusion category: feature bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23 CVE: NA --------------------------- Provide struct sparsemask and functions to manipulate it. A sparsemask is a sparse bitmap. It reduces cache contention vs the usual bitmap when many threads concurrently set, clear, and visit elements, by reducing the number of significant bits per cacheline. For each cacheline chunk of the mask, only the first K bits of the first word are used, and the remaining bits are ignored, where K is a creation time parameter. Thus a sparsemask that can represent a set of N elements is approximately (N/K * CACHELINE) bytes in size. This type is simpler and more efficient than the struct sbitmap used by block drivers. Signed-off-by:
Steve Sistare <steven.sistare@oracle.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Cheng Jian authored
hulk inclusion category: bugfix bugzilla: 38260, https://bugzilla.openeuler.org/show_bug.cgi?id=22 CVE: NA --------------------------- Commit 92010bb6d6b8 ("sched/fair: Start tracking SCHED_IDLE tasks count in cfs_rq") introduced kabi broken, just fix it. Fixes: 92010bb6d6b8 ("sched/fair: Start tracking SCHED_IDLE tasks count in cfs_rq") Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com>
-
Viresh Kumar authored
mainline inclusion from mainline-v5.6-rc1 commit 17346452 category: feature bugzilla: 38260, https://bugzilla.openeuler.org/show_bug.cgi?id=22 CVE: NA --------------------------- There are instances where we keep searching for an idle CPU despite already having a sched-idle CPU (in find_idlest_group_cpu(), select_idle_smt() and select_idle_cpu() and then there are places where we don't necessarily do that and return a sched-idle CPU as soon as we find one (in select_idle_sibling()). This looks a bit inconsistent and it may be worth having the same policy everywhere. On the other hand, choosing a sched-idle CPU over a idle one shall be beneficial from performance and power point of view as well, as we don't need to get the CPU online from a deep idle state which wastes quite a lot of time and energy and delays the scheduling of the newly woken up task. This patch tries to simplify code around sched-idle CPU selection and make it consistent throughout. Testing is done with the help of rt-app on hikey board (ARM64 octa-core, 2 clusters, 0-3 and 4-7). The cpufreq governor was set to performance to avoid any side affects from CPU frequency. Following are the tests performed: Test 1: 1-cfs-task: A single SCHED_NORMAL task is pinned to CPU5 which runs for 2333 us out of 7777 us (so gives time for the cluster to go in deep idle state). Test 2: 1-cfs-1-idle-task: A single SCHED_NORMAL task is pinned on CPU5 and single SCHED_IDLE task is pinned on CPU6 (to make sure cluster 1 doesn't go in deep idle state). Test 3: 1-cfs-8-idle-task: A single SCHED_NORMAL task is pinned on CPU5 and eight SCHED_IDLE tasks are created which run forever (not pinned anywhere, so they run on all CPUs). Checked with kernelshark that as soon as NORMAL task sleeps, the SCHED_IDLE task starts running on CPU5. And here are the results on mean latency (in us), using the "st" tool. $ st 1-cfs-task/rt-app-cfs_thread-0.log N min max sum mean stddev 642 90 592 197180 307.134 109.906 $ st 1-cfs-1-idle-task/rt-app-cfs_thread-0.log N min max sum mean stddev 642 67 311 113850 177.336 41.4251 $ st 1-cfs-8-idle-task/rt-app-cfs_thread-0.log N min max sum mean stddev 643 29 173 41364 64.3297 13.2344 The mean latency when we need to: - wakeup from deep idle state is 307 us. - wakeup from shallow idle state is 177 us. - preempt a SCHED_IDLE task is 64 us. Signed-off-by:
Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/b90cbcce608cef4e02a7bbfe178335f76d201bab.1573728344.git.viresh.kumar@linaro.org Signed-off-by:
Ingo Molnar <mingo@kernel.org> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Cheng Jian authored
mainline inclusion from mainline-v5.6-rc1 commit 60588bfa category: bugfix bugzilla: 38260, https://bugzilla.openeuler.org/show_bug.cgi?id=22 CVE: NA --------------------------- select_idle_cpu() will scan the LLC domain for idle CPUs, it's always expensive. so the next commit : 1ad3aaf3 ("sched/core: Implement new approach to scale select_idle_cpu()") introduces a way to limit how many CPUs we scan. But it consume some CPUs out of 'nr' that are not allowed for the task and thus waste our attempts. The function always return nr_cpumask_bits, and we can't find a CPU which our task is allowed to run. Cpumask may be too big, similar to select_idle_core(), use per_cpu_ptr 'select_idle_mask' to prevent stack overflow. Fixes: 1ad3aaf3 ("sched/core: Implement new approach to scale select_idle_cpu()") Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by:
Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Viresh Kumar authored
mainline inclusion from mainline-v5.4-rc1 commit 3c29e651 category: feature bugzilla: 38260, https://bugzilla.openeuler.org/show_bug.cgi?id=22 CVE: NA --------------------------- We try to find an idle CPU to run the next task, but in case we don't find an idle CPU it is better to pick a CPU which will run the task the soonest, for performance reason. A CPU which isn't idle but has only SCHED_IDLE activity queued on it should be a good target based on this criteria as any normal fair task will most likely preempt the currently running SCHED_IDLE task immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one shall give better results as it should be able to run the task sooner than an idle CPU (which requires to be woken up from an idle state). This patch updates both fast and slow paths with this optimization. Signed-off-by:
Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: chris.redpath@arm.com Cc: quentin.perret@linaro.org Cc: songliubraving@fb.com Cc: steven.sistare@oracle.com Cc: subhra.mazumdar@oracle.com Cc: tkjos@google.com Link: https://lkml.kernel.org/r/eeafa25fdeb6f6edd5b2da716bc8f0ba7708cbcf.1561523542.git.viresh.kumar@linaro.org Signed-off-by:
Ingo Molnar <mingo@kernel.org> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Viresh Kumar authored
mainline inclusion from mainline-v5.4-rc1 commit 43e9f7f2 category: feature bugzilla: 38260, https://bugzilla.openeuler.org/show_bug.cgi?id=22 CVE: NA --------------------------- Track how many tasks are present with SCHED_IDLE policy in each cfs_rq. This will be used by later commits. Signed-off-by:
Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: chris.redpath@arm.com Cc: quentin.perret@linaro.org Cc: songliubraving@fb.com Cc: steven.sistare@oracle.com Cc: subhra.mazumdar@oracle.com Cc: tkjos@google.com Link: https://lkml.kernel.org/r/0d3cdc427fc68808ad5bccc40e86ed0bf9da8bb4.1561523542.git.viresh.kumar@linaro.org Signed-off-by:
Ingo Molnar <mingo@kernel.org> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Viresh Kumar authored
mainline inclusion from mainline-v5.0-rc1 commit 1da1843f category: feature bugzilla: 38260, https://bugzilla.openeuler.org/show_bug.cgi?id=22 CVE: NA --------------------------- We already have task_has_rt_policy() and task_has_dl_policy() helpers, create task_has_idle_policy() as well and update sched core to start using it. While at it, use task_has_dl_policy() at one more place. Signed-off-by:
Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org Signed-off-by:
Ingo Molnar <mingo@kernel.org> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com> Reviewed-by:
Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com>
-
Jan Kara authored
stable inclusion from linux-4.19.184 commit 30e7160bb4a94d9aa8bcc6274cd7ad2ac132c0fd -------------------------------- commit 163f0ec1df33cf468509ff38cbcbb5eb0d7fac60 upstream. Syzbot is reporting that ext4 can enter fs reclaim from kvmalloc() while the transaction is started like: fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340 might_alloc include/linux/sched/mm.h:193 [inline] slab_pre_alloc_hook mm/slab.h:493 [inline] slab_alloc_node mm/slub.c:2817 [inline] __kmalloc_node+0x5f/0x430 mm/slub.c:4015 kmalloc_node include/linux/slab.h:575 [inline] kvmalloc_node+0x61/0xf0 mm/util.c:587 kvmalloc include/linux/mm.h:781 [inline] ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline] ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline] ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649 ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224 ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380 ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493 This should be impossible since transaction start sets PF_MEMALLOC_NOFS. Add some assertions to the code to catch if something isn't working as expected early. Link: https://lore.kernel.org/linux-ext4/000000000000563a0205bafb7970@google.com/ Signed-off-by:
Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210222171626.21884-1-jack@suse.cz Signed-off-by:
Theodore Ts'o <tytso@mit.edu> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Thomas Gleixner authored
stable inclusion from linux-4.19.184 commit fb5318f0c0370a4ee0d62effb36014ed452b72c7 -------------------------------- commit 291da9d4a9eb3a1cb0610b7f4480f5b52b1825e7 upstream. If CONFIG_DEBUG_LOCK_ALLOC=n then mutex_lock_io_nested() maps to mutex_lock() which is clearly wrong because mutex_lock() lacks the io_schedule_prepare()/finish() invocations. Map it to mutex_lock_io(). Fixes: f21860ba ("locking/mutex, sched/wait: Fix the mutex_lock_io_nested() define") Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/878s6fshii.fsf@nanos.tec.linutronix.de Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
JeongHyeon Lee authored
stable inclusion from linux-4.19.184 commit 1410d3d7f470da01e66b1ba793140959f1e163fd -------------------------------- [ Upstream commit 88cd3e6c ] The verification is to support cases where the root hash is not secured by Trusted Boot, UEFI Secureboot or similar technologies. One of the use cases for this is for dm-verity volumes mounted after boot, the root hash provided during the creation of the dm-verity volume has to be secure and thus in-kernel validation implemented here will be used before we trust the root hash and allow the block device to be created. The signature being provided for verification must verify the root hash and must be trusted by the builtin keyring for verification to succeed. The hash is added as a key of type "user" and the description is passed to the kernel so it can look it up and use it for verification. Adds CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG which can be turned on if root hash verification is needed. Kernel commandline dm_verity module parameter 'require_signatures' will indicate whether to force root hash signature verification (for all dm verity volumes). Signed-off-by:
Jaskaran Khurana <jaskarankhurana@linux.microsoft.com> Tested-and-Reviewed-by:
Milan Broz <gmazyland@gmail.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Mark Tomlinson authored
stable inclusion from linux-4.19.184 commit 74bc0e5132581974f4f2b559540a59665a7cf96a -------------------------------- [ Upstream commit abe7034b9a8d57737e80cc16d60ed3666990bdbf ] This reverts commit 443d6e86f821a165fae3fc3fc13086d27ac140b1. This (and the following) patch basically re-implemented the RCU mechanisms of patch 78454473. That patch was replaced because of the performance problems that it created when replacing tables. Now, we have the same issue: the call to synchronize_rcu() makes replacing tables slower by as much as an order of magnitude. Revert these patches and fix the issue in a different way. Signed-off-by:
Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Mark Tomlinson authored
stable inclusion from linux-4.19.184 commit 81bc258370c6eeb1f41d350325e8a2c8e20fafad -------------------------------- [ Upstream commit 175e476b8cdf2a4de7432583b49c871345e4f8a1 ] When a new table value was assigned, it was followed by a write memory barrier. This ensured that all writes before this point would complete before any writes after this point. However, to determine whether the rules are unused, the sequence counter is read. To ensure that all writes have been done before these reads, a full memory barrier is needed, not just a write memory barrier. The same argument applies when incrementing the counter, before the rules are read. Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic reported in cc00bcaa (which is still present), while still maintaining the same speed of replacing tables. The smb_mb() barriers potentially slow the packet path, however testing has shown no measurable change in performance on a 4-core MIPS64 platform. Fixes: 7f5c6d4f ("netfilter: get rid of atomic ops in fast path") Signed-off-by:
Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Mark Tomlinson authored
stable inclusion from linux-4.19.184 commit 0abcfaf058d77aa6450ceb29985e50f72bf6b782 -------------------------------- [ Upstream commit d3d40f237480abf3268956daf18cdc56edd32834 ] This reverts commit cc00bcaa. This (and the preceding) patch basically re-implemented the RCU mechanisms of patch 78454473. That patch was replaced because of the performance problems that it created when replacing tables. Now, we have the same issue: the call to synchronize_rcu() makes replacing tables slower by as much as an order of magnitude. Prior to using RCU a script calling "iptables" approx. 200 times was taking 1.16s. With RCU this increased to 11.59s. Revert these patches and fix the issue in a different way. Signed-off-by:
Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Pavel Tatashin authored
stable inclusion from linux-4.19.184 commit df6f09cb7143be1f981f42ec363e5551c5d890c2 -------------------------------- [ Upstream commit 141f8202cfa4192c3af79b6cbd68e7760bb01b5a ] The ppos points to a position in the old kernel memory (and in case of arm64 in the crash kernel since elfcorehdr is passed as a segment). The function should update the ppos by the amount that was read. This bug is not exposed by accident, but other platforms update this value properly. So, fix it in ARM64 version of elfcorehdr_read() as well. Signed-off-by:
Pavel Tatashin <pasha.tatashin@soleen.com> Fixes: e62aaeac ("arm64: kdump: provide /proc/vmcore file") Reviewed-by:
Tyler Hicks <tyhicks@linux.microsoft.com> Link: https://lore.kernel.org/r/20210319205054.743368-1-pasha.tatashin@soleen.com Signed-off-by:
Will Deacon <will@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Florian Westphal authored
stable inclusion from linux-4.19.184 commit 2ca21906400986780cb5216e8bdd27201fd4a780 -------------------------------- [ Upstream commit b58f33d49e426dc66e98ed73afb5d97b15a25f2d ] Before this change, the mask is never included in the netlink message, so "conntrack -E expect" always prints 0.0.0.0. In older kernels the l3num callback struct was passed as argument, based on tuple->src.l3num. After the l3num indirection got removed, the call chain is based on m.src.l3num, but this value is 0xffff. Init l3num to the correct value. Fixes: f957be9d ("netfilter: conntrack: remove ctnetlink callbacks from l3 protocol trackers") Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Mikulas Patocka authored
stable inclusion from linux-4.19.184 commit 76aa61c55279fdaa8d428236ba8834edf313b372 -------------------------------- commit 4edbe1d7bcffcd6269f3b5eb63f710393ff2ec7a upstream. If there are not any dm devices, we need to zero the "dev" argument in the first structure dm_name_list. However, this can cause out of bounds write, because the "needed" variable is zero and len may be less than eight. Fix this bug by reporting DM_BUFFER_FULL_FLAG if the result buffer is too small to hold the "nl->dev" value. Signed-off-by:
Mikulas Patocka <mpatocka@redhat.com> Reported-by:
Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@vger.kernel.org Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Daniel Wagner authored
stable inclusion from linux-4.19.184 commit 4c083481b30a17568273deed595736e091d17a65 -------------------------------- [ Upstream commit 9ec491447b90ad6a4056a9656b13f0b3a1e83043 ] register_disk() suppress uevents for devices with the GENHD_FL_HIDDEN but enables uevents at the end again in order to announce disk after possible partitions are created. When the device is removed the uevents are still on and user land sees 'remove' messages for devices which were never 'add'ed to the system. KERNEL[95481.571887] remove /devices/virtual/nvme-fabrics/ctl/nvme5/nvme0c5n1 (block) Let's suppress the uevents for GENHD_FL_HIDDEN by not enabling the uevents at all. Signed-off-by:
Daniel Wagner <dwagner@suse.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin Wilck <mwilck@suse.com> Link: https://lore.kernel.org/r/20210311151917.136091-1-dwagner@suse.de Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Frank Sorenson authored
stable inclusion from linux-4.19.184 commit 5cd09eeadd277c301344975e403bcd95d9c0f125 -------------------------------- [ Upstream commit ad3dbe35c833c2d4d0bbf3f04c785d32f931e7c9 ] CREATE requests return a post_op_fh3, rather than nfs_fh3. The post_op_fh3 includes an extra word to indicate 'handle_follows'. Without that additional word, create fails when full 64-byte filehandles are in use. Add NFS3_post_op_fh_sz, and correct the size calculation for NFS3_createres_sz. Signed-off-by:
Frank Sorenson <sorenson@redhat.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Vincent Whitchurch authored
stable inclusion from linux-4.19.183 commit b0834edc70e402244ed8da96664368c15d869582 -------------------------------- commit 05946d4b7a7349ae58bfa2d51ae832e64a394c2d upstream. smb311_update_preauth_hash() uses the shash in server->secmech without appropriate locking, and this can lead to sessions corrupting each other's preauth hashes. The following script can easily trigger the problem: #!/bin/sh -e NMOUNTS=10 for i in $(seq $NMOUNTS); mkdir -p /tmp/mnt$i umount /tmp/mnt$i 2>/dev/null || : done while :; do for i in $(seq $NMOUNTS); do mount -t cifs //192.168.0.1/test /tmp/mnt$i -o ... & done wait for i in $(seq $NMOUNTS); do umount /tmp/mnt$i done done Usually within seconds this leads to one or more of the mounts failing with the following errors, and a "Bad SMB2 signature for message" is seen in the server logs: CIFS: VFS: \\192.168.0.1 failed to connect to IPC (rc=-13) CIFS: VFS: cifs_mount failed w/return code = -13 Fix it by holding the server mutex just like in the other places where the shashes are used. Fixes: 8bd68c6e ("CIFS: implement v3.11 preauth integrity") Signed-off-by:
Vincent Whitchurch <vincent.whitchurch@axis.com> CC: <stable@vger.kernel.org> Reviewed-by:
Aurelien Aptel <aaptel@suse.com> Signed-off-by:
Steve French <stfrench@microsoft.com> [aaptel: backport to kernel without CIFS_SESS_OP and multichannel] Signed-off-by:
Aurelien Aptel <aaptel@suse.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
zhangyi (F) authored
stable inclusion from linux-4.19.183 commit a8fb57ec924feec102d477c34a1e21685ff865e9 -------------------------------- commit 6b22489911b726eebbf169caee52fea52013fbdd upstream. Syzbot report a warning that ext4 may create an empty ea_inode if set an empty extent attribute to a file on the file system which is no free blocks left. WARNING: CPU: 6 PID: 10667 at fs/ext4/xattr.c:1640 ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640 ... Call trace: ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640 ext4_xattr_block_set+0x1d0/0x1b1c fs/ext4/xattr.c:1942 ext4_xattr_set_handle+0x8a0/0xf1c fs/ext4/xattr.c:2390 ext4_xattr_set+0x120/0x1f0 fs/ext4/xattr.c:2491 ext4_xattr_trusted_set+0x48/0x5c fs/ext4/xattr_trusted.c:37 __vfs_setxattr+0x208/0x23c fs/xattr.c:177 ... Now, ext4 try to store extent attribute into an external inode if ext4_xattr_block_set() return -ENOSPC, but for the case of store an empty extent attribute, store the extent entry into the extent attribute block is enough. A simple reproduce below. fallocate test.img -l 1M mkfs.ext4 -F -b 2048 -O ea_inode test.img mount test.img /mnt dd if=/dev/zero of=/mnt/foo bs=2048 count=500 setfattr -n "user.test" /mnt/foo Reported-by:
<syzbot+98b881fdd8ebf45ab4ae@syzkaller.appspotmail.com> Fixes: 9c6e7853 ("ext4: reserve space for xattr entries/names") Cc: stable@kernel.org Signed-off-by:
zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210305120508.298465-1-yi.zhang@huawei.com Signed-off-by:
Theodore Ts'o <tytso@mit.edu> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Oleg Nesterov authored
stable inclusion from linux-4.19.183 commit 6cd1e19841fc245b44277d73e449c1dc82a56c73 -------------------------------- commit 5abbe51a526253b9f003e9a0a195638dc882d660 upstream. Preparation for fixing get_nr_restart_syscall() on X86 for COMPAT. Add a new helper which sets restart_block->fn and calls a dummy arch_set_restart_data() helper. Fixes: 609c19a3 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code") Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210201174641.GA17871@redhat.com Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Sagi Grimberg authored
stable inclusion from linux-4.19.183 commit 5d9873e46c6d5a3c358341e40c373b79677f14e2 -------------------------------- [ Upstream commit c4c6df5fc84659690d4391d1fba155cd94185295 ] We only setup io queues for nvme controllers, and it makes absolutely no sense to allow a controller (re)connect without any I/O queues. If we happen to fail setting the queue count for any reason, we should not allow this to be a successful reconnect as I/O has no chance in going through. Instead just fail and schedule another reconnect. Reported-by:
Chao Leng <lengchao@huawei.com> Fixes: 71102307 ("nvme-rdma: add a NVMe over Fabrics RDMA host driver") Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Chao Leng <lengchao@huawei.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Daniel Kobras authored
stable inclusion from linux-4.19.183 commit 98982cf7997414245477ffa90c9cdf8492c0b4bc -------------------------------- commit f1442d6349a2e7bb7a6134791bdc26cb776c79af upstream. If an auth module's accept op returns SVC_CLOSE, svc_process_common() enters a call path that does not call svc_authorise() before leaving the function, and thus leaks a reference on the auth module's refcount. Hence, make sure calls to svc_authenticate() and svc_authorise() are paired for all call paths, to make sure rpc auth modules can be unloaded. Signed-off-by:
Daniel Kobras <kobras@puzzle-itc.de> Fixes: 4d712ef1 ("svcauth_gss: Close connection when dropping an incoming message") Link: https://lore.kernel.org/linux-nfs/3F1B347F-B809-478F-A1E9-0BE98E22B0F0@oracle.com/T/#t Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Matthew Wilcox (Oracle) authored
stable inclusion from linux-4.19.181 commit 5bd7642bd62805a91f5e90d4af8b5465515273f5 -------------------------------- [ Upstream commit 149fc787353f65b7e72e05e7b75d34863266c3e2 ] Fix a sparse warning by using rcu_dereference(). Technically this is a bug and a sufficiently aggressive compiler could reload the `real_parent' pointer outside the protection of the rcu lock (and access freed memory), but I think it's pretty unlikely to happen. Link: https://lkml.kernel.org/r/20210221194207.1351703-1-willy@infradead.org Fixes: b18dc5f2 ("mm, oom: skip vforked tasks from being selected") Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Miaohe Lin <linmiaohe@huawei.com> Acked-by:
Michal Hocko <mhocko@suse.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Anna-Maria Behnsen authored
stable inclusion from linux-4.19.181 commit eda5858d4867f1681b42e8b13e5eac3fba29e915 -------------------------------- [ Upstream commit 46eb1701c046cc18c032fa68f3c8ccbf24483ee4 ] hrtimer_force_reprogram() and hrtimer_interrupt() invokes __hrtimer_get_next_event() to find the earliest expiry time of hrtimer bases. __hrtimer_get_next_event() does not update cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That needs to be done at the callsites. hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when the first expiring timer is a softirq timer and the soft interrupt is not activated. That's wrong because cpu_base::softirq_expires_next is left stale when the first expiring timer of all bases is a timer which expires in hard interrupt context. hrtimer_interrupt() does never update cpu_base::softirq_expires_next which is wrong too. That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that timer before the stale cpu_base::softirq_expires_next. cpu_base::softirq_expires_next is cached to make the check for raising the soft interrupt fast. In the above case the soft interrupt won't be raised until clock monotonic reaches the stale cpu_base::softirq_expires_next value. That's incorrect, but what's worse it that if the softirq timer becomes the first expiring timer of all clock bases after the hard expiry timer has been handled the reprogramming of the clockevent from hrtimer_interrupt() will result in an interrupt storm. That happens because the reprogramming does not use cpu_base::softirq_expires_next, it uses __hrtimer_get_next_event() which returns the actual expiry time. Once clock MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is raised and the storm subsides. Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard bases seperately, update softirq_expires_next and handle the case when a soft expiring timer is the first of all bases by comparing the expiry times and updating the required cpu base fields. Split this functionality into a separate function to be able to use it in hrtimer_interrupt() as well without copy paste. Fixes: 5da70160 ("hrtimer: Implement support for softirq based hrtimers") Reported-by:
Mikael Beckius <mikael.beckius@windriver.com> Suggested-by:
Thomas Gleixner <tglx@linutronix.de> Tested-by:
Mikael Beckius <mikael.beckius@windriver.com> Signed-off-by:
Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210223160240.27518-1-anna-maria@linutronix.de Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Aleksandr Miloserdov authored
stable inclusion from linux-4.19.181 commit 14e4d57cbafe0927a080bd6b635e997ab347aa07 -------------------------------- [ Upstream commit 14d24e2cc77411301e906a8cf41884739de192de ] TCM buffer length doesn't necessarily equal 8 + ADDITIONAL LENGTH which might be considered an underflow in case of Data-In size being greater than 8 + ADDITIONAL LENGTH. So truncate buffer length to prevent underflow. Link: https://lore.kernel.org/r/20210209072202.41154-3-a.miloserdov@yadro.com Reviewed-by:
Roman Bolshakov <r.bolshakov@yadro.com> Reviewed-by:
Bodo Stroesser <bostroesser@gmail.com> Signed-off-by:
Aleksandr Miloserdov <a.miloserdov@yadro.com> Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Aleksandr Miloserdov authored
stable inclusion from linux-4.19.181 commit 7abc17dced7593cf40a46f8f9b1aafd634a0ebb1 -------------------------------- [ Upstream commit 1c73e0c5e54d5f7d77f422a10b03ebe61eaed5ad ] TCM doesn't properly handle underflow case for service actions. One way to prevent it is to always complete command with target_complete_cmd_with_length(), however it requires access to data_sg, which is not always available. This change introduces target_set_cmd_data_length() function which allows to set command data length before completing it. Link: https://lore.kernel.org/r/20210209072202.41154-2-a.miloserdov@yadro.com Reviewed-by:
Roman Bolshakov <r.bolshakov@yadro.com> Reviewed-by:
Bodo Stroesser <bostroesser@gmail.com> Signed-off-by:
Aleksandr Miloserdov <a.miloserdov@yadro.com> Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Sasha Levin <sashal@kernel.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-
Geert Uytterhoeven authored
stable inclusion from linux-4.19.181 commit 6a8b02ead7f7ed10af3dbe180c524e0b5ce16c6e -------------------------------- [ Upstream commit f6bda644fa3a7070621c3bf12cd657f69a42f170 ] Kmemleak reports: unreferenced object 0xc328de40 (size 64): comm "kworker/1:1", pid 21, jiffies 4294938212 (age 1484.670s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 e0 d8 fc eb 00 00 00 00 ................ 00 00 10 fe 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ad758d10>] pci_register_io_range+0x3c/0x80 [<2c7f139e>] of_pci_range_to_resource+0x48/0xc0 [<f079ecc8>] devm_of_pci_get_host_bridge_resources.constprop.0+0x2ac/0x3ac [<e999753b>] devm_of_pci_bridge_init+0x60/0x1b8 [<a895b229>] devm_pci_alloc_host_bridge+0x54/0x64 [<e451ddb0>] rcar_pcie_probe+0x2c/0x644 In case a PCI host driver's probe is deferred, the same I/O range may be allocated again, and be ignored, causing a memory leak. Fix this by (...
-
Linus Torvalds authored
stable inclusion from linux-4.19.181 commit c6afb0eeed94a7d504acdfdde7128573ec6a9e3e -------------------------------- commit 9b1ea29bc0d7b94d420f96a0f4121403efc3dd85 upstream. This reverts commit 8ff60eb052eeba95cfb3efe16b08c9199f8121cf. The kernel test robot reports a huge performance regression due to the commit, and the reason seems fairly straightforward: when there is contention on the page list (which is what causes acquire_slab() to fail), we do _not_ want to just loop and try again, because that will transfer the contention to the 'n->list_lock' spinlock we hold, and just make things even worse. This is admittedly likely a problem only on big machines - the kernel test robot report comes from a 96-thread dual socket Intel Xeon Gold 6252 setup, but the regression there really is quite noticeable: -47.9% regression of stress-ng.rawpkt.ops_per_sec and the commit that was marked as being fixed (7ced3719: "slub: Acquire_slab() avoid loop") actually did the loop exit early very intentionally (the hint being that "avoid loop" part of that commit message), exactly to avoid this issue. The correct thing to do may be to pick some kind of reasonable middle ground: instead of breaking out of the loop on the very first sign of contention, or trying over and over and over again, the right thing may be to re-try _once_, and then give up on the second failure (or pick your favorite value for "once"..). Reported-by:
kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/ Cc: Jann Horn <jannh@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by:
Christoph Lameter <cl@linux.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Yang Yingliang <yangyingliang@huawei.com> Signed-off-by:
Cheng Jian <cj.chengjian@huawei.com>
-