Skip to content
Snippets Groups Projects
  1. Apr 22, 2017
  2. Mar 25, 2017
  3. Feb 18, 2017
    • Daniel Borkmann's avatar
      bpf: make jited programs visible in traces · 74451e66
      Daniel Borkmann authored
      Long standing issue with JITed programs is that stack traces from
      function tracing check whether a given address is kernel code
      through {__,}kernel_text_address(), which checks for code in core
      kernel, modules and dynamically allocated ftrace trampolines. But
      what is still missing is BPF JITed programs (interpreted programs
      are not an issue as __bpf_prog_run() will be attributed to them),
      thus when a stack trace is triggered, the code walking the stack
      won't see any of the JITed ones. The same for address correlation
      done from user space via reading /proc/kallsyms. This is read by
      tools like perf, but the latter is also useful for permanent live
      tracing with eBPF itself in combination with stack maps when other
      eBPF types are part of the callchain. See offwaketime example on
      dumping stack from a map.
      
      This work tries to tackle that issue by making the addresses and
      symbols known to the kernel. The lookup from *kernel_text_address()
      is implemented through a latche...
      74451e66
  4. Dec 30, 2016
    • Matthias Tafelmeier's avatar
      net: dev_weight: TX/RX orthogonality · 3d48b53f
      Matthias Tafelmeier authored
      
      Oftenly, introducing side effects on packet processing on the other half
      of the stack by adjusting one of TX/RX via sysctl is not desirable.
      There are cases of demand for asymmetric, orthogonal configurability.
      
      This holds true especially for nodes where RPS for RFS usage on top is
      configured and therefore use the 'old dev_weight'. This is quite a
      common base configuration setup nowadays, even with NICs of superior processing
      support (e.g. aRFS).
      
      A good example use case are nodes acting as noSQL data bases with a
      large number of tiny requests and rather fewer but large packets as responses.
      It's affordable to have large budget and rx dev_weights for the
      requests. But as a side effect having this large a number on TX
      processed in one run can overwhelm drivers.
      
      This patch therefore introduces an independent configurability via sysctl to
      userland.
      
      Signed-off-by: default avatarMatthias Tafelmeier <matthias.tafelmeier@gmx.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d48b53f
  5. Dec 09, 2016
  6. May 17, 2016
    • Daniel Borkmann's avatar
      bpf: add generic constant blinding for use in jits · 4f3446bb
      Daniel Borkmann authored
      This work adds a generic facility for use from eBPF JIT compilers
      that allows for further hardening of JIT generated images through
      blinding constants. In response to the original work on BPF JIT
      spraying published by Keegan McAllister [1], most BPF JITs were
      changed to make images read-only and start at a randomized offset
      in the page, where the rest was filled with trap instructions. We
      have this nowadays in x86, arm, arm64 and s390 JIT compilers.
      Additionally, later work also made eBPF interpreter images read
      only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
      arm, arm64 and s390 archs as well currently. This is done by
      default for mentioned JITs when JITing is enabled. Furthermore,
      we had a generic and configurable constant blinding facility on our
      todo for quite some time now to further make spraying harder, and
      first implementation since around netconf 2016.
      
      We found that for systems where untrusted users can load cBPF/eBPF
      code where JIT is enabled, start offset randomization helps a bit
      to make jumps into crafted payload harder, but in case where larger
      programs that cross page boundary are injected, we again have some
      part of the program opcodes at a page start offset. With improved
      guessing and more reliable payload injection, chances can increase
      to jump into such payload. Elena Reshetova recently wrote a test
      case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
      can leave some more room for payloads. Note that for all this,
      additional bugs in the kernel are still required to make the jump
      (and of course to guess right, to not jump into a trap) and naturally
      the JIT must be enabled, which is disabled by default.
      
      For helping mitigation, the general idea is to provide an option
      bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
      that for cases where JIT should be enabled for performance reasons,
      the generated image can be further hardened with blinding constants
      for unpriviledged users (bpf_jit_harden == 1), with trading off
      performance for these, but not for privileged ones. We also added
      the option of blinding for all users (bpf_jit_harden == 2), which
      is quite helpful for testing f.e. with test_bpf.ko. There are no
      further e.g. hardening levels of bpf_jit_harden switch intended,
      rationale is to have it dead simple to use as on/off. Since this
      functionality would need to be duplicated over and over for JIT
      compilers to use, which are already complex enough, we provide a
      generic eBPF byte-code level based blinding implementation, which is
      then just transparently JITed. JIT compilers need to make only a few
      changes to integrate this facility and can be migrated one by one.
      
      This option is for eBPF JITs and will be used in x86, arm64, s390
      without too much effort, and soon ppc64 JITs, thus that native eBPF
      can be blinded as well as cBPF to eBPF migrations, so that both can
      be covered with a single implementation. The rule for JITs is that
      bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
      and in case blinding is disabled, we follow normally with JITing the
      passed program. In case blinding is enabled and we fail during the
      process of blinding itself, we must return with the interpreter.
      Similarly, in case the JITing process after the blinding failed, we
      return normally to the interpreter with the non-blinded code. Meaning,
      interpreter doesn't change in any way and operates on eBPF code as
      usual. For doing this pre-JIT blinding step, we need to make use of
      a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
      to the JIT and not in any way part of the eBPF architecture. Just like
      in the same way as JITs internally make use of some helper registers
      when emitting code, only that here the helper register is one
      abstraction level higher in eBPF bytecode, but nevertheless in JIT
      phase. That helper register is needed since f.e. manually written
      program can issue loads to all registers of eBPF architecture.
      
      The core concept with the additional register is: blind out all 32
      and 64 bit constants by converting BPF_K based instructions into a
      small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
      is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
      and REG <OP> BPF_REG_AX, so actual operation on the target register
      is translated from BPF_K into BPF_X one that is operating on
      BPF_REG_AX's content. During rewriting phase when blinding, RND is
      newly generated via prandom_u32() for each processed instruction.
      64 bit loads are split into two 32 bit loads to make translation and
      patching not too complex. Only basic thing required by JITs is to
      call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
      pair, and to map BPF_REG_AX into an unused register.
      
      Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
      
      echo 0 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f5e9 + <x>:
        [...]
        39:   mov    $0xa8909090,%eax
        3e:   mov    $0xa8909090,%eax
        43:   mov    $0xa8ff3148,%eax
        48:   mov    $0xa89081b4,%eax
        4d:   mov    $0xa8900bb0,%eax
        52:   mov    $0xa810e0c1,%eax
        57:   mov    $0xa8908eb4,%eax
        5c:   mov    $0xa89020b0,%eax
        [...]
      
      echo 1 > /proc/sys/net/core/bpf_jit_harden
      
        ffffffffa034f1e5 + <x>:
        [...]
        39:   mov    $0xe1192563,%r10d
        3f:   xor    $0x4989b5f3,%r10d
        46:   mov    %r10d,%eax
        49:   mov    $0xb8296d93,%r10d
        4f:   xor    $0x10b9fd03,%r10d
        56:   mov    %r10d,%eax
        59:   mov    $0x8c381146,%r10d
        5f:   xor    $0x24c7200e,%r10d
        66:   mov    %r10d,%eax
        69:   mov    $0xeb2a830e,%r10d
        6f:   xor    $0x43ba02ba,%r10d
        76:   mov    %r10d,%eax
        79:   mov    $0xd9730af,%r10d
        7f:   xor    $0xa5073b1f,%r10d
        86:   mov    %r10d,%eax
        89:   mov    $0x9a45662b,%r10d
        8f:   xor    $0x325586ea,%r10d
        96:   mov    %r10d,%eax
        [...]
      
      As can be seen, original constants that carry payload are hidden
      when enabled, actual operations are transformed from constant-based
      to register-based ones, making jumps into constants ineffective.
      Above extract/example uses single BPF load instruction over and
      over, but of course all instructions with constants are blinded.
      
      Performance wise, JIT with blinding performs a bit slower than just
      JIT and faster than interpreter case. This is expected, since we
      still get all the performance benefits from JITing and in normal
      use-cases not every single instruction needs to be blinded. Summing
      up all 296 test cases averaged over multiple runs from test_bpf.ko
      suite, interpreter was 55% slower than JIT only and JIT with blinding
      was 8% slower than JIT only. Since there are also some extremes in
      the test suite, I expect for ordinary workloads that the performance
      for the JIT with blinding case is even closer to JIT only case,
      f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
      35 (+ blinding), and 151 (interpreter).
      
      BPF test suite, seccomp test suite, eBPF sample code and various
      bigger networking eBPF programs have been tested with this and were
      running fine. For testing purposes, I also adapted interpreter and
      redirected blinded eBPF image to interpreter and also here all tests
      pass.
      
        [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
        [2] https://github.com/01org/jit-spray-poc-for-ksp/
        [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
      
      
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f3446bb
  7. Feb 09, 2016
  8. Mar 21, 2015
    • Eric Dumazet's avatar
      net: increase sk_[max_]ack_backlog · becb74f0
      Eric Dumazet authored
      
      sk_ack_backlog & sk_max_ack_backlog were 16bit fields, meaning
      listen() backlog was limited to 65535.
      
      It is time to increase the width to allow much bigger backlog,
      if admins change /proc/sys/net/core/somaxconn &
      /proc/sys/net/ipv4/tcp_max_syn_backlog default values.
      
      Tested:
      
      echo 5000000 >/proc/sys/net/core/somaxconn
      echo 5000000 >/proc/sys/net/ipv4/tcp_max_syn_backlog
      
      Ran a SYNFLOOD test against a listener using listen(fd, 5000000)
      
      myhost~# grep request_sock_TCP /proc/slabinfo
      request_sock_TCP  4185642 4411940    304   13    1 : tunables   54   27    8 : slabdata 339380 339380      0
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      becb74f0
  9. Mar 12, 2015
    • Alexey Kodanev's avatar
      net: sysctl_net_core: check SNDBUF and RCVBUF for min length · b1cb59cf
      Alexey Kodanev authored
      
      sysctl has sysctl.net.core.rmem_*/wmem_* parameters which can be
      set to incorrect values. Given that 'struct sk_buff' allocates from
      rcvbuf, incorrectly set buffer length could result to memory
      allocation failures. For example, set them as follows:
      
          # sysctl net.core.rmem_default=64
            net.core.wmem_default = 64
          # sysctl net.core.wmem_default=64
            net.core.wmem_default = 64
          # ping localhost -s 1024 -i 0 > /dev/null
      
      This could result to the following failure:
      
      skbuff: skb_over_panic: text:ffffffff81628db4 len:-32 put:-32
      head:ffff88003a1cc200 data:ffff88003a1cc200 tail:0xffffffe0 end:0xc0 dev:<NULL>
      kernel BUG at net/core/skbuff.c:102!
      invalid opcode: 0000 [#1] SMP
      ...
      task: ffff88003b7f5550 ti: ffff88003ae88000 task.ti: ffff88003ae88000
      RIP: 0010:[<ffffffff8155fbd1>]  [<ffffffff8155fbd1>] skb_put+0xa1/0xb0
      RSP: 0018:ffff88003ae8bc68  EFLAGS: 00010296
      RAX: 000000000000008d RBX: 00000000ffffffe0 RCX: 0000000000000000
      RDX: ffff88003fdcf598 RSI: ffff88003fdcd9c8 RDI: ffff88003fdcd9c8
      RBP: ffff88003ae8bc88 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000001 R11: 00000000000002b2 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff88003d3f7300 R15: ffff88000012a900
      FS:  00007fa0e2b4a840(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000d0f7e0 CR3: 000000003b8fb000 CR4: 00000000000006f0
      Stack:
       ffff88003a1cc200 00000000ffffffe0 00000000000000c0 ffffffff818cab1d
       ffff88003ae8bd68 ffffffff81628db4 ffff88003ae8bd48 ffff88003b7f5550
       ffff880031a09408 ffff88003b7f5550 ffff88000012aa48 ffff88000012ab00
      Call Trace:
       [<ffffffff81628db4>] unix_stream_sendmsg+0x2c4/0x470
       [<ffffffff81556f56>] sock_write_iter+0x146/0x160
       [<ffffffff811d9612>] new_sync_write+0x92/0xd0
       [<ffffffff811d9cd6>] vfs_write+0xd6/0x180
       [<ffffffff811da499>] SyS_write+0x59/0xd0
       [<ffffffff81651532>] system_call_fastpath+0x12/0x17
      Code: 00 00 48 89 44 24 10 8b 87 c8 00 00 00 48 89 44 24 08 48 8b 87 d8 00
            00 00 48 c7 c7 30 db 91 81 48 89 04 24 31 c0 e8 4f a8 0e 00 <0f> 0b
            eb fe 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83
      RIP  [<ffffffff8155fbd1>] skb_put+0xa1/0xb0
      RSP <ffff88003ae8bc68>
      Kernel panic - not syncing: Fatal exception
      
      Moreover, the possible minimum is 1, so we can get another kernel panic:
      ...
      BUG: unable to handle kernel paging request at ffff88013caee5c0
      IP: [<ffffffff815604cf>] __alloc_skb+0x12f/0x1f0
      ...
      
      Signed-off-by: default avatarAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1cb59cf
  10. Feb 14, 2015
  11. Feb 09, 2015
    • Eric Dumazet's avatar
      net:rfs: adjust table size checking · 93c1af6c
      Eric Dumazet authored
      
      Make sure root user does not try something stupid.
      
      Also make sure mask field in struct rps_sock_flow_table
      does not share a cache line with the potentially often dirtied
      flow table.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93c1af6c
    • Eric Dumazet's avatar
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet authored
      
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      567e4b79
  12. Feb 03, 2015
    • Willem de Bruijn's avatar
      net-timestamp: no-payload only sysctl · b245be1f
      Willem de Bruijn authored
      
      Tx timestamps are looped onto the error queue on top of an skb. This
      mechanism leaks packet headers to processes unless the no-payload
      options SOF_TIMESTAMPING_OPT_TSONLY is set.
      
      Add a sysctl that optionally drops looped timestamp with data. This
      only affects processes without CAP_NET_RAW.
      
      The policy is checked when timestamps are generated in the stack.
      It is possible for timestamps with data to be reported after the
      sysctl is set, if these were queued internally earlier.
      
      No vulnerability is immediately known that exploits knowledge
      gleaned from packet headers, but it may still be preferable to allow
      administrators to lock down this path at the cost of possible
      breakage of legacy applications.
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        (v1 -> v2)
        - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
        (rfc -> v1)
        - document the sysctl in Documentation/sysctl/net.txt
        - fix access control race: read .._OPT_TSONLY only once,
              use same value for permission check and skb generation.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b245be1f
  13. Nov 17, 2014
    • Eric Dumazet's avatar
      net: provide a per host RSS key generic infrastructure · 960fb622
      Eric Dumazet authored
      
      RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
      RSS key.
      
      Some drivers use a constant (and well known key), some drivers use a random
      key per port, making bonding setups hard to tune. Well known keys increase
      attack surface, considering that number of queues is usually a power of two.
      
      This patch provides infrastructure to help drivers doing the right thing.
      
      netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
      even if they provide ethtool -X support to let user redefine the key later.
      
      A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
      RSS key even for drivers not providing ethtool -x support, in case some
      applications want to precisely setup flows to match some RX queues.
      
      Tested:
      
      myhost:~# cat /proc/sys/net/core/netdev_rss_key
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72
      
      myhost:~# ethtool -x eth0
      RX flow hash indirection table for eth0 with 8 RX ring(s):
          0:      0     1     2     3     4     5     6     7
      RSS hash key:
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      960fb622
  14. Nov 12, 2014
    • Joe Perches's avatar
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches authored
      
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      
      All messages are still ratelimited.
      
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      
      Miscellanea:
      
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba7a46f1
  15. Dec 20, 2013
  16. Aug 31, 2013
    • stephen hemminger's avatar
      qdisc: allow setting default queuing discipline · 6da7c8fc
      stephen hemminger authored
      
      By default, the pfifo_fast queue discipline has been used by default
      for all devices. But we have better choices now.
      
      This patch allow setting the default queueing discipline with sysctl.
      This allows easy use of better queueing disciplines on all devices
      without having to use tc qdisc scripts. It is intended to allow
      an easy path for distributions to make fq_codel or sfq the default
      qdisc.
      
      This patch also makes pfifo_fast more of a first class qdisc, since
      it is now possible to manually override the default and explicitly
      use pfifo_fast. The behavior for systems who do not use the sysctl
      is unchanged, they still get pfifo_fast
      
      Also removes leftover random # in sysctl net core.
      
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6da7c8fc
  17. Aug 03, 2013
    • Roman Gushchin's avatar
      net: check net.core.somaxconn sysctl values · 5f671d6b
      Roman Gushchin authored
      
      It's possible to assign an invalid value to the net.core.somaxconn
      sysctl variable, because there is no checks at all.
      
      The sk_max_ack_backlog field of the sock structure is defined as
      unsigned short. Therefore, the backlog argument in inet_listen()
      shouldn't exceed USHRT_MAX. The backlog argument in the listen() syscall
      is truncated to the somaxconn value. So, the somaxconn value shouldn't
      exceed 65535 (USHRT_MAX).
      Also, negative values of somaxconn are meaningless.
      
      before:
      $ sysctl -w net.core.somaxconn=256
      net.core.somaxconn = 256
      $ sysctl -w net.core.somaxconn=65536
      net.core.somaxconn = 65536
      $ sysctl -w net.core.somaxconn=-100
      net.core.somaxconn = -100
      
      after:
      $ sysctl -w net.core.somaxconn=256
      net.core.somaxconn = 256
      $ sysctl -w net.core.somaxconn=65536
      error: "Invalid argument" setting key "net.core.somaxconn"
      $ sysctl -w net.core.somaxconn=-100
      error: "Invalid argument" setting key "net.core.somaxconn"
      
      Based on a prior patch from Changli Gao.
      
      Signed-off-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Reported-by: default avatarChangli Gao <xiaosuo@gmail.com>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f671d6b
  18. Aug 02, 2013
  19. Jul 11, 2013
  20. Jun 26, 2013
    • Eliezer Tamir's avatar
      net: poll/select low latency socket support · 2d48d67f
      Eliezer Tamir authored
      
      select/poll busy-poll support.
      
      Split sysctl value into two separate ones, one for read and one for poll.
      updated Documentation/sysctl/net.txt
      
      Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
      sk_poll_ll if possible. sock_poll sets this flag in its return value
      to indicate to select/poll when a socket that can busy poll is found.
      
      When poll/select have nothing to report, call the low-level
      sock_poll again until we are out of time or we find something.
      
      Once the system call finds something, it stops setting POLL_LL, so it can
      return the result to the user ASAP.
      
      Signed-off-by: default avatarEliezer Tamir <eliezer.tamir@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d48d67f
  21. Jun 18, 2013
  22. Jun 14, 2013
    • Willem de Bruijn's avatar
      net-rps: fixes for rps flow limit · 5f121b9a
      Willem de Bruijn authored
      
      Caught by sparse:
      - __rcu: missing annotation to sd->flow_limit
      - __user: direct access in cpumask_scnprintf
      
      Also
      - add endline character when printing bitmap if room in buffer
      - avoid bucket overflow by reducing FLOW_LIMIT_HISTORY
      
      The last item warrants some explanation. The hashtable buckets are
      subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal
      to bucket size, since all packets may end up in a single bucket. The
      current (rather arbitrary) history value of 256 happens to match the
      buffer size (u8).
      
      As a result, with a single flow, the first 128 packets are accepted
      (correct), the second 128 packets dropped (correct) and then the
      history[] array has filled, so that each subsequent new packet
      causes an increment in the bucket for new_flow plus a decrement
      for old_flow: a steady state.
      
      This is fine if packets are dropped, as the steady state goes away
      as soon as a mix of traffic reappears. But, because the 256th packet
      overflowed the bucket to 0: no packets are dropped.
      
      Instead of explicitly adding an overflow check, this patch changes
      FLOW_LIMIT_HISTORY to never be able to overflow a single bucket.
      
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      (first item)
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f121b9a
  23. Jun 13, 2013
  24. Jun 11, 2013
  25. May 21, 2013
    • Willem de Bruijn's avatar
      rps: selective flow shedding during softnet overflow · 99bbc707
      Willem de Bruijn authored
      
      A cpu executing the network receive path sheds packets when its input
      queue grows to netdev_max_backlog. A single high rate flow (such as a
      spoofed source DoS) can exceed a single cpu processing rate and will
      degrade throughput of other flows hashed onto the same cpu.
      
      This patch adds a more fine grained hashtable. If the netdev backlog
      is above a threshold, IRQ cpus track the ratio of total traffic of
      each flow (using 4096 buckets, configurable). The ratio is measured
      by counting the number of packets per flow over the last 256 packets
      from the source cpu. Any flow that occupies a large fraction of this
      (set at 50%) will see packet drop while above the threshold.
      
      Tested:
      Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
      kernel receive (RPS) on cpu0 and application threads on cpus 2--7
      each handling 20k req/s. Throughput halves when hit with a 400 kpps
      antagonist storm. With this patch applied, antagonist overload is
      dropped and the server processes its complete load.
      
      The patch is effective when kernel receive processing is the
      bottleneck. The above RPS scenario is a extreme, but the same is
      reached with RFS and sufficient kernel processing (iptables, packet
      socket tap, ..).
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99bbc707
  26. Jan 29, 2013
  27. Nov 19, 2012
  28. Apr 21, 2012
  29. Apr 19, 2012
  30. Apr 18, 2012
  31. Feb 24, 2012
    • Ingo Molnar's avatar
      static keys: Introduce 'struct static_key', static_key_true()/false() and... · c5905afb
      Ingo Molnar authored
      static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
      
      So here's a boot tested patch on top of Jason's series that does
      all the cleanups I talked about and turns jump labels into a
      more intuitive to use facility. It should also address the
      various misconceptions and confusions that surround jump labels.
      
      Typical usage scenarios:
      
              #include <linux/static_key.h>
      
              struct static_key key = STATIC_KEY_INIT_TRUE;
      
              if (static_key_false(&key))
                      do unlikely code
              else
                      do likely code
      
      Or:
      
              if (static_key_true(&key))
                      do likely code
              else
                      do unlikely code
      
      The static key is modified via:
      
              static_key_slow_inc(&key);
              ...
              static_key_slow_dec(&key);
      
      The 'slow' prefix makes it abundantly clear that this is an
      expensive operation.
      
      I've updated all in-kernel code to use this everywhere. Note
      that I (intentionally) have not pushed through the renam...
      c5905afb
  32. Nov 18, 2011
    • Eric Dumazet's avatar
      net: use jump_label to shortcut RPS if not setup · adc9300e
      Eric Dumazet authored
      
      Most machines dont use RPS/RFS, and pay a fair amount of instructions in
      netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
      RPS/RFS is not setup.
      
      Add a jump_label named rps_needed.
      
      If no device rps_map or global rps_sock_flow_table is setup,
      netif_receive_skb() / netif_rx() do a single instruction instead of many
      ones, including conditional jumps.
      
      jmp +0    (if CONFIG_JUMP_LABEL=y)
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      adc9300e
  33. May 28, 2011
  34. Apr 28, 2011
    • Eric Dumazet's avatar
      net: filter: Just In Time compiler for x86-64 · 0a14842f
      Eric Dumazet authored
      
      In order to speedup packet filtering, here is an implementation of a
      JIT compiler for x86_64
      
      It is disabled by default, and must be enabled by the admin.
      
      echo 1 >/proc/sys/net/core/bpf_jit_enable
      
      It uses module_alloc() and module_free() to get memory in the 2GB text
      kernel range since we call helpers functions from the generated code.
      
      EAX : BPF A accumulator
      EBX : BPF X accumulator
      RDI : pointer to skb   (first argument given to JIT function)
      RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
      r9d : skb->len - skb->data_len (headlen)
      r8  : skb->data
      
      To get a trace of generated code, use :
      
      echo 2 >/proc/sys/net/core/bpf_jit_enable
      
      Example of generated code :
      
      # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24
      
      flen=18 proglen=147 pass=3 image=ffffffffa00b5000
      JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
      JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
      JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
      JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
      JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
      JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
      JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
      JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
      JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
      JIT code: ffffffffa00b5090: c0 c9 c3
      
      BPF program is 144 bytes long, so native program is almost same size ;)
      
      (000) ldh      [12]
      (001) jeq      #0x800           jt 2    jf 8
      (002) ld       [26]
      (003) and      #0xffffff00
      (004) jeq      #0xc0a81400      jt 16   jf 5
      (005) ld       [30]
      (006) and      #0xffffff00
      (007) jeq      #0xc0a81400      jt 16   jf 17
      (008) jeq      #0x806           jt 10   jf 9
      (009) jeq      #0x8035          jt 10   jf 17
      (010) ld       [28]
      (011) and      #0xffffff00
      (012) jeq      #0xc0a81400      jt 16   jf 13
      (013) ld       [38]
      (014) and      #0xffffff00
      (015) jeq      #0xc0a81400      jt 16   jf 17
      (016) ret      #65535
      (017) ret      #0
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a14842f