Skip to content
Snippets Groups Projects
  1. Sep 26, 2022
    • Lu Wei's avatar
      ipvlan: Fix out-of-bound bugs caused by unset skb->mac_header · 4b633c1e
      Lu Wei authored
      mainline inclusion
      from mainline-v6.0-rc6
      commit 81225b2ea161af48e093f58e8dfee6d705b16af4
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5SYBY
      
      
      CVE: NA
      
      --------------------------------
      
      If an AF_PACKET socket is used to send packets through ipvlan and the
      default xmit function of the AF_PACKET socket is changed from
      dev_queue_xmit() to packet_direct_xmit() via setsockopt() with the option
      name of PACKET_QDISC_BYPASS, the skb->mac_header may not be reset and
      remains as the initial value of 65535, this may trigger slab-out-of-bounds
      bugs as following:
      
      =================================================================
      UG: KASAN: slab-out-of-bounds in ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan]
      PU: 2 PID: 1768 Comm: raw_send Kdump: loaded Not tainted 6.0.0-rc4+ #6
      ardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33
      all Trace:
      print_address_description.constprop.0+0x1d/0x160
      print_report.cold+0x4f/0x112
      kasan_report+0xa3/0x130
      ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan]
      ipvlan_start_xmit+0x29/0xa0 [ipvlan]
      __dev_direct_xmit+0x2e2/0x380
      packet_direct_xmit+0x22/0x60
      packet_snd+0x7c9/0xc40
      sock_sendmsg+0x9a/0xa0
      __sys_sendto+0x18a/0x230
      __x64_sys_sendto+0x74/0x90
      do_syscall_64+0x3b/0x90
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The root cause is:
        1. packet_snd() only reset skb->mac_header when sock->type is SOCK_RAW
           and skb->protocol is not specified as in packet_parse_headers()
      
        2. packet_direct_xmit() doesn't reset skb->mac_header as dev_queue_xmit()
      
      In this case, skb->mac_header is 65535 when ipvlan_xmit_mode_l2() is
      called. So when ipvlan_xmit_mode_l2() gets mac header with eth_hdr() which
      use "skb->head + skb->mac_header", out-of-bound access occurs.
      
      This patch replaces eth_hdr() with skb_eth_hdr() in ipvlan_xmit_mode_l2()
      and reset mac header in multicast to solve this out-of-bound bug.
      
      Fixes: 2ad7bf36 ("ipvlan: Initial check-in of the IPVLAN driver.")
      Signed-off-by: default avatarLu Wei <luwei32@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarLu Wei <luwei32@huawei.com>
      Reviewed-by: default avatarYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarYongqiang Liu <liuyongqiang13@huawei.com>
      4b633c1e
  2. Sep 22, 2020
    • Mahesh Bandewar's avatar
      ipvlan: don't deref eth hdr before checking it's set · 64cbd7e0
      Mahesh Bandewar authored
      
      stable inclusion
      from linux-4.19.111
      commit eb273bb8205c7eeed8da0bca7842fa68fd62d0bb
      
      --------------------------------
      
      [ Upstream commit ad819276 ]
      
      IPvlan in L3 mode discards outbound multicast packets but performs
      the check before ensuring the ether-header is set or not. This is
      an error that Eric found through code browsing.
      
      Fixes: 2ad7bf36 (“ipvlan: Initial check-in of the IPVLAN driver.”)
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarLi Aichun <liaichun@huawei.com>
      Reviewed-by: default avatarguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      64cbd7e0
    • Eric Dumazet's avatar
      ipvlan: do not use cond_resched_rcu() in ipvlan_process_multicast() · d43caccb
      Eric Dumazet authored
      
      stable inclusion
      from linux-4.19.111
      commit cb9e7197bbebbdfd762c34f64b1e55d8b526c345
      
      --------------------------------
      
      [ Upstream commit afe207d8 ]
      
      Commit e18b353f ("ipvlan: add cond_resched_rcu() while
      processing muticast backlog") added a cond_resched_rcu() in a loop
      using rcu protection to iterate over slaves.
      
      This is breaking rcu rules, so lets instead use cond_resched()
      at a point we can reschedule
      
      Fixes: e18b353f ("ipvlan: add cond_resched_rcu() while processing muticast backlog")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarLi Aichun <liaichun@huawei.com>
      Reviewed-by: default avatarguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      d43caccb
    • Mahesh Bandewar's avatar
      ipvlan: add cond_resched_rcu() while processing muticast backlog · 0d961ec2
      Mahesh Bandewar authored
      
      stable inclusion
      from linux-4.19.111
      commit 79a958d8a1e55bb66fe74a49d31fd8aa8474dfc0
      
      --------------------------------
      
      [ Upstream commit e18b353f ]
      
      If there are substantial number of slaves created as simulated by
      Syzbot, the backlog processing could take much longer and result
      into the issue found in the Syzbot report.
      
      INFO: rcu_sched detected stalls on CPUs/tasks:
              (detected by 1, t=10502 jiffies, g=5049, c=5048, q=752)
      All QSes seen, last rcu_sched kthread activity 10502 (4294965563-4294955061), jiffies_till_next_fqs=1, root ->qsmask 0x0
      syz-executor.1  R  running task on cpu   1  10984 11210   3866 0x30020008 179034491270
      Call Trace:
       <IRQ>
       [<ffffffff81497163>] _sched_show_task kernel/sched/core.c:8063 [inline]
       [<ffffffff81497163>] _sched_show_task.cold+0x2fd/0x392 kernel/sched/core.c:8030
       [<ffffffff8146a91b>] sched_show_task+0xb/0x10 kernel/sched/core.c:8073
       [<ffffffff815c931b>] print_other_cpu_stall kernel/rcu/tree.c:1577 [inline]
       [<ffffffff815c931b>] check_cpu_stall kernel/rcu/tree.c:1695 [inline]
       [<ffffffff815c931b>] __rcu_pending kernel/rcu/tree.c:3478 [inline]
       [<ffffffff815c931b>] rcu_pending kernel/rcu/tree.c:3540 [inline]
       [<ffffffff815c931b>] rcu_check_callbacks.cold+0xbb4/0xc29 kernel/rcu/tree.c:2876
       [<ffffffff815e3962>] update_process_times+0x32/0x80 kernel/time/timer.c:1635
       [<ffffffff816164f0>] tick_sched_handle+0xa0/0x180 kernel/time/tick-sched.c:161
       [<ffffffff81616ae4>] tick_sched_timer+0x44/0x130 kernel/time/tick-sched.c:1193
       [<ffffffff815e75f7>] __run_hrtimer kernel/time/hrtimer.c:1393 [inline]
       [<ffffffff815e75f7>] __hrtimer_run_queues+0x307/0xd90 kernel/time/hrtimer.c:1455
       [<ffffffff815e90ea>] hrtimer_interrupt+0x2ea/0x730 kernel/time/hrtimer.c:1513
       [<ffffffff844050f4>] local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1031 [inline]
       [<ffffffff844050f4>] smp_apic_timer_interrupt+0x144/0x5e0 arch/x86/kernel/apic/apic.c:1056
       [<ffffffff84401cbe>] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778
      RIP: 0010:do_raw_read_lock+0x22/0x80 kernel/locking/spinlock_debug.c:153
      RSP: 0018:ffff8801dad07ab8 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff12
      RAX: 0000000000000000 RBX: ffff8801c4135680 RCX: 0000000000000000
      RDX: 1ffff10038826afe RSI: ffff88019d816bb8 RDI: ffff8801c41357f0
      RBP: ffff8801dad07ac0 R08: 0000000000004b15 R09: 0000000000310273
      R10: ffff88019d816bb8 R11: 0000000000000001 R12: ffff8801c41357e8
      R13: 0000000000000000 R14: ffff8801cfb19850 R15: ffff8801cfb198b0
       [<ffffffff8101460e>] __raw_read_lock_bh include/linux/rwlock_api_smp.h:177 [inline]
       [<ffffffff8101460e>] _raw_read_lock_bh+0x3e/0x50 kernel/locking/spinlock.c:240
       [<ffffffff840d78ca>] ipv6_chk_mcast_addr+0x11a/0x6f0 net/ipv6/mcast.c:1006
       [<ffffffff84023439>] ip6_mc_input+0x319/0x8e0 net/ipv6/ip6_input.c:482
       [<ffffffff840211c8>] dst_input include/net/dst.h:449 [inline]
       [<ffffffff840211c8>] ip6_rcv_finish+0x408/0x610 net/ipv6/ip6_input.c:78
       [<ffffffff840214de>] NF_HOOK include/linux/netfilter.h:292 [inline]
       [<ffffffff840214de>] NF_HOOK include/linux/netfilter.h:286 [inline]
       [<ffffffff840214de>] ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:278
       [<ffffffff83a29efa>] __netif_receive_skb_one_core+0x12a/0x1f0 net/core/dev.c:5303
       [<ffffffff83a2a15c>] __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:5417
       [<ffffffff83a2f536>] process_backlog+0x216/0x6c0 net/core/dev.c:6243
       [<ffffffff83a30d1b>] napi_poll net/core/dev.c:6680 [inline]
       [<ffffffff83a30d1b>] net_rx_action+0x47b/0xfb0 net/core/dev.c:6748
       [<ffffffff846002c8>] __do_softirq+0x2c8/0x99a kernel/softirq.c:317
       [<ffffffff813e656a>] invoke_softirq kernel/softirq.c:399 [inline]
       [<ffffffff813e656a>] irq_exit+0x16a/0x1a0 kernel/softirq.c:439
       [<ffffffff84405115>] exiting_irq arch/x86/include/asm/apic.h:561 [inline]
       [<ffffffff84405115>] smp_apic_timer_interrupt+0x165/0x5e0 arch/x86/kernel/apic/apic.c:1058
       [<ffffffff84401cbe>] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778
       </IRQ>
      RIP: 0010:__sanitizer_cov_trace_pc+0x26/0x50 kernel/kcov.c:102
      RSP: 0018:ffff880196033bd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
      RAX: ffff88019d8161c0 RBX: 00000000ffffffff RCX: ffffc90003501000
      RDX: 0000000000000002 RSI: ffffffff816236d1 RDI: 0000000000000005
      RBP: ffff880196033bd8 R08: ffff88019d8161c0 R09: 0000000000000000
      R10: 1ffff10032c067f0 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000
       [<ffffffff816236d1>] do_futex+0x151/0x1d50 kernel/futex.c:3548
       [<ffffffff816260f0>] C_SYSC_futex kernel/futex_compat.c:201 [inline]
       [<ffffffff816260f0>] compat_SyS_futex+0x270/0x3b0 kernel/futex_compat.c:175
       [<ffffffff8101da17>] do_syscall_32_irqs_on arch/x86/entry/common.c:353 [inline]
       [<ffffffff8101da17>] do_fast_syscall_32+0x357/0xe1c arch/x86/entry/common.c:415
       [<ffffffff84401a9b>] entry_SYSENTER_compat+0x8b/0x9d arch/x86/entry/entry_64_compat.S:139
      RIP: 0023:0xf7f23c69
      RSP: 002b:00000000f5d1f12c EFLAGS: 00000282 ORIG_RAX: 00000000000000f0
      RAX: ffffffffffffffda RBX: 000000000816af88 RCX: 0000000000000080
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000816af8c
      RBP: 00000000f5d1f228 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      rcu_sched kthread starved for 10502 jiffies! g5049 c5048 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=1
      rcu_sched       R  running task on cpu   1  13048     8      2 0x90000000 179099587640
      Call Trace:
       [<ffffffff8147321f>] context_switch+0x60f/0xa60 kernel/sched/core.c:3209
       [<ffffffff8100095a>] __schedule+0x5aa/0x1da0 kernel/sched/core.c:3934
       [<ffffffff810021df>] schedule+0x8f/0x1b0 kernel/sched/core.c:4011
       [<ffffffff8101116d>] schedule_timeout+0x50d/0xee0 kernel/time/timer.c:1803
       [<ffffffff815c13f1>] rcu_gp_kthread+0xda1/0x3b50 kernel/rcu/tree.c:2327
       [<ffffffff8144b318>] kthread+0x348/0x420 kernel/kthread.c:246
       [<ffffffff84400266>] ret_from_fork+0x56/0x70 arch/x86/entry/entry_64.S:393
      
      Fixes: ba35f858 (“ipvlan: Defer multicast / broadcast processing to a work-queue”)
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarLi Aichun <liaichun@huawei.com>
      Reviewed-by: default avatarguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      0d961ec2
  3. Dec 27, 2019
    • linmiaohe's avatar
      ipvlan: disable l2e local xmit · 36ca2808
      linmiaohe authored and 谢秀奇's avatar 谢秀奇 committed
      
      euler inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      -------------------------------------------------
      
      Ipvlan l2e mode will cache skbuff for local xmit in
      ipvlan_xmit_mode_l2e. But when tso/gso is disabled,
      this would result in performance loss.
      
      So we should stop caching the skbuff when tso/gso is
      disabled.
      
      Signed-off-by: default avatarlinmiaohe <linmiaohe@huawei.com>
      Reviewed-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: default avatarKeefe Liu <liuqifa@huawei.com>
      Reviewed-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      36ca2808
    • Keefe LIU's avatar
      ipvlan: Introduce local xmit queue for l2e mode · 53d0bf95
      Keefe LIU authored and 谢秀奇's avatar 谢秀奇 committed
      
      euler inclusion
      category: feature
      bugzilla: 9511
      CVE: NA
      
      -------------------------------------------------
      
      Consider two IPVlan devices are set up on the same master, when
      they communicate with each other by TCP, the receive part is too
      fast to make the send packets coalesced, so in this case, the
      performace is not as good as we expect.
      
      This patch introduces a local xmit queue for l2e mode, when the
      packets are sent to the IPVlan devices of the same master, the
      packets will be cloned and added to the local xmit queue, this
      operation can make the send packets coalesced and improve the
      TCP performace in this case.
      
      Signed-off-by: default avatarKeefe LIU <liuqifa@huawei.com>
      Reviewed-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      53d0bf95
    • Keefe LIU's avatar
      ipvlan: Introduce l2e mode · 40247945
      Keefe LIU authored and 谢秀奇's avatar 谢秀奇 committed
      
      euler inclusion
      category: feature
      bugzilla: 9511
      CVE: NA
      
      -------------------------------------------------
      
      In a typical IPvlan L2 setup where master is in default-ns and
      each slave is into different (slave) ns. In this setup, if master
      and slaves in different net, egress packet processing for traffic
      originating from slave-ns can't be forwarded to master or other
      machine whose ip in the same net with master, and they can't be
      forwarded to other interface in default-ns.
      
      This patch introuce a new mode l2e for ipvlan to realize above
      goals, and it won't affect the original l2, l3, l3s mode.
      
      As the ip tool doesn't support l2e mode, We use module param
      "ipvlan_default_mode" to set the default work mode. 0 for l2
      mode, 1 for l3, 2 for l2e, 3 for l3s, others invalid now.
      Attention, when we create ipvlan devices by "ip" commond, if we
      assign the mode, ipvlan will work in the mode we assigned other
      then the "ipvlan_default_mode".
      
      Signed-off-by: default avatarKeefe LIU <liuqifa@huawei.com>
      Reviewed-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      40247945
  4. Mar 05, 2018
  5. Mar 01, 2018
    • Paolo Abeni's avatar
      ipvlan: use per device spinlock to protect addrs list updates · 82308194
      Paolo Abeni authored
      
      This changeset moves ipvlan address under RCU protection, using
      a per ipvlan device spinlock to protect list mutation and RCU
      read access to protect list traversal.
      
      Also explicitly use RCU read lock to traverse the per port
      ipvlans list, so that we can now perform a full address lookup
      without asserting the RTNL lock.
      
      Overall this allows the ipvlan driver to check fully for duplicate
      addresses - before this commit ipv6 addresses assigned by autoconf
      via prefix delegation where accepted without any check - and avoid
      the following rntl assertion failure still in the same code path:
      
       RTNL: assertion failed at drivers/net/ipvlan/ipvlan_core.c (124)
       WARNING: CPU: 15 PID: 0 at drivers/net/ipvlan/ipvlan_core.c:124 ipvlan_addr_busy+0x97/0xa0 [ipvlan]
       Modules linked in: ipvlan(E) ixgbe
       CPU: 15 PID: 0 Comm: swapper/15 Tainted: G            E    4.16.0-rc2.ipvlan+ #1782
       Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
       RIP: 0010:ipvlan_addr_busy+0x97/0xa0 [ipvlan]
       RSP: 0018:ffff881ff9e03768 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: ffff881fdf2a9000 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 00000000000000f6 RDI: 0000000000000300
       RBP: ffff881fdf2a8000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: ffff881ff9e034c0 R12: ffff881fe07bcc00
       R13: 0000000000000001 R14: ffffffffa02002b0 R15: 0000000000000001
       FS:  0000000000000000(0000) GS:ffff881ff9e00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fc5c1a4f248 CR3: 000000207e012005 CR4: 00000000001606e0
       Call Trace:
        <IRQ>
        ipvlan_addr6_event+0x6c/0xd0 [ipvlan]
        notifier_call_chain+0x49/0x90
        atomic_notifier_call_chain+0x6a/0x100
        ipv6_add_addr+0x5f9/0x720
        addrconf_prefix_rcv_add_addr+0x244/0x3c0
        addrconf_prefix_rcv+0x2f3/0x790
        ndisc_router_discovery+0x633/0xb70
        ndisc_rcv+0x155/0x180
        icmpv6_rcv+0x4ac/0x5f0
        ip6_input_finish+0x138/0x6a0
        ip6_input+0x41/0x1f0
        ipv6_rcv+0x4db/0x8d0
        __netif_receive_skb_core+0x3d5/0xe40
        netif_receive_skb_internal+0x89/0x370
        napi_gro_receive+0x14f/0x1e0
        ixgbe_clean_rx_irq+0x4ce/0x1020 [ixgbe]
        ixgbe_poll+0x31a/0x7a0 [ixgbe]
        net_rx_action+0x296/0x4f0
        __do_softirq+0xcf/0x4f5
        irq_exit+0xf5/0x110
        do_IRQ+0x62/0x110
        common_interrupt+0x91/0x91
        </IRQ>
      
       v1 -> v2: drop unneeded in_softirq check in ipvlan_addr6_validator_event()
      
      Fixes: e9997c29 ("ipvlan: fix check for IP addresses in control path")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82308194
    • Paolo Abeni's avatar
      ipvlan: egress mcast packets are not exceptional · cccc200f
      Paolo Abeni authored
      
      Currently, if IPv6 is enabled on top of an ipvlan device in l3
      mode, the following warning message:
      
       Dropped {multi|broad}cast of type= [86dd]
      
      is emitted every time that a RS is generated and dmseg is soon
      filled with irrelevant messages. Replace pr_warn with pr_debug,
      to preserve debuggability, without scaring the sysadmin.
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cccc200f
  6. Feb 22, 2018
  7. Dec 16, 2017
    • Mahesh Bandewar's avatar
      ipvlan: remove excessive packet scrubbing · c0d451c8
      Mahesh Bandewar authored
      
      IPvlan currently scrubs packets at every location where packets may be
      crossing namespace boundary. Though this is desirable, currently IPvlan
      does it more than necessary. e.g. packets that are going to take
      dev_forward_skb() path will get scrubbed so no point in scrubbing them
      before forwarding. Another side-effect of scrubbing is that pkt-type gets
      set to PACKET_HOST which overrides what was already been set by the
      earlier path making erroneous delivery of the packets.
      
      Also scrubbing packets just before calling dev_queue_xmit() has detrimental
      effects since packets lose skb->sk and because of that miss prio updates,
      incorrect socket back-pressure and would even break TSQ.
      
      Fixes: b93dd49c ('ipvlan: Scrub skb before crossing the namespace boundary')
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0d451c8
    • Mahesh Bandewar's avatar
      Revert "ipvlan: add L2 check for packets arriving via virtual devices" · 918150cb
      Mahesh Bandewar authored
      
      This reverts commit 92ff4264.
      
      Even though the check added is not that taxing, it's not really needed.
      First of all this will be per packet cost and second thing is that the
      eth_type_trans() already does this correctly. The excessive scrubbing
      in IPvlan was changing the pkt-type skb metadata of the packet which
      made it necessary to re-check the mac. The subsequent patch in this
      series removes the faulty packet-scrub.
      
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      918150cb
  8. Dec 12, 2017
    • Mahesh Bandewar's avatar
      ipvlan: add L2 check for packets arriving via virtual devices · 92ff4264
      Mahesh Bandewar authored
      
      Packets that don't have dest mac as the mac of the master device should
      not be entertained by the IPvlan rx-handler. This is mostly true as the
      packet path mostly takes care of that, except when the master device is
      a virtual device. As demonstrated in the following case -
      
        ip netns add ns1
        ip link add ve1 type veth peer name ve2
        ip link add link ve2 name iv1 type ipvlan mode l2
        ip link set dev iv1 netns ns1
        ip link set ve1 up
        ip link set ve2 up
        ip -n ns1 link set iv1 up
        ip addr add 192.168.10.1/24 dev ve1
        ip -n ns1 addr 192.168.10.2/24 dev iv1
        ping -c2 192.168.10.2
        <Works!>
        ip neigh show dev ve1
        ip neigh show 192.168.10.2 lladdr <random> dev ve1
        ping -c2 192.168.10.2
        <Still works! Wrong!!>
      
      This patch adds that missing check in the IPvlan rx-handler.
      
      Reported-by: default avatarAmit Sikka <amit.sikka@ericsson.com>
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92ff4264
  9. Dec 07, 2017
  10. Dec 03, 2017
  11. Nov 24, 2017
  12. Nov 11, 2017
  13. Oct 29, 2017
    • Mahesh Bandewar's avatar
      ipvlan: implement VEPA mode · fe89aa6b
      Mahesh Bandewar authored
      
      This is very similar to the Macvlan VEPA mode, however, there is some
      difference. IPvlan uses the mac-address of the lower device, so the VEPA
      mode has implications of ICMP-redirects for packets destined for its
      immediate neighbors sharing same master since the packets will have same
      source and dest mac. The external switch/router will send redirect msg.
      
      Having said that, this will be useful tool in terms of debugging
      since IPvlan will not switch packets within its slaves and rely completely
      on the external entity as intended in 802.1Qbg.
      
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe89aa6b
    • Mahesh Bandewar's avatar
      ipvlan: introduce 'private' attribute for all existing modes. · a190d04d
      Mahesh Bandewar authored
      
      IPvlan has always operated in bridge mode. However there are scenarios
      where each slave should be able to talk through the master device but
      not necessarily across each other. Think of an environment where each
      of a namespace is a private and independant customer. In this scenario
      the machine which is hosting these namespaces neither want to tell who
      their neighbor is nor the individual namespaces care to talk to neighbor
      on short-circuited network path.
      
      This patch implements the mode that is very similar to the 'private' mode
      in macvlan where individual slaves can send and receive traffic through
      the master device, just that they can not talk among slave devices.
      
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a190d04d
  14. Feb 12, 2017
  15. Dec 29, 2016
  16. Dec 24, 2016
    • Mahesh Bandewar's avatar
      ipvlan: fix multicast processing · e2525360
      Mahesh Bandewar authored
      
      In an IPvlan setup when master is set in loopback mode e.g.
      
        ethtool -K eth0 set loopback on
      
        where eth0 is master device for IPvlan setup.
      
      The failure is caused by the faulty logic that determines if the
      packet is from TX-path vs. RX-path by just looking at the mac-
      addresses on the packet while processing multicast packets.
      
      In the loopback-mode where this crash was happening, the packets
      that are sent out are reflected by the NIC and are processed on
      the RX path, but mac-address check tricks into thinking this
      packet is from TX path and falsely uses dev_forward_skb() to pass
      packets to the slave (virtual) devices.
      
      This patch records the path while queueing packets and eliminates
      logic of looking at mac-addresses for the same decision.
      
      ------------[ cut here ]------------
      kernel BUG at include/linux/skbuff.h:1737!
      Call Trace:
       [<ffffffff921fbbc2>] dev_forward_skb+0x92/0xd0
       [<ffffffffc031ac65>] ipvlan_process_multicast+0x395/0x4c0 [ipvlan]
       [<ffffffffc031a9a7>] ? ipvlan_process_multicast+0xd7/0x4c0 [ipvlan]
       [<ffffffff91cdfea7>] ? process_one_work+0x147/0x660
       [<ffffffff91cdff09>] process_one_work+0x1a9/0x660
       [<ffffffff91cdfea7>] ? process_one_work+0x147/0x660
       [<ffffffff91ce086d>] worker_thread+0x11d/0x360
       [<ffffffff91ce0750>] ? rescuer_thread+0x350/0x350
       [<ffffffff91ce960b>] kthread+0xdb/0xe0
       [<ffffffff91c05c70>] ? _raw_spin_unlock_irq+0x30/0x50
       [<ffffffff91ce9530>] ? flush_kthread_worker+0xc0/0xc0
       [<ffffffff92348b7a>] ret_from_fork+0x9a/0xd0
       [<ffffffff91ce9530>] ? flush_kthread_worker+0xc0/0xc0
      
      Fixes: ba35f858 ("ipvlan: Defer multicast / broadcast processing to a work-queue")
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      CC: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2525360
    • Eric Dumazet's avatar
      ipvlan: fix various issues in ipvlan_process_multicast() · b1227d01
      Eric Dumazet authored
      
      1) netif_rx() / dev_forward_skb() should not be called from process
      context.
      
      2) ipvlan_count_rx() should be called with preemption disabled.
      
      3) We should check if ipvlan->dev is up before feeding packets
      to netif_rx()
      
      4) We need to prevent device from disappearing if some packets
      are in the multicast backlog.
      
      5) One kfree_skb() should be a consume_skb() eventually
      
      Fixes: ba35f858 ("ipvlan: Defer multicast / broadcast processing to
      a work-queue")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1227d01
  17. Sep 19, 2016
    • Mahesh Bandewar's avatar
      ipvlan: Introduce l3s mode · 4fbae7d8
      Mahesh Bandewar authored
      
      In a typical IPvlan L3 setup where master is in default-ns and
      each slave is into different (slave) ns. In this setup egress
      packet processing for traffic originating from slave-ns will
      hit all NF_HOOKs in slave-ns as well as default-ns. However same
      is not true for ingress processing. All these NF_HOOKs are
      hit only in the slave-ns skipping them in the default-ns.
      IPvlan in L3 mode is restrictive and if admins want to deploy
      iptables rules in default-ns, this asymmetric data path makes it
      impossible to do so.
      
      This patch makes use of the l3_rcv() (added as part of l3mdev
      enhancements) to perform input route lookup on RX packets without
      changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
      to change the skb->dev just before handing over skb to L4.
      
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fbae7d8
  18. Jul 26, 2016
  19. Feb 22, 2016
  20. Nov 18, 2015
  21. Oct 22, 2015
    • Brenden Blanco's avatar
      ipvlan: read direct ifindex instead of iflink · 63b11e75
      Brenden Blanco authored
      
      In the ipv4 outbound path of an ipvlan device in l3 mode, the ifindex is
      being grabbed from dev_get_iflink. This works for the physical device
      case, since as the documentation of that function notes: "Physical
      interfaces have the same 'ifindex' and 'iflink' values.".  However, if
      the master device is a veth, and the pairs are in separate net
      namespaces, the route lookup will fail with -ENODEV due to outer veth
      pair being in a separate namespace from the ipvlan master/routing
      namespace.
      
        ns0    |   ns1    |   ns2
       veth0a--|--veth0b--|--ipvl0
      
      In ipvlan_process_v4_outbound(), a packet sent from ipvl0 in the above
      configuration will pass fl.flowi4_oif == veth0a to
      ip_route_output_flow(), but *net == ns1.
      
      Notice also that ipv6 processing is not using iflink. Since there is a
      discrepancy in usage, fixup both v4 and v6 case to use local dev
      variable.
      
      Tested this with l3 ipvlan on top of veth, as well as with single
      physical interface in the top namespace.
      
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Reviewed-by: default avatarJiri Benc <jbenc@redhat.com>
      Acked-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63b11e75
  22. Oct 08, 2015
  23. Jul 16, 2015
    • WANG Cong's avatar
      ipvlan: use rcu_deference_bh() in ipvlan_queue_xmit() · 0fba37a3
      WANG Cong authored
      
      In tx path rcu_read_lock_bh() is held, so we need rcu_deference_bh().
      This fixes the following warning:
      
       ===============================
       [ INFO: suspicious RCU usage. ]
       4.1.0-rc1+ #1007 Not tainted
       -------------------------------
       drivers/net/ipvlan/ipvlan.h:106 suspicious rcu_dereference_check() usage!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 1, debug_locks = 0
       1 lock held by dhclient/1076:
        #0:  (rcu_read_lock_bh){......}, at: [<ffffffff817e8d84>] rcu_lock_acquire+0x0/0x26
      
       stack backtrace:
       CPU: 2 PID: 1076 Comm: dhclient Not tainted 4.1.0-rc1+ #1007
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000001 ffff8800d381bac8 ffffffff81a4154f 000000003c1a3c19
        ffff8800d4d0a690 ffff8800d381baf8 ffffffff810b849f ffff880117d41148
        ffff880117d40000 ffff880117d40068 0000000000000156 ffff8800d381bb18
       Call Trace:
        [<ffffffff81a4154f>] dump_stack+0x4c/0x65
        [<ffffffff810b849f>] lockdep_rcu_suspicious+0x107/0x110
        [<ffffffff8165a522>] ipvlan_port_get_rcu+0x47/0x4e
        [<ffffffff8165ad14>] ipvlan_queue_xmit+0x35/0x450
        [<ffffffff817ea45d>] ? rcu_read_unlock+0x3e/0x5f
        [<ffffffff810a20bf>] ? local_clock+0x19/0x22
        [<ffffffff810b4781>] ? __lock_is_held+0x39/0x52
        [<ffffffff8165b64c>] ipvlan_start_xmit+0x1b/0x44
        [<ffffffff817edf7f>] dev_hard_start_xmit+0x2ae/0x467
        [<ffffffff817ee642>] __dev_queue_xmit+0x50a/0x60c
        [<ffffffff817ee7a7>] dev_queue_xmit_sk+0x13/0x15
        [<ffffffff81997596>] dev_queue_xmit+0x10/0x12
        [<ffffffff8199b41c>] packet_sendmsg+0xb6b/0xbdf
        [<ffffffff810b5ea7>] ? mark_lock+0x2e/0x226
        [<ffffffff810a1fcc>] ? sched_clock_cpu+0x9e/0xb7
        [<ffffffff817d56f9>] sock_sendmsg_nosec+0x12/0x1d
        [<ffffffff817d7257>] sock_sendmsg+0x29/0x2e
        [<ffffffff817d72cc>] sock_write_iter+0x70/0x91
        [<ffffffff81199563>] __vfs_write+0x7e/0xa7
        [<ffffffff811996bc>] vfs_write+0x92/0xe8
        [<ffffffff811997d7>] SyS_write+0x47/0x7e
        [<ffffffff81a4d517>] system_call_fastpath+0x12/0x6f
      
      Fixes: 2ad7bf36 ("ipvlan: Initial check-in of the IPVLAN driver.")
      Cc: Mahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarMahesh Bandewar <maheshb@google.com>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fba37a3
    • Konstantin Khlebnikov's avatar
      ipvlan: unhash addresses without synchronize_rcu · 6640e673
      Konstantin Khlebnikov authored
      
      All structures used in traffic forwarding are rcu-protected:
      ipvl_addr, ipvl_dev and ipvl_port. Thus we can unhash addresses
      without synchronization. We'll anyway hash it back into the same
      bucket: in worst case lockless lookup will scan hash once again.
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6640e673
  24. May 06, 2015
    • Mahesh Bandewar's avatar
      ipvlan: Defer multicast / broadcast processing to a work-queue · ba35f858
      Mahesh Bandewar authored
      
      Processing multicast / broadcast in fast path is performance draining
      and having more links means more cloning and bringing performance
      down further.
      
      Broadcast; in particular, need to be given to all the virtual links.
      Earlier tricks of enabling broadcast bit for IPv4 only interfaces are not
      really working since it fails autoconf. Which means enabling broadcast
      for all the links if protocol specific hacks do not have to be added into
      the driver.
      
      This patch defers all (incoming as well as outgoing) multicast traffic to
      a work-queue leaving only the unicast traffic in the fast-path. Now if we
      need to apply any additional tricks to further reduce the impact of this
      (multicast / broadcast) type of traffic, it can be implemented while
      processing this work without affecting the fast-path.
      
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba35f858
  25. Apr 03, 2015