2

I am experimenting with netfilter in a Docker container. I have three containers, one a "router", and two "endpoints". They are each connected via pipework, so an external (host) bridge exists for each endpoint<->router connection. Something like this:

containerA (eth1) -- hostbridgeA -- (eth1) containerR containerB (eth1) -- hostbridgeB -- (eth2) containerR 

Then within the "router" container containerR, I have a bridge br0 configured like so:

bridge name bridge id STP enabled interfaces br0 8000.3a047f7a7006 no eth1 eth2 

I have net.bridge.bridge-nf-call-iptables=0 on the host as that was interfering with some of my other tests.

containerA has IP 192.168.10.1/24 and containerB has 192.168.10.2/24.

I then have a very simple ruleset that traces forwarded packets:

flush ruleset table bridge filter { chain forward { type filter hook forward priority 0; policy accept; meta nftrace set 1 } } 

With this, I find that only ARP packets are traced, and not ICMP packets. In other words, if I run nft monitor while containerA is pinging containerB, I can see the ARP packets traced, but not the ICMP packets. This surprises me, because based on my understanding of nftables' bridge filter chain types, the only time a packet wouldn't go through the forward stage is if it's sent via input to the host (in this case containerR). Per the Linux packet flow diagram:

Netfilter and Linux packet flows

I would still expect ICMP packets to take the forward path, just like ARP. I do see the packets if I trace pre- and post-routing. So my question is, what's happening here? Is there a flowtable or other short-circuit I'm not aware of? Is it specific to container networking and/or Docker? I can check with VMs rather than containers, but am interested if others are aware of, or have encountered this, themselves.

Edit: I have since created a similar setup with a set of Alpine Virtual Machines in VirtualBox. ICMP packets do reach the forward chain, so it seems something in the host, or Docker, is interfering with my expectations. I will leave this unanswered until I, or somebody else, can identify the reason, in case it's useful for others to know.

Thanks!

Minimum reproducible example

For this I'm using Alpine Linux 3.19.1 in a VM, with the community repository enabled in /etc/apk/respositories:

# Prerequisites of host apk add bridge bridge-utils iproute2 docker openrc service docker start # When using linux bridges instead of openvswitch, disable iptables on bridges sysctl net.bridge.bridge-nf-call-iptables=0 # Pipework to let me avoid docker's IPAM git clone https://github.com/jpetazzo/pipework.git cp pipework/pipework /usr/local/bin/ # Create two containers each on their own network (bridge) pipework brA $(docker create -itd --name hostA alpine:3.19) 192.168.10.1/24 pipework brB $(docker create -itd --name hostB alpine:3.19) 192.168.10.2/24 # Create bridge-filtering container then connect it to both of the other networks R=$(docker create --cap-add NET_ADMIN -itd --name hostR alpine:3.19) pipework brA -i eth1 $R 0/0 pipework brB -i eth2 $R 0/0 # Note: `hostR` doesn't have/need an IP address on the bridge for this example # Add bridge tools and netfilter to the bridging container docker exec hostR apk add bridge bridge-utils nftables docker exec hostR brctl addbr br docker exec hostR brctl addif br eth1 eth2 docker exec hostR ip link set dev br up # hostA should be able to ping hostB docker exec hostA ping -c 1 192.168.10.2 # 64 bytes from 192.168.10.2... # Set nftables rules docker exec hostR nft add table bridge filter docker exec hostR nft add chain bridge filter forward '{type filter hook forward priority 0;}' docker exec hostR nft add rule bridge filter forward meta nftrace set 1 # Now ping hostB from hostA while nft monitor is running... docker exec hostA ping -c 4 192.168.10.2 & docker exec hostR nft monitor # Ping will succeed, nft monitor will not show any echo-request/-response packets traced, only arps # Example: trace id abc bridge filter forward packet: iif "eth2" oif "eth1" ether saddr ... daddr ... arp operation request trace id abc bridge filter forward rule meta nfrtrace set 1 (verdict continue) trace id abc bridge filter forward verdict continue trace id abc bridge filter forward policy accept ... trace id def bridge filter forward packet: iif "eth1" oif "eth2" ether saddr ... daddr ... arp operation reply trace id def bridge filter forward rule meta nfrtrace set 1 (verdict continue) trace id def bridge filter forward verdict continue trace id def bridge filter forward policy accept # Add tracing in prerouting and the icmp packets are visible: docker exec hostR nft add chain bridge filter prerouting '{type filter hook prerouting priority 0;}' docker exec hostR nft add rule bridge filter prerouting meta nftrace set 1 # Run again docker exec hostA ping -c 4 192.168.10.2 & docker exec hostR nft monitor # Ping still works (obviously), but we can see its packets in prerouting, which then disappear from the forward chain, but ARP shows up in both. # Example: trace id abc bridge filter prerouting packet: iif "eth1" ether saddr ... daddr ... ... icmp type echo-request ... trace id abc bridge filter prerouting rule meta nfrtrace set 1 (verdict continue) trace id abc bridge filter prerouting verdict continue trace id abc bridge filter prerouting policy accept ... trace id def bridge filter prerouting packet: iif "eth2" ether saddr ... daddr ... ... icmp type echo-reply ... trace id def bridge filter prerouting rule meta nfrtrace set 1 (verdict continue) trace id def bridge filter prerouting verdict continue trace id def bridge filter prerouting policy accept ... trace id 123 bridge filter prerouting packet: iif "eth1" ether saddr ... daddr ... ... arp operation request trace id 123 bridge filter prerouting rule meta nfrtrace set 1 (verdict continue) trace id 123 bridge filter prerouting verdict continue trace id 123 bridge filter prerouting policy accept trace id 123 bridge filter forward packet: iif "eth1" oif "eth2" ether saddr ... daddr ... arp operation request trace id 123 bridge filter forward rule meta nfrtrace set 1 (verdict continue) trace id 123 bridge filter forward verdict continue trace id 123 bridge filter forward policy accept ... trace id 456 bridge filter prerouting packet: iif "eth2" ether saddr ... daddr ... ... arp operation reply trace id 456 bridge filter prerouting rule meta nfrtrace set 1 (verdict continue) trace id 456 bridge filter prerouting verdict continue trace id 456 bridge filter prerouting policy accept trace id 456 bridge filter forward packet: iif "eth2" oif "eth1" ether saddr ... daddr ... arp operation reply trace id 456 bridge filter forward rule meta nfrtrace set 1 (verdict continue) trace id 456 bridge filter forward verdict continue trace id 456 bridge filter forward policy accept # Note the trace id matching across prerouting and forward chains 

I tried this with openvswitch as well, but for simplicity I went with a Linux bridge example which yields the same result anyway. The only real difference with openvswitch is that net.bridge.bridge-nf-call-iptables=0 isn't needed, IIRC.

2
  • 1
    You might want to provide a reproducible setup ( stackoverflow.com/help/minimal-reproducible-example ) if you want someone to help on it. Commented Apr 5, 2024 at 10:40
  • Hi @A.B, Indeed I'm trying to come up with something to that effect now, which is partly why I tried using VMs instead of Docker. Thanks! Commented Apr 6, 2024 at 4:15

1 Answer 1

2

Introduction and simplified reproducer setup

Docker loads the br_netfilter module. Once loaded, it affects all present and future network namespaces. This is for historical and compatibility reasons, as described in my answer for this Q/A.

So when this is done on the host:

service docker start # When using linux bridges instead of openvswitch, disable iptables on bridges sysctl net.bridge.bridge-nf-call-iptables=0 

This affects only the host network namespace. The future network namespace created for hostR will still get:

# docker exec hostR sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-iptables = 1 

Below is a much simpler bug reproducer than OP's. It doesn't require Docker at all nor a VM: it can be run on the current Linux host, requiring only the iproute2 package and creating a single bridge: within the affected hostR named network namespace:

#!/bin/sh modprobe br_netfilter # as would have done Docker sysctl net.bridge.bridge-nf-call-iptables=0 # actually it won't matter: netns hostR will still get 1 when created ip netns add hostA ip netns add hostB ip netns add hostR ip -n hostR link add name br address 02:00:00:00:01:00 up type bridge ip -n hostR link add name eth1 up master br type veth peer netns hostA name eth1 ip -n hostR link add name eth2 up master br type veth peer netns hostB name eth1 ip -n hostA addr add dev eth1 192.168.10.1/24 ip -n hostA link set eth1 up ip -n hostB addr add dev eth1 192.168.10.2/24 ip -n hostB link set eth1 up ip netns exec hostR nft -f - <<'EOF' table bridge filter # for idempotence delete table bridge filter # for idempotence table bridge filter { chain forward { type filter hook forward priority 0; meta nftrace set 1 } } EOF 

Note that br_netfilter still has its default settings in hostR network namespace:

# ip netns exec hostR sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-iptables = 1 

Running on one side:

ip netns exec hostR nft monitor trace 

And elsewhere:

ip netns exec hostA ping -c 4 192.168.10.2 

will trigger the problem: no IPv4 seen, only ARP (which are often seen delayed a few seconds later, in typical lazy ARP update). This always triggers for kernels 6.6.x or below, and can trigger or not for kernels 6.7.x or above (see later).


Effects of br_netfilter

This module creates interactions between the bridge path and the Netfilter hooks for IPv4, normally for the routing path but now also for the bridge path. Here hooks for IPv4 are both iptables and nftables in the ip family (likewise this happens for ARP and IPv6. IPv6 is not used, we won't talk about it anymore).

That means that now the frames reach the Netfilter hooks as described in ebtables/iptables interaction on a Linux-based bridge: 5. Chain traversal for bridged IP packets:

  1. Chain traversal for bridged IP packets

    A bridged packet never enters any network code above layer 1 (Link Layer). So, a bridged IP packet/frame will never enter the IP code. Therefore all iptables chains will be traversed while the IP packet is in the bridge code. The chain traversal will look like this:

    Figure 5. Chain traversal for bridged IP packets

    Figure 5. Chain traversal for bridged IP packets

They are supposed to reach bridge filter forward (blue) first followed by ip filter forward (green)...

... but not when the original hook priorities is changed and in turns change the order of the boxes above. The original hook priorities for the bridge family are described in nft(8):

Table 7. Standard priority names and hook compatibility for the bridge family

Name Value Hooks
dstnat -300 prerouting
filter -200 all
out 100 output
srcnat 300 postrouting

So the schematic above expects filter forward to hook at priority -200 not 0. If using 0, all bets are off.

Indeed, when the running kernel was compiled with option CONFIG_NETFILTER_NETLINK_HOOK, nft list hooks can be used to query all hooks in use in the current namespace, including br_netfilter's. For kernel 6.6.x or before:

# ip netns exec hostR nft list hooks family ip { hook prerouting { -2147483648 ip_sabotage_in [br_netfilter] } hook postrouting { -0000000225 apparmor_ip_postroute } } family ip6 { hook prerouting { -2147483648 ip_sabotage_in [br_netfilter] } hook postrouting { -0000000225 apparmor_ip_postroute } } family bridge { hook prerouting { 0000000000 br_nf_pre_routing [br_netfilter] } hook input { +2147483647 br_nf_local_in [br_netfilter] } hook forward { -0000000001 br_nf_forward_ip [br_netfilter] 0000000000 chain bridge filter forward [nf_tables] 0000000000 br_nf_forward_arp [br_netfilter] } hook postrouting { +2147483647 br_nf_post_routing [br_netfilter] } } 

one can see that the kernel module br_netfilter (not deactivated in this network namespace) hooks at -1 for IPv4 and again at 0 for ARP: the expected hook order isn't met and disruption happens for bridge filter forward at OP's priority 0.

On kernel 6.7.x and later, since this commit, default order after the reproducer was run changes:

# ip netns exec hostR nft list hooks [...] family bridge { hook prerouting { 0000000000 br_nf_pre_routing [br_netfilter] } hook input { +2147483647 br_nf_local_in [br_netfilter] } hook forward { 0000000000 chain bridge filter forward [nf_tables] 0000000000 br_nf_forward [br_netfilter] } hook postrouting { +2147483647 br_nf_post_routing [br_netfilter] } } 

With the simplification, br_netfilter hooks only at priority 0 to handle forwarding, but what matters is it's now after bridge filter forward: the expected order, which won't cause OP's issue.

As having two hooks at same priority is to be considered undefined behavior, this is a frail setup: one can still trigger from here the problem (at least on kernel 6.7.x) simply by running:

rmmod br_netfilter modprobe br_netfilter 

which now changes the order:

[...] hook forward { 0000000000 br_nf_forward [br_netfilter] 0000000000 chain bridge filter forward [nf_tables] } [...] 

triggering again the problem since now br_netfilter is again before bridge filter forward.

How to avoid this

To work around this in the network namespace (or container) choose one of these:

  • don't have br_netfilter loaded at all

    On host:

    rmmod br_netfilter 
  • or disable the effects of br_netfilter in the additional network namespace

    As explained, each new network namespace gets again this feature enabled when created. It has to be disabled where it matters: in hostR network namespace:

    ip netns exec hostR sysctl net.bridge.bridge-nf-call-iptables=0 

    Once done, all br_netfilter hooks disappear in hostR causing no more any disruption when the unexpected order happens.

    There's one caveat. This doesn't work when using only Docker:

    # docker exec hostR sysctl net.bridge.bridge-nf-call-iptables=0 sysctl: error setting key 'net.bridge.bridge-nf-call-iptables': Read-only file system # docker exec --privileged hostR sysctl net.bridge.bridge-nf-call-iptables=0 sysctl: error setting key 'net.bridge.bridge-nf-call-iptables': Read-only file system 

    because Docker protected some settings to prevent them to be tampered with by the container.

    Instead, one has to bind-mount (using ip netns attach ...) the container's network namespace, so it can be used by ip netns exec ... without getting its mount namespace in the way:

    ip netns attach hostR $(docker inspect --format '{{.State.Pid}}' hostR) 

    Which now allows to run the previous command and affect the container:

    ip netns exec hostR sysctl net.bridge.bridge-nf-call-iptables=0 
  • or use a priority that guarantees bridge filter forward to happen first

    As seen in the previous table, the default priority (priority forward) in the bridge family is -200. So use -200, or else at most the value -2 to always happen before br_netfilter whatever the kernel version:

    ip netns exec hostR nft delete chain bridge filter forward ip netns exec hostR nft add chain bridge filter forward '{ type filter hook forward priority -200; }' ip netns exec hostR nft add rule bridge filter forward meta nftrace set 1 

    or likewise, if using Docker:

    docker exec hostR nft delete chain bridge filter forward docker exec hostR nft add chain bridge filter forward '{ type filter hook forward priority -200; }' docker exec hostR nft add rule bridge filter forward meta nftrace set 1 

Tested on:

  • (OP's) alpine 3.19.1
  • Debian 12.5 with
    • stock Debian kernel 6.1.x
    • 6.6.x with CONFIG_NETFILTER_NETLINK_HOOK
    • 6.7.11 with CONFIG_NETFILTER_NETLINK_HOOK

Not tested with openvswitch bridges.


Final note: avoid as much as possible Docker or the br_netfilter kernel module when doing network experiments. As my reproducer shows, it's quite easy to experiment using ip netns alone when there's only network involved (this might become more difficult if daemons (such as OpenVPN) are needed in an experiment).

6
  • Thanks for the extremely detailed answer. I did wonder if my chain priority am not familiar enough to be sure and hadn't gone that far into testing it yet. I also accept that Docker is unnecessary here, although I was going to be introducing some services into the mix later on. Sticking with ip netns as much as possible seems sensible though. I'll be able to test your advice later this week, at which point I'm quite confident I'll be able to mark this answer correct, but bear with me until then. Commented Apr 8, 2024 at 3:15
  • @stevekez take your time. Just remember that modprobe br_netfilter is needed to trigger the problem. Commented Apr 8, 2024 at 6:22
  • 1
    Also I realize there is no kernel version fix. It's about having undefined behavior when using priority of exactly 0. I'll have to revisit my answer for this part. Requires kernel config option CONFIG_NETFILTER_NETLINK_HOOK to see this. Commented Apr 8, 2024 at 6:38
  • 1
    Fixed the answer: it's all about relative priority with br_netfilter (which changed in kernel 6.7.x), there's no kernel bug anywhere. Commented Apr 8, 2024 at 21:25
  • 1
    Actually it's more undefined order rather than behavior. but undefined order can lead to unexpected behavior. Commented Apr 9, 2024 at 6:51

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.