Replacing iptables with eBPF in Kubernetes with Cilium
The document discusses the replacement of iptables with eBPF in Kubernetes using Cilium, highlighting the limitations of iptables such as high resource consumption and inefficiencies in rule management. It explains how eBPF enhances Kubernetes networking by enabling faster, more efficient rule processing and allows for advanced filtering capabilities. Cilium integrates eBPF for improved container networking and security while supporting various networking modes and policies.
3 IPtables runs intoa couple of significant problems: ● Iptables updates must be made by recreating and updating all rules in a single transaction. ● Implements chains of rules as a linked list, so almost all operations are O(n). ● The standard practice of implementing access control lists (ACLs) as implemented by iptables was to use sequential list of rules. ● It’s based on matching IPs and ports, not aware about L7 protocols. ● Every time you have a new IP or port to match, rules need to be added and the chain changed. ● Has high consumption of resources on Kubernetes. What’s wrong with legacy iptables?
4.
4 Complexity of iptables ●Linked list. ● All rules in the chain have to be replaced as a whole. Rule 1 Rule 2 Rule n ... Search O(n) Insert O(1) Delete O(n)
5.
5 Kubernetes uses iptablesfor... ● kube-proxy - the component which implements Services and load balancing by DNAT iptables rules ● the most of CNI plugins are using iptables for Network Policies
7 HW Bridge OVS. Netdevice / Drivers Traffic Shaping Ethernet IPv4 IPv6 Netfilter TCP UDP Raw Sockets System Call Interface Process Process Process ● The Linux kernel stack is split into multiple abstraction layers. ● Strong userspace API compatibility in Linux for years. ● This shows how complex the linux kernel is and its years of evolution. ● This cannot be replaced in a short term. ● Very hard to bypass the layers. ● Netfilter module has been supported by linux for more than two decades and packet filtering has to applied to packets that moves up and down the stack. Linux Network Stack
8.
8 HW Bridge OVS. Netdevice / Drivers Traffic Shaping Ethernet IPv4 IPv6 Netfilter TCP UDP Raw Sockets System Call Interface Process Process Process BPF System calls BPF Sockmap and Sockops BPF TC hooks BPF XDP BPF kernel hooks BPF cGroups
11 NetFilter NetFilter To Linux Stack FromLinux Stack Netdev (Physical or virtual Device) Netdev (Physical or virtual Device) Ingress Chain Selector INGRESS CHAIN FORWARD CHAIN [local dst] [rem ote dst] TC/XDP Ingress hook TC Egress hook Egress Chain Selector OUTPUT CHAIN [local src] [remote src] Update session Label Packet Update session Label Packet Store session Store session Store session Update session Label Packet Connection Tracking BPF based filtering architecture
12.
12 …. Headers parsing IP.dst lookup IP1 bitv1 IP2 bitv2 IP3bitv3 eBPF Program #1 eBPF Program #2 eBPF Program #3 IP.proto lookup * bitv1 udp bitv2 tcp bitv3 Bitwise AND bit-vectors Search first Matching rule Update counters ACTION (drop/ accept) rule1 act1 rule2 act2 rule3 act3 rule1 cnt1 rule2 cnt2 eBPF Program eBPF Program #N Packet in Packet out From eBPF hook To eBPF hook Tailcall Tailcall Tailcall Tailcall Packet header offsets Bitvector with temporary result per cpu _array shared across the entire program chain per cpu _array shared across the entire program chain Each eBPF program can exploit a different matching algorithm (e.g., exact match, longest prefix match, etc). Each eBPF program is injected only if there are rules operating on that field. LBVS is implemented with a chain of eBPF programs, connected through tail calls. Header parsing is done once and results are kept in a shared map for performance reasons BPF based tail calls
13.
13 BPF goes into... ●Load balancers - katran ● perf ● systemd ● Suricata ● Open vSwitch - AF_XDP ● And many many others
17 CNI Functionality CNI isa CNCF ( Cloud Native Computing Foundation) project for Linux Containers It consists of specification and libraries for writing plugins. Only care about networking connectivity of containers ● ADD/DEL General container runtime considerations for CNI: The container runtime must ● create a new network namespace for the container before invoking any plugins ● determine the network for the container and add the container to the each network by calling the corresponding plugins for each network ● not invoke parallel operations for the same container. ● order ADD and DEL operations for a container, such that ADD is always eventually followed by a corresponding DEL. ● not call ADD twice ( without a corresponding DEL ) for the same ( network name, container id, name of the interface inside the container). When CNI ADD call is invoked it tries to add the network to the container with respective veth pairs and assigning IP address from the respective IPAM Plugin or using the Host Scope. When CNI DEL call is invoked it tries to remove the container network, release the IP Address to the IPAM Manager and cleans up the veth pairs.
18.
18 Kubernetes API Server Kubelet CRI-Containerd CNI-Plugin(Cilium) Cilium Agent eth0 BPF Maps Container2 Container1 Linux Kernel Network Stack 000 c1 FE 0A 001 54 45 31 002 A1 B1 C1 004 32 66 AA cni-add().. Kubectl K8s Pod Userspace Kernel bpf_syscall() BPF Hook Cilium CNI Plugin control Flow
21 Networking modes Use case: Ciliumhandling routing between nodes Encapsulation Use case: Using cloud provider routers, using BGP routing daemon Direct routing Node A Node B Node C VXLAN VXLAN VXLAN Node A Node B Node C Cloud or BGP routing
26 L3 filtering –CIDR based, egress IP: 10.0.1.1 Subnet: 10.0.1.0/24 IP: 10.0.2.1 Subnet: 10.0.2.0/24 allow deny Cluster A Pod Labels: role=backend IP: 10.0.0.1 Any IP not belonging to 10.0.1.0/24
32 Standalone proxy, L7filtering Node A Pod A + BPF Envoy Generating BPF programs for L7 filtering through libcilium.so Node B Pod B + BPF Envoy Generating BPF programs for L7 filtering through libcilium.so Generating BPF programs for L3/L4 filtering Generating BPF programs for L3/L4 filtering VXLAN
34 Cluster Mesh Cluster ACluster B Node A Pod A + BPF Node B + BPF Container eth0 Pod B Container eth0 Pod C Container eth0 External etcd Node A Pod A + BPF Container eth0
40 Kubernetes Services ● Hashtable. BPF, Cilium ● Linked list. ● All rules in the chain have to be replaced as a whole. Iptables, kube-proxy Key Key Key Value Value Value Rule 1 Rule 2 Rule n ... Search O(1) Insert O(1) Delete O(1) Search O(n) Insert O(1) Delete O(n)
50 Why Cilium isawesome? ● It makes disadvantages of iptables disappear. And always gets the best from the Linux kernel. ● Cluster Mesh / multi-cluster. ● Makes Istio faster. ● Offers L7 API Aware filtering as a Kubernetes resource. ● Integrates with the other popular CNI plugins – Calico, Flannel, Weave, Lyft, AWS CNI.