Replacing iptables with eBPF in Kubernetes with Cilium Cilium, eBPF, Envoy, Istio, Hubble Michal Rostecki Software Engineer mrostecki@suse.com mrostecki@opensuse.org Swaminathan Vasudevan Software Engineer svasudevan@suse.com
22 What’s wrong with iptables?
3 IPtables runs into a couple of significant problems: ● Iptables updates must be made by recreating and updating all rules in a single transaction. ● Implements chains of rules as a linked list, so almost all operations are O(n). ● The standard practice of implementing access control lists (ACLs) as implemented by iptables was to use sequential list of rules. ● It’s based on matching IPs and ports, not aware about L7 protocols. ● Every time you have a new IP or port to match, rules need to be added and the chain changed. ● Has high consumption of resources on Kubernetes. What’s wrong with legacy iptables?
4 Complexity of iptables ● Linked list. ● All rules in the chain have to be replaced as a whole. Rule 1 Rule 2 Rule n ... Search O(n) Insert O(1) Delete O(n)
5 Kubernetes uses iptables for... ● kube-proxy - the component which implements Services and load balancing by DNAT iptables rules ● the most of CNI plugins are using iptables for Network Policies
6 What is BPF?
7 HW Bridge OVS . Netdevice / Drivers Traffic Shaping Ethernet IPv4 IPv6 Netfilter TCP UDP Raw Sockets System Call Interface Process Process Process ● The Linux kernel stack is split into multiple abstraction layers. ● Strong userspace API compatibility in Linux for years. ● This shows how complex the linux kernel is and its years of evolution. ● This cannot be replaced in a short term. ● Very hard to bypass the layers. ● Netfilter module has been supported by linux for more than two decades and packet filtering has to applied to packets that moves up and down the stack. Linux Network Stack
8 HW Bridge OVS . Netdevice / Drivers Traffic Shaping Ethernet IPv4 IPv6 Netfilter TCP UDP Raw Sockets System Call Interface Process Process Process BPF System calls BPF Sockmap and Sockops BPF TC hooks BPF XDP BPF kernel hooks BPF cGroups
9 Mpps
10 PREROUTING INPUT OUTPUTFORWARD POSTROUTING FILTER FILTER FILTER NAT NAT Routing Decision NAT Routing Decision Routing Decision Netdev (Physical or virtual Device) Netdev (Physical or virtual Device) Local Processes eBPF Code eBPF Code IPTables netfilter hooks eBPF TC hooks XDP hooks BPF replaces IPtables
11 NetFilter NetFilter To Linux Stack From Linux Stack Netdev (Physical or virtual Device) Netdev (Physical or virtual Device) Ingress Chain Selector INGRESS CHAIN FORWARD CHAIN [local dst] [rem ote dst] TC/XDP Ingress hook TC Egress hook Egress Chain Selector OUTPUT CHAIN [local src] [remote src] Update session Label Packet Update session Label Packet Store session Store session Store session Update session Label Packet Connection Tracking BPF based filtering architecture
12 …. Headers parsing IP.dst lookup IP1 bitv1 IP2 bitv2 IP3 bitv3 eBPF Program #1 eBPF Program #2 eBPF Program #3 IP.proto lookup * bitv1 udp bitv2 tcp bitv3 Bitwise AND bit-vectors Search first Matching rule Update counters ACTION (drop/ accept) rule1 act1 rule2 act2 rule3 act3 rule1 cnt1 rule2 cnt2 eBPF Program eBPF Program #N Packet in Packet out From eBPF hook To eBPF hook Tailcall Tailcall Tailcall Tailcall Packet header offsets Bitvector with temporary result per cpu _array shared across the entire program chain per cpu _array shared across the entire program chain Each eBPF program can exploit a different matching algorithm (e.g., exact match, longest prefix match, etc). Each eBPF program is injected only if there are rules operating on that field. LBVS is implemented with a chain of eBPF programs, connected through tail calls. Header parsing is done once and results are kept in a shared map for performance reasons BPF based tail calls
13 BPF goes into... ● Load balancers - katran ● perf ● systemd ● Suricata ● Open vSwitch - AF_XDP ● And many many others
14 BPF is used by...
1515 Cilium
16 What is Cilium?
17 CNI Functionality CNI is a CNCF ( Cloud Native Computing Foundation) project for Linux Containers It consists of specification and libraries for writing plugins. Only care about networking connectivity of containers ● ADD/DEL General container runtime considerations for CNI: The container runtime must ● create a new network namespace for the container before invoking any plugins ● determine the network for the container and add the container to the each network by calling the corresponding plugins for each network ● not invoke parallel operations for the same container. ● order ADD and DEL operations for a container, such that ADD is always eventually followed by a corresponding DEL. ● not call ADD twice ( without a corresponding DEL ) for the same ( network name, container id, name of the interface inside the container). When CNI ADD call is invoked it tries to add the network to the container with respective veth pairs and assigning IP address from the respective IPAM Plugin or using the Host Scope. When CNI DEL call is invoked it tries to remove the container network, release the IP Address to the IPAM Manager and cleans up the veth pairs.
18 Kubernetes API Server Kubelet CRI-Containerd CNI-Plugin (Cilium) Cilium Agent eth0 BPF Maps Container2 Container1 Linux Kernel Network Stack 000 c1 FE 0A 001 54 45 31 002 A1 B1 C1 004 32 66 AA cni-add().. Kubectl K8s Pod Userspace Kernel bpf_syscall() BPF Hook Cilium CNI Plugin control Flow
19 Cilium Components with BPF hook points and BPF maps shown in Linux Stack Orchestrator
20 container A container B container C eth0 eth0 eth0 lxc0 lxc0 lxc1 eth0 eth0
21 Networking modes Use case: Cilium handling routing between nodes Encapsulation Use case: Using cloud provider routers, using BGP routing daemon Direct routing Node A Node B Node C VXLAN VXLAN VXLAN Node A Node B Node C Cloud or BGP routing
22
23
24 L3 filtering – label based, ingress Pod Labels: role=frontend IP: 10.0.0.1 Pod Labels: role=frontend IP: 10.0.0.2 Pod IP: 10.0.0.5 Pod Labels: role=backend IP: 10.0.0.3 Pod Labels: role=backend IP: 10.0.0.4 allow deny
25 L3 filtering – label based, ingress apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow frontends to access backends" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress: - fromEndpoints: - matchLabels: class: frontend
26 L3 filtering – CIDR based, egress IP: 10.0.1.1 Subnet: 10.0.1.0/24 IP: 10.0.2.1 Subnet: 10.0.2.0/24 allow deny Cluster A Pod Labels: role=backend IP: 10.0.0.1 Any IP not belonging to 10.0.1.0/24
27 L3 filtering – CIDR based, egress apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow backends to access 10.0.1.0/24" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend egress: - toCIDR: - IP: “10.0.1.0/24”
28 L4 filtering Pod Labels: role=backend IP: 10.0.0.1 allow deny TCP/80 Any other port
29 L4 filtering apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow to access backends only on TCP/80" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress: - toPorts: - ports: - port: “80” protocol: “TCP”
30 L7 filtering – API Aware Security Pod Labels: role=api IP: 10.0.0.1 GET /articles/{id} GET /private Pod IP: 10.0.0.5
31 L7 filtering – API Aware Security apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "L7 policy to restict access to specific HTTP endpoints" metadata: name: "frontend-backend" endpointSelector: matchLabels: role: backend ingress: - toPorts: - ports: - port: “80” protocol: “TCP” rules: http: - method: "GET" path: "/article/$"
32 Standalone proxy, L7 filtering Node A Pod A + BPF Envoy Generating BPF programs for L7 filtering through libcilium.so Node B Pod B + BPF Envoy Generating BPF programs for L7 filtering through libcilium.so Generating BPF programs for L3/L4 filtering Generating BPF programs for L3/L4 filtering VXLAN
33 Features
34 Cluster Mesh Cluster A Cluster B Node A Pod A + BPF Node B + BPF Container eth0 Pod B Container eth0 Pod C Container eth0 External etcd Node A Pod A + BPF Container eth0
35 Socket Socket Socket Socket Service Service Socket TCP/IP Ethernet eth0 Socket TCP/IP Ethernet eth0 Network TCP/IP Ethernet IPtables TCP/IP Ethernet IPtables Loopback IPtables IPtables TCP/IP TCP/IP Ethernet Ethernet Loopback
36 Cilium CNI Cilium CNI Socket Socket Socket Socket Service Service Socket TCP/IP Ethernet eth0 Socket TCP/IP Ethernet eth0 Network
37 Service A Service B Service C
38 Service A Service B
39 Service A Service B External Github Service External Cloud Network
40 Kubernetes Services ● Hash table. BPF, Cilium ● Linked list. ● All rules in the chain have to be replaced as a whole. Iptables, kube-proxy Key Key Key Value Value Value Rule 1 Rule 2 Rule n ... Search O(1) Insert O(1) Delete O(1) Search O(n) Insert O(1) Delete O(n)
41 usec number of services in cluster
42 CNI chaining Policy enforcement, load balancing, multi-cluster connectivity IP allocation, configuring network interface, encapsulation/routing inside the cluster
43 Native support for AWS ENI
44 ● ● ● ● ● ● ●
45 ● ○ ● ○ ● ○ ● ○
46
47
48
4949 To sum it up
50 Why Cilium is awesome? ● It makes disadvantages of iptables disappear. And always gets the best from the Linux kernel. ● Cluster Mesh / multi-cluster. ● Makes Istio faster. ● Offers L7 API Aware filtering as a Kubernetes resource. ● Integrates with the other popular CNI plugins – Calico, Flannel, Weave, Lyft, AWS CNI.
Replacing iptables with eBPF in Kubernetes with Cilium

Replacing iptables with eBPF in Kubernetes with Cilium

  • 1.
    Replacing iptables witheBPF in Kubernetes with Cilium Cilium, eBPF, Envoy, Istio, Hubble Michal Rostecki Software Engineer mrostecki@suse.com mrostecki@opensuse.org Swaminathan Vasudevan Software Engineer svasudevan@suse.com
  • 2.
  • 3.
    3 IPtables runs intoa couple of significant problems: ● Iptables updates must be made by recreating and updating all rules in a single transaction. ● Implements chains of rules as a linked list, so almost all operations are O(n). ● The standard practice of implementing access control lists (ACLs) as implemented by iptables was to use sequential list of rules. ● It’s based on matching IPs and ports, not aware about L7 protocols. ● Every time you have a new IP or port to match, rules need to be added and the chain changed. ● Has high consumption of resources on Kubernetes. What’s wrong with legacy iptables?
  • 4.
    4 Complexity of iptables ●Linked list. ● All rules in the chain have to be replaced as a whole. Rule 1 Rule 2 Rule n ... Search O(n) Insert O(1) Delete O(n)
  • 5.
    5 Kubernetes uses iptablesfor... ● kube-proxy - the component which implements Services and load balancing by DNAT iptables rules ● the most of CNI plugins are using iptables for Network Policies
  • 6.
  • 7.
    7 HW Bridge OVS. Netdevice / Drivers Traffic Shaping Ethernet IPv4 IPv6 Netfilter TCP UDP Raw Sockets System Call Interface Process Process Process ● The Linux kernel stack is split into multiple abstraction layers. ● Strong userspace API compatibility in Linux for years. ● This shows how complex the linux kernel is and its years of evolution. ● This cannot be replaced in a short term. ● Very hard to bypass the layers. ● Netfilter module has been supported by linux for more than two decades and packet filtering has to applied to packets that moves up and down the stack. Linux Network Stack
  • 8.
    8 HW Bridge OVS. Netdevice / Drivers Traffic Shaping Ethernet IPv4 IPv6 Netfilter TCP UDP Raw Sockets System Call Interface Process Process Process BPF System calls BPF Sockmap and Sockops BPF TC hooks BPF XDP BPF kernel hooks BPF cGroups
  • 9.
  • 10.
    10 PREROUTING INPUT OUTPUTFORWARDPOSTROUTING FILTER FILTER FILTER NAT NAT Routing Decision NAT Routing Decision Routing Decision Netdev (Physical or virtual Device) Netdev (Physical or virtual Device) Local Processes eBPF Code eBPF Code IPTables netfilter hooks eBPF TC hooks XDP hooks BPF replaces IPtables
  • 11.
    11 NetFilter NetFilter To Linux Stack FromLinux Stack Netdev (Physical or virtual Device) Netdev (Physical or virtual Device) Ingress Chain Selector INGRESS CHAIN FORWARD CHAIN [local dst] [rem ote dst] TC/XDP Ingress hook TC Egress hook Egress Chain Selector OUTPUT CHAIN [local src] [remote src] Update session Label Packet Update session Label Packet Store session Store session Store session Update session Label Packet Connection Tracking BPF based filtering architecture
  • 12.
    12 …. Headers parsing IP.dst lookup IP1 bitv1 IP2 bitv2 IP3bitv3 eBPF Program #1 eBPF Program #2 eBPF Program #3 IP.proto lookup * bitv1 udp bitv2 tcp bitv3 Bitwise AND bit-vectors Search first Matching rule Update counters ACTION (drop/ accept) rule1 act1 rule2 act2 rule3 act3 rule1 cnt1 rule2 cnt2 eBPF Program eBPF Program #N Packet in Packet out From eBPF hook To eBPF hook Tailcall Tailcall Tailcall Tailcall Packet header offsets Bitvector with temporary result per cpu _array shared across the entire program chain per cpu _array shared across the entire program chain Each eBPF program can exploit a different matching algorithm (e.g., exact match, longest prefix match, etc). Each eBPF program is injected only if there are rules operating on that field. LBVS is implemented with a chain of eBPF programs, connected through tail calls. Header parsing is done once and results are kept in a shared map for performance reasons BPF based tail calls
  • 13.
    13 BPF goes into... ●Load balancers - katran ● perf ● systemd ● Suricata ● Open vSwitch - AF_XDP ● And many many others
  • 14.
  • 15.
  • 16.
  • 17.
    17 CNI Functionality CNI isa CNCF ( Cloud Native Computing Foundation) project for Linux Containers It consists of specification and libraries for writing plugins. Only care about networking connectivity of containers ● ADD/DEL General container runtime considerations for CNI: The container runtime must ● create a new network namespace for the container before invoking any plugins ● determine the network for the container and add the container to the each network by calling the corresponding plugins for each network ● not invoke parallel operations for the same container. ● order ADD and DEL operations for a container, such that ADD is always eventually followed by a corresponding DEL. ● not call ADD twice ( without a corresponding DEL ) for the same ( network name, container id, name of the interface inside the container). When CNI ADD call is invoked it tries to add the network to the container with respective veth pairs and assigning IP address from the respective IPAM Plugin or using the Host Scope. When CNI DEL call is invoked it tries to remove the container network, release the IP Address to the IPAM Manager and cleans up the veth pairs.
  • 18.
    18 Kubernetes API Server Kubelet CRI-Containerd CNI-Plugin(Cilium) Cilium Agent eth0 BPF Maps Container2 Container1 Linux Kernel Network Stack 000 c1 FE 0A 001 54 45 31 002 A1 B1 C1 004 32 66 AA cni-add().. Kubectl K8s Pod Userspace Kernel bpf_syscall() BPF Hook Cilium CNI Plugin control Flow
  • 19.
    19 Cilium Components withBPF hook points and BPF maps shown in Linux Stack Orchestrator
  • 20.
    20 container A containerB container C eth0 eth0 eth0 lxc0 lxc0 lxc1 eth0 eth0
  • 21.
    21 Networking modes Use case: Ciliumhandling routing between nodes Encapsulation Use case: Using cloud provider routers, using BGP routing daemon Direct routing Node A Node B Node C VXLAN VXLAN VXLAN Node A Node B Node C Cloud or BGP routing
  • 22.
  • 23.
  • 24.
    24 L3 filtering –label based, ingress Pod Labels: role=frontend IP: 10.0.0.1 Pod Labels: role=frontend IP: 10.0.0.2 Pod IP: 10.0.0.5 Pod Labels: role=backend IP: 10.0.0.3 Pod Labels: role=backend IP: 10.0.0.4 allow deny
  • 25.
    25 L3 filtering –label based, ingress apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow frontends to access backends" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress: - fromEndpoints: - matchLabels: class: frontend
  • 26.
    26 L3 filtering –CIDR based, egress IP: 10.0.1.1 Subnet: 10.0.1.0/24 IP: 10.0.2.1 Subnet: 10.0.2.0/24 allow deny Cluster A Pod Labels: role=backend IP: 10.0.0.1 Any IP not belonging to 10.0.1.0/24
  • 27.
    27 L3 filtering –CIDR based, egress apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow backends to access 10.0.1.0/24" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend egress: - toCIDR: - IP: “10.0.1.0/24”
  • 28.
    28 L4 filtering Pod Labels: role=backend IP:10.0.0.1 allow deny TCP/80 Any other port
  • 29.
    29 L4 filtering apiVersion: "cilium.io/v2" kind:CiliumNetworkPolicy description: "Allow to access backends only on TCP/80" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress: - toPorts: - ports: - port: “80” protocol: “TCP”
  • 30.
    30 L7 filtering –API Aware Security Pod Labels: role=api IP: 10.0.0.1 GET /articles/{id} GET /private Pod IP: 10.0.0.5
  • 31.
    31 L7 filtering –API Aware Security apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "L7 policy to restict access to specific HTTP endpoints" metadata: name: "frontend-backend" endpointSelector: matchLabels: role: backend ingress: - toPorts: - ports: - port: “80” protocol: “TCP” rules: http: - method: "GET" path: "/article/$"
  • 32.
    32 Standalone proxy, L7filtering Node A Pod A + BPF Envoy Generating BPF programs for L7 filtering through libcilium.so Node B Pod B + BPF Envoy Generating BPF programs for L7 filtering through libcilium.so Generating BPF programs for L3/L4 filtering Generating BPF programs for L3/L4 filtering VXLAN
  • 33.
  • 34.
    34 Cluster Mesh Cluster ACluster B Node A Pod A + BPF Node B + BPF Container eth0 Pod B Container eth0 Pod C Container eth0 External etcd Node A Pod A + BPF Container eth0
  • 35.
    35 Socket Socket SocketSocket Service Service Socket TCP/IP Ethernet eth0 Socket TCP/IP Ethernet eth0 Network TCP/IP Ethernet IPtables TCP/IP Ethernet IPtables Loopback IPtables IPtables TCP/IP TCP/IP Ethernet Ethernet Loopback
  • 36.
    36 Cilium CNI CiliumCNI Socket Socket Socket Socket Service Service Socket TCP/IP Ethernet eth0 Socket TCP/IP Ethernet eth0 Network
  • 37.
  • 38.
  • 39.
    39 Service A ServiceB External Github Service External Cloud Network
  • 40.
    40 Kubernetes Services ● Hashtable. BPF, Cilium ● Linked list. ● All rules in the chain have to be replaced as a whole. Iptables, kube-proxy Key Key Key Value Value Value Rule 1 Rule 2 Rule n ... Search O(1) Insert O(1) Delete O(1) Search O(n) Insert O(1) Delete O(n)
  • 41.
  • 42.
    42 CNI chaining Policy enforcement,load balancing, multi-cluster connectivity IP allocation, configuring network interface, encapsulation/routing inside the cluster
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
    50 Why Cilium isawesome? ● It makes disadvantages of iptables disappear. And always gets the best from the Linux kernel. ● Cluster Mesh / multi-cluster. ● Makes Istio faster. ● Offers L7 API Aware filtering as a Kubernetes resource. ● Integrates with the other popular CNI plugins – Calico, Flannel, Weave, Lyft, AWS CNI.