I’ve been working on a project that takes requests and load balances them between containers. Using Docker or Kubernetes makes this easy, but I wondered how these systems work, so I decided to try and implement it myself. I’ve been using containerd to programmatically create and manage containers. It works quite well for this task, however it does not provide networking to or between containers. This is a problem for the load balancing part of my project, so I searched for a solution. The solution I landed with uses a Linux bridge, virtual Ethernet devices, and IPVS to provide load balancing and networking to and between containers.
Definitions
- namespaces: A namespace in Linux isolates resources from each other. For example a network namespace is an isolated group that has its own interfaces, routing table, and IP addresses. Processes can run inside a namespace and use its resources, but cannot access resources in namespaces it is not in.
- bridge: A bridge in Linux is a virtual device that forwards traffic between other virtual devices. Unlike physical bridges, it is not limited to two ports.
- veth: A virtual Ethernet device gives you two ends called “peers”, when you send traffic into one peer, it exists out of the other peer. Each peer can be in the same or different namespaces.
- vip: A virtual IP address is an address that is not connected to a specific interface. It is used to route traffic to IP addresses.
- IPVS: IP Virtual Server is a module in the Linux kernel (
ip_vs
) that can be used load balance TCP/UDP traffic between a group of IP addresses. For a group of IP addresses a vip is given.
Problem
Each container has its own namespaces, including a network namespace which allows containers to be
separated from each other. However, sometimes we wish two containers to communicate with each other,
such as an application talking to a database. We may also want traffic to be load balanced between a
group of containers. For example, let’s say we have two containers c0
and c1
running that
contains a webserver on port 8080
that prints a list of interfaces and their addresses
(essentially ip addr
served on 8080
). We want to load balance between these two containers.
Solution
When using containerd, the ctr
CLI is helpful to inspect what is happening. For example if we want
to see a list of running processes of containers we can run ctr task list
and get:
TASK PID STATUS
c0 65077 RUNNING
c1 65128 RUNNING
We can see more information about these containers by doing lsns
which will list the existing namespaces.
NS TYPE NPROCS PID USER COMMAND
4026533267 mnt 2 65077 root /bin/sh ./server.sh
4026533268 uts 2 65077 root /bin/sh ./server.sh
4026533269 ipc 2 65077 root /bin/sh ./server.sh
4026533270 pid 2 65077 root /bin/sh ./server.sh
4026533271 net 2 65077 root /bin/sh ./server.sh
4026533381 mnt 2 65128 root /bin/sh ./server.sh
4026533382 uts 2 65128 root /bin/sh ./server.sh
4026533383 ipc 2 65128 root /bin/sh ./server.sh
4026533384 pid 2 65128 root /bin/sh ./server.sh
4026533385 net 2 65128 root /bin/sh ./server.sh
Looking at this printout we can see that there was a network namespaces created for each container
(where TYPE
is net
).
4026533271 net 2 65077 root /bin/sh ./server.sh
4026533385 net 2 65128 root /bin/sh ./server.sh
A simple way to provide network connectivity to a container is to create a veth and then put one peer inside the network namespace belonging to the container.
# Create a veth with two peers, one named c0veth0 and the other c0veth1
ip link add c0veth0 type veth peer name c0veth1
# Move the c0veth0 to the c0 namespace, which is at PID 65077
ip link set c0veth0 netns 65077
We can check that this veth peer has been moved using the nsenter
command which allows us to run
commands inside a namespace.
nsenter -t 65077 -n ip addr
2: c0veth0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether ba:79:db:15:10:b1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Now let’s give the c0veth0
interface an IP address.
nsenter -t 65077 -n ip link set c0veth0 up
# Set the address of c0veth0 to 10.0.0.2
nsenter -t 65077 -n ip addr add 10.0.0.2/32 broadcast 10.0.0.2 dev c0veth0
# Route traffic going to 10.0.0.1 through c0veth0 with the source address 10.0.0.2
nsenter -t 65077 -n ip route add 10.0.0.1 dev c0veth0 src 10.0.0.2
And let’s give our c0veth1
interface the 10.0.0.1 address on a /24.
ip addr add 10.0.0.1/24 dev c0veth1
If we do a curl 10.0.0.2:8080
we see the webserver returning the interfaces.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: c0veth0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ba:79:db:15:10:b1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.0.2/32 brd 10.0.0.2 scope global c0veth0
valid_lft forever preferred_lft forever
inet6 fe80::b879:dbff:fe15:10b1/64 scope link
valid_lft forever preferred_lft forever
Creating a bridge
We can now see that we are able to create a veth and route traffic to a specific container. The next
step is to create a load balancing system that uses IPVS to route traffic between addresses. In this
example c0
will have the IP address 10.0.0.2
and c1
will have 10.0.0.3
. Assume that I have
setup c1veth0
and c1veth1
using the same method as above.
Let me delete the 10.0.0.1
address from c0veth1
because we will need this address.
ip addr del 10.0.0.1/24 dev c0veth1
Since we want routing to and between containers, let’s create a bridge and attach both
c0veth1
and c1veth1
to it. This will allow the two veths to communicate on a shared network.
ip link add name br0 type bridge
ip link set dev br0 up
# Assign 10.0.0.1 to the bridge
ip addr add 10.0.0.1/24 dev br0
Now connect both c0veth1
and c1veth1
to br0
.
ip link set c0veth1 master br0
ip link set c1veth1 master br0
Afterwards we can do curl 10.0.0.2:8080
and curl 10.0.0.3:8080
to reach each container.
Load Balancing
With these two addresses we can use IPVS to load balance between them. First enable the IPVS modules.
# Enable IPVS
modprobe ip_vs
# Enable round robin algorithm.
# IPVS supports many scheduling algorithms, view the man page for full list.
modprobe ip_vs_rr
Now create a virtual service with a vip of 10.0.0.4
that uses round robin on port 8080
.
ipvsadm -A -t 10.0.0.4:8080 -s rr
Add both of the containers as targets.
ipvsadm -a -t 10.0.0.4:8080 -r 10.0.0.2:8080 -m
ipvsadm -a -t 10.0.0.4:8080 -r 10.0.0.3:8080 -m
If we do a curl we can see it rotates between requesting each container.
user@host# curl 10.0.0.4:8080
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: c1veth0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 96:33:b4:9d:3a:b3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.0.3/32 brd 10.0.0.3 scope global c1veth0
valid_lft forever preferred_lft forever
inet6 fe80::9433:b4ff:fe9d:3ab3/64 scope link
valid_lft forever preferred_lft forever
user@host# curl 10.0.0.4:8080
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: c0veth0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ba:79:db:15:10:b1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.0.2/32 brd 10.0.0.2 scope global c0veth0
valid_lft forever preferred_lft forever
inet6 fe80::b879:dbff:fe15:10b1/64 scope link
valid_lft forever preferred_lft forever
End
I hope to continue experimenting and learning about containers and networks, in the future I will write more blog posts about it. Feel free to email me if you have questions or if I got something wrong.