how does linux nat a ping?

2023-09-08 #linux #networking

A few months ago, I found myself wondering how a command like ping 1.1.1.1 works from within a private network.

In most private networks, multiple hosts connect to the Internet through a router. For IPv4, the router performs network address translation (NAT) by rewriting the original host’s source address to the router’s public IP address. The router can lookup the correct host for a reply packet based on the packet’s port field, at least for protocols like TCP and UDP.

But a command like ping doesn’t use TCP or UDP; it uses ICMP, and those packets do not have a port field. So how does NAT work for ICMP packets?

This led me down a deep rabbit hole: running experiments in network namespaces, capturing packets, reading RFCs, and tracing through the Linux source code. This post summarizes what I did and learned along the way.¹

Before these experiments, I hadn’t spent much time in the Linux networking code – this is something new I’m learning. If I’ve made any mistakes please let me know so I can correct them.

Table of contents

Experiment setup

One of the best ways to understand Linux networking is through experimentation. These days, it’s easy to run experiments using network namespaces to simulate multiple devices on a single Linux machine.

This is the setup I wanted to test:

Diagram showing the setup of the experiment

There are two clients (client1 and client2) connected to a router (natbox) performing NAT from private network 192.168.99.0/24 to public network 10.0.100.0/24. The clients, natbox, and server are each separate network namespaces. Once everything is ready, a ping from either client to the server at 10.0.100.2 should get a reply!

For these experiments, I used a Fedora 38 Server VM running version 6.2.9 of the Linux kernel. Most of the below commands (ip, iptables, tcpdump, etc.) were run as the root user.²

Step 1: Connect two clients to a bridge

The first step is to create two clients connected to a bridge, like this: Diagram showing two clients connected to a bridge

To set it up:

# Create a network namespace for each client.
ip netns add "client1"
ip netns add "client2"

# Create a virtual bridge.
ip link add name "br0" type bridge
ip link set dev "br0" up

# Disable iptables processing for bridges so rules don't block traffic over br0.
# This is necessary only if the br_netfilter module is enabled.
sysctl -w net.bridge.bridge-nf-call-iptables=0

# Connect client1 to the bridge with a veth pair and assign IP address 192.168.99.1
ip link add dev "vethclient1" type veth peer name "eth0" netns "client1"
ip link set "vethclient1" master "br0"
ip link set "vethclient1" up
ip -n "client1" addr add dev "eth0" "192.168.99.1/24"
ip -n "client1" link set dev "eth0" up

# Same for client2, with IP address 192.168.99.2
ip link add dev "vethclient2" type veth peer name "eth0" netns "client2"
ip link set "vethclient2" master "br0"
ip link set "vethclient2" up
ip -n "client2" addr add dev "eth0" "192.168.99.2/24"
ip -n "client2" link set dev "eth0" up

If this worked, then:

ip netns should show client1 and client2.
ip -n client1 addr and ip -n client2 addr should show 192.168.99.1 and 192.168.99.2 respectively, and the eth0 interface should show “state UP”.

Now the two clients can ping each other over the bridge:

# ping client1 -> client2
ip netns exec client1 ping 192.168.99.2

# ping client2 -> client1
ip netns exec client2 ping 192.168.99.1

Step 2: Connect natbox and server

Next, create network namespaces for the natbox and server:

ip netns add "natbox"
ip netns add "server"

Then connect the natbox to the bridge:

ip link add dev "vethnatbox" type veth peer name "eth0" netns "natbox"
ip link set "vethnatbox" master "br0"
ip link set "vethnatbox" up
ip -n "natbox" addr add dev "eth0" "192.168.99.3/24"
ip -n "natbox" link set dev "eth0" up

The natbox needs a second interface in the 10.0.100.0/24 network, so add that and call it “eth1”. Since there’s only one server, there’s no need for a bridge – just connect the natbox and server directly with a veth pair:

ip -n "natbox" link add "eth1" type veth peer name "eth0" netns "server"
ip -n "natbox" addr add dev "eth1" "10.0.100.1/24"
ip -n "natbox" link set dev "eth1" up
ip -n "server" addr add dev "eth0" "10.0.100.2/24"
ip -n "server" link set dev "eth0" up

Now the natbox can reach both clients and the server. Test it with ping:

# ping natbox -> client1
ip netns exec natbox ping 192.168.99.1

# ping natbox -> client2
ip netns exec natbox ping 192.168.99.2

# ping natbox -> server
ip netns exec natbox ping 10.0.100.2

At this point, every network namespace, interface, and veth pair has been created: Diagram showing the setup of the experiment

However, the client cannot yet ping the server because the natbox isn’t forwarding traffic between its interfaces or performing NAT.

Step 3: Configure routing and NAT

Add a default route in each client to send traffic to the natbox:

ip -n client1 route add 0.0.0.0/0 via 192.168.99.3
ip -n client2 route add 0.0.0.0/0 via 192.168.99.3

For security reasons, Linux does not forward packets between interfaces unless specifically enabled. So configure the natbox to forward traffic by setting net.ipv4.ip_forward:

ip netns exec natbox sysctl "net.ipv4.ip_forward=1"

At this point, packets from a client will reach the server. However, these packets will retain the original source IP in the 192.168.99.0/24 network, so replies from the server back to this IP will go… nowhere. Fix it by configuring the natbox to NAT the traffic from a client IP (in network 192.168.99.0/24) to the natbox’s public IP (10.0.100.1/24). The easiest way to do this is to add a MASQUERADE rule to the iptables “nat” chain:

ip netns exec natbox iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE

At last, clients can reach the server through the natbox! Test it with ping:

# ping client1 -> server via natbox
ip netns exec client1 ping 10.0.100.2

# ping client2 -> server via natbox
ip netns exec client2 ping 10.0.100.2

Packet capture

Now capture ICMP packets from both client and server network namespaces.

ip netns exec client1 tcpdump -n icmp
ip netns exec server tcpdump -n icmp

This is the tcpdump for client1:

08:01:33.549598 IP 192.168.99.1 > 10.0.100.2: ICMP echo request, id 31428, seq 1, length 64
08:01:33.549661 IP 10.0.100.2 > 192.168.99.1: ICMP echo reply, id 31428, seq 1, length 64
08:01:34.610605 IP 192.168.99.1 > 10.0.100.2: ICMP echo request, id 31428, seq 2, length 64
08:01:34.610654 IP 10.0.100.2 > 192.168.99.1: ICMP echo reply, id 31428, seq 2, length 64

… and the corresponding tcpdump for the server:

08:01:33.549643 IP 10.0.100.1 > 10.0.100.2: ICMP echo request, id 31428, seq 1, length 64
08:01:33.549654 IP 10.0.100.2 > 10.0.100.1: ICMP echo reply, id 31428, seq 1, length 64
08:01:34.446611 IP 10.0.100.1 > 10.0.100.2: ICMP echo request, id 33391, seq 1, length 64
08:01:34.446619 IP 10.0.100.2 > 10.0.100.1: ICMP echo reply, id 33391, seq 1, length 64
08:01:34.610635 IP 10.0.100.1 > 10.0.100.2: ICMP echo request, id 31428, seq 2, length 64
08:01:34.610646 IP 10.0.100.2 > 10.0.100.1: ICMP echo reply, id 31428, seq 2, length 64
08:01:35.506411 IP 10.0.100.1 > 10.0.100.2: ICMP echo request, id 33391, seq 2, length 64
08:01:35.506423 IP 10.0.100.2 > 10.0.100.1: ICMP echo reply, id 33391, seq 2, length 64

These captures show that:

Traffic is being NAT’d. By the time an ICMP echo request reaches the server (10.0.100.2), its source IP has been rewritten to the IP of the natbox (10.0.100.1).
Each client has a different “id” field (in the capture above, client1 has ID 31428 and client2 has ID 33391).

The “id” field seemed like it might allow the natbox to distinguish reply packets destined for each client. But what does the “id” field mean, and how is it chosen?

RFC 792

ICMP is a very, very old protocol. It is defined in RFC 792, which was published in 1981. The RFC specifies the exact structure of an ICMP echo and echo reply message:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     Type      |     Code      |          Checksum             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Identifier          |        Sequence Number        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     Data ...
   +-+-+-+-+-

The “type” field distinguishes an echo request (8) from an echo reply (1). Code is always 0 (I guess it isn’t used for anything?). What about “sequence number” and “identifier”?

If code = 0, an identifier to aid in matching echos and replies, may be zero…
If code = 0, a sequence number to aid in matching echos and replies, may be zero…
The identifier and sequence number may be used by the echo sender to aid in matching the replies with the echo requests. For example, the identifier might be used like a port in TCP or UDP to identify a session, and the sequence number might be incremented on each echo request sent. The echoer returns these same values in the echo reply.

The RFC doesn’t say anything about how the IDs are actually chosen. That’s not part of the protocol specification, so the next step is to look at an implementation – in this case, the source code for the ping command.

Ping source code

The ping command is part of the “iputils” package, with source code available at github.com/iputils/iputils. There is a comment just before ping4_send_probe:

/*
 * pinger --
 * 	Compose and transmit an ICMP ECHO REQUEST packet.  The IP packet
 * will be added on by the kernel.  The ID field is a random number,
 * and the sequence number is an ascending integer.  The first several bytes
 * of the data portion are used to hold a UNIX "timeval" struct in VAX
 * byte-order, to compute the round-trip time.
 */

So ping chooses the identifier randomly. It’s a bit difficult to see where this actually happens in the code, but from what I understand:

There is a struct ping_rts that has a field ident.
The ident field defaults to -1, but can be overridden by the CLI flag “-e” to any value between zero and IDENTIFIER_MAX (0xFFFF).
When rts->ident == -1, ping binds to a socket with type SOCK_DGRAM and protocol IPPROTO_ICMP. In this configuration, it does not modify source.sin_port, so the source port is zero.

I didn’t find much documentation for how Linux implements SOCK_DGRAM sockets with IPPROTO_ICMP, except for this description from the mailing list “net: ipv4: add IPPROTO_ICMP socket kind”:

ICMP headers given to send() are checked and sanitized. The type must be ICMP_ECHO and the code must be zero (future extensions might relax this, see below). The id is set to the number (local port) of the socket, the checksum is always recomputed.

I suspect that when ping doesn’t specify a source port (source.sin_port == 0), then the Linux kernel chooses a free port at random. This port then gets used as the ID for ICMP packets.

ID conflict

What happens if two ping processes on different hosts both choose the exact same ID? Test it using ping -e to explicitly set the ICMP ID to the same value for both clients:

# ping from client1 -> server with ICMP ID 999
ip netns exec client1 ping 10.0.100.2 -e 999

# ping from client2 -> server with ICMP ID 999
ip netns exec client2 ping 10.0.100.2 -e 999

This time, the packet capture from the server shows something different:

10:22:18.807289 IP 10.0.100.1 > 10.0.100.2: ICMP echo request, id 999, seq 1, length 64
10:22:18.807300 IP 10.0.100.2 > 10.0.100.1: ICMP echo reply, id 999, seq 1, length 64
10:22:19.838650 IP 10.0.100.1 > 10.0.100.2: ICMP echo request, id 999, seq 2, length 64
10:22:19.838661 IP 10.0.100.2 > 10.0.100.1: ICMP echo reply, id 999, seq 2, length 64
10:22:20.011677 IP 10.0.100.1 > 10.0.100.2: ICMP echo request, id 30218, seq 1, length 64
10:22:20.011687 IP 10.0.100.2 > 10.0.100.1: ICMP echo reply, id 30218, seq 1, length 64
10:22:20.862591 IP 10.0.100.1 > 10.0.100.2: ICMP echo request, id 999, seq 3, length 64
10:22:20.862603 IP 10.0.100.2 > 10.0.100.1: ICMP echo reply, id 999, seq 3, length 64
10:22:21.054598 IP 10.0.100.1 > 10.0.100.2: ICMP echo request, id 30218, seq 2, length 64
10:22:21.054614 IP 10.0.100.2 > 10.0.100.1: ICMP echo reply, id 30218, seq 2, length 64

One of the clients is using ID 999, but the other one is using ID 30218. Where did that second ID come from? Time to go to the Linux source code.

Netfilter, conntrack, and NAT

The kernel subsystem responsible for implementing iptables rules is called “netfilter.” The iptables MASQUERADE rule is responsible for NAT’ing packets, so the NAT implementation for ICMP must be in netfilter. Grep’ing through the net/netfilter directory in the Linux repository, I found a few places where the ICMP “id” field is used:

In “nf_nat_core.c” the function nf_nat_setup_info calls get_unique_tuple, which calls nf_nat_l4proto_unique_tuple. There is a switch statement with a case for IPPROTO_ICMP, and a reference to &tuple->src.u.icmp.id.
In “nf_nat_proto.c” the function nf_nat_manip_pkt calls nf_nat_ipv4_manip_pkt, which calls l4proto_manip_pkt. When the protocol is IPPROTO_ICMP this calls icmp_manip_pkt, which has a line hdr->un.echo.id = tuple->src.u.icmp.id.

In order to NAT packets, netfilter needs to store something called a connection. For TCP, not surprisingly, this represents the TCP connection, uniquely identified by the 5-tuple (src IP, src port, dst IP, dst port, L4 protocol). However, in netfilter the term “connection” has a broader meaning: it can correlate outgoing and incoming packets even for connectionless protocols like UDP and ICMP.

Examining the nf_conn data structure:

nf_conn has a field struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX]. There are two tuple hashes, one for outgoing packets and one for incoming packets (IP_CT_DIR_ORIGINAL and IP_CT_DIR_REPLY respectively).
Each nf_conntrack_tuple_hash has a field nf_conntrack_tuple tuple with the tuple uniquely identifying the connection.
Each tuple is split into a part that can be manipulated, called src, and a part that is immutable called dst.
- src has type struct nf_conntrack_man, which has an IP address (union nf_inet_addr u3) and protocol-specific fields (union nf_conntrack_man_proto u). For ICMP, the protocol-specific field is __be16 id.
- dst has the unmodified IP address as well as the ICMP type and code fields.

Connection tracking and NAT are closely related. To NAT a packet, netfilter needs to “remember” how it modified the outgoing packet so it can reverse those modifications on the reply packet. It does so by representing the modifications in a connection.

For ICMP, I believe netfilter works like this:

When natbox receives an ICMP echo, nf_nat_setup_info creates a new connection. This is where it chooses whether it needs to rewrite the source IP address and/or the ICMP id field on the outgoing packet.
For each incoming and outgoing ICMP packet, the function nf_nat_manip_pkt sets the source IP and ICMP id field to whatever is set in the connection. The argument ip_conntrack_dir dir determines whether the packet is treated as an outgoing echo (rewrite the source IP) or incoming reply (rewrite the destination IP).

nf_nat_setup_info is responsible for choosing the ICMP ID for the NAT’d packets. The NAT rewrites happen in get_unique_tuple.

Here are the key steps:

On line 541, find_best_ips_proto(zone, tuple, range, ct, maniptype) rewrites the source IP address.
On lines 548-560, nf_nat_used_tuple(tuple, ct) checks whether the tuple is already being used; if not, the current tuple is returned. This explains why when two clients use different ICMP IDs, those IDs are preserved in the NAT’d packets.
On line 563, nf_nat_l4proto_unique_tuple is called to perform protocol-specific NAT (in this case manipulating the ICMP ID field).
In nf_nat_l4proto_unique_tuple lines 393-403 set keyptr = &tuple->src.u.icmp.id to choose the ICMP ID field as the “key” to NAT, then jumps to find_free_id at the end of the function.
find_free_id on line 471 calls get_random_u16() to generate a random ID, adjusts the value into the range³ of valid ICMP IDs (on line 485), then checks if it’s used (another call to nf_nat_used_tuple on line 486).
If a tuple with the random ID not yet used, then it gets returned. Otherwise, netfilter searches for an unused ID from progressively smaller ranges starting at random offsets (lines 483-494).
If an unused tuple cannot be found within a maximum number of attempts, then nf_nat_l4_proto_unique_tuple returns, leaving the duplicate ID in the connection. Later, __nf_conntrack_confirm will detect the duplicate and drop the packet.

bpftrace

To verify my understanding of the netfilter code, I used a tool called bpftrace.⁴ After much tinkering, I ended up with this program to trace the kernel functions nf_nat_setup_info and nf_nat_manip_pkt:

// from linux/socket.h
#define AF_INET		2	/* Internet IP Protocol 	*/

// from net/netfilter/nf_nat.h
enum nf_nat_manip_type {
	NF_NAT_MANIP_SRC,
	NF_NAT_MANIP_DST
};

// from include/uapi/linux/netfilter/nf_conntrack_tuple_common.h
// Use #define instead of enum so we can use these in bpftrace array indices.
#define IP_CT_DIR_ORIGINAL 0
#define IP_CT_DIR_REPLY 1

kprobe:nf_nat_setup_info {
	// nf_nat_setup_info gets called twice, once in the prerouting chain
	// to modify the destination (actually a no-op), and once in the output
	// chain to modify the source (which is what we care about).
	$mtype = arg2;
	if ($mtype != NF_NAT_MANIP_SRC) {
		return;
	}

	$conn = (struct nf_conn *)arg0;
	if ($conn->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum == IPPROTO_ICMP) {
		@setupConn[tid] = $conn;
	}
}

kretprobe:nf_nat_setup_info {
	if (@setupConn[tid] == none) {
		return;
	}
	$conn = (struct nf_conn *)@setupConn[tid];
	$origTuple = $conn->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
	$replyTuple = $conn->tuplehash[IP_CT_DIR_REPLY].tuple;
	printf("nf_nat_setup_info: origTuple.addr=%s, origTuple.id=%d, replyTuple.addr=%s, replyTuple.id=%d\n",
		ntop(AF_INET, $origTuple.src.u3.ip),
		bswap($origTuple.src.u.icmp.id),
		ntop(AF_INET, $replyTuple.src.u3.ip),
		bswap($replyTuple.src.u.icmp.id));
	delete(@setupConn[tid]);
}

kprobe:nf_nat_manip_pkt {
	$mtype = arg2;
	$skb = (struct sk_buff *)arg0;
	$iphdr = (struct iphdr *)$skb->data;
	$icmphdr = (struct icmphdr *)($skb->data + $iphdr->ihl * 4);
	printf("nf_nat_manip_pkt before: mtype=%d, saddr=%s, daddr=%s, icmp.type=%d, icmp.id=%d\n",
		$mtype,
		ntop(AF_INET, $iphdr->saddr),
		ntop(AF_INET, $iphdr->daddr),
		bswap($icmphdr->type),
		bswap($icmphdr->un.echo.id));
	@manipType[tid] = $mtype;
	@manipSkb[tid] = $skb
}

kretprobe:nf_nat_manip_pkt {
	$mtype = @manipType[tid];
	$skb = @manipSkb[tid];
	$iphdr = (struct iphdr *)$skb->data;
	$icmphdr = (struct icmphdr *)($skb->data + $iphdr->ihl * 4);
	printf("nf_nat_manip_pkt after: mtype=%d, saddr=%s, daddr=%s, icmp.type=%d, icmp.id=%d\n",
		$mtype,
		ntop(AF_INET, $iphdr->saddr),
		ntop(AF_INET, $iphdr->daddr),
		bswap($icmphdr->type),
		bswap($icmphdr->un.echo.id));
	delete(@manipType[tid]);
	delete(@manipSkb[tid]);
}

The important parts are:

kprobe traces when a kernel function is called, and kretprobe traces when the function returns.
kretprobe cannot access function arguments directly, so store the arguments in a BPF map on entry and retrieve them on exit. For example, kprobe:nf_nat_setup_info writes the netfilter connection argument to @setupConn[tid] (a BPF map keyed by thread ID). Then kretprobe:nf_nat_setup_info reads the connection from the map and deletes the entry.
struct sk_buff is how the Linux kernel represents a packet.
bswap reverses byte order, which is used to convert from big endian (network byte order) to little endian.
ntop returns the string representation of an IP address.
The BPF program can reference kernel data structures like sk_buff and nf_conn without including any headers. This is the magic of BPF Type Format (BTF) available in recent versions of the Linux kernel.
I tested this program on Linux kernel version 6.2.9. It may or may not work on other kernel versions.

To execute the program, I saved the above code to a file called trace.bt then ran bpftrace trace.bt as root. This is what the output looks like with two clients pinging the server using the same ICMP ID (999):

$ bpftrace trace.bt
Attaching 4 probes...
nf_nat_setup_info: origTuple.addr=192.168.99.1, origTuple.id=999, replyTuple.addr=10.0.100.2, replyTuple.id=999
nf_nat_manip_pkt before: mtype=0, saddr=192.168.99.1, daddr=10.0.100.2, icmp.type=8, icmp.id=999
nf_nat_manip_pkt after: mtype=0, saddr=10.0.100.1, daddr=10.0.100.2, icmp.type=8, icmp.id=999
nf_nat_manip_pkt before: mtype=1, saddr=10.0.100.2, daddr=10.0.100.1, icmp.type=0, icmp.id=999
nf_nat_manip_pkt after: mtype=1, saddr=10.0.100.2, daddr=192.168.99.1, icmp.type=0, icmp.id=999
nf_nat_setup_info: origTuple.addr=192.168.99.2, origTuple.id=999, replyTuple.addr=10.0.100.2, replyTuple.id=32809
nf_nat_manip_pkt before: mtype=0, saddr=192.168.99.2, daddr=10.0.100.2, icmp.type=8, icmp.id=999
nf_nat_manip_pkt after: mtype=0, saddr=10.0.100.1, daddr=10.0.100.2, icmp.type=8, icmp.id=32809
nf_nat_manip_pkt before: mtype=1, saddr=10.0.100.2, daddr=10.0.100.1, icmp.type=0, icmp.id=32809
nf_nat_manip_pkt after: mtype=1, saddr=10.0.100.2, daddr=192.168.99.2, icmp.type=0, icmp.id=999

The output shows that nf_nat_setup_info gets called twice, once for each client.⁵ For the first client (IP 192.168.99.1), both the original and reply tuple have the ICMP ID sent by the client (999). For the second client (IP 192.168.99.2), however, the reply tuple has been rewritten to ID 32809. For both clients, the source IP address has been rewritten to the IP of the natbox (10.0.100.2).

Once nf_nat_setup_info has created the connection, nf_nat_manip_pkt modifies the echo and echo reply ICMP packets. For the echo packet, mtype=0 (NF_NAT_MANIP_SRC) because the source IP is rewritten. Likewise, the reply packet has mtype=1 (NF_NAT_MANIP_DST) to rewrite the destination IP of the incoming reply back to the original client IP.

Conclusion

So that is how Linux NATs a ping! In the end, maybe the answer isn’t very surprising – and, in fact, I discovered much later that most of this behavior is documented in the Netfilter Hacking HOWTO. But it was a fun journey, and it’s nice to know exactly where this magic happens in the code.

There’s a superuser question that explains how NAT works for ICMP. I guess I could have read that and moved on with my life, but I wanted to take the opportunity to dive deeper into Linux networking. ↩︎
None of these commands are persistent: rebooting the machine will clear any network namespaces, iptables rules, and interfaces. ↩︎
By default, this is the full range of IDs (0 to 65536), but it’s possible to constrain this by specifying --to-ports in the iptables MASQUERADE rule like this: iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE -p icmp --to-ports 100-200. ↩︎
For more details about bpftrace, see Brendan Gregg’s 2019 article from LWN ↩︎
I observed that if the client stops sending packets for more than 30 seconds (the default timeout: check cat /proc/sys/net/netfilter/nf_conntrack_icmp_timeout), then nf_nat_setup_info gets called again the next time the client pings, presumably because netfilter garbage collects connections after the timeout. I wasn’t able to trace exactly where the garbage collection happens, but I’d guess it’s part of gc_worker in nf_conntrack_core.c. ↩︎