Some early Linux IPC latency data
I’ve added benchmarks for UNIX domain
sockets and TCP sockets over the loopback interface. UNIX domain sockets were
super easy to implement thanks to the handy socketpair
function. It was not really any different from pipes. The difference is that
since sockets are full duplex, you only need to create one pair. If the
processes were unrelated, or if I wanted to be able to accept multiple
connections, it would be much more like TCP sockets—ie, a pain!
I say a pain because, in doing this, I ‘found out’ that, despite having written a non-zero number of server applications, I’ve never done socket programming before. This wasn’t exactly a surprise, but it was definitely interesting to realise how little I knew about how to go about it. Luckily, man pages! (And Advanced Programming in the UNIX Environment.)
Here’s the quick tl;dr for TCP over IPv4::
- to listen for incoming connections:
- create a socket with
socket(AF_INET, SOCK_STREAM, 0 /* default protocol */).1 - bind it to a port with
bind(sockfd, addr, addrlen)whereaddris a struct that specifies the address to bind to. ForAF_INET, this means the IP and port. In my case, I usedINETADDR_LOOPBACKand0to listen on some available port on127.0.0.1.2 - start listening on the socket with
listen(sockfd, 1 /* backlog */). I used abacklogof 1 because I only expect a single incoming connection. - finally, call
accept(sockfd, NULL /* addr */, NULL /* addrlen */)to block until a connection comes in, which returns a new file descriptor to talk to the connecting process. I pass inNULLfor theaddrbecause I don’t care who’s talking to me!
- create a socket with
- to connect to another process that’s listening:
- create a socket with
socket(AF_INET, SOCK_STREAM, 0 /* default protocol */). - connect to the remote process with
connect(sockfd, addr, addrlen). Theaddrspecifies the address to connect to; again forAF_INETthis means the IP and port.
- create a socket with
This brings me up to having programs to test latency for four IPC mechanisms: - pipes - eventfd - UNIX domain sockets - TCP sockets over the loopback interface
Here is some early latency data from my machine, with emphasis on the tail latencies:
| 50 | 75 | 90 | 99 | 99.9 | 99.99 | 99.999 | |
|---|---|---|---|---|---|---|---|
| pipes | 4255 | 4960 | 5208 | 5352 | 7814 | 16214 | 31290 |
| eventfd | 4353 | 4443 | 4760 | 5053 | 9445 | 14573 | 68528 |
| af_unix | 1439 | 1621 | 1655 | 1898 | 2681 | 11512 | 54714 |
| af_inet_loopback | 7287 | 7412 | 7857 | 8573 | 17412 | 20515 | 37019 |
Units are nanoseconds. Time is measured using clock_gettime with
CLOCK_MONOTONIC. The quantiles are for a million measurements; in all cases,
the binary was run with flags --warmup-iters=10000 --iters=1 --repeat=1000000
(see below).
For me, the biggest surprise was how much faster UNIX domain sockets were than
anything else, and in particular, how much faster they are than eventfd. Or
that they are faster at all. The read call in each case blocks until a
corresponding write. I would have thought eventfd had the minimal amount of
extra work beyond that, since all it does is read and modify a uint64_t. In
fairness, each of the other programs are writing a single byte at present, but
I doubt the difference will be so drastic.
Another fun thing is to see difference in `latency when pinning the two processes to specific CPUs. My machine has a dual core processor, where each processor has 2 hardware threads. Here’s a quick look at latencies for pipes with different CPU affinities:
| Percentile | 50 | 75 | 90 | 99 | 99.9 | 99.99 | 99.999 |
|---|---|---|---|---|---|---|---|
| default | 4255 | 4960 | 5208 | 5352 | 7814 | 16214 | 31290 |
| same CPU | 2386 | 2402 | 2564 | 3134 | 12255 | 15126 | 28225 |
| same core | 4232 | 4270 | 4395 | 4788 | 14408 | 17101 | 39052 |
| different core | 5043 | 5101 | 5170 | 5772 | 11894 | 38726 | 398796 |
I was expecting a difference between different cores and not, since it requires a trip through the L3 cache. I have no realy idea of what difference I was expecting, but a microsecond could make sense if multiple locations needed to be accessed. This stuff is beyond my ken, so I’m just guessing.
What I was not expecting, was a dramatic difference between ‘same CPU’ and ‘same core’. The CPUs are hardware threads on a single core. I can’t think of any reason there would be such a difference. I do want to check that it’s not due to scheduling weirdness, so I’ll probably boot up in single user mode at some point to give it another go.
If you want to run these on your own system, clone the repo and run make.
There will be four binaries produced, one for each of the mechanisms. They all
take the same command line flags:
-c, --child-cpu=CPUID CPU to run the child on; default is to let the
scheduler do as it will
-i, -n, --iters=COUNT number of iterations to measure; default: 100000
-p, --parent-cpu=CPUID CPU to run the parent on; default is to let the
scheduler do as it will
-r, --repeat=COUNT number of times to repeat measurement; default: 1
-w, --warmup-iters=COUNT number of iterations before measurement; default:
1000
-?, --help Give this help list
--usage Give a short usage message
-
The default protocol for
SOCK_STREAMfor theAF_INETsocket family is TCP. ↩ -
A fun little thing to be aware of is that the
addrmust contain the IP address in network byte order. This necessitates converting the IP address and port usinghtonlandhtons, respectively, to convert the IP from h_ost _to _n_etwork byte order (thelstands forlong, which in this case means auint32_tbecauselongs used to be shorter; thesstands forshortwhich have stayed short at 16 bits long). ↩