为什么在工作交错时TCP写入延迟变得更糟?

本文介绍了为什么在工作交错时TCP写入延迟变得更糟?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在分析TCP延迟(特别是从用户空间到小消息的内核空间的write)，以便对write的延迟有一些直觉(承认这可能是上下文相关的-具体的).我注意到在我看来相似的测试之间存在很大的不一致，并且我很好奇找出差异的来源.我知道微基准测试可能会有问题，但是我仍然觉得我缺少一些基本的了解(因为延迟差异约为10倍).

I have been profiling TCP latency (in particular, the write from user space to kernel space of a small message) in order to get some intuition for the latency of a write (acknowledging that this can be context-specific). I have noticed substantial inconsistency between tests that to me seem similar, and I'm very curious to figure out where the difference comes from. I understand that microbenchmarks can be problematic, but I still feel like I am missing some fundamental understanding (since the latency differences are ~10x).

设置是，我有一个C ++ TCP服务器，该服务器接受一个客户端连接(来自同一CPU上的另一个进程)，并且与客户端连接后，对套接字进行20次对write的系统调用，发送一个字节一次.服务器的完整代码复制在这篇文章的末尾.这是使用boost/timer对每个write进行计时的输出(这会增加〜1麦克风的噪音):

The set up is that I have a C++ TCP server that accepts one client connection (from another process on the same CPU), and upon connecting with the client makes 20 system calls to write to the socket, sending one byte at a time. The full code of the server is copied at the end of this post. Here's the output which times each write using boost/timer (which adds noise of ~1 mic):

$ clang++ -std=c++11 -stdlib=libc++ tcpServerStove.cpp -O3; ./a.out
18 mics
3 mics
3 mics
4 mics
3 mics
3 mics
4 mics
3 mics
5 mics
3 mics
...

我确实发现第一个write比其他的慢得多.如果我在计时器中包装10,000个write呼叫，则平均每个write为2微秒，但第一个呼叫始终为15+麦克风.为什么会出现这种热身"现象?

I reliably find that the first write is significantly slower than the others. If I wrap 10,000 write calls in a timer, the average is 2 microseconds per write, yet the first call is always 15+ mics. Why is there this "warming" up phenomenon?

相关地，我进行了一个实验，其中在每个write调用之间，我进行了一些阻塞的CPU工作(计算大质数).这会导致全部 write调用变慢:

Relatedly, I ran an experiment where in between each write call I do some blocking CPU work (calculating a large prime number). This causes all the write calls to be slow:

$ clang++ -std=c++11 -stdlib=libc++ tcpServerStove.cpp -O3; ./a.out
20 mics
23 mics
23 mics
30 mics
23 mics
21 mics
21 mics
22 mics
22 mics
...

鉴于这些结果，我想知道在将字节从用户缓冲区复制到内核缓冲区的过程中是否发生某种批处理.如果快速连续发生多个write调用，它们会合并为一个内核中断吗?

Given these results, I'm wondering if there is some kind of batching that happens during the process of copying bytes from the user buffer to the kernel buffer. If multiple write calls happen in quick succession, do they get coalesced into one kernel interrupt?

特别是我正在寻找write需要多长时间才能将缓冲区从用户空间复制到内核空间的概念.如果有某种合并效果，当我连续进行10,000次操作时，平均write只需要2麦克风，那么得出结论，write延迟是2麦克风将是不公平的乐观.似乎我的直觉是每个write都需要20微秒.对于没有内核绕过的情况(对于一个字节的原始write调用)而言，它对于获得最低的延迟来说似乎很慢.

In particular I am looking for some notion of how long write takes to copy buffers from user space to kernel space. If there is some coalescing effect that allows the average write to only take 2 mics when I do 10,000 in succession, then it would be unfairly optimistic to conclude that the write latency is 2 mics; it seems that my intuition should be that each write takes 20 microseconds. This seems surprisingly slow for the lowest latency you can get (a raw write call on one byte) without kernel bypass.

最后一条数据是，当我在计算机上的两个进程(TCP服务器和TCP客户端)之间设置乒乓测试时，平均每往返6麦克风(其中包括read) write，以及在本地网络中移动).这似乎与上面看到的一次写入的20个麦克风延迟不符.

A final piece of data is that when I set up a ping-pong test between two processes on my computer (a TCP server and a TCP client), I average 6 mics per round trip (which includes a read, a write, as well as moving through the localhost network). This seems at odds with the 20 mic latencies for a single write seen above.

TCP服务器的完整代码:

Full code for the TCP server:

// Server side C/C++ program to demonstrate Socket programming
// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <boost/timer.hpp>
#include <unistd.h>

// Set up some blocking work.
bool isPrime(int n) {
    if (n < 2) {
        return false;
    }

    for (int i = 2; i < n; i++) {
        if (n % i == 0) {
            return false;
        }
    }

    return true;
}

// Compute the nth largest prime. Takes ~1 sec for n = 10,000
int getPrime(int n) {
    int numPrimes = 0;
    int i = 0;
    while (true) {
        if (isPrime(i)) {
            numPrimes++;
            if (numPrimes >= n) {
                return i;
            }
        }
        i++;
    }
}

int main(int argc, char const *argv[])
{
    int server_fd, new_socket, valread;
    struct sockaddr_in address;
    int opt = 1;
    int addrlen = sizeof(address);

    // Create socket for TCP server
    server_fd = socket(AF_INET, SOCK_STREAM, 0);

    // Prevent writes from being batched
    setsockopt(server_fd, SOL_SOCKET, TCP_NODELAY, &opt, sizeof(opt));
    setsockopt(server_fd, SOL_SOCKET, TCP_NOPUSH, &opt, sizeof(opt));
    setsockopt(server_fd, SOL_SOCKET, SO_SNDBUF, &opt, sizeof(opt));
    setsockopt(server_fd, SOL_SOCKET, SO_SNDLOWAT, &opt, sizeof(opt));

    address.sin_family = AF_INET;
    address.sin_addr.s_addr = INADDR_ANY;
    address.sin_port = htons(8080);

    bind(server_fd, (struct sockaddr *)&address, sizeof(address));

    listen(server_fd, 3);

    // Accept one client connection
    new_socket = accept(server_fd, (struct sockaddr *)&address, (socklen_t*)&addrlen);

    char sendBuffer[1] = {0};
    int primes[20] = {0};
    // Make 20 sequential writes to kernel buffer.
    for (int i = 0; i < 20; i++) {
        sendBuffer[0] = i;
        boost::timer t;
        write(new_socket, sendBuffer, 1);
        printf("%d mics\n", int(1e6 * t.elapsed()));

        // For some reason, doing some blocking work between the writes
        // The following work slows down the writes by a factor of 10.
        // primes[i] = getPrime(10000 + i);
    }

    // Print a prime to make sure the compiler doesn't optimize
    // away the computations.
    printf("prime: %d\n", primes[8]);

}

TCP客户端代码:

// Server side C/C++ program to demonstrate Socket programming
// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char const *argv[])
{
    int sock, valread;
    struct sockaddr_in address;
    int opt = 1;
    int addrlen = sizeof(address);

    // We'll be passing uint32's back and forth
    unsigned char recv_buffer[1024] = {0};

    // Create socket for TCP server
    sock = socket(AF_INET, SOCK_STREAM, 0);

    setsockopt(sock, SOL_SOCKET, TCP_NODELAY, &opt, sizeof(opt));

    address.sin_family = AF_INET;
    address.sin_addr.s_addr = INADDR_ANY;
    address.sin_port = htons(8080);

    // Accept one client connection
    if (connect(sock, (struct sockaddr *)&address, (socklen_t)addrlen) != 0) {
        throw("connect failed");
    }

    read(sock, buffer_pointer, num_left);

    for (int i = 0; i < 10; i++) {
        printf("%d\n", recv_buffer[i]);
    }
}

我尝试使用带和不带标记TCP_NODELAY，TCP_NOPUSH，SO_SNDBUF和SO_SNDLOWAT的想法，这可能会阻止批处理(但我的理解是，这种批处理发生在内核缓冲区和网络之间，而不是在用户缓冲区和内核缓冲区之间.)

I tried with and without the flags TCP_NODELAY, TCP_NOPUSH, SO_SNDBUF and SO_SNDLOWAT, with the idea that this might prevent batching (but my understanding is that this batching occurs between the kernel buffer and the network, not between the user buffer and the kernel buffer).

这是乒乓测试的服务器代码:

Here is server code for the ping pong test:

// Server side C/C++ program to demonstrate Socket programming
// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <boost/timer.hpp>
#include <unistd.h>

 __inline__ uint64_t rdtsc(void)
   {
uint32_t lo, hi;
__asm__ __volatile__ (
        "xorl %%eax,%%eax \n        cpuid"
        ::: "%rax", "%rbx", "%rcx", "%rdx");
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
 }

// Big Endian (network order)
unsigned int fromBytes(unsigned char b[4]) {
    return b[3] | b[2]<<8 | b[1]<<16 | b[0]<<24;
}

void toBytes(unsigned int x, unsigned char (&b)[4]) {
    b[3] = x;
    b[2] = x>>8;
    b[1] = x>>16;
    b[0] = x>>24;
}

int main(int argc, char const *argv[])
{
    int server_fd, new_socket, valread;
    struct sockaddr_in address;
    int opt = 1;
    int addrlen = sizeof(address);
    unsigned char recv_buffer[4] = {0};
    unsigned char send_buffer[4] = {0};

    // Create socket for TCP server
    server_fd = socket(AF_INET, SOCK_STREAM, 0);

    address.sin_family = AF_INET;
    address.sin_addr.s_addr = INADDR_ANY;
    address.sin_port = htons(8080);

    bind(server_fd, (struct sockaddr *)&address, sizeof(address));

    listen(server_fd, 3);

    // Accept one client connection
    new_socket = accept(server_fd, (struct sockaddr *)&address, (socklen_t*)&addrlen);
    printf("Connected with client!\n");

    int counter = 0;
    unsigned int x = 0;
    auto start = rdtsc();
    boost::timer t;

    int n = 10000;
    while (counter < n) {
        valread = read(new_socket, recv_buffer, 4);
        x = fromBytes(recv_buffer);
        toBytes(x+1, send_buffer);
        write(new_socket, send_buffer, 4);
        ++counter;
    }

    printf("%f clock cycles per round trip (rdtsc)\n",  (rdtsc() - start) / double(n));
    printf("%f mics per round trip (boost timer)\n", 1e6 * t.elapsed() / n);
}

这是乒乓球测试的客户端代码:

Here is client code for the ping pong test:

// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <boost/timer.hpp>
#include <unistd.h>

// Big Endian (network order)
unsigned int fromBytes(unsigned char b[4]) {
    return b[3] | b[2]<<8 | b[1]<<16 | b[0]<<24;
}

void toBytes(unsigned int x, unsigned char (&b)[4]) {
    b[3] = x;
    b[2] = x>>8;
    b[1] = x>>16;
    b[0] = x>>24;
}

int main(int argc, char const *argv[])
{
    int sock, valread;
    struct sockaddr_in address;
    int opt = 1;
    int addrlen = sizeof(address);

    // We'll be passing uint32's back and forth
    unsigned char recv_buffer[4] = {0};
    unsigned char send_buffer[4] = {0};

    // Create socket for TCP server
    sock = socket(AF_INET, SOCK_STREAM, 0);

    // Set TCP_NODELAY so that writes won't be batched
    setsockopt(sock, SOL_SOCKET, TCP_NODELAY, &opt, sizeof(opt));

    address.sin_family = AF_INET;
    address.sin_addr.s_addr = INADDR_ANY;
    address.sin_port = htons(8080);

    // Accept one client connection
    if (connect(sock, (struct sockaddr *)&address, (socklen_t)addrlen) != 0) {
        throw("connect failed");
    }

    unsigned int lastReceived = 0;
    while (true) {
        toBytes(++lastReceived, send_buffer);
        write(sock, send_buffer, 4);
        valread = read(sock, recv_buffer, 4);
        lastReceived = fromBytes(recv_buffer);
    }
}

推荐答案

这里有一些问题.

要接近答案，您需要让客户端执行以下两项操作:1.接收所有数据. 2.跟踪每次阅读的大小.我这样做的方式是:

To get closer to the answer, you need to have your client side do two things: 1. receive all the data. 2. keep track of how big each read was. I did this by:

  int loc[N+1];
int nloc, curloc;
for (nloc = curloc = 0; curloc < N; nloc++) {
    int n = read(sock, recv_buffer + curloc, sizeof recv_buffer-curloc);
    if (n <= 0) {
            break;
    }
    curloc += n;
    loc[nloc] = curloc;
}
int last = 0;
for (int i = 0; i < nloc; i++) {
    printf("%*.*s ", loc[i] - last, loc[i] - last, recv_buffer + last);
    last = loc[i];
}
printf("\n");

，并将N定义为20(对不起，成长)，然后将服务器更改为一次将a-z写入一个字节.现在，当打印出如下内容时:

and defining N to 20 (sorry, upbringing), and changing your server to write a-z one byte at a time. Now, when this prints out something like:

 a b c d e f g h i j k l m n o p q r s

我们知道服务器正在发送1个字节的数据包；但是，当它打印类似以下内容时:

we know the server is sending 1 byte packets; however when it prints something like:

 a bcde fghi jklm nop qrs

我们怀疑服务器主要发送4个字节的数据包.

we suspect the server is sending mainly 4 byte packets.

根本问题是TCP_NODELAY不会做您怀疑的事情. Nagle的算法，在有未确认的发送数据包时累积输出； TCP_NODELAY控制是否应用它.

The root problem is that TCP_NODELAY doesn't do what you suspect. Nagle's algorithm, accumulates output when there is an unacknowledged sent packet; TCP_NODELAY controls whether this is applied.

无论TCP_NODELAY如何，您仍然是STREAM_SOCKET，这意味着N个写入可以合并为一个.插槽正在给设备供电，但同时您也在给插槽供电.将数据包[mbuf，skbuff ...]提交给设备后，套接字需要在下一个write()上创建一个新数据包.一旦设备准备好接收新数据包，套接字就可以提供它，但是在此之前，该数据包将充当缓冲区.在缓冲模式下，写入非常快，因为所有必需的数据结构均可用(如注释和其他答案中所述).

Regardless of TCP_NODELAY, you are still a STREAM_SOCKET, which means that N-writes can be combined into one. The socket is feeding the device, but simultaneously you are feeding the socket. Once a packet [ mbuf, skbuff, ...] has been committed to the device, the socket needs to create a new packet on the next write()s. As soon as the device is ready for a new packet, the socket can provide it, but until then, the packet will serve as a buffer. In buffering mode, the write is very fast, as all the necessary data structures are available [ as mentioned in comments and other answers ].

您可以通过调整SO_SNDBUF和SO_SNDLOWAT套接字选项来控制此缓冲.请注意，但是accept返回的缓冲区不会继承提供的套接字的缓冲区大小.通过将SNDBUF减少到1

You can control this buffering by adjusting the SO_SNDBUF and SO_SNDLOWAT socket options. Note, however the buffer returned by accept does not inherit the buffer sizes of the provided socket. By reducing the SNDBUF to 1

以下输出:

abcdefghijklmnopqrst 
a bcdefgh ijkl mno pqrst 
a b cdefg hij klm nop qrst 
a b c d e f g h i j k l m n o p q r s t

对应项从默认值开始，然后在后续连接上依次向服务器端添加:TCP_NODELAY，TCP_NOPUSH，SO_SNDBUF(= 1)，SO_SNDLOWAT(= 1).每次迭代的时间增量都比前一次更平坦.

corresponds starts at the default, then successively adds: TCP_NODELAY, TCP_NOPUSH, SO_SNDBUF (=1), SO_SNDLOWAT (=1) to the server side on subsequent connections. Each iteration has a flatter time delta than the previous.

您的里程可能会有所不同，这是在MacOS 10.12上；并且由于存在信任问题，我使用rdtsc()将程序更改为C ++.

Your mileage will likely vary, this was on MacOS 10.12; and I changed your programs to C++ thing with rdtsc() because I have trust issues.

/* srv.c */
// Server side C/C++ program to demonstrate Socket programming
// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdbool.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <unistd.h>

#ifndef N
#define N 20
#endif
int nap = 0;
int step = 0;
extern long rdtsc(void);

void xerror(char *f) {
    perror(f);
    exit(1);
}
#define Z(x)   if ((x) == -1) { xerror(#x); }

void sopt(int fd, int opt, int val) {
    Z(setsockopt(fd, SOL_SOCKET, opt, &val, sizeof(val)));
}
int gopt(int fd, int opt) {
    int val;
    socklen_t r = sizeof(val);
    Z(getsockopt(fd, SOL_SOCKET, opt, &val, &r));
    return val;
}

#define POPT(fd, x)  printf("%s %d ", #x, gopt(fd, x))
void popts(char *tag, int fd) {
    printf("%s: ", tag);
    POPT(fd, SO_SNDBUF);
    POPT(fd, SO_SNDLOWAT);
    POPT(fd, TCP_NODELAY);
    POPT(fd, TCP_NOPUSH);
    printf("\n");
}

void stepsock(int fd) {
     switch (step++) {
     case 7:
    step = 2;
     case 6:
         sopt(fd, SO_SNDLOWAT, 1);
     case 5:
         sopt(fd, SO_SNDBUF, 1);
     case 4:
         sopt(fd, TCP_NOPUSH, 1);
     case 3:
         sopt(fd, TCP_NODELAY, 1);
     case 2:
     break;
     }
}

int main(int argc, char const *argv[])
{
    int server_fd, new_socket, valread;
    struct sockaddr_in address;
    int opt = 1;
    int addrlen = sizeof(address);



    // Create socket for TCP server
    server_fd = socket(AF_INET, SOCK_STREAM, 0);

    popts("original", server_fd);
    // Set TCP_NODELAY so that writes won't be batched
    while ((opt = getopt(argc, argv, "sn:o:")) != -1) {
    switch (opt) {
    case 's': step = ! step; break;
    case 'n': nap = strtol(optarg, NULL, 0); break;
    case 'o':
        for (int i = 0; optarg[i]; i++) {
            switch (optarg[i]) {
            case 't': sopt(server_fd, TCP_NODELAY, 1); break;
            case 'p': sopt(server_fd, TCP_NOPUSH, 0); break;
            case 's': sopt(server_fd, SO_SNDBUF, 1); break;
            case 'l': sopt(server_fd, SO_SNDLOWAT, 1); break;
            default:
                exit(1);
            }
        }
    }
    }
    address.sin_family = AF_INET;
    address.sin_addr.s_addr = INADDR_ANY;
    address.sin_port = htons(8080);

    if (bind(server_fd, (struct sockaddr *)&address, sizeof(address)) == -1) {
    xerror("bind");
    }
    popts("ready", server_fd);
    while (1) {
        if (listen(server_fd, 3) == -1) {
        xerror("listen");
        }

        // Accept one client connection
        new_socket = accept(server_fd, (struct sockaddr *)&address, (socklen_t*)&addrlen);
        if (new_socket == -1) {
        xerror("accept");
        }
            popts("accepted: ", new_socket);
        sopt(new_socket, SO_SNDBUF, gopt(server_fd, SO_SNDBUF));
        sopt(new_socket, SO_SNDLOWAT, gopt(server_fd, SO_SNDLOWAT));
        if (step) {
                stepsock(new_socket);
            }
        long tick[21];
        tick[0] = rdtsc();
        // Make N sequential writes to kernel buffer.
        for (int i = 0; i < N; i++) {
                char ch = 'a' + i;

        write(new_socket, &ch, 1);
        tick[i+1] = rdtsc();

        // For some reason, doing some blocking work between the writes
        // The following work slows down the writes by a factor of 10.
        if (nap) {
           sleep(nap);
        }
        }
        for (int i = 1; i < N+1; i++) {
        printf("%ld\n", tick[i] - tick[i-1]);
        }
        printf("_\n");

        // Print a prime to make sure the compiler doesn't optimize
        // away the computations.
        close(new_socket);
    }
}

clnt.c:

#include <stdio.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <unistd.h>

#ifndef N
#define N 20
#endif
int nap = 0;

int main(int argc, char const *argv[])
{
    int sock, valread;
    struct sockaddr_in address;
    int opt = 1;
    int addrlen = sizeof(address);

    // We'll be passing uint32's back and forth
    unsigned char recv_buffer[1024] = {0};

    // Create socket for TCP server
    sock = socket(AF_INET, SOCK_STREAM, 0);

    // Set TCP_NODELAY so that writes won't be batched
    setsockopt(sock, SOL_SOCKET, TCP_NODELAY, &opt, sizeof(opt));

    while ((opt = getopt(argc,argv,"n:")) != -1) {
        switch (opt) {
        case 'n': nap = strtol(optarg, NULL, 0); break;
        default:
            exit(1);
        }
    }
    opt = 1;
    address.sin_family = AF_INET;
    address.sin_addr.s_addr = INADDR_ANY;
    address.sin_port = htons(8080);

    // Accept one client connection
    if (connect(sock, (struct sockaddr *)&address, (socklen_t)addrlen) != 0) {
        perror("connect failed");
    exit(1);
    }
    if (nap) {
    sleep(nap);
    }
    int loc[N+1];
    int nloc, curloc; 
    for (nloc = curloc = 0; curloc < N; nloc++) {
    int n = read(sock, recv_buffer + curloc, sizeof recv_buffer-curloc);
        if (n <= 0) {
        perror("read");
        break;
    }
    curloc += n;
    loc[nloc] = curloc;
    }
    int last = 0;
    for (int i = 0; i < nloc; i++) {
    int t = loc[i] - last;
    printf("%*.*s ", t, t, recv_buffer + last);
    last = loc[i];
    }
    printf("\n");
    return 0;
}

rdtsc.s:

.globl _rdtsc
_rdtsc:
    rdtsc
    shl $32, %rdx
    or  %rdx,%rax
    ret

这篇关于为什么在工作交错时TCP写入延迟变得更糟?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！