Linux上的Java BlockingQueue延迟很高

本文介绍了Linux上的Java BlockingQueue延迟很高的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用BlockingQueue：s（尝试使用ArrayBlockingQueue和LinkedBlockingQueue）在我正在处理的应用程序中的不同线程之间传递对象。性能和延迟在这个应用程序中相对重要，所以我很好奇使用BlockingQueue在两个线程之间传递对象需要多长时间。为了衡量这一点，我编写了一个带有两个线程（一个消费者和一个生产者）的简单程序，我让生产者将时间戳（使用System.nanoTime（））传递给消费者，参见下面的代码。

I am using BlockingQueue:s (trying both ArrayBlockingQueue and LinkedBlockingQueue) to pass objects between different threads in an application I’m currently working on. Performance and latency is relatively important in this application, so I was curious how much time it takes to pass objects between two threads using a BlockingQueue. In order to measure this, I wrote a simple program with two threads (one consumer and one producer), where I let the producer pass a timestamp (taken using System.nanoTime()) to the consumer, see code below.

我记得在一些论坛上的某个地方读过，对于试过这个的人来说，花了大约10微秒（不知道在哪个操作系统和硬件上），所以我不是在我的Windows 7机箱（英特尔E7500核心2双核CPU，2.93GHz）上花了大约30微秒时，我感到非常惊讶，同时在后台运行了很多其他应用程序。但是，当我在速度更快的Linux服务器（两个Intel X5677 3.46GHz四核CPU，运行Debian 5和内核2.6.26-2-amd64）上进行相同测试时，我感到非常惊讶。我预计延迟会低于我的Windows框，但相反它会高得多 - 约75 - 100微秒！两个测试都是使用Sun的Hotspot JVM 1.6.0-23版完成的。

I recall reading somewhere on some forum that it took about 10 microseconds for someone else who tried this (don’t know on what OS and hardware that was on), so I was not too surprised when it took ~30 microseconds for me on my windows 7 box (Intel E7500 core 2 duo CPU, 2.93GHz), whilst running a lot of other applications in the background. However, I was quite surprised when I did the same test on our much faster Linux server (two Intel X5677 3.46GHz quad-core CPUs, running Debian 5 with kernel 2.6.26-2-amd64). I expected the latency to be lower than on my windows box , but on the contrary it was much higher - ~75 – 100 microseconds! Both tests were done with Sun’s Hotspot JVM version 1.6.0-23.

还有其他人在Linux上做过类似的测试吗？或者有人知道为什么Linux上的速度会慢得多（硬件更好），与Windows相比，Linux上的线程切换是否会慢得多？如果是这种情况，似乎Windows实际上更适合某种应用程序。非常感谢帮助我理解相对较高数字的任何帮助。

Has anyone else done any similar tests with similar results on Linux? Or does anyone know why it is so much slower on Linux (with better hardware), could it be that thread switching simply is this much slower on Linux compared to windows? If that’s the case, it’s seems like windows is actually much better suited for some kind of applications. Any help in helping me understanding the relatively high figures are much appreciated.

编辑：

DaveC发表评论后我还做了一个测试，我把JVM（在Linux机器上）限制在一个核心（即在同一个核心上运行的所有线程）。这大大改变了结果 - 延迟降至20微秒以下，即比Windows机器上的结果更好。我还做了一些测试，我将生产者线程限制为一个核心，将消费者线程限制为另一个核心（尝试将它们放在同一个套接字和不同的套接字上），但这似乎没有帮助 - 延迟仍然是〜75微秒。顺便说一句，这个测试应用程序几乎就是我在执行测试时在机器上运行。

After a comment from DaveC, I also did a test where I restricted the JVM (on the Linux machine) to a single core (i.e. all threads running on the same core). This changed the results dramatically - the latency went down to below 20 microseconds, i.e. better than the results on the Windows machine. I also did some tests where I restricted the producer thread to one core and the consumer thread to another (trying both to have them on the same socket and on different sockets), but this did not seem to help - the latency was still ~75 microseconds. Btw, this test application is pretty much all I'm running on the machine while performering test.

有谁知道这些结果是否有意义？如果生产者和消费者在不同的核心上运行，它真的应该慢得多吗？任何输入都非常感激。

Does anyone know if these results make sense? Should it really be that much slower if the producer and the consumer are running on different cores? Any input is really appreciated.

再次编辑（1月6日）：

我尝试对代码进行不同的更改和运行环境：

Edited again (6 January):
I experimented with different changes to the code and running environment:

我将Linux内核升级到2.6.36.2（从2.6.26.2开始）。内核升级后，测量时间变为60微秒，变化非常小，从升级前的75-100开始。为生产者和消费者线程设置CPU关联性没有任何影响，除非将它们限制在同一个核心。当在相同的核心上运行时，测得的延迟是13微秒。

I upgraded the Linux kernel to 2.6.36.2 (from 2.6.26.2). After the kernel upgrade, the measured time changed to 60 microseconds with very small variations, from 75-100 before the upgrade. Setting CPU affinity for the producer and consumer threads had no effect, except when restricting them to the same core. When running on the same core, the latency measured was 13 microseconds.

在原始代码中，我让生产者在每次迭代之间进入休眠状态1秒，为了给消费者足够的时间来计算经过的时间并将其打印到控制台。如果我删除对Thread.sleep（）的调用，而是让生产者和消费者在每次迭代中调用barrier.await（）（消费者在将经过的时间打印到控制台后调用它），则测量的延迟从60微秒至10微秒以下。如果在同一核上运行线程，则延迟低于1微秒。任何人都可以解释为什么这会显着减少延迟？我的第一个猜测是，这个改变产生了生成器在消费者调用queue.take（）之前调用queue.put（）的效果，所以消费者永远不必阻塞，但在玩了一个修改后的ArrayBlockingQueue版本之后，我找到了这个猜测是假的 - 消费者确实阻止了。如果您有其他猜测，请告诉我。（顺便说一句，如果我让生产者同时调用Thread.sleep（）和barrier.await（），则延迟时间保持在60微秒）。

In the original code, I had the producer go to sleep for 1 second between every iteration, in order to give the consumer enough time to calculate the elapsed time and print it to the console. If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly? My first guess was that the change had the effect that the producer called queue.put() before the consumer called queue.take(), so the consumer never had to block, but after playing around with a modified version of ArrayBlockingQueue, I found this guess to be false – the consumer did in fact block. If you have some other guess, please let me know. (Btw, if I let the producer call both Thread.sleep() and barrier.await(), the latency remains at 60 microseconds).

我也是尝试了另一种方法 - 我没有调用queue.take（），而是调用了queue.poll（），超时为100微秒。这样可以将平均延迟降低到10微秒以下，但当然会占用更多的CPU（但是忙碌等待的CPU密集度可能更低？）。

I also tried another approach – instead of calling queue.take(), I called queue.poll() with a timeout of 100 micros. This reduced the average latency to below 10 microseconds, but is of course much more CPU intensive (but probably less CPU intensive that busy waiting?).

再次编辑（1月10日） - 问题解决了：

ninjalj建议延迟~60微秒是由于CPU不得不从更深的睡眠状态 - 他完全正确！在BIOS中禁用C状态后，延迟减少到

Edited again (10 January) - Problem solved:
ninjalj suggested that the latency of ~60 microseconds was due to the CPU having to wake up from deeper sleep states - and he was completely right! After disabling C-states in BIOS, the latency was reduced to <10 microseconds. This explains why I got so much better latency under point 2 above - when I sent objects more frequently the CPU was kept busy enough not to go to the deeper sleep states. Many thanks to everyone who has taken time to read my question and shared your thoughts here!

...

import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.CyclicBarrier;

public class QueueTest {

    ArrayBlockingQueue<Long> queue = new ArrayBlockingQueue<Long>(10);
    Thread consumerThread;
    CyclicBarrier barrier = new CyclicBarrier(2);
    static final int RUNS = 500000;
    volatile int sleep = 1000;

    public void start() {
        consumerThread = new Thread(new Runnable() {
            @Override
            public void run() {
                try {
                    barrier.await();
                    for(int i = 0; i < RUNS; i++) {
                        consume();

                    }
                } catch (Exception e) {
                    e.printStackTrace();
                } 
            }
        });
        consumerThread.start();

        try {
            barrier.await();
        } catch (Exception e) { e.printStackTrace(); }

        for(int i = 0; i < RUNS; i++) {
            try {
                if(sleep > 0)
                    Thread.sleep(sleep);
                produce();

            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public void produce() {
        try {
            queue.put(System.nanoTime());
        } catch (InterruptedException e) {
        }
    }

    public void consume() {
        try {
            long t = queue.take();
            long now = System.nanoTime();
            long time = (now - t) / 1000; // Divide by 1000 to get result in microseconds
            if(sleep > 0) {
                System.out.println("Time: " + time);
            }

        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    public static void main(String[] args) {
        QueueTest test = new QueueTest();
        System.out.println("Starting...");
        // Run first once, ignoring results
        test.sleep = 0;
        test.start();
        // Run again, printing the results
        System.out.println("Starting again...");
        test.sleep = 1000;
        test.start();
    }
}

推荐答案

您的测试不是衡量队列切换延迟的一个好方法，因为你有一个线程读取队列，它同步写入 System.out （做一个字符串和长连接，而它是在它之前它再次需要。要正确测量这一点，您需要将此活动移出此线程，并在获取线程中尽可能少地工作。

Your test is not a good measure of queue handoff latency because you have a single thread reading off the queue which writes synchronously to System.out (doing a String and long concatenation while it is at it) before it takes again. To measure this properly you need to move this activity out of this thread and do as little work as possible in the taking thread.

你最好只在接受者中进行计算（然后现在）并将结果添加到另一个其他集合中，该集合由另一个输出的线程定期排干结果。我倾向于通过添加到通过AtomicReference访问的适当规定的数组支持结构来实现这一点（因此报告线程只需要在该引用上使用该存储结构的另一个实例的getAndSet以获取最新批次的结果;例如make 2列表，将一个设置为活动，每个xsa线程唤醒并交换活动和被动的线程）。然后，您可以报告一些分布而不是每个结果（例如，十分位数范围），这意味着您不会在每次运行时生成大量日志文件并获取为您打印的有用信息。

You'd be better off just doing the calculation (then-now) in the taker and adding the result to some other collection which is periodically drained by another thread that outputs the results. I tend to do this by adding to an appropriately presized array backed structure accessed via an AtomicReference (hence the reporting thread just has to getAndSet on that reference with another instance of that storage structure in order to grab the latest batch of results; e.g. make 2 lists, set one as active, every x s a thread wakes up and swaps the active and the passive ones). You can then report some distribution instead of every single result (e.g. a decile range) which means you don't generate vast log files with every run and get useful information printed for you.

FWIW我同意Peter Lawrey所述的时代。如果延迟真的很关键，那么你需要考虑忙于等待适当的cpu亲和力（即将核心专用于该线程）

FWIW I concur with the times Peter Lawrey stated & if latency is really critical then you need to think about busy waiting with appropriate cpu affinity (i.e. dedicate a core to that thread)

在Jan之后编辑6

EDIT after Jan 6

您正在查看 java之间的区别。 util.concurrent.locks.LockSupport #park （和相应的 unpark ）和 Thread＃sleep 。大多数j.u.c. stuff建立在 LockSupport 上（通常通过 AbstractQueuedSynchronizer ReentrantLock 提供或直接）并且（在Hotspot中）解析为 sun.misc.Unsafe #park （以及 unpark ）这往往最终落在pthread（posix线程）lib的手中。通常 pthread_cond_broadcast 唤醒并且 pthread_cond_wait 或 pthread_cond_timedwait 喜欢 BlockingQueue＃take 。

You're looking at the difference between java.util.concurrent.locks.LockSupport#park (and corresponding unpark) and Thread#sleep. Most j.u.c. stuff is built on LockSupport (often via an AbstractQueuedSynchronizer that ReentrantLock provides or directly) and this (in Hotspot) resolves down to sun.misc.Unsafe#park (and unpark) and this tends to end up in the hands of the pthread (posix threads) lib. Typically pthread_cond_broadcast to wake up and pthread_cond_wait or pthread_cond_timedwait for things like BlockingQueue#take.

我不能说我曾经看过 Thread＃sleep 是如何实际实现的（我是'从来没有遇到过一些低延迟而不是基于条件的等待）但我想这会导致它被调度程序以比pthread信令机制更激进的方式降级，这就是延迟差异的原因。

I can't say I've ever looked at how Thread#sleep is actually implemented (cos I've never come across something low latency that isn't a condition based wait) but I would imagine that it causes it to be demoted by the schedular in a more aggressive way than the pthread signalling mechanism and that is what accounts for the latency difference.

这篇关于Linux上的Java BlockingQueue延迟很高的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！