在Spark mapParitions中使用Java 8 parallelStream

本文介绍了在Spark mapParitions中使用Java 8 parallelStream的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图了解Spark 8并行流中的Java 8并行流的行为。当我运行以下代码时，我期望 listOfThings 的输出大小与输入大小相同。但事实并非如此，我的输出中有时会丢失项目。这种行为不一致。如果我只是遍历迭代器而不是使用 parallelStream ，一切都很好。每次都计算匹配。

I am trying to understand the behavior of Java 8 parallel stream inside spark parallelism. When I run the below code, I am expecting the output size of listOfThings to be the same as input size. But that's not the case, I sometimes have missing items in my output. This behavior is not consistent. If I just iterate through the iterator instead of using parallelStream, everything is fine. Count matches every time.

// listRDD.count = 10
JavaRDD test = listRDD.mapPartitions(iterator -> {
    List listOfThings = IteratorUtils.toList(iterator);
    return listOfThings.parallelStream.map(
        //some stuff here
    ).collect(Collectors.toList());
});
// test.count = 9
// test.count = 10
// test.count = 8
// test.count = 7

推荐答案

这是一个非常好的问题。

这里发生的事情是竞争条件。当你并行化流然后流将完整列表分成几个相等的部分[基于可用线程和列表大小]然后它尝试在每个可用线程上独立运行子部分来执行工作。

it's a very good question.
Whats going on here is Race Condition. when you parallelize the stream then stream split the full list into several equal parts [Based on avaliable threads and size of list] then it tries to run subparts independently on each avaliable thread to perform the work.

但是你也在使用以更快地计算工作而闻名的apache spark，即通用计算引擎。 Spark使用相同的方法[并行化工作]来执行操作。

But you are also using apache spark which is famous for computing the work faster i.e. general purpose computation engine. Spark uses the same approach [parallelize the work] to perform the action.

现在在这个Scenerio中发生的事情是Spark已经并行化整个工作然后在这里你是再次并行工作因此竞争条件开始，即火花执行者开始处理工作然后你并行化工作然后流程处理获取其他线程并开始处理如果线程处理流工作完成工作之前SPARK执行者完成他的工作然后它会增加结果，否则SPARK EXECUTOR继续报告结果为主。

Now Here in this Scenerio what is happening is Spark already parallelized the whole work then inside this you are again parallelizing the work due to this the race condition starts i.e. spark executor starts processing the work and then you parallelized the work then stream process aquires other thread and start processing IF THE THREAD THAT WAS PROCESSING STREAM WORK FINISHES WORK BEFORE THE SPARK EXECUTOR COMPLETE HIS WORK THEN IT ADD THE RESULT OTHERWISE SPARK EXECUTOR CONTINUES TO REPORT RESULT TO MASTER.

这不是一个很好的方法 - 并行化工作，它总会让你痛苦，让火花为你做。

希望你能理解这里发生的事情

Hope you understand whats going on here

谢谢

这篇关于在Spark mapParitions中使用Java 8 parallelStream的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！