如何在不同的Flink运算符中访问相同的变量

本文介绍了如何在不同的Flink运算符中访问相同的变量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个收藏，例如val m = ConcurrentMap()，通常我可以使用以它为参数的方法，并且不同的线程可以调用传递相同m的方法.

I have a collection, e.g. val m = ConcurrentMap(), normally I can use a method taking it as parameter and different threads can call the method passing the same m.

可能是

val s = StreamExecutionEnvironment.getExecutionEnvironment
s.addSource(new MySource(m))
      .map(new MyMap(m))
      .addSink(new MySink(m))

这些参数将被序列化到不同的机器，并且似乎无法由不同的操作员共享.我发现ColocationGroup可能接近解决方案.这样对吗?怎么做?

These params would be serialized to different machines and seems that it could not be shared by different operators. I found that ColocationGroup maybe close to the solution. Is it right? How to do it?

推荐答案

无法在运算符之间甚至对于 same 运算符并行化的子任务之间共享内存中的数据结构，因为操作员的每个实例都可以在单独的JVM中运行.

There's no way to share an in-memory data structure between operators, or even between parallelized sub-tasks for the same operator, as each instance of an operator can be running in a separate JVM.

通常，您会想出如何设计工作流程以避免共享数据的方法，因为这通常会导致并发性和可伸缩性问题.

Normally you'd figure out how to design your workflow to avoid needing to share data, as that will often lead to concurrency and scalability issues.

如果不能使用数据分区来消除每个子任务都能看到所有数据的要求，则可以使用广播流确保操作员的每个子任务获得相同的数据.

You can use broadcast streams to ensure every sub-task of an operator gets the same data, if you can't use partitioning of the data to remove the requirement that every sub-task sees all the data.

最坏的情况是，您为此数据映射使用了一些共享数据存储(Cassandra，HBase等)，但是几乎总是可以通过重新设计工作流来避免这种情况.

Worst case, you use some shared data store (Cassandra, HBase, etc) for this map of data, but almost always you can avoid that by redesigning your workflow.

这篇关于如何在不同的Flink运算符中访问相同的变量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！