问题描述
我有一个收藏,例如val m = ConcurrentMap()
,通常我可以使用以它为参数的方法,并且不同的线程可以调用传递相同m
的方法.
I have a collection, e.g. val m = ConcurrentMap()
, normally I can use a method taking it as parameter and different threads can call the method passing the same m
.
可能是
val s = StreamExecutionEnvironment.getExecutionEnvironment
s.addSource(new MySource(m))
.map(new MyMap(m))
.addSink(new MySink(m))
这些参数将被序列化到不同的机器,并且似乎无法由不同的操作员共享.我发现ColocationGroup
可能接近解决方案.这样对吗?怎么做?
These params would be serialized to different machines and seems that it could not be shared by different operators. I found that ColocationGroup
maybe close to the solution. Is it right? How to do it?
推荐答案
无法在运算符之间甚至对于 same 运算符并行化的子任务之间共享内存中的数据结构,因为操作员的每个实例都可以在单独的JVM中运行.
There's no way to share an in-memory data structure between operators, or even between parallelized sub-tasks for the same operator, as each instance of an operator can be running in a separate JVM.
通常,您会想出如何设计工作流程以避免共享数据的方法,因为这通常会导致并发性和可伸缩性问题.
Normally you'd figure out how to design your workflow to avoid needing to share data, as that will often lead to concurrency and scalability issues.
如果不能使用数据分区来消除每个子任务都能看到所有数据的要求,则可以使用广播流确保操作员的每个子任务获得相同的数据.
You can use broadcast streams to ensure every sub-task of an operator gets the same data, if you can't use partitioning of the data to remove the requirement that every sub-task sees all the data.
最坏的情况是,您为此数据映射使用了一些共享数据存储(Cassandra,HBase等),但是几乎总是可以通过重新设计工作流来避免这种情况.
Worst case, you use some shared data store (Cassandra, HBase, etc) for this map of data, but almost always you can avoid that by redesigning your workflow.
这篇关于如何在不同的Flink运算符中访问相同的变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!