I am running a spark cluster over C++ code wrapped in python.I am currently testing different configurations of multi-threading options (at Python level or Spark level).I am using spark with standalone binaries, over a HDFS 2.5.4 cluster. The cluster is currently made of 10 slaves, with 4 cores each.From what I can see, by default, Spark launches 4 slaves per node (I have 4 python working on a slave node at a time).How can I limit this number ? I can see that I have a --total-executor-cores option for "spark-submit", but there is little documentation on how it impacts the distribution of executors over the cluster !I will run tests to get a clear idea, but if someone knowledgeable has a clue of what this option does, it could help.Update :I went through spark documentation again, here is what I understand :By default, I have one executor per worker node (here 10 workers node, hence 10 executors)However, each worker can run several tasks in parallel. In standalone mode, the default behavior is to use all available cores, which explains why I can observe 4 python.To limit the number of cores used per worker, and limit the number of parallel tasks, I have at least 3 options :use --total-executor-cores whith spark-submit (least satisfactory, since there is no clue on how the pool of cores is dealt with)use SPARK_WORKER_CORES in the configuration fileuse -c options with the starting scriptsThe following lines of this documentation http://spark.apache.org/docs/latest/spark-standalone.html helped me to figure out what is going on :What is still unclear to me is why it is better in my case to limit the number of parallel tasks per worker node to 1 and rely on my C++ legacy code multithreading. I will update this post with experiment results, when I will finish my study. 解决方案 The documentation does not seem clear.From my experience, the most common practice to allocate resources is by indicating the number of executors and the number of cores per executor, for example (taken from here):$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \--master yarn-cluster \--num-executors 10 \--driver-memory 4g \--executor-memory 2g \--executor-cores 4 \--queue thequeue \lib/spark-examples*.jar \10However, this approach is limited to YARN, and is not applicable to standalone and mesos based Spark, according to this.Instead, the parameter --total-executor-cores can be used, which represents the total amount of cores - of all executors - assigned to the Spark job. In your case, having a total of 40 cores, setting the attribute --total-executor-cores 40 would make use of all the available resources.Unfortunately, I am not aware of how Spark distributes the workload when less resources than the total available are provided. If working with two or more simultaneous jobs, however, it should be transparent to the user, in that Spark (or whatever resource manager) would manage how the resources are managed depending on the user settings. 这篇关于使用spark-submit,--total-executor-cores选项的行为是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-03 21:28