When I add the spark-hive dependency to the spark-shell (spark-submit is affected as well) by usingspark.jars.packages=org.apache.spark:spark-hive_2.11:2.4.5在 spark-defaults.conf 中的,出现此错误:in spark-defaults.conf, I get this error:20/02/26 11:20:45 ERROR spark.SparkContext:Failed to add file:/root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar to Spark environmentjava.io.FileNotFoundException: Jar /root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar not foundat org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)at org.apache.spark.SparkContext.addJar(SparkContext.scala:1868)at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:458)at org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:458)at scala.collection.immutable.List.foreach(List.scala:392)at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)at scala.Option.getOrElse(Option.scala:189)at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)因为 spark-shell 不能与分类依赖项一起处理分类器,请参见 https://github.com/apache/spark/pull/21339 和 https://github.com/apache/spark/pull/17416because spark-shell cannot handle classifiers together with bundle dependencies, see https://github.com/apache/spark/pull/21339 and https://github.com/apache/spark/pull/17416分类器问题的解决方法如下:A workaround for the classifier probleme looks like this:$ cp .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2-hadoop2.jar .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar但是DevOps不会接受.but DevOps won't accept this.完整的依赖项列表如下所示(我添加了换行符以提高可读性)The complete list of dependencies looks like this (I have added line breaks for better readability)root@a5a04d888f85:/opt/spark-2.4.5/conf# cat spark-defaults.confspark.jars.packages=com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.9.10,com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.9.10,org.apache.spark:spark-hive_2.11:2.4.5,org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,org.apache.hadoop:hadoop-aws:2.10.0,io.delta:delta-core_2.11:0.5.0,org.postgresql:postgresql:42.2.5,mysql:mysql-connector-java:8.0.18,com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,io.prestosql:presto-jdbc:307(一切正常-蜂巢除外)(everything works - except for Hive) Spark 2.4.5和Hadoop 2.10的组合是否在任何地方使用?怎么样?如何将 Spark 2.4.5与用户提供的Hadoop 和Hadoop 2.9或2.10结合在一起?是否有必要构建Spark来解决Hive依赖问题?Is the combination of Spark 2.4.5 and Hadoop 2.10 used anywhere? How?How to combine Spark 2.4.5 with user-provided Hadoop and Hadoop 2.9 or 2.10 ?Is it necessary to build Spark to get around the Hive dependency problem ?推荐答案用用户提供的Hadoop将 Spark 2.4.5配置为使用Hadoop 2.10.0似乎不是一个简单的方法.There does not seem to be an easy way to configure Spark 2.4.5 with user-provided Hadoop to use Hadoop 2.10.0由于我的任务实际上是最大程度地减少依赖关系问题,因此我选择编译针对Hadoop 2.10.0的Spark 2.4.5.As my task actually was to minimize dependency problems, I have chosen to compile Spark 2.4.5 against Hadoop 2.10.0../dev/make-distribution.sh \ --name hadoop-2.10.0 \ --tgz \ -Phadoop-2.7 -Dhadoop.version=hadoop-2.10.0 \ -Phive -Phive-thriftserver \ -Pyarn现在,Maven处理Hive依赖项/分类器,并且可以使用生成的包.Now Maven deals with the Hive dependencies/classifiers, and the resulting package is ready to be used.我个人认为,与使用用户提供的Hadoop配置 Spark 相比,编译Spark实际上要容易得多.In my personal opinion compiling Spark is actually easier than configuring the Spark with-user-provided Hadoop.到目前为止,集成测试还没有发现任何问题,Spark可以访问HDFS和S3(MinIO).Integration tests so far have not shown any problems, Spark can access both HDFS and S3 (MinIO). 更新2021-04-08如果要添加对Kubernetes的支持,只需将 -Pkubernetes 添加到参数列表中If you want to add support for Kubernetes, just add -Pkubernetes to the list of arguments 这篇关于如何使用用户提供的Hadoop正确配置Spark 2.4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 阿里云证书,YYDS!
05-23 09:10