本文介绍了如何在 sc.textFile 中加载本地文件,而不是 HDFS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我正在关注精彩的 spark 教程

所以我试图在 46m:00s 加载 README.md 但失败了我正在做的是:

so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this:

$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)



尝试明确指定 sc.textFile("file:///path to the file/").设置Hadoop环境时出现该错误.

Try explicitly specify sc.textFile("file:///path to the file/"). The error occurs when Hadoop environment is set.

SparkContext.textFile 内部调用 org.apache.hadoop.mapred.FileInputFormat.getSplits,如果 schema 是,它又使用 org.apache.hadoop.fs.getDefaultUri缺席的.此方法读取 Hadoop conf 的fs.defaultFS"参数.如果设置 HADOOP_CONF_DIR 环境变量,参数通常设置为hdfs://...";否则为文件://".

SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".

这篇关于如何在 sc.textFile 中加载本地文件,而不是 HDFS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-23 15:09