HDFS上的Apache Spark:一次读取10k-100k的小文件

本文介绍了HDFS上的Apache Spark:一次读取10k-100k的小文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我最多可以有10万个小文件(每个10-50 KB).它们全部存储在HDFS中，块大小为128 MB.我必须使用Apache Spark一次阅读它们，如下所示:

I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below:

// return a list of paths to small files
List<Sting> paths = getAllPaths(); 
// read up to 100000 small files at once into memory
sparkSession
    .read()
    .parquet(paths)
    .as(Encoders.kryo(SmallFileWrapper.class))
    .coalesce(numPartitions);

问题

小文件数从内存消耗的角度来看，这不是问题.问题是读取这么多文件的速度.读取490个小文件需要38秒，读取3420个文件需要266秒.我想读取100.000个文件需要很多时间.

Problem

The number of small files is not a problem from the perspective of memory consumption. The problem is the speed of reading of that amount of files. It takes 38 seconds to read 490 small files, and 266 seconds to read 3420 files. I suppose it would take a lot to read 100.000 files.

HAR或序列文件会加快Apache Spark批量读取10k-100k小文件的速度吗?为什么?

HAR或序列文件会减慢那些小文件的持久性吗?为什么?

批量读取是该小文件所需的唯一操作，我不需要按id或其他任何内容来读取它们.

Batch read is the only operation required for that small files, I don't need to read them by id or anything else.

推荐答案

从该帖子中:

From that post: How does the number of partitions affect `wholeTextFiles` and `textFiles`?

Spark 1.6.3 Java API文档中针对SparkContext
的确认 http://spark.apache.org/docs/1.6. 3/api/java/index.html

Confirmation in the Spark 1.6.3 Java API documentation for SparkContext
http://spark.apache.org/docs/1.6.3/api/java/index.html

确认类WholeTextFileInputFormat
的源代码(分支1.6)注释中的确认 https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileFileInputFormat.scala

Confirmation in the source code (branch 1.6) comments for class WholeTextFileInputFormat
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala

记录下来， Hadoop CombineInputFormat 是在单个Mapper中填充多个小文件的标准方法；它可以在Hive中使用，其属性为hive.hadoop.supports.splittable.combineinputformat和hive.input.format.

Spark wholeTextFiles() 重用了该Hadoop功能，但有两个缺点:
(a)您必须使用整个目录，无法在加载文件之前按名称过滤掉文件(加载后只能过滤 )
(b)，如果需要，您必须通过将每个文件分成多个记录来对RDD进行后处理

Spark wholeTextFiles() reuses that Hadoop feature, with two drawbacks:
(a) you have to consume a whole directory, can't filter out files by name before loading them (you can only filter after loading)
(b) you have to post-process the RDD by splitting each file into multiple records, if required

尽管如此，这似乎是一个可行的解决方案，请参阅.该帖子: Spark分区/集群执行

That seems to be a viable solution nonetheless, cf. that post: Spark partitioning/cluster enforcing

或者，您可以基于相同的Hadoop CombineInputFormat构建自己的自定义文件阅读器，请参见.该帖子: Apache YARN上的Spark:大量输入数据文件(在spark中组合了多个输入文件)

Or, you can build your own custom file reader based on that same Hadoop CombineInputFormat, cf. that post: Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

这篇关于HDFS上的Apache Spark:一次读取10k-100k的小文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！