有什么有效的方法可以从大型二进制文件中读取数据?

本文介绍了有什么有效的方法可以从大型二进制文件中读取数据?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要在一个二进制文件中处理数十GB的数据.数据文件中的每个记录都是可变长度的.

I need to handle tens of Gigabytes data in one binary file. Each record in the data file is variable length.

所以文件就像:

<len1><data1><len2><data2>..........<lenN><dataN>

数据包含整数，指针，双精度值等.

The data contains integer, pointer, double value and so on.

我发现python甚至无法处理这种情况.如果我在内存中读取整个文件，则没有问题.它很快.但是，似乎struct软件包的性能不佳.它几乎卡在解压缩字节上.

I found python can not even handle this situation. There is no problem if I read the whole file in memory. It's fast. But it seems the struct package is not good at performance. It almost stuck on unpack the bytes.

感谢您的帮助.

谢谢.

推荐答案

struct和array对于实现的详细信息很好，如果始终需要，可能就是您所需要的全部顺序读取所有文件或其前缀.其他选项包括缓冲， mmap ，甚至 ctypes ，具体取决于您未提及的许多详细信息.如果还没有合适的可访问库(在C，C ++，Fortran等环境下)可以作为处理此庞大文件的接口，则也许有一些专门的Cython编码助手可以提供您所需的所有额外性能.您需要.

struct and array, which other answers recommend, are fine for the details of the implementation, and might be all you need if your needs are always to sequentially read all of the file or a prefix of it. Other options include buffer, mmap, even ctypes, depending on many details you don't mention regarding your exact needs. Maybe a little specialized Cython-coded helper can offer all the extra performance you need, if no suitable and accessible library (in C, C++, Fortran, ...) already exists that can be interfaced for the purpose of handling this humongous file as you need to.

但是显然这里存在一些特殊的问题-例如，数据文件如何包含指针，这些指针本质上是与解决内存有关的概念?它们是否可能是偏移量"，如果是，它们是如何精确地基于和编码的?您的需求是否比简单的顺序读取(例如，随机访问)更高级?如果是，您是否可以进行第一次索引"传递以使从文件开始到记录开始的所有偏移量都变得更有用，更紧凑，手工格式化的辅助文件? (对于array，那个偏移量的二进制文件是很自然的-除非偏移量需要长于计算机上array支持的长度！).组成数十千兆字节"的记录长度和组成以及记录数量的分布是什么?等等等

But clearly there are peculiar issues here -- how can a data file contain pointers, for example, which are intrinsically a concept related to addressing memory? Are they maybe "offsets" instead, and, if so, how exactly are they based and coded? Are your needs at all more advanced than simply sequential reading (e.g., random access), and if so, can you do a first "indexing" pass to get all the offsets from start of file to start of record into a more usable, compact, handily-formatted auxiliary file? (That binary file of offsets would be a natural for array -- unless the offsets need to be longer than array supports on your machine!). What is the distribution of record lengths and compositions and number of records to make up the "tens of gigabytes"? Etc, etc.

您遇到了一个非常大的问题(毫无疑问，这是非常大规模的硬件来支持它，因为您提到您可以轻松地将所有文件读入内存，这意味着一个具有数十GB RAM的64位盒-哇！)，因此值得进行详细的护理以优化其处理方式，但值得一提的是，除非我们知道足够的详细信息，否则我们无法提供太多的协助！-).

You have a very large scale problem (and no doubt very large scale hardware to support it, since you mention that you can easily read all of the file into memory that means a 64bit box with many tens of GB of RAM -- wow!), so it's well worth the detailed care to optimize the handling thereof -- but we can't help much with such detailed care unless we know enough detail to do so!-).

这篇关于有什么有效的方法可以从大型二进制文件中读取数据?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！