问题描述
我正在编写一个小程序来处理大文本文件并进行一些替换.问题是它永远不会停止分配新的内存,因此最终它会耗尽内存.我将其简化为一个简单的程序,该程序只计算行数(请参见下面的代码),同时仍然分配越来越多的内存.我必须承认,我对提升精神和提升精神知之甚少.你能告诉我我做错了吗?谢谢一百万!
I am writing a small program to process a big text file and do some replacements. The thing is that it never stops allocating new memory, so in the end it runs out of memory. I have reduced it to a simple program that simply counts the number of lines (see the code below) while still allocating more and more memory. I must admit that I know little about boost and boost spirit in particular. Could you please tell me what I am doing wrong? Thanks a million!
#include <string>
#include <iostream>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/bind.hpp>
#include <boost/ref.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
// Token ids
enum token_ids {
ID_EOL= 100
};
// Token definition
template <typename Lexer>
struct var_replace_tokens : boost::spirit::lex::lexer<Lexer> {
var_replace_tokens() {
this->self.add ("\n", ID_EOL); // newline characters
}
};
// Functor
struct replacer {
typedef bool result_type;
template <typename Token>
bool operator()(Token const& t, std::size_t& lines) const {
switch (t.id()) {
case ID_EOL:
lines++;
break;
}
return true;
}
};
int main(int argc, char **argv) {
size_t lines=0;
var_replace_tokens< boost::spirit::lex::lexertl::lexer< boost::spirit::lex::lexertl::token< boost::spirit::istream_iterator> > > var_replace_functor;
cin.unsetf(std::ios::skipws);
boost::spirit::istream_iterator first(cin);
boost::spirit::istream_iterator last;
bool r = boost::spirit::lex::tokenize(first, last, var_replace_functor, boost::bind(replacer(), _1, boost::ref(lines)));
if (r) {
cerr<<"Lines processed: "<<lines<<endl;
} else {
string rest(first, last);
cerr << "Processing failed at: "<<rest<<" (line "<<lines<<")"<<endl;
}
}
推荐答案
该行为是设计使然.
您:据我所知,istream_iterator
负责读取输入流,而不必将整个流存储到内存中
You: As fas as I know, istream_iterator
takes care of reading the input stream without having to store the whole stream into memory
是的.但是您没有使用std::istream_iterator
.您正在使用Boost Spirit.这是解析器生成器.解析器需要随机访问以进行回溯.
Yes. But you're not using std::istream_iterator
. You're using Boost Spirit. Which is a parser generator. Parsers need random access for backtracking.
Spirit通过使用multi_pass
适配器将输入序列调整为随机访问序列来支持输入迭代器.该迭代器适配器存储一个可变大小的缓冲区¹,用于回溯.某些操作(期望点,始终贪婪的运算符,如Kleene- *
等)告诉解析器框架何时安全地刷新缓冲区.
Spirit supports input iterators by adapting an input sequence to a random-access sequence with the multi_pass
adaptor. This iterator adaptor stores a variable-size buffer¹ for backtracking purposes. Certain actions (expectation points, always-greedy operators like Kleene-*
etc) tell the parser framework when it's safe to flush the buffer.
您不解析,而只是标记化.什么也没有告诉迭代器刷新其缓冲区.
You're not parsing, just tokenizing. Nothing ever tells the iterator to flush its buffers.
缓冲区不受限制,因此内存使用量增加.当然,这不是泄漏,因为一旦多遍适应的迭代器的最后一个副本超出范围,共享的回溯缓冲区就被释放.
The buffer is unbounded, so memory usage grows. Of course it's not a leak because as soon as the last copy of a multi-pass adapted iterator goes out of scope, the shared backtracking buffer is freed.
最简单的解决方案是使用随机访问源.如果可以,请使用内存映射文件.
The simplest solution is to use a random access source. If you can, use a memory mapped file.
其他解决方案将涉及告诉多通道适配器冲洗.实现 this 的最简单方法是使用tokenize_and_parse
.即使使用像*(any_token)
这样的 faux 语法,这也足以说服解析器框架,而您不会要求它回溯.
Other solutions would involve telling the multi-pass adaptor to flush. The simplest way to achieve this would be to use tokenize_and_parse
. Even with a faux grammar like *(any_token)
this should be enough to convince the parser framework you will not be asking it to backtrack.
答案涉及解析多Gib文件流.与wc -l
The answer deals with parsing multi-GiB files streaming. Comparing performance with tools like wc -l
¹ http: //www.boost.org/doc/libs/1_62_0/libs/spirit/doc/html/spirit/support/multi_pass.html 默认情况下会存储一个共享双端队列.使用dd if=/dev/zero bs=1M | valgrind --tool=massif ./sotest
进行测试一段时间后即可看到:
¹ http://www.boost.org/doc/libs/1_62_0/libs/spirit/doc/html/spirit/support/multi_pass.html by default it stores a shared deque. See it after running your test for a little while using dd if=/dev/zero bs=1M | valgrind --tool=massif ./sotest
:
清楚地显示
100.00% (805,385,576B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.99% (805,306,368B) 0x4187D5: void boost::spirit::iterator_policies::split_std_deque::unique<char>::increment<boost::spirit::multi_pass<std::istream, boost::spirit::iterator_policies::default_policy<boost::spirit::iterator_policies::ref_counted, boost::spirit::iterator_policies::no_check, boost::spirit::iterator_policies::istream, boost::spirit::iterator_policies::split_std_deque> > >(boost::spirit::multi_pass<std::istream, boost::spirit::iterator_policies::default_policy<boost::spirit::iterator_policies::ref_counted, boost::spirit::iterator_policies::no_check, boost::spirit::iterator_policies::istream, boost::spirit::iterator_policies::split_std_deque> >&) (in /home/sehe/Projects/stackoverflow/sotest)
| ->99.99% (805,306,368B) 0x404BC3: main (in /home/sehe/Projects/stackoverflow/sotest)
这篇关于增强精神记忆泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!