


I am writing a small program to process a big text file and do some replacements. The thing is that it never stops allocating new memory, so in the end it runs out of memory. I have reduced it to a simple program that simply counts the number of lines (see the code below) while still allocating more and more memory. I must admit that I know little about boost and boost spirit in particular. Could you please tell me what I am doing wrong? Thanks a million!

#include <string>
#include <iostream>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/bind.hpp>
#include <boost/ref.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>

// Token ids
enum token_ids {
    ID_EOL= 100

// Token definition
template <typename Lexer>
    struct var_replace_tokens : boost::spirit::lex::lexer<Lexer> {
        var_replace_tokens() {
            this->self.add ("\n", ID_EOL); // newline characters

// Functor
struct replacer {
    typedef bool result_type;
    template <typename Token>
    bool operator()(Token const& t, std::size_t& lines) const  {
        switch (t.id()) {
        case ID_EOL:
        return true;

int main(int argc, char **argv) {
    size_t lines=0;

    var_replace_tokens< boost::spirit::lex::lexertl::lexer< boost::spirit::lex::lexertl::token< boost::spirit::istream_iterator> > > var_replace_functor;


    boost::spirit::istream_iterator first(cin);
    boost::spirit::istream_iterator last;

    bool r = boost::spirit::lex::tokenize(first, last, var_replace_functor,  boost::bind(replacer(), _1, boost::ref(lines)));

    if (r) {
        cerr<<"Lines processed: "<<lines<<endl;
    }  else {
        string rest(first, last);
        cerr << "Processing failed at: "<<rest<<" (line "<<lines<<")"<<endl;




You: As fas as I know, istream_iterator takes care of reading the input stream without having to store the whole stream into memory

是的.但是您没有使用std::istream_iterator.您正在使用Boost Spirit.这是解析器生成器.解析器需要随机访问以进行回溯.

Yes. But you're not using std::istream_iterator. You're using Boost Spirit. Which is a parser generator. Parsers need random access for backtracking.

Spirit通过使用multi_pass适配器将输入序列调整为随机访问序列来支持输入迭代器.该迭代器适配器存储一个可变大小的缓冲区¹,用于回溯.某些操作(期望点,始终贪婪的运算符,如Kleene- *等)告诉解析器框架何时安全地刷新缓冲区.

Spirit supports input iterators by adapting an input sequence to a random-access sequence with the multi_pass adaptor. This iterator adaptor stores a variable-size buffer¹ for backtracking purposes. Certain actions (expectation points, always-greedy operators like Kleene-* etc) tell the parser framework when it's safe to flush the buffer.


You're not parsing, just tokenizing. Nothing ever tells the iterator to flush its buffers.


The buffer is unbounded, so memory usage grows. Of course it's not a leak because as soon as the last copy of a multi-pass adapted iterator goes out of scope, the shared backtracking buffer is freed.


The simplest solution is to use a random access source. If you can, use a memory mapped file.

其他解决方案将涉及告诉多通道适配器冲洗.实现 this 的最简单方法是使用tokenize_and_parse.即使使用像*(any_token)这样的 faux 语法,这也足以说服解析器框架,而您不会要求它回溯.

Other solutions would involve telling the multi-pass adaptor to flush. The simplest way to achieve this would be to use tokenize_and_parse. Even with a faux grammar like *(any_token) this should be enough to convince the parser framework you will not be asking it to backtrack.

答案涉及解析多Gib文件流.与wc -l

The answer deals with parsing multi-GiB files streaming. Comparing performance with tools like wc -l

¹ http: //www.boost.org/doc/libs/1_62_0/libs/spirit/doc/html/spirit/support/multi_pass.html 默认情况下会存储一个共享双端队列.使用dd if=/dev/zero bs=1M | valgrind --tool=massif ./sotest进行测试一段时间后即可看到:

¹ http://www.boost.org/doc/libs/1_62_0/libs/spirit/doc/html/spirit/support/multi_pass.html by default it stores a shared deque. See it after running your test for a little while using dd if=/dev/zero bs=1M | valgrind --tool=massif ./sotest:


100.00% (805,385,576B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.99% (805,306,368B) 0x4187D5: void boost::spirit::iterator_policies::split_std_deque::unique<char>::increment<boost::spirit::multi_pass<std::istream, boost::spirit::iterator_policies::default_policy<boost::spirit::iterator_policies::ref_counted, boost::spirit::iterator_policies::no_check, boost::spirit::iterator_policies::istream, boost::spirit::iterator_policies::split_std_deque> > >(boost::spirit::multi_pass<std::istream, boost::spirit::iterator_policies::default_policy<boost::spirit::iterator_policies::ref_counted, boost::spirit::iterator_policies::no_check, boost::spirit::iterator_policies::istream, boost::spirit::iterator_policies::split_std_deque> >&) (in /home/sehe/Projects/stackoverflow/sotest)
| ->99.99% (805,306,368B) 0x404BC3: main (in /home/sehe/Projects/stackoverflow/sotest)


10-18 12:55