在 mapreduce 中操作迭代器

本文介绍了在 mapreduce 中操作迭代器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 hadoop 找到任何给定点的总和，我遇到的问题是从单个减速器中的给定键中获取所有值.看起来像这样.

I am trying to find the sum of any given points using hadoop, The issue I am having is on getting all values from a given key in a single reducer. It looks like this.

减速器:

 public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, DoubleWritable> {

    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, DoubleWritable> output, Reporter reporter)
            throws IOException {
        Text word = new Text();

        Iterator<IntWritable> tr = values;
        IntWritable v;
        while (tr.hasNext()) {
             v = tr.next();

            Iterator<IntWritable> td = values;
            while (td.hasNext()) {

                IntWritable u = td.next();
                double sum = u+v;
                word.set( u + " + " + v);
                output.collect(word, new DoubleWritable(sum));
            }
        }
    }
}

并且我正在尝试创建 Iterator 变量的两个副本，以便我可以遍历第二个迭代器的所有值，同时从前一个迭代器(上面的两个 while 循环)获取单个值，但两个迭代器保存始终相同的值.

And I am trying to create two copies of the Iterator variable so that I can go through all the values of the second iterator while I get a single value from the previous Iterator( Two while loops above) but the two iterators hold the same value all the time.

我不确定这是否是正确的做法.

I am not sure if this is the right way to do it.

推荐答案

reducer 中的迭代器并不像你想象的那么简单.

The iterators in the reducer are not as simple as you might think.

问题是您迭代的项目总数可能不适合内存.这意味着迭代器可能正在从磁盘读取.如果您有两个独立的迭代器副本，那么您可以让其中一个远远领先于另一个，这意味着不能删除两个迭代器指向的位置之间的数据.

The issue is that the total number of items that you are iterating through might not fit into memory. That means that the iterator may be reading from disk. If you have two independent copies of the iterator, then you can have one of them far ahead of the other which implies that the data between where the two iterators point can't be dropped.

为了简化实现，Hadoop 不支持为 reduce 值使用多个迭代器.

For simplicity of implementation, Hadoop doesn't support having more than one iterator for the reduce values.

这样做的实际影响是你不能两次遍历同一个迭代器.这不太好，但情况确实如此.如果您完全知道项目的数量将适合内存，那么您可以按照 MrGomez 的建议将所有项目复制到一个列表中.如果您不知道这一点，您可能必须使用辅助存储.

The practical impact of this is that you can't go through the same iterator twice. That isn't nice, but it is the case. If you absolutely know that the number of items will fit into memory, then you can copy all the items into a list as suggested by MrGomez. If you don't know that, you may have to use secondary storage.

更好的方法是重新设计你的程序，这样你就不需要在 reducer 中无限存储.这可能会有点棘手，但有标准的方法可以解决这个问题.

The better approach is to redesign your program so that you don't need unbounded storage in the reducer. This can get a bit tricky, but there are standard approaches to the problem.

对于您的特定问题，相对于最大的减少输入集，输出大小呈二次增长.这通常是一个非常糟糕的主意.在大多数情况下，您不需要所有对，只需要最重要的对.如果你能以某种方式修剪这组对，那么你就全部设置好了，你可能能够移除所有对约束.

For your particular problem, you have a quadratic growth in output size relative to the largest reduce input set. This is usually a really bad idea. In most cases you don't need ALL pairs, just the most important pairs. If you can trim the set of pairs in some way, then you are all set and you may be able to remove the all pairs constraint.

例如，如果您试图找到每个归约集总和最大的 100 对，您可以保留一个具有迄今为止看到的最大 100 个输入的优先级队列和一个具有迄今为止看到的最大 100 个输入的优先级队列.对于每个新输入，您可以使用迄今为止看到的最大 100 个数字形成总和，并尝试将这些总和放入第二个队列中.最后，您应该将新输入粘贴到第一个队列中，并通过删除最小值(如有必要)将两个队列修剪为 100 个元素.在reduce的close方法中，应该dump优先队列.这种方法保证您只需要 min(n^2, 200) 个存储元素，从而避免了 n^2 问题，并通过保留 100 个最大的项目而不是所有项目来避免输入的双重传递.

For instance, if you are trying to find the 100 pairs with the largest sum for each reduce set, you can keep a priority queue with the 100 largest inputs seen so far and a priority queue with the 100 largest sums seen so far. For each new input, you can form the sum with the largest 100 numbers seen so far and try to stick those sums into the second queue. Finally, you should stick the new input into the first queue and trim both queues to 100 elements by deleting the smallest values (if necessary). In the close method of the reduce, you should dump the priority queue. This approach guarantees that you only need min(n^2, 200) elements of storage which avoids the n^2 problem and avoids the double pass through the input by keeping the 100 largest items seen rather than all items seen.

这篇关于在 mapreduce 中操作迭代器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！