本文介绍了在mapreduce中创建迭代器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到使用hadoop的任何给定点的总和,我遇到的问题是从一个reducer中的给定键获取所有值。它看起来像这样。

Reducer:

  public static class Reduce扩展MapReduceBase实现
Reducer< Text,IntWritable,Text,DoubleWritable> {
$ b $ public void reduce(Text key,Iterator< IntWritable> values,
OutputCollector< Text,DoubleWritable> output,Reporter reporter)
throws IOException {
Text word = new Text();

Iterator< IntWritable> tr =值;
IntWritable v;
while(tr.hasNext()){
v = tr.next();

Iterator< IntWritable> td =值;
while(td.hasNext()){

IntWritable u = td.next();
double sum = u + v;
word.set(u +++ v);
output.collect(word,new DoubleWritable(sum));
}
}
}
}

我正在尝试创建Iterator变量的两个副本,以便我可以遍历第二个迭代器的所有值,同时从前一个Iterator(上面的两个while循环)获取单个值,但两个迭代器保持相同的值时间。



我不确定这是否是正确的做法,任何帮助都非常值得赞赏。



感谢,



Tsegay

解决方案

并不像您想象的那么简单。



问题是您正在迭代的项目总数可能不适合内存。这意味着迭代器可能正在从磁盘读取数据。如果你有两个独立的迭代器,那么你可以让它们中的一个远离另一个,这意味着两个迭代器指向的地方之间的数据不能被丢弃。



为了简化实现,Hadoop不支持为reduce值使用多个迭代器。



这样做的实际影响是,您不能通过两次相同的迭代器。这并不好,但情况确实如此。如果你完全知道物品的数量会适合记忆,那么你可以将所有物品复制到MrGomez建议的清单中。如果您不知道这一点,则可能需要使用辅助存储。



更好的方法是重新设计您的程序,以便您不需要无限存储减速器。这可能会有点棘手,但有标准的方法来解决这个问题。



针对您的特定问题,相对于最大减少输入,您的输出尺寸有一个二次增长组。这通常是一个非常糟糕的主意。在大多数情况下,你不需要所有的对,只需要最重要的一对。如果您可以以某种方式修剪一组对,那么您已经设置好了,并且您可以移除所有对约束。



例如,如果您是试图找到每个减少集合中总数最大的100对,可以保留一个优先级队列,其中包含迄今为止所看到的最大输入数量为100的优先级队列,以及迄今为止所见的100个最大总和的优先级队列。对于每一个新的输入,你可以用迄今为止看到的最大的100个数字来形成总和,并且试图将这些总和粘贴到第二个队列中。最后,您应该将新输入粘贴到第一个队列中,并通过删除最小值(如有必要)将两个队列修剪为100个元素。在reduce的close方法中,您应该转储优先队列。这种方法保证您只需要min(n ^ 2,200)个存储元素,避免了n ^ 2问题,并通过保持看到100个最大项目而不是所有项目被看到来避免双重输入。


I am trying to find the sum of any given points using hadoop, The issue I am having is on getting all values from a given key in a single reducer. It looks like this.

Reducer:

 public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, DoubleWritable> {

    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, DoubleWritable> output, Reporter reporter)
            throws IOException {
        Text word = new Text();

        Iterator<IntWritable> tr = values;
        IntWritable v;
        while (tr.hasNext()) {
             v = tr.next();

            Iterator<IntWritable> td = values;
            while (td.hasNext()) {

                IntWritable u = td.next();
                double sum = u+v;
                word.set( u + " + " + v);
                output.collect(word, new DoubleWritable(sum));
            }
        }
    }
}

And i am trying to create two copies of the Iterator variable so that i can go through all the values of the second iterator while i get a single value from the previous Iterator( Two while loops above) but the two iterators hold the same value all the time.

I am not sure if this is the right way to do it, Any help is really appreciated.

Thanks,

Tsegay

解决方案

The iterators in the reducer are not as simple as you might think.

The issue is that the total number of items that you are iterating through might not fit into memory. That means that the iterator may be reading from disk. If you have two independent copies of the iterator, then you can have one of them far ahead of the other which implies that the data between where the two iterators point can't be dropped.

For simplicity of implementation, Hadoop doesn't support having more than one iterator for the reduce values.

The practical impact of this is that you can't go through the same iterator twice. That isn't nice, but it is the case. If you absolutely know that the number of items will fit into memory, then you can copy all the items into a list as suggested by MrGomez. If you don't know that, you may have to use secondary storage.

The better approach is to redesign your program so that you don't need unbounded storage in the reducer. This can get a bit tricky, but there are standard approaches to the problem.

For your particular problem, you have a quadratic growth in output size relative to the largest reduce input set. This is usually a really bad idea. In most cases you don't need ALL pairs, just the most important pairs. If you can trim the set of pairs in some way, then you are all set and you may be able to remove the all pairs constraint.

For instance, if you are trying to find the 100 pairs with the largest sum for each reduce set, you can keep a priority queue with the 100 largest inputs seen so far and a priority queue with the 100 largest sums seen so far. For each new input, you can form the sum with the largest 100 numbers seen so far and try to stick those sums into the second queue. Finally, you should stick the new input into the first queue and trim both queues to 100 elements by deleting the smallest values (if necessary). In the close method of the reduce, you should dump the priority queue. This approach guarantees that you only need min(n^2, 200) elements of storage which avoids the n^2 problem and avoids the double pass through the input by keeping the 100 largest items seen rather than all items seen.

这篇关于在mapreduce中创建迭代器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 04:12