本文介绍了使用mapreduce为每个值计算top-N b值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是hadoop的新手,并且一直在努力编写mapreduce算法来找到每个A值的前N个值。

I am new to hadoop and have been struggling to write a mapreduce algorithm for finding top N values for each A value. Any help or guide to code implementation would be highly appreciated.

Input data
a,1
a,9
b,3
b,5
a,4
a,7
b,1

output
a 1,4,7,9
b 1,3,5

我相信我们应该写一个可以读取行的Mapper ,将这些值拆分并让它通过减速器收集。如果每个键的值的数量足够小,那么可以使用这个键来完成排序。

I believe we should write a Mapper that would read the line, split the values and allow it to be collected by reducer. And once in the reducer we have to do the sorting part.

推荐答案

简单的方法,只需让reducer读取与给定键相关的所有值并输出前N即可。

If the number of values per key is small enough, the simple approach of just having the reducer read all values associated to a given key and output the top N is probably best.

如果每个键的值数足够大这将是一个糟糕的选择,那么复合键将会更好地工作,并且需要定制分区器和比较器。你想要根据自然键进行分区(这里是'a'或'b',这样它们最终在同一个reducer上),但是对该值进行第二次排序(这样reducer将首先看到最大值)。

If the number of values per key is large enough that this would be a poor choice, then a composite key is going to work better, and a custom partitioner and comparator will be needed. You'd want to partition based on the natural key (here 'a' or 'b', so that these end up at the same reducer) but with a secondary sort on the value (so that the reducer will see the largest values first).

这篇关于使用mapreduce为每个值计算top-N b值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-16 03:57