本文介绍了我们如何计算地图函数在mapreduce程序中调用的次数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须对输入数据执行某些操作,并使用mapreduce程序将其写入hdfs。
我的输入数据看起来像

  

您可以使用成员变量来保持计数,现在。成员变量是实例变量,每次调用map函数时都不会被重置。你可以在映射器 setup 方法中实例化它们。

显然,你可以使用静态变量来保存计数器。 / b>

HDFS中的数据以块形式存储,当数据分成两个块时,您将如何处理。



<要处理两个块之间的数据拆分,可能需要 Reducers 。还原器的属性是,与特定键相关的所有数据(值)始终发送给同一个(单个)还原器。减速器的输入是键和值列表,在您的情况下列表数据。因此,您可以根据您的要求轻松地存储它们。



优化:您可以使用相同的 Reducer code as Combiner 来优化数据传输。

想法: Mapper 会发出键和值,因为它是。现在,当 Reducer 接收到 Key,List< value> 数据时,所有的值已经合并通过 MapReduce框架。你只需要再次发射它们。这是您正在寻找的输出。


I have to do certain operations on my input data and write it to hdfs using mapreduce program. My input data looks like

abc  
some data  
some data  
some data  
def  
other data  
other data  
other data 

and continues in the same way, where abc ,def are the headers and some data are records with tab space.

My task is to eliminate the headers and append it to its below records like

some data abc  
some data abc  
some data abc  
other data def  
other data def  
other data def  

Each header will have 50 records.

I am using the default record reader so it reads each line at a time

Now my problem is how do I know that map function has been called for a nth time?Do I have any counter to know that?So that I can use that counter to append the header to string as

if (counter % 50 ==0 )
   *some code*

Or else static variables are the only way?

解决方案

You can use member variables to keep the count, how many have processed till now. The member variable are instance variables and will not be reset each time map function get called. You can instantiate them in mapper setup method.

Obviously, you can use static variable for keeping the counter.

The data in HDFS is stored in blocks, how are you going to handle when data is split in two blocks.

To handle the data split between two blocks, you might need the Reducers. The property of the reducers is, all the data (values) related to a particular key are always sent to the same (single) reducer. The input to the reducer is key and list of values which is in your case list of data. So you can store them very easily as per your requirement.

Optimization : You can use the same Reducer code as Combiner for optimizing the data transfer.

Idea : The Mapper will emit the key and value as it is. Now when the Reducer receive the data, which is Key, List<value>, all of your values are already combined by the MapReduce framework. You just to need to emit them again. This is the output you are looking for.

这篇关于我们如何计算地图函数在mapreduce程序中调用的次数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-16 03:57