本文介绍了段落处理Hadoop的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以将文本段落传递给Mapper类而不是逐行传递。我正在寻找一个ParagraphRecordReader实现。

Is it possible to get paragraphs of text passed to a Mapper class instead of line by line. I am looking for a ParagraphRecordReader implementation.

推荐答案

类似的回答这个要求。但是,您也可以简单地将配置参数 textinputformat.record.delimiter 设置为双换行符字符串(例如:\\\
\\\
)来解决这个问题。

The answer at https://stackoverflow.com/a/5398215/1660002 sort of answers this requirement. However, you can simply also set the configuration parameter textinputformat.record.delimiter to a double newline string (For example: "\n\n") to solve this.

这个可配置的特性可以在Apache Hadoop 0.23.x和2.x版本中找到,如果您使用这些软件,则会从Cloudera发布CDH3和CDH4。

This configurable feature is available in the Apache Hadoop 0.23.x, and 2.x releases, and also in both CDH3 and CDH4 releases from Cloudera if you use those.

这篇关于段落处理Hadoop的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 21:32