创建50,000个单词的ARPA语言模型文件

本文介绍了创建50,000个单词的ARPA语言模型文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想创建一个包含将近50,000个单词的ARPA语言模型文件.我无法通过将文本文件传递给CMU语言工具来生成语言模型.是否有其他链接可以用来为这些许多单词提供语言模型?

I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words?

推荐答案

我认为我会回答这个问题，因为它有几个选票，尽管基于克里斯蒂娜(Christina)的其他问题，我认为这不是一个可行的答案她的原因是，由于目前使用这种语言模型格式的iOS应用程序内识别系统，使用50,000个单词的语言模型几乎可以肯定不会有可接受的单词错误率或识别速度(或者很可能甚至长时间运行).硬件限制.我认为值得对其进行记录，因为我认为这可能对使用平台的其他人有所帮助，在该平台上，将这样大小的词汇表存储在内存中是更可行的事情，也许将来的设备模型也有可能.

I thought I'd answer this one since it has a few votes, although based on Christina's other questions I don't think this will be a usable answer for her since a 50,000-word language model almost certainly won't have an acceptable word error rate or recognition speed (or most likely even function for long) with in-app recognition systems for iOS that use this format of language model currently, due to hardware constraints. I figured it was worth documenting it because I think it may be helpful to others who are using a platform where keeping a vocabulary this size in memory is more of a viable thing, and maybe it will be a possibility for future device models as well.

我没有一个像Sphinx知识库工具这样的基于Web的工具，它可以使50,000个单词的纯文本语料库拼凑起来，并返回一个ARPA语言模型.但是，您可以通过以下步骤获得一个已经完成的64,000字的DMP语言模型(可以与Sphinx在命令行或在其他平台实现中以与ARPA .lm文件相同的方式使用):

There is no web-based tool I'm aware of like the Sphinx Knowledge Base Tool that will munge a 50,000-word plaintext corpus and return an ARPA language model. But, you can obtain an already-complete 64,000-word DMP language model (which can be used with Sphinx at the command line or in other platform implementations in the same way as an ARPA .lm file) with the following steps:

从CMU语音网站下载此语言模型:

http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20HUB4%20Language%20Model/HUB4_trigram_lm.zip

该文件夹中有一个名为language_model.arpaformat.DMP的文件，它将作为您的语言模型.

In that folder is a file called language_model.arpaformat.DMP which will be your language model.

从CMU语音站点下载此文件，它将成为您的发音词典:

https://cmusphinx. svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

将cmu07a.dic的内容转换为所有大写字母.

Convert the contents of cmu07a.dic to all uppercase letters.

如果需要，还可以通过删除语音词典中未在语料库language_model.vocabulary中找到的任何单词来精简发音词典(这将是一个正则表达式问题).这些文件旨在与一种Sphinx英语声学模型一起使用.

If you want, you could also trim down the pronunciation dictionary by removing any words from it which aren't found in the corpus language_model.vocabulary (this would be a regex problem). These files are intended for use with one of the Sphinx English-language acoustic models.

如果使用50,000个单词的英语语言模型的愿望是由进行某种广义的大词汇量语音识别的想法所驱动，而不是因为需要使用非常具体的50,000个单词(例如，像医学词典或50,000个条目的联系人列表)，如果硬件可以处理，则此方法应给出那些结果.可能需要更改某些Sphinx或Pocketsphinx设置，以通过这种大小的模型优化搜索.

If the desire to use a 50,000-word English language model is driven by the idea of doing some kind of generalized large vocabulary speech recognition and not by the need to use a very specific 50,000 words (for instance, something specialized like a medical dictionary or 50,000-entry contact list), this approach should give those results if the hardware can handle it. There are probably going to be some Sphinx or Pocketsphinx settings that will need to be changed which will optimize searches through this size of model.

这篇关于创建50,000个单词的ARPA语言模型文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

1403页，肝出来的..

Any