本文介绍了要使nltk.tokenize.word_tokenize正常工作要下载什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将在群集中我的帐户受空间配额非常限制的群集上使用nltk.tokenize.word_tokenize.在家里,我通过nltk.download()下载了所有nltk资源,但是据我发现,它占用了约2.5GB.

I am going to use nltk.tokenize.word_tokenize on a cluster where my account is very limited by space quota. At home, I downloaded all nltk resources by nltk.download() but, as I found out, it takes ~2.5GB.

对我来说,这似乎有些矫kill过正.您能否建议nltk.tokenize.word_tokenize的最小(或几乎最小)依赖性?到目前为止,我已经看过nltk.download('punkt'),但是我不确定它是否足够,大小如何.为了使它正常工作,我应该运行什么?

This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize? So far, I've seen nltk.download('punkt') but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?

推荐答案

您是对的.您需要Punkt Tokenizer模型.它有13 MB,nltk.download('punkt')应该可以解决问题.

You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt') should do the trick.

这篇关于要使nltk.tokenize.word_tokenize正常工作要下载什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 06:03