修改python nltk.word_tokenize以排除“#".作为分隔符

本文介绍了修改python nltk.word_tokenize以排除“#".作为分隔符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Python的NLTK库标记我的句子.

I am using Python's NLTK library to tokenize my sentences.

如果我的代码是

text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)

我将其作为输出

['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

符号;，.和#被视为定界符.有没有办法从定界符集中删除#，例如+不是定界符，从而使C++出现为单个标记?

The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?

我希望输出为

['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

我希望将C#视为一个令牌.

I want C# to be considered as one token.

推荐答案

另一个想法:与其改变文本的标记方式，不如将标记循环，并将每个'#'与前面的标记连接起来.

Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.

txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)

i_offset = 0
for i, t in enumerate(tokens):
    i -= i_offset
    if t == '#' and i > 0:
        left = tokens[:i-1]
        joined = [tokens[i - 1] + t]
        right = tokens[i + 1:]
        tokens = left + joined + right
        i_offset += 1

>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

这篇关于修改python nltk.word_tokenize以排除“#".作为分隔符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！