本文介绍了加载word2vec模块时出现'utf-8'解码错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用包含大量汉字的word2vec模块.该模块由我的同事使用Java进行了培训,并保存为bin文件.

I have to use a word2vec module containing tons of Chinese characters. The module was trained by my coworkers using Java and is saved as a bin file.

我安装了 gensim 并尝试加载该模块,但是发生了以下错误:

I installed gensim and tries to load the module, but following error occurred:

In [1]: import gensim  

In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

我尝试同时在python 2.7和3.5中加载模块,但均以相同的方式失败.那么如何在gensim中加载模块呢?谢谢.

I tried to load the module both in python 2.7 and 3.5, failed in the same way. So how can I load the module in gensim? Thanks.

推荐答案

该模块是由Java训练而成的大量汉字.我无法弄清楚原始语料库的编码格式.可以通过gensim 常见问题解答

The module was tons of Chinese characters trained by Java. I cannot figure out the encoding format of the original corpus. The error can be solved as the description in gensim FAQ,

在load_word2vec_format中使用一个标志来忽略字符解码错误:

Using load_word2vec_format with a flag for ignoring the character decoding errors:

In [1]: import gensim

In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True, unicode_errors='ignore')

但是我不知道在忽略编码错误时是否重要.

But I've no idea whether it matters when ignoring the encoding errors.

这篇关于加载word2vec模块时出现'utf-8'解码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-18 15:21