如何加载word2vec模型并将其函数调用到映射器中

本文介绍了如何加载word2vec模型并将其函数调用到映射器中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想加载word2vec模型并通过执行单词类比任务对其进行评估(例如， a对b就像c对某物一样?).为此，首先我加载我的w2v模型:

I want to load a word2vec model and evaluate it by executing word analogy tasks (e.g. a is to b as c is to something?). To do this, first I load my w2v model:

model = Word2VecModel.load(spark.sparkContext, str(sys.argv[1]))

然后我给映射器打电话来评估模型:

and then I call the mapper to evaluate the model:

rdd_lines = spark.read.text("questions-words.txt").rdd.map(getAnswers)

getAnswers函数每次从 questions-words.txt 中读取一行，其中每行包含用于评估我的模型的问题和答案(例如，雅典，希腊，巴格达，伊拉克=雅典，b =希腊，c =巴格达，某物=伊拉克).阅读该行之后，我创建了current_question和actual_answer(例如:current_question=Athens Greece Baghdad和actual_answer=Iraq).之后，我调用getAnalogy函数，该函数用于计算类比(基本上，给定要计算答案的问题).最后，在计算出类比后，我将答案返回并将其写入文本文件.

The getAnswers function reads one line per time from questions-words.txt, in which each line contains the question and the answer to evaluate my model (e.g. Athens Greece Baghdad Iraq, where a=Athens, b=Greece, c=Baghdad and something=Iraq). After reading the line, I create the current_question and the actual_answer (e.g.: current_question=Athens Greece Baghdad and actual_answer=Iraq). After that, I call the getAnalogy function that is used to compute the analogy (basically, given the question it computes the answer). Finally, after computing the analogy, I return the answer and write it to a text file.

问题是我收到以下异常:

The problem is that I get the following exception:

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers.

，我认为它被抛出是因为我在map函数中使用了模型.此问题与我的问题类似，但我确实不知道如何将答案应用于我的代码.我怎么解决这个问题?以下是完整的代码:

and I think that it is thrown because I am using the model within the map function. This question is similar to my problem but I do not know how to apply that answer to my code. How can I solve this problem? The following is the full code:

def getAnalogy(s, model):
    try:
        qry = model.transform(s[0]) - model.transform(s[1]) - model.transform(s[2])    
        res = model.findSynonyms((-1)*qry,5) # return 5 "synonyms"
        res = [x[0] for x in res]
        for k in range(0,3):
            if s[k] in res:
                res.remove(s[k])
        return res[0]
    except ValueError:
        return "NOT FOUND"

def getAnswers (text):
    tmp = text[0].split(' ', 3)
    answer_list = []
    current_question = " ".join(str(x) for x in tmp[:3])
    actual_answer = tmp[-1]

    model_answer = getAnalogy(current_question, model)
    if model_answer is "NOT FOUND":
        answer_list.append("NOT FOUND\n")
    elif model_answer is actual_answer:
        answer_list.append("TRUE\n")
    else:
        answer_list.append("FALSE:\n")
    return answer_list.append


if __name__ == "__main__":

    if len(sys.argv) != 3:
        print("Usage: my_test <file>", file=sys.stderr)
        exit(-1)


    spark = SparkSession\
    .builder\
    .appName("my_test")\
    .getOrCreate()


    model = Word2VecModel.load(spark.sparkContext, str(sys.argv[1]))

    rdd_lines = spark.read.text("questions-words.txt").rdd.map(getAnswers)

    dataframe = rdd_lines.toDF()

    dataframe.write.text(str(sys.argv[2]))

    spark.stop()

推荐答案

您已经怀疑，您无法在地图函数中使用该模型.另一方面，questions-answers.txt文件没有那么大(〜20K行)，因此您最好使用原始的Python列表推导来进行评估(这实际上是您所链接问题中的第一个建议答案)；这不是很快，但这只是一次性的任务.这是一种使用我的getAnalogy函数的方法，用于错误处理(请注意，我已经从questions-answers.txt中删除了'comment'行，并且应该将其转换为小写，这似乎在代码中没有做):

As you have already suspected, you cannot use the model in a map function. On the other hand, the questions-answers.txt file is not that big (~ 20K lines), so you should better do the evaluation using vanilla Python list comprehensions (it is essentially the first suggested answer in the question you have linked); it is not fast, but it is just an one-off task. Here is a way, using my getAnalogy function as you have augmented it for error handling (notice that I have already removed the 'comment' lines from questions-answers.txt, and that you should convert it to lowercase, something you don't seem to be doing in your code):

from pyspark.mllib.feature import Word2Vec, Word2VecModel
model = Word2VecModel.load(sc, "word2vec/demo_200") # model built with k=200
with open('/home/ctsats/word2vec/questions-words.txt') as f:
    lines = f.readlines()
lines2 = [x.lower() for x in lines] # all to lowercase
lines3 = [x.strip('\n') for x in lines2] # remove end-of-line characters
lines4 = [x.split(' ',3) for x in lines3]
lines4[0] # check:
# ['Athens', 'Greece', 'Baghdad', 'Iraq']

def getAnswers (text, model):
    actual_answer = text[-1]
    question = [text[0], text[1], text[2]]
    model_answer = getAnalogy(question, model)
    if model_answer == "NOT FOUND":
        correct_answer = "NOT FOUND"
    elif model_answer == actual_answer:
        correct_answer = "TRUE"
    else:
        correct_answer = "FALSE"
    return text, model_answer, correct_answer

因此，您的评估列表现在可以构建为

So, your evaluation list can now be built as

answer_list = [getAnswers(x, model) for x in lines4]

以下是前20个条目的示例(模型为k=200):

Here's an example for the first 20 entries (with a model of k=200):

[(['athens', 'greece', 'baghdad', 'iraq'], u'turkey', 'FALSE'),
 (['athens', 'greece', 'bangkok', 'thailand'], u'turkey', 'FALSE'),
 (['athens', 'greece', 'beijing', 'china'], u'albania', 'FALSE'),
 (['athens', 'greece', 'berlin', 'germany'], u'germany', 'TRUE'),
 (['athens', 'greece', 'bern', 'switzerland'], u'liechtenstein', 'FALSE'),
 (['athens', 'greece', 'cairo', 'egypt'], u'albania', 'FALSE'),
 (['athens', 'greece', 'canberra', 'australia'], u'liechtenstein', 'FALSE'),
 (['athens', 'greece', 'hanoi', 'vietnam'], u'turkey', 'FALSE'),
 (['athens', 'greece', 'havana', 'cuba'], u'turkey', 'FALSE'),
 (['athens', 'greece', 'helsinki', 'finland'], u'finland', 'TRUE'),
 (['athens', 'greece', 'islamabad', 'pakistan'], u'turkey', 'FALSE'),
 (['athens', 'greece', 'kabul', 'afghanistan'], u'albania', 'FALSE'),
 (['athens', 'greece', 'london', 'england'], u'italy', 'FALSE'),
 (['athens', 'greece', 'madrid', 'spain'], u'portugal', 'FALSE'),
 (['athens', 'greece', 'moscow', 'russia'], u'russia', 'TRUE'),
 (['athens', 'greece', 'oslo', 'norway'], u'albania', 'FALSE'),
 (['athens', 'greece', 'ottawa', 'canada'], u'moldova', 'FALSE'),
 (['athens', 'greece', 'paris', 'france'], u'france', 'TRUE'),
 (['athens', 'greece', 'rome', 'italy'], u'italy', 'TRUE'),
 (['athens', 'greece', 'stockholm', 'sweden'], u'norway', 'FALSE')]

这篇关于如何加载word2vec模型并将其函数调用到映射器中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！