本文介绍了如何使用R中的元数据将语料库转换为data.frame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将语料库转换为 R 中还包含元数据的数据框?我已经尝试了 将语料库转换为 R 中的 data.frame 的建议,但结果数据框仅包含语料库中所有文档的文本行.我还需要文档 ID 以及两列中文本行的行号.那么,我该如何扩展这个命令:dataframe <- data.frame(text=unlist(sapply(mycorpus,[, "content")), stringsAsFactors=FALSE) 获取数据?

how can I convert a corpus into a data frame in R which contains also meta data? I already tried the suggestion from convert corpus into data.frame in R, but the resulting data frame only contains the text lines from all docs in the corpus.I need also the document ID and maybe the line number of the text line in two columns. So, how can I extend this command: dataframe <- data.frame(text=unlist(sapply(mycorpus,[, "content")), stringsAsFactors=FALSE) to get the data?

我已经试过了

    dataframe <- 
data.frame(id=sapply(corpus, meta(corpus, "id")), 
text=unlist(sapply(corpus, `[`, "content")), 
stringsAsFactors=F)

但它没有帮助;我只收到一条错误消息match.fun(FUN) 中的错误:'meta(corpus, "id")' 不是 Funktion, Zeichen oder Symbol"

but it didn't help; I only got an error message "Error in match.fun(FUN) : 'meta(corpus, "id")' ist nicht Funktion, Zeichen oder Symbol"

语料提取自纯文本文件;这是一个例子:

The corpus is extracted from plain text files; here is an example:

> str(corpus)
[...]
$ 1178531510 :List of 2
  ..$ content: chr [1:67] " uberrasch sagt [...] gemacht echt schad verursacht" ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2015-08-16 14:44:11"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "1178531510" # <--- This is the ID i want in the data.frame
  .. ..$ language     : chr "de"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
[...]

非常感谢:)

推荐答案

有两个问题:sapply 中的论点语料不要重复,多段文本转为字符向量长度 > 1,您应该在取消上市之前将其粘贴在一起.

There are two problems : you should not repeat the argument corpus in sapply, and multi-paragraphs texts are turned to character vectors of length > 1 which you should paste together before unlisting.

dataframe <- 
    data.frame(id=sapply(corpus, meta, "id"),
               text=unlist(lapply(sapply(corpus, '[', "content"),paste,collapse="\n")),
               stringsAsFactors=FALSE)

这篇关于如何使用R中的元数据将语料库转换为data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-30 08:17