本文介绍了阿帕奇蒂卡和文档元数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我做使用Apache提卡各种文档(ODS,微软Office,PDF)的简单加工。我一定要得到至少为:

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :

word count, author, title, timestamps, language etc.

这是不那么容易了。我的策略是使用模板方法模式为6种类型的文件,在这里我先找到的文档类型,并根据我单独处理它。

which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.

我知道阿帕奇蒂卡应该删除这方面的需要,但是文档格式有很大的不同吧?

I know that apache tika should remove the need for this, but the document formats are quite different right ?

例如

InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();

for(String s : metadata.names()) {
    System.out.println("Metadata name : "  + s);
}

我试图消耗臭氧层物质,微软Office,PDF文档做到这一点,而metadada相差很多。还有的MSOffice界面,其中列出的元数据键MS文档和一些 Dublic核心元数据列表。但是,应该如何实现这样的应用程序?

I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?

请能有谁与它的经验分享他的经验谁?谢谢

Could please anybody who has experience with it share his experience ? Thank you

推荐答案

一般情况下,解析器应该返回在所有文档格式同样的事情相同的元数据项。然而,也有一些种类的只发生在某些文件类型的元数据,这样你就不会得到那些别人。

Generally the parsers should return the same metadata key for the same kind of thing across all document formats. However, there are some kinds of metadata that only occur in some file types, so you won't get those from others.

您可能希望只使用AutoDetectParser,如果你需要做什么特别的事情与事后基础上,MIME类型的元数据处理,如:

You might want to just use the AutoDetectParser, and if you need to do anything special with the metadata handle that afterwards based on the mimetype, eg

Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();

Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());

if(metadata.get(CONTENT_TYPE).equals("application/pdf")) {
   // Do something special with the PDF metadata here
}

这篇关于阿帕奇蒂卡和文档元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-31 03:54