本文介绍了在不使用 COM/自动化的情况下从 Word 文档中提取文本的最佳方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种合理的方法可以从不依赖于 COM 自动化的 Word 文件中提取纯文本?(这是部署在非 Windows 平台上的网络应用程序的一项功能 - 在这种情况下这是不可协商的.)

Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)

Antiword 似乎是一个合理的选择,但它似乎可能会被放弃.

Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.

Python 解决方案是理想的,但似乎不可用.

A Python solution would be ideal, but doesn't appear to be available.

推荐答案

为此我使用 catdoc 或 antiword,无论给出最容易解析的结果.我已经将它嵌入到python函数中,因此很容易从解析系统(用python编写)中使用.

I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

-w 切换到 catdoc 关闭换行,顺便说一句.

The -w switch to catdoc turns off line wrapping, BTW.

这篇关于在不使用 COM/自动化的情况下从 Word 文档中提取文本的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-23 00:06