使用mbox Python mdule解码和访问mbox文件

本文介绍了使用mbox Python mdule解码和访问mbox文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要将电子邮件数据库迁移到CRM并有两个问题：

I need to migrate an email database to a CRMand have 2 problems:

我可以访问mbox文件，但是内容没有正确解码。

I get to access the mbox file but the content is not properly decoded.

我想用以下几列创建一个类似数据框的结构：日期，从，到，主题，正文

I want to create a dataframe like structure with following columns: "date, from, to, subject, body"

我尝试了以下操作：

for i, message in enumerate(mbox):
    print("from   :",message['from'])
    print("subject:",message['subject'])
    if message.is_multipart():
        content = (part.get_payload(decode=True) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    print("content:",content)
    print("**************************************")

    if i == 10:
        break

并获得以下输出：

from   : =?UTF-8?Q?Gonzalo_Gasset_Yba=C3=B1ez?= <gonzalo.gasset@baud.es>
subject: =?UTF-8?Q?Marqu=C3=A9s_de_Vargas_=26_Baud?=
content: <generator object <genexpr> at 0x7fe025f3a350>
**************************************
from   : Mailtrack Reminder <reminders@mailtrack.io>
subject: Re: Presupuesto de Logotipo y =?utf-8?Q?Dise=C3=B1o?= Corporativo
 para nuevo proyecto
content: b'<!DOCTYPE html>\r\n<html>\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width">\r\n    <title>Reminder</title>\r\n</head>\r\n<style media="screen">\r\n    body {\r\n        font-family: Helvetica;\r\n    }\r\n</style>\r\n<body style="background-color: #f6f6f6; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; .....

推荐答案

mailbox.Mailbox 接受工厂可用于生成消息的参数。通过传递方法由 BytesParser = https://docs.python.org/3/library/email.policy.html#email.policy.default rel = nofollow noreferrer>默认政策，我们可以生成，它将自动解码标题和正文。

The concrete implementations of mailbox.Mailbox accept a factory argument that can be used to build messages. By passing the parse method of a BytesParser initialised with the default policy we can generate EmailMessages which will decode headers and body text automatically.

选择实际的正文比较棘手，并且可能取决于您的特定要求。在下面的代码示例中，任何文本都是字体部分连接在一起，而非文本部分则被拒绝。您可能希望应用自己的选择条件。

Selecting the actual body is trickier, and perhaps depends on your particular requirements. In the code sample below, any "text" type parts are joined together, while non-text parts are rejected. You might wish to apply your own selection criteria.

from email.parser import BytesParser
from email.policy import default
import mailbox

mbox = mailbox.mbox(path_to_mailbox, factory=BytesParser(policy=default).parse)

for _, message in enumerate(mbox):
    print("date:  :", message['date'])
    print("to:    :", message['to'])
    print("from   :", message['from'])
    print("subject:", message['subject'])
    if message.is_multipart():
        contents = []
        for part in message.walk():
            maintype = part.get_content_maintype()
            if maintype == 'multipart' or maintype != 'text':
                # Reject containers and non-text types
                continue
            contents.append(part.get_content())
        content = '\n\n'.join(contents)
    else:
        content = message.get_content()
    print("content:", content)
        print("**************************************")

这篇关于使用mbox Python mdule解码和访问mbox文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！