如何使用python和漂亮的汤将html页面拆分为多个页面

本文介绍了如何使用python和漂亮的汤将html页面拆分为多个页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个像这样的简单html文件.实际上，我是从Wiki页面中提取它的，删除了一些html属性并转换为这个简单的html页面.

I have a simple html file like this. In fact I pulled it from a wiki page, removed some html attributes and converted to this simple html page.

<html>
   <body>
      <h1>draw electronics schematics</h1>
      <h2>first header</h2>
      <p>
         <!-- ..some text images -->
      </p>
      <h3>some header</h3>
      <p>
         <!-- ..some image -->
      </p>
      <p>
         <!-- ..some text -->
      </p>
      <h2>second header</h2>
      <p>
         <!-- ..again some text and images -->
      </p>
   </body>
</html>

我使用python和类似这样的漂亮汤阅读了html文件.

I read this html file using python and beautiful soup like this.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("test.html"))

pages = []

我想做的就是将此html页面分为两部分.第一部分在第一标题和第二标题之间.第二部分将位于第二个标头

和标记之间.然后我想将它们存储在一个列表中.页面.因此，我可以根据

标签从html页面创建多个页面.
What I'd like to do is split this html page into two parts. The first part will be between first header and second header. And the second part will be between second header <h2> and </body> tags. Then I'd like to store them in a list eg. pages. So I'd be able to create multiple pages from an html page according to <h2> tags.
关于如何执行此操作的任何想法?谢谢.
Any ideas on how should I do this? Thanks..

推荐答案

查找h2标记，然后使用.next_sibling抓取所有内容，直到它是另一个h2标记:

Look for the h2 tags, then use .next_sibling to grab everything until it's another h2 tag:

soup = BeautifulSoup(open("test.html"))
pages = []
h2tags = soup.find_all('h2')

def next_element(elem):
    while elem is not None:
        # Find next element, skip NavigableString objects
        elem = elem.next_sibling
        if hasattr(elem, 'name'):
            return elem

for h2tag in h2tags:
    page = [str(h2tag)]
    elem = next_element(h2tag)
    while elem and elem.name != 'h2':
        page.append(str(elem))
        elem = next_element(elem)
    pages.append('\n'.join(page))

使用您的样本，可以得出:

Using your sample, this gives:

>>> pages
['<h2>first header</h2>\n<p>\n<!-- ..some text images -->\n</p>\n<h3>some header</h3>\n<p>\n<!-- ..some image -->\n</p>\n<p>\n<!-- ..some text -->\n</p>', '<h2>second header</h2>\n<p>\n<!-- ..again some text and images -->\n</p>']
>>> print pages[0]
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>

这篇关于如何使用python和漂亮的汤将html页面拆分为多个页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！