在命令行上解析HTML；如何在中捕获文本?

本文介绍了在命令行上解析HTML；如何在中捕获文本?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从如下所示的HTML输出中获取数据:

I'm trying to grab data from HTML output that looks like this:

<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....

我正在使用管道火车将数据缩减到我要达到的目标.到目前为止，这是我的方法:

I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:

grep "/strong" output.html | awk '{print $1}'

在"/strong"上使用Grep获取与目标行；效果很好.

Grep on "/strong" to get the lines with the targets; that works fine.

通过管道将'awk'{print $ 1}'.当目标没有空格时，在情况#1中有效，但是当目标具有空格时，在情况#2中失败..仅保留第一个单词，如下所示:

Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:

<strong>Target1NoSpaces</strong><span
<strong>Target2

您是否有通过我的awk或其他命令正确击中目标的提示?任何快速而肮脏的东西(grep，awk，sed，perl)都将受到赞赏.

Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.

推荐答案

在grep中使用Perl regex的后视和超前功能.它应该比使用awk更简单.

Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.

grep -oP "(?<=<strong>).*?(?=</strong>)" file

输出:

Target1NoSpaces
Target2 With Spaces

添加:

此Perl的regex Ruby多重匹配实现可以匹配多行中的值:

This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:

ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file

输入:

<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>

输出:

----------
Target
A
B
C
----------
Target D
----------
Target E

这篇关于在命令行上解析HTML；如何在中捕获文本?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

the

在命令行上解析HTML；如何在&lt; strong&gt;&lt;/strong&gt;中捕获文本?

问题描述

推荐答案

在命令行上解析HTML；如何在< strong></strong>中捕获文本?