R中最长的公共子字符串，找到两个字符串之间不连续的匹配项

本文介绍了R中最长的公共子字符串，找到两个字符串之间不连续的匹配项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

关于在R中找到最长的公共子字符串，我有一个问题.在搜索StackOverflow上的几篇文章时，我了解了qualV包.但是，我看到此程序包中的LCS函数实际上是从string1中找到存在于string2中的所有字符，即使它们不是连续的.

I have a question regarding finding the longest common substring in R. While searching through a few posts on StackOverflow, I got to know about the qualV package. However, I see that the LCS function in this package actually finds all characters from string1 which are present in string2, even if they are not contiguous.

说明一下，如果字符串是string1:" hel lo"string2:" hel 12345lo"我希望输出为 hel ，但是我得到的结果是你好".我一定做错了什么.请在下面查看我的代码.

To explain, if the strings arestring1 : "hello"string2 : "hel12345lo"I expect the output to be hel, however I get the output as hello. I must be doing something wrong. Please see my code below.

library(qualV)
a= "hello"
b="hel123l5678o" 
sapply(seq_along(a), function(i)
    paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
              substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
          collapse = ""))

我也尝试了Rlibstree方法，但是我仍然得到不连续的子字符串.另外，子字符串的长度也与我的预期不符.请参见下文.

I have also tried the Rlibstree method but I still get substrings which are not contiguous. Also, the length of the substring is also off from my expectation.s Please see below.

> a = "hello"
> b = "h1e2l3l4o5"

> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"

> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
                21

推荐答案

以下是三种可能的解决方案.

Here are three possible solutions.

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

在基R中也有adist()和agrep()，并且stringdist程序包具有一些运行LCS方法的功能.看一下stringsidt.它返回未配对字符的数量.

There are also adist() and agrep() in base R, and the stringdist package has a few functions that run the LCS method. Here's a look at stringsidt. It returns the number of unpaired characters.

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0

现在，我已经对此进行了更多的探索，我认为adist()可能是要走的路.如果设置counts=TRUE，则会得到一系列的匹配项，插入项等.因此，如果将其赋予stri_locate()，则可以使用该矩阵将匹配项从a转换为b.

Now that I've explored this a bit more, I think adist() might be the way to go. If we set counts=TRUE we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate() we can use that matrix to get the matches from a to b.

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

所以M值表示比赛中的比赛.我们可以使用stri_sub()

So the M values denote straight across matches. We can go and get the substrings with stri_sub()

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o"

对不起，我还没有很好地解释这一点，因为我不熟悉字符串距离算法.

Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.

这篇关于R中最长的公共子字符串，找到两个字符串之间不连续的匹配项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！