问题描述
关于在R中找到最长的公共子字符串,我有一个问题.在搜索StackOverflow上的几篇文章时,我了解了qualV包.但是,我看到此程序包中的LCS函数实际上是从string1中找到存在于string2中的所有字符,即使它们不是连续的.
I have a question regarding finding the longest common substring in R. While searching through a few posts on StackOverflow, I got to know about the qualV package. However, I see that the LCS function in this package actually finds all characters from string1 which are present in string2, even if they are not contiguous.
说明一下,如果字符串是string1:" hel lo"string2:" hel 12345lo"我希望输出为 hel ,但是我得到的结果是你好".我一定做错了什么.请在下面查看我的代码.
To explain, if the strings arestring1 : "hello"string2 : "hel12345lo"I expect the output to be hel, however I get the output as hello. I must be doing something wrong. Please see my code below.
library(qualV)
a= "hello"
b="hel123l5678o"
sapply(seq_along(a), function(i)
paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
collapse = ""))
我也尝试了Rlibstree方法,但是我仍然得到不连续的子字符串.另外,子字符串的长度也与我的预期不符.请参见下文.
I have also tried the Rlibstree method but I still get substrings which are not contiguous. Also, the length of the substring is also off from my expectation.s Please see below.
> a = "hello"
> b = "h1e2l3l4o5"
> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"
> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
21
推荐答案
以下是三种可能的解决方案.
Here are three possible solutions.
library(stringi)
library(stringdist)
a <- "hello"
b <- "hel123l5678o"
## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"
在基R中也有adist()
和agrep()
,并且stringdist
程序包具有一些运行LCS方法的功能.看一下stringsidt
.它返回未配对字符的数量.
There are also adist()
and agrep()
in base R, and the stringdist
package has a few functions that run the LCS method. Here's a look at stringsidt
. It returns the number of unpaired characters.
stringdist(a, b, method="lcs")
# [1] 7
Filter("!", mapply(
stringdist,
stri_sub(b, 1, 1:nchar(b)),
stri_sub(a, 1, 1:nchar(b)),
MoreArgs = list(method = "lcs")
))
# h he hel
# 0 0 0
现在,我已经对此进行了更多的探索,我认为adist()
可能是要走的路.如果设置counts=TRUE
,则会得到一系列的匹配项,插入项等.因此,如果将其赋予stri_locate()
,则可以使用该矩阵将匹配项从a转换为b.
Now that I've explored this a bit more, I think adist()
might be the way to go. If we set counts=TRUE
we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate()
we can use that matrix to get the matches from a to b.
ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"
所以M
值表示比赛中的比赛.我们可以使用stri_sub()
So the M
values denote straight across matches. We can go and get the substrings with stri_sub()
stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l" "o"
对不起,我还没有很好地解释这一点,因为我不熟悉字符串距离算法.
Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.
这篇关于R中最长的公共子字符串,找到两个字符串之间不连续的匹配项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!