本文介绍了在一般的树遍历中找到最深的路径试图找到最大的公共子串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决2个字符串之间最大公共子字符串的问题。我将把我的问题减少到以下几点:我创建了一个,并根据我的理解最大的公共子串是由属于两个字符串的节点组成的最深路径。

I am trying to solve the problem of largest common substring between 2 Strings. I will reduce my problem to the following: I created a general suffix tree and as per my understanding the largest common substring is the deepest path consisting of nodes that belongs to both strings.

我的测试输入是:

String1 = xabc
String2 = abc

看来我构建的树是正确的,但我的问题是以下方法(我传递了根最初的树):

It seems that the tree I build is correct but my problem is the following method (I pass the root of the tree initially):

private void getCommonSubstring(SuffixNode node) {
   if(node == null)
    return;
   if(node.from == ComesFrom.Both){
    current.add(node);
   }
   else{
    if(max == null || current.size() > max.size()){
        max = current;
    }
    current = new ArrayList<SuffixNode>();
   }
   for(SuffixNode n:node.children){
    getCommonSubstring(n);
   }
}

我的目标是,为了找到属于两个字符串的节点的最深路径,我将遍历树(预订)并添加属于列表中两个字符串的节点(当前)。一旦我进入不属于两者的节点,我更新 max 列表,如果当前更大。

What I was aiming to do is, in order to find the deepest path with nodes that belong to both strings, I would traverse the tree (pre-order) and add nodes that belong to both strings in a list (current). Once I am in a node that is not part of both I update max list if current is bigger.

但代码是错误的。我对如何实现这一点很困惑,因为我没有为一般(非二进制)树编写代码。

But the code is erroneous. And I am confused on how to implement this, since I haven't written code for general (non-binary) trees in ages.

你能帮我搞清楚吗? ?

Could you help me figure this out?

更新:

根据@templatetypedef进行修改。无法做到这一点。

Update:
Modified as per @templatetypedef. Could not make this work either.

private void getCommonSubstring(SuffixNode node, List<SuffixNode> nodes) {
   if(node == null)
    return;
   if(node.from == ComesFrom.Both){
    nodes.add(node);
   }
   else{
       if(max == null || current.size() > max.size()){
       max = nodes;
    }
    nodes = new ArrayList<SuffixNode>();
   }
   for(SuffixNode n:node.children){
    List<SuffixNode> tmp = new ArrayList<SuffixNode>(nodes);
    getCommonSubstring(n, tmp);
   }
}


public class SuffixNode {
    Character character;
    Collection<SuffixNode> children;
    ComesFrom from;
    Character endMarker;
}


推荐答案

我在这里看到的一个问题是后缀树中节点的 depth 与沿该路径的字符串的 length 不同。后缀树中的每个边都用一系列字符注释,因此由一系列深度为5的节点编码的字符串可能比在深度为二的编码字符串具有更短的长度。您可能需要调整算法以通过跟踪到目前为止构建的字符串的有效长度来处理此问题,而不是到目前为止您已追踪的路径中的节点数。

One issue I see here is that the depth of a node in the suffix tree is not the same as the length of the string along that path. Each edge in a suffix tree is annotated with a range of characters, so a string encoded by a series of nodes of depth five might have shorter length than a string encoded at depth two. You will probably need to adjust your algorithm to handle this by tracking the effective length of the string that you've built up so far, rather than the number of nodes in the path that you've traced out up to this point.

我刚才注意到的第二个问题是你似乎只有一个当前的变量,它在所有不同的地方共享递归的分支。这可能会在递归调用中搞乱你的状态。例如,假设您在一个节点并且路径长度为3,并且有两个子节点 - 第一个只以第一个字符串的后缀结尾,第二个以后缀结尾两个字符串。在这种情况下,如果你对第一个字符串进行递归调用,你将最终执行该行

A second issue I just noticed is that you seem to only have one current variable that is getting shared across all the different branches of the recursion. This probably is messing up your state across recursive calls. For example, suppose that you're at a node and have a path of length three, and that there are two children - the first of which only ends in a suffix of the first string, and the second of which ends in a suffix of both strings. In that case, if you make the recursive call on the first string, you will end up executing the line

current = new ArrayList<SuffixNode>();

。这将清除所有历史记录,因此当您从此回调返回到原始节点并进入第二个子节点时,您将表现为目前没有构建的节点列表,而不是继续到目前为止你找到的三个节点。

in the recursive call. This will then clear all your history, so when you return from this call back up to the original node and descend into the second child node, you will act as if there is no list of nodes built up so far, instead of continuing on from the three nodes you found so far.

为了解决这个问题,我建议将当前一个参数添加到函数,然后在需要时创建一个新的 ArrayList ,而不是消灭现有的 ArrayList 。其他一些逻辑也可能需要改变以解决这个问题。

To fix this, I'd suggest making current a parameter to the function and then creating a new ArrayList when needed, rather than wiping out the existing ArrayList. Some of the other logic might have to change as well to account for this.

鉴于此,我建议像伪代码那样编写函数(因为我不这样做)知道后缀树实现的细节):

Given this, I would suggest writing the function in pseudocode like this (since I don't know the particulars of your suffix tree implementations):


  • 如果当前节点为null,则返回0.

  • 如果当前节点不是来自两个字符串,则返回0.

  • 否则:

    • 设置maxLen = 0。

    • 对于每个子节点:$ b​​ $ b

      • 计算以该节点为根的最长公共子字符串的长度。

      • 将该边缘的字符数添加到该长度。

      • 如果超过旧值,则更新maxLen。

      • If the current node is null, return 0.
      • If the current node doesn't come from both strings, return 0.
      • Otherwise:
        • Set maxLen = 0.
        • For each child node:
          • Compute the length of the longest common substring rooted at that node.
          • Add to that length the number of characters along the edge to that child.
          • Update maxLen if this exceeds the old value.

          希望这有帮助!

          这篇关于在一般的树遍历中找到最深的路径试图找到最大的公共子串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 18:44