最长最大重复子串

本文介绍了最长最大重复子串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

子字符串的长度可以为1,2,3 ...
我要解决的问题涉及查找出现次数最多的子字符串。因此，它基本上可以分解为找到具有最大频率的字符。
但是，我发现可以使用O（n）中的后缀树找到最长的重复子字符串。
但是，后缀树返回将长度作为优先级的子字符串。
我想找到出现次数最多的子字符串，在这些子字符串中，我想找到最长的子字符串。
例如：

A substring can be of length 1,2,3...The question that I was trying to solve involved finding the substring that occurred the maximum number of times. So it basically broke down to finding the character having the maximum frequency.However, I found out that I can find the longest repeating substring using suffix tree in O(n).But, suffix tree returns the substring keeping the length as a priority.I wanted to find the substring which occurs the most number of times, and out of those substrings I want to find the longest one.For eg:

In the following string: ABCZLMNABCZLMNABC
A suffix tree will return ABCZLMN as the longest repeating substring.
However, what I am looking for is ABC;  as it is the longest out of all the ones having frequency = 3.

我尝试通过以下方法解决此问题：在两个索引i和j之间生成子字符串。之后，使用在O（n）中运行的Z算法在每种情况下查找这些子字符串的出现。但是总复杂度为O（n ^ 3）

I tried solving this problem by generating substring between two indices i and j. After that finding the occurrences of these substrings in each case using Z algorithm running in O(n). However the total complexity was O(n^3)

我的O（n ^ 3）代码

My O(n^3) code

map<ll,vector<string>> m;
    string s; cin >> s;
    for(ll i=0;i<s.length();i++){
        string c;
        for(ll len=0; i+len<s.length();len++){
            c+=s[i+len];
            ll z[N];
            ll l=0,r=0;
            string kk;
            for(ll p=0;p<c.length();p++){
                kk+=c[p];
            }
            kk+="#";
            for(ll p=0;p<s.length();p++){
                kk+=s[p];
            }
            for(ll k=1;k<kk.length();k++){
                if(k>r){
                    l=r=k;
                    while(r<c.length()&&kk[r-l]==kk[r])r++;
                    z[k]=r-l;
                    r--;
                }
                else{
                    ll m=k-l;
                    if(z[m]<r-k+l)z[k]=z[m];
                    else{
                        l=k;
                        while(r<c.length()&&kk[r-l]==kk[r])r++;
                        z[k]=r-l;
                        r--;
                    }
                }
            }
            ll occ=0;
            for(ll n=0;n<kk.length();n++){
                if(z[n]==c.length())occ++;
            }
            m[occ].push_back(c);
        }
    }

我找不到合适的解决方案它高效。
请帮助。
谢谢。

I am not able to find a suitable solution to make it efficient.Kindly help.Thank you.

推荐答案

单个字符被视为子字符串，因此，最大重复子字符串必须与频率等于字符串中最常见的字符。

A single character counts as a substring, so therefore the maximum repeating substring must occur with a frequency equal to the most common character in the string.

一个含义是最大重复子字符串中的每个字符在字符串中只能出现一次，因为如果如果出现多次，则该字符本身将成为最大重复字符串。例如，子字符串 dad在字符串 dadxdadydadzdadydad中出现5次，而子字符串 d发生10次。

One implication of that is that each character in the maximum repeating substring can only occur once in the string, because if it occurred more than once then that character on it's own would become the maximum repeating string. For example the substring "dad" occurs 5 times in the string "dadxdadydadzdadydad", but the substring "d" occurs 10 times.

它们也必须出现在同一位置每次排序（否则单个字符的频率将高于子字符串，并且本身就是最大重复子字符串）。它们也不能单独出现在子字符串中（否则它们将再次成为最大重复子字符串）。

They also have to appear in the same order each time (or else the individual characters would have a higher frequency than the substring and be the maximum repeating substring themselves). They also can't appear separately to the substring (or else yet again they would become the maximum repeating substring).

因此，最大重复子字符串必须由

Therefore, the maximum repeating substring must be made up of a subset (or all) of the equally most frequently occurring characters.

我们可以轻松地找出这些字符是哪一个字符，只需在字符串中进行一次遍历并计数即可。我们还可以通过跟踪每个字符之前和之后出现的字符来推断哪个字符出现的顺序，如果每次相同则存储该字符，否则存储零。例如，在字符串 abcxabcyabczabcyabc中，字符 b始终以 a开头，后跟 c：

We can easily figure out which characters these are just by making one pass through the string and counting them. We can also deduce which characters appear in which order, by keeping track of which characters appear before and after each character, storing the character if it is the same every time, and zero otherwise. For example, in the string "abcxabcyabczabcyabc", the character "b" is always preceded by "a" and followed by "c":

string s; cin >> s;
int i, freq[256];
char prev[256], next[256];
for(i = 1; i < 256; i++)
    freq[i] = prev[i] = next[i] = 0;
int maxFreq = 0;
for(i = 0; i < s.length(); i++)
{
    char c = s[i];
    char p = (i == 0) ? 0 : s[i-1];
    char n = (i < s.length() - 1) ? s[i+1] : 0;
    if(freq[c] == 0) // first time to encounter this character
    {
        prev[c] = p;
        next[c] = n;
    }
    else // check if it is always preceded and followed by the same characters:
    {
        if(prev[c] != p)
            prev[c] = 0;
        if(next[c] != n)
            next[c] = 0;
    }
    // increment frequency and track the maximum:
    if(++freq[c] > maxFreq)
        maxFreq = freq[c];
}

if(maxFreq == 0)
    return 0;

然后，我们可以遍历每个字符以及频率等于最大频率的字符，通过遵循 next 字符索引，找到我们可以以此字符开头的字符串的长度：

Then, we can iterate over each character and of the ones that have a frequency equal to the maximum frequency, find the length of string we can form starting with this character by following the next character indices:

int maxLen = 0;
int startingChar = 0;
for(i = 1; i < 256; i++)
{
    // should have a frequency equal to the max and not be preceded
    // by the same character each time (or it is in the middle of the string)
    if((freq[i] == maxFreq) && (prev[i] == 0))
    {
        int len = 1, j = i;
        while(next[j] != 0)
        {
            len++;
            j = next[j];
        }
        if(len > maxLen)
        {
            maxLen = len;
            startingChar = i;
        }
    }
}

一旦找到最大重复子字符串，将其打印出来：

Once we've found the maximum repeating substring, print it out:

// print out the maximum length string:
int j = startingChar;
while(j != 0)
{
    cout << (char)j;
    j = next[j];
}
cout << endl;

如果您不喜欢迭代这些固定大小的数组或需要支持UNICODE字符等，则可以使用 map 从字符类型到包含字符频率以及上一个和下一个字符的结构。

If you don't like iterating over those fixed size arrays or need to support UNICODE characters etc you can use a map from the character type to a struct containing the character's frequency and prev and next characters.

这篇关于最长最大重复子串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！