java - 查找字符串中最常见字符的更有效方法

我创建了一种用于查找字符串中最常见字符的方法:

public static char getMax(String s) {

char maxappearchar = ' ';
int counter = 0;
int[] charcnt = new int[Character.MAX_VALUE + 1];


for (int i = 0 ; i < s.length() ; i++)
{
    char ch = s.charAt(i);
    // increment this character's cnt and compare it to our max.
    charcnt[ch]++ ;
    if (charcnt[ch] >= counter)
    {
        counter = charcnt[ch];
        maxappearchar = ch;
    }
}
System.out.println("the max char is   " +maxappearchar + "  and displayed  " +counter+ "  times");
return maxappearchar;
}

我在问不同的解决方案:

解决方案1-最快的代码(是我的附加代码吗？)

解决方案2-在内存，减少使用数组和变量方面最有效

我使用HashMap创建了我的方法-这更适合解决方案2吗？如果可以，为什么？优点/缺点是什么？

随附的代码是否适合o技术(o ^，o logn ...)？如果可以，为什么？

最佳答案

最快的方法是对每个字符的出现进行计数，然后取计数数组中的最大值。如果您的字符串很长，则可以在循环遍历字符串中的字符时不跟踪当前最大值来获得不错的加速效果。

有关如何计算频率的许多其他想法，请参见How to count frequency of characters in a string?。

如果您的字符串主要是ASCII码，那么在count循环中选择一个低128个char值的数组或一个用于其余字符的HashMap的分支应该是值得的。如果您的字符串不包含非ASCII字符，则分支将很好地预测。如果在ascii和non-ascii之间有很多交替，那么与将HashMap用于所有内容相比，该分支可能会受到伤害。

public static char getMax(String s) {

    char maxappearchar = ' ';
    int counter = 0;
    int[] ascii_count = new int[128];  // fast path for ASCII
    HashMap<Character,Integer> nonascii_count = new HashMap<Character,Integer>();

    for (int i = 0 ; i < s.length() ; i++)
    {
        char ch = s.charAt(i);  // This does appear to be the recommended way to iterate over a String
        // alternatively, iterate over 32bit Unicode codepoints, not UTF-16 chars, if that matters.
        if (ch < 128) {
            ascii_count[ch]++;
        } else {
            // some code to set or increment the nonascii_count[ch];
        }
    }

    // loop over ascii_count and find the highest element
    // loop over the keys in nonascii_count, and see if any of them are even higher.
    return maxappearchar;
}

我没有充实代码，因为我没有做很多Java，所以IDK(如果有一个容器)比HashMap的1和get对更有效地执行insert-put -or-increment操作。 https://stackoverflow.com/a/6712620/224132建议使用番石榴MultiSet<Character>，看起来不错。

这可能比2 ^ 16 int数组更好。但是，如果您仅触摸此阵列的低128个元素，则绝不会触及大部分内存。已分配但未更改的内存并不会真的造成伤害，也不会耗尽RAM /交换空间。

但是，最后遍历所有65536个条目至少意味着要读取它，因此操作系统必须对其进行软页面错误处理并进行连接。它将污染缓存。因此，实际上，更新每个角色的最大值可能是一个更好的选择。微基准测试可能表明，在String上进行迭代，然后在charcnt[Character.MAX_VALUE]上循环，这是成功的方法，但这并不能解决触摸那么多不需要的内存对缓存/ TLB的污染。