本文介绍了以高性能将 CESU-8 转换为 UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些原始文本,通常是有效的 UTF-8 字符串.然而,有时会发现输入实际上是一个 CESU-8 字符串.技术上可以检测到这一点并转换为 UTF-8,但由于这种情况很少发生,我宁愿不花大量 CPU 时间来执行此操作.

I have some raw text that is usually a valid UTF-8 string. However, every now and then it turns out that the input is in fact a CESU-8 string, instead. It is possible to technically detect this and convert to UTF-8 but as this happens rarely, I would rather not spend lots of CPU time to do this.

是否有任何快速方法来检测字符串是用 CESU-8 还是 UTF-8 编码的?我想我总是可以盲目地将UTF-8"转换为 UTF-16LE,然后使用 iconv() 再转换为 UTF-8,我可能每次都会得到正确的结果,因为 CESU-8 已经足够接近了到 UTF-8 才能工作.您能提出更快的建议吗?(我希望输入字符串是 CESU-8 而不是有效的 UTF-8,大约占所有字符串出现次数的 0.01-0.1%.)

Is there any fast method to detect if a string is encoded with CESU-8 or UTF-8? I guess I could always blindly convert "UTF-8" to UTF-16LE and then to UTF-8 using iconv() and I would probably get the correct result every time because CESU-8 is close enough to UTF-8 for this to work. Can you suggest anything faster? (I'm expecting the input string to be CESU-8 instead of valid UTF-8 around 0.01-0.1% of all string occurrences.)

(CESU-8 是一种非标准的字符串格式,它包含以 UTF-8 编码的 16 位代理对.从技术上讲,UTF-8 字符串应该包含由这些代理对表示的字符,而不是代理对本身.)

推荐答案

这里有一个更高效的转换函数版本:

Here's a more efficient version of your conversion function:

$regex = '@(\xED[\xA0-\xAF][\x80-\xBF]\xED[\xB0-\xBF][\x80-\xBF])@';
$s = preg_replace_callback($regex, function($m) {
    $in = unpack("C*", $m[0]);
    $in[2] += 1; // Effectively adds 0x10000 to the codepoint.
    return pack("C*",
        0xF0 | (($in[2] & 0x1C) >> 2),
        0x80 | (($in[2] & 0x03) << 4) | (($in[3] & 0x3C) >> 2),
        0x80 | (($in[3] & 0x03) << 4) | ($in[5] & 0x0F),
        $in[6]
    );
}, $s);

代码只转换高代理后低代理,将两个三字节的CESU-8序列直接转换成四字节的UTF-8序列,即来自

The code only converts high surrogates followed by low surrogates, and converts the two three-byte CESU-8 sequences directly into a four-byte UTF-8 sequence, i.e. from

ED       A0-AF    80-BF    ED       B0-BF    80-BF
11101101 1010aaaa 10bbbbbb 11101101 1011cccc 10dddddd

F0-F4    80-BF    80-BF    80-BF
11110oaa 10aabbbb 10bbcccc 10dddddd    // o is "overflow" bit

这是一个在线示例.

这篇关于以高性能将 CESU-8 转换为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-02 07:52