本文介绍了为什么当我将随机的UTF-8网页视为UTF-16时,很可能会看到中文字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于好奇,我在随机英文网页的编码菜单中选择了UTF-16,以查看发生了什么(在Chrome上:工具 - >编码 - > Unicode(UTF-16LE)。我感兴趣的是几乎所有的我看到的mojibake是汉字(和一些积分标志)。

当从ASCII / UTF-8英文切换到UTF- 16?是HTML标签中的随机非中文特殊字符?因为UTF-16中的最小单位是两个字节很长时间以来,像拉丁语这样的大多数低字符的第一个字节以 NUL 字节开始: 00 xx 内容通常不包含 NUL 字节,在将随机字节序列解释为UTF-16时,实际上不可能打拉丁字符,UTF-8编码内容的大多数字节将在某处中间的下半部分,比如说 46 6F 。这恰好是许多亚洲人的语言年龄位于UTF-16,由于中国人是一个巨大的块,你很可能会击中它。


Out of curiosity, I chose UTF-16 in the Encoding menu of a random English webpage to see what happens (on Chrome: Tools -> Encoding -> Unicode (UTF-16LE). What interested me is that almost all of the mojibake I see are Chinese characters (and some integral signs).

Are there any statistical reasons for seeing Chinese characters when switching from ASCII/UTF-8 English to UTF-16? Are the random non-Chinese special characters from HTML tags?

解决方案

Since the smallest unit in UTF-16 is two bytes long, the first byte of most "low" characters like Latin starts with a NUL byte: 00 xx. Since normal content does not typically contain NUL bytes, it's virtually impossible to hit Latin characters when interpreting random byte sequences as UTF-16. Most bytes of UTF-8 encoded content will be somewhere in the lower middle, like say 46 6F. And that happens to be where many Asian languages are situated in UTF-16, and since Chinese is a ginormous block you're very likely to hit it.

这篇关于为什么当我将随机的UTF-8网页视为UTF-16时,很可能会看到中文字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-03 11:14