为什么当我将随机的UTF-8网页视为UTF-16时，很可能会看到中文字符？

本文介绍了为什么当我将随机的UTF-8网页视为UTF-16时，很可能会看到中文字符？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

出于好奇，我在随机英文网页的编码菜单中选择了UTF-16，以查看发生了什么（在Chrome上：工具 - >编码 - > Unicode（UTF-16LE）。我感兴趣的是几乎所有的我看到的mojibake是汉字（和一些积分标志）。

当从ASCII / UTF-8英文切换到UTF- 16？是HTML标签中的随机非中文特殊字符？因为UTF-16中的最小单位是两个字节很长时间以来，像拉丁语这样的大多数低字符的第一个字节以 NUL 字节开始： 00 xx 内容通常不包含 NUL 字节，在将随机字节序列解释为UTF-16时，实际上不可能打拉丁字符，UTF-8编码内容的大多数字节将在某处中间的下半部分，比如说 46 6F 。这恰好是许多亚洲人的语言年龄位于UTF-16，由于中国人是一个巨大的块，你很可能会击中它。

Out of curiosity, I chose UTF-16 in the Encoding menu of a random English webpage to see what happens (on Chrome: Tools -> Encoding -> Unicode (UTF-16LE). What interested me is that almost all of the mojibake I see are Chinese characters (and some integral signs).

Are there any statistical reasons for seeing Chinese characters when switching from ASCII/UTF-8 English to UTF-16? Are the random non-Chinese special characters from HTML tags?

解决方案

Since the smallest unit in UTF-16 is two bytes long, the first byte of most "low" characters like Latin starts with a NUL byte: 00 xx. Since normal content does not typically contain NUL bytes, it's virtually impossible to hit Latin characters when interpreting random byte sequences as UTF-16. Most bytes of UTF-8 encoded content will be somewhere in the lower middle, like say 46 6F. And that happens to be where many Asian languages are situated in UTF-16, and since Chinese is a ginormous block you're very likely to hit it.

这篇关于为什么当我将随机的UTF-8网页视为UTF-16时，很可能会看到中文字符？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！