问题描述
在python2中:
$ python2 -c 'print "x08x04x87x18"' | hexdump -C
00000000 08 04 87 18 0a |.....|
00000005
在python3中:
$ python3 -c 'print("x08x04x87x18")' | hexdump -C
00000000 08 04 c2 87 18 0a |......|
00000006
为什么这里有字节"xc2"
?
编辑:
我认为当字符串具有非 ascii 字符时,python3 会将字节 "xc2"
附加到字符串.(正如@Ashraful Islam 所说)
I think when the string have a non-ascii character, python3 will append the byte "xc2"
to the string. (as @Ashraful Islam said)
那么如何在 python3 中避免这种情况?
So how can I avoid this in python3?
推荐答案
考虑以下代码片段:
import sys
for i in range(128, 256):
sys.stdout.write(chr(i))
使用 Python 2 运行它并使用 hexdump -C
查看结果:
Run this with Python 2 and look at the result with hexdump -C
:
00000000 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f |................|
等等.没有惊喜;从 0x80
到 0xff
的 128 个字节.
Et cetera. No surprises; 128 bytes from 0x80
to 0xff
.
对 Python 3 执行相同操作:
Do the same with Python 3:
00000000 c2 80 c2 81 c2 82 c2 83 c2 84 c2 85 c2 86 c2 87 |................|
...
00000070 c2 b8 c2 b9 c2 ba c2 bb c2 bc c2 bd c2 be c2 bf |................|
00000080 c3 80 c3 81 c3 82 c3 83 c3 84 c3 85 c3 86 c3 87 |................|
...
000000f0 c3 b8 c3 b9 c3 ba c3 bb c3 bc c3 bd c3 be c3 bf |................|
总结:
- 从
0x80
到0xbf
的所有内容都添加了0xc2
. - 从
0xc0
到0xff
的所有内容都将第 6 位设置为零,并在前面添加了0xc3
.
- Everything from
0x80
to0xbf
has0xc2
prepended. - Everything from
0xc0
to0xff
has bit 6 set to zero and has0xc3
prepended.
那么,这里发生了什么?
So, what’s going on here?
在 Python 2 中,字符串是 ASCII 并且不进行转换.告诉它写一些超出 0-127 ASCII 范围的内容,它会说oky-doke!"和只写那些字节.简单.
In Python 2, strings are ASCII and no conversion is done. Tell it towrite something outside the 0-127 ASCII range, it says "okey-doke!" andjust writes those bytes. Simple.
在 Python 3 中,字符串是 Unicode.当非 ASCII 字符写,它们必须以某种方式编码.默认编码是UTF-8.
In Python 3, strings are Unicode. When non-ASCII characters arewritten, they must be encoded in some way. The default encoding isUTF-8.
那么,这些值是如何以 UTF-8 编码的?
So, how are these values encoded in UTF-8?
从 0x80
到 0x7ff
的代码点编码如下:
Code points from 0x80
to 0x7ff
are encoded as follows:
110vvvvv 10vvvvvv
其中 11 个 v
字符是代码点的位.
Where the 11 v
characters are the bits of the code point.
因此:
0x80 hex
1000 0000 8-bit binary
000 1000 0000 11-bit binary
00010 000000 divide into vvvvv vvvvvv
11000010 10000000 resulting UTF-8 octets in binary
0xc2 0x80 resulting UTF-8 octets in hex
0xc0 hex
1100 0000 8-bit binary
000 1100 0000 11-bit binary
00011 000000 divide into vvvvv vvvvvv
11000011 10000000 resulting UTF-8 octets in binary
0xc3 0x80 resulting UTF-8 octets in hex
这就是为什么你在 87
之前得到一个 c2
.
So that’s why you’re getting a c2
before 87
.
如何在 Python 3 中避免这一切?使用 bytes
类型.
How to avoid all this in Python 3? Use the bytes
type.
这篇关于为什么python2和python3的print输出同一个字符串不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!