本文介绍了为什么python2和python3中的print输出使用相同的字符串不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python2中:

$ python2 -c 'print "\x08\x04\x87\x18"' | hexdump -C
00000000  08 04 87 18 0a                                    |.....|
00000005

在python3中:

$ python3 -c 'print("\x08\x04\x87\x18")' | hexdump -C
00000000  08 04 c2 87 18 0a                                 |......|
00000006

为什么这里有字节"\xc2"?

修改:

我认为当字符串中包含非ASCII字符时,python3会将字节"\xc2"附加到字符串中. (如@Ashraful伊斯兰教所说的)

I think when the string have a non-ascii character, python3 will append the byte "\xc2" to the string. (as @Ashraful Islam said)

那么如何在python3中避免这种情况?

So how can I avoid this in python3?

推荐答案

请考虑以下代码段:

import sys
for i in range(128, 256):
    sys.stdout.write(chr(i))

使用Python 2运行此代码,然后使用hexdump -C查看结果:

Run this with Python 2 and look at the result with hexdump -C:

00000000  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  |................|

等等不出意外;从0x800xff的128个字节.

Et cetera. No surprises; 128 bytes from 0x80 to 0xff.

使用Python 3进行相同的操作

Do the same with Python 3:

00000000  c2 80 c2 81 c2 82 c2 83  c2 84 c2 85 c2 86 c2 87  |................|
...
00000070  c2 b8 c2 b9 c2 ba c2 bb  c2 bc c2 bd c2 be c2 bf  |................|
00000080  c3 80 c3 81 c3 82 c3 83  c3 84 c3 85 c3 86 c3 87  |................|
...
000000f0  c3 b8 c3 b9 c3 ba c3 bb  c3 bc c3 bd c3 be c3 bf  |................|

总结:

  • 0x800xbf的所有内容都以0xc2开头.
  • 0xc00xff的所有内容都将第6位设置为零,并在前面加上0xc3.
  • Everything from 0x80 to 0xbf has 0xc2 prepended.
  • Everything from 0xc0 to 0xff has bit 6 set to zero and has 0xc3 prepended.

那么,这是怎么回事?

在Python 2中,字符串是ASCII,并且不进行任何转换.告诉给在0-127 ASCII范围之外的地方写东西,上面写着"okey-doke!".和只是写那些字节.很简单.

In Python 2, strings are ASCII and no conversion is done. Tell it towrite something outside the 0-127 ASCII range, it says "okey-doke!" andjust writes those bytes. Simple.

在Python 3中,字符串为 Unicode .当非ASCII字符是编写时,必须以某种方式对其进行编码.默认编码是UTF-8.

In Python 3, strings are Unicode. When non-ASCII characters arewritten, they must be encoded in some way. The default encoding isUTF-8.

那么,这些值如何用UTF-8编码?

So, how are these values encoded in UTF-8?

0x800x7ff的代码点编码如下:

Code points from 0x80 to 0x7ff are encoded as follows:

110vvvvv 10vvvvvv

其中11个v字符是代码点的位.

Where the 11 v characters are the bits of the code point.

因此:

0x80                 hex
1000 0000            8-bit binary
000 1000 0000        11-bit binary
00010 000000         divide into vvvvv vvvvvv
11000010 10000000    resulting UTF-8 octets in binary
0xc2 0x80            resulting UTF-8 octets in hex

0xc0                 hex
1100 0000            8-bit binary
000 1100 0000        11-bit binary
00011 000000         divide into vvvvv vvvvvv
11000011 10000000    resulting UTF-8 octets in binary
0xc3 0x80            resulting UTF-8 octets in hex

这就是为什么在87之前获得c2的原因.

So that’s why you’re getting a c2 before 87.

如何在Python 3中避免所有这些?使用bytes类型.

How to avoid all this in Python 3? Use the bytes type.

这篇关于为什么python2和python3中的print输出使用相同的字符串不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-27 05:01