问题描述
我需要测试一个字符串是否为Unicode,然后是否为UTF-8.之后,获取字符串的长度(以字节为单位),包括 BOM (如果使用的话).如何在Python中完成?
I need to test if a string is Unicode, and then if it whether it's UTF-8. After that, get the string's length in bytes including the BOM, if it ever uses that. How can this be done in Python?
出于教学目的,UTF-8字符串的字节列表表示是什么样的?我很好奇Python中如何表示UTF-8字符串.
Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python.
后期pprint效果很好.
Latter edit: pprint does that pretty well.
推荐答案
try:
string.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"
在Python 2中,str
是字节序列,而unicode
是字符序列.您可以使用str.decode
将字节序列解码为unicode
,并使用unicode.encode
将字符序列编码为str
.因此,例如,u"é"
是包含单个字符U + 00E9的Unicode字符串,也可以写为u"\xe9"
;编码为UTF-8会给出字节序列"\xc3\xa9"
.
In Python 2, str
is a sequence of bytes and unicode
is a sequence of characters. You use str.decode
to decode a byte sequence to unicode
, and unicode.encode
to encode a sequence of characters to str
. So for example, u"é"
is the unicode string containing the single character U+00E9 and can also be written u"\xe9"
; encoding into UTF-8 gives the byte sequence "\xc3\xa9"
.
在Python 3中,这已更改; bytes
是字节序列,str
是字符序列.
In Python 3, this is changed; bytes
is a sequence of bytes and str
is a sequence of characters.
这篇关于测试字符串(如果是Unicode),哪个UTF标准,并获取其长度(以字节为单位)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!