本文介绍了unicodedata.normalize(form, unistr) 如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 API 文档中,http://docs.python.org/2/library/unicodedata.html#unicodedata.normalize.它说

On the API doc, http://docs.python.org/2/library/unicodedata.html#unicodedata.normalize. It says

返回 Unicode 字符串 unistr 的标准形式 form.表单的有效值为‘NFC’、‘NFKC’、‘NFD’和‘NFKD’.`

文档相当模糊,有人可以用一些例子解释valid values吗?

The documentation is rather vague, can someone explain the valid values with some examples?

推荐答案

我觉得文档很清楚,但这里有一些代码示例:

I find the documentation pretty clear, but here are a few code examples:

from unicodedata import normalize

print '%r' % normalize('NFD', u'u00C7')  # decompose: convert Ç to "C + ̧"
print '%r' % normalize('NFC', u'Cu0327') # compose: convert "C + ̧" to Ç

'D' (=decompose) 两种形式都将单个组合字符(如 ä)转换为两个字符(a + 两个点).'C' (=compose) 两种形式都是相反的.

Both 'D' (=decompose) forms convert a single combined character (like ä) into two characters (a + two dots). Both 'C' (=compose) forms do the reverse.

这两个K"形式用于转换添加到 Unicode 的字符以实现兼容性目的.例如,为了支持不能在符号周围画圆圈的软件,有一组带圆圈的数字",比如①(unicode number 2460).当我们对其应用规范分解 (NFD) 时,它不会做任何事情:

The two "K" forms are used to convert characters added to Unicode for compatibility purposes. For example, to support software that cannot draw circles around symbols, there is a set of "circled numbers", like ① (unicode number 2460). When we apply the canonical decomposition (NFD) to it, it doesn't do anything:

print '%r' % normalize('NFD', u'u2460')     # u'u2460'

但是,兼容性分解(NFKD)会返回对应的兼容"字符:

However, the compatibility decomposition (NFKD) will return the corresponding "compatible" character:

print '%r' % normalize('NFKD', u'u2460')    # 1

有关详细信息,请参阅 http://en.wikipedia.org/wiki/Unicode_equivalence.

See http://en.wikipedia.org/wiki/Unicode_equivalence for more details.

这篇关于unicodedata.normalize(form, unistr) 如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-01 09:14