问题描述
又是一个简单的问题:拥有一个std::string
,根据用户的语言和区域设置(区域设置)确定其哪个字符是数字,符号,空格等.
The simple question again: having an std::string
, determine which of its characters are digits, symbols, white spaces etc. with respect to the user's language and regional settings (locale).
我设法使用提升语言环境边界分析工具:
std::string text = u8"生きるか死ぬか";
boost::locale::boundary::segment_index<std::string::const_iterator> characters(
boost::locale::boundary::character,
text.begin(), text.end(),
boost::locale::generator()("ja_JP.UTF-8"));
for (const auto& ch : characters) {
// each 'ch' is a single character in japanese language
}
但是,我进一步看不到有什么方法可以确定ch
是数字还是符号还是其他.有 boost字符串分类算法,但是这些似乎都无法使用..无论*segment_index::iterator
是什么.
However, I further do not see any way to determine if ch
is a digit or a symbol or anything else.There are boost string classification algorithms, but these don't seem to be working with.. whatever *segment_index::iterator
is.
也不能使用 std::isalpha(std::locale)
,因为我不确定是否可以将增强段转换为char
或wchar_t
.
是否有任何巧妙的方式来对符号进行分类?
Is there any neat way to classify symbols?
推荐答案
<locale>
,但是...您提供的示例文本看起来像UTF-8,这是一种多字节编码,并且<locale>
中的功能不适用于多字节编码.
There are a number of functions and objects supporting this in<locale>
but... The example text you give looks like UTF-8,which is a multibyte encoding, and the functions in <locale>
don't work with multibyte encodings.
我建议您使用ICU库.除其他外事物,它可以测试在中定义的所有属性Unicode字符数据库.它还具有宏或功能用于遍历字符串(或至少是char
的数组),一次提取一个UTF_32代码点(这就是您想要的)要测试).
I'd suggest you get the ICU library, and use it. Amongst otherthings, it allows testing for all of the properties defined inthe Unicode Character Database. It also has macros or functionsfor iterating over a string (or at least an array of char
),extracting one UTF_32 codepoint at a time (which is what you'dwant to test).
这篇关于人物分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!