问题描述
更新
由于@ikegami建议,我报这个bug。
.EXE
考虑下面的C和Perl程序其中两个输出字符串的UTF-8编码和阿尔法;&测试;&伽马;在标准输出:
C版:
的#include<&stdio.h中GT;诠释主要(无效){
/ * UTF-8 EN codeDα,β,γ* /
焦X [] = {0xce,0xb1,0xce,0xb2,0xce,0xb3,0×00};
看跌期权(X);
返回0;
}
输出:
C:\\…> CHCP 65001
主动code页:65001C:\\…> cttt.exe
αβγ
Perl版本:
C:\\…>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ XB3 \\ n}
αβγ
从我所知道的,最后一个字节, 0xb3
正在再次输出,另一条线路上,正在被翻译成 U + FFFD
。
注意,输出重定向消除了这种效果。
我也可以验证它是被重复了最后一个字节:
C:\\…>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ xb3xyz \\ n}
αβγxyz
ž
在另一方面,的避免了这个问题。
C:\\…>的perl -eSYSWRITE标准输出,QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ xb3xyz \\ n}
αβγxyz
我在cmd.exe的窗户观察这在Windows 8.1专业版64位和同时使用自建的Perl 5.18.2 Windows Vista家庭的32位和ActiveState的5.16.3。
我看不出在Cygwin中,Linux或Mac OS X环境中的问题。此外,Cygwin的Perl的5.14.4产生输入cmd.exe正确的输出。
另外,当code页被设置为437,从C和Perl的两种版本的输出是相同的:
C:\\…> 437 CHCP
主动code页:437C:\\…> cttt.exe
╬▒╬▓╬│C:\\…>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ XB3 \\ n}
╬▒╬▓╬│
是什么原因造成的cmd.exe从Perl程序打印时,最后一个字节是两次输出时的?
PS:我有上的。对于这个问题,我试图提炼一切最简单的可能的情况。
PPS:离开了 \\ n
的结果更有趣的东西:
C:\\…>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ xb3xyz}
αβγxyzxyz
C:\\…>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ XB3}
αβγγ
下面的程序产生正确的输出:
使用UTF8;
使用严格的;
使用警告;
使用警告QW(FATAL UTF8);binmode(STDOUT,:UNIX:编码(UTF8)CRLF);打印αβγxyz,\\ n;
输出:
C:\\…> CHCP 65001
主动code页:65001
C:\\…> perl的pttt.pl
αβγxyz
这似乎表明我有一些funkiness与:CRLF
层。我不明白的内部足够的在这一点上智能评论这个问题。
经过多次实验,我是来,如果控制台已被设置为65001 code页面, binmode(STDOUT,结论:UNIX:编码(UTF8)CRLF );
将工作。但是,请注意以下内容:
binmode(STDOUT,:UNIX:编码(UTF8)CRLF);
打印转储[
图{
我的$ X =定义($ _)? $ _:'';
$ X =〜S / \\ A([0-9] +)\\ Z / sprintf的'为0x%08X,$ 1 / EG;
$ X;
} PerlIO的:: get_layers(标准输出,细节=> 1)
];
打印αβγxyz\\ N的;
给我:
---
- UNIX
- ''
- 0x01205200
- CRLF
- ''
- 0x00c85200
- UNIX
- ''
- 0x01201200
- 编码
- UTF8
- 0x00c89200
- CRLF
- ''
- 0x00c8d200
αβγxyz
和以前一样,我不知道足够知道这全部后果。我不打算在某个时候建立一个调试 perl的
来进一步诊断这一点。
检查这一个远一点。这里有一些意见从岗位:
第一个 UNIX
层的标志是 0x01205200 = CANWRITE | TRUNCATE | CRLF | OPEN | NOTREG
。为什么要设置 CRLF
Windows上的 UNIX
图层?我不知道的内部有足够的了解这一点。
然而,标志的第二个 UNIX
层,一个是把我的显性 binmode
,是0x01201200 = 0x01205200&安培; 〜CRLF。这是什么话都对我有意义的开始。
第一个CRLF层的标志是 0x00c85200 = CANWRITE | TRUNCATE | CRLF | LINEBUF | FASTGETS | TTY
。第二层
,这是我的后推的标志:编码(UTF8)
层是 0x00c8d200 = 0x00c85200 | UTF8
。
现在,如果我用打开我的$跳频,打开一个文件'>:编码(UTF8),TTT
,并转储相同的信息,我得到:
---
- UNIX
- ''
- 0x00201200
- CRLF
- ''
- 0x00405200
- 编码
- UTF8
- 0x00409200
正如所料, UNIX
层不设置 CRLF
标记。
Update
As @ikegami suggested, I reported this as a bug.
Consider the following C and Perl programs which both output a the UTF-8 encoding of the string "αβγ" on standard output:
C version:
#include <stdio.h>
int main(void) {
/* UTF-8 encoded alpha, beta, gamma */
char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };
puts(x);
return 0;
}
Output:
C:\…> chcp 65001 Active code page: 65001 C:\…> cttt.exe αβγ
Perl version:
C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3\n}" αβγ �
From what I can tell, the last octet, 0xb3
is being output again, on another line, which is being translated to U+FFFD
.
Note that redirecting output eliminates this effect.
I can also verify that it is the last octet being repeated:
C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz\n}" αβγxyz z
On the other hand, syswrite avoids this problem.
C:\…> perl -e "syswrite STDOUT, qq{\xce\xb1\xce\xb2\xce\xb3xyz\n}" αβγxyz
I have observed this in cmd.exe windows on Windows 8.1 Pro 64-bit and Windows Vista Home 32-bit using both self-built perl 5.18.2 and ActiveState's 5.16.3.
I do not see the problem in Cygwin, Linux, or Mac OS X environments. Also, Cygwin's perl 5.14.4 produces correct output in cmd.exe.
Also, when the code page is set to 437, the output from both the C and the Perl versions is identical:
C:\…> chcp 437 Active code page: 437 C:\…> cttt.exe ╬▒╬▓╬│ C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3\n}" ╬▒╬▓╬│
What is causing the last octet to be output twice when printing from perl program in cmd.exe when the code page is set to 65001?
PS: I have some more information and screenshots on my blog. For this question, I have tried to distill everything to the simplest possible cases.
PPS: Leaving out the \n
results in something even more interesting:
C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz}" αβγxyzxyz
C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3}" αβγ�γ�
The following program produces the correct output:
use utf8;
use strict;
use warnings;
use warnings qw(FATAL utf8);
binmode(STDOUT, ":unix:encoding(utf8):crlf");
print 'αβγxyz', "\n";
Output:
C:\…> chcp 65001 Active code page: 65001 C:\…> perl pttt.pl αβγxyz
which seems to indicate to me there is some funkiness with the :crlf
layer. I do not understand the internals enough to comment intelligently about this at this point.
After many experiments, I have come to the conclusion that, if the console is already set to 65001 code page, binmode(STDOUT, ":unix:encoding(utf8):crlf");
will "work". However, note the following:
binmode(STDOUT, ":unix:encoding(utf8):crlf");
print Dump [
map {
my $x = defined($_) ? $_ : '';
$x =~ s/\A([0-9]+)\z/sprintf '0x%08x', $1/eg;
$x;
} PerlIO::get_layers(STDOUT, details => 1)
];
print "αβγxyz\n";
gives me:
--- - unix - '' - 0x01205200 - crlf - '' - 0x00c85200 - unix - '' - 0x01201200 - encoding - utf8 - 0x00c89200 - crlf - '' - 0x00c8d200 αβγxyz
As before, I do not know enough to know the full consequences of this. I do intend to build a debug perl
at some point to further diagnose this.
I examined this a little further. Here are some observations from that post:
The flags for the first unix
layer are 0x01205200 = CANWRITE | TRUNCATE | CRLF | OPEN | NOTREG
. Why is CRLF
set for the unix
layer on Windows? I do not know about the internals enough to understand this.
However, the flags for the second unix
layer, the one pushed by my explicit binmode
, are 0x01201200 = 0x01205200 & ~CRLF. This is what would have made sense to me to begin with.
The flags for the first crlf layer are 0x00c85200 = CANWRITE | TRUNCATE | CRLF | LINEBUF | FASTGETS | TTY
. The flags for the second layer
, which I push after the :encoding(utf8)
layer are 0x00c8d200 = 0x00c85200 | UTF8
.
Now, if I open a file using open my $fh, '>:encoding(utf8)', 'ttt'
, and dump the same information, I get:
--- - unix - '' - 0x00201200 - crlf - '' - 0x00405200 - encoding - utf8 - 0x00409200
As expected, the unix
layer does not set the CRLF
flag.
这篇关于为什么我会收到重复的最后一个字节时,我的Perl程序输出UTF-8 EN输入cmd.exe codeD字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!