为什么我会收到重复的最后一个字节时，我的Perl程序输出UTF-8 EN输入cmd.exe codeD字符串？

本文介绍了为什么我会收到重复的最后一个字节时，我的Perl程序输出UTF-8 EN输入cmd.exe codeD字符串？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

更新

由于@ikegami建议，我报这个bug。

.EXE

考虑下面的C和Perl程序其中两个输出字符串的UTF-8编码和阿尔法;＆测试;＆伽马;在标准输出：

C版：

 的#include＆LT;＆stdio.h中GT;诠释主要（无效）{
    / * UTF-8 EN codeDα，β，γ* /
    焦X [] = {0xce，0xb1，0xce，0xb2，0xce，0xb3，0×00};
    看跌期权（X）;
    返回0;
}

输出：

 C：\\＆hellip;> CHCP 65001
主动code页：65001C：\\＆hellip;> cttt.exe
αβγ

Perl版本：

 C：\\＆hellip;>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ XB3 \\ n}
αβγ

从我所知道的，最后一个字节， 0xb3 正在再次输出，另一条线路上，正在被翻译成 U + FFFD 。

注意，输出重定向消除了这种效果。

我也可以验证它是被重复了最后一个字节：

 C：\\＆hellip;>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ xb3xyz \\ n}
αβγxyz
ž

在另一方面，的避免了这个问题。

 C：\\＆hellip;>的perl -eSYSWRITE标准输出，QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ xb3xyz \\ n}
αβγxyz

我在cmd.exe的窗户观察这在Windows 8.1专业版64位和同时使用自建的Perl 5.18.2 Windows Vista家庭的32位和ActiveState的5.16.3。

我看不出在Cygwin中，Linux或Mac OS X环境中的问题。此外，Cygwin的Perl的5.14.4产生输入cmd.exe正确的输出。

另外，当code页被设置为437，从C和Perl的两种版本的输出是相同的：

 C：\\＆hellip;> 437 CHCP
主动code页：437C：\\＆hellip;> cttt.exe
╬▒╬▓╬│C：\\＆hellip;>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ XB3 \\ n}
╬▒╬▓╬│

是什么原因造成的cmd.exe从Perl程序打印时，最后一个字节是两次输出时的？

PS：我有上的。对于这个问题，我试图提炼一切最简单的可能的情况。

PPS：离开了 \\ n 的结果更有趣的东西：

 C：\\＆hellip;>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ xb3xyz}
αβγxyzxyz

 C：\\＆hellip;>的perl -e打印QQ {\\ XCE \\ XB1 \\ XCE \\ XB2 \\ XCE \\ XB3}
αβγγ

解决方案

下面的程序产生正确的输出：

 使用UTF8;
使用严格的;
使用警告;
使用警告QW（FATAL UTF8）;binmode（STDOUT，：UNIX：编码（UTF8）CRLF）;打印αβγxyz，\\ n;

输出：

 C：\\＆hellip;> CHCP 65001
主动code页：65001
C：\\＆hellip;> perl的pttt.pl
αβγxyz

这似乎表明我有一些funkiness与：CRLF 层。我不明白的内部足够的在这一点上智能评论这个问题。

经过多次实验，我是来，如果控制台已被设置为65001 code页面， binmode（STDOUT，结论：UNIX：编码（UTF8）CRLF ）; 将工作。但是，请注意以下内容：

  binmode（STDOUT，：UNIX：编码（UTF8）CRLF）;
打印转储[
    图{
        我的$ X =定义（$ _）？ $ _：'';
        $ X =〜S / \\ A（[0-9] +）\\ Z / sprintf的'为0x％08X，$ 1 / EG;
        $ X;
    } PerlIO的:: get_layers（标准输出，细节=＆GT; 1）
];
打印αβγxyz\\ N的;

给我：

 ---
 -  UNIX
 - ''
 -  0x01205200
 -  CRLF
 - ''
 -  0x00c85200
 -  UNIX
 - ''
 -  0x01201200
 - 编码
 -  UTF8
 -  0x00c89200
 -  CRLF
 - ''
 -  0x00c8d200
αβγxyz

和以前一样，我不知道足够知道这全部后果。我不打算在某个时候建立一个调试 perl的来进一步诊断这一点。

检查这一个远一点。这里有一些意见从岗位：

第一个 UNIX 层的标志是 0x01205200 = CANWRITE | TRUNCATE | CRLF | OPEN | NOTREG 。为什么要设置 CRLF Windows上的 UNIX 图层？我不知道的内部有足够的了解这一点。

然而，标志的第二个 UNIX 层，一个是把我的显性 binmode ，是0x01201200 = 0x01205200＆安培; 〜CRLF。这是什么话都对我有意义的开始。

现在，如果我用打开我的$跳频，打开一个文件'＆GT;：编码（UTF8），TTT，并转储相同的信息，我得到：

 ---
 -  UNIX
 - ''
 -  0x00201200
 -  CRLF
 - ''
 -  0x00405200
 - 编码
 -  UTF8
 -  0x00409200

正如所料， UNIX 层不设置 CRLF 标记。

Update

As @ikegami suggested, I reported this as a bug.

Bug #121783 for perl5: Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpected output

Consider the following C and Perl programs which both output a the UTF-8 encoding of the string "αβγ" on standard output:

C version:

#include <stdio.h>

int main(void) {
    /* UTF-8 encoded alpha, beta, gamma */
    char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };
    puts(x);
    return 0;
}

Output:

C:\…> chcp 65001
Active code page: 65001

C:\…> cttt.exe
αβγ

Perl version:

C:\…>  perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3\n}"
αβγ
�

From what I can tell, the last octet, 0xb3 is being output again, on another line, which is being translated to U+FFFD.

Note that redirecting output eliminates this effect.

I can also verify that it is the last octet being repeated:

C:\…>  perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz\n}"
αβγxyz
z

On the other hand, syswrite avoids this problem.

C:\…>  perl -e "syswrite STDOUT, qq{\xce\xb1\xce\xb2\xce\xb3xyz\n}"
αβγxyz

I have observed this in cmd.exe windows on Windows 8.1 Pro 64-bit and Windows Vista Home 32-bit using both self-built perl 5.18.2 and ActiveState's 5.16.3.

I do not see the problem in Cygwin, Linux, or Mac OS X environments. Also, Cygwin's perl 5.14.4 produces correct output in cmd.exe.

Also, when the code page is set to 437, the output from both the C and the Perl versions is identical:

C:\…> chcp 437
Active code page: 437

C:\…> cttt.exe
╬▒╬▓╬│

C:\…>  perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3\n}"
╬▒╬▓╬│

What is causing the last octet to be output twice when printing from perl program in cmd.exe when the code page is set to 65001?

PS: I have some more information and screenshots on my blog. For this question, I have tried to distill everything to the simplest possible cases.

PPS: Leaving out the \n results in something even more interesting:

C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3xyz}"
αβγxyzxyz

C:\…> perl -e "print qq{\xce\xb1\xce\xb2\xce\xb3}"
αβγ�γ�

解决方案

The following program produces the correct output:

use utf8;
use strict;
use warnings;
use warnings qw(FATAL utf8);

binmode(STDOUT, ":unix:encoding(utf8):crlf");

print 'αβγxyz', "\n";

Output:

C:\…> chcp 65001
Active code page: 65001
C:\…> perl pttt.pl
αβγxyz

which seems to indicate to me there is some funkiness with the :crlf layer. I do not understand the internals enough to comment intelligently about this at this point.

After many experiments, I have come to the conclusion that, if the console is already set to 65001 code page, binmode(STDOUT, ":unix:encoding(utf8):crlf"); will "work". However, note the following:

binmode(STDOUT, ":unix:encoding(utf8):crlf");
print Dump [
    map {
        my $x = defined($_) ? $_ : '';
        $x =~ s/\A([0-9]+)\z/sprintf '0x%08x', $1/eg;
        $x;
    } PerlIO::get_layers(STDOUT, details => 1)
];
print "αβγxyz\n";

gives me:

---
- unix
- ''
- 0x01205200
- crlf
- ''
- 0x00c85200
- unix
- ''
- 0x01201200
- encoding
- utf8
- 0x00c89200
- crlf
- ''
- 0x00c8d200
αβγxyz

As before, I do not know enough to know the full consequences of this. I do intend to build a debug perl at some point to further diagnose this.

I examined this a little further. Here are some observations from that post:

The flags for the first unix layer are 0x01205200 = CANWRITE | TRUNCATE | CRLF | OPEN | NOTREG. Why is CRLF set for the unix layer on Windows? I do not know about the internals enough to understand this.

However, the flags for the second unix layer, the one pushed by my explicit binmode, are 0x01201200 = 0x01205200 & ~CRLF. This is what would have made sense to me to begin with.

Now, if I open a file using open my $fh, '>:encoding(utf8)', 'ttt', and dump the same information, I get:

---
- unix
- ''
- 0x00201200
- crlf
- ''
- 0x00405200
- encoding
- utf8
- 0x00409200

As expected, the unix layer does not set the CRLF flag.

这篇关于为什么我会收到重复的最后一个字节时，我的Perl程序输出UTF-8 EN输入cmd.exe codeD字符串？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！