3

I'm trying to write to a Unicode (UCS-2 Little Endian) file in Perl on Windows, like this.

open my $f, ">$fName" or die "can't write $fName\n"; binmode $f, ':raw:encoding(UCS-2LE)'; print $f, "ohai\ni can haz unicodez?\nkthxbye\n"; close $f; 

It basically works except I no longer get the automatic LF -> CR/LF translation on output that I get on regular text files. (The output files just have LF) If I leave out :raw or add :crlf in the "binmode" call, then the output file is garbled. I've tried re-ordering the "directives" (i.e. :encoding before :raw) and can't get it to work. The same problem exists for reading.

4 Answers 4

2

This works for me on windows:

open my $f, ">:encoding(UCS-2LE):crlf", "test.txt"; print $f "ohai\ni can haz unicodez?\nkthxbye\n"; close $f; 

Yielding UCS-16 LE output in test.txt of

ohai i can haz unicodez? kthxbye 
Sign up to request clarification or add additional context in comments.

6 Comments

Really?! The 2nd line looks like Asian characters when I open that in Notepad+ (which has Unicode support). And when I open it in HexEdit, I can see why... the lines end in \x00 \x0D \x0A (instead of \x00 \x0D \x00 \x0A) which makes the second line "out of sync"
I see the line ending in \x00 \x0D \x00 \x0A. or as hexlify-buffer puts it, 0d00 0a00. What version of perl are you using? I'm using the latest strawberry distribution.
Wow, this is weird... I am also using the latest Strawberry Perl (5.12), and I just tried it again to make sure... I am still seeing 00 0D 0A... Maybe you are using 5.10 and this is a bug introduced in 5.12?
This is perl 5, version 12, subversion 0 (v5.12.0) built for MSWin32-x86-multi-thread, not that this helps us much. I wonder if I have some sort of module installed that's affecting encoding and crlf. I'll try from another computer or two when I have access and post results.
Maybe the OS matters? I am on Windows 7 64-bit (although I am using MSWin32-x86-multi-thread same as you)
|
2

Here is what I have found to work, at least with perl 5.10.1:

Input:

open(my $f_in, '<:raw:perlio:via(File::BOM):crlf', $file); 

Output:

open(my $f_out, '>:raw:perlio:encoding(UTF-16LE):crlf:via(File::BOM)', $file); 

These handle BOM, CRLF translation, and UTF-16LE encoding/decoding transparently.

Note that according to the perlmonks post below, if trying to specify with binmode() instead of open(), an extra ":pop" is required:

binmode $f_out, ':raw:pop:perlio:encoding(UTF-16LE):crlf'; 

which my experience corroborates. I was not able to get this to work with the ":via(File::BOM)" layer, however.

References:

http://www.perlmonks.org/?node_id=608532

http://metacpan.org/pod/File::BOM

Comments

2

The :crlf layer does a simple byte mapping of 0x0A -> 0x0D 0x0A (\n --> \r\n) in the output stream, but for the most part this isn't valid in any wide character encoding.

How about using a raw mode but explicitly print the CR?

print $f "ohai\r\ni can haz unicodez?\r\nkthxbye\r\n"; 

Or if portability is a concern, discover and explicitly use the correct line ending:

## never mind - $/ doesn't work # print $f "ohai$/i can haz unicodez?$/kthxbye$/"; open DUMMY, '>', 'dummy'; print DUMMY "\n"; close DUMMY; open DUMMY, '<:raw', 'dummy'; $EOL = <DUMMY>; close DUMMY; unlink 'dummy'; ... print $f "ohai${EOL}i can haz unicodez?${EOL}kthxbye${EOL}"; 

Unrelated to the question, but Ωmega asked in a comment about the difference between :raw and :bytes. As documented in perldoc perlio, you can think of :raw as removing all I/O layers, and :bytes as removing a :utf8 layer. Compare the output of these two commands:

$ perl -E 'binmode *STDOUT,":crlf:raw"; say' | od -c 0000000 \n 0000001 $ perl -E 'binmode *STDOUT,":crlf:bytes";say' | od -c 0000000 \r \n 0000002 

3 Comments

There's no way to make the :crlf layer work before the :encoding layer?
Not as far as I know. Maybe somebody else has an idea.
Could you please clarify a difference between :raw and :bytes?
0

So the implicit question seems to be: How the heck is this IO layer business supposed to be used properly? Or perhaps: Is there a bug in the implementation of the :crlf layer?

After 13 years, there is no clarity. But the short answer is that :crlf seems to be meant to be applied – somewhat counter-intuitively – after any :encoding(…) layer and not before as doing otherwise will result in garbled output for UTF-16/UCS-2 as can be demonstrated using Perl 5.36.0 MSYS on Windows (shipping with "Git for Windows"), which by the way does not have :crlf enabled by default:

$ perl -E "say for PerlIO::get_layers(*STDOUT)" unix perlio $ perl -Mopen=":std,OUT,:encoding(UCS-2LE):crlf" -E say | od -c 0000000 \r \0 \n \0 → correct $ perl -Mopen=":std,OUT,:crlf:encoding(UCS-2LE)" -E say | od -c 0000000 \r \n \0 → wrong! 

The PerlIO doc says:

On DOS/Windows like architectures where this layer is part of the defaults, it also acts like the :perlio layer, and removing the CRLF translation (such as with :raw) will only unset the CRLF translation flag.

Talk of a "translation flag" seems to suggest that it is another so-called pseudo-layer which sets a flag respected by other layers, but I don't know whether that is actually the case.

Since Perl 5.14, you can also apply another :crlf layer later, such as when the CRLF translation must occur after an encoding layer.

The very notion that CRLF translation should occur after encoding seems counter-intuitive in the case of UTF-16/UCS-2.

All this seems a bit quirky and not properly specified. To add to the confusion (or possibly gain more insight), read the open::layers doc, where it seems someone with more knowledge than me has studied the issue and made a list of "historical quirks" and "issues".

This is as far as my own research has brought me and I'll leave it there.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.