CRLF translation with Unicode in Perl

Question

I'm trying to write to a Unicode (UCS-2 Little Endian) file in Perl on Windows, like this.

open my $f, ">$fName" or die "can't write $fName\n"; binmode $f, ':raw:encoding(UCS-2LE)'; print $f, "ohai\ni can haz unicodez?\nkthxbye\n"; close $f;

It basically works except I no longer get the automatic LF -> CR/LF translation on output that I get on regular text files. (The output files just have LF) If I leave out :raw or add :crlf in the "binmode" call, then the output file is garbled. I've tried re-ordering the "directives" (i.e. :encoding before :raw) and can't get it to work. The same problem exists for reading.

dsolimano · Accepted Answer · 2010-07-23 16:38:39Z

2

This works for me on windows:

open my $f, ">:encoding(UCS-2LE):crlf", "test.txt"; print $f "ohai\ni can haz unicodez?\nkthxbye\n"; close $f;

Yielding UCS-16 LE output in test.txt of

ohai i can haz unicodez? kthxbye

answered Jul 23, 2010 at 16:38

dsolimano

9,0463 gold badges52 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

JoelFan Over a year ago

Really?! The 2nd line looks like Asian characters when I open that in Notepad+ (which has Unicode support). And when I open it in HexEdit, I can see why... the lines end in \x00 \x0D \x0A (instead of \x00 \x0D \x00 \x0A) which makes the second line "out of sync"

dsolimano Over a year ago

I see the line ending in \x00 \x0D \x00 \x0A. or as hexlify-buffer puts it, 0d00 0a00. What version of perl are you using? I'm using the latest strawberry distribution.

JoelFan Over a year ago

Wow, this is weird... I am also using the latest Strawberry Perl (5.12), and I just tried it again to make sure... I am still seeing 00 0D 0A... Maybe you are using 5.10 and this is a bug introduced in 5.12?

dsolimano Over a year ago

This is perl 5, version 12, subversion 0 (v5.12.0) built for MSWin32-x86-multi-thread, not that this helps us much. I wonder if I have some sort of module installed that's affecting encoding and crlf. I'll try from another computer or two when I have access and post results.

JoelFan Over a year ago

Maybe the OS matters? I am on Windows 7 64-bit (although I am using MSWin32-x86-multi-thread same as you)

|

szabgab · Accepted Answer · 2014-04-29 13:01:17Z

Here is what I have found to work, at least with perl 5.10.1:

Input:

open(my $f_in, '<:raw:perlio:via(File::BOM):crlf', $file);

Output:

open(my $f_out, '>:raw:perlio:encoding(UTF-16LE):crlf:via(File::BOM)', $file);

These handle BOM, CRLF translation, and UTF-16LE encoding/decoding transparently.

Note that according to the perlmonks post below, if trying to specify with binmode() instead of open(), an extra ":pop" is required:

binmode $f_out, ':raw:pop:perlio:encoding(UTF-16LE):crlf';

which my experience corroborates. I was not able to get this to work with the ":via(File::BOM)" layer, however.

References:

http://www.perlmonks.org/?node_id=608532

http://metacpan.org/pod/File::BOM

mob · Accepted Answer · 2022-08-21 18:53:44Z

The :crlf layer does a simple byte mapping of 0x0A -> 0x0D 0x0A (\n --> \r\n) in the output stream, but for the most part this isn't valid in any wide character encoding.

How about using a raw mode but explicitly print the CR?

print $f "ohai\r\ni can haz unicodez?\r\nkthxbye\r\n";

Or if portability is a concern, discover and explicitly use the correct line ending:

## never mind - $/ doesn't work # print $f "ohai$/i can haz unicodez?$/kthxbye$/"; open DUMMY, '>', 'dummy'; print DUMMY "\n"; close DUMMY; open DUMMY, '<:raw', 'dummy'; $EOL = <DUMMY>; close DUMMY; unlink 'dummy'; ... print $f "ohai${EOL}i can haz unicodez?${EOL}kthxbye${EOL}";

Unrelated to the question, but Ωmega asked in a comment about the difference between :raw and :bytes. As documented in perldoc perlio, you can think of :raw as removing all I/O layers, and :bytes as removing a :utf8 layer. Compare the output of these two commands:

$ perl -E 'binmode *STDOUT,":crlf:raw"; say' | od -c 0000000 \n 0000001 $ perl -E 'binmode *STDOUT,":crlf:bytes";say' | od -c 0000000 \r \n 0000002

There's no way to make the :crlf layer work before the :encoding layer?
Could you please clarify a difference between :raw and :bytes?

Lumi · Accepted Answer · 2023-12-18 09:52:03Z

So the implicit question seems to be: How the heck is this IO layer business supposed to be used properly? Or perhaps: Is there a bug in the implementation of the :crlf layer?

After 13 years, there is no clarity. But the short answer is that :crlf seems to be meant to be applied – somewhat counter-intuitively – after any :encoding(…) layer and not before as doing otherwise will result in garbled output for UTF-16/UCS-2 as can be demonstrated using Perl 5.36.0 MSYS on Windows (shipping with "Git for Windows"), which by the way does not have :crlf enabled by default:

$ perl -E "say for PerlIO::get_layers(*STDOUT)" unix perlio $ perl -Mopen=":std,OUT,:encoding(UCS-2LE):crlf" -E say | od -c 0000000 \r \0 \n \0 → correct $ perl -Mopen=":std,OUT,:crlf:encoding(UCS-2LE)" -E say | od -c 0000000 \r \n \0 → wrong!

The PerlIO doc says:

On DOS/Windows like architectures where this layer is part of the defaults, it also acts like the :perlio layer, and removing the CRLF translation (such as with :raw) will only unset the CRLF translation flag.

Talk of a "translation flag" seems to suggest that it is another so-called pseudo-layer which sets a flag respected by other layers, but I don't know whether that is actually the case.

Since Perl 5.14, you can also apply another :crlf layer later, such as when the CRLF translation must occur after an encoding layer.

The very notion that CRLF translation should occur after encoding seems counter-intuitive in the case of UTF-16/UCS-2.

All this seems a bit quirky and not properly specified. To add to the confusion (or possibly gain more insight), read the open::layers doc, where it seems someone with more knowledge than me has studied the issue and made a list of "historical quirks" and "issues".

This is as far as my own research has brought me and I'll leave it there.

Collectives™ on Stack Overflow

CRLF translation with Unicode in Perl

4 Answers 4

6 Comments

Comments

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

3 Comments

Comments

Linked

Related