11

This simple code segment shows an issue I am having with JSON::XS encoding in Perl:

#!/usr/bin/perl use strict; use warnings; use JSON::XS; use utf8; binmode STDOUT, ":encoding(utf8)"; my (%data); $data{code} = "Gewürztraminer"; print "data{code} = " . $data{code} . "\n"; my $json_text = encode_json \%data; print $json_text . "\n"; 

The output this yields is:

johnnyb@boogie:~/Projects/repos > ./jsontest.pl data{code} = Gewürztraminer {"code":"Gewürztraminer"} 

Now if I comment out the binmode line above I get:

johnnyb@boogie:~/Projects/repos > ./jsontest.pl data{code} = Gew�rztraminer {"code":"Gewürztraminer"} 

What is happening here? Note that I am trying to fix this behavior in a perl CGI script in which binmode can not be used but I always get the "ü" characters as above returned in the JSON stream. How do I debug this? What am I missing?

1
  • Replace last line with print decode('UTF-8', $json_text, Encode::FB_CROAK) . "\n"; Commented Jul 1, 2015 at 20:21

2 Answers 2

14

encode_json (short for JSON::XS->new->utf8->encode) encodes using UTF-8, then you are re-encoding it by printing it to STDOUT to which you've added an encoding layer. Effectively, you are doing encode_utf8(encode_utf8($uncoded_json)).

Solution 1

use open ':std', ':encoding(utf8)'; # Defaults binmode STDOUT; # Override defaults print encode_json(\%data); 

Solution 2

use open ':std', ':encoding(utf8)'; # Defaults print JSON::XS->new->encode(\%data); # Or to_json from JSON.pm 

Solution 3

The following works with any encoding on STDOUT by using \u escapes for non-ASCII:

print JSON::XS->new->ascii->encode(\%data); 

In the comments, you mention it's actually a CGI script.

#!/usr/bin/perl use strict; use warnings; use utf8; # Encoding of source code. use open ':encoding(UTF-8)'; # Default encoding of file handles. BEGIN { binmode STDIN; # Usually does nothing on non-Windows. binmode STDOUT; # Usually does nothing on non-Windows. binmode STDERR, ':encoding(UTF-8)'; # For text sent to the log file. } use CGI qw( -utf8 ); use JSON::XS qw( ); { my $cgi = CGI->new(); my $data = { code => "Gewürztraminer" }; print $cgi->header('application/json'); print encode_json($data); } 
Sign up to request clarification or add additional context in comments.

7 Comments

Yes - that does work and, as stated below, I think I am encoding data that is already UTF8. Not sure how to get around that. Unfortunately STDOUT really has no meaning in a CGI script (?) so I'm not sure I can use the method above.
You're not making any sense. You wouldn't have the problem if you didn't use STDOUT. And yes, CGI does use STDOUT.
I have not been able to figure out to to use a CGI object as a filehande to hand to binmode. The encoded json data must be printed to the CGI object to be returned to the jQuery making the AJAX request. If you know how to do this I am all ears and eyes.
What are you talking about. We've already established that CGI uses STDOUT, so you use exactly the code I posted.
@Omortis CGI.pm doesn't send anything anywhere. Your code needs to print stuff. CGI.pm just gives you crappy old functions that generate HTML for you. Also keep in mind that CGI.pm is not in core any more. :)
|
3

JSON::XS encodes its output into octets. It means the external representation of encoded utf8 string, but it is not unicode string. For more details see perlunicode. In short, content of $json_text is prepared for transmitting by IO handler in binary code. If you create scalar content of $data{code} after use utf8; you have scalar containing internally encoded unicode characters string. (Which is internally encoded as utf8 but it is implementation detail you should not rely on. Pragma use utf8; means the source code is encoded as utf8 and nothing else.) If you would like to output both scalars in utf8 encoded IO handler you have to transform $json_string into internal unicode chracters string.

use strict; use warnings; use JSON::XS; use utf8; binmode STDOUT, ":encoding(utf8)"; my (%data); $data{code} = "Gewürztraminer"; print "data{code} = " . $data{code} . "\n"; my $json_text = encode_json \%data; utf8::decode($json_text); print $json_text . "\n"; 

Or how it is intended to use, output encoded string using IO handler in binary mode.

my $json_text = encode_json \%data; binmode STDOUT; print $json_text . "\n"; 

Try

print utf8::is_utf8($json_text) ? "UTF8" : "OCTETS" . "\n"; 

to see what is inside.

8 Comments

Rather wasteful to ask to UTF-8 encode it only to follow up with a decode and yet another encode!
Yes, but points is to explain what is wrong rather make it efficient.
The code doesn't explain that, the explanation does.
Never ever use is_utf8. If you need to force the use of one of the two storage formats, use utf8::upgrade or utf8::downgrade, neither of which are needed here.
@Omortis: You read data from DBI? Here we go. Check if your data read from DBI is flagged internally as Unicode using is_utf8 and if not then use utf8::upgrade . It is usual bug in DBI drivers. Then use approach in @ikegami's answer and choose if you would like to have IO in binary or utf8 mode.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.