Perl-generated UTF8 strings in JSON corrupted on client side

Question

I have a Perl CGI script that is accessing Thai language, UTF-8 strings from a PostgreSQL DB and returning them to a web-based front end as JSON. The strings are fine when I get them from the DB and after I encode them as JSON (based on writing to a log file). However, when the client receives them they are corrupted, for example:

featurename "à¹\u0082à¸£à¸\u0087à¹\u0080à¸£à¸µà¸¢à¸\u0099à¸§à¸±à¸\u0094à¸ à¸²à¸©à¸µ"

Clearly some chars are being converted to Unicode escape sequences, but not all.

I could really use some suggestions as to how to solve this.

Simplified code snippet follows. I am using 'utf8' and 'utf8::all', as well as 'JSON'.

Thanks in advance for any help you can provide.

my $dataId = $cgi->param('dataid'); my $table = "uploadpoints"; my $sqlcommand = "select id,featurename from $table where dataid=$dataId;"; my $stmt = $gDbh->prepare($sqlcommand); my $numrows = $stmt->execute; # print JSON header print <<EOM; Content-type: application/json; charset="UTF-8" EOM my @retarray; for (my $i = 0; ($i < $numrows); $i=$i+1) { my $hashref = $stmt->fetchrow_hashref("NAME_lc"); #my $featurename = $hashref->{'featurename'}; #logentry("Point $i feature name is: $featurename\n"); push @retarray,$hashref; } my $json = encode_json (\@retarray); logentry("JSON\n $json"); print $json;

I have modified and simplified the example, now running locally rather than via browser invocation:

my $dataId = 5; my $table = "uploadpoints"; my $sqlcommand = "select id,featurename from $table where dataid=$dataId and id=75;"; my $stmt = $gDbh->prepare($sqlcommand); my $numrows = $stmt->execute; my @retarray; for (my $i = 0; ($i < $numrows); $i=$i+1) { my $hashref = $stmt->fetchrow_hashref("NAME_lc"); my $featurename = $hashref->{'featurename'}; print "featurename $featurename\n"; push @retarray,$hashref; } my $json = encode_json (\@retarray); print $json;

Using hexdump as in Stefan's example, I've determined that the data as read from the database are already in UTF-8. It looks as though they are being re-encoded in the JSON encode method. But why?

The data in the JSON use exactly twice as many bytes as the original UTF-8.

 perl testcase.pl | hexdump -C 00000000 66 65 61 74 75 72 65 6e 61 6d 65 20 e0 b9 82 e0 |featurename ....| 00000010 b8 a3 e0 b8 87 e0 b9 80 e0 b8 a3 e0 b8 b5 e0 b8 |................| 00000020 a2 e0 b8 99 e0 b9 81 e0 b8 88 e0 b9 88 e0 b8 a1 |................| 00000030 e0 b8 88 e0 b8 b1 e0 b8 99 e0 b8 97 e0 b8 a3 e0 |................| 00000040 b9 8c 0a 5b 7b 22 66 65 61 74 75 72 65 6e 61 6d |...[{"featurenam| 00000050 65 22 3a 22 c3 a0 c2 b9 c2 82 c3 a0 c2 b8 c2 a3 |e":"............| 00000060 c3 a0 c2 b8 c2 87 c3 a0 c2 b9 c2 80 c3 a0 c2 b8 |................| 00000070 c2 a3 c3 a0 c2 b8 c2 b5 c3 a0 c2 b8 c2 a2 c3 a0 |................| 00000080 c2 b8 c2 99 c3 a0 c2 b9 c2 81 c3 a0 c2 b8 c2 88 |................| 00000090 c3 a0 c2 b9 c2 88 c3 a0 c2 b8 c2 a1 c3 a0 c2 b8 |................| 000000a0 c2 88 c3 a0 c2 b8 c2 b1 c3 a0 c2 b8 c2 99 c3 a0 |................| 000000b0 c2 b8 c2 97 c3 a0 c2 b8 c2 a3 c3 a0 c2 b9 c2 8c |................| 000000c0 22 2c 22 69 64 22 3a 37 35 7d 5d |","id":75}]| 000000cb

Further suggestions? I tried using decode on the UTF string but got errors related to wide characters.

I did read the recommended answer from Tom Christianson, as well as his Unicode tutorials, but I will admit much of it went over my head. Also it seems my problem is considerably more constrained.

I did wonder whether retrieving the hash value and assigning it to a normal variable was doing some sort of auto-decoding or encoding. I do not really understand when Perl uses its internal character format as opposed to when it retains the external encoding.

UPDATE WITH SOLUTION

Turns out that since the string retrieved from the DB is already in UTF-8, I need to use 'to_json' rather than 'encode_json'. This fixed the problem. Learned a lot about Perl Unicode handling in the process though...

Also recommend: http://perldoc.perl.org/perluniintro.html

Very clear exposition.

I retracted my answer. It is obviously impossible to write test code that show the correct behavior without the exact input string. I.e. you question is missing the exact contents of $hashref->{featurename}. Is it an octet string with encoded UTF-8? Is it a string in the internal Perl representation. i.e. UTF-8 decoded? Note that this can change depending if an operation is executed under use utf8;or no utf8;. — Stefan Becker
– Stefan Becker, Commented Feb 15, 2019 at 8:52
Just stumbled over Tom Christiansen's answer. Probably recommended primer if you do anything with Unicode/UTF-8 on Perl. — Stefan Becker
– Stefan Becker, Commented Feb 15, 2019 at 16:32
Did you configure your db connection ? SET client_encoding TO 'UTF-8' — clamp
– clamp, Commented Feb 15, 2019 at 19:33
Depending on your DBD::Pg version you also might need to set $dh->{pg_enable_utf8}=1 — clamp
– clamp, Commented Feb 15, 2019 at 19:39

Stefan Becker · Accepted Answer · 2019-02-15 16:30:54Z

NOTE: you should probably also read this answer, which makes my answer sub-par in comparison :-)

The problem is that you have to be sure in which format each string is, otherwise you'll get incorrect conversions. When handling UTF-8 a string can be in two formats:

raw UTF-8 encoded octet string, i.e. \x{100} represented as two octets 0xC4 0x80
internal Perl string representation, i.e. one Unicode character \x{100} (U+0100 Ā LATIN CAPITAL LETTER A WITH MACRON)

If I/O is involved you also need to know if the I/O layer does UTF-8 de/encoding or not. For terminal I/O you also have to consider if it understands UTF-8 or not. Both taken together can make it difficult to get meaningful debug printouts from your code.

If you Perl code needs to process UTF-8 strings after reading them from the source, you must make sure that they are in internal Perl format. Otherwise you'll get surprising result when you call code that expects Perl strings and not raw octet strings.

I try to show this in my example code:

#!/usr/bin/perl use warnings; use strict; use JSON; open(my $utf8_stdout, '>& :encoding(UTF-8)', \*STDOUT) or die "can't reopen STDOUT as utf-8 file handle: $!\n"; my $hex = "C480"; print "${hex}\n"; my $raw = pack('H*', $hex); print STDOUT "${raw}\n"; print $utf8_stdout "${raw}\n"; my $decoded; utf8::decode($decoded = $raw); print STDOUT ord($decoded), "\n"; print STDOUT "${decoded}\n"; # Wide character in print at... print $utf8_stdout "${decoded}\n"; my $json = JSON->new->encode([$decoded]); print STDOUT "${json}\n"; # Wide character in print at... print $utf8_stdout "${json}\n"; $json = JSON->new->utf8->encode([$decoded]); print STDOUT "${json}\n"; print $utf8_stdout "${json}\n"; exit 0;

Copy & paste from my terminal (which supports UTF-8). Look closely at the differences between the lines:

$ perl dummy.pl C480 Ā Ä 256 Wide character in print at dummy.pl line 21. Ā Ā Wide character in print at dummy.pl line 25. ["Ā"] ["Ā"] ["Ā"] ["Ä"]

But compare this to the following, where STDOUT is not a terminal, but piped to another program. The hex dump always shows "c4 80", i.e. UTF-8 encoded.

$ perl dummy.pl | hexdump -C Wide character in print at dummy.pl line 21. Wide character in print at dummy.pl line 22. Wide character in print at dummy.pl line 25. Wide character in print at dummy.pl line 26. 00000000 43 34 38 30 0a c4 80 0a c4 80 0a 5b 22 c4 80 22 |C480.......[".."| 00000010 5d 0a 5b 22 c4 80 22 5d 0a 43 34 38 30 0a c4 80 |].[".."].C480...| 00000020 0a 32 35 36 0a c4 80 0a 5b 22 c4 80 22 5d 0a 5b |.256....[".."].[| 00000030 22 c4 80 22 5d 0a |".."].| 00000036

Collectives™ on Stack Overflow

Perl-generated UTF8 strings in JSON corrupted on client side

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related