9

I'm crawling a website and collecting information from its JSON. The results are saved in a hash. But some of the pages give me "malformed UTF-8 character in JSON string" error. I notice that the last letter in "cafe" will produce error. I think it is because of the mix of character types. So now I'm looking for a way to convert all types of character to utf-8 (hope there is a way perfect like that). I tried utf8::all, it just doesn't work (maybe I didn't do it right). I'm a noob. Please help, thanks.


UPDATE

Well, after I read the article "Know the difference between character strings and UTF-8 strings" Posted by brian d foy. I solve the problem with the codes:

use utf8; use Encode qw(encode_utf8); use JSON; my $json_data = qq( { "cat" : "Büster" } ); $json_data = encode_utf8( $json_data ); my $perl_hash = decode_json( $json_data ); 

Hope this help some one else.

1
  • Also, you might look at whatever your web user-agent is doing and tell it not to decode the body. That should give you the raw octets so you don't have to encode what it decoded. Commented May 12, 2021 at 20:37

1 Answer 1

25

decode_json expects the JSON to have been encoded using UTF-8.

While your source file is encoded using UTF-8, you have Perl decode it by using use utf8; (as you should). This means your string contains Unicode characters, not the UTF-8 bytes that represent those characters.

As you've shown, you could encode the string before passing it to decode_json.

use utf8; use Encode qw( encode_utf8 ); use JSON qw( decode_json ); my $data_json = qq( { "cat" : "Büster" } ); my $data = JSON->new->utf8(1)->decode(encode_utf8($data_json)); -or- my $data = JSON->new->utf8->decode(encode_utf8($data_json)); -or- my $data = decode_json(encode_utf8($data_json)); 

But you could simply tell JSON that the string is already decoded.

use utf8; use JSON qw( from_json ); my $data_json = qq( { "cat" : "Büster" } ); my $data = JSON->new->utf8(0)->decode($data_json); -or- my $data = JSON->new->decode($data_json); -or- my $data = from_json($data_json); 
Sign up to request clarification or add additional context in comments.

2 Comments

"But you could simply tell JSON that the string is already decoded." Do you mean that the input of the decode function is already encoded to utf-8?
The question makes no sense. There's no "is", there is only "must be". Whether the input to $json->decode must be UTF-8 encoded or must not be encoded depends on whether you are using JSON->new->utf8(1)->decode (aka decode_json) (input must be UTF-8) or JSON->new->utf8(0)->decode (input must be Unicode chars).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.