14

I'm trying to read tables into R from HTML pages that are mostly encoded in UTF-8 (and declare <meta charset="utf-8">) but have some strings in some other encodings (I think Windows-1252 or ISO 8859-1). Here's an example. I want everything decoded properly into an R data frame. XML::readHTMLTable takes an encoding argument but doesn't seem to allow one to try multiple encodings.

So, in R, how can I try several encodings for each line of the input file? In Python 3, I'd do something like:

with open('file', 'rb') as o: for line in o: try: line = line.decode('UTF-8') except UnicodeDecodeError: line = line.decode('Windows-1252') 
1

1 Answer 1

5

There do seem to be R library functions for guessing character encodings, like stringi::stri_enc_detect, but when possible, it's probably better to use the simpler determinstic method of trying a fixed set of encodings in order. It looks like the best way to do this is to take advantage of the fact that when iconv fails to convert a string, it returns NA.

linewise.decode = function(path) sapply(readLines(path), USE.NAMES = F, function(line) { if (validUTF8(line)) return(line) l2 = iconv(line, "Windows-1252", "UTF-8") if (!is.na(l2)) return(l2) l2 = iconv(line, "Shift-JIS", "UTF-8") if (!is.na(l2)) return(l2) stop("Encoding not detected") }) 

If you create a test file with

$ python3 -c 'with open("inptest", "wb") as o: o.write(b"This line is ASCII\n" + "This line is UTF-8: I like π\n".encode("UTF-8") + "This line is Windows-1252: Müller\n".encode("Windows-1252") + "This line is Shift-JIS: ハローワールド\n".encode("Shift-JIS"))' 

then linewise.decode("inptest") indeed returns

[1] "This line is ASCII" [2] "This line is UTF-8: I like π" [3] "This line is Windows-1252: Müller" [4] "This line is Shift-JIS: ハローワールド" 

To use linewise.decode with XML::readHTMLTable, just say something like XML::readHTMLTable(linewise.decode("http://example.com")).

Sign up to request clarification or add additional context in comments.

2 Comments

doesn´t it have to be the other way around iconv(lines, from = "UTF-8", to = "Windows-1252") ?
@BigDataScientist No, I want everything to ultimately be in UTF-8 (since it turns out that R doesn't have a binary-representation-free string type like Python 3's str), and Windows-1252 is the preexisting encoding of some of the lines I want to change.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.