I've been working on outlook imports (linked in exports to outlook format) but I'm having troubles with encoding. The outlook format CSV I get from exporting my LinkedIn contacts are not in UTF-8. Letters like ñ cause an exception in the mongoid_search gem when calling str.to_s.mb_chars.normalize. I think encoding is the issue, because when I call mb_chars (see first code example). I am not sure if this is a bug in the gem, but I was advised to sanitize the data nonetheless.
From File Picker, I tried using their new, community-supported gem to upload CSV data. I tried three encoding detectors and transcoders:
- Ruby port of a Python lib
chardet- Didn't work as expected
- The port still contained Python code, preventing it from running in my app
rchardet19gem- Detected
iso-8859with.8/1confidence. - Tried to transcode with Iconv, but crashed on "illegal characters" at
ñ
- Detected
Charlock_Holmesgem- Detected
windows-1252with33/100confidence - I assume that's the actual encoding, and
rchardetgotiso-8859because this ones based of that. - This gem uses ICU and has a maintained branch "bundle-icu" which supports Heroku. When I try to transcode using
charlock, I get the errorU_FILE_ACCESS_ERROR, an ICU error code meaning "could not open file"
- Detected
Anybody know what to do here?