1

Background

I'm doing a job for someone that involved downloading ~123,000 US government court decisions stored as text files (.txt), which seem to be generally encoded in the Windows 1252 format, but are apparently occasionally encoded in the UCS-2 LE BOM format (according to Notepad++). They may also occasionally use other formats; I haven't figured out how to quickly get a complete list.

Problem

This variability in the encoding is preventing me from examining the UCS-2 files using Python.

I'd like a quick way to convert all of the files to UTF-8, regardless of their original encoding.

I have access to both a Linux and a Windows machine, so I can use solutions specific to either OS.

What I've tried

I tried using Python's cchardet library, but it doesn't seem to be as good at detecting the encoding as Notepad++ is, as the library is telling me that a certain file is using the Windows-1252 encoding when Notepad++ is saying it's actually using the UCS-2 LE BOM encoding.

import os import re import cchardet def print_the_encodings_used_by_all_files_in_a_directory(): path_to_cases = '<fill this in>' encodings = set() detector = cchardet.UniversalDetector() for index, filename in enumerate(os.listdir(path_to_cases)): path_to_file = os.path.join(path_to_cases, filename) detector.reset() with open(path_to_file, 'rb') as infile: for line in infile.readlines(): detector.feed(line) if detector.done: break detector.close() encodings.add(detector.result['encoding']) print(encodings) 

Here's what a hex editor shows as the first two bytes of the file in question: enter image description here

2

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.