How to convert encoding of multiple files to UTF-8, regardless of their original encoding?

Question

Background

I'm doing a job for someone that involved downloading ~123,000 US government court decisions stored as text files (.txt), which seem to be generally encoded in the Windows 1252 format, but are apparently occasionally encoded in the UCS-2 LE BOM format (according to Notepad++). They may also occasionally use other formats; I haven't figured out how to quickly get a complete list.

Problem

This variability in the encoding is preventing me from examining the UCS-2 files using Python.

I'd like a quick way to convert all of the files to UTF-8, regardless of their original encoding.

I have access to both a Linux and a Windows machine, so I can use solutions specific to either OS.

What I've tried

I tried using Python's cchardet library, but it doesn't seem to be as good at detecting the encoding as Notepad++ is, as the library is telling me that a certain file is using the Windows-1252 encoding when Notepad++ is saying it's actually using the UCS-2 LE BOM encoding.

import os import re import cchardet def print_the_encodings_used_by_all_files_in_a_directory(): path_to_cases = '<fill this in>' encodings = set() detector = cchardet.UniversalDetector() for index, filename in enumerate(os.listdir(path_to_cases)): path_to_file = os.path.join(path_to_cases, filename) detector.reset() with open(path_to_file, 'rb') as infile: for line in infile.readlines(): detector.feed(line) if detector.done: break detector.close() encodings.add(detector.result['encoding']) print(encodings)

Here's what a hex editor shows as the first two bytes of the file in question:

You have a BOM of FF FE, which means UTF-16 LE. What is the problem? Convert UTF-16 to UTF-8 and remove BOM?, Converting UTF-16 to UTF-8, Converting from utf-16 to utf-8 in Python3, Convert UTF16LE file to UTF8 in Python?, etc. — jww
– jww, Commented Jul 8, 2019 at 7:28
@jww The problem is that the cchardet library does not seem to be detecting it as UTF-16 LE. When I run the code you see in the question, it says that every file in the directory is using the 1252 encoding, with confidence "0.5". — Nathan Wailes
– Nathan Wailes, Commented Jul 8, 2019 at 19:23

Collectives™ on Stack Overflow

How to convert encoding of multiple files to UTF-8, regardless of their original encoding?

Background

Problem

What I've tried

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Background

Problem

What I've tried

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked