Background
I'm doing a job for someone that involved downloading ~123,000 US government court decisions stored as text files (.txt), which seem to be generally encoded in the Windows 1252 format, but are apparently occasionally encoded in the UCS-2 LE BOM format (according to Notepad++). They may also occasionally use other formats; I haven't figured out how to quickly get a complete list.
Problem
This variability in the encoding is preventing me from examining the UCS-2 files using Python.
I'd like a quick way to convert all of the files to UTF-8, regardless of their original encoding.
I have access to both a Linux and a Windows machine, so I can use solutions specific to either OS.
What I've tried
I tried using Python's cchardet library, but it doesn't seem to be as good at detecting the encoding as Notepad++ is, as the library is telling me that a certain file is using the Windows-1252 encoding when Notepad++ is saying it's actually using the UCS-2 LE BOM encoding.
import os import re import cchardet def print_the_encodings_used_by_all_files_in_a_directory(): path_to_cases = '<fill this in>' encodings = set() detector = cchardet.UniversalDetector() for index, filename in enumerate(os.listdir(path_to_cases)): path_to_file = os.path.join(path_to_cases, filename) detector.reset() with open(path_to_file, 'rb') as infile: for line in infile.readlines(): detector.feed(line) if detector.done: break detector.close() encodings.add(detector.result['encoding']) print(encodings) Here's what a hex editor shows as the first two bytes of the file in question: 
FF FE, which means UTF-16 LE. What is the problem? Convert UTF-16 to UTF-8 and remove BOM?, Converting UTF-16 to UTF-8, Converting from utf-16 to utf-8 in Python3, Convert UTF16LE file to UTF8 in Python?, etc.