Converting a Massive Unicode Text File to ASCII

Question

I've received several text files, where each file contains thousands of lines of text. Because the files use Unicode encoding, each file ends up being around 1GB. I know this might sound borderline ridiculous, but it unfortunately is the reality:

I'm using Python 2.7 on a Windows 7 machine. I've only started using Python but figured this would be a good chance to really start using the language. You've gotta use it to learn it, right?

What I'm hoping to do is to be able to make a copy of all of these massive files. The new copies would be using ASCII character encoding and would ideally be significantly smaller in size. I know that changing the character encoding is a solution because I've had success by opening a file in MS WordPad and saving it to a regular text file:

Using WordPad is a manual and slow process: I need to open the file, which takes forever because it's so big, and then save it as a new file, which also takes forever since it's so big. I'd really like to automate this by having a script run in the background while I work on other things. I've written a bit of Python to do this, but it's not working correctly. What I've done so far is the following:

def convertToAscii(): # Getting a list of the current files in the directory cwd = os.getcwd() current_files = os.listdir(cwd) # I don't want to mess with all of the files, so I'll just pick the second one since the first file is the script itself test_file = current_files[1] # Determining a new name for the ASCII-encoded file file_name_length = len(test_file) ascii_file_name = test_file[:file_name_length - 3 - 1] + "_ASCII" + test_file[file_name_length - 3 - 1:] # Then we open the new blank file the_file = open(ascii_file_name, 'w') # Finally, we open our original file for testing... with io.open(test_file, encoding='utf8') as f: # ...read it line by line for line in f: # ...encode each line into ASCII line.encode("ascii") # ...and then write the ASCII line to the new file the_file.write(line) # Finally, we close the new file the_file.close() convertToAscii()

And I end up with the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

But that doesn't make any sense.... The first line within all of the text files is either a blank line or a series of equal signs, such as ===========.

I was wondering if someone would be able to put me onto the right path for this. I understand that doing this operation can take a very long time since I'm essentially reading each file line by line and then encoding the string into ASCII. What must I do in order to get around my current issue? And is there a more efficient way to do this?

the result of line.encode("ascii") is discarded, you don't do anything with it. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Feb 26, 2018 at 19:42
Also, is there a particualr reason you are using Python 2.7? If not, you should probably be using Python 3... — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Feb 26, 2018 at 19:44
@juanpa.arrivillaga, I have other software which relies on Python 2. However, from what I know, both Python 2 and 3 can be setup on the same machine without too much contention. If you feel like this can be much more easily accomplished using Python 3, then I am open to that. — coolDude
– coolDude, Commented Feb 26, 2018 at 19:48

lotyrin · Accepted Answer · 2018-02-26 19:50:06Z

For characters that exist in ASCII, UTF-8 already encodes using single bytes. Opening a UTF8 file with only single byte characters then saving an ASCII file should be a non-operation.

For any size difference, your files would have to be some wider encoding of Unicode, like UTF-16 / UCS-2. That would also explain the utf8 codec complaining about unexpected bytes in the source file.

Find out what encoding your files actually are, then save using utf8 codec. That way your files will be just as small (equivalent to ASCII) for single byte characters, but if your source files happen to have any multibyte characters, the result file will still be able to encode them and you won't be doing a lossy conversion.

Indeed, that's why it was my first suggestion. Thanks for adding the context though.
That was exactly it, thank you so much for the help! I just changed my assumed character encoding to UTF16 via with io.open(test_file, encoding='utf16') as f: and everything appears to be working smooth as butter now :D
And per your recommendation, I stuck to encoding each line of the file to UTF8 via line.encode("utf8"). That way, I don't have to worry about having any lossy conversions.

Josh Lee · Accepted Answer · 2018-02-27 02:37:26Z

There's a potential speedup if you avoid splitting the file into lines, since the only thing that you're doing is joining the lines back together. This allows you to process the input in larger blocks.

Using the shutil.copyfileobj function (which is just read and write in a loop):

import shutil with open('input.txt', encoding='u16') as infile, \ open('output.txt', 'w', encoding='u8') as outfile: shutil.copyfileobj(infile, outfile)

(Using Python 3 here, by passing the encoding argument directly to open, but it should be the same as the library function io.open.)

Collectives™ on Stack Overflow

Converting a Massive Unicode Text File to ASCII

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related