I've received several text files, where each file contains thousands of lines of text. Because the files use Unicode encoding, each file ends up being around 1GB. I know this might sound borderline ridiculous, but it unfortunately is the reality:
I'm using Python 2.7 on a Windows 7 machine. I've only started using Python but figured this would be a good chance to really start using the language. You've gotta use it to learn it, right?
What I'm hoping to do is to be able to make a copy of all of these massive files. The new copies would be using ASCII character encoding and would ideally be significantly smaller in size. I know that changing the character encoding is a solution because I've had success by opening a file in MS WordPad and saving it to a regular text file:
Using WordPad is a manual and slow process: I need to open the file, which takes forever because it's so big, and then save it as a new file, which also takes forever since it's so big. I'd really like to automate this by having a script run in the background while I work on other things. I've written a bit of Python to do this, but it's not working correctly. What I've done so far is the following:
def convertToAscii(): # Getting a list of the current files in the directory cwd = os.getcwd() current_files = os.listdir(cwd) # I don't want to mess with all of the files, so I'll just pick the second one since the first file is the script itself test_file = current_files[1] # Determining a new name for the ASCII-encoded file file_name_length = len(test_file) ascii_file_name = test_file[:file_name_length - 3 - 1] + "_ASCII" + test_file[file_name_length - 3 - 1:] # Then we open the new blank file the_file = open(ascii_file_name, 'w') # Finally, we open our original file for testing... with io.open(test_file, encoding='utf8') as f: # ...read it line by line for line in f: # ...encode each line into ASCII line.encode("ascii") # ...and then write the ASCII line to the new file the_file.write(line) # Finally, we close the new file the_file.close() convertToAscii() And I end up with the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte But that doesn't make any sense.... The first line within all of the text files is either a blank line or a series of equal signs, such as ===========.
I was wondering if someone would be able to put me onto the right path for this. I understand that doing this operation can take a very long time since I'm essentially reading each file line by line and then encoding the string into ASCII. What must I do in order to get around my current issue? And is there a more efficient way to do this?


line.encode("ascii")is discarded, you don't do anything with it.