259

I'm really confused with the codecs.open function. When I do:

file = codecs.open("temp", "w", "utf-8") file.write(codecs.BOM_UTF8) file.close() 

It gives me the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

If I do:

file = open("temp", "w") file.write(codecs.BOM_UTF8) file.close() 

It works fine.

Question is why does the first method fail? And how do I insert the bom?

If the second method is the correct way of doing it, what the point of using codecs.open(filename, "w", "utf-8")?

3
  • 68
    Don’t use a BOM in UTF-8. Please. Commented Feb 9, 2012 at 11:12
  • 11
    @tchrist Huh? Why not? Commented Jun 1, 2013 at 5:16
  • 13
    @SalmanPK BOM is not needed in UTF-8 and only adds complexity (e.g. you can't just concatenate BOM'd files and result with valid text). See this Q&A; don't miss the big comment under Q Commented Aug 29, 2013 at 14:18

9 Answers 9

320

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on "I'm meant to be writing Unicode as UTF-8-encoded text, but you've given me a byte string!"

Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:

import codecs file = codecs.open("lol", "w", "utf-8") file.write(u'\ufeff') file.close() 

(That seems to give the right answer - a file with bytes EF BB BF.)

EDIT: S. Lott's suggestion of using "utf-8-sig" as the encoding is a better one than explicitly writing the BOM yourself, but I'll leave this answer here as it explains what was going wrong before.

Sign up to request clarification or add additional context in comments.

5 Comments

Warning: open and open is not the same. If you do "from codecs import open", it will NOT be the same as you would simply type "open".
you can also use codecs.open('test.txt', 'w', 'utf-8-sig') instead
I'm getting "TypeError: an integer is required (got type str)". I don't understand what we're doing here. Can someone please help? I need to append a string (paragraph) to a text file. Do I need to convert that into an integer first before writing?
@Mugen: The exact code I've written works fine as far as I can see. I suggest you ask a new question showing exactly what code you've got, and where the error occurs.
@Mugen you need to call codecs.open instead of just open
203

Read the following: http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig

Do this

with codecs.open("test_output", "w", "utf-8-sig") as temp: temp.write("hi mom\n") temp.write(u"This has ♭") 

The resulting file is UTF-8 with the expected BOM.

4 Comments

Thanks. That worked (Windows 7 x64, Python 2.7.5 x64). This solution works well when you open the file in mode "a" (append).
This didn't work for me, Python 3 on Windows. I had to do this instead with open(file_name, 'wb') as bomfile: bomfile.write(codecs.BOM_UTF8) then re-open the file for append.
@user2905353: not required; this is handled by context management of open.
Solve my problem. Mac os python script copy to windows running success.
82

It is very simple just use this. Not any library needed.

with open('text.txt', 'w', encoding='utf-8') as f: f.write(text) 

Comments

12

@S-Lott gives the right procedure, but expanding on the Unicode issues, the Python interpreter can provide more insights.

Jon Skeet is right (unusual) about the codecs module - it contains byte strings:

>>> import codecs >>> codecs.BOM '\xff\xfe' >>> codecs.BOM_UTF8 '\xef\xbb\xbf' >>> 

Picking another nit, the BOM has a standard Unicode name, and it can be entered as:

>>> bom= u"\N{ZERO WIDTH NO-BREAK SPACE}" >>> bom u'\ufeff' 

It is also accessible via unicodedata:

>>> import unicodedata >>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE') u'\ufeff' >>> 

Comments

10

I use the file *nix command to convert a unknown charset file in a utf-8 file

# -*- encoding: utf-8 -*- # converting a unknown formatting file in utf-8 import codecs import commands file_location = "jumper.sub" file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location) file_stream = codecs.open(file_location, 'r', file_encoding) file_output = codecs.open(file_location+"b", 'w', 'utf-8') for l in file_stream: file_output.write(l) file_stream.close() file_output.close() 

2 Comments

Use # coding: utf8 instead of # -*- coding: utf-8 -*-which is far easier to remember.
I am really interested in seing something like that working on windows
2

python 3.4 >= using pathlib:

import pathlib pathlib.Path("text.txt").write_text(text, encoding='utf-8') #or utf-8-sig for BOM 

Comments

1
import os import re import chardet def read_text_from_file(file_path): with open(file_path, 'rb') as f: raw_data = f.read() try: # Încercăm să decodăm ca UTF-8, ignorând erorile return raw_data.decode('utf-8', errors='ignore') except UnicodeDecodeError: pass # Dacă UTF-8 eșuează, încercăm detectarea automată a codificării encoding = chardet.detect(raw_data)['encoding'] if encoding is not None: try: return raw_data.decode(encoding, errors='ignore') except UnicodeDecodeError: pass raise Exception(f"Eroare: Nu s-a putut decodifica fișierul {file_path} nici cu UTF-8, nici cu codificarea detectată.") def write_to_file(text, file_path, encoding='utf8'): """ Aceasta functie scrie un text intr-un fisier. text: textul pe care vrei sa il scrii file_path: calea catre fisierul in care vrei sa scrii """ with open(file_path, 'wb') as f: f.write(text.encode(encoding, 'ignore')) 

3 Comments

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
Hello, please don't post code only and add an explantation as to why you think that this is the optimal solution. People are supposed to learn from your answer, which might not occur if they just copy paste code without knowing why it should be used.
That was a full code, with an example how to open/close a UTF-8 file with Python
0
 def read_files(file_path): with open(file_path, encoding='utf8') as f: text = f.read() return text **OR (AND)** def read_files(text, file_path): with open(file_path, 'rb') as f: f.write(text.encode('utf8', 'ignore')) **OR** document = Document() document.add_heading(file_path.name, 0) file_path.read_text(encoding='UTF-8')) file_content = file_path.read_text(encoding='UTF-8') document.add_paragraph(file_content) **OR** def read_text_from_file(cale_fisier): text = cale_fisier.read_text(encoding='UTF-8') print("what I read: ", text) return text # return written text def save_text_into_file(cale_fisier, text): f = open(cale_fisier, "w", encoding = 'utf-8') # open file print("Ce am scris: ", text) f.write(text) # write the content to the file **OR** def read_text_from_file(file_path): with open(file_path, encoding='utf8', errors='ignore') as f: text = f.read() return text # return written text **OR** def write_to_file(text, file_path): with open(file_path, 'wb') as f: f.write(text.encode('utf8', 'ignore')) # write the content to the file 

SOURCE HERE:

Comments

-3

If you are using Pandas I/O methods like pandas.to_excel(), add an encoding parameter, e.g.

pd.to_excel("somefile.xlsx", sheet_name="export", encoding='utf-8') 

This works for most international characters I believe.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.