Write to UTF-8 file in Python

Question

I'm really confused with the codecs.open function. When I do:

file = codecs.open("temp", "w", "utf-8") file.write(codecs.BOM_UTF8) file.close()

It gives me the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

If I do:

file = open("temp", "w") file.write(codecs.BOM_UTF8) file.close()

It works fine.

Question is why does the first method fail? And how do I insert the bom?

If the second method is the correct way of doing it, what the point of using codecs.open(filename, "w", "utf-8")?

@SalmanPK BOM is not needed in UTF-8 and only adds complexity (e.g. you can't just concatenate BOM'd files and result with valid text). See this Q&A; don't miss the big comment under Q — Alois Mahdal
– Alois Mahdal, Commented Aug 29, 2013 at 14:18

Zanon · Accepted Answer · 2017-06-17 19:24:56Z

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on "I'm meant to be writing Unicode as UTF-8-encoded text, but you've given me a byte string!"

Try writing the Unicode string for the byte order mark (i.e. Unicode U+FEFF) directly, so that the file just encodes that as UTF-8:

import codecs file = codecs.open("lol", "w", "utf-8") file.write(u'\ufeff') file.close()

(That seems to give the right answer - a file with bytes EF BB BF.)

EDIT: S. Lott's suggestion of using "utf-8-sig" as the encoding is a better one than explicitly writing the BOM yourself, but I'll leave this answer here as it explains what was going wrong before.

Warning: open and open is not the same. If you do "from codecs import open", it will NOT be the same as you would simply type "open".
you can also use codecs.open('test.txt', 'w', 'utf-8-sig') instead
I'm getting "TypeError: an integer is required (got type str)". I don't understand what we're doing here. Can someone please help? I need to append a string (paragraph) to a text file. Do I need to convert that into an integer first before writing?
@Mugen: The exact code I've written works fine as far as I can see. I suggest you ask a new question showing exactly what code you've got, and where the error occurs.
@Mugen you need to call codecs.open instead of just open

Eric O. Lebigot · Accepted Answer · 2013-05-14 02:31:25Z

203

Read the following: http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig

Do this

with codecs.open("test_output", "w", "utf-8-sig") as temp: temp.write("hi mom\n") temp.write(u"This has ♭")

The resulting file is UTF-8 with the expected BOM.

edited May 14, 2013 at 2:31

Eric O. Lebigot

95.1k49 gold badges223 silver badges263 bronze badges

answered Jun 1, 2009 at 9:58

S.Lott

393k83 gold badges520 silver badges791 bronze badges

4 Comments

Mohamad Fakih Over a year ago

Thanks. That worked (Windows 7 x64, Python 2.7.5 x64). This solution works well when you open the file in mode "a" (append).

Dustin Andrews Over a year ago

This didn't work for me, Python 3 on Windows. I had to do this instead with open(file_name, 'wb') as bomfile: bomfile.write(codecs.BOM_UTF8) then re-open the file for append.

matheburg Over a year ago

@user2905353: not required; this is handled by context management of open.

Zeus Over a year ago

Solve my problem. Mac os python script copy to windows running success.

Kamran Gasimov · Accepted Answer · 2021-08-12 11:17:48Z

It is very simple just use this. Not any library needed.

with open('text.txt', 'w', encoding='utf-8') as f: f.write(text)

tzot · Accepted Answer · 2009-06-01 17:10:17Z

@S-Lott gives the right procedure, but expanding on the Unicode issues, the Python interpreter can provide more insights.

Jon Skeet is right (unusual) about the codecs module - it contains byte strings:

>>> import codecs >>> codecs.BOM '\xff\xfe' >>> codecs.BOM_UTF8 '\xef\xbb\xbf' >>>

Picking another nit, the BOM has a standard Unicode name, and it can be entered as:

>>> bom= u"\N{ZERO WIDTH NO-BREAK SPACE}" >>> bom u'\ufeff'

It is also accessible via unicodedata:

>>> import unicodedata >>> unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE') u'\ufeff' >>>

Ricardo · Accepted Answer · 2012-02-08 20:35:11Z

I use the file *nix command to convert a unknown charset file in a utf-8 file

# -*- encoding: utf-8 -*- # converting a unknown formatting file in utf-8 import codecs import commands file_location = "jumper.sub" file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location) file_stream = codecs.open(file_location, 'r', file_encoding) file_output = codecs.open(file_location+"b", 'w', 'utf-8') for l in file_stream: file_output.write(l) file_stream.close() file_output.close()

Use # coding: utf8 instead of # -*- coding: utf-8 -*-which is far easier to remember.
I am really interested in seing something like that working on windows

celsowm · Accepted Answer · 2022-04-08 20:52:32Z

python 3.4 >= using pathlib:

import pathlib pathlib.Path("text.txt").write_text(text, encoding='utf-8') #or utf-8-sig for BOM

Vasile Caraus · Accepted Answer · 2023-12-19 16:49:54Z

import os import re import chardet def read_text_from_file(file_path): with open(file_path, 'rb') as f: raw_data = f.read() try: # Încercăm să decodăm ca UTF-8, ignorând erorile return raw_data.decode('utf-8', errors='ignore') except UnicodeDecodeError: pass # Dacă UTF-8 eșuează, încercăm detectarea automată a codificării encoding = chardet.detect(raw_data)['encoding'] if encoding is not None: try: return raw_data.decode(encoding, errors='ignore') except UnicodeDecodeError: pass raise Exception(f"Eroare: Nu s-a putut decodifica fișierul {file_path} nici cu UTF-8, nici cu codificarea detectată.") def write_to_file(text, file_path, encoding='utf8'): """ Aceasta functie scrie un text intr-un fisier. text: textul pe care vrei sa il scrii file_path: calea catre fisierul in care vrei sa scrii """ with open(file_path, 'wb') as f: f.write(text.encode(encoding, 'ignore'))

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
Hello, please don't post code only and add an explantation as to why you think that this is the optimal solution. People are supposed to learn from your answer, which might not occur if they just copy paste code without knowing why it should be used.
That was a full code, with an example how to open/close a UTF-8 file with Python

Just Me · Accepted Answer · 2023-03-07 12:52:09Z

 def read_files(file_path): with open(file_path, encoding='utf8') as f: text = f.read() return text **OR (AND)** def read_files(text, file_path): with open(file_path, 'rb') as f: f.write(text.encode('utf8', 'ignore')) **OR** document = Document() document.add_heading(file_path.name, 0) file_path.read_text(encoding='UTF-8')) file_content = file_path.read_text(encoding='UTF-8') document.add_paragraph(file_content) **OR** def read_text_from_file(cale_fisier): text = cale_fisier.read_text(encoding='UTF-8') print("what I read: ", text) return text # return written text def save_text_into_file(cale_fisier, text): f = open(cale_fisier, "w", encoding = 'utf-8') # open file print("Ce am scris: ", text) f.write(text) # write the content to the file **OR** def read_text_from_file(file_path): with open(file_path, encoding='utf8', errors='ignore') as f: text = f.read() return text # return written text **OR** def write_to_file(text, file_path): with open(file_path, 'wb') as f: f.write(text.encode('utf8', 'ignore')) # write the content to the file

SOURCE HERE:

RogerZ · Accepted Answer · 2021-12-08 12:04:46Z

If you are using Pandas I/O methods like pandas.to_excel(), add an encoding parameter, e.g.

pd.to_excel("somefile.xlsx", sheet_name="export", encoding='utf-8')

This works for most international characters I believe.

Collectives™ on Stack Overflow

Write to UTF-8 file in Python

9 Answers 9

5 Comments

4 Comments

Comments

Comments

2 Comments

Comments

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

5 Comments

4 Comments

Comments

Comments

2 Comments

Comments

3 Comments

Comments

Comments

Linked

Related