Remove non-UTF8 characters from file contents

Question

I'm trying to read usernames from a database and if there are non-UTF-8 characters, it throws UnicodeDecodeError.

I'm unsure of what all the non-UTF8 characters are and I'm looking for a solution.

I want to keep special symbols, but just filter out the ones that aren't compatible with UTF-8. ³ and ™ (trademark), don't work with UTF-8, they're the only two I know of.

I still want to keep chinese symbols, arabic, etc. That's why I'm using UTF8.

Code:

def is_author_used(author): with open("C:\\Users\\Administrator\\Desktop\\authors.txt", 'r', encoding='utf-8') as f: content = f.read().splitlines() if author in content: return True return False def set_author_used(author): with open("C:\\Users\\Administrator\\Desktop\\authors.txt", 'a', encoding='utf-8') as f: f.write(author + '\r\n')

It seems that your files are simply not in UTF-8 format. Only characters up to 0x7f are stored in "the usual way" in UTF-8. If you have a byte >= 0x80, it is part of a multibyte character. Reading a file as UTF-8 which isn't indeed leads to errors. — glglgl
– glglgl, Commented Sep 15, 2017 at 7:58
What do you mean by "³ and ™ (trademark), don't work with UTF-8"? Those are perfectly good Unicode characters and all Unicode characters can be represented as UTF-8. — Błotosmętek
– Błotosmętek, Commented Sep 15, 2017 at 8:01
I know, that's the thing. But for some reason it throws an error. My text document is UTF-8 — Joseph Jones
– Joseph Jones, Commented Sep 15, 2017 at 8:02
@JosephJones Check the respective part in a hex editor and post the bytes and what decoded unicode you are expecting. Also, you can try using this tool, gnuwin32.sourceforge.net/packages/file.htm, I'm pretty convinced that your file is just not UTF8. — filmor
– filmor, Commented Sep 15, 2017 at 8:33

Danil Speransky · Accepted Answer · 2017-09-15 07:57:45Z

3

Maybe something like this:

with open('text.txt', encoding='utf-8', errors='ignore') as f: content = f.read().splitlines()

answered Sep 15, 2017 at 7:57

Danil Speransky

30.5k6 gold badges70 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

glglgl Over a year ago

I am not sure ignoring errors will solve the problem.

Joseph Jones Over a year ago

What happens when it finds an error? Does it stop the reading? Or does it just replace specific location in 'content' with an empty character?

filmor Over a year ago

errors="ignore" does exactly what it says, it filters non-UTF8 parts from the text.

Brōtsyorfuzthrāx Over a year ago

Solving the problem (if this is what you want) without knowing the cause might have it's uses, but there could be other issues down the road, as well as issues right away that you do or do not notice, depending on the cause.

tursunWali Over a year ago

errors='ignore', just ignore it , but did not remove all non utf8 characters. they are still there. how to remove it ?

Collectives™ on Stack Overflow

Remove non-UTF8 characters from file contents

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related