1

I'm trying to read usernames from a database and if there are non-UTF-8 characters, it throws UnicodeDecodeError.

I'm unsure of what all the non-UTF8 characters are and I'm looking for a solution.

I want to keep special symbols, but just filter out the ones that aren't compatible with UTF-8. ³ and (trademark), don't work with UTF-8, they're the only two I know of.

I still want to keep chinese symbols, arabic, etc. That's why I'm using UTF8.

Code:

def is_author_used(author): with open("C:\\Users\\Administrator\\Desktop\\authors.txt", 'r', encoding='utf-8') as f: content = f.read().splitlines() if author in content: return True return False def set_author_used(author): with open("C:\\Users\\Administrator\\Desktop\\authors.txt", 'a', encoding='utf-8') as f: f.write(author + '\r\n') 
9
  • It seems that your files are simply not in UTF-8 format. Only characters up to 0x7f are stored in "the usual way" in UTF-8. If you have a byte >= 0x80, it is part of a multibyte character. Reading a file as UTF-8 which isn't indeed leads to errors. Commented Sep 15, 2017 at 7:58
  • the notepad/text document is in fact in UTF-8 Commented Sep 15, 2017 at 8:00
  • 3
    What do you mean by "³ and ™ (trademark), don't work with UTF-8"? Those are perfectly good Unicode characters and all Unicode characters can be represented as UTF-8. Commented Sep 15, 2017 at 8:01
  • I know, that's the thing. But for some reason it throws an error. My text document is UTF-8 Commented Sep 15, 2017 at 8:02
  • 1
    @JosephJones Check the respective part in a hex editor and post the bytes and what decoded unicode you are expecting. Also, you can try using this tool, gnuwin32.sourceforge.net/packages/file.htm, I'm pretty convinced that your file is just not UTF8. Commented Sep 15, 2017 at 8:33

1 Answer 1

3

Maybe something like this:

with open('text.txt', encoding='utf-8', errors='ignore') as f: content = f.read().splitlines() 
Sign up to request clarification or add additional context in comments.

5 Comments

I am not sure ignoring errors will solve the problem.
What happens when it finds an error? Does it stop the reading? Or does it just replace specific location in 'content' with an empty character?
errors="ignore" does exactly what it says, it filters non-UTF8 parts from the text.
Solving the problem (if this is what you want) without knowing the cause might have it's uses, but there could be other issues down the road, as well as issues right away that you do or do not notice, depending on the cause.
errors='ignore', just ignore it , but did not remove all non utf8 characters. they are still there. how to remove it ?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.