pandas to_csv: ascii can't encode character

Question

I'm trying to read and write a dataframe to a pipe-delimited file. Some of the characters are non-Roman letters (`, ç, ñ, etc.). But it breaks when I try to write out the accents as ASCII.

df = pd.read_csv('filename.txt',sep='|', encoding='utf-8') <do stuff> newdf.to_csv('output.txt', sep='|', index=False, encoding='ascii') ------- File "<ipython-input-63-ae528ab37b8f>", line 21, in <module> newdf.to_csv(filename,sep='|',index=False, encoding='ascii') File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1344, in to_csv formatter.save() File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1551, in save self._save() File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1652, in _save self._save_chunk(start_i, end_i) File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1678, in _save_chunk lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer) File "pandas\lib.pyx", line 1075, in pandas.lib.write_csv_rows (pandas\lib.c:19767) UnicodeEncodeError: 'ascii' codec can't encode character '\xb4' in position 7: ordinal not in range(128)

If I change to_csv to have utf-8 encoding, then I can't read the file in properly:

newdf.to_csv('output.txt',sep='|',index=False,encoding='utf-8') pd.read_csv('output.txt', sep='|') > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 2: invalid start byte

My goal is to have a pipe-delimited file that retains the accents and special characters.

Also, is there an easy way to figure out which line read_csv is breaking on? Right now I don't know how to get it to show me the bad character(s).

Are you normalizing your unicode strings to remove accents? I thought ASCII can't handle those letters... — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Dec 19, 2016 at 18:35
@juanpa.arrivillaga: I edited my post to clarify what i'm looking for in my output. — ale19
– ale19, Commented Dec 19, 2016 at 18:40
@ale19 you cannot encode accents and special characters in ASCII. It is a bare-bones representation. That is why encodings like UTF-8 exist. Just write it out in UTF-8. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Dec 19, 2016 at 18:54

Ohad Zadok · Accepted Answer · 2020-07-16 18:58:47Z

73

Check the answer here

It's a much simpler solution:

newdf.to_csv('filename.csv', encoding='utf-8')

edited Jul 16, 2020 at 18:58

answered May 21, 2017 at 13:31

Ohad Zadok

3,6502 gold badges24 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jukurrpa · Accepted Answer · 2018-04-09 22:40:15Z

You have some characters that are not ASCII and therefore cannot be encoded as you are trying to do. I would just use utf-8 as suggested in a comment.

To check which lines are causing the issue you can try something like this:

def is_not_ascii(string): return string is not None and any([ord(s) >= 128 for s in string]) df[df[col].apply(is_not_ascii)]

You'll need to specify the column col you are testing.

Thanks. When I try your function (specifying the column), I get TypeError: ord() expected a character, but string of length 17 found. I'm assuming this is because ord() checks individual characters, but the column in question contains strings.
If you do df[df[col].apply(is_ascii) ==False] then you get only the rows/indices where and error was found.

Sumit Shrestha · Accepted Answer · 2020-08-12 04:33:02Z

1

Try this, it works

newdf.to_csv('filename.csv', encoding='utf-8')

answered Aug 12, 2020 at 4:33

Sumit Shrestha

4065 silver badges6 bronze badges

Comments

Edward Weinert · Accepted Answer · 2024-08-28 08:38:39Z

Another solution is to use string functions encode/decode with the 'ignore' option, but it will remove non-ascii characters:

df['text'] = df['text'].apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))

Instead of ignoring (removing the character), you can replace it with '?' by default by setting errors='replace'. You can also define your own encoding error handling.

See the https://docs.python.org/3/library/codecs.html#error-handlers documentation for more information.

Hector Chocobar-Torrejon · Accepted Answer · 2022-06-30 14:48:33Z

When I read csv file with latin characters such as: á, é, í, ó, ú, ñ, etc. my solution is to use: encoding='latin_1'

df = pd.read_csv('filename.txt',sep='|', encoding='latin_1') <do stuff> newdf.to_csv('output.txt', sep='|', index=False, encoding='latin_1')

You can read a complete list in this documentation: [List of Python standard encodings][1].

Collectives™ on Stack Overflow

pandas to_csv: ascii can't encode character

5 Answers 5

Comments

2 Comments

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

Comments

1 Comment

Comments

Linked

Related