23

I'm trying to read and write a dataframe to a pipe-delimited file. Some of the characters are non-Roman letters (`, ç, ñ, etc.). But it breaks when I try to write out the accents as ASCII.

df = pd.read_csv('filename.txt',sep='|', encoding='utf-8') <do stuff> newdf.to_csv('output.txt', sep='|', index=False, encoding='ascii') ------- File "<ipython-input-63-ae528ab37b8f>", line 21, in <module> newdf.to_csv(filename,sep='|',index=False, encoding='ascii') File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1344, in to_csv formatter.save() File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1551, in save self._save() File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1652, in _save self._save_chunk(start_i, end_i) File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1678, in _save_chunk lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer) File "pandas\lib.pyx", line 1075, in pandas.lib.write_csv_rows (pandas\lib.c:19767) UnicodeEncodeError: 'ascii' codec can't encode character '\xb4' in position 7: ordinal not in range(128) 

If I change to_csv to have utf-8 encoding, then I can't read the file in properly:

newdf.to_csv('output.txt',sep='|',index=False,encoding='utf-8') pd.read_csv('output.txt', sep='|') > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 2: invalid start byte 

My goal is to have a pipe-delimited file that retains the accents and special characters.

Also, is there an easy way to figure out which line read_csv is breaking on? Right now I don't know how to get it to show me the bad character(s).

4
  • 1
    Possible duplicate of Pandas writing dataframe to CSV file Commented Dec 19, 2016 at 18:30
  • Are you normalizing your unicode strings to remove accents? I thought ASCII can't handle those letters... Commented Dec 19, 2016 at 18:35
  • @juanpa.arrivillaga: I edited my post to clarify what i'm looking for in my output. Commented Dec 19, 2016 at 18:40
  • @ale19 you cannot encode accents and special characters in ASCII. It is a bare-bones representation. That is why encodings like UTF-8 exist. Just write it out in UTF-8. Commented Dec 19, 2016 at 18:54

5 Answers 5

73

Check the answer here

It's a much simpler solution:

newdf.to_csv('filename.csv', encoding='utf-8') 
Sign up to request clarification or add additional context in comments.

Comments

10

You have some characters that are not ASCII and therefore cannot be encoded as you are trying to do. I would just use utf-8 as suggested in a comment.

To check which lines are causing the issue you can try something like this:

def is_not_ascii(string): return string is not None and any([ord(s) >= 128 for s in string]) df[df[col].apply(is_not_ascii)] 

You'll need to specify the column col you are testing.

2 Comments

Thanks. When I try your function (specifying the column), I get TypeError: ord() expected a character, but string of length 17 found. I'm assuming this is because ord() checks individual characters, but the column in question contains strings.
If you do df[df[col].apply(is_ascii) ==False] then you get only the rows/indices where and error was found.
1

Try this, it works

newdf.to_csv('filename.csv', encoding='utf-8')

Comments

1

Another solution is to use string functions encode/decode with the 'ignore' option, but it will remove non-ascii characters:

df['text'] = df['text'].apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))

Instead of ignoring (removing the character), you can replace it with '?' by default by setting errors='replace'. You can also define your own encoding error handling.

See the https://docs.python.org/3/library/codecs.html#error-handlers documentation for more information.

1 Comment

Can we replace non-ascii with '' instead of ignore?
0

When I read csv file with latin characters such as: á, é, í, ó, ú, ñ, etc. my solution is to use: encoding='latin_1'

df = pd.read_csv('filename.txt',sep='|', encoding='latin_1') <do stuff> newdf.to_csv('output.txt', sep='|', index=False, encoding='latin_1') 

You can read a complete list in this documentation: [List of Python standard encodings][1].

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.