1

I have encountered an issue with EM Dash in my csv raw data file that prevents Pandas from reading the CSV.

I ran a few variations below

 datalocation = filepath df = pd.read_csv(datalocation) 

Received the error: 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)'

other variation includes

 df = pd.read_csv(datalocation, encoding='utf-8') df = pd.read_csv(datalocation, encoding='utf-16') 

Received the error:'UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte'

 df = pd.read_csv(datalocation, na_values=['—']) 

Received the error:'line contains NULL byte'

If it was successful, the dataframe should be similar to the sample table below.

+---------+------+----------+--------+ | Country | Date | Delivery | Region | +---------+------+----------+--------+ | a | — | 10 | foo | | b | — | 30 | — | | c | 2 | —50 | foo— | | — | — | 20 | —bar | | a | — | 40 | bar— | | — | — | —6— | bar | | b | — | 90— | foo | | c | — | 70 | bar | | a | — | 80 | foo | | c | — | 100 | foo— | +---------+------+----------+--------+ 

After spending time researching the resources on SO, I understand that it has to do with some conflict across Unicode/UTF-8/ASCII.

Is there a way to remove all the EM Dash's prior to running 'pd.read_csv'? Keep in mind, I don't know the precise cells of all the EM Dash's in the csv raw file.

2
  • What happens if you try encoding='iso-8859-1' ? Commented Jul 28, 2015 at 19:59
  • @DanielMartin Just tried entering that in and received this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128) Commented Jul 28, 2015 at 20:30

1 Answer 1

2

Finally figured out how to do this by pre-processing the data set into a new file before reading it! Wanted to share the methodology with anyone that is also running into this problem.

import os, re import pandas as pd EMDASH = '—' with open('scrubbed_file','wt') as outfile: with open('original_file_location','rt') as infile: for line in infile: outfile.write(re.sub(EMDASH,'-',line)) df = pd.read_csv('scrubbed_file', engine='python', encoding='utf_16_le', names=['Country', 'Date', 'Delivery', 'Region'], delimiter='\t', quotechar='"', skiprows=2, skip_footer=2, thousands = ',') 

Hope this helps with anyone running into trouble with some problematic characters in their data frame.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.