EM Dash in CSV causing issues with Pandas

Question

I have encountered an issue with EM Dash in my csv raw data file that prevents Pandas from reading the CSV.

I ran a few variations below

 datalocation = filepath df = pd.read_csv(datalocation)

Received the error: 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)'

other variation includes

 df = pd.read_csv(datalocation, encoding='utf-8') df = pd.read_csv(datalocation, encoding='utf-16')

Received the error:'UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte'

 df = pd.read_csv(datalocation, na_values=['—'])

Received the error:'line contains NULL byte'

If it was successful, the dataframe should be similar to the sample table below.

+---------+------+----------+--------+ | Country | Date | Delivery | Region | +---------+------+----------+--------+ | a | — | 10 | foo | | b | — | 30 | — | | c | 2 | —50 | foo— | | — | — | 20 | —bar | | a | — | 40 | bar— | | — | — | —6— | bar | | b | — | 90— | foo | | c | — | 70 | bar | | a | — | 80 | foo | | c | — | 100 | foo— | +---------+------+----------+--------+

After spending time researching the resources on SO, I understand that it has to do with some conflict across Unicode/UTF-8/ASCII.

Is there a way to remove all the EM Dash's prior to running 'pd.read_csv'? Keep in mind, I don't know the precise cells of all the EM Dash's in the csv raw file.

@DanielMartin Just tried entering that in and received this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128) — Anthony Wu
– Anthony Wu, Commented Jul 28, 2015 at 20:30

Anthony Wu · Accepted Answer · 2015-08-04 16:08:09Z

Finally figured out how to do this by pre-processing the data set into a new file before reading it! Wanted to share the methodology with anyone that is also running into this problem.

import os, re import pandas as pd EMDASH = '—' with open('scrubbed_file','wt') as outfile: with open('original_file_location','rt') as infile: for line in infile: outfile.write(re.sub(EMDASH,'-',line)) df = pd.read_csv('scrubbed_file', engine='python', encoding='utf_16_le', names=['Country', 'Date', 'Delivery', 'Region'], delimiter='\t', quotechar='"', skiprows=2, skip_footer=2, thousands = ',')

Hope this helps with anyone running into trouble with some problematic characters in their data frame.

Collectives™ on Stack Overflow

EM Dash in CSV causing issues with Pandas

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related