I have encountered an issue with EM Dash in my csv raw data file that prevents Pandas from reading the CSV.
I ran a few variations below
datalocation = filepath df = pd.read_csv(datalocation) Received the error: 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)'
other variation includes
df = pd.read_csv(datalocation, encoding='utf-8') df = pd.read_csv(datalocation, encoding='utf-16') Received the error:'UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte'
df = pd.read_csv(datalocation, na_values=['—']) Received the error:'line contains NULL byte'
If it was successful, the dataframe should be similar to the sample table below.
+---------+------+----------+--------+ | Country | Date | Delivery | Region | +---------+------+----------+--------+ | a | — | 10 | foo | | b | — | 30 | — | | c | 2 | —50 | foo— | | — | — | 20 | —bar | | a | — | 40 | bar— | | — | — | —6— | bar | | b | — | 90— | foo | | c | — | 70 | bar | | a | — | 80 | foo | | c | — | 100 | foo— | +---------+------+----------+--------+ After spending time researching the resources on SO, I understand that it has to do with some conflict across Unicode/UTF-8/ASCII.
Is there a way to remove all the EM Dash's prior to running 'pd.read_csv'? Keep in mind, I don't know the precise cells of all the EM Dash's in the csv raw file.
encoding='iso-8859-1'?