Python - Dealing with Unicode Decode Error 'utf8'

Question

Since I was unable to find a one stop answer to the problem, I am posting my solution after learning from different threads:

I am importing data using pandas as follows

import pandas as pd data=read_csv(".../file.csv",encoding='utf8')

This resulted in the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 352: invalid start type

To counter this when I changed the encoding to Latin-1

data=read_csv(".../file.csv",encoding='Latin-1')

This resulted in the error when trying to apply vectorizer.fit_transform()

ValueError: np.nan is an invalid document, expected byte or unicode string

Thanks for your suggestion @aryamccarthy the error was creeping up because of encoding issues — Anurag H
– Anurag H, Commented May 8, 2017 at 19:22

Anurag H · Accepted Answer · 2017-05-08 19:10:21Z

Import the data using 'Latin-1' encoding:

data=read_csv(".../file.csv",encoding='Latin-1')

Next when executing the vectorizer.fit_transform() using the following:

vectorizer.fit_transform(train['desc'].values.astype('U')) #This example is for a specific dictionary type which I had named train with desc as an key

This should resolve the issue

Collectives™ on Stack Overflow

Python - Dealing with Unicode Decode Error 'utf8'

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related