1

Since I was unable to find a one stop answer to the problem, I am posting my solution after learning from different threads:

I am importing data using pandas as follows

import pandas as pd data=read_csv(".../file.csv",encoding='utf8') 

This resulted in the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 352: invalid start type 

To counter this when I changed the encoding to Latin-1

data=read_csv(".../file.csv",encoding='Latin-1') 

This resulted in the error when trying to apply vectorizer.fit_transform()

ValueError: np.nan is an invalid document, expected byte or unicode string 
1
  • Thanks for your suggestion @aryamccarthy the error was creeping up because of encoding issues Commented May 8, 2017 at 19:22

1 Answer 1

1

Import the data using 'Latin-1' encoding:

data=read_csv(".../file.csv",encoding='Latin-1') 

Next when executing the vectorizer.fit_transform() using the following:

vectorizer.fit_transform(train['desc'].values.astype('U')) #This example is for a specific dictionary type which I had named train with desc as an key 

This should resolve the issue

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.