Python Character Encoding Discrepancy

Question

I am ingesting messages into a pandas DataFrame and attempting to run some machine learning functions on the data. When I run a tokenisation function I get an error KeyError: "..." basically spits out the content of one of the messages. Looking at the string there utf-8 chars appear such as \xe2\x80\xa8 (space),\xe2\x82\xac (Euro Currency Sign). 1. Is this the cause of the error? 2. Why aren't these symbols kept like they appear in the original messages or in the DataFrame?

coding=utf-8 from __future__ import print_function import sys reload(sys) sys.setdefaultencoding("utf8") import os import pandas as pd path = '//directory1//' data = [] for f in [f for f in os.listdir(path) if not f.startswith('.')]: with open(path+f, "r") as myfile: data.append(myfile.read().replace('\n', ' ')) df = pd.DataFrame(data, columns=["message"]) df["label"] = "1" path = '//directory2//' data = [] for f in [f for f in os.listdir(path) if not f.startswith('.')]: with open(path+f, "r") as myfile: data.append(myfile.read().replace('\n', ' ')) df2 = pd.DataFrame(data, columns=["message"]) df2["label"] = "0" messages = pd.concat([df,df2], ignore_index=True) import nltk from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizer stopwords = nltk.corpus.stopwords.words('english') def tokenize_only(text): # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] filtered_tokens = [] # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) return filtered_tokens tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words='english', use_idf=True, tokenizer=tokenize_only, ngram_range=(1,2)) # analyzer = word tfidf_matrix = tfidf_vectorizer.fit_transform(messages.message) #fit the vectorizer to corpora terms = tfidf_vectorizer.get_feature_names() totalvocab_tokenized = [] for i in emails.message: # x = emails.message[i].decode('utf-8') x = unicode(emails.message[i], errors="replace") allwords_tokenized = tokenize_only(x) totalvocab_tokenized.extend(allwords_tokenized) vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}) print(vocal_frame)

I tried decoding each message to utf-8, unicode and without those two lines in the last for loop but I keep getting an error.

Any ideas?

Thanks!

Alastair McCormack · Accepted Answer · 2015-11-25 16:30:02Z

It looks like you're printing a repr() of the data. If UTF-8 can't be printed, Python may choose to escape it. Print the actual string or Unicode
Get rid of the sys.setdefaultencoding("utf8") and sys reload - it masks issues. If you get new exceptions, let's investigate those.
Open you text files with automatic decoding. Assuming your input is UTF-8:
```
with io.open(path+f, "r", encoding="utf-8") as myfile: 
```

Thanks for this, seems to have worked! Ok, good advice regarding sys.setdefaultencoding('utf-8') I wasn't really aware of how it does what it does just read that it could remove the errors.

Collectives™ on Stack Overflow

Python Character Encoding Discrepancy

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related