0

I am ingesting messages into a pandas DataFrame and attempting to run some machine learning functions on the data. When I run a tokenisation function I get an error KeyError: "..." basically spits out the content of one of the messages. Looking at the string there utf-8 chars appear such as \xe2\x80\xa8 (space),\xe2\x82\xac (Euro Currency Sign). 1. Is this the cause of the error? 2. Why aren't these symbols kept like they appear in the original messages or in the DataFrame?

coding=utf-8 from __future__ import print_function import sys reload(sys) sys.setdefaultencoding("utf8") import os import pandas as pd path = '//directory1//' data = [] for f in [f for f in os.listdir(path) if not f.startswith('.')]: with open(path+f, "r") as myfile: data.append(myfile.read().replace('\n', ' ')) df = pd.DataFrame(data, columns=["message"]) df["label"] = "1" path = '//directory2//' data = [] for f in [f for f in os.listdir(path) if not f.startswith('.')]: with open(path+f, "r") as myfile: data.append(myfile.read().replace('\n', ' ')) df2 = pd.DataFrame(data, columns=["message"]) df2["label"] = "0" messages = pd.concat([df,df2], ignore_index=True) import nltk from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizer stopwords = nltk.corpus.stopwords.words('english') def tokenize_only(text): # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] filtered_tokens = [] # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) return filtered_tokens tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words='english', use_idf=True, tokenizer=tokenize_only, ngram_range=(1,2)) # analyzer = word tfidf_matrix = tfidf_vectorizer.fit_transform(messages.message) #fit the vectorizer to corpora terms = tfidf_vectorizer.get_feature_names() totalvocab_tokenized = [] for i in emails.message: # x = emails.message[i].decode('utf-8') x = unicode(emails.message[i], errors="replace") allwords_tokenized = tokenize_only(x) totalvocab_tokenized.extend(allwords_tokenized) vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}) print(vocal_frame) 

I tried decoding each message to utf-8, unicode and without those two lines in the last for loop but I keep getting an error.

Any ideas?

Thanks!

1 Answer 1

1
  1. It looks like you're printing a repr() of the data. If UTF-8 can't be printed, Python may choose to escape it. Print the actual string or Unicode

  2. Get rid of the sys.setdefaultencoding("utf8") and sys reload - it masks issues. If you get new exceptions, let's investigate those.

  3. Open you text files with automatic decoding. Assuming your input is UTF-8:

    with io.open(path+f, "r", encoding="utf-8") as myfile: 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for this, seems to have worked! Ok, good advice regarding sys.setdefaultencoding('utf-8') I wasn't really aware of how it does what it does just read that it could remove the errors.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.