0

I have read the existing posts regarding how to remove non-ASCI characters of a string in python. But my issue is that when I want to apply it to a dataframe which I have read from a csv file, it doesn't work. Any idea why?

import pandas as pd import numpy as np import re import string import unicodedata def preprocess(x): # Convert to unicode text = unicode(x, "utf8") # Convert back to ascii x = unicodedata.normalize('NFKD',text).encode('ascii','ignore') return x preprocess("Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG") 

'Ludwig Maximilian University of Munich / Munchen (LMU) and Siemens AG'

df = pd.DataFrame(["Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG"]) df.columns=['text'] df['text'] = df['text'].apply(lambda x: preprocess(x) if(pd.notnull(x)) else x) df['text'][0] 

'Ludwig Maximilian University of Munich / Munchen (LMU) and Siemens AG'

df1 = pd.read_csv('sample.csv') df1['text'] = df1['text'].apply(lambda x: preprocess(x) if(pd.notnull(x)) else x) df1['text'][0] 

'Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG'

Note that df1: enter image description here

is exactly like df: enter image description here

1 Answer 1

2

It's because pandas is reading the text in the file as a raw string. It's essentially equivalent to:

df = pd.DataFrame({"text": [r"Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG"]}) 

To get the normalization to work properly, you'd have to process the escaped string. Just modify your preprocess function:

def preprocess(x): decoded = x.decode('string_escape') text = unicode(decoded, 'utf8') return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore') 

Should work afterwords:

>>> df = pd.DataFrame({"text": [r"Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG"]}) >>> df text 0 Ludwig Maximilian University of Munich / M\xc3... >>> df['text'] = df['text'].apply(lambda x: preprocess(x) if(pd.notnull(x)) else x) >>> df text 0 Ludwig Maximilian University of Munich / Munch... 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.