I have a pandas dataframe, which I am creating from an excel (.xlsx) file. Using df = pd.read_excel()
One of the column names in the excel document is: IN - Original ex factory date.
I am trying to remove all special characters and white spaces as I will be saving this file in Avro format in Spark.
I am doing the following to clean the column names:
df.columns = df.columns.str.replace('(', '') df.columns = df.columns.str.replace(')', '') df.columns = df.columns.str.replace('=', '') df.columns = df.columns.str.replace('-', '_') df.columns = df.columns.str.replace('/', '_') df.columns = df.columns.str.replace('Â', '__') df.columns = df.columns.str.replace(' ', '_') After the above transformations, I am left with the above column name as:
IN___Original__ ex_factory_date
It doesn't remove the whitespace before ex_factory_date. And you can see all the other whitespaces are replaced by _.
I even tried to use unicodedata
def remove_accents(input_str): nfkd_form = unicodedata.normalize('NFKD', input_str) only_ascii = nfkd_form.encode('ASCII', 'ignore') return only_ascii for col in df.columns: print(col) neutral = remove_accents(col) print(normal) It is returning the same thing: IN___Original__ ex_factory_date
Is there something I am missing as to why this wihtespace is not removed? I need a clean column name.
df.columns = df.columns.str.replace(' ', '_')