Pandas dataframe column name not formatting as expected

Question

I have a pandas dataframe, which I am creating from an excel (.xlsx) file. Using df = pd.read_excel()

One of the column names in the excel document is: IN - OriginalÂ ex factory date.

I am trying to remove all special characters and white spaces as I will be saving this file in Avro format in Spark.

I am doing the following to clean the column names:

df.columns = df.columns.str.replace('(', '') df.columns = df.columns.str.replace(')', '') df.columns = df.columns.str.replace('=', '') df.columns = df.columns.str.replace('-', '_') df.columns = df.columns.str.replace('/', '_') df.columns = df.columns.str.replace('Â', '__') df.columns = df.columns.str.replace(' ', '_')

After the above transformations, I am left with the above column name as:

IN___Original__ ex_factory_date

It doesn't remove the whitespace before ex_factory_date. And you can see all the other whitespaces are replaced by _.

I even tried to use unicodedata

def remove_accents(input_str): nfkd_form = unicodedata.normalize('NFKD', input_str) only_ascii = nfkd_form.encode('ASCII', 'ignore') return only_ascii for col in df.columns: print(col) neutral = remove_accents(col) print(normal)

It is returning the same thing: IN___Original__ ex_factory_date

Is there something I am missing as to why this wihtespace is not removed? I need a clean column name.

It's may be a double space. try to add double space in df.columns = df.columns.str.replace(' ', '_') — Mahery Ranaivoson
– Mahery Ranaivoson, Commented Feb 19, 2021 at 13:46
@MaheryRanaivoson it is not a double space. I checked thoroughly. It's a single space but not replaced by the regex somehow — aki2all
– aki2all, Commented Feb 19, 2021 at 13:56

VirtualScooter · Accepted Answer · 2021-02-19 13:56:07Z

Most likely, your space is not the ASCII space character, but rather one of the many Unicode whitespace characters (see this extensive answer https://stackoverflow.com/a/37903645/5660315). This code should replace your whitespace:

# .str.replace supports regular expressions by default df.columns = df.columns.str.replace('\s', '_')

Collectives™ on Stack Overflow

Pandas dataframe column name not formatting as expected

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related