Rename dataframe columns in spark python

Question

I have a CSV with headings that I'd like to save as Parquet (actually a delta table)

The column headings have spaces in them, which parquet can't handle. How do I change spaces to underscores?

This is what I have so far, cobbled together from other SO posts:

from pyspark.sql.functions import * df = spark.read.option("header", True).option("delimiter","\u0001").option("inferSchema",True).csv("/mnt/landing/MyFile.TXT") names = df.schema.names for name in names: df2 = df.withColumnRenamed(name,regexp_replace(name, ' ', '_'))

When I run this, the final line gives me this error:

TypeError: Column is not iterable

I thought this would be a common requirement given that parquet can't handle spaces but it's quite difficult to find any examples.

can you try with select: df.select([col(a).alias(b) for a,b in zip(df.columns,[re.sub(" ","_",i) for i in df.columns])]) — anky
– anky, Commented Jun 24, 2020 at 12:47

Alex Ott · Accepted Answer · 2020-06-24 12:57:44Z

1

You need to use reduce function to iteratively apply renaming to the dataframe, because in your code df2 will have only the last column renamed...

The code would look as following (instead of for loop):

df2 = reduce(lambda data, name: data.withColumnRenamed(name, name.replace('1', '2')), names, df)

answered Jun 24, 2020 at 12:57

Alex Ott

88.1k10 gold badges110 silver badges157 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Nick.Mc Over a year ago

Great. I just had to add from functools import reduce to the top of this. So far this is doing what I want - I'll just check out some of the others also

Nick.Mc Over a year ago

In this case is the reduce function accepting three parameters: the lambda function, names, and df?

Nick.Mc Over a year ago

and it seems like names is passed to the name parameter and df is passed to the data parameter? Trying to understand what's going on here

Nick.Mc Over a year ago

All the examples of reduce that I see take two parameters?

Nick.Mc Over a year ago

OK so there are some good examples here showing both ways of doing it (reduce and loop) medium.com/@mrpowers/…

|

s.polam · Accepted Answer · 2020-06-24 13:05:30Z

You are getting exception because - function regexp_replace returns of type Column but function withColumnRenamed is excepting of type String.

def regexp_replace(e: org.apache.spark.sql.Column,pattern: String,replacement: String): org.apache.spark.sql.Column

def withColumnRenamed(existingName: String,newName: String): org.apache.spark.sql.DataFrame

notNull · Accepted Answer · 2020-06-24 13:00:28Z

Use .toDF (or) .select and pass list of columns to create new dataframe.

df.show() #+---+----+----+ #| id|id a|id b| #+---+----+----+ #| 1| a| b| #| 2| c| d| #+---+----+----+ new_cols=list(map(lambda x: x.replace(" ", "_"), df.columns)) df.toDF(*new_cols).show() df.select([col(s).alias(s.replace(' ','_')) for s in df.columns]).show() #+---+----+----+ #| id|id_a|id_b| #+---+----+----+ #| 1| a| b| #| 2| c| d| #+---+----+----+

Thanks for your input. I haven't tried your answer but I'm sure I'll come back to it.

Collectives™ on Stack Overflow

Rename dataframe columns in spark python

3 Answers 3

6 Comments

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

1 Comment

Related