How to make dictionary of column names in PySpark?

Question

I am receiving files and for some files columns are named differently. For example:

In file 1, column names are: "studentID" , "ADDRESS", "Phone_number".
In file 2, column names are: "Common_ID", "Common_Address", "Mobile_number".
In file 3, column names are: "S_StudentID", "S_ADDRESS", "HOME_MOBILE".

I want to pass a dictionary after loading the file data into dataframes and in that dictionary I want to pass values like:

StudentId -> STUDENT_ID Common_ID -> STUDENT_ID S_StudentID -> STUDENT_ID ADDRESS -> S_ADDRESS Common_Address -> S_ADDRESS S_ADDRESS -> S_ADDRESS

The reason for doing this because in my next dataframe I am reading column names like "STUDENT_ID", "S_ADDRESS" and if it will not find "S_ADDRESS", "STUDENT_ID" names in the dataframe, it will throw error for files whose names are not standardized. I want to run my dataframe and get values from those files after renaming in the above DF and one question when in run the new df will it pick the column name form dictionary having data in it.

ZygD · Accepted Answer · 2022-11-02 17:16:03Z

1

You can have the dictionary as you want and use toDF with a list comprehension in order to rename the columns.

Input dataframe and column names:

from pyspark.sql import functions as F df = spark.createDataFrame([], 'Common_ID string, ADDRESS string, COL3 string') print(df.columns) # ['Common_ID', 'ADDRESS', 'COL3']

Dictionary and toDF:

dict_cols = { 'StudentId': 'STUDENT_ID', 'Common_ID': 'STUDENT_ID', 'S_StudentID': 'STUDENT_ID', 'ADDRESS': 'S_ADDRESS', 'Common_Address': 'S_ADDRESS', 'S_ADDRESS': 'S_ADDRESS' } df = df.toDF(*[dict_cols.get(c, c) for c in df.columns])

Resultant column names:

print(df.columns) # ['STUDENT_ID', 'S_ADDRESS', 'COL3']

answered Nov 2, 2022 at 17:16

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

BigData Lover Over a year ago

Hi @Zygd if in our file column name is Common_ID and we are reading Student_ID in our dataframe will it also select data with it as we are renaming S_StudentID and StudentId as 'STUDENT_ID'

ZygD Over a year ago

Hello. Sorry, I didn't get the question if it is a question.

BigData Lover Over a year ago

Hi @Zygd I am asking Suppose we are getting file with Column name as 'COMMON_ID' and we have a Query where we are selecting 'STUDENT_ID' in our select Query as df=df.select('STUDENT_ID) and anyway we are converting COMMON_ID -> 'STUDENT_ID, and we are passing 3 Key values in our dictionary which are mapped as STUDENT_ID, so when i use your code and create dictionary will it able to read data also from that column

BigData Lover Over a year ago

My Query is as my select statement will be Fixed df=df.select('STUDENT_ID') and in files we are receiving some different names and mapping it to STUDENT_ID in our dictionary, so by any chance when i run my df=df.select('STUDENT_ID') will it be able to read the data also and neglect other two key value pairs, which we dont need

ZygD Over a year ago

Dictionary can have as many mappings as you need. Dictionary is not looped. Keys in dictionaries must be unique., so that when you pass you column name, you would get just one answer. You never touch other dictionary values. When you access the dictionary to find a matching value from a key-value pair, only the key that you need will be touched and the corresponding value will be returned. So, yes, other dictionary values will be neglected, if you will. But it is quite easy to test: create some example dataframes and run some simple queries to see for yourself.

wwnde · Accepted Answer · 2022-11-02 23:08:13Z

Use dict and list comprehension. An easier way and which would work even if some of the columns are not in the list is

 df.toDF(*[dict_cols[x] if x in dict_cols else x for x in df.columns ]).show() +----------+---------+----+ |STUDENT_ID|S_ADDRESS|COL3| +----------+---------+----+ +----------+---------+----+

Collectives™ on Stack Overflow

How to make dictionary of column names in PySpark?

2 Answers 2

5 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Related