1

I am receiving files and for some files columns are named differently. For example:

  1. In file 1, column names are: "studentID" , "ADDRESS", "Phone_number".
  2. In file 2, column names are: "Common_ID", "Common_Address", "Mobile_number".
  3. In file 3, column names are: "S_StudentID", "S_ADDRESS", "HOME_MOBILE".

I want to pass a dictionary after loading the file data into dataframes and in that dictionary I want to pass values like:

StudentId -> STUDENT_ID Common_ID -> STUDENT_ID S_StudentID -> STUDENT_ID ADDRESS -> S_ADDRESS Common_Address -> S_ADDRESS S_ADDRESS -> S_ADDRESS 

The reason for doing this because in my next dataframe I am reading column names like "STUDENT_ID", "S_ADDRESS" and if it will not find "S_ADDRESS", "STUDENT_ID" names in the dataframe, it will throw error for files whose names are not standardized. I want to run my dataframe and get values from those files after renaming in the above DF and one question when in run the new df will it pick the column name form dictionary having data in it.

2 Answers 2

1

You can have the dictionary as you want and use toDF with a list comprehension in order to rename the columns.

Input dataframe and column names:

from pyspark.sql import functions as F df = spark.createDataFrame([], 'Common_ID string, ADDRESS string, COL3 string') print(df.columns) # ['Common_ID', 'ADDRESS', 'COL3'] 

Dictionary and toDF:

dict_cols = { 'StudentId': 'STUDENT_ID', 'Common_ID': 'STUDENT_ID', 'S_StudentID': 'STUDENT_ID', 'ADDRESS': 'S_ADDRESS', 'Common_Address': 'S_ADDRESS', 'S_ADDRESS': 'S_ADDRESS' } df = df.toDF(*[dict_cols.get(c, c) for c in df.columns]) 

Resultant column names:

print(df.columns) # ['STUDENT_ID', 'S_ADDRESS', 'COL3'] 
Sign up to request clarification or add additional context in comments.

5 Comments

Hi @Zygd if in our file column name is Common_ID and we are reading Student_ID in our dataframe will it also select data with it as we are renaming S_StudentID and StudentId as 'STUDENT_ID'
Hello. Sorry, I didn't get the question if it is a question.
Hi @Zygd I am asking Suppose we are getting file with Column name as 'COMMON_ID' and we have a Query where we are selecting 'STUDENT_ID' in our select Query as df=df.select('STUDENT_ID) and anyway we are converting COMMON_ID -> 'STUDENT_ID, and we are passing 3 Key values in our dictionary which are mapped as STUDENT_ID, so when i use your code and create dictionary will it able to read data also from that column
My Query is as my select statement will be Fixed df=df.select('STUDENT_ID') and in files we are receiving some different names and mapping it to STUDENT_ID in our dictionary, so by any chance when i run my df=df.select('STUDENT_ID') will it be able to read the data also and neglect other two key value pairs, which we dont need
Dictionary can have as many mappings as you need. Dictionary is not looped. Keys in dictionaries must be unique., so that when you pass you column name, you would get just one answer. You never touch other dictionary values. When you access the dictionary to find a matching value from a key-value pair, only the key that you need will be touched and the corresponding value will be returned. So, yes, other dictionary values will be neglected, if you will. But it is quite easy to test: create some example dataframes and run some simple queries to see for yourself.
1

Use dict and list comprehension. An easier way and which would work even if some of the columns are not in the list is

 df.toDF(*[dict_cols[x] if x in dict_cols else x for x in df.columns ]).show() +----------+---------+----+ |STUDENT_ID|S_ADDRESS|COL3| +----------+---------+----+ +----------+---------+----+ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.