My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:
a = "select 123 as id, 1 as priority" a_df = spark.sql(a) b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority" b_df = spark.sql(b) c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority) c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int]
The drop function is not removing the columns.
But if I try to do:
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority) Then priority column for a_df gets dropped.
Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.
I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?
Thanks in advance.