Pyspark dropped column not gone

Question

I have a spark dataframe. I attempt to drop a column, but in some situations the column appears to still be there.

my_range = spark.range(1000).toDF("number") new_range = my_range.withColumn('num2', my_range.number*2).drop('number') # can still sort by "number" column new_range.sort('number')

Is this a bug? Or am I missing something?

Spark version is v3.3.1 python 3 I'm on a Mbook pro M1 20221

M_S · Accepted Answer · 2022-11-19 07:14:18Z

,I called .explain(True) on your sample dataset, lets take a look at output:

== Parsed Logical Plan == 'Sort ['number ASC NULLS FIRST], true +- Project [num2#61L] +- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L] +- Project [id#57L AS number#59L] +- Range (0, 1000, step=1, splits=Some(8))

Parsed Logical Plan is first "raw" version of query plan. Here you can see Project [num2#61L] before sort - this is your drop

But at next stage (Analyzed Logical Plan) its different:

== Analyzed Logical Plan == num2: bigint Project [num2#61L] +- Sort [number#59L ASC NULLS FIRST], true +- Project [num2#61L, number#59L] +- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L] +- Project [id#57L AS number#59L] +- Range (0, 1000, step=1, splits=Some(8))

Spark was smart enough to figure out that you need this column, so project before sort includes this column right now. To be compliant with your code, there is new Project added after sort

Now last stage, so optimized logical plan:

== Optimized Logical Plan == Project [num2#61L] +- Sort [number#59L ASC NULLS FIRST], true +- Project [(id#57L * 2) AS num2#61L, id#57L AS number#59L] +- Range (0, 1000, step=1, splits=Some(8))

In my opinion its not a bug but Spark design. Keep in mind that your code is executed within same action so due Spark lazy nature he is smart enough to adjust/optimize some code during planning.

I think you are right, but with a checkpoint it may not. Question is if a valid approach.
Interesting example. For me this behaviour is still ok. Checkpoint is breaking the lineage so for me its intuitive that Spark is unable to do the trick in this case and later on Spark is going to use what was persisted during checkpoint

Ged · Accepted Answer · 2022-11-19 11:05:02Z

The first answer is obviously correct, and whether or not the Spark approach is a good implementation is open to debate - I think it is.

As an embellishment: A checkpoint, if used, will mean an error:

spark.sparkContext.setCheckpointDir("/foo2/bar") new_range = new_range.checkpoint() new_range.sort('number').show()

returns:

AnalysisException: Column 'number' does not exist. Did you mean one of the following? [num2]; 'Sort ['number ASC NULLS FIRST], true +- LogicalRDD [num2#69L], false

Collectives™ on Stack Overflow

Pyspark dropped column not gone

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related