0

I have a spark dataframe. I attempt to drop a column, but in some situations the column appears to still be there.

my_range = spark.range(1000).toDF("number") new_range = my_range.withColumn('num2', my_range.number*2).drop('number') # can still sort by "number" column new_range.sort('number') 

Is this a bug? Or am I missing something?

Spark version is v3.3.1 python 3 I'm on a Mbook pro M1 20221

2 Answers 2

2

,I called .explain(True) on your sample dataset, lets take a look at output:

== Parsed Logical Plan == 'Sort ['number ASC NULLS FIRST], true +- Project [num2#61L] +- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L] +- Project [id#57L AS number#59L] +- Range (0, 1000, step=1, splits=Some(8)) 

Parsed Logical Plan is first "raw" version of query plan. Here you can see Project [num2#61L] before sort - this is your drop

But at next stage (Analyzed Logical Plan) its different:

== Analyzed Logical Plan == num2: bigint Project [num2#61L] +- Sort [number#59L ASC NULLS FIRST], true +- Project [num2#61L, number#59L] +- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L] +- Project [id#57L AS number#59L] +- Range (0, 1000, step=1, splits=Some(8)) 

Spark was smart enough to figure out that you need this column, so project before sort includes this column right now. To be compliant with your code, there is new Project added after sort

Now last stage, so optimized logical plan:

== Optimized Logical Plan == Project [num2#61L] +- Sort [number#59L ASC NULLS FIRST], true +- Project [(id#57L * 2) AS num2#61L, id#57L AS number#59L] +- Range (0, 1000, step=1, splits=Some(8)) 

In my opinion its not a bug but Spark design. Keep in mind that your code is executed within same action so due Spark lazy nature he is smart enough to adjust/optimize some code during planning.

Sign up to request clarification or add additional context in comments.

3 Comments

I think you are right, but with a checkpoint it may not. Question is if a valid approach.
Interesting example. For me this behaviour is still ok. Checkpoint is breaking the lineage so for me its intuitive that Spark is unable to do the trick in this case and later on Spark is going to use what was persisted during checkpoint
we agree here...
2

The first answer is obviously correct, and whether or not the Spark approach is a good implementation is open to debate - I think it is.

As an embellishment: A checkpoint, if used, will mean an error:

spark.sparkContext.setCheckpointDir("/foo2/bar") new_range = new_range.checkpoint() new_range.sort('number').show() 

returns:

AnalysisException: Column 'number' does not exist. Did you mean one of the following? [num2]; 'Sort ['number ASC NULLS FIRST], true +- LogicalRDD [num2#69L], false 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.