,I called .explain(True) on your sample dataset, lets take a look at output:
== Parsed Logical Plan == 'Sort ['number ASC NULLS FIRST], true +- Project [num2#61L] +- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L] +- Project [id#57L AS number#59L] +- Range (0, 1000, step=1, splits=Some(8))
Parsed Logical Plan is first "raw" version of query plan. Here you can see Project [num2#61L] before sort - this is your drop
But at next stage (Analyzed Logical Plan) its different:
== Analyzed Logical Plan == num2: bigint Project [num2#61L] +- Sort [number#59L ASC NULLS FIRST], true +- Project [num2#61L, number#59L] +- Project [number#59L, (number#59L * cast(2 as bigint)) AS num2#61L] +- Project [id#57L AS number#59L] +- Range (0, 1000, step=1, splits=Some(8))
Spark was smart enough to figure out that you need this column, so project before sort includes this column right now. To be compliant with your code, there is new Project added after sort
Now last stage, so optimized logical plan:
== Optimized Logical Plan == Project [num2#61L] +- Sort [number#59L ASC NULLS FIRST], true +- Project [(id#57L * 2) AS num2#61L, id#57L AS number#59L] +- Range (0, 1000, step=1, splits=Some(8))
In my opinion its not a bug but Spark design. Keep in mind that your code is executed within same action so due Spark lazy nature he is smart enough to adjust/optimize some code during planning.