What does "Stage Skipped" mean in Apache Spark web UI?

Question

From my Spark UI. What does it mean by skipped?

zero323 · Accepted Answer · 2016-01-03 20:39:40Z

166

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data:

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed.

edited Jan 3, 2016 at 20:39

answered Jan 3, 2016 at 20:19

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Josh Rosen Over a year ago

Great answer. If you want to find out way more about the semantics of "skipped" and "pending" stages on the web UI, check out github.com/apache/spark/pull/3009, the pull request which first introduced these concepts. That PR is also an interesting read if you're curious about how skipped / pending stages interact with job-level progress bars.

SparkleGoat Over a year ago

If I am following correctly, Spark skipping these mean they don't happen and they can be removed from code all together? or code is very efficient with the cache so leave it? @zero323

10465355 Over a year ago

@SparkleGoat No. It means that these stages have been evaluated before, and the result is available without re-execution.

SparkleGoat Over a year ago

another question, can caching and skipping stages make the output data different?

Ravi Sanwal Over a year ago

@SparkleGoat, no caching (and skipping because of that) is an internal spark optimization and doesn't change the output data in any way.

|

Nikunj Kakadiya · Accepted Answer · 2021-02-05 07:59:55Z

Suppose you have a initial data frame with some data. Now you perform couple of transformations on top of it and perform multiple actions on the final data frame. If you had cache a data frame then it would materialize it when you call an action and keep it in memory in materialize form. So when an next action gets called it would go through the whole DAG and in doing that it will see that the data frame was cached so it will skip those stages by utilizing the already ready state that it has in materialized form in the memory.

When it skip the stage then you will see it as skipped in the spark UI and it speeds up your operation as it does not have to calculate the dag from the root and can start its operation after the cache data frame.

Collectives™ on Stack Overflow

What does "Stage Skipped" mean in Apache Spark web UI?

2 Answers 2

6 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Linked

Related