125

From my Spark UI. What does it mean by skipped?

enter image description here

2 Answers 2

166

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data:

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed.

Sign up to request clarification or add additional context in comments.

6 Comments

Great answer. If you want to find out way more about the semantics of "skipped" and "pending" stages on the web UI, check out github.com/apache/spark/pull/3009, the pull request which first introduced these concepts. That PR is also an interesting read if you're curious about how skipped / pending stages interact with job-level progress bars.
If I am following correctly, Spark skipping these mean they don't happen and they can be removed from code all together? or code is very efficient with the cache so leave it? @zero323
@SparkleGoat No. It means that these stages have been evaluated before, and the result is available without re-execution.
another question, can caching and skipping stages make the output data different?
@SparkleGoat, no caching (and skipping because of that) is an internal spark optimization and doesn't change the output data in any way.
|
1

Suppose you have a initial data frame with some data. Now you perform couple of transformations on top of it and perform multiple actions on the final data frame. If you had cache a data frame then it would materialize it when you call an action and keep it in memory in materialize form. So when an next action gets called it would go through the whole DAG and in doing that it will see that the data frame was cached so it will skip those stages by utilizing the already ready state that it has in materialized form in the memory.

When it skip the stage then you will see it as skipped in the spark UI and it speeds up your operation as it does not have to calculate the dag from the root and can start its operation after the cache data frame.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.