Are Spark DataFrames ever implicitly cached?

Question

I have recently understood that Spark DAGs get executed lazily, and intermediate results are never cached unless you explicitly call DF.cache().

Now I've ran an experiment that should give me different random numbers every time, based on that fact:

from pyspark.sql.functions import rand df = spark.range(0, 3) df = df.select("id", rand().alias('rand')) df.show()

Executing these lines multiple times gives me different random numbers each time, as expected. But if the computed values (rand() in this case) are never stored, then calling just df.show() repeatedly should give me new random numbers every time, because the 'rand' column is not cached, right?

df.show()

This command called a second time gives me the same random numbers as before though. So the values are stored somewhere now, which I thought does not happen.

Where is my thinking wrong? And could you give me a minimal example of non-caching that results in new random numbers every time?

pansen · Accepted Answer · 2018-10-10 14:01:26Z

The random seed parameter of rand() is set when rand().alias('rand') is called inside the select method and does not change afterwards. Therefore, calling show multiple times does always use the same random seed and hence the result is the same.

You can see it more clearly when you return the result of rand().alias('rand') by itself, which also shows the random seed parameter:

>>> rand().alias('rand') Column<b'rand(166937772096155366) AS `rand`'>

When providing the seed directly, it will show up accordingly:

>>> rand(seed=22).alias('rand') Column<b'rand(22) AS `rand`'>

The random seed is set when calling rand() and is stored as a column expression within the select method. Therefore the result is the same. You will get different results when reevaluating rand() everytime like df.select("id", rand().alias('rand')).show().

Thanks! So my assumption was right, but the experiment design was wrong :) The random numbers get recomputed every time, but with the same initial seed.

Collectives™ on Stack Overflow

Are Spark DataFrames ever implicitly cached?

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related