I have recently understood that Spark DAGs get executed lazily, and intermediate results are never cached unless you explicitly call DF.cache().
Now I've ran an experiment that should give me different random numbers every time, based on that fact:
from pyspark.sql.functions import rand df = spark.range(0, 3) df = df.select("id", rand().alias('rand')) df.show() Executing these lines multiple times gives me different random numbers each time, as expected. But if the computed values (rand() in this case) are never stored, then calling just df.show() repeatedly should give me new random numbers every time, because the 'rand' column is not cached, right?
df.show() This command called a second time gives me the same random numbers as before though. So the values are stored somewhere now, which I thought does not happen.
Where is my thinking wrong? And could you give me a minimal example of non-caching that results in new random numbers every time?