This is probably a basic question for any software architects, but I struggle with the concept.
Let's say I have a big Spark DataFrame stored on hdfs. I now do a filtering operation like this:
df_new = my_big_hdfs_df.where("my_column='testvalue'") print(type(df_new)) class 'pyspark.sql.dataframe.DataFrame'>
Where exactly is df_new stored? If this was regular python, I would guess somewhere in memory. But is this true for PySpark as well? Or is it just some kind of reference? Is it persisted on disk somewhere in hdfs?
df_newaren't actually stored anywhere at the moment. Spark is lazy, so it doesn't evaluatedf_newuntil it needs to. For now it just stores the instructions needed to createdf_new.