PySpark DataFrame generated by filter - where is it stored?

Question

This is probably a basic question for any software architects, but I struggle with the concept.

Let's say I have a big Spark DataFrame stored on hdfs. I now do a filtering operation like this:

df_new = my_big_hdfs_df.where("my_column='testvalue'") print(type(df_new))

class 'pyspark.sql.dataframe.DataFrame'>

Where exactly is df_new stored? If this was regular python, I would guess somewhere in memory. But is this true for PySpark as well? Or is it just some kind of reference? Is it persisted on disk somewhere in hdfs?

AFAIK the contents of df_new aren't actually stored anywhere at the moment. Spark is lazy, so it doesn't evaluate df_new until it needs to. For now it just stores the instructions needed to create df_new. — pault
– pault, Commented Mar 19, 2018 at 17:04

Anahcolus · Accepted Answer · 2018-03-19 17:14:50Z

df_new is a transformation from my_big_hdfs_df after applying the condition in where function.

Or in other words, df_new is a logical plan set to be performed on the data as soon as an action will be called upon.

Data is not touched until an action such as show(), count() foreach() etc are called.

As soon as an action is called, data is transformed and all the transformations are stored in memory. Sometimes they are spilled to disk when persist() is called. And saved to disk when action to save is called.

Collectives™ on Stack Overflow

PySpark DataFrame generated by filter - where is it stored?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related