0

This is probably a basic question for any software architects, but I struggle with the concept.

Let's say I have a big Spark DataFrame stored on hdfs. I now do a filtering operation like this:

df_new = my_big_hdfs_df.where("my_column='testvalue'") print(type(df_new)) 

class 'pyspark.sql.dataframe.DataFrame'>

Where exactly is df_new stored? If this was regular python, I would guess somewhere in memory. But is this true for PySpark as well? Or is it just some kind of reference? Is it persisted on disk somewhere in hdfs?

1
  • AFAIK the contents of df_new aren't actually stored anywhere at the moment. Spark is lazy, so it doesn't evaluate df_new until it needs to. For now it just stores the instructions needed to create df_new. Commented Mar 19, 2018 at 17:04

1 Answer 1

1

df_new is a transformation from my_big_hdfs_df after applying the condition in where function.

Or in other words, df_new is a logical plan set to be performed on the data as soon as an action will be called upon.

Data is not touched until an action such as show(), count() foreach() etc are called.

As soon as an action is called, data is transformed and all the transformations are stored in memory. Sometimes they are spilled to disk when persist() is called. And saved to disk when action to save is called.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.