2

I have 2 pyspark Dataframess, the first one contain ~500.000 rows and the second contain ~300.000 rows. I did 2 join, in the second join will take cell by cell from the second dataframe (300.000 rows) and compare it with all the cells in the first dataframe (500.000 rows).

So, there's is very slow join. I broadcasted the dataframes before join.

Test 1:

df_join = df1.join(F.broadcast(df2), df1.String.contains(df2["search.subString"]), "left") 

The job took many hours to finish.

Test 2:

df_join = F.broadcast(df1).join(F.broadcast(df2), df1.String.contains(df2["search.subString"]), "left") 

The running is very slow than the first code above, so the performance is very bad.

I tried to cache the dataframes before join.

I used:

df.cache() for each dataframe. But, the performance always not good.

I tried to use persist in memory_only:

df.persist(MEMORY_ONLY) ==> NameError: global name 'MEMORY_ONLY' is not defined df.persist(StorageLevel.MEMORY_ONLY) ==> NameError: global name 'StorageLevel' is not defined 

How can I persist the Dataframe in memory ?

Can you please suggest me a solution to improve the performance ?

Thanks in advance.

5
  • from pyspark.sql import StorageLevel , you'd have to import the module Commented Dec 10, 2019 at 15:25
  • A regex type join will always be very slow. Can you not merge it in python? Commented Dec 10, 2019 at 15:28
  • @samkart I already added it but I got the error: ImportError: cannot import name StorageLevel Commented Dec 10, 2019 at 15:34
  • 1
    Ah! I think it should be from pyspark import StorageLevel Commented Dec 12, 2019 at 12:09
  • @samkart Hi, do you have some idea about this issue please ? stackoverflow.com/questions/59931770/… Thanks Commented Jan 27, 2020 at 13:34

1 Answer 1

4

Use

df=df.cache()

print(df.count())

Basically, you need to call an action to get the effect of caching.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.