3

I can't understand why my dataframe is only on one node. I have a cluster of 14 machines with 4 physical CPU on a spark standalone cluster. enter image description here

I am connected through a notebook and create my spark context :

enter image description here

I expect a parralelism of 8 partitions, but when I create a dataframe I get only one partition : enter image description here

What am I missing ?

Thanks to anser from user8371915 I repartitions my dataframe (I was reading a compressed file (.csv.gz) so I understand in splittable. enter image description here

But When I do a "count" on it, I see it as being executed only on one executor : enter image description here Here namely on executor n°1, even if the file is 700 Mb large, and is on 6 blocks on HDFS. As far as I understand, the calculus should be over 10 cores, over 5 nodes ... But everything is calculated only on one node :-(

1 Answer 1

3

There are two possibilities:

In the first case you may consider adjusting parameters, but if you go with defaults, it is already small.

In the second case it is best to unpack file before loading to Spark. If you cannot do that, repartition after loading, but it'll be slow.

Sign up to request clarification or add additional context in comments.

4 Comments

I think I am on the second case, I updated my questions accordingly
it seems like a "repartition" on a "gz" file doesn't work. I unzipped the "gz" directly in hdfs, and rebuilt a dataframe on it => now I can change the number of partitions, and see faster results.
It will work, fine but it happens after loading. First stage won't be affected.
I couldn't really repartition a gz files. I had to transform it on csv to get partioning working correctly.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.