Spark Dataframe not distributed

Question

I can't understand why my dataframe is only on one node. I have a cluster of 14 machines with 4 physical CPU on a spark standalone cluster.

I am connected through a notebook and create my spark context :

I expect a parralelism of 8 partitions, but when I create a dataframe I get only one partition :

What am I missing ?

Thanks to anser from user8371915 I repartitions my dataframe (I was reading a compressed file (.csv.gz) so I understand in splittable.

But When I do a "count" on it, I see it as being executed only on one executor : Here namely on executor n°1, even if the file is 700 Mb large, and is on 6 blocks on HDFS. As far as I understand, the calculus should be over 10 cores, over 5 nodes ... But everything is calculated only on one node :-(

Alper t. Turker · Accepted Answer · 2018-01-31 23:19:26Z

3

There are two possibilities:

File size is below spark.sql.files.maxPartitionBytes.
File is compressed using unsplitable format like gzip.

In the first case you may consider adjusting parameters, but if you go with defaults, it is already small.

In the second case it is best to unpack file before loading to Spark. If you cannot do that, repartition after loading, but it'll be slow.

answered Jan 31, 2018 at 23:19

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Romain Jouin Over a year ago

I think I am on the second case, I updated my questions accordingly

Romain Jouin Over a year ago

it seems like a "repartition" on a "gz" file doesn't work. I unzipped the "gz" directly in hdfs, and rebuilt a dataframe on it => now I can change the number of partitions, and see faster results.

Alper t. Turker Over a year ago

It will work, fine but it happens after loading. First stage won't be affected.

Romain Jouin Over a year ago

I couldn't really repartition a gz files. I had to transform it on csv to get partioning working correctly.

Collectives™ on Stack Overflow

Spark Dataframe not distributed

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related