50

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.

3
  • 4
    What is 'a very long time'? Can you tell us more about what and how you're trying to count? Commented Sep 6, 2016 at 21:20
  • 4
    See stackoverflow.com/questions/28413423/… and also the countApprox method in spark if you don't need an exact answer. Commented Sep 6, 2016 at 21:32
  • 2
    Possible duplicate of Count number of rows in an RDD Commented Apr 24, 2017 at 14:38

2 Answers 2

62

It's going to take so much time anyway. At least the first time.

One way is to cache the dataframe, so you will be able to more with it, other than count.

E.g

df.cache() df.count() 

Subsequent operations don't take much time.

Sign up to request clarification or add additional context in comments.

Comments

7

The time it takes to count the records in a DataFrame depends on the power of the cluster and how the data is stored. Performance optimizations can make Spark counts very quick.

It's easier for Spark to perform counts on Parquet files than CSV/JSON files. Parquet files store counts in the file footer, so Spark doesn't need to read all the rows in the file and actually perform the count, it can just grab the footer metadata. CSV / JSON files don't have any such metadata.

If the data is stored in a Postgres database, then the count operation will be performed by Postgres and count execution time will be a function of the database performance.

Bigger clusters generally perform count operations faster (unless the data is skewed in a way that causes one node to do all the work, leaving the other nodes idle).

The snappy compression algorithm is generally faster than gzip cause it is splittable by Spark and faster to inflate.

approx_count_distinct that's powered by HyperLogLog under the hood will be more performant for distinct counts, at the cost of precision.

The other answer suggests caching before counting, which will actually slow down the count operation. Caching is an expensive operation that can take a lot more time that counting. Caching is an important performance optimization at times, but not if you just want a simple count.

2 Comments

would it be possible to give a scenario where caching of a large df and the associated costs are justified?
@Bonson - yes, caching is a powerful pattern that can speed certain types of queries a lot, especially for a DataFrame that'll get reused a lot. Suppose you have a DF, perform a big filtering operation, and then do a bunch of different types of computations on the filtered DF. Caching the filtered DF might help a lot. It can be good to repartition before caching, depending on the data. In short, yes, caching can help, but it can also hurt, so it needs to be applied intelligently.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.