I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.
2 Answers
The time it takes to count the records in a DataFrame depends on the power of the cluster and how the data is stored. Performance optimizations can make Spark counts very quick.
It's easier for Spark to perform counts on Parquet files than CSV/JSON files. Parquet files store counts in the file footer, so Spark doesn't need to read all the rows in the file and actually perform the count, it can just grab the footer metadata. CSV / JSON files don't have any such metadata.
If the data is stored in a Postgres database, then the count operation will be performed by Postgres and count execution time will be a function of the database performance.
Bigger clusters generally perform count operations faster (unless the data is skewed in a way that causes one node to do all the work, leaving the other nodes idle).
The snappy compression algorithm is generally faster than gzip cause it is splittable by Spark and faster to inflate.
approx_count_distinct that's powered by HyperLogLog under the hood will be more performant for distinct counts, at the cost of precision.
The other answer suggests caching before counting, which will actually slow down the count operation. Caching is an expensive operation that can take a lot more time that counting. Caching is an important performance optimization at times, but not if you just want a simple count.
countApproxmethod in spark if you don't need an exact answer.